Multimodal RAG!? - Pushing the Boundaries of AI

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello everybody Adam luek here and I was recently browsing on chroma's website and that's chroma DB the super cool open source embedding database that I've been using recently and I noticed a small little announcement that they had up here launching multimodal so naturally I had to click in and it turns out that they've now added some integrative functionality to use some opsource embedding models that aren't just meant for embedding text but actually images and you can contain multiple modalities in a single embedding space which what that means is that we can create now retrieval augmented generation pipelines using different modalities not just the traditional text but actually now images as well using some cool open source models and then we can combine them with some top-of-the-line Vision models to create some really fun flows so that's what we're going to be going over today is how to set up this sort of multimodal rag flow with a focus on using pictures so to provide a little bit of background and context before we jump into it let's go over what a clip embed function is and what open clip is so clip originally comes from this paper that was released by open AI in January 5th of 2021 and this corresponding blog post clip connecting text and images and the simple oneliner of clip is that it stands for contrastive language image pre-training and clip models basically connect images and text by learning from a wide variety of internet data and that enables them to both understand and generate detailed descriptions of images and match images with corresponding text this is all possible through the way that it's trained and so during the training these clip models actually learn to predict which images and texts are related by contrasting correct image text pairs against incorrect image text Pairs and what this does is it encourages the model to sort of create a shared embedding space where similar images and texts are close together and dissimilar ones are far apart and what this allows is very very robust zero shot learning and so what zero shot learning is is it's the ability for the model to then generalize this to new tasks or data without having to have seen them during training this is super important because it allows to clip models to perform tasks it wasn't specifically trained for on the Fly for instance it can now classify images into categories that it has never seen before by understanding the textual description of the categories we're going to take advantage of this by being able to have the textual description be our user query and then it will pull from our database of images to find the relevant one long story short this is effectively going to take the classic ability that text based embedding models similar to how you can have different chunks of texts submit a user query and then it'll try and find these similar chunks of text to your user query it's taking that same exact idea and then just applying it to photos and these models these clip models are very much trained to be able to do this on photos and while open ai's clip model is actually not public and not usable people have taken the paper and recreated this essentially in an open source setting so now there's this repo open clip that contains all sorts of op Source clip-based models these contrastive language image pre-training models um that are able to do essentially the same thing but now all open source so this is what we'll be taking advantage of and this is what Chrome and DB takes advantage of so hopping back to the code I've got this nice diagram now where essentially this is what we're going to be building we're going to have a user query that's first going to go to a vector database that will pull photos and then both the photos and the user query are going to go to a language model with vision capabilities and then the language model will produce a final output super cool stuff and so the first step to actually be able to do this is of course to get a relevant data set or some sort of collection of pictures that you actually want to do this from and so I decided that I would make my life easy and just choose one from hugging face because there's a lot of image classification data sets already readily available on there one that caught my eye that I thought was kind of cool was this fashion IIA um data set and the fashion pedia data set pretty much just has thousands of rows of images of just fashion so cool stuff like that and when I looked into it it's really cool it's a data set that consists of sort of two parts it's an ontology built by fashion experts containing 27 main apparel categories etc etc and then also a data set with 48,000 everyday and celebrity event fashion images annotated with segmentation masks and there Associated per mask fine grain attributes built by the fashion pedia team and so essentially they used this data set originally to train a model for classification and localized attribution of objects and fashion images so given an image their model will be able to classify what the image is about different attributes about it and then in sort of segmen tized boxes what's actually happening within those so they've got a cool detailed uh website right over here so you can check it out they've even got an API to their stuff and whatnot some good cool sponsors here really cool stuff check out their paper and whatnot but they have a phenomenal open-source database here that we're able to take advantage of so essentially I'm going to be using the hugging face data set package to both load and prepare this data so just using load data set I path over to the fashion pedia data set here you can find any data set where this is by clicking on the use and data library and then it'll give you this little snippet of code and how to actually path over once we have that up we can sort of select a sample image so that'll be all in sort of different lists and dictionaries and then I found out that using this string here will pull a relevant image so just got a quick image there to show that and then the next step is I'm actually just going to use a subset of this for this example so I needed to have a way to grab some of the photos and just save them locally onto my drive all I did was create a quick function where it'll path to a specific folder this one that I'm calling fashion data set it will make that folder if it doesn't exist and then the quick helper function that I have here will take in the data set the folder and then the number of images that I want to save which for this example I'm just going to save the first thousand photos and then it'll iterate through all of that grab each image and then save it there with just image and then starting the index at one up to 1,000 so hopping over to that folder that sort of just looks like this we've just saved you know thousands of not thousands but exactly 1,000 pictures of different fashion photos so that's perfect this is now what we can use to very easily load into our Vector database in the second part of course this step could be skipped if you already have sort of a folder with all of the pictures that you actually want to index but let's get into actually setting up our Vector database and to do this we'll be using CH DB the one that kickstarted this entire project and chroma DB now has sort of this integration with open clip embedding models that we went over a little bit earlier I have a few more things here on what those are in case you want to look at different references but essentially chrom adb's open clip integration currently uses this model the clip vit b32 Layon 2bs 34b b79 K embedding model so a little bit of a mouthful there but essentially all this is is it's a clip-based model that was trained on the LA or Lon 2 billion English subset of Lyon 5B and Lon specifically is a large scale open data set that contains about almost 6 billion filtered image text Pairs and it's essentially a big resource that you can use to both train and sort of put together different models that are useful for different image classification or other image manipulation tasks so specifically the Layon 2B English has about 2.3 billion samples of these text image Pairs and so to get everything working here we'll be using chrom and DB's API and all we need to do is first instantiate the client you can use chrom ad db. persistent client and then path over to where you want that to be I just have that as image VDB and then also your image loader so you're going to do that similarly with chroma DB you can just instantiate that and then also instantiate the embeddings model so for this it's going to be open clip embedding function and then you pass all that through the chroma client with get or create collection give it a name path towards the embedding function and path towards the dating data loader and then that will create another folder right here which we named image VDB and that's where your vector database for all of these images is going to sit so very simply that's all you need to do to set this up to be able to start inputting some pictures actually getting the images into our Vector database from the file is not too bad we can use this method add that takes in IDs and Uris and so IDs are just going to be literally you know 1 2 3 4 it can be any sort of ID that we want we just have to Define it I'm just going to use a string of one up to 1,000 all integers and then the uis are uniform resource identifiers which in the case of this is just going to be the path actually two to the image so to do this very simply just start a list of IDs and uis that are both empty and then we iterate over the entire thing append either a index number and then a path to the IDS and uis and then we pass that all into the image VDB which we created up here with IDs equal to IDs and then uis equal to uis and then that's all we got to do to actually put those into the vector database then what we can do is use count method and it will return back 1,000 meaning that all 1,000 entries are in there and that's perfect chroma also makes actually querying the database pretty simple as well so to do that I wrote this quick helper function of query DB it'll take in the query which is going to be our raw text from the user and then I have a results argument here which I have set to five automatically and then all we need to do is take the same image VDB which we defined back up here which we used in ad here here but now we can use the query method and then all we have to do is pass in the query texts which in our case is going to be our query from the user how many results we want to show up with end results which in this case will be five and then what information we want to be included when it's actually being returned so I want the uis to be included because from the uis that's how I can get the path to actually display the pictures and then also the distances and the distances are going to be the actual calculated distances between your query and the embedding of the photo and so that's sort of the measure of how similar your query or your text is to the photo so let's go ahead and test this out and to do that I have this function print results and so that's just going to enumerate over the ID and the URI of each result that is returned from the query and then it's going to print the ID it's going to print the distance it's going to print the path and then because we have the path under the URI we can use a different package to just display that image H with the I python package I believe yes I Python and so with that we'll display it and then also just print out a new line so if we put in leather jacket and use the query DB function that leather jacket's going to go to this query which is then going to be put into the query texts to return the results and that's all going to be spit through the print results and so then that returns this we can see that from our leather jacket we have now five different pictures of people wearing leather jackets and and that's great we can see what id they are we can see the distance that was calculated between the picture and the text and then we can see the path where it actually shows up actually on file in that folder so perfect this is great it's working let's try something else even here so let's say like red tops look at that red tops is great so this is working perfectly now so now that we have the retrieval out of the way what we need to do then is set up the augmented generation part of that so essentially this is going to follow this order the user is going to submit a raw text input could be a question could be a query Etc it's just going to be raw text that input's then going to go through the retrieval function similar to how we had just you know red tops here and then it will try and pull relevant images from the vector database to match the user's input those images then along with a prompt will be fed into a vision capable model where that's where the augmented generation is going to come in as the model use the images and the prompt as context to respond as a final output to make things easy for us we're just going to be using GPT 40 so as of the time of recording this this is open ai's latest model with both audio vision and text capabilities so it can take in all of these modalities at once and then respond with audio Vision or text as well and essentially we're just going to be passing in the image data along with a prompt and it's going to return some other text for us as well but it's going to be able to use all of that as context to put put this together we're going to be using some Lang chain framework and I have a link directly to their documentation about multimodel prompts but this is how I set up everything myself first you need to instantiate the language model to do that you can use the chat open aai function path over to GPT 40 set the temperature to whatever you like generally I just keep it at zero and then I have that saved in gbt 40 we're also going to use a quick string output parser so that we can convert the actual response from the model directly just to a your string without having to path into anything strange there and then here's where we're actually going to be doing a little bit of inserting the context dynamically into the chat prompt and so right here we're seeing that we're using a chat prompt template from messages and what we're going to be able to do here now is first set up a system message and so this is following the simple flow that open AI system user response messages tend to take and so what we're doing here is pathing in to the system first and so the system prompt is going to be you're a helpful fashion and styling assistant answer the user's question giving using the given image context with direct references to parts of the images provided maintain a more conversational tone don't make too many lists and use markdown formatting for highlights emphasis and structure something pretty simple and then the actual user prompt that we're going to pass to it is then going to be first three instances of things so the first one's going to be text the text is going to be what are some ideas for styling and then in Brackets here we have user query that's where we're going to dynamically pass through the actual input from the user into this and so then for the second two things this is where we're going to put in more of that context and these are actually going to be image URL types and from that what we can say is this thing that I got from just right here from link chain documentation we essentially say that it's expecting an image that's going to be a JPEG of Base 64 encoding the jpeg doesn't necessarily matter here or there what matters is the base 64 encoding and then these are the placeholders where we're going to take both of the image data once we've encoded into base 64 which we'll get to in a bit and dynamically insert in there then all we do is create the chain which is going to be the image prompt that's then going to passed to GPT 40 and it's going to get passed to the string output parser so perfect this is setting up our rag prompt essentially using using Lane chain with GPT 40 this leaves us then with solving the issue of how do we actually format and get the user query the image data and pass it through so that's what we're going to tackle right here in formatting query results for llm prompting essentially to input the images in as context we first need to encode them to base 64 for the llm to be able to interpret it and base 64 is a way of encoding data using 64 different asky characters which are safe to use in text based systems and the encoding essentially will trans transform the binary data of the image into a text format that can be easily transmitted and stored and so when you encode an image in B 64 for a vision model it's usually because the model or the system hosting the model requires the image data in a text format this ensures that the data is handled very consistently across the different platforms and different languages that might be processing it so by encoding our image in Bas 64 we can convert it then from an image to a string of text making it much easier to send and process without having to worry about different data issues so the function below that I've written here will essentially do that and create a dictionary along with the original user query to pass back into our chain our chain here is going to expect an input of a dictionary that can path to the user query the image data and the second image data so let's go through this function and explain how that's happening so it's going to take in two inputs the data which is going to be the results of our query and then also just the user query we're going to start in open dictionary here that has absolutely nothing first thing that we're going to do is add in this key user query and then the value of user query which is what's going to get passed in and so that will then start this so this will be the first line here and then we're going to get the paths to the images from the results and the uis there and then we're going to use the base 64 package to do some encoding so that's what these few lines are doing we're opening up that binary the actual image in binary we're going to read it and then using base 64 we're going to encode it into this nice long string of text do that both times and then add it on to the key of image data 1 and image data 2 so now we're going to have this full dictionary here that'll have the keys of user query image data 1 image data 2 and then the values of the actual prompt so what are some ideas for styling you know X which we'll get to soon and then the two basic 64 encodings of the first image and the second image and this is what our chain here the vision chain is going to be expecting as input to actually be able to run it so that's perfect this will work nicely when passing in our results from the retrieval and then also the query so now all you have to do is put it all together so as mentioned we now have the retrieval step already defined and then now we also have the generation step here defined so putting them together looks a little bit like this I just have some fancy little markdown things here which will display and look nice when we actually start to run this but we're going to take the query as an input and then very simply we're going to use our query DB function with the query that we input and we're only going to grab two results for this example and then that's going to be saved in results we're then going to format both the query and the results with this um with this function here the format prompt inputs and that's going to be this function that we just went over and then that's going to be the prompt inputs and then then what we can do is get the response from the model by using the vision chain and invoking it with the dictionary that we made from Pro prompt inputs and so that will map pretty much all of these keys and values into here into where they're supposed to be very dynamically with what we've put and then pass that all to the language model and get the language model's response out we're then just going to grab the uis from the results that we had here to display some of the pictures and then also print out the response so let's give this a shot so fashion rag is At Your Service what would you like to style today I'm going to say brown pants and so now that's going to kick off the retrieval and then also the generation step so let's give this a second and see what it comes up with great so this is what we got back so fashion rag is At Your Service we wanted to style some brown pants and so it grabbed two pictures I'm seeing brown pants in both of them so that looks phenomenally so let's see what the language model was actually able to generate with both of these these pictures plus our prompt of styling brown pants so it says styling brown pants can be quite versatile and Chic here are some ideas inspired by the images you provided first is a monochromatic look first image brown pants are paired with a matching Brown Blazer and a black top I would say that's pretty correct there creates a sophisticated and cohesive look you could add a statement belt like the zebra print one shown or add some flare and break up the monotony she does have a zebra print belt on I didn't even realize that at first so it says a few examples there the second one a little bit more casual Chic the second image shows a more casual approach with brown shorts paired with a white top perfect for a relaxed yet stylish look then it also gives me some additional styling tips different Footwear options and accessories it says by mixing and matching these elements you can create a variety of stylish outfits with brown pants so this is great this has retrieved exactly the right pictures and then use them as context along with my input to generator response so we now have a fully functioning multimodal rag setup so to sum things up we now have the ability to actually use databases with other modalities other than just text for our retrieval augmented generation flows this is allowed through language models with different capabilities of modalities such as GPT 40 that can process images and using things like clip embedding models with Vector databases such as chromos so super cool we can take a user query retrieve relevant photos from the text and then pass all of that to a language model that can actually process all of that to generate our output so of course all of this all of this code is going to be available in the description below along with some other resources if you enjoyed it leave a like if You' got some questions ask them in the comments and of course feel free to subscribe if you found this useful thank you

Info

Channel: Adam Lucek

Views: 8,996

Rating: undefined out of 5

Keywords: artificial intelligence, OpenAI, AI, Gemini, Mistral, Llama, DALLE, Open Source, HuggingFace, Machine Learning, Deep Learning, AI Trends, AI Innovations, AI Applications, AI for Beginners, AI Tutorial, AI Research, AI Solutions, AI Projects, AI Software, AI Algorithms, Artificial General Intelligence, AGI, AI Skills, AI Strategy, AI Integration, AI Development, ElevenLabs, AssemblyAI, Multimodal, Agent, Azure, LangChain, gpt-4o, gpt, langsmith, fine-tuning, fashion, chroma, styling, RAG, vector

Id: OPGmeFmFyq0

Channel Id: undefined

Length: 22min 19sec (1339 seconds)

Published: Fri Jun 07 2024