Multimodal RAG with GPT-4-Vision and LangChain | Retrieval with Images, Tables and Text

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

retrieval augmented generation has become a standard when you want to feed a large language model with your own knowledge base if you've got text files this is quite easy convert a text into a vector using an embedding model and store everything into a vector database but how does this work with more complex data like a PDF PDFs are more than just files containing text they include text but also images and tables these elements are usually not Standalone their sequence and Arrangement is also important and this is quite difficult to represent that thanks to new multimodal models tables and images are no longer lost in the process and retrieval augmented generation over PDFs can be significantly improved compared to before so let's first talk about one of the possible approaches in theory we extract all text elements table elements and images from the PDF then we will create a summary for each element using an llm for the text elements and tables we can use a language model and for the images we can use the new gbd4 vision model next we take the summaries and St dra them into a multiv vector retriever together with the corresponding embeddings we also store the raw documents there for the images it's only the summary because currently I've not seen a way to process images and text into a single request to open ey then we make a standard request and retrieve the relevant documents from the multiv vector retriever through a similarity search and send everything to the llm so that's a theory let's now have a look at how this works in practice okay I'm here in V code and we first have to install some packages so we will use l chain to make the request OPI and we will use unstructured to extract all the relevant documents or relevant parts of the document from the PDFs then we also use pantic l XML openi chrom DPS or vector store and tick token for creating our tokens so let's run the installation and after that we will make use of unstructured to extract the tables the text and also the images from our PDF I will show you how the PDF will look like so this is the PDF file to be honest the content doesn't really matter but as you can see we've got some text elements we've got images and here we've got also some tables on the bottom here here's the table and we want to extract all the relevant information so the tables will not get lost and also the images will not get lost and we will use unstructured to achieve that unstructured makes use of taser so we first have to install taser I will provide you the link to download TCT and install it in the description of that repository before you can run the code check first if everything works so after installing TCT you should run TCT minus minus version and if you don't get an error there then you know that taser is installed correctly so what we first do is we import partition PDF from the unstructured package and then we also set the taser CMD to the path where we store our task. exe we then can run the partition PDF function from unstructured and run that over our test PDF it will extract all the relevant documents or parts of our document so the tables the images and also the text elements so we first have to provide the file name I do that dynamically by providing the current working directory and then here also the test.pdf this is where our file is stored and then we want to extract the images in the PDF we want to extract extract the tables from the PDF then we will provide a chunking strategy and we will chunk all of the text Elements by title very important also is the image output deer and we will create a new uh directory called output that will with all of our images so let's run this this can take its time maybe 1 and a half minutes so let's wait a little bit okay it's done and now we've got everything stored in the Raw PDF elements variable so there are our text and also the table stored and the images are stored in the output folder as we can see we've got 1 2 3 seven images in our output directory now let's maybe first have a look at how this raw PDFs element variable looks like so let's inspect that so as you can see we've got a list and here we can see our um classes so we've got composite elements and we've got table elements the composite element is the text and the table element is the table so let's remove that again and now we want to extract the relevant information so we want to store all table elements in the list the text elements in the list and also all of the images in our list for the images we cannot just send the images as a whole to open my ey but we have to convert the image into a binary format so we have to encode it with base 64 so we read the files and then encode them into a b64 format for the text and the table elements we will Loop over the raw PDF elements as you saw before we've got the classes and we can just convert that type into a string and check if composite element is in that string or if table is in that string and if composite element is in that string we append this to the text elements list otherwise we append it to the table elements list and then after doing this we will of course extract the text which is stored in the text property because we don't want to store the raw classes in in that list so we extract that and if you run that we can see that we've got three table elements and we've got 24 text elements in our PDF the images are stored in that output folder here so we will list everything in that output path and check if it's a PNG or JPEG and then provide the full path to the encode image function so we will read that file and encode it into a base 64 format and after that we will append that to the image elements list so we've got now our texts our tables and our images so here we can see we've got seven images and the length of that list is seven so that's correct now we can run a summary function over each of that element so we first have to create the function of course to make that work so for the text we will create a prompt summarize the following text for the table summarize the following table and for the images we have to make it a little bit different because this doesn't work with a normal text model so for the tables and for the text we use the GPT 3.5 to model for the images we have to use the gbd4 vision preview model because that's the only one that can work with text U with images so here we make it like this so we create an iMessage you are a bot that is good at analyzing images then we have to provide a second class from Lang chain which is human message and here we have to provide a list so we provide type equals text and then here in the text we say describe the contents of the image and then in another dictionary you have to provide the type which is image URL and inside that image URL is another dictionary and here you can see this is the URL because we use uh our file locally we have to provide it like this so this uh is for the basics encoding and then we just provide the encoded image so now let's run a code so we created our function and now we can create a summary for each text and for each table and then also for the images so this will take its time and my kernel crashed multiple times in running that before so I will make a print statement after um yeah summarizing each element so we can see that the current is not crashed okay now let's do this for the text element and also for the images okay that worked and in case you wondered why we use an index here you you can also remove that for your code but I just want to make that faster and of course also a little bit cheaper because especially the GPD 4 model are yeah quite expensive in comparison to the gbd 3.5 turbo model okay perfect after creating our summaries we can now use a multi Vector retriever to store all of our documents the summaries and also the embeddings in that Vector store so for our Vector store we will use chroma and then we will use also a doc store we we will store our documents or our raw files and here we will use an inmemory store from L chain and after that we will use the multi Vector retriever class and instantiate it with our Vector store so with our chroma Vector store our document store and also an ID key which we will provide later in that code so we will just call it DOC ID okay then we will create a function to add documents to the retriever to do this we first have to create some IDs so we will use the U ID for so this will create some U IDs for our documents and we will create a list of documents by creating a new list called summary docs and we will create a new instance of the document class which is quite important for for Lang chain and this has got a page content attribute and also a metadata attribute which which is a dictionary so in the page content we will provide here our our uh summaries and inside the metadata we will provide one of the IDS which we created here and then we use add documents this is the normal function from a vector store and add the summaries to this Vector store we also store our documents so our raw data in the doc store and the link between Vector store and Doc store are these doc IDs so inside the vector store we've got that document class which has got that metadata attribute bu and here we've got our D key and inside the doc store we also store the corresponding doc ID okay so now we've got a function to add this and now we can add everything to that multiv Vector retriever so for the text we will provide the text summaries here as first argument so this is stored in the vector store and the text elements are stored in the doc store for the tables the table summaries are stored in the vector store and the table elements are stored in the the um doc store for the image summaries this is a little bit different because we cannot retrieve images and send them in combination with text to Opi this does not work yet so we also provide the image summaries inside the dock store so that's a little bit unfortunate but maybe in the future that will work so now we added everything and after adding that we can of course retrieve that information okay now let's retrieve something here from our Vector store and we use the method get relevant documents from the retriever so we ask what do you see on images in the database to be honest it does not really matter what we ask here so we get back here the image displays a screenshot of a digital interface with three separate boxes so this is a summary of one of the images so not the raw images what we get back here and now we can use that in combination with a chat model so we will create a new template answer the question based only of the following context which can include text images and tables so this is the context which we will retrieve with that function so this will be used under the hood we will use a chat op myi model and now we will use the L chain expression language here to create a new chain so we will provide the Retriever and for the question a runner will pass through so this will be unchanged and here for for the retriever we will extract the uh four most relevant documents for the specific question we pass that to the prompt the variables will be inserted and then we pass everything to the model and everything to the output passer so this is our chain and if we invoke that chain so we can now ask what do you see on the images in the database okay this is the answer based on the given context images in the database display a screenshot of a digital interface with three separate text boxes so this is the summary of one of our images and yeah that's how you can do multimodal retrieval with Lang chain if you like the video feel free to subscribe to my channel and like the video thank you very much bye-bye

Info

Channel: Coding Crashcourses

Views: 1,633

Rating: undefined out of 5

Keywords: langchain, openai, gpt-4, gpt-4-vision, multimodal

Id: 6D9mpFCPeI8

Channel Id: undefined

Length: 13min 8sec (788 seconds)

Published: Mon Nov 20 2023