Working with MULTIPLE PDF Files in LangChain: ChatGPT for your Data

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

if you are searching for a way to load multiple PDF files in nankching to do information retrieval then this video is for you I'm going to show you how to load multiple PDF files into link chain and use open AIS models for efficient information retrieval so if you want to save time and streamline your research processes keep watching we're going to be using Google collab I will put a link to this notebook before using this notebook just make sure that you go to file and make a copy in your own Drive let me walk you through the notebook so here we are installing different packages that are needed so the link chain Constructor that's for reading PDF files open AI chroma and DB s python for faster speed and tick token next we are importing to different functions so one is this unstructured PDF loader that will be used for loading mobiles and then we are also loading the vector store and it's created now this is where all the magic is going to be happening so I'm going to go into the details later in the video next we need a API key from openai so you can actually go to this link you will need to make a account which is free and usually you get some free credit so go to your account click on API keys and then you can create a new one simply name your API key and click create and we're simply storing this in the environment variable I'm going to delete this so you're not going to see my API key next using this code we connect our Google Drive so if you run this code for the first time it will ask you permission to connect your to your Google Drive so in this case it's mounted um so I don't really have to do anything again now keep in mind you can run this whole code locally as well like you don't have to run it on your Google Drive so what I did was uh within my Google Drive I created a folder called data and put two different PDF files in there so here's my folder and then I took the state of the union and simply divided into two different files right so I have part one and part two now later on we are going to look at an example of research papers which has images as well and see whether this approach can work out so this code is simply taking the root directory as the Google Drive then appending my data folder in into the path and you can see there are two different files in here next we are using list comprehension to load uh both these PDF files using two different data loaders so essentially what is happening is here we have a for Loop and it's simply iterating one the number of files that we have in that directory picking each one of them and creating a separate ruler for them so here you can see that we have two different data loaders depending on the number of PDF files that you have in your directory you will see the number of data loaders changes next we will look at Vector store index Creator this one line of code is doing a lot of heavy duty work now using this function we are loading all the PDF files from these different loaders and then there are three main steps that are happening after these documents are loaded so first documents are split into chunks then for each document this function creates embeddings then these documents and embeddings are stored in a vector store in this case by default the vector score store is chroma DB in my previous video I showed you how to do these steps manually but in this case this single function is taking care of everything in the background it's using open ai's text embeddings we will look at that in a bit so let's run this code now you you will see this warning message because dektron 2 is not installed I was having issues with it if this library was installed it was giving me errors so that's why I didn't install it but that's fine for the time being we will simply ignore these messages okay now in order to retrieve information from our Vector store all you have to do is you use index dot query and then whatever query or prompt you have right so in this case what is the main topic of the address the main topic of the address was investing in America and rebuilding infrastructure so we will look we are looking at state of the new Union address and let's look at a little more complex example in this case I have another folder called Data underscore 2. and here we have two different research papers so this is uh on the lip sync we'll do the paper and then GPT for all and these are the type of documents you probably are going to be working with so there are images in there and a whole bunch of text and let's do informational retrieval on those so the main steps are exactly the same first we are simply defining the path and we see there are these two documents then we are creating our indices and creating a vector store for this and here I am actually asking it a simple query so index.query how was the GPT for all model trained brain so I get a response GPT train model was trained with Laura and here are the number of parameters now let's say if I have multiple documents and I want to see which document contains this information right so I actually can use uh this query with source with sources right and if I've now give it the same prompt all right so apart from uh the response that we saw it will also tell us the source so if you remember there are two papers but it got the actual paper where this information is available and that's pretty neat because if you want to look at the sources you can actually do that let's ask a question regarding this lip sync paper and I'm asking it who who wrote that paper uh so it Returns the lip sync paper was written by and here's the list of authors and then added the source as well so that's the file name that we have it in here so it's pretty neat um you can easily work with documents that have images in here now let's look at what is actually happening under the hood and this Vector store index created and it is important to understand because in case if you want to change any hyper parameters that you're using or you want to use different model all right so we would be using a lag change documentation for this purposes there are three main steps so first you load a document then you split that documents into different chunks right so in this case it's a thousand characters chunk and you can Define the overlap or you want to not overlap it's up to you and then you simply split that so you're going to have multiple documents in here then you want to select your embeddings right so in this case they're using open AIS embeddings but you can replace it with any other type of embedding that you want now using these embeddings and the text documents that you created with different chunks you want to create in the vector store so by default chroma is used for the vector store but you can replace that with some of the vector stores now next we want to retrieve information from our embedding and Vector store so for that you simply create a retriever object right that is the retribute interface that we'll be using next putting everything together so we simply create a chain which has our large language model and this can in this case openai I think it's a text DaVinci and then the retriever interface that we provided and next is simply run the query to get get a response and the vector store index created is a simple wrapper around all these steps so you can actually pass these in different parameters so for example in this case the passing is the vector store where is chroma the type of embedding and then the type of uh trunk Creator or character collector that we want to use at the end I went into a little bit more details than I usually do but this is important so that you are able to understand it and modify these things based on your own needs I hope this video was helpful if you have any questions please comment below and I would try my best to answer them consider liking and subscribing to the channel thanks for watching see you in the next one

Info

Channel: Prompt Engineering

Views: 39,630

Rating: undefined out of 5

Keywords: LLMs, prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain openai, langchain in python, embeddings stable diffusion, Text Embeddings, langchain demo, long chain tutorial, langchain, langchain javascript, gpt-3, openai, vectorstorage, chroma, train gpt on documents, train gpt on your data, train openai, train openai model, train openai with own data, langchain tutorial, how to train gpt-3, embeddings, langchain ai

Id: s5LhRdh5fu4

Channel Id: undefined

Length: 9min 2sec (542 seconds)

Published: Thu Apr 13 2023