ChatGPT for YOUR OWN PDF files with LangChain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you have a bunch of PDF files and you want to have a conversation with them just like you can have a conversation with chat GPT about any topic then this video is for you I'm going to show you a website which uses the exact same concept and does the exact same work I'll also show you how to get free credits from open AI so that you can experiment with this and I'll also show you my usage and the cost associated with it for our application we're going to be using blank Chan which is a framework for developing applications powered by language models it allows you to connect your language model apis to other sources of data and then it allows the language model to interact with its environment so in this case the data sources you can use this with any model from openai along with all the recently open sourced models such as lamma alpaca or GPT for all before looking at the code on how to do this let's walk through this architecture diagram to have a better understanding and don't worry if you are not a code it I will walk you through each and every line of code and explain what's going on in there it's going to be pretty easy to follow so let's get started so we start with a PDF files you can have multiple PDF files for this example I'm going to be just using one PDF file then you simply read that PDF file and extract data from it using python now say you have 100 or 200 pages in your document you cannot really fit that into your large language model because it will hit the token name it so in this case what we do is we divide it into smaller chunks such that the length of the chunk is smaller than the token size that the model supports for our example we are dividing our document into 10 different chunks next we convert each 10 chunks into their corresponding embeddings an inverting is a vector a list of floating Point numbers and essentially it works as a compression algorithm so let's say each text Chunk has thousand characters but using embeddings we can use it to reduce it to a much smaller size let's say the embedding size is only three right so that will be the compression which is performed here in this case we are going to be using open ai's text embeddings and now with the help of embeddings rather than comparing text directly you can simply compare embeddings and see which two different texts are closer to each other based on these embeddings we simply create a knowledge base now with this new knowledge base when a user asks a question we will use open ai's embeddings to convert this that text into embeddings so it's going to be a list of numbers and then we'll use a vector database to actually run a query on the knowledge database So based on the embeddings of the documents that we have stored we will get results and they will be rank based off at the closeness or relatedness to our query so we'll get the results here and then we will use a generative large language model to generate a response agenda back to the user so I hope this process makes it simple in order to understand what is going on now let's look at the code here is a notebook that I put together for this exercise so first uh you actually need to go to the runtime and in this case you don't really need a GPU runtime so the normal CPU runtime is fine right then you need and come here you will see a connect button so click on that if you want to run a self you can say this you can see this play button with it so if you click on this it will run uh the quotes they're running the first cell will install all the packages that we need for this specific example I I'm using this as a technical report of gpt4 for all I have a video on the actual gpt4 model so I'm going to put a link to that if you are interested in watching so it's basically a PDF file for their technical report and we are going to be using this and asking questions from this next cell if you run this it will include all the packages next we will run this cell where we need the API key so since we're using open AI we will need to open AI API key now in order to get your API key you can go to this link right if you go to the link you will be asked to create an account I already have the account so um I don't need to create one then you will see something like this API Keys click on this and then I you won't have any bi key so it will ask you to actually create a new one so let's say this is the API key right you can just copy it close this I am going to delete it afterwards so don't try to use this specific API key but I just wanted to show you the process so after that simply paste the AP API key and now let's run this cell since we are using running this in on Google collab and my file is actually in my Google Drive so we were going to connect our Google drive to the Google column so run this it will ask you to for permission to connect your Google Drive said yes clicks and then simply allow it and that's all then you will see this message saying that your Google Drive is mounted okay and this is the Base address for your Google Drive okay next um I have a in the technical file and this folder on my Google Drive called Data so I simply put in the pathway of that folder right and we are going to read that file so you just run this and it will create this reader object right so if you run it this is a regular object it has all the information on how to read the contents of my PDF file then we want but we want the raw text right so if you run this cell it will actually uh go to each page and run the read the text from each page right and we'll simply return the raw text so if I type in run underscore text these are the content of my uh PDF file and it didn't really show the whole file because uh of the space limits but it simply Trump creates it there okay all right so we have a raw file uh now so now we need to divide it into different chunks as I showed you in the architecture diagram and we are using a chunk size of thousand characters because that will ensure that we are not hitting the token size and in this case I'm using an overlap of uh 200 so that would mean that we get the first sentence we'll have the Thousand characters and then the second sentence will start from the 800 character and onward all right so let me show you uh what I mean so let's say that's it will give us a total of eight chunks right and if I read the first chunk and then uh sorry okay let me run this that's the text in the first chunk but if we add another chunk to it so let's say uh text uh the second chunk now it starts from we collected roughly one million uh prompt responses and there is an overlap so if you see here we collected roughly one median uh prompt responses right so this is because if we Define uh overlap of 200 characters you don't have to do it but in my case it seems to be helping okay next we download embeddings from the open AI uh as I said embeddings is simply a list of numbers a floating numbers right and you can use this to measure the distance between two different uh text strings or sentences so here we simply downloaded the embeddings from the open AI but now we want to convert or actually find the embeddings of our text chunks so for that we are using this Vector database so essentially what's going to happen is it will take the text chunks and finding the corresponding embeddings and that will be stored here in the document search next we are importing a question answer change from a line chain and the corresponding open AI object now here you can actually pass on different models that you want for example uh like by default I'm using this text header zero zero one model but if you need a more powerful model so you can use those you can I think even use gpt4 API calls if you have it but just for this simple experiment I'm using this model okay so this will create a chain right and now we are all set to uh start asking questions so for example here my first query is who are the authors of this article right but then from um our embeddings we simply do a query right so we'll give this text and it will find out which is the closest text in the document semantically to our query search right and run the chain on it and give us a response so let's wait for it now the response is uh Jana wish Anand Zach uh I don't know how to pronounce the name but we can actually go to the article and see who were the authors right so you can see that's the first author that's the second out there and then uh third fourth and fifth author so it was able to actually uh correctly and retrieve that information from the document now the beautiful part is that there is no list of authors like let's say author One author two other three but even um from the way the technical report is written it's able to actually infer that these are the authors because if you go down there are other names of people as well but it completely ignored those so that's pretty awesome okay next I'm asking what was the cost of training uh GPT for for uh for all model right and it returned an answer of hundred dollars right if we go back to the article uh so here it says that our release model can be trained about on about eight hours right and here's the total cost now the interesting part is there is another dollar figure and like 500 and then even 800 but it still was able to actually get us the correct four dollar figure because these are the cost of training other models that's pretty unique okay so I asked a couple of other questions for example how is the model trained so the model was trained with Laura on uh it gave us actually the uh how many training examples or data set that was used and that's pretty good uh and then how was the size what was the size of actually should be what was the size of the training data it changes although they could still give us a correct answer so this many prompt generation pairs were used to uh in the creation of the data set uh another question how is this different from the other model so it kind of gave me a whole description uh it says that it is a non-commercial it has non-commercial license it is trained with Laura right so I think it's not really um an exact answer because that exact information is not really present in the paper so let's ask is something that is not in the technical report uh I will say what is Google Bard okay let's see what the response is and the answer is I don't know because that information is actually not in the technical report okay so it's very powerful now let me show you a website uh that is actually using the same concept uh to let you query your own PDF documents this website is called um PDF TPT dot IO and it used to be actually free but now it requires you to give you to give then uh your API key and then you will be able to query your documents but anyways here is how the interface looks like so if they enter uh upload your own PDF file right and then uh you can simply start asking your question and it will start giving you responses so it's very similar to what we did in this tutorial but they have a pretty nice interface around it and they are using a little bit more powerful model than we what we had now let me also show you uh the user cost that we had uh with this tutorial if you go to your open ai's account using the link that I provided so first as I said there are API keys that you can see then you can actually go to the usage right so they give me a credit five dollars so far I have used around uh 50 cents of that and I have made around 62 requests so it's a minimal fee it's not that crazy but something to be uh careful about well I hope you found this video useful if you did and I don't forget to comment and like the video it really helps with the algorithm thanks for watching see you in the next one
Info
Channel: Prompt Engineering
Views: 170,561
Rating: undefined out of 5
Keywords: LLMs, prompt engineering, Prompt Engineer, natural language processing, GPT-4, chatgpt for pdf files, ChatGPT for PDF, langchain openai, langchain in python, embeddings stable diffusion, Text Embeddings, langchain demo, long chain tutorial, langchain, langchain javascript, gpt-3, openai, vectorstorage, chroma, train gpt on documents, train gpt on your data, train openai, train openai model, train openai with own data, langchain tutorial, how to train gpt-3, embeddings, langchain ai
Id: TLf90ipMzfE
Channel Id: undefined
Length: 14min 20sec (860 seconds)
Published: Tue Apr 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.