Langchain: PDF Chat App (GUI) | ChatGPT for Your PDF FILES | Step-by-Step Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
you probably have seen these websites where you can upload your PDF files and then you can chat with these documents that you uploaded now in today's video we are going to be doing exactly the same but we'll be creating our own user interface so here is how our app is going to look like you will be able to Simply drag a PDF file or it can be any other file type and then ask questions directly from that PDF file then you will get a response so by the end of the video you will be able to replicate this all you need is just some basic python experience for the rest of the step I will walk you through a step-by-step process okay so here's the architecture that we're going to be looking at so in this video we will be creating both the front end as well as the back end of our chat app and this is going to be an extremely detailed video where I'm going to be walking you through a step-by-step process as well as live coding to create this PDF chat now the end product is going to be very similar to what you will see on something like chat PDF so for example here we have the chat PDF and I'll all in in order to use this you need to Simply drag your uh PDF file for example I'm uploading the Constitution of the United States right and then it kind of creates a summary but then after that you can simply interact with it so that's it how many states or in the USA sorry okay and then it it's actually using uh chat GPD at the back end uh and based on chat gpt's data as well as the data and you you have in your PDF file it comes up with an answer so let's say say it doesn't provide the information on the current number of states in the USA but as of 2021 so there were 50 states so this is ex like a way we are going to be designing a very similar chat app right and um it's going to look something like this so for the graphical user interface we're going to be using streamlit right at the back end in order to process your data create embeddings and uh response from the large language model that is going to be coming from the open AI API however uh you can easily replace those with open source models but just to show you uh the process I'm going to be using open AI API later on I will walk you through different uh steps and show you how you can easily do this I have a whole bunch of videos on that so that's going to be pretty easy to do now for the front end if again it's going to be based on streamlit so the user will upload a PDF file just like we did here and then uh it will accept prompts from the user as an input and generate responses based on the llm and embeddings that we use and show the those back to the user now that's just an overview of how the system is going to look like in terms of background requirements all you need is to have knowledge of python the rest I'm going to walk you through a step-by-step process so uh essentially this is what we're going to be doing we'll upload our PDF files divided into chunks compute embeddings then create a vector store that's going to become a knowledge base and in terms of the query so the user will ask a question compute embeddings do a search on the knowledge base and then give us a result right if these things do not make sense that's absolutely fine by the end of the video you will understand all these Concepts okay so I'm going to be coding everything in a visual code Studio now all the required packages are listed in this requirements.txt file but if you face any errors you might have to install some other packages just do those right so now I am in and directly and there are just two files one is app.pi where I'm going to be putting online code the second is the requirements.txt so first let's install all the required packages so I'm going to use say pip install or requirements dot txt so essentially it's going to go here and pick all the packages one by one and install them right I have already installed these in my virtual environment so I don't need to do that one other tip make sure that you create a new virtual environment for each project so if you have Python and Corner install you will do it something like this corner create Dash and and then your virtual environment name right I already have one that I'm using my EnV so I'm not gonna do this step now as I said we're going to be using to create our graphical user interface using streamlits so what exactly is this now streamliness is a package which lets you create a beautiful graphical user interfaces right in your python code it's very similar to the normal python code there are just a couple of changes but you will be able to create a graphical user interfaces within your python files it's very similar to gradiop which is actually used in hugging face a lot right so this is what we're going to be using now in the basic form uh first let's create a structure of our graphical user interface and then we will add more components and explain what's going on so first I'm going to uh import streamlit so I'm going to say import streamlit as St right and then let me add this code so that it's visible what's going on right so in streamlit I'm going to create a vertical sidebar okay if you saw my at the start of the video there is a sidebar which has some information so let's save this I will walk you to this uh file in a bit right so to run this we're going to say streamlit run and then app.pi okay now this is going to be a very simple app uh but when you run the code you will see this local URL you can copy this and if we go here so here is our basic app running so it has a sidebar and that sidebar has some information on this is going to be the main uh place we're going to be where we'll be creating our uh app right now here basically what you do is within the sidebar we Define a title with me actually do it this way okay I hope it's going to be more clear now so title that's the title that we gave it to the app right and then you have a markdown like normal markdown within the markdown we Define the about so that's the subtitle and then streamlit with the corresponding um you know the URL right so these are actually active links if you click on it it will take you to the streamlit website right similarly for link chain and open Ai and then an emoji here and the great thing about uh streaming to this you can simply make changes here in your python code and you can see them right away like what's going on right so for that we will just need to rerun I'll show you in a bit right so first we will use this as a starting point for our file and I'm going to define a main function here right so whenever we run this file it will be going to this main function now if you want to uh write some text here so the simplest way is to say St dot right it's very similar to what uh you will do it in a print command so let's say hello all right and then you save your PDF file oh sorry the python file and here you you can just click on rerun and now we see that we have our text shown up here okay so that's how streamnet works but what I'm going to do is I'm going to go ahead and add a header here right so header is header of the uh page like is going to be shown in here so if I click save this rerun it now we have a header okay so that's going to be the header of our uh web page on web app okay I'm gonna hide the rest fit the sidebar right because I think we're going to be working here okay now next uh just uh similar to this website if we want to provide the user however they can upload their PDF files so for that we're gonna say PDF is equal to St so that's stranded and then we're going to say file uploader uh and then we can show the user a message so let's say upload your PDF right then what type of file it's going to be so we'll say uh PDF okay now this will basically if you rerun so we have a place where the users can now upload PDF files and there is a limit of 200 MBS per file okay so this will simply upload the PDF file but now we actually let me show you how the upload is going to actually work right so here I'm dragging a PDF file that's the uh US Constitution right and it uploaded the file okay so this simply uploads the PDF file but now we need to somehow read it so for that we're going to be using the PDF reader from PI pdf2 package okay now in order to read it on a PDF file here is what I did so I took the PDF object and that's the uploaded object right pass it into the PDF reader and we are going to get the PDF by the Excel frame so let me see uh whether we can look at the object itself or not so PDF underscore reader right okay so I'm going to save this let's try to upload a file right and you see now it's showing uh uh the object that we just uploaded right so this is the PDF reader okay now there is one issue with doing things this way let's see if I rerun this for the first time so I haven't uploaded a file yet and you see uh it's actually throwing an error right and the error is coming from here if I say St Dot all right so PDF okay a narrow we run this again right so uh since we haven't uploaded any file yet the uh uploader is returning and none right but if we upload a file right so now you see it's the upload file object and we get a PDF file reader object here as well okay now in order to take care of this we can do a very simple thing so I'm going to just remove that right I will say if a PDF is not null okay so we simply check whether file is uploaded or not right if it's not uploaded don't even execute this code of block the this uh block of code okay if it's uploaded only then execute this so let's see I'm going to rerun this again right so nothing is uploaded no issues at all and if I upload a PDF file we see this object okay so this is green okay so now we only have the PDF reader object and we need to read data from it right so the way you want to do it specifically for the lag chain you want to read one page at a time now in order to do that we can do it this way right so there are multiple ways I have other videos which goes through it but an easier way would be so for page in uh PDF reader right dot pages okay so it simply will take one page at a time right and then we append uh the text of that page into this uh text variable or object that we just Define right so we're going to call the extract uh extract text right so it's going to read each page and append it to this text file right so I'm gonna just check it whether it works or not right so let's do this again okay let's rerun and hopefully yeah okay so good now what it did was it read each and every page and appended it together right so here is all the text that is within this Constitution that PDF file if we just complete it uh the data loading part so create basically right here now we will start the fun part now next we need to actually uh take these documents or these pages and split it into smaller chunks so that our llm can process them the reason is that the nrms or the large language models that we're going to be using it has a limited context window so for example if you're using a chart GPT I think it's four thousand uh 1096 tokens for a gpt4 it's around the one that is available is 8k and I think there is a 32k version as well but it's not yet publicly available so basically we need to divide our text into smaller chunks now it's in order to do that we are going to be using recursive a character text Splitter from live change so in order to import it so from link chain then we are going to say text splitter and import recursive character uh split text pleasure okay so here is how we're going to be setting it up so the recursive uh character text splitter it will divide and document into a chunk size of uh thousand tokens each and then they uh there's going to be an overlap of uh 200 tokens between the consecutive chunks you know in order to explain the overlap part let's say if I were to draw this assume this is the first chunk yeah it is a thousand characters long then the next chunk is not going to start here but rather somewhere here where it will be how it will have an overlap of 200 tokens okay so hope this is clear and this is important when there is an information uh which spans more than one chunk right so let's say if you have an information which is in multiple sentences right and these are interdependent then you want the second uh chunk to actually cover some of the previous one so it keeps the context of the conversation okay so we Define our text splitter then we need to actually convert text into chunks so the way we're going to do is we will call this text splitter okay so text splitter dot split text or split documents if you give it documents then you can do on the split documents in this case we already converted into text so it's we just passed the text object here okay in order to make sure that we are actually and dividing it to chunks so I'm gonna just do this text uh and then pass on the chunks right that's what we just created should I rerun this and hopefully as here here we have our chunks right so you see chunks 0 1 2 3 and I think it goes on right and another thing we'd probably want to notice is this overlap right so the first one uh at the end you see this right the second chunk is actually starting with this section uh section two and then the House of Representative shall be composed of right so that's that's how the overlap works okay next for each of these chunks we need to compute the corresponding embeddings all right so you may ask what is uh embeddings right so embedding this PC viewing and the medical representation uh uh of your text so let's look let's see if we can find a good example so for example uh let's look at this one uh so basically well maybe not this probably okay yeah so for example uh in this case you have a three-dimensional Vector space in each word is represented by these three dimensions right so King is represented then uh Queen is represented by another point in these in this three-dimensional space here is actually a better example so for example the word cat is represented by four dimensional embedding right so these are numerical values that will simply replace the word cat right now uh the vector embedding that we are going to be using it takes each chunk and this represent that chunk as a vector of numbers so now instead of the text uh I'll look a single word you're going to have a whole chunk or a bunch of sentences represented by something like this right uh a bunch of numbers and then what you can do is you can find which chunks are very similar to each other by simply comparing and the vectors or embeddings Okay so this concept of embedding is very clear I have a whole bunch of other videos on this topic if you guys were interested so check those out now in order to Computing weddings for our chunks we will be using open AIS text embedding so let's say open AI text embeddings okay and this is what we're going to be using let's look at it what is happening here so you provide the text then there's a model which is a pre-pen text embedding models yeah and that will compute the embeddings and return the embeddings over to you so for that we are going to be using a wrapper Within uh length chain that wraps around the open air embedding so that is going to be open AI import I'll open AI embeddings okay you know ah here's how you actually use it so if we create an embeddings object right so let's call it in beddings it could too and then we have created object from open air embeddings no so this is just the embedding object we want to compute the embeddings on our documents right so there are a whole bunch of uh Vector stores that LinkedIn supports so long chain uh I would say uh embedding power let's see Vector store right let's see what it comes up with so if we look at the documentation from long chain then there are different numbers of um Vector stores okay so here is the list of all the vector stores in that length chain supports so in this case a specific case we're going to be using the files Vector stuff but you can use anything that you want okay so uh the way it's going to work we have our embeddings and we need to pass on our documents along with the embedics so here we are calling them chunks so I'm going to say Define a variable called Vector start equal to and we actually need to first uh import this right so let me copy this scroll up okay so we already imported files now we can come down and then say f a i s dot so from text okay and then we need to pass on our chunks along with the embeddings now there is one thing we you need to be very careful about and that is the cost associated with running it multiple times so what do I mean by that every time you upload a new file so we're going to get uh like you are going to run all this code edit and it will compute the embeddings again right so that will incur charges because we're going to be calling the um open aisle API and so with each run that will add up to your cost so you don't want to do that right so what we're gonna do is we want to figure out a way that we only compute the embeddings or retro store once and just keep it on the disk for each file that we're putting it in you know uh in order to do that and let's come up with a method so first of all let me comment this section right and let me rerun this okay so next what I'm going to do is I will take this PDF object right and it has a a field called name right so let's see SD Dot and it writes okay so PDF dot name okay let me run this we run this okay so you see the file that we uploaded and is uh constitution.pdf right and then we have a uh we are simply printing that so what I want to do is for each file that we upload I want to take the part of the name before the file extension right compute these Vector store embeddings and store those right so next time when I upload a new file I will just check with the these Vector store embeddings exist on disk with this file with this name or not right so if they already exist we don't need to run this part at all right so this will save us a lot of money okay so here's what we can do so I'm going to say with open right and let's give it a file name so the way I'm going to define the file name is uh let's do it this way all right okay we'll just do this so what I did was I took the file name and dropped the last four letters that are corresponding to dot PDF right and then we can pass this on store underscore name okay next we are going to write it to a pickle file I actually need okay so don't pickle and then we want to write this file right so uh we'll see this F okay and actually let me first and pull that pickle so let's say import okay pickle okay so pickle is basically for uh writing files to storage right so I'm going to say pickle dot dump okay and then we can use the vector store because we want to write this Vector store into the disk right okay so let me show you something all right if you see initially we only have uh two files here dot uh Pi the app dot pi and requirements.txt uh and actually we need to also uh yes store the uh open AI API key because without that this is going to give us an error but let me run this and uh you will see what I mean and then we can fix this okay so let's do this then let's rerun right and you see uh validation error for open air embedding right so it's not able to get the open AI API key so we need to First provide that so I'm gonna just uh actually okay let's do that first okay so I'm going to create a DOT EnV file here right and within this file I will type open AI underscore API underscore key okay is equal to and then the key that we're going to be using now it has to be exactly like this right then you need to go back go to your open air account so let's say that's going to be platform dot open AI right in this case I have a whole bunch of keys but you can create a new one give it a name and then simply copy that and paste it here okay so I paste in my opening a key here is how it's going to look like I will delete this right after creating this video okay so now we need to actually load that API key so for that we are going to be using from dot EnV import load environment variables right so this is there is a function called load Dot and wife EnV right so that will load in random variables now we can do it too is um we just need to call this function so you can do it here for example just outside of the main function because it's sequential execution so it will execute that code or you can Define or you can call that function let's just actually do it here okay so just make sure it works okay as you see there are just three files so the environment app.pi and requirements but if we run this now so it's supposed to uh write our Vector store into the disk screen so let me save this let's re-run it so hopefully we are not going to see so I'm having this error because uh the file name that I give it is wrong so e and V okay so hopefully now this will run let's actually save this secure this okay re-run and okay all right so loading files and then successfully loaded files with this right okay perfect okay now if you look at it so there is a new file called Constitution dot pickle okay now uh so we are writing it to the uh disk but even if we run this again so it will recompute the embeddings Android again to the disk screen so first we need to check whether this file name exists or not if it exists so we don't write it to the disk anymore okay so in order to do that let's uh put a condition here so I'm going to say of os dot path dot exists right so first check whether a certain path exists or not so let me just copy this put it there right so first we are checking that if this path exists then we simply need to read it so with open uh that's going to be I guess the same thing right so we simply read that file it says we want to read it so that's going to be RB right as F okay and then uh we will read the file and store it in this Vector store variable right so in order to read the file since it's pickle so we're going to say pickle.load f okay so basically this part reads the file from the storage okay now just to make sure that's exactly what is doing so let's run a quick test and see uh in beddings loaded from the disk okay now if it uh does not exist then we want to recompute the embeddings again right and that's when we're going to be running uh this whole operation actually it probably want to remove this part and put it here Okay so let me walk you through the logic here right uh okay and we have oh we don't have OS I'm going to say import OS that will import the operandic system packages okay so we calculated the chunks right then we see that with it with this given file name do we have any uh embeddings that exist on the disk if they exist we simply read that from the real screen if the embeddings do not exist then we create an embedding objects using the openim buildings then uh create a vector store using files right and here we're using the chunks as well as the embeddings that we just unloaded and then simply write that to the disk as well so this process will definitely help us in saving some money now let's test this because we want to make sure that it actually is reading uh from the disk okay so if I run this and it's reading from the disk so uh we will be seeing this message actually let me add another message here so we'll say uh embeddings uh simply say embedding computation completed right so if it's a new file then we should see this message otherwise we should see that message okay let's rerun right and it says embeddings loaded from the disk that means it would simply reading this okay awesome okay so I'm going to comment this section out as well as this because these are just for logging purposes all right let me just save this three Rand okay awesome now next we need to be able to um except an input or query from the user array so if you go here right let me just drive the file again okay so now it gives us this box where we can actually ask questions right so we want to do something similar all right so in order to do that so I'm going to say accept user question slash query right now uh so a simple way of doing this is to use the um text input right so I'm going to say query equal to uh St that's the streamlit and text underscore input okay and then we can um uh simply put a message here so we'll say ask questions um about your document or your PDF file okay and that's it right so let's run this again and here we have an input box right so we can say uh what is the Capital oops sorry USA all right I don't know if that information is in the Constitution or not but right now look nothing is happening right but actually what we can do is we can do something like this SD dot right and then query okay so basically we want to make sure that whatever the input is we are able to capture that and write it back okay so now you can see the input was what is the capital and it is simply repeat that right because we're just writing that so it works okay so going back to our architecture diagram so this is what we did so far we created the knowledge base and stored it as a files right then we are able to accept questions from the uh user as well now we need to compute embeddings uh based on the question right and do a semantic search so when we do semantic search the result is going to be a bunch of documents which uh the system thinks that they are similar to our query okay so how do we actually do this uh so this is the operation that you want to perform on your knowledge base or vector store not on the island right in order to do this we can say docs is equal to uh the vector store that we defined right then dot similarity so basically we want to find similar documents right and we will pass on our query okay so that's going to be our query okay so what's going to happen again just to reiterate the point it will get the question compute the embeddings using the open AI embeddings right and then do a cementing search so out of all these chunks it will take I think uh three a top three uh by default so the top three documents that are the most similar to our query and it will simply return those right so I'm gonna say um actually let me make sure that we do not run into any issues so I'm going to say if query because I want to make sure a question was asked before we process oops sorry before we process this segment okay so if no question is asked us we'll draw an error what do we want to make sure that we ask a question and then essentially gives us the documents right and let's do this for uh debugging purposes let's also look at the documents that were all right okay so I'm just running this and now you see that uh when I asked the the query right so it will return top three documents where it thinks and the relevant information is there so let's say if we change this to uh uh how many states are in the USA all right so it should give me more uh documents right so in this case there are actually four different documents not three okay I think you can pass this parameter called K which defines the number of documents that you want to return like the top three four five right and this is going to be important when we are dealing with the large language models this is where the concept of the context window comes in because what's going to end up happening is uh the this step simply returns us the most irrelevant documents right then you take these most relevant documents and feed those through the large language model so that becomes the context for the large language model and along with the query and this contacts the large language model tries to generate a response right so if you pass like four or five documents it might be exceeding the context window that the llm has and it's going to start throwing errors right so in this case we'll just stick to three and I hope this is going to work the commute and this again and yeah yeah so now we are just seeing three documents so zero one and two okay uh so we are almost there now we need to do the final step so we returned our documents those are the rank results right so we are going to be feeding that as a context to the llm right in this case we're using open air LNF right and then uh we will feed in the question plus the context to get the result right and for this purpose we are going to be using chains from Land chain okay so we're going to be using this um a question answer change uh it needs an llm in our case we're going to be using open Ai and then the type of chain okay so let's set it up okay so first we need to load the wrapper for the open AI large language model so I'm going to say from Lang chain dot this is going to be using llms and then import and then open AI right so that's the open a other lamp we can actually Define the type of an item that you want to use so for example by default it's going to be using the DaVinci model but you can change it to other models such as the uh uh chat GPT like the third GPD 3.5 turbo right so I'll I'll show you how you can do that as well so first let's say LM is equal to open AI right and I'm going to use the default settings so the default settings the temperature is zero I think we can set it to zero right uh you can Define the model name as well right but let's first do experiment without that so we're using the default one and then we'll change it and there is a reason why I'm using the default one I'll show you in a little bit right and next we also want to right so in order to do that we can just say Frontline change dot uh chains dot question answering okay and import uh so that's going to be load QA a chain okay so now there are two versions for this there is this load with sources I think as well that you can use that as well but right now I just want to get the answers we already looked at how to get the relevant documents so I don't think we need that for this specific example okay so we Define our llm then uh we're going to say chain is equal to and load the question answer change we pass our added M which is open Ai and NM in this case and then we need to define the type of chains for chain type is equal to uh stuff there are I think four different types that you can Define uh if there is interest I can actually create more videos on it uh and see the uh expand what goes on there uh so here's what we did and we got the response from the chain right and then we're just writing that response so for example here is a no it was how many states are in the US so the response is 50 States this is pretty neat right so that's how you can actually design a question answer system for your PDF files now I just want to show you one more thing that I saw somewhere where uh if you want to look at the cost associated with each and every query right so we can actually do that directly here and I'll also show you uh how much it is going to cost you if you're doing uh this uh within uh the open AIS platform all right so first we need to create a call back so I'm gonna say from uh length chain right again it's a wrap it around open AI so it's going to be have callbacks right then there is import get API callback okay now this will help us in figuring out how much each query actually costs us right so the way you do this specific part is uh new wrap around it we wrap around the code that you want to run through in this case we're getting response so at the same with get open AI callback all right so that's as CB okay and then we get the response but I will also let's say print the Callback okay so whatever results we get from the open AI um and it will print that but it doesn't really have any impact on uh what else we are doing here actually let's just remove this one okay let's store this and slide it again right okay now if I so there is absolutely nothing going on at the moment but if I say uh how many states are in the U.S right and click enter right so we get a response of 50 and here if you see at the bottom it says uh tokens used 587 right the completion tokens was one right so and then the total cost associated with it is point zero uh one seven one four cents now if you go here right so um based on our initial example that we just saw it's actually pretty expensive but here you can see for chat uh they say to use this GPT 3.5 turbo model and it's a lot less expensive right but let's run the query using this model now in order to change the open-air model I'm going to say name of the score model or actually this was model underscore name okay and then we can provide uh GPT 3.5 turbo okay now here is the difference that you will see in the response as well right so it may save this run this again okay let's ask you the same question okay and regret it right wait there is an error oh okay now it's actually saying that uh it's a warning uh the way we're doing the model okay so we directly want to use chat open AI okay anyways now if you look at the cost uh the cost is much lower that's point zero zero one five right however uh if you look at the response it says as of now there are 50 states in the US however the provided context doesn't explicitly States this information so it seems like it couldn't find it in the documents or the chunks then um the vector embeddings retrieved but since it's a chat repeat model so it has information related to uh look the rest of the data it was trained on right so it's not specifically just looking at the PDF file but just using other information as well now there are ways around it where you can restrict it and it's I think it's very similar to what uh they are using so for example if I ask exactly the same question here right so you will see a very similar response this is I'm sorry but the information you are looking for it's not included in the pages of the PDF provided however as of 2021 there are 50 states in the US now this was a very long video but I hope you learned something new uh the at the back end most of the uh chat applications that are out there which says I can chat with you on documents is using exactly same technology um if you're still here you probably want to check out our Discord server as we have a new Discord server over there we all we talk about alternative AI stuff and I think it would be a good Learning Place uh so I would hope to see you there um if you can and would like to support the work I'm doing so check out my patreon as well you can support me there if you have any questions put them in the comments below and I'll try my best to answer them if you haven't subscribed consider subscribing to the channel thanks for watching see you in the next one
Info
Channel: Prompt Engineering
Views: 63,285
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, chatgpt for pdf files, ChatGPT for PDF, langchain in python, long chain tutorial, train openai model, train openai with own data, langchain tutorial, what is langchain, langchain pdf, chatgpt for your own pdf, chatgpt for pdf, langchain ai, train openai on your data, prompt engineering course, chatgpt prompt engineering, prompt engineering chatgpt, streamlit langchain chatbot, langchain streamlit app, langchain streamlit chatbot
Id: RIWbalZ7sTo
Channel Id: undefined
Length: 46min 23sec (2783 seconds)
Published: Fri May 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.