Learn How To Query Pdf using Langchain Open AI in 5 min

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello all my name is krishnaik and welcome to my YouTube channel so guys uh many of you had actually requested me to upload a video related to PDF query using Lang chin so suppose let's say if you have a PDF which has a lot of text information and you want to use Lang chain and open API open AI API itself to query some kind of data but just by asking questions you will be able to get the output so this is where many people had actually requested and probably just in this five minutes tutorial will be able to understand how you can specifically use it okay now Lang chain if I probably go and see the documentation there is something called as document loaders okay now in this document loaders you know you will be able to take data from different different sources you know you may be able to take data from a PDF or text file and all those things and here you have this kind of syntaxes we probably see that how many different types of document loaders are there everything you'll be able to check it out over here so please make sure that you check out this particular page itself with respect to the documentation now uh let's go step by step probably if you follow this particular step uh querying the data source for the text file also becomes very very much easy so quickly first of all we go ahead and install these all basic libraries which is specifically require Lang chain open AI Pi PDF 2 5 CPU now 5 CPU is I'll explain you why this specific thing library is required Pi pdf2 is a library which will actually help you to read from the PDF file itself and one more dependency libraries for five stick token okay so tick token is also one more dependency libraries which will focus on creating tokens and all okay so once you probably install all these particular libraries and it will probably take some time so I've done the installation now I'm going to import all these libraries say from file pdf2 I'm going to import PDF reader which will again be responsible for reading the PDF files from langchin.embeddings.open Aim using open AI embedding so open AI you know also has this embedding things embedding vectors so anytime whenever you have any confusion you can just go and search it right go and just search for open AI embeddings what are embeddings open AI text embedding measures the relatedness of a text tree right embeddings are commonly used for search clustering recommendation anomaly detection classification so open EI has provided almost everything that you can probably do so in this step uh we uh we have to probably use open AI embedding so that whenever I'm asking any kind of question like let's say my PDF over here as an example I'm going to take of the Indian budget right so Indian budget whatever budget is announced in this particular year with respect to that particular PDF I'll try to upload over here and what I will do I'll try to ask questions within that specific PDF and you'll be able to get the answer then the next thing is that I will be importing character text splitter now again if you don't know about character text splitter in short whatever content I basically have inside the PDF I'm just going to split that into considering some special characters like a new line and I can also Define how much should be the text size you know this is specifically done because uh whenever I'm using open AI embeddings we have a fixed size of tokens okay and this is a very important step that we really need to do then the next important library is something called as files and now this is just like a vector database uh you know whenever you are trying to create an embeddings of the text data that is probably present inside your PDF we will try to store in that in the vector stores okay so just a simple things over here but if any queries you specifically have just go and search for the storms this libraries you know just search for open AI open aim buildings opening a character text split you will be able to find out all the information like how we have actually searched over here what are embeddings okay so uh make sure that you do this but this four libraries are specifically required so here are all the four libraries over here that I'm going to import now over here I have already executed this code of uh getting the open API key I don't think so we require serp API key unless until you are doing a Google search so over here whatever API key you specifically have in my previous video I've already shown you how to put the API key itself now as suggested the problem statement is that this is my budget PDF okay and this particular PDF I will try to read it with the help of PDF reader so just to execute it PDF reader I have to just give the path of my budget PDF over here and then I will probably be able to uh read each and every Pages how do I read it for that I will be importing two libraries from type extension import concatenate okay then what I'm actually we're going to do I'm going to enumerate uh inside the pages of all the pages of the PDF I'm going to extract this text and put it inside my variable content and if content then raw Text Plus content that basically means I am putting all the content inside this particular raw text okay so once I execute it and it will probably take time because there are so many different different pages that are available in that particular PDF so finally you will be able to see this is what is my raw text and you'll be able to find out all the information over here and it's quite huge if I probably open this it is somewhere around 36 to 38 page okay now the next step as we have imported the library of open AI embedding and character text splitter we will go ahead and actually take this character text splitter and we'll split our entire text based on this particular separator on this chunk size like what should be my one sentence one one sentence that is 800 chunk size how much overlap can be done the next sentence can have an overlap of from the previous sentence of 200 at the last so that is there and length function is nothing but this length which is an inbuilt function so once I do this text Dot text splitter dot split text of raw text then I am going to get this entire text okay so this is an inbuilt function that is present inside character text filter okay so once we execute this then this is my total number of text that I am probably able to get okay again you can play with this chunk size uh the main thing is that for each and every model that we specifically use uh there will be a fixed token size you should not exceed that I can probably put this thousand also right with respect to that then I will go ahead and use this open AI embeddings which I already told you this embeddings whatever is there and then I will going to say that files from text and this text into embedding so that basically means what I'm actually going to do is that I'm going to basically put this entire text with respect to this particular embedding and get this entire vectors okay so here is my document search so if I probably go and execute this okay you will be able to see that it will be a lang chain Vector stores as as suggested right it will just be like this text is actually getting converted into this embedding and getting stored over here okay and later on I will try to use a tool in Langston also you have this question answering load qha and from llm.lms import open AI as usual I will take this load q a chain use this open AI object and I will say chain type is equal to stuff so that whenever you try to ask a question it will be able to give you the answer so this is my chain now all I have to do is that write a query now inside this particular PDF let me open this particular PDF so that you will be able to understand so let's say this is my PDF over here okay and let's say I want to ask some specific question okay so let's say over here any question you can basically ask okay what is the vision for amritkal okay I'll just ask this specific question so let's let's consider this is one of the question over here but before this let me go ahead and write this and take this particular stuff so what is the vision for amritical okay now if I execute it then what is happening this document search which has the entire embedding it is trying to find out the similarity search for this particular query and then this we are doing chain dot run where I am giving my input documents and question of this query so our vision formulate current includes technology driven and knowledge based economics so here you can basically see all the information is basically getting picked up so whatever things we are querying with respect to the PDF we are able to get this okay and one more question was that how much agriculture Target will be increased by right so uh I will just say I'll just copy and search for this if you want so over here only somewhere agriculture related to agriculture also something I saw uh quite long back let me see somewhere that question was there and you can ask basic question see the agriculture credit Target will be increased too right so I have written how much the agriculture Target will be increased to and if I probably execute it you'll be able to see 20 lakhs crore right so this is what 20 lakh crores is there and suppose if I say and what the and what the focus will be right when I do this query search uh it will probably give the agriculture driver to grow with focus on animal husbandry Dairy and Fisheries the same thing whatever things are specifically there so I hope you are able to understand this see at the end of the day you take any any any document Source even even pandas data frame you can also take that also now if you want that specific example please do let me know but if you take this kind of PDFs definitely you are able to do this and this is quite amazing with respect to Langston now you don't have to be dependent on things just imagine uh if you have your entire financial data right in the form of PDFs let's say the expenditure in the form of PDFs right you just need to load those PDF chat with that and you will be able to answer this kind of questions right so I hope you are able to understand this anyhow I'll be sharing you this entire materials this was it from my side I'll see you all in the next video have a great day thank you take care bye
Info
Channel: Krish Naik
Views: 34,559
Rating: undefined out of 5
Keywords: yt:cc=on, how to query a pdf using langchain and openai, krish naik langchain
Id: 5Ghv-F1wF_0
Channel Id: undefined
Length: 10min 22sec (622 seconds)
Published: Fri Jun 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.