Chat with PDF files: ChatGPT for PDF using Langchain, HF and Chainlit || Step-by-step tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi all in the last video we built a chat jpt like chatbot using Lang chain and chain lit while that video received a tremendous response many of you requested relating to building your own chatbot based on PDF files so in this particular video we are going to build a chat PDF that accepts a PDF from the user and based on that PDF answers the relevant queries that are passed okay so for in order to build this chat PDF we will have the same requirements as we had in building our chatbot we need to install Lang chain hugging face Hub since we are working with hugging phase integration with langche we are using hugging face Hub you can very well use this with openai as well additionally we need to have a vector stored so when we discussed about the capabilities of land chain we talked about indexing and how Lang chain helps you create chat bots on your own data so in order to have the land chain work with your own data you need to create Vector stores these Vector stores are nothing but the vector representations of the data that you pass to your model or you want your model to look into right we are not going into the details of vector stores and how they work I will attach the link to that in the description below if you are interested check it out for this particular video understand that chroma DB is a vector store that we will use in integration with Lang chain additionally we shall use sentence Transformers in order to embed our text in if as a vector rotation okay now many a times when installing chroma DB you may face an error like chroma H and swlib missing or fail to build chroma hnswlib so there are multiple causes due to which this this particular error can arise and I have attached the possible solution for this error in case you face this error it is most likely that you are missing some of the build Tools in your environment so go ahead and check this link in the description I'll attach it in the description and make sure you make your configurations right okay all the steps are listed in the in this particular link itself additionally we need to install Pi PDF the reason being we need to convert our PDF into text format so for that we need a pi PDF now in order to build our chat PDF right or a chatbot based on our own PDF we require six steps there are six primary steps here so the very first step is the document loader that loads the document or the PDF file and converts it into text the second is the chunky now your PDF file could be very large in the length or the number of tokens in a particular PDF file can be huge right and as you know there are limitations for the large language models or llms in or in dealing with those number of tokens so what we will do we will perform chunky now the third step that you need to perform is embedding so embedding or creating vectors on top of the chunks that you have generated right once we have created these embeddings or vector representations of our text we will store them in a vector store now this Vector store will help us index our vectors and in the semantic matching for instance you need to identify which query is the best match with the with the chunks available right otherwise it will go on and look for across the entire chain so what Vector store does is Vector store creates a representation or a memory mapping which helps you to create which helps you to map your query and find the relevant chunks from the PDF file on top of it we will require a large language model you know in order to perform the question answering and finally the document retriever that helps us combine all these together or combine your llms your PDF and your queries and get the relevant answers so with these physics steps we are now going to build our very own chat PDF okay so quickly we will import the library's necessary libraries we will import OS get pass from Land chain we shall import the document loaders and we are currently loading the pi PDF loader there are other loaders such as CSV loader text loader depending on the type of document you want to deal with we will then import the text splitter for chunking we have recursive character text splitter there is another text filter so there are two primary types of text Splitters one this recursive character text Creator what it will do is it will try to break down your entire text into chunks based on the separators okay then we are importing the hugging phase embeddings in order to create the embeddings for our text we are importing the chroma Vector store we are importing hugging face hub for our llm model and for the chaining we are using retrieval QA retrieval QA is a built-in chain provided by line chain that helps you connect and build your own chat PDF we have set our API key here just as we do in all our videos right we have set our hugging face API key this has been kept hidden next what we will do we will read our PDF and create Vector source so in order to read your PDF we are taking the input or the input path from the user the user will provide the PDF path currently I have kept two different PDF files here okay now we we are going to load this PDF file using pi PDF loader and then convert this pipe convert this document that has been loaded in the form of text into different pages okay so here we have a total of six pages when I load Nave base based let me run this so it is asking for the file location I will use linear regression dot PDF enter let me see so number of pages in this particular document is six let me show you the sample of one page so the first page contains the document as linear regression what is linear regression it models the relationship between two variables mathematical representation machine learning and all the information that we have in our PDF right this is the first page of our PDF now we are performing the chunking operation so in order to perform this chunky we import or we bring the recursive character text filter specify the chunk size and the chunk overlap so if you want to set this chunk overlap to zero so it will be each new chunk will have will start from the next point or the point where the last chunk ended the new chunk will start from its immediate next if you perform an overlap so there will be an overlap in the number of overlap in the tokens between the two chunks okay hope this makes sense very simple to understand we have created the documents by using our splitter let me show you an example Docs you see there are multiple is these are comma separated chunks that has been generated here one let me find the length of these stocks so there are 13 blocks that have been generated out of these six pages okay now we are creating the embeddings so we are calling the hugging face embeddings the hugging phase embeddings by default uses the sentence Transformer and that is why we have installed sentence transformed by now environment and we are creating a document basically a vector stored database okay Vector DB from our documents so this will perform some download and it will create your vectors so this takes uh some time perhaps around 30 seconds or 40 seconds depending on your environment now let me show you one example here okay let me perform the semantic search on the chunks that we have generated so my query is what is linear equation and I have what this will do is Doc search dot similarity search this will filter out the top K chunks that are most relevant for your query okay so I have specified K as 3 so it will filter out three three charts so you see the first thing that has been taken out is what is linear regression models the relationship between two variables mathematical representation models Target prediction based on independent variables all this information the second is error can be defined as a difference between the actual value and the predicted value the third chunk is there are many methods to apply linearly so depending on the semantic similarity it will filter out the top three chunks for the query okay hope this is making sense so far now this thing has been very clear very simple now next what we need to do is we need to integrate our llm model in order to perform question answering for our for our chat PDF so here we are using Thai UA Falcon 7B model the more advanced models that you are going to use will give you better results right currently we are using talcon files Thai UA Falcon 7B let us see how the results go with these particular model okay cool so what we are doing we are creating the hugging face llm object we are passing the API token we are specifying the repo ID and we are specifying the model KW arcs hope these things is making sense we have specified the max length for our answer okay these can be valid you can modify your parameters and play around now the important thing that is different from our earlier chatbot that we built so far is the retrieval queue so we are using a build in chain built in Lang chain instead of creating our very own this allows you to connect your llm this allows you to connect your documents or basically from the vector stores and it will help you answer queries based on the provided input okay so retrieval QA Dot from chain type we are passing the llm chain type is stuff and retriever is talk search dot as victriver by stuffing it means it just concatenates the most relevant semantically semantically similar chunks that we have filtered out so that those will be stuffed together and based on that the answers would be generated so our very first what is the mathematical formulation of linear regression let us see how relevant our answer is so linear regression is a mathematical model that is used to predict the value so this particular answer hasn't been very uh what I would say very relevant to our question that but that depends on the model that we are using let us talk about the assumptions okay so the first assumption is that the variables are independent the second assumption is because we have set a maximum token lens size so that is why so we are getting a relevant result here for the second query so hope this is making sense now the final part of this particular tutorial is we are going to create an interface just as we did in our last video we are going to create the interface using chain lit and all the same things that we are going to perform I have already created this chat PDF here okay so I'll just quickly walk you through okay so we have imported all the libraries that we did in our notebook same same way we have set our environment variable we are specifying our path the input PDF file the loader Pi PDF loader and all the steps that we did so far we are doing the same things and only changes when we are calling this retrieval chain we are moving it inside that CL dot on chainstart and under the messaging part what we are doing we are calling this particular message here so we are calling content rest result so there are two parts of this chain it on chat start and on message just as we had in our previous tutorial all you need is this particular thing that needs to be integrated with your land chain right other parts we already covered in the in this notebook that we sort so far add these two modules and you're ready right this works well enough in a script environment right we have created a python script here so I'll just run this and show you let me open an anaconda prompt okay quickly let me move over so it's time to execute our chatbot so we'll call chainlate run chat PDF dot Pi hyphen w okay let's see if it is running yes it is asking for the file path PDF file slash linear regression dot PDF so we have specified the path it will take some time to load it is loading pre-trained sentence Transformer anime Telemetry it is creating our Vector store and our chatbot is available here so let me ask what is linear regression it is Raising and it'll just give me a second okay so our API token was incorrect there I have re-specified the API token and I will specify the path once again okay now this should reload what is linear regression what happened let us try one more time sometimes these due to the server bottlenecks your API might not be authenticated and you would get certain errors at times so see now this is working fine linear regression is a statistical model used to predict the value of a dependent variable based on the now because of the max length of the token this has been truncated you can modify and play around let me show you the assumptions assumptions so the assumptions are the data is normally distributed the data is independent so this works fine hope you learned something new hope you enjoyed the lecture if you like the content make sure to give it a thumbs up make sure to subscribe to never miss a video see you in the next lecture have a nice day bye
Info
Channel: Datahat -- Simplified AI
Views: 4,285
Rating: undefined out of 5
Keywords: data science, data analysis, python, chatpdf, #chatpdf, chatpdf app, how to use chatpdf, chatpdf tutorial, unlock the power of pdfs with chatpdf, chat pdf, chatgpt pdf, langchain, chat pdf using langchain, chatbot using langchain, langchain tutorial, step by step chatbot, langchain demo, langchain explained, langchain chatgpt, langchain agents, langchain chat gpt, langchain hugging face, langchain tutorial python, ui for langchain, langchain own data, langchain qa documents
Id: OFTWL0c9GYM
Channel Id: undefined
Length: 13min 57sec (837 seconds)
Published: Sat Aug 26 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.