ChatGPT For Your DATA | Chat with Multiple Documents Using LangChain

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello guys welcome back I hope you are doing great in this video I will show you how you can chat with any documents let's say you have a folder and inside that folder you have different file formats let's say you have PDF file you have text file you have readme file and all these things I will show you quickly how you can take all of your data split the data into different songs do the embeddings using the open AI embeddings store that into the pine cone Vector restore and then when you ask some questions it can do the semantic search and get the relevant chunks pass that song into the large language model and get the response and at the end I will also show you how you can have a simple chat UI which will be similar to chart GPT bot for your data let's get started okay so now first thing first we need to actually set up the environment right I will be actually providing different links as I go in the notebook you can refer to those documentations for further understanding but for we need to first set up the environment for that you need to install the packages as normal I have already done this and then you need to import the things and then you need to pass the chat open AI API key once you are done with this part then we need to go and take the document loader from link chain there is actually many document loaders as you can see here there are many document loaders and the list keeps on increasing and many people want to create the applications out of link chain in my case I am going to use the directory loader meaning that whatever you have in a directory or a folder you can we can load all of the files into the into the directory loader here I'm importing the directory loader from the document loaders and here I am passing the page PDF loader readme loader and the text I have already uploaded some documents here inside the document so you need to actually create a folder here called documents and then upload the file and get the path of this and then you can pass here and you need to if there is a PDF you can do with DOT PDF dot readme file means MD markdown and txt file meanings the text file of course right if I run this command now all of these files are being loaded in their respective loader what I can do now is just create create a list out of it and then let's create the document from that because we have the three loaders but we need to create a document right this is the first part like we are having our documents ready for the application if I run this command here it should load the loaders and now if I just run this particular cell here it should show us that there should be three documents of course and each document there are 10 000 101 characters in your document if I just want to print what is in the first document it is about the GPT for all paper so you can see it is related to the GPT for all paper similarly you can print the documents one and documents two now let's scroll down here the next part as I saw in the document this is the next part is now making different songs of the data because the open AI token limit there is some limit in the tokens and for that also we we can just go to this part now so what I can do here is by the way if you want to know more about the open AI token limits I have provided the link on the top of this file just click this I can show you of course in here there are the different Max tokens for gpt4 GPT 3.5 and so on I don't have GPT 4 ss8 I'm using the GPT 3.5 for this demonstration purpose now let's scroll down to our page here I'm just saying the character text splitter and now I'm actually passing the chunk size as 1000 and overlap 40 because I it seems that if we pass some overlap it performs better you can just play around with this number it all depends upon you what kind of applications you want to develop and here is the documents and I pass the text splitter dot split documents and I pass the documents that we just created upright if I run this command and so now those three documents are now splitted into 23 different documents or the chunks right now we completed this chunk part now what next if you just want to print you can just print this document 0 and document 1 and so on you can see it is printed here now comes the embedding part by the way this is just one line of code because we are using the open AI embeddings so we just said embeddings equals to open AI embeddings now we just created that we are or we just instantiate the open AI embeddings so now we need to pass that right so here is the two examples if you want to go with the chroma battery score is fine I have just shown you the example here how you can do this but I will show you actually how you can do this in Pine Cone why pine cone because when you run this open AI embeddings it will cost some amount for you and the good part of pine cone is that once you have that embeddings stored in the pine cone you can just retrieve that embeddings later on you don't need to run all the time now as I said before if you want to go with just a chroma you can just run these two shells here I'm not going to run this I will just close this now here now we will go and see how we can do the vector store storing part with pine cone using pine cone for storing the vectors right first thing first you need to install the pine cone client just run this cell and I have provided on all the links here how you can do this there are some of the recommendations what is Vector restore and all the different things and by the way I can just go here and remove this part so that the document looks fine and what I need to do now is take the pine cone API keys because that is what we need for that you can just go to this app.pinecone.io and then just take the API keys and the environment just to show you what it is I have already logged into the system but I will just show you where you can get if you click that and you need to go to the API keys just copy this API key here and just come back to here and we are using get pass for Simplicity here I can just run this cell it will ask to input something I will paste that enter and then there is the pine cone API Key by the way this is the good idea if you don't want to show your API keys and it's secure way to do it now I also need to have the Pinecone environment although this is not that secure I can just copy it from here because this is just the environment but anyway here I will just run this and then now I have the both of the API key and the environment right now I need to initialize the pine cone so if you follow step by step it's quite straightforward but many people find it confusing I'm going step by step just to explain you so now you need to import pine cone and Pinecone dot init and pass the API keys and then you need to have the index name this one you need to First create it by yourself if you go to the pine cone part here in the index I already have the land chain demo index being created for me but for you what you can do is just go or I can just actually delete this for you just to show you what I will do here is Lang chain demo right I will just delete this it is going to be terminated and I will create a new one now create new one and this is how it will show you you can just give land chain anything you want by the way demo and what is the dimensions this is the good part that we are using open AI open AI has how many 1536 dimensions and you can just all leave this as it is because the Pinecone actually provide you one index free so we can use it you can just create the index now the index is being initialized for us once the pine cone index is being initialized here I will show you what we will be doing next what we can do is just pass the index name that we just created and just give the variable Vector store Pinecone Dot from documents and we need to pass the documents that we just have the different chunks right how many songs we have we had actually 23 chunks meaning that we will have 23 different vectors being created in the pine cone I will show you how it works later but just to show you and that we need to pass the embeddings we will be using the embeddings from open AI as I said before and the index name is just the index name that we are creating here let me see if it is being created okay it is still initializing and by the way if you already have the index being created as I said you we can have the index being created and we can use it for later purpose right for that what you need to do is you just need to pass the same thing as it shows here before but pine cone Dot from existing index here when we first initialize we just pass from documents but when you already have it you can just run this from existing in this I could have just run this but just to demonstrate how it works I'm just showing you this okay now it's ready and if you see here we don't have anything here Vector zero right if I actually run this command because I need to create it if I run this command it will now create the index for us let's see when this finish running it says okay retrying don't worry it will just pass okay there is some error what is the error UF end of line but it is just a normal error I think Vector is to if I again run this it should actually pass I don't know for some reason sometimes it will create some some issues okay let me see okay again there is some error and it is showing that ssle way of error I don't know what is this but I think I can just run it again because sometimes for some reason it it has some error connecting to the Pinecone Vector restore from Google collab but I think yeah now it passes as you see here when this runs if I go here and if I rephrase the page now there should be 23 different vectors being created if I go to the index yeah there are 23 total vectors right that is the embeddings that we just have with the different songs the pine cone part is now done but this is not 19 because I had B4 different things here we have 23 documents and then we have 23 different vectors now that is all now we can start querying into our data so just to show you how we can do that that is the query who are the authors on cpt4 for our paper I can just say docs Vector score and then if I run this command it will fetch the things now what it is doing this is the main thing that you need to be really knowing what is happening here right so what we did we had our data we load with the directory loader from Lang chain we we have the chunks and then we did the embeddings and we passed that into the vector ReStore in our case pine cone that was already done and then we have our user query just I just run that what is happening behind the scene is that query is also converted to embeddings because Masons just understand the numbers right we need to convert that into the numbers embeddings in general is just converting the E strings into the numbers and then that embedding or the norm what will go to the pine cone Vector restore and we'll find the similar one as you can see here we say that similarity source so the thing that we are who are the authors of gbt4 for all paper that is going to convert it into the vectors and that Vector we will be sorting in the pine cone Vector database that we have 23 different vectors and it will find the most similar ones and it will provide us the answer that is all what is happening here so it will find the relevant data chunks and then it will pass into the llm and then we have the response here you can see we have the vector store similarity source and the query but here we haven't passed through the llm part now we are just in the vector restore part right when I run this length of documents it will show four it finds four of those vectors which has a similar embeddings to this who are the authors of gpt4 paper if you want to just print what is the page content you can just print here so as you can see here there are the different author's name in this particular chunk right so similarly you can go and query all the different documents now this is Dawn now comes the Lang chain part where we will be creating the chain and we also want to create in such a way that when we ask something to the chatbot it will remember the history what we are also asking okay now the Lang chain part as I said before if I just expand this shell from Lang chain dot llms I'm going to have the open air by the way there are many chains in line chain actually if I just click this so chat over documents with chat history you can go through this document for different kinds of change but here I'm just going to show you the conversational retrieval chain part here I'm just opening or just importing the open Ai and the retriever right we need to read three from the victory store what we did here Vector store dot as retriever Source type similarity and the source query just we just want to have the k equals to 2 and QA is the question answer and the conversational retrieval chain Dot from llm right and we pass the open AI temperature zero you can just pass anything here and the retriever because what this is the main part what we are doing now as I said you before in the diagram we have the user query and then the embedding is already here but when we pass that we need to take these relevant chunks and pass into the llm and get the response numbers right that is all done in a simplest way with just one line of code with Lang chain so if I just run this command open AI is not defined because I didn't run this shell before I run this okay now it should be fine okay now we have the chain question answer chain and now I just can ask the questions let me first have the chat history zero right how much is spent for training the gpt4 for all model and I can just pass the question as query and the chat history at chat history before I just have the empty string here right if I just run this command it will show me that okay just let me see 200 right and the chat history.appane now we need to append the query and the result answer and if I just show you what it is here it is being appending how much is the you know how much you spend for training the GPT for all model because that is the question I asked and the answer is 200 so I am actually having the chart history right now what I will do next ask some different question what is this number divided by 2 right and then I will just ask the question as query I am passing the chat history which also has the previous question and answer part right if I run this command it should actually show me 100 but for some reason as you can see let me see what it says 400 so this is what you must be really careful because large language models are really large and it does not mean that everything will be correct and as I said here what is this normal divided by 2 it remembers the history but then instead of showing 100 it is showing 400 yeah and what if I just said that what is this normal multi-plied by two let me see what it will so because the as I said in the in my previous video also you have to be really good with the prompting part okay now it says 1600 okay but the good part here is that it remember something right I haven't passed here anything about the paper or something like that but the the main part here is it knows the chat history so that we can have the conversations as we go okay this is all but if you want to have this in a small chat UI kind of things let's go through that now create a chat bot with memory with simple widgets I will just open this now what you can run is from the IPython display import display and I Pi widgets as widgets here as I said before we need to have the chat history as a list empty list in the beginning and there is some function for the chat and I can just run this command if I run this cell here it will be asked I can ask the questions this is just normal things I can just say what who are the let me see who are the authors of GPT for all people right I can type enter and then it will go through the vector search Vector restore find the relevant one pass that information into the large language model and give us the answer as you can see here user who are the authors of gpt4 for all and the chatbot replies the authors of GPT for all are these this and this so you get the answer so follow-up questions you can ask which is related to this but any any things you want because I'm just ask I'm passing actually here three different things right one is the PDF file one is the readme and in the readme I have passed the pandas AI readme let me ask something like that what is pandas AI right because that was the video I created last last video was about this where you can have conversation with the tabular data with the help of pandas Ai and it says here that pandas AI is a python library that adds generative artificial intelligence capabilities to pandas you can see the power of this that you can have any kind of documents and with the help of land chain you can just pass all the things and into the victory store and have a conversation with this this can be done in really in an organization where you need to you have many data or not many data also or many documents and you need to query it with them but one thing is that always be careful that you don't pass the sensitive information although we don't pass the direct message to the open AI because we are storing everything into the victory store and we are just passing the question and it will find the similar thing from the victory store and that is passed to the llm but always be careful when you implement something because we never know something can happen something can be leaked always be careful when you create some kind of chatbots okay this is fine but if you want to go one step further you can even create a Grady your UI this is the UI already fine but if you just want to have the grade your UI I have just provided the link here you can go through the link I have just actually copied the link from there and for some reason it is not working here I will just show you how it works before this is just the code from gradio it is doing the this is the example from gradu so what it is doing is just running the shell and it will create an interactive UI kind of things if you prefer this way and you can just ask something here hello and then it says I love you because it is just providing you answers from these three different choices and you can just pass hi let's say hi again it will not it will just provide you among these three it is working fine with this example but I try to incorporate this piece of code into our application which I'm showing you here here I just have the same code but instead of the random choices I'm passing the conversational chain and the user message and chat history and all these things but for some reason it is not working and your task now is that just I will provide the code in the GitHub just go through this let me know if you are able to achieve or if you are able to run this code what I am showing you here is I can just say hello it will run for the first time right and then it is also printing here because I am passing debug and CR equals to debug equals to 2 meaning that when you pass debug equals to 2 it will just provide the print all the things now if I go here and say who are the authors of GPT for all people it is going to show some error for me and then it says that the unsus supported chat history format list and all these things I haven't been through the documentation and fix this but maybe you can do the task go through this notebook as I said I will provide this in the description of this video and let me know and let others know in the comment section if you if you were able to solve this yeah this is all I want to show you in this video I think although this is quite complicated kind of things when you see on the high level but with the help of Lang chain you can actually easily achieve these things and now you are not restricted only to PDF or with the table or data or different things you can have different kinds of file formats just take that into the directory with the lunch and with the document loader make it into different chunks do the embedding part load that into the vector restore query with some questions it will do the similarity search pass that relevant data songs into the large language model and then you get the response yeah there are many pieces but when we join that together it seems really simple yeah I hope you like this video thank you for watching and see you in the next one
Info
Channel: Data Science Basics
Views: 11,318
Rating: undefined out of 5
Keywords: openai api, code, chat ai, large language models, llm, what is large language model, chat, langchain, lang chain gradio, langchain demo, langchain tutorial, langchain openai, langchain explained, framework, langchain hugging face, langchain chat gpt, llms, chat models, prompt, chain, agents, chatcsv, csv, chat with any csv, pandas, chat with data, pandas-ai, chat with any tabular data, how to create chart, create chart with llm, documents, markdown, chat with your data, own chatgpt, pinecone
Id: TeDgIDqQmzs
Channel Id: undefined
Length: 23min 15sec (1395 seconds)
Published: Sat May 06 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.