LangChain Tutorial - ChatGPT with own data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi in this video I will show you how you can Empower a large language model like gpt4 using link chain for example with link chain you can connect GPT to your own knowledge base that was not trained into the model I will first explain the basics of why this is done and how it's implemented in theory after that we will look at the code Hands-On to make it easier to understand let's take a fictional example we have a knowledge base of let's say 100 texts and we want to make these available to other people and through gpt4 generate human-like messages with it and not just query the facts out of these text files the problem is that the prompt or the message you give to the bot can only have a certain number of words or tokens so you have to pre-filter large data sets this is ideally done with semantic filtering and that's where Vector databases come in handy they check questions for similarities to data in the database and the more similar data is given back as output this reduced output is then sent again as a prompt let's now take the perspective of a developer as a developer we initially have only these text files but of course we can also have other sources of knowledge like PDFs or CSV files with text files it's relatively easy to extract the data with PDF files it's more difficult once we have extracted the data we cannot just feed it into the database but we now have to split it into small parts so-called chunks for example if we have 10 000 words we split them into groups of 100 words we now have 100 chunks at the end okay now we have a lot of chunks and these chunks have to be converted into so-called embeddings but what are embeddings I show you this on the openai website which gives a pretty good overview so I'm here at the of my website and this gives a pretty good overview I think on the left here you've got the text or the text chunks better set and these are fitted into an embedding model the embedding model takes the input text and converts it into a vector and as you can see the vector is just a bunch of numbers an array of numbers or a list of numbers and these vectors are stored in the database and as you can see here these vectors have a meaning and for example anything associated with animal is stored here in this Vector space and anything associated with athlete is here and then you run a query for example what is an elephant and the Elephant gets also converted or the query could also convert it into a vector and now you retrieve similar vectors from the database so this is the semantic filtering I introduced at the beginning of the video and let's say you get back the top four vectors from the database and these vectors get translated back into text and now you take a question and the text which you retrieved from the vector database and take it as a prompt to check GPT for example okay that was the theory now we will go into the code and I will walk you through the code what you can do how you can create a vector database and how you run queries on it okay I'm here now at my iPad notebook you can also get it on my GitHub page you will find the link in the description and what I first do is install link chain this is a package which allows us to use yeah Vector databases and also open my eye to run queries very easily this is very nice and Tiny packages and I'll show the basics here what you get from the package you also need to open Ai and you need an open AI API key this is the only prerequisite here which you have to have otherwise The Notebook will not work because I use open AI for my embedding model I also need picker and this is done to deserialize and serialize our Vector database we'll use files which is built by Facebook we also will install python.ent because I store my API key inside this dot end file and because I want not to show to everybody and I will just load it after initializing or installing the packages so this is what we first do we install link chain of my eye pickle and python.nf and after this is done I will load my API key in the variable API key so okay after loading the API key we can now take a look at the first concept of length chain and these are so called loaders loaders are a class which allows you to use your database or your text sources and convert it or put it into a database there are a lot of different loaders like text loaders PDF loaders these Reloaders and so on and I've got here the FAQ files and these contain some text about uh yeah effective animal and this is a mixture of a rabbit and a dog this is done to make sure that the query I use against my database is not created by GPT but actually retrieved from my database because GPT will not make up this anymore so to load the data first we have to import the directory loader and text loader I use the directory loader because everything in this FAQ text file is a text and I can just use the path here and then use a loader class and the loader class is a text loader and also will show my progress and then load it into memory so after doing this you can see that three or three files are loaded and these are stored into this talks variable now so three files because we've got three text files in this FAQ directory after loading the data in inside memory we now have to split it into so-called chunks and there are very nice text splitter there are different kind of text Splitters are used to recurs recursive character text splitter here and I will say that I want a chunk size of 500 words or tokens and I also want the chunk overlap of 100 because I don't want to lose the context of the chunks so these have some kind of overlap okay after initializing or after instantiating the class I will now split the documents which are stored in this docs variable when I run this code you can now see this is extracted here from the text file and has a chunk size of 500. so now we have our chunks loaded and we we now have to use our embeddings and from Lang chain you can use different embeddings classes and I will use open AI embeddings so I will just import it here and instantiate it with the open API API key and the value here is my API key which I loaded here from my.n file so I've got the instantiated class node and then we need to use or load the chunks into a vector database there are different kind of vector databases I can use here with langchain and I will use files and import the files class here so now from files I have a class method from documents and I will use the documents and the embeddings instance and this will create a vector store and this Vector store can be pickled or dumped here and I now have the data store here on my file system so now I have a vector store and the vector store can now be used to run our queries and operations on the store okay now we can just load our Vector store again and put it inside memory okay now we can send a question to GPT but you don't just put a question there you give more and the more is actually the context so the context in the question is called the so-called prompt so you give the bot or the GPT model an identity and I would say you are a veterinarian and you help users with their animals and then you put your answer here and everything together is called the so-called prompt and you can get a nice prompt template from Lang chain and you instantiate it with context variables the context variable here and the question this is the whole template though this is fixed and this is the variables and you put it here and now we have our prompt template so in this prompt template can now be used in a so-called chain you can get different chains for different use cases like chatting or retrievals from q a and I will use the retrieval Q and A chain because we will make a query to our a database and then put this output to a model and this is very nicely done with the retriever q a chain and we first have to initialize or instantiate our model and I will use openai here and just put the open API key here and now we have our llm our large language model and this is put as variable into the chain and now you also have to put a retriever here and you can convert the vector store to a retriever and if you run now the whole chain you get the llm which will be used and you also will use the data store here so I will run the aquarium how old does these animals get and GPT will answer this for me so first I have to run the template of course and then run the chain so okay no wave and answer and as you can see this is nicely formatted and it is information which gets actually retrieved from the Q and A's and this is not made up by GPT okay but now we have a little problem or sometimes at least we have a little problem and this is if we run multiple queries we also we always have to give the correct question here but the context of the whole conversation gets lost and this is where memory comes very nice and handy and the memory stores the whole conversation and uses the memory for input in the open AI prompt link chain provides multiple memory classes for example the conversation a buffer memory and you give a memory key like chat history and an output key and you also say you want to return the messages from the memory so I now create the instance of this conversation buffer memory and we use a different Channel because the retrieval Q in a chain is is not able to use memory but the conversational retrieval chain is able to use it so we do the same we instantiate it with model text DaVinci model this is a model which is perfect for text creation and we give it a little bit of temperature which says how Dynamic the model will be and also memory now as input parameter and if you run multiple queries like here how old does the animal get and now how much does it eat and the it in this case gets retrieved shift from the memory or interpret interpreted from the memory because otherwise GPT would not know what it actually means so it has to have the memory in this case so if we run this okay and here you can see it it's a mixture of dog or rabbit should only eat a little bit of food and so the it is interpreted correctly okay that's it as you can see Langston is pretty easy to use and very powerful and if you liked my video please don't forget to subscribe and like my video thank you very much bye bye
Info
Channel: Coding Crashcourses
Views: 3,722
Rating: undefined out of 5
Keywords: langchain, vector db, vector database, python, openai, gpt-3, gpt-4, gpt, embeddings
Id: BPduCIDym6w
Channel Id: undefined
Length: 12min 29sec (749 seconds)
Published: Thu May 18 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.