Fixing LLM Hallucinations with Retrieval Augmentation in LangChain #6

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

life's language models that I have a little bit of an issue with data freshness that is the ability to use data that is actually up to date now that is because the world according to a large language model is essentially Frozen in Time the large language model understands the world as it was in its training data set and the train data is that it's huge it contains a ton of information but you're not going to retrain a large language model on new training data sets very often because it's expensive it takes a ton of time and it's you know it's just not very easy to do so how do we handle that problem well for that for keeping data up to date in a large language model we can use retrieval augmentation the idea behind this technique is that we retrieve relevant information from what we call a knowledge base and we will actually pass that into our large language model but not through training but through the prompt that we're we're feeding into the model that makes this external knowledge base our window into the world or into the specific subside of the world that we would like our large launch model to have access to in this video that's what we're going to talk about we're going to have a look at how we can Implement a retrieval augmentation pipeline using line chain so before we jump into it it's probably best we understand that there are different types of knowledge that we can feed into a large language model we're going to be talking about parametric knowledge and Source knowledge now the parametric knowledge is actually gained by the LM during its training okay so we have a big training process and within that training process the LM crates like an internal representation of the world according to that training data set and they all get stored within the parameters of the large language model okay and you can sort a ton of information that because they're super big but of course this is pretty static after training the parametric knowledge is set and it doesn't change and that's where Source knowledge comes in so when we are feeding a a query or a prompt into our LM okay so we have some problems here's a question we feed the answer to return us I can answer based on that prompt but this input here this is what we would call The Source knowledge okay so Source knowledge and then up here we have the parametric knowledge now when we're talking about retrieval augmentation naturally what we're going to be doing is adding more knowledge via the source knowledge to the LM when not touching the parametric knowledge okay so we're going to start with this notebook there'll be a link to this somewhere in the top of the video right now and the first thing we're going to do is actually build our knowledge base so this is going to be the location where we saw all of that Source knowledge that we will be feeding or potentially feeding into our large Lounge model at inference time so when we're making the the predictions or generating texts so we're going to be using the Wikipedia data set from here so this is from hooding face datasets and we'll just have a quick look at one of those examples okay so if we go across we have all this text here this is what we're going to be putting into our knowledge base and you can see that it's pretty long right yeah it goes on for quite a while so there's a bit of an issue here a large language model and also the encoding model that we're going to be using as well they have a limited amount of text that they can efficiently process and they also have like a ceiling where they they can't process anymore and we'll return an error but more importantly we have that sort of efficiency threshold we don't want to be feeding too much text into an embedding model because usually the embeddings are of lesser quality when we do that and we also don't want to be feeling too much text into our completion model so this is a model that's generating an answer because the performance of that so if you for example give it some instructions and you feed in a small amount of extra attempts after those instructions there's a good chance you're going to follow those instructions if we put in the instructions and then loads of text there's actually an increased chance that the model will forget to follow these instructions so both the embedding and the completion quality degrades with the more text that we feed into those models so what we need to do here is actually cut down this long chunk of text into smaller chunks so to create these chunks we first need a way of actually measuring the length of our text now we can't just count the number of words or count the number of characters because a language model that's not how they count the length of text they can't learn the test using something called a token now token is typically like a word or sub word length chunk of or string and the Azure varies by large language model or just language model and the tokenizer that they use now for us we're going to be using the GT 3.5 turbo model and the encoding model for that is actually this one here okay so I mean we can maybe I can show you how we how we can check for that so we import tick token so tick token is just the tokenizer or the family of tokenizers the Open Eye AI uses for a lot of their large language models all of the GPT models so what we are going to do is we're going to say tick token dot encoding for model and then we just pass in the name of the model that we're going to be using so GT 3.5 turbo okay and actually the the embedding model that we should be using is this one Okay cool so lucky we checked so let's run that in reality that there is very little difference between like this tokenizer and the the p50 tokenizer that we saw before so in reality the difference is is pretty minor but anyway so we can take a look here and we'll see that the tokenizer split this into 26 tokens right and if we let me take this and what we'll do is we'll just split it by spaces and I just want to actually get the length of that list as well right so this is the number of webs right I just want to show you that there's not a direct mapping between the number of tokens and the number of words and obviously not for the number of carriers either okay so the number of tokens is not exactly the number of words cool so we'll move on and now that we have this this function here which is it's just counting the number of tokens within some text that we pass to it we can actually initialize what we call a text splitter Now text splitter just allows us to take you know long chunk of text like this and split it into chunks and we can specify that trunk size so we're going to say we don't want anything longer than 400 tokens we're going to also add an overlap between chunks so you imagine right we're going to split into 400 roughly with 400 token chunks at the end of one of those chunks in the beginning the next Chunk we might actually be splitting it in the middle of a sentence or in in between sentences that are related to each other so that means that we might cut out some important information like connecting information between two chunks so what we do to somewhat avoid this is we add a chunk overlap which says okay for and no chunk zero and chunk one between members actually an overlap of about 20 tokens that is exists within both of those and this just reduces the chance of us cutting out something or a connection between two chunks that is actually like important information okay uh live function so this is what we created before up here and then we also have separators so separators is it's gonna so this what we're using here is a recursive character text splitter it's going to say try and split on double new line characters first if you can't a split on new line character if not split on the space if not sperm anything okay that's all that is and yeah so we can run that and we'll get these smaller chunks it's still pretty long but as we can see here they are now all under 400 tokens so that's pretty useful now what we want to do is actually move on to creating the embeddings for this whole thing so the embeddings or vector embeddings are a very key component of this whole retrieval thing that we're about to do and essentially they will allow us to just retrieve relevant information that we can then pass to our large language model based on like a user's query so what we're going to do is take each of the chunks that we're going to be creating and embedding them into essentially what are just vectors okay but these vectors are not not just normal vectors they're actually you can think of them as being numerical representations of the meaning behind whatever text is within that chunk and the reason that we can do that is because we're using a specially trained embedding model that essentially translates human readable text in to machine readable embedding vectors so once we have those embeddings we then go and saw those in our Vector database I'm sure we'll be talking about pretty soon and then when we have a user's query we encode that using the same embedding model and then just compare those vectors within that Vector space and find the items that are the most similar in terms of like there basically their angular similarity if that makes sense or you can another alternative way that you could think of it is their distance within the vector space although that's not exactly right because it's actually the angular similarity between them but it's pretty similar so we're going to come down to here and I'm just going to first add my openai API key and one thing I should know is obviously you're gonna you would be paying for this and also actually if you don't have your API key so it's so it's platform it's openai.com okay and then what we're going to need to do is initialize this text embedding order 002 models so this is basically open ai's best embedding model at the time of recording this so we'd go ahead and we would initialize that via Lang chain using the open AI embeddings object okay then with that we can just encode text like this so we have this list of chunks of text and then we just do embed so the embedding model embed documents and then pass in a list of our of our tax chunks there okay then we can see so the response we get from this okay so that what we're returning is we get two vector embeddings and that's because we have two terms of text here and each one of those has this dimensionality of 1536. this is just the embedding dimensionality of the text embedding are the 002 model each embedding model is going to vary this exact number is not typical but it's within the range of what would be an a typical dimensionality for these unleading models Okay cool so with that we can actually move on to the vector database part of things so a vector database is a specific type of knowledge base that allows us to search using these embedding vectors that we've created and actually scale that to billions of Records so we could we could literally have I want billions of these these text chunks in there that we encode into vectors and we can search through those and actually return them very very quickly we're talking like I think a billion scale you maybe you're looking at 100 milliseconds maybe even less if you're optimizing it now because it's a database that we're going to be using we can also manage our records so we can you know add update delete records which is super important for that data freshness thing that I mentioned earlier and we can even do some things like what we'd call metadata filtering so if I use the example of internal company documents let's say you have company documents that belong to engineering company documents are belong to HR you could use this metadata filtering to filter purely for HR documents or engineering documents so that's where you would start using that you can also filter based on dates and all these other things as well so let's take a look at how we would initialize that so to create the vector database we're going to be using pinecon for this if you do need a free API key from Pinecone then maybe wait list at the moment for that but you know at least at the time of at the time of recording I think that wait list has been processed pretty quickly so hopefully would not be waiting too long so I'm going to just oh first I'll get my API key and I'll get my environment so I've gone to app.pinecone.io you'll end up in your default project by default and then you go to API Keys you click copy right and also just note your environment here okay so I have us West one gcp so I'm gonna remember that and I'll type that in okay so let me run this enter my API key and now I'm going to enter my environment which is us West one gcp okay so I'm sharing an error because I've already created the index here so let me what I can do is just add another line here so if index name kind of want to do that not quite so I don't want to delete it if index name is not in the index list what we're going to create it otherwise I'm I don't need to create it because it's already there so I'm not going to create it because I don't need to but of course when you're on this if this is your first time running this notebook it will create that index and then after that we need to connect to the index we're using this grpc index which is just an alternative to using index grpc is just a little more reliable can be faster so I like to use that one instead but you can you can use either honestly it doesn't make a huge difference okay and again if you're running this for the first time this is going to say zero okay because it will be an empty index for me obviously there's already voters in there so yeah they're they're already there and then what we would we would do is we'd start populating the index I'm not going to run this again because I've I've already run it but let me take you through what is actually happening so we set first set this batch limit so this is saying okay I don't want to upset or add any more than 100 records at any one time now that's important for two reasons rarely more than anything else first the API request to open Ai and you can only send and receive so much data and then the API request to Pinecone that for the exact same reason can only send so much data so we limit that so we don't go beyond um where we would likely hit a data limit and then we initialize this text list and also this metadators list and then we're going to go through we're going to create a metadata we're going to get our text and we're using the split text method there and then we just create our metadators so the metadators is just in the metadata we create up here plus the the Trump that we so this is like the chunk number so imagine for each record like the Alan Turing example earlier on we had three chunks from that single record so in that case we would have chunk zero chunk one chunk two and then we would have the corresponding text for each one of those terms and then the metadata is actually on the article level so that wouldn't vary for each chunk okay so it's just the chunk number and the text that will actually vary there okay and we append those to our current batches which is up here and then we say once we reach our batch limit then we would add everything okay so that's what we're doing there and then actually so here so we might actually get to the end here and we'll probably have a few left over so we should also catch those as well so we would say if the Len of text is greater than yeah we would do this okay so that's just to catch those final let's say we have like three items at the end there uh with that the initial code they would have been missed okay and we don't want to miss anything so yeah um we create our IDs we're using uid4 for that and then we create our embeddings with embed embed documents this is just what we did before we then add everything to our Pinecone index so that includes basically the way that we do that is we'll create a list or an iterable object that contains two ports of IDs embeddings and metadatas and yeah that's it okay so after that we will have indexed everything of course I already had everything in there so this isn't varied it just doesn't change but for you it should say something it should say like 27.4 ish thousand and yeah so that is our indexing process so we've just added everything to our knowledge base or added all the source knowledge to our knowledge base and then what we want to do is actually back in line chain we're going to initialize a new Pinecone instance so the Pinecone instance that we just created was not within line chain the reason that I did that is because creating the index and populating it in line chain is Fair bit slower than just doing it directly with the Pinecone client so I I tend to avoid doing that maybe at some point in the future that they'll be optimized a little better than it is now but for now it's yeah it isn't so I avoid doing that part within launching but we are going to be using Lang chain and actually for the next so for the querying and for the retrieval augmentation with a large language Model A Line chain makes this much easier okay so I'm going to re-initialize Pinecone but in line chain now as far as I this might change but the grpc index wasn't recognized by Lang chain last time I last time I tried so we just use a normal index here and yeah we just initialize our vets so okay so this is a Vex database connection essentially the same as what we had up here the index okay and the only thing the only extra thing we need to do here is we need to tell line chain where the text within our metadata is stored so it's it's a so we're saying the text field is text and we can see that because we create it here okay cool so we run that and then what we can do is we do a similarity search across that Vector so okay so we pass in our query we're going to say who was Benito Mussolini okay and we're going to return to top three most relevant dogs to our to that query and we see okay page content Benito Mussolini it's on so on Italian politician and journalists prime minister of Italy so and so on leader National fascist party okay obviously relevant and then this one again you know ukulele I think clearly relevant and obviously relevant again so we're getting three relevant documents there okay now what can we do that it's a lot of information right if if we scroll across that's a ton it takes and we don't really already so I don't want to feed all that information to our users so what we want to do is actually come down to here and we're going to layer a large language model on top of what we just did so that retrieval thing we just did we're actually going to add a large language model onto the end of that and it's essentially going to take the query it's going to take these contacts these documents that we returned and it is going to we're going to put them both together into the prompt s and then ask the large language model to answer the query based on those returned documents or contacts okay and we would call this generative question answering and I mean let's just see how it works right so we're going to initialize our LM we're using the Jeep G 3.5 turbo model attempt to recite the zero so we basically decrease the randomness in the model generation as much as possible that's important when we're trying to do like factual question answering because we don't really want the model to make anything up it doesn't protect us 100 from it making things up but it will limit it a bit more than if we set a high temperature and then we actually use this retrieval QA chain so the retrieval QA chain is just going to wrap everything up into a single function okay so it's going to give an a query it's going to send it to our Vex database retrieve everything and then pass the query and the retrieve documents into the large language model and get it to answer the question for us okay I should run this and then I'll run this and this can take a little bit of time this is one my bad internet connection and also just the slowness of interacting with open AI at the moment so we'll get Benito Mussolini it was an Italian politician journalist who served as prime minister of Italy he was leader of national fascist party and invented the ideology of fascism here's a dictator of Italy by the end of 1927 in his former fascism Italian fascism so on and so on right there's a ton of text in there okay and I mean it looks pretty accurate right but you know large language models they're very good at saying things are completely wrong in a very convincing way and that that's actually one of the biggest problems with these like you don't necessarily know that what it's telling you is true and of course like for for people that use these things a lot they are pretty aware of this and they're probably going to cross-check things but you know even for me I use these all the time sometimes a large language model will say something and I'm kind of unsure like oh is that true is it not I don't know and then I have to check and it turns out that it's just completely false so that is problematic especially when you start deploying this to users that are not necessarily using these sort of models all the time so there's not a 100 full solution for that problem for the the issue of hallucinations but we can do things to limit it um on one one end we can use prompt engineering to reduce the likelihood of the model making things up we can set the temperature to zero to reduce the likelihood of the model making things up another thing we can do which is is not really you know the modifying the model all but it's actually just giving the the user um citations so they can actually check where this information is coming from so to do that in in line chain it's actually really easy we just use a slightly different version of the retrieval QA chain called the retrieval QA with sources chain okay and then we use it in pretty much the same way so we're just gonna we're gonna pass the same query about uh Benito muslin you can see here actually and I'm just going to run that Okay so let's wait a moment okay and yeah I mean we're getting the pretty much the same in fact I think it's the same answer and what we can see is we actually get the sources of this information as well so we can actually I think we can click on this and yeah it's going to take us through and we can see ah okay so this looks like a pretty good source maybe this is a bit more trustworthy and we can also just use this as essentially a check we can go through what we're reading and if something seems so weird we can check on here to actually see that you know it's either true like it's actually there or it's not so yeah that can be really useful simply just adding the source of our information it can make a big difference and really I think help users trust the system that we're building and even just as developers and also you know people like managers that are wanting to integrate these systems into their operations having those sources can I think make a big difference in trustworthiness so we've learned how to ground our large language models using Source knowledge so Source knowledge again is the knowledge that we're feeding into the large language model via the the input prompt and Naturally by doing this to a kind of you know we're encouraging accuracy in our in our large language model outputs and just reducing the likelihood of hallucinations or inaccurate information in there now as well we can obviously Keep information super up to date with this this approach and we start at the end now with sources we can actually cite everything which can be super helpful in trusting the output of these models now we're already seeing large language models being used with external and large bases in a lot of really big products like Bing AI Google is Bard we see chat GPT plugins are you know starting to use this sort of thing as well so I think the future of large language models these knowledge bases are going to be incredibly important they are essentially an efficient form of memory for these models that we can we can update and manage which we just can't do with if we just rely on parametric knowledge so yeah I rarely think that this like long-term memory for large language models is super important it's it's here to say and it's definitely worth looking at whatever you're building at the moment thinking okay does it make sense to integrate something like this will it help right but for now that's it for this video so I hope I hope this has been useful and interesting thank you very much for watching and I will see you again in the next one bye

Info

Channel: James Briggs

Views: 32,686

Rating: undefined out of 5

Keywords: python, machine learning, artificial intelligence, natural language processing, nlp, semantic search, similarity search, vector similarity search, vector search, pinecone database, pinecone vector database, vector database, chatgpt, langchain vector database, chroma db, chroma db langchain, llm, langchain tutorial, langchain, james briggs, gpt 3.5, gpt 4, langchain 101, langchain search, langchain memory, ai, openai api, langchain python, langchain ai, retrieval augmentation

Id: kvdVduIJsc8

Channel Id: undefined

Length: 31min 0sec (1860 seconds)

Published: Thu May 04 2023