Retrieval-Augmented Generation chatbot, part 1: LangChain, Hugging Face, FAISS, AWS

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everybody this is Julian from hugging face retrieval augmented generation also known as rag is a popular technique to build more efficient chat bots in this video I'm going to show you how to build a rag chatbot using a combination of Open Source libraries and AWS services so we'll use Lang chain hogging face the face library from Facebook Amazon Sage maker and Amazon text track okay so quite a lot of toys to play with let's get to work before we jump into the code let's take a look at the typical chatbot architectures the simplest one is really this one where we start from a user query uh wrap it in a system prompt send the query to a large language model and generate an answer okay and this is a good way to start um there's nothing wrong with this there are limitations however obviously knowledge only comes from the initial training process which means that recent events are not taken into account the so-called cutof dat problem domain and company knowledge could be too shallow um because obviously the the llm wasn't trained on your internal company data so there could be a problem there maybe if you ask deep domain questions you're going to get shallow answers and if you take the model out of domain if again if you ask deep domain question and the llm never really saw that kind of data during the training process it's likely that uh the model will hallucinate obviously you could find sh the llm on your internal data to mitigate those problems but how often are you willing to run it uh if you need very fresh answers if you need to know what happened yesterday uh obviously fine-tuning is not a great option because you don't want to fine-tune again and again and again every single day right so this is where retrieval augmented generation steps in um and we really uh look at two different workflows here so the first workflow is really an ingestion workflow where we start from internal documents which could be text or images or anything really and we um embed them uh using an embedding model um which really means uh we turn those documents or or document chunks into High dimensional vectors which we store in some kind of database that will um uh let us query right and now the workflow looks something like this we start from the user query we embed the query itself we run uh um Vector proximity search or semantic search um against our embeddings database and we return the top five top 10 documents that are closely related to the query and we wrap everything in a system prompt which goes something like hey has a helpful assistant please answer the following query blah blah blah uh using context found in those documents right and then we generate the answer and the two benefits here are obviously if we have fresh knowledge right because the ingestion process could be running constantly in real time uh if we have fresh knowledge and as soon as that knowled is embedded and available in the in the database then it can be discovered and it can be um added to the generation process and the second benefit is uh if we trust our search mechanism to to work then we're always bringing relevant context to the generation meaning that if the query is taking the vanilla llm out of domain at least we're pointing it to relevant information to start generating a good answer instead of letting it you know hallucinate and invent uh stupid things so that's what rag is all about and that's pretty much what we're going to build okay so let's switch to the notebook and let's start running some code so what are we building here um in a nutshell we're going to build a chatbot that can retrieve information extracted from PDF files okay so we'll start from a few PDF files containing information about energy Trends and uh and the energy market right why not domain specific documents and we're going to process those documents embed them uh store the embeddings in a face database and face is a is a really nice library from Facebook we'll talk about that later and um then we're going to deploy an llm uh on on Sage maker and we're going to query that llm using uh hopefully relevant context extracted from our embeddings okay so this is what we're going to do so let's get started right let me zoom in maybe a bit okay so obviously first we need to install some dependencies Sage maker SDK Lang chain to orchestrate everything uh and a couple of additional packages for um Amazon text trct and and PDF processing okay uh we need to import all those objects we'll obviously see them again okay um the first step is to deploy our llm okay on on a sagemaker endpoint so as you can see here I am deploying the mistol 7B model and uh this is a particular variant fine tune for instruction following so it's this particular model here okay again all the links will be in the video description okay so 7 billion parameter model um not a lot to worry about um we just need to keep an eye on the prompting format okay but we'll get back to that so deploying this model is super simple um in fact I didn't write any of this uh code I just decided to go here click on deploy Sage maker and copy paste that code here right and that's all I did so that's one of the reason why you know using sagemaker is a simple option because you can just copy paste and deploy all right so we run this and there's really nothing to uh uh to explain I guess right pointing at the model on the hub using the built-in container for llm inference on sag maker deploying on on uh a small G5 instance this one has a single A10 G GPU so pretty small not expensive probably just uh about a dollar an hour so not an expensive instance by any means okay we wait for a few minutes and then we have our endpoint okay and we just need to grab the endpoint name because Leng chain is going to need that okay so now we have our endpoint running in Sage maker uh the next step is to configure that s end point in Leng chain right so if we want to we can set some model parameters here such as the number of tokens to to generate or the maximum number of tokens to generate and top p and temperature which uh in a nutshell control how creative you want the answer to be uh how exotic you want the vocabulary to be okay uh feel free to look up those parameters we we've met them before um then we need to to um provide um input and output transforms um and these are really important um because they will um literally adapt the the model input and the model output for length chain okay so obviously Jon is going to be uh our input format and we need to Define The Prompt uh format here right so we saw that thing on the model page page so the model wants to be prompted like that right and this is exactly what I've implemented here okay so we'll have the prompt uh and those uh inst uh Market tags here okay and the parameters so that's it for the input for the output uh we just decode the answer um the uh the the generated answer includes the instructions um which you know I found inconvenient um for printing the result so I'm just basically filtering um everything that comes before that prompt close uh tag here okay so we'll basically we'll just see the generated answer if you also want to see um everything that came before including the um including the the rag chunks Etc uh just return just return the full response instead of uh splitting here okay all right so now we have our pre-processing and postprocessing functions okay and we can simply Define our um sagemaker endpoint as um a length chain llm okay so endpoint name model parameters content Handler which we just created and we need a sagem maker client because this will bring uh the Eds credentials right uh if you if you omit that you're going to get some uh permission errors so we just need to know that we're allowed to invoke that endpoint and that's how you do it okay so before we go into rag we could try and ask uh a basic question right we could run a zero shot example providing no context so here's my system prompt as a helpful energy specialist please answer the question Focus focusing on numerical data don't invent facts if you can't provide a factual answer say you don't know what the answer is okay pretty reasonable um so that's my prompt template the system prompt plus the actual query okay so I Define my length chain uh chain my llm chain with the llm and the prompt okay and then I can ask a question so my question is we're in the energy domain right so what is the latest trend for solar investments in China okay so that's my query uh and if I run it this is the answer that I get right according to a report by the International Energy agency China was the world's largest solar Market in 2020 okay that's few years ago um the report also states that China's SL Market is expected to grow however the report doesn't provide specific information on the latest trend for solar investments in China so not a terrible answer um probably factually correct because I'm guessing the this document was in um in the training set however um it's a little bit outdated right there's nothing Beyond 20 20 and uh the answer is actually quite honest about the fact that yeah it's giving us this information but it's not super specific and it's definitely not about the latest Trend okay so again not a horrible answer but we can certainly do better okay so this is where rag steps in so let's look at how we can add fresh context to the mix okay so we have a few more objects here um if you already run the notebook uh you will have embedded the database and saved it uh so you can load it again so that you don't have to reprocess everything but uh obviously if this is the first time you run this the shortcut will not be available so let's continue here so as mentioned I'll start from uh those three uh PDF files um coming from the International Energy agency right so you can use anything else uh um the the code below is uh is generic so if you use other PDF files uh just uh yeah just go ahead it it'll it'll run fine uh so as mentioned before we're going to use we're going to extract information from those files using Amazon Tex trct which is a a managed service and these are Big documents they're multi-page documents and uh and they need to be in S3 okay so that's that's why I need a nest3 bucket and prefix I'm copying my three PDF files to that bucket and prefix right and we can see them here okay again you shouldn't need to change anything here unless you want to change a prefix obviously so now I've got my three files uh so I can easily build a list of S3 Uris okay so I've got the full path to those three PDF files in S3 okay simple enough okay so now I'm going to analyze those documents meaning extract uh the information so these are complex document lots of tables lots of graphs uh definitely not just text um and so that's why I'm using text track because I just want to extract everything and um hopefully I can I can use uh as much information as possible so um so text track is actually pretty simple and it is integrated with L chain so that's that's good news um we're going to need two uh objects we're going to need um a text track client from AWS and we're going to need um a splitter uh as you can imagine to split the extracted documents into chunks so here I decided to have rather small chunks 256 bytes without overlap right so no overlapping chunks but again feel free to to experiment with different values okay and then I'm simply looping over my three Uris um loading uh each document to textract receiving the the extracted document you can see how simple this is really and then splitting each document into those 256 byte chunks okay and then uh merging all the chunks so you can see the first first document was 137 Pages the second one was 181 pages and the last one was 355 pages so that's about uh yeah 7 I would say 700 something pages right um and we have a little less than I guess 10,000 chunks okay and this took five minutes um so I don't know is that fast is that slow uh I don't know I I I didn't try to to optimize it it it's simple and it's really all I wanted and certainly fast enough for for my demo here right okay so now we have we have all those chunks um and again the chunks are just you know shorter bits of text uh extracting from those documents so the next step is to embed those chunks and store them in in our back end okay so um what embedding model are we using right that that's a question I get a lot so we have built a leaderboard okay um for embedding models you're probably familiar with our llm unboard but there's also one for embedding models right and this is based on the massive text embedding Benchmark which is available in three languages and and G tasks okay so if you look at this and I'm interested in English here obviously so we see we have um you know we have the best models here lots of Dimensions but pretty large models 1.3 gigs um so I actually went for this one okay so the smaller version which is only 130 megabytes so I guess just a bit bigger than maybe sentence Transformers which a lot of folks use out there so it's got fewer Dimensions um but and the the benchmarks are still very good and um and I felt you know this should be uh this should be fast enough and and accurate enough for this demo but feel free to experiment with bigger models obviously the bigger the embedding model um the the longer it takes to embed and the more uh storage you're going to require right because the embeddings are are bigger but anyway um so this is the the leaderboard it's a good place to start so this is the model I'm using here defining it uh as an embeddings uh model in L chain and then well simply creating a new face um index and starting from all the chunks and embedding them with the embeddings model and I love how simple this is and if you look at the Lang chain documentation you will see there are many options for uh uh for data stores from you know very simple things to you know Vector databases etc etc um again feel free to experiment but here for this demo I think the this the Simplicity of face is just just excellent and it takes about six minutes uh to to embed everything so this is running uh this notebook is running on a WS it's a T3 instance so uh you know CPU instance very small one um and uh again I guess 6 minutes is fine uh you would get obviously better performance if you run this on uh I would say bigger CPUs or maybe even GPU but I wanted this demo to be economical okay so six minutes later we have our uh face index we can save it okay so now we have the shortcut and uh now we are pretty much ready to query right so we'll configure our uh face index as the retriever okay so as the source of Truth I've decided to fetch 10 documents right um they should fit in in the context again feel free to experiment with this and now here's my template okay so I started from the same prompt as before as a helpful energy specialist please the question focusing on numerical data blah blah blah okay so the question is here and obviously I'm injecting the context um into the prompt okay so the the 10 chunks that are retrieved um through L chain are going to be available there and you know I guess I really pointed the model as this hey this is useful context to expand your built-in knowledge um probably there are better ways to to phrase it but you know I thought I would insist on hey please use this stuff okay all right so I'm using this template to build the actual prompt okay so the context and the question will obviously be injected when we run the query okay the two input variables so now I can build the the chain okay uh so I'm using a retrieval uh QA chain type okay using the llm um I'm using the stuff policy which is okay just grab 10 chunks and and and put them into the put them into the prompt there are some uh additional ways to do this uh you can try to refine you can try map reduce if you have tons of data um and that won't fit into the context but you know I like to start simple and stuff is certainly simplest way uh the retriever so the the face index and obviously my prompt okay so now asking my question again um I get this answer right um so now the model is pretty definitive about the latest Trend uh you know so it certainly found information in the index it's giving us numbers for 2022 um and I feel this is just a better answer than the previous one feels you know I guess pointier more you know more uh documented with more numbers etc etc not too sure what that steps steps thing here I'm not too sure what that steps thing means so why don't we ask right so what does steps mean and here we get a very good answer the step scenario is a scenario that provides a sense of the prevailing direction of Energy System progression blah blah blah blah blah okay so steps means stated policy scenario okay and here you can see I didn't I didn't provide any U context in the question you know obviously steps could mean a million things but in the context of the extracted uh information right um this is energy related obviously and and I get a very very clear answer on what that is right um so pretty cool pretty cool stuff again uh feel free to ask all kinds of questions feel free to try um other PDF files um but as you can see this is a really simple way to to build this uh once you're done please don't forget to delete the model uh and more importantly the endpoint uh to avoid unnecessary charges okay so there you go um this is really what I wanted to show you today and um I think it's pretty cool that we have everything in a single notebook right we have all the all the main elements uh we have the embeddings uh the embedding model we have the llm um we have uh document extraction document embedding document querying and uh you can see this is a very you know this is a very short notebook there's really U not so much happening here and uh and hopefully that's a good place to start your own experiments okay again all the links to everything including the code in the video description and I guess I'll see you soon with more videos if you have questions please ask your questions I'll try to to answer as many as I can and until next time keep rocking

Info

Channel: Julien Simon

Views: 7,895

Rating: undefined out of 5

Keywords: aws, open source, chatbot, langchain, faiss, sagemaker, nlp, ai

Id: 7kDaMz3Xnkw

Channel Id: undefined

Length: 24min 7sec (1447 seconds)

Published: Tue Oct 24 2023