Hugging Face LLMs with SageMaker + RAG with Pinecone

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today we're going to be taking a look at how we do retrieval augmented generation with open source models using aws's sagemaker to do this we're going to need to set up a few different components obviously with our llm we're using an open source model but it's still a big language model so we need a specialized instance in order to do this and we can actually do that through sagemaker as well so we're going to have our like our instance here which is the way we're going to store our llm right and that's within sagemaker we're also going to have another instance that will store our embedding model and if you don't know why we would want that I'll explain in a moment right so we'll just call this embed so we're gonna have those two instances we're going to see how to set those up then we're going to need to get a data set now for this we're not going to use anything like crazy big but we will need it so we'll have our data set down here what we're going to do is we're going to take our data set which is essentially chunks of information about AWS and we are going to use them to inform our large language model okay so these will become the long-term memory or the external knowledge base for our LM because our LM it's been trained on you know internet data so I'm sure it does know about AWS and the services they provide but it's probably not up to date so that will be very important now what this will actually look like is we'll take this relevant information here we're going to take it to the embedding model and then from this we're going to get what we call Vector embeddings we're going to store those within our Vector database which is pine cone it will look like this and what we will do is when we ask our llm something okay so we have our our query we want to know something about AWS that is actually not going to go to our LM and give us a response like it usually would okay this is a usual approach instead it is actually going to go to our embedding model I haven't drawn this very well awesome organized this very well it's going to go to our embedding model and from that we get what we would call a query Vector I'm going to call it HQ and we take that into pine cone and we say it's pine cone okay given my query Vector what are the relevant other records that you have sawed so those AWS documents that we embedded and stored before and it will just give us a set of responses okay so we're going to get a load of relevant information essentially now what we do is we take our query vector so we can spring it here and we take our context and we put those together so now we have queries and contacts and that gives us a context or retrieval augmented prompt we feed that into our LM and now it can actually give us some relevant information okay so up-to-date information which it usually wouldn't be able to do so let's actually dive into how we would Implement all this okay so I'm going to begin in the home page of my console I'm going to go to Amazon sagemaker you can also just do sagemaker at the top here and click on that I'm going to go to my domain here and I'm just going to launch my studio okay so here we are within the studio and we're going to go through this notebook there will also be a link to this notebook on GitHub which you can like copy across over into your own sagemaker so we're going to start by just installing everything we need so it's just going to be sagemaker pinecon client and the widgets okay now I said we're going to be using open source models and we are going to be getting those from hugging face so to do that we will need at least I think this version of sagemaker and what we have is within sagemaker they actually have support for hugging face so we can just import everything we need here so we're going to import the hook and face model and this is essentially the like the configuration for the image that we'll be initializing okay so we can decide which model we'd like to use so I'm going to use a Google's flan T5 XL now where would I get this model ID from you can go to hookingface.co you can head over to models at the top here and you can search for models right so if we wanted to come down to I think there should be a text yeah text generation here this will give us a list of all the like text generation models you can see we've got llama2 in there so we are going to go for the Google flan T5 model which actually isn't even tagged under texture iteration so let's remove that filter and go again so this is it's actually text to text generation so you can also maybe we could just apply both out one at a time okay great so text generation and we can see T5 is there and there's also this bigger model you could also try that as well obviously your results will be better but you'll need a bigger compute instance to run it on so I'm just going to go the T5 XL model which gives us good enough results particularly when we're retrieving all this relevant information and this is just a model ID right here so that's what we're copying in to sagemaker okay we are going to use text generation and what we then need to do is retrieve the LM image URI so there are different images that you can use for hooking face models if you're using a large language model this is the one that you're going to want to use okay and then we can initialize the I believe this is actually the image so we initialize the image okay and then this here will deploy that image and we're going to deploy it to this instance we can see a list of instances here so this is AWS dot amazon.com sagemaker pricing and actually let me take this and just command F in here okay so here we can see it right so we can see it is we're using this it's a Nvidia a10g the what do we have instance memory is 64 gigabytes GPU GPU memory which is probably more important is 24 gigabytes okay so definitely big enough for our T5 ASL model cool so we can let's first I just show you in here this is the Amazon sagemaker console we saw before and if we go down to inference open this and we take a look at models okay there's nothing in there at the moment and we have nothing in end points or endpoint configurations now that would change as soon as I run this next cell so this is going to deploy our model it will take a moment it does take a little bit of time to deploy but you'll see like a little loading bar at the bottom in a moment so I'm just going to go and Skip ahead for when that is done now as you while we're waiting for that to load I will show you where we are in that sort of rough diagram I created before okay so right now what have we done we have just initialized this endpoint okay so our LM here that is now being initialized and deployed within a stage maker so for that we're using like I said the flan T5 XL model okay so that has just finished and we can move on to next steps so what I want to do here is just show you what the difference is between asking an LM question directly and asking it a question when you provide some context which is obviously what we want to do with rag in this instance so I'm going to ask which instances can I use with manage spot training in sagemaker and we're going to send this directly to the llm okay and we get the generator takes the stage maker and sagemaker XL which sounds like a great product but as far as unaware doesn't exist so what we need to do is pass in some relevant context to the model so that relevant context would look something like this right here we're just this is an example this is not how we're going to do it I just want to show you what actually happens so we're going to tell it manage spot training can be used with all instances supported in Amazon stage maker okay let's run that and then what we do is create a prime template so we're just going to feed in our contents here and then feed in our user question here and that creates our full prompt and then we call element predate again but this time we have a retrieval well kind of retrieval augmented prompt here it's retrieval as in we put that information in there later of course we'll automate that okay and then the answer we get this time is all instances supported in Amazon sagemaker Okay so that is actually the correct answer this time okay and I just want to also see is our LM capable following our instructions all right because here I said if you do not know the answer and the context doesn't contain the answer truthfully say I don't know okay so what could lose my desk it's white obviously the LM doesn't know this they're not that good yet so it says I don't know that's great but obviously I just fed in the context weren't not going to do that in a real use case in reality we're probably going to have tons of documents and we're going to want to extract little bits of information from those documents one thing that I have seen people doing a lot is feeding in all of those documents into an llm basically don't do that because it doesn't work very well there's actually a paper on this and I'll make sure there's a link probably at the top of the video right now if you want to read that basically sharing that if you fill your context window for your llm with loads of text it's you're going to forget everything that isn't either at a start of what you've fed into that context window or the end of that context window so it's not a good idea it's also expensive right more tokens you use the more you're going to pay so we don't want to do that what we do want to do is be more efficient and to do that we can use rag which is retrieval augmented generation essentially we're going to be looking at a question and we're going to be finding chunks of text that seem like they'll probably answer our question from a larger database now to make this work this is where our embedding model comes into play so right now what we need to do is we need to deploy this here okay our embedded model so let's go ahead and see how we will do that again we're going to be using hug and face Transformers we can actually copy this and we can just go to models again and we can do this okay so we're using this model here it's a very small and efficient model but the performance is actually fairly good so that means one when we're doing this hook and face model here we don't need to use that LM image because this isn't a large language model it just says it's a small transform model and we also change the the tasks that we're doing here to feature extraction because we're extracting embeddings which are like features from the model okay so we will run that and then we come down to here and we're going to deploy it we are going to deploy to this instance come over here and let's see where that is it's actually not even on this page I don't remember where I found it but essentially it's a small model maybe if I do T2 okay so you can see some kind of similar instances here we are using the mlt2 large the T3 large has it's just a actually a CPU model right it's not even GPU but again this embedding model is very small you could use GP if you wanted to be quicker and the memory is just eight gigabytes which is actually plenty for this model I think in reality you can load this model onto like two gigabytes of GPU Ram so that should be okay now let's deploy that okay so that has now deployed now we have our LM and embedding model deployed and we can go and have a look over in sagemaker here so now we can see we have both of these models which one is which so this one deployed earlier is our LM and this one here is our embedding model and we can go over to endpoint configurations we can see these are our images so we have the mini LM demo image and flan T5 demo image and then our actual endpoints that we are going to be calling later when we're well actually we already called the flan T5 endpoint and we're about to call the mini LM endpoint so what am i showing you next next I'm just going to show you how we basically how we create xq here okay I'm not going to do with the data set yet all I'm going to do is I'm going to create some little examples and I'm going to pull those into our embedding model and then we're going to go on over here and that will create our query vectors the xq vectors okay so to do that we have our encoder we actually just Sue encoder predict that's it here we're creating two what like taking two contests or chunks of text or documents whatever you want to call them so that means we're going to get two embeddings back which you can see here and each of those if we take a look is not what we'd expect right so when we're creating these embeddings from these sentence Transformer models or any other embedding model we expect to Output a vector but the vector dimensionality that I'm expecting from the mini LM model is 384 Dimensions what we see here is two eight dimensional sum of things which is not what we would expect let's take a look at what is Within These some things okay so it looks like we have we have our two records that's fine that makes sense but each record for each embedding output is actually eight 384 dimensional vectors now the reason that we have this is because we have input some tapes right we created like two inputs and I can't remember what were they and they're like something random I think okay and basically each one of those sentences or the text within those sentences is going to be broken apart into what we call tokens right and those tokens might look you know something like this okay so I think it was called something random I don't remember exactly what I wrote let's say it's something random right and then we had another one here and let's say that this one was containing eight tokens right so there's a tokens in this one so it's a bit longer basically what would happen here is this shorter sentence will be padded with what we call padding tokens so actually this gets extended with some extra the padding tokens as mentioned to align with the same size or the same length as our longest sequence within the batch that we're passing it to the model Okay so it's going to look like this now what we have here is two lists of eight tokens each these are then pass into our embedding model right so embedding model it gets these so this is our bearing model and what it's going to do is it's going to Output a Vector embedding at the token level right so that means that we get one two three four five six seven eight token level embeddings here but we want to represent the whole the whole sentence or each document so what we actually do here is something called mean pulling where we essentially just take the average crush each Dimension so here we say the average and using that we create a single sentence embedding right so we we just need to add that on to the end of our process here and with that we would get x q our query Vector actually sorry not necessarily our query Vector in this case it would also be our we can call them x c or XD which would be like our context vectors or document vectors so that's actually my bad because right here this red bit we're not actually doing that yet we're actually going along this line here right so from here into here and creating those we'll call them XC vectors so ignore the xq bit for now all right so to get those we're going to take the mean across a single axis let's do that and you can see that now we have two 384 dimensional Vector embeddings now what I want to do is just kind of package that into a single function all right so this is going to take a list of strings which are our documents or contacts and we're going to create the token level embeddings and then we're going to take the mean or we're going to do mean pulling to create a sentence level embedding okay now that is how we create our context slash document embeddings now what we want to do is actually take what we've just learned and apply that to an actual data set so we're going to be using the Amazon stagemaker FAQs which we can download from here and we're just going to open that data with pandas okay so we have the question and answer columns here we're going to drop the question column because we we just want to look at answers here okay this is all we're going to be indexing now that gives us our database down here what we should do now is well we need to embed them but we also need somewhere to store them all right so we need to take our contact selectors and store them within Pinecone here okay to store them within pinecan we need to initialize a vector index to actually store them within so let's do that okay so to initialize our connection with Pinecone we're going to need a free API key we can get that from app.pinecone.io okay once we are in Pine Cone we go over to our API Keys we want to copy that and also remember the environment here we need that as well so the environment you can put in here so I'm in US West one gcp and for the API key you need to paste it into here okay and then with those we just initialize our connection to Pinecone and we can make sure it's connected by taking a look at this now mine is not empty because I'm doing a lot of things to Pinecone right now so yeah not empty but we would expect that in reality it it should be now one thing I do need to do is make sure I delete this index which is already running so I come down to here we have that index name and I'm going to check if that index is already running and if so I'm going to delete the notes because I want to start fresh and then what I'm going to do is create a new index with the same name the dimensionality is telling us what is the dimensionality of the vectors that we'll be putting into our index we know that that's the 384 that we saw before for our mini LM model and we also have this metric so basically most embedding models you can use the cosine metric but some of them may need you to use dot products or also a euclidean distance so basically just check which embedding model you're using if you can't see any information on which one of those metrics you should use just assume it's probably cosine okay and then here we're just waiting for the index to finish initializing before we move on to the next step which would be less indexes again which will look exactly the same because I've I already had the retrieve augmentation AWS index in there okay so it takes like a minute for that to run and now it has and then we I've put here that we do this in batches of 128 ignore that we're doing it in batches of two which given the data set size is fine it's a really small data set obviously if you doing wanting to do this for a big data set you should use a larger batch size otherwise it's going to take a really long time and if you do want to use a larger batch size you need to use a larger instance than the what was it like the mlt2 model that we're using we're just going to upload or upset 1000 of those vectors okay so that's what we're doing here we then initialize our connection to the specific index that we created up here and then we just Loop through and upsert everything so I'm going to run that and let me just explain what's happening so we're going through in batches of two we are getting our IDs for those batches so the ideas here are just like zero one two three nothing nothing special you should probably use actual IDs if you're wanting to do something real with this we create a metadata for each batch so basically I just want to store the text for each and So within pine cone because it makes things a little bit easier later on when we're retrieving everything and then we want to create our embeddings so we take the answers like the documents within this batch and we do embed dots which is a function that we defined earlier and then we upset everything okay and that's it now if we take a look at our number of Records within the index we should see that there are 154 which is at tiny tiny index honestly in reality you probably wouldn't use Pinecone with something this small you really want to be using ten thousand fifty thousand hundred thousand million vectors but for this example it's fine okay so let's just take a look at the question we initialized earlier which instances can I use with manage spot training in sagemaking all right so that was a question what I'm going to do is I'm going to embed that to create our query Vector which I was calling xq earlier and then we're going to query Pinecone and we're going to include the metadata so that will allow us to see the the text of the answers as well all right so we can run that and we get these contacts here so so far that means we have basically done okay we've created our database and now what we're doing is we're asking a question a query taking it through to a betting model and this time we are actually going along this path here and creating our query vector taking that into pine cone and getting the relevant context from there so we just that's what we've just seen those matches that we just saw there are relevant contacts here so what is left okay we need to take our query we need to use context feed them in to our lab okay so let's do that I'm gonna get those contacts okay so it's just a list of these I mean I can even show you yeah it's literally just a list of those and what we're going to do is we're going to construct this into a single string and what we need to do here is be careful as to how much data we feed in any one time because we're not using like a massive large language model here we're using flan t5xl which is it's okay please it cannot store a ton of text within its context window so we need to be extra careful here and what we're going to do is we're going to say for the text within the context here we're going to add each one until we get to a point where we can't add any more because we reach the context window limit and what is our limit going to be we've just set to a thousand characters okay cool so we run that and then we can actually run that with the context and let's just see what that actually returns us I can't say string okay so we get I'm not sure how many that is printer it's a bit easier so I think we're actually oh okay here it's telling us sorry I forgot that I added that in there so with maximum sequence length of 1000 selected the top four document sections so we retrieve five we're just going to pass through the top four is that last one we couldn't fit into the the limit we've set okay that's great so now what we want to do is same as what we did a lot earlier where we had that prompt where it was like answer the question below given the context we're going to do that okay let me do that on in another cell so I can at least share it to you so I'm going to do print text input okay answer the following question based on the context right so we have our context here which we're feeding in and we have a question okay so let's now predict with that and see what we get which instances can I use in manage spot training in sagemaker all instances are supported okay cool okay so with that we've now taken our context and we've fed them into what will be our new prompt I've also taking our question okay and we've fed that into new prompt and use all that to create our retrieval augmented prompt which llm then uses to go through and create an answer okay cool so yeah that's that is the process we now have our retrieval augmented generation Pipeline and the answer that that is producing so now we can put all that together in our single reg query okay let's uh run that and I'm going to just I'm going to ask the same question initially because I don't know too much about AWS here so I just want to ask something that I know the answer to which is the question the spot instances one okay and we get this right now I checked the data set and there isn't actually any mention of the hugging face instances in sagemaker all right so although this is a relevant question the model should say I don't know because it doesn't actually know about this piece of information so we can test that okay and first like chunk that we're getting here is actually the context that are being returned now these contacts we can see they don't contain anything about hooking face right that is talking about something else and the reason that we're retrieving this irrelevant information is because when we do our embedding and we query Pinecone by saying retrieve the top five most relevant Concepts within the database now there is nothing that contains anything about hooking face within our database but it's still going to return the top five most relevant items so it does that but fortunately we've told our LM that if the context doesn't contain relevant information you need to respond with I don't know so that is exactly what it does here responds with I don't know okay so with that we've seen how we can use sagemaker for retrieval augmented generation with Pinecone using open source models which is I think pretty cool relatively easy to set up I think one thing that we actually I should show you very quickly before finishing is right now we have some running instances in sagemaker you should probably shut this down um so we can do that by going to our endpoints here and selecting those clicking delete and that will be deleted okay and we'll just want to do that for our other items we have the images you can delete those as well and also the models as well Okay cool so yeah once you've you've kind of gone through and deleted those um you won't be paying any more so you know kind of following this so yeah that's it for this video we've obviously seen how to do rag with open source with Pinecone and it seems to work pretty well obviously when you're wanting more performant Generations you'll probably want to switch up to a larger model the flan T5 XL that we demoed here is pretty Limited in its abilities but it's not bad and it's definitely not bad for like a demo so yeah I hope this has all been useful and interesting thank you very much for watching and I will see you again in the next one bye thank you

Info

Channel: James Briggs

Views: 15,120

Rating: undefined out of 5

Keywords: python, machine learning, artificial intelligence, natural language processing, nlp, Huggingface, open source chatbot, open source chatbot alternatives, ai, james briggs, hugging face, hugging face tutorial, open source llm, retrieval augmented generation, vector database, vector db, pinecone demo, pinecone, ai search, pinecone tutorial, pinecone chatbot tutorial, pinecone vector database, pinecone chatbot, flan t5, aws sagemaker, sagemaker jumpstart tutorial, sagemaker chatbot

Id: 0xyXYHMrAP0

Channel Id: undefined

Length: 32min 30sec (1950 seconds)

Published: Tue Aug 22 2023