Better Llama 2 with Retrieval Augmented Generation (RAG)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in today's video we're going to be looking at more llama 2. this time we're going to be looking at a very simple version of retrieval augmented generation using the 13 billion parameter llama 2 model which we're going to quantize and actually fit that onto a single T4 GPU which is included within the free tier of co-labs so anyone can actually run this it should be pretty fun let's jump straight into the code so to get started this notebook there'll be a link to this at the top of the video right now the first thing that you will have to do if you haven't already is actually a request access to llama2 which you can do via a form if you need some guidance on that they'll be linked to another video of mine in the previous long 2 video where I describe how to go through that and the get access so first thing I'm going to want to do after getting your access is we want to go to change runtime type and you want to make sure that you're using GPU for Hardware accelerator and T4 for your GPU type if you have color Pro you can use one of these and it will run a lot faster but T4 is is good enough cool so we just installed everything we need okay and once that is ready we come down to here so hung face embedding pipeline so before we dive into the embedding pipeline maybe what I should do is kind of explain a little bit of what this retrieval augmented generation thing is and why it's so important so a problem that we have LMS is that they don't have access to the outside world the only knowledge contained within them is knowledge that they learned during training which can be super limiting so in this example here uh this was a little while ago I asked gpt4 how to use the llm training line chain okay so line chain being the sort of new LM framework and the answer gave me specified this line chain which is a blockchain based decentralized AI language model which is like completely wrong basically it's hallucinated and the reason for that is that gpt4 just didn't know anything about Lang chain and that's because it didn't have access to the outside world it just had knowledge he called parametric knowledge this knowledge stored within the model itself it gained during training so the idea behind retrieval augmented generation is that you give your LM access to the outside world and the way that we do it at least in this example is we're going to give it access to the outside world like our subset of the outside world not the entire outside world and we're going to do that by searching with natural language which is ideal when it comes to our llm because our LM works with natural language so we interact with LM using natural language and then we search with natural language and what that will allow us to do is we'll ask a question we'll get relevant information about that question from you know somewhere else and we get to feed that relevant information plus our original question back into the LM giving it access okay so this is what we would call Source knowledge rather than parametric knowledge now part of this is that embedding model okay so the embedding model is how we build this retrieval system it's how we translate human readable text into machine readable vectors okay and we need machine readable vectors in order to perform a search and to perform it based on semantic meaning rather than like traditional search which would be more on keywords so in the spirit of going with open source or Open Access models is the case with line of two and we're going to use a open source model so we're going to use the sentence Transformers Library if you I'm watching my videos for a while this will be kind of like a flashback to a little while ago so we use sending track Transformers a lot before the whole like open AI Chat thing kicked off now this model here is a very small model super easy to run you can run it on CPU okay like let's have a look at how much RAM I just used okay at the moment it seems like we're not really even using any so I think we it may need to wait until we actually create start creating embeddings which we do next so you can see that we're using the Cuda device here we're going to create some embeddings okay you see that we're using some GPU Ram now but very little 0.9 gigabytes which is nothing that's pretty cool so what we've done here is we've created these two documents or trunks of text we embed them using our embedding model so if I just come up to here the way that we've initialized our Samsung Transformer is a little different how I used to do it so we've essentially initialized it through hugging face and then we have actually loaded that into the line chain I can face embeddings object okay so we're using hooking face via line chain to use sentient Transformers so there's a few abstractions there but this will make things a lot easier for us later on okay cool and let's onto this so we have loaded our embedding model we have two document embeddings which means we have two documents here and each of those has a damage and I do 384 now with open AI for comparison we're going to be embedding to a dimensionality of like 1500 36 I think it is so with this you can like particularly with Pinecone the vector database which I'm talking about later can fit in like five of these for every one opening iron bedding the performance is you know less with these to be honest but kind of depends on your use case a lot of the time you don't need the performance that opening eye embeddings gives you like in this example it actually works really well with this very small model so that's pretty useful now yeah let's move on to the pine cone bit so when we're going to create our Vector database and build our Vector index so to do that we're going to need a free Pinecone API key so I'm going to click on this link here that's going to take us to Here app Dot pinecone.io I'm going to come over to my default project zoom in a little bit here and go to API keys right and we need the environment here so Us West one gcp remember that or for you is this environment will be different so whatever environment you have next to your API key remember that and then just copy your API key come back over to here you're going to put in your API key here and you're also going to put in that environment or the cloud region so it was us West 1 gcp for me okay and I initialize that with my API key and now we move on to the next cell so in this next cell we're going to initialize the index the place to just create where we're going to store all of our vectors that we create with that embedding model there are a few items here so Dimension this needs to match the dimensionality of your embedding model we already found ours before so it's just 384 so we feed that into there and then the metric metrics can change depending on your embedding model with openai is R to zero zero two you're going to be using you can use either cosine or dot product with open source models it varies a bit more sometimes you have to use cosine sometimes you have to use a DOT product sometimes you have to use euclidean although that one is a little less common so it's worth just checking you can usually find in the model cars on hooding face which metric you need to use but most common the kind of go-to is cosine all right cool so we initialize that Okay cool so that initialized it does take a minute for me it was like a minute right now and then we want to connect to the index so we do background index and its name and then we can describe that index as well just to see what is in there at the moment which should finale be nothing okay cool now with the index ready and the embedding ready we're ready to begin populating our database okay so just like a typical traditional database with a vector database you need to put things in there in order to retrieve things from that database later on so that's what we're going to do now so we're going to come down to here I quickly just pulled this together it's it's actually a small data set I think it's just around 5000 items in there and it just contains chunks of text from the Llama 2 paper and a few other related papers so I just built that by kind of going through the alarm 2 paper and extracting the references and extracting those papers as well just kind of like repeating that Loop a few times all right so once we download that we come down to here we're going to convert that hugging face data set so this is using hook and face data sets we're going to convert that into a pandas data frame and we're specifying here that we would like to upload everything in batches of 32. we could honestly we could definitely increase that but it like to like 100 or so but it doesn't really matter because it's not a big data set it's not going to take long to push everything to Pinecone so let's just have a look at this loop we're going through in these batches of 32 we are getting our batch from the data frame we're getting IDs first then we get the chunks of text from the data frame and then we get our metadata from the data frame so maybe what would actually be helpful here is if I just show you what's in that data frame so data dot head okay so you can see here we just have a chunk ID so I'm going to use I think I use DOI and chunk ID to create the ID for each entry yeah and then we have the chunk which is just like the chunk of text okay you can kind of see that here we have the paper IDs the title the paper some summaries The Source several other things in there okay but we don't need all of that so for the metadata we actually just keep the text the source and the title and yeah we can run that should be pretty quick okay so that took 30 seconds for me you can also I kind of forgot to do this but you can do from tqdm Auto Import tqdm and you can add like a progress bar so that you can actually see the progress like that okay so that's just a little bit nicer if you would rather not just be staring at a a cell doing something okay cool so now if we describe index stats we should see about 5 000 vectors in there okay so it's pretty cool now what we're going to do so we have our index like a database ready what we want to do now is we want to add in the LM so we want to add in llama2 to do that we're going to be using the text generation pipeline from hugging face and I'm going to be loading that into line chain we're going to be using the llama213b chat model which you can see here and the everything that comes with that I've explained this stuff here so like how to load the model the quantization everything else several times so I'm not going to go through that again if you do want to go through that it's in the video I linked earlier that previously video but what I will do is show you how to get this home face authentication token so for that we go to hookingface.co we want to go to your profile icon at the top here settings and then you go to access tokens you'd have to create a new token here I've already created mine just make it a read token you can use it right if you want but it just gives more permissions that you don't need for this but I've created mine here I'm just going to copy it and I will put it into this string here and we run that that's just you want to load everything so we need that authentication token because llama2 all those models you need permission to use them you get that by signing up through metas forms and everything as I mentioned earlier so you need to in this case which you don't for every model on hook and face but for this model you do need to authenticate yourself okay so that will take a moment to load just note here I'm using a GPU and and then I am switching the model to like evaluation mode and actually sorry we don't need to use that GPU code here because the device actually figures It Out by itself but it's good to make sure that we actually are using uh Cuda so that'll just print out down here it should print out something like model loaded on Cuda zero so this will take a moment to load so I'll just skip ahead to when it's ready okay so that has finished loading it took eight minutes and we can see that the GPU memory is not to 8.2 gigabytes so it is using more now now considering also that that 1.2 gigabytes of that was used by the mini LM model we're using like seven gigabytes for this quantized version of the model which is pretty cool now I'm slowing the tokenizer the pipeline again I went through all this stuff before so I'm not going to go through again and then what we do is just initialize that in line chain so now we can start using all the different line chain utilities so come down to here what we need to do is initialize the retrieval QA change so this is like the simplest form of rag that you can get in for your LMS so for that for retrieval QA training we need a vector saw which is like another Lang chain object and our lamp which we already have so let's initialize our Vector saw and we just confirmed that it works so we have this query I'm going to do a similar to search so this is not using the LMR here this is just retrieving what it believes are relevant documents now it's kind of hard to read these to be honest I I at least struggle but we'll see in a moment that the LM does actually manage to get good information from these so we create our reg pipeline like so so we just passed in our LM our Retriever and the chain type chain type basically just means it's going to stuff all of the contacts into the context window of the LM query and then we can begin ask some questions so let's begin by asking what is so special about llama 2. if we run that this will take again we're using over the smallest GPU possible here so it's going to take a little bit of time also the quantization step that we use to make this model so small amps time to the processing or inference speeds so we do have to wait a moment game get our response it took like a minute again if you actually want to run this in production you're probably going to want more GPU power and also not to quantize the model so yeah we get this it's talking about actual llamas it just tells us a load of random things like their coach can be a variety of colors they are silky I think it says somewhere and it did in the previous output they're calm so and so on we don't need that so what we actually want to ask about is llama 2 the large language model so now what we're going to do is run it through our reg Pipeline and see what we get okay so that was 30 seconds to run I think maybe the first time that you run the model it's a little bit slower but yeah that was quicker so we get lamb two is a collection of pre-drain finds you in large language models additionally they're considered a suitable substitute closed Source models like chachu tea Bard and Cloud they are optimized for dialogue and outperform open source chat models on most benchmarks tested which I think is the special thing about llama 2. cool now let's try some more questions I'll see if that rag example that works a lot better so what safety measures we use in the development of llama 2 just using the LM without retrieval augmentation we get this so it just I don't even know what it's talking about it kind of just it's almost like it's rambling about something I'm not sure what that something is but yeah not not a good answer now if we look at what we get with retrieval augmentation we get the development of longitude included safety measures such as pre-training fire tuning and model safety approaches the release of the 34 billion parameter model was delayed because they didn't have time to Red Team plus a pretty good answer but let's ask a little more about the red teaming procedures I'm not going to bother asking the LM because it clearly isn't capable of getting as good answers here so let's just go straight for the retrieve augmented pipeline so we asked what are the red streaming procedures for llama 2 and it describes okay correcting procedures use alarm 2 included creating prompts at my illicit unsafe or undesirable responses from the model such as sensitive topics or problems that could cause harm if the model was spawned inappropriately these exercises were performed by a set of experts and it also notes that the paper mentions that multiple additional rounds of red team have performed over several months to ensure the robustness of the model cool now let's ask one more final question how does the performance of llama 2 compare to other local LMS the phone's alarm 2 is compared to other local LMS such as chinchilla and Bard and paper although I wouldn't call Bard a local LM fine specifically the author's report that Alarma 2 out forms these other models on the series of helpfulness and safety benchmarks that they tested llama2 appears to be on par with some of closed Source models at least on the human evaluations they performed so that would be models like GPT 3.5 which is seems a little bit better than lime two but not buy that much except for my coding stuff coding stuff llama 2 is pretty terrible so everything else it seems pretty good now yeah that that is the example so we can see very clearly that retrieval augmentation works a lot better than without retrieval augmentation that's why this sort of technique is super powerful it means you can your LM can answer questions about more up-to-date topics which it can't otherwise it means it can answer questions about like if you have maybe you work in an organization you have internal documents it means it can answer questions about that so overall with your augmentation in most cases is really useful now that's it for this video I hope this has been useful interesting so thank you very much for watching and I will see you again in the next one bye foreign

Info

Channel: James Briggs

Views: 79,676

Rating: undefined out of 5

Keywords: python, machine learning, artificial intelligence, natural language processing, nlp, Huggingface, langchain chatbot, langchain agent, langchain chatbot tutorial, open source chatbot, open source chatbot alternatives, ai, james briggs, hugging face, hugging face tutorial, open source llm, llama 2, llama 2 huggingface, meta ai, llama 2 langchain, llama 2 chatbot, llama, llama 2 13b, retrieval augmented generation, vector database, vector db, pinecone demo, pinecone, ai search

Id: ypzmPwLH_Q4

Channel Id: undefined

Length: 20min 48sec (1248 seconds)

Published: Sat Jul 29 2023