Llama 2 in LangChain — FIRST Open Source Conversational Agent!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a few days ago I met a i released llama 2. now what's exciting about llama 2 is that it's open source and it is currently the best performing open source model in a big variety of different benchmarks now one of the things that I'm personally very excited about is when I see these new open source models being released one of the first things I do is I try out as a conversational agent that is a chatbot that is actually able to use tools and every single time that I have tried this so far with other models I've been pretty disappointed they either cannot use tools at all or they're just very unreliable so this will it work as a conversational agent Benchmark has just become my personal go-to when these new models are released it's my way of benchmarking where open sources compared to open AI models which generally speaking GPT 3.5 takes Avengers 003 and especially Jeep D4 they are pretty capable as conversational agents and what I find in real world use cases is that conversational agents are the future of how we interact with large language models having a simple chatbot that just talks to us is great but it's limited it doesn't have the flexibility in access to external information that a conservation agent will have and it cannot use tools like you know a python interpreter that a conversational agent can use so that for me is super important and finally with llama 2 we have a model that has actually passed that test I fairly quickly managed to sort of prompt engineer my way to getting a llama 2 model the fine-tuned chat version of alarm 2 to work as a conversational agent which I think is pretty insane so what I want to do in this video is show you how you can do the same so we're going to take a look at the biggest llama 2 model it's the 70b parameter model we're going to quantize it so that we can fit onto a single a100 GPU I'm actually going to be running all this on collab so you can actually go ahead and run the same notebook with this approach we're going to be able to fit that 70 billion parameter model into at a minimum 35 gigabytes of GPU memory but actually after multiple interactions it kind of pushes its way up to more like 38 gigabytes which is so not that much for such a performance model now let's just dive into how we can actually do this so the first thing we're going to have to do is actually sign up and get access to these models it's pretty straightforward it doesn't take that long so what you can do for this is head on over to huggingface.co meta Dash Lama and you want to go over to the meta website here so we click on that and we just want to request access to the next version of llama so you fill that out and for me I got a response almost instantly through using two different emails and basically they're going to send you something like this so it's just okay you're all set start building with alarm at 2. also gives you model weights that are available this is not every single llama 2 model there is also a 34 billion parameter model which they have not finished testing yet so that hasn't been released just yet but the one that we are going to be using is this llama 270b chat so on hug and face we need to go to llama two 70b chat HF okay this is a model that we want to be using okay so you'll see that there's this access armor too on Hunting face one thing you need to be aware of here is that well actually it says it right here your hugging face account and email address must match the email you provide on The Meta website so a minute ago when we entered our details on the meta website make sure you use the email that you also use on hugging face so once you've done that you can click this you can submit and as long as those emails line up you will get access fairly quickly now one thing that you will need is one we have to wait for that access to come through but we also need to go down over to our profile we go to settings and we need to get an access token so this will allow us to download the model within our code okay so we you will actually need to create a new token I'm just going to call this metallama and we just need read permissions okay so with that we generate token and I'm just going to copy that so this is a notebook that we're going to be working through in this video there will be a link to this at the top of the video around now so you can follow along if you like although I will just pre-warn you that parts of this notebook can take a little bit of time particularly when you're downloading the model so with that in mind I wouldn't even necessarily recommend running this on colab because you're going to have to re-download the model like every day that you use this which is not ideal and it's fairly expensive so you you should probably run this on your like local computer if you have a good GPU or on a cloud service somewhere okay so yeah we come down to here you'll need to enter your hooking face API key in here and let me just come down and show you what is happening okay so there's a fair bit of code that is just kind of initializing the model here for us and as I mentioned this download of the model this download and the initialization of the model does take a bit of time so this has actually been running now for one hour and 10 minutes or a little bit longer and I'm not expecting it to finish too soon although I'm hoping it will not take too much longer but essentially we're going to be waiting a while for the model to download but let's come up here and just kind of go through that code that we've used to initialize it first right so we're doing a pip install of all the libraries that we're going to be using we do need all of these okay hug and face Transformers then we have like these libraries and these libraries are basically so we can run large language models and also optimize how we're running those and we also have line chain so later on in the notebook we're going to be using Lang chain to create that conversational agent so command here what we need here is the large language model a tokenizer fertilized language model and also stopping criteria object which is more of an optional item I would say for this model but let's talk about those the llm at first so the llm we have this model ID this is coming from hugging face so if we come up here again we can type in llama2 and we see that there's all these different model IDs the one that we're using is this one here Okay so we have our model ID here we're just checking that we have a GPU available here we have this bits and bytes config object I've spoken about this in previous videos so I'm not going to go too into depth but essentially what we're doing here is when minimizing the amount of GPU memory we need to sort the model now this is a 70 billion parameter model so let's just do some very quick maths here so 70 billion parameters each of those parameters using the standard date type is 32 bits of information okay so the standard data type is a float 32 so float32 and that is 32 bits of information within each byte there is eight bits of information so we can actually calculate how much memory we need to sort that model OKAY is just the premise by the data type divided by eight okay and that gives us this many bytes of information which is 280 gigabytes the wish is a lot right that's many many um gpus many a100s single a100 I think is 40 gigabytes so yeah we need we need a few of those Now by doing this bits and bytes quantization we can minimize that so what we're essentially doing is switching from a float32 data type to an INT 4 data type okay and that contains four bits of information okay so now each one of those parameters is not 32 bits it's four bits so let's calculate that we have int 4 divided by eight which gives us this so that is 35 gigabytes of information now that's not precise because when we're doing this quantization method if we just converted everything into info basically we would lose a lot of performance this works in a more intelligent Way by quantizing different parts of the model that essentially don't need quite as much precision than the bits that do require more Precision we convert into 16-bit floats so it will be a little bit more than 35 gigabytes essentially but we're going to be within that ballpark so that's great and allows us to load this model onto a single a100 which is pretty incredible then what we need to do is we load the model config from Hook based Transformers because we're downloading that from home and face Transformers we need to make sure we're using our authorization token which you will need to set in here and then we're also going to download the the llama2 model itself now we need to have the Transformer code in there because this is a big model and there is is this custom code that will allow us to load that model you don't need that for all models on Transformers but you do need it for this one we have the config object which we just initialized up here and we also have the quantization config which we initialized up here device map needs to be set to Auto and we again need to pass in our authorization token which we do here okay and then after that we switch the model into evaluation mode which basically means we're not training the model we're going to be using for inference or prediction and then after that we we just wait so this is almost done now so it's I think it's just finished downloading the model now we're going to need to wait for it to actually initialize the model from all of those downloaded shards that we just created so I will see you in a few minutes when that is finished okay so everything has now loaded and initialized so we can get on with the rest of the code so we need tokenizer tokenizer it just converts plain text into basically what the model will be reading so I just need to make sure I Define this and I can rerun that so yeah converts plain text to tokens which a model will read and then we come down to the stopping criteria of a model now with the smaller models this is pretty important with this model I would say less so but we can add this in anywhere as a precaution basically if we see that the model has generated these two items which are basically this is from like a chat log so we'd have the assistant it would type reply and then if it moves on to next line and starts generating the text for the human response well it's generating too much text and we want to cut it off okay so we have that as a stopping criteria and we also have these three back tips the reason we use these three back ticks is because when we are using llama2 as a conversational agent we actually ask it to reply to everything in essentially markdown of a Json output so we'll have it replies everything in this format okay and then in here we'll have like an action which is something like use a calculator and also the action input okay so it would be like two plus two right so that is why we're using this or including this within the sub list right essentially once we get to here we want the chatbot to stop generating anything but as I said with this model it doesn't seem to be that necessary so you can add it in there as a precaution but actually what I'm going to do is just skip that for now I actually you know I don't necessarily need that to be in there if you do want to include that in there what you'll need to do is just uncomment that and you'll have that in there but yeah I'm not going to initialize it with that if we do see any issues then we'll go back and run that with the stopping criteria included so this is just initializing the text generation Pipeline with hugging face so we can now ask you to generate something so this is a question that I've used a few times in the past we just want to make sure that it is actually working on the hook and face side of things so can this hugging face initiated model generation tapes and it will take a little bit of time as I said before this is exciting because it is finally able to at least a very basic level act as a conversational agent in terms of speed and Hardware requirements it's not the most optimal solution at least not yet but that's something that can be solved with more optimized Hardware or just kind of throwing a load of hardware at it at least on the time side of things so that will take a little while to run and we see that we get this response which I think is relatively accurate I haven't read through it but it looks pretty good okay then what we want to do is right now we have everything home face we now want to transfer that over into a line chain so we're going to do that by initializing this hung face pipeline object from Line train and initializing it with our pipeline that we initialize up here okay and we just treat that as the the LM so we'll run that we can then run this again and this will produce a pretty similar output to what we got up here okay and we can see we get kind of similar output as it's just having the same sort of stuff but with more text okay cool now what I want to do coming out of here we have everything initialized in line chain so now what we can do is use all of the tooling that comes with line chain to initialize our conversational agent now conversational agent as I mentioned before is conversational that means it has some sort of conversational memory and it is also able to use tools that is kind of the advantage of using a conversational agent versus just a standard chat spot so we initialize both of those so consonational buffer window memory this is going to remember the previous five interactions and we're also just going to load a llm math tool it's a calculator okay so we initialize both of those and then here we have what is an output parser we don't need this for this model it's you can have it in there as a precaution again if you like but for the most part I've found that it doesn't actually need this with good prompting so essentially what I would do usually with this apple parser is if the agent returns some text without the correct format so without that Json format that I mentioned earlier I would assume that that's trying to respond directly to the user all this apple parser does is kind of reformats that into the correct json-like response but as I said we can ignore it we don't need it necessarily for at least the tools that we're using here maybe in a more complex scenario it might come in more use so if you did want to use that you just uncomment that and run it but as mentioned let's skip that and just see how the agent performs without it again it's just like a precaution okay so we initialize the agent here we're using this check conversational react description agent and yeah this is kind of standard agent initialization parameters and what I want to show you here is the parameter that we initially use now this prompt doesn't work very well okay one like this initial system prompt is super long it's it's not that useful and then we have the user prompt template here which again is super long and it doesn't work that well so I've modified those okay so one thing that is slightly different or specific to llama 2 is the use of these special tokens so we have this which indicates the start of some instructions this which indicates the end of instructions this indicates the sort of a system message so that initial message that tells the chatbot or LM how to behave and this indicates the end of the system message so we initialize our system message and we include that sort of initialization of the system message in there and then we go through we say assistant is an expert Json Builder designed to assess a wide range of tasks the intention here is to really drill in the point that assistant needs to respond with Jason we also mentioned it needs to respond with the action and action input parameters and we can see an example of that in here okay so in this example I'm saying this is how to use a calculator you need to say action calculator and what you would like to use with the calculator okay and then we have some future examples in here so we have just responding directly to the user we need to use this Json format using calculator again use the Json format and we just go through and keep giving a few of those examples the end of the system message we put that end of system message token and yeah so we can run that and then we come down to here and this is another thing that they found in the paper is that llama 2 over multiple interactions seem to forget those initial instructions so all I'm doing here is saying we have some instructions so I'm adding those instruction tags in there and I'm summarizing like giving a little reminder to alignment to respond to the following in Json with action and action input values and we're just appending that or adding that to every user query okay so which we can see here and then we just modified a human message prompt template and what we end up with is this which you can see down here okay so we're going to have that with every human message okay so now we can actually begin asking questions I just ran this one so hey how are you today we see that we get this output right final answer I'm good thanks how are you that's pretty good um let's try what is 4 to the power of 2.1 okay and we see that it's correctly using a calculator it has the action input which is 4 to the power of 2.1 in Python okay and this interaction takes a little bit longer because there are multiple LM cores happening here so the first LM core produces the okay I need to use a calculator and the input to that calculator this is sent back to line chain and this is actually executed in a python interpreter and we'll get this answer from that okay that is sent back to the assistant and based on that final answer it knows that it can give the answer back to us so the action is final answer it looks like the answer is this okay so that is the output that we get there now let's use our constitutional history and ask it to multiply that previous number by three so we're not we can see the first item so the calculator it is being used correctly so we have that 18.379 multiplied by three again it's going to take a little moment because it needs to then actually get the answer and generate a new LM response based on that answer okay and then we get our answer and we have this 55 point on three and that's what we get so it looks pretty good now I will say as you saw it I mean these answers where it's going through multiple sets it's taking like a minute for each one I love that time seems to be kind of spinning up a python interpreter it's not fully on the LM in this case but it does take a little bit of time so naturally that is probably one of the biggest issues with using long two at the moment it takes a lot of GPU memory to run it that comes with high costs and especially if you are running on a single GPU like we are with quantization which slows the whole thing down things are going to take a little bit of time but nonetheless I think this looks really cool what we've done here is a very simple agent it's just using a calculator we're not stress testing this and honestly if we want to start using other tools I think we might run into some issues that require a bit more tweaking and prompt engineering than what I have done here but I'm optimistic that we can actually use this for other tools and when you consider that even GT 3.5 even that model is not that good at just producing the Json response when you use it as a conversational agent it can and it's it can do it so reliably but it's not perfect and the fact that La mature and open source model that we're fitting on a single GPU is at least somewhat comparable to one of the best large language models in the world I mean I think that is pretty incredible so I'm very excited to see where this goes naturally Lima 2 has only been around for a few days as of me recording this we're probably going to see a lot of new models built by the community on top of Alarma to appear within probably the next few days from now and especially in the coming weeks and months so that'll be very exciting to see where that goes for now I'm going to leave it there for this video I hope this has all been useful and interesting so thank you very much for watching and I will see you again in the next one bye
Info
Channel: James Briggs
Views: 120,944
Rating: undefined out of 5
Keywords: python, machine learning, artificial intelligence, natural language processing, nlp, Huggingface, chatgpt, langchain chatbot, langchain agent, langchain chatbot tutorial, open source chatbot, open source chatbot alternatives, ai, james briggs, openai, hugging face, hugging face tutorial, open source llm, llama 2, llama 2 meta, llama 2 huggingface, llama 2 70b, llama 2 python, meta ai, llama 2 langchain, llama 2 chatbot, agent llama 2, llama 2 70b chat, llama-2-70b-chat-hf, llama
Id: 6iHVJyX2e50
Channel Id: undefined
Length: 26min 51sec (1611 seconds)
Published: Sat Jul 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.