Okay. In this video, I'm going to look
at using a LLaMA-2 with LangChain, specifically, I'm just going
to use the small model here. I'll do a number of videos, going
through more advanced stuff What I'm trying to do is show you the
basics of getting something going and also how you can run it locally. we will look at running the 70
billion in the cloud where you can use it like an API in the future. But in this one, I want to actually
basically just load the whole thing in a notebook, run the whole thing with pretty
good response times, And use it that way. So you'll notice just to set up we're
bringing in the sort of standard stuff of transformers, bringing in LangChain here. because the LLaMA model requires
you to get permission, as I've talked about in the previous videos, you will need to put in
your hugging face token. from this. So when you see this pop up, you
can basically just click this. It will take you to hugging
face where your token is. You can either create a new
token or bring a token across. You just need to read token
for this going through here. once you've basically got that in,
you can then download the model. So the 7 billion model is not that big. you'll find that you can probably load it. You can see that, when I'm,
basically running this through, it's using under 15 GB of memory. so you can probably actually
load it in a T4 GPU as well here. we need to set up some of the
things that we did before. So remember I talked about in
the previous video about the different sorts of prompts. this is setting up. So I'm using the same sort of system. I've just altered it a little bit here. because we're going to be using it
in LangChain one of the challenges that we have with LangChain for this
kind of thing is that in some ways, this model is a chat model in that
it's built for having, Meta's actual sort of API for serving it runs it. As a chat model, just like you would use a
GPT 4 or the GPT 3.5 turbo model or Claude but when we're using it in here, we're
using it just as this completion model. So we need to basically go through
and make this customization. So here you can see what I've done
is I've got this get prompt, which we can pass in instruction to. Now, if we just pass in the, just the
instruction, we will basically get back the default system templates. You can see from here through to here
is the default system templates, and then we've got the instruction after it. And they're going to be wrapped
in the instruction there. If we pass in our own system templates
into here then we will get our system, the new system template that we've got
followed by the instruction in the same format that LLaMA-2 wants to see it here. And this is key for playing
around with the prompts and trying different things out. Now I will just preface this by saying
that with this small 7 billion model, it is, I think is a very good model. there are certain things that, is not
great at the logic stuff you want a bigger model, of course, for that kind of thing. It's also not great at returning
things as JSON or returning things in a structured output way. my guess is we will see some fine tunes. coming that will improve that over time. but for now, what it is good at is that,
we can use it just like a normal language model to do a variety of different
tasks, like summarization question answering, all this kind of thing. So the key to this though, is you
really want to play around with both the system template and the instruction. in here. So don't be afraid to go and change the
system templates that I've put in here. I've put some that I've played around
with a bit but I'm not going to say that these are the perfect ones. You could probably get a
lot better from doing this. Now we set up, the model just
quickly up here as a pipeline, a transformers pipeline. This is where you would make the changes. If you want to make the contents longer. anything that you want to change there. and then, coming down to use this in
LangChain, we're just basically using the hugging face pipeline where we're
bringing in that pipeline here, you can see I'm sitting in temperature to zero. So once you've got your LLM set
up with the hugging face pipeline. you then want to basically
make an LLM chain here, which is going to require a prompt. But we've actually got
multiple prompts, right? We've got the system prompt
and the instruct prompt here. so this is where our helper function
for get prompt is going to be used. So you can see here, I'm
passing in the instruction. I'm passing in the system prompt. And that's going to
format it out like this. So we've got this, you are an
events assistant that excels at translation in the system prompt part. And we've got, convert the
following texts from English to French, in the instruction part. And then we've got this text where
we're still gonna pass this in. so you can see that's where, when we
define our prompt template, the input variable is going to be text that matches
up here, that what we've got in there. And then we're just doing our LLM chain,
passing in the LLM, passing in the prompt. And then we can basically run it. And you can see here, we can ask it,
okay, the text is, how are you today? so that's going to be translated
from English to French. And you can see here that the
output that it gives us is this. And if we look at, Google translate,
We can see that seems to be translating quite well from English to French. So this French is translating back to the
English, which is what we wanted in here. So even though this model is not
built for translation, it's had enough data that it can actually do
that task or, as it goes through. So let's look at another
task that we want to do. So if we wanted to do summarization. So here again, I've got my instruction. I've got my system template is going to
be, you are an expert at summarization expressing key ideas succinctly. instruction is summarize the
following article for me and then passing in the text. and you can see here that this is
going to put it into the right format and we've still got the text input
of what we're going to be putting in. So here is basically an
article from TechCrunch. all about some of the changes at
Twitter over the past few days. And you can see if we count
the words, it's 940 words there just splitting on spaces. and if we come through and run that text
through, we get show here's, a summary of the article in 400 words or less. It's actually a lot less than 400 words. and it gives us a
decent, summary for this. Now, if you want it to get bullet
points, you would just play with this instruction here to say, summarize the
following in key bullet points, et cetera. so again, this is making use of basically
getting the sort of merging of the two parts of the instruction prompt
and the system prompt to create this template and then passing that into
the prompt template with the input variables that we're going to use there. So you could do a variety of different
tasks that you want to use that for. Anything that you want to transform
some kind of text from one thing to another thing you would use
this kind of a task for doing it. If we wanted to do a simple chat
bot we can certainly do this here and this is going to be just
a simple chat bot with memory. We're not using any tools here. in the future, I'll look at sort of
tool use and ways that you can do that with the LLaMA-2 model as well
one of the key things here is that we're going to have, A system prompt. I'm going to override the system prompts. We're going to say you
are a helpful assistant. You always only answer for the assistant. This is key because if you don't
have something like that, you'll often find that it will try just
generate lots of answers for both sides of the conversation coming out. read the chat history to get the
context of what's going on here. So here you can see I'm passing in
the instruction, the chat history. which is one thing we'd be passing
and then the user input in here. we're wrapping the whole thing
in one instruction in here. Now this is a little bit different
than how Meta does it, where each interaction they're wrapping
as a separate instruction here. I found that actually, that
wasn't necessary if you basically put the prompt like this. So playing around with this prompt,
I found that you really need to make, tell it where the chat history
is it won't just infer that like perhaps a bigger model would. So by making it really clear that
below here is the chat history and then the user input is going to be here. It will then be able to operate on
the history and use it like a memory. So we've got our prompt templates set up. This time we pass the in both
chat history and user inputs. we've got our conversational buffer
memory, which is the chat history that we're going to pass in. And then we've got our LLM chain here. So you can see here, we're passing
in, both the LLM and prompt, but also the memory going on here. okay, let's look at the conversation. If I start out just
say, hi, my name is Sam. So it's getting this full thing going in. There's no chat history at the start. So it says, hello, Sam,
it's nice to meet you. How can I assist you today? and I ask it, okay, can
you tell me about yourself? And it comes back. and it says, of course, And notice here now it's
got the chat history. So it's got the chat history
from before in there. So it's answers, of course, I'm
just an AI designed to assist and provide helpful responses. I'm here to help with any
questions or tasks you may have. How can I assist you today? Now to show off the memory. I wanted to play around
with some things with this. And for testing the memory you want
to try these kinds of things out. So here I'm saying, okay, today is Friday. what number day of the week is that? Okay. So it goes through and it gives me an
answer out and it says, ah, great question Friday is the fifth day of the week. now I think in different calendars,
people count the days different. That I'm not really interested in. What I'm more interested
in is the next question. When I say, what is the day to day? and without that chat history, you'll
find that it will just make up a day. it will, just generate something
random or something like that. But here it's got the chat history in it. So you can see it's can see that,
the human said today is Friday, so it knows that, oh, okay, the
answer is, today is Friday here. Now, actually this AI thing, I could
have put this in the prompt to so that it doesn't actually feel that bit out itself. It just gives us this
as we're coming back. And again, another thing I wanted
to try was okay, what is my name? So remember way back at the start,
it said my name, so sure enough, it's able to say, your name is Sam in here. You will find, for example,
with the different size models. This is an example of
the 13 billion model. and then in this one, it's, sure
thing, Sam, as a helpful assistant, I can tell you that your name
is Sam, And it's got some more sassiness with the bigger models too. but back in the 7B, we can see
that, okay, it's gotten that. if I ask it now completely
different question. Can you tell me about the Olympics? it then goes on to give me a bunch of
information about the Olympics in here. and then final question, I ask it, okay,
what have we talked about in this chat? and you can see that it's able
then to do a summary of this. Of course, here's what
we've discussed in the chat. And then it's, I'm actually not, you're
not printing these out, but if we were printing, you have a new line. assistant introduces themselves. the user, asks them to
tell them about themselves. It's got the conversation of what we've
gone through and talked about in there. So it shows that the
memory is working in here. so this is kind of a good
sign that even the small model is working with the memory. it allows us to do that kind of thing. If we want to incorporate tools, we'll
look at that in a future video for this. Anyway, this gives you the quick basics
of using LangChain for doing a variety of different tasks with LLaMA-2 the
same thing you will be able to do with a four bit version of the model if you're
running this locally and you want it to basically do this as a 4 bit model,
perhaps look at that in a future video. And also if you're actually pinging
an API where it's been served in the cloud, you'd be able to do that as well. Anyway, as always, if you've
got questions, please put them in the comments below. If you're interested in these kinds
of videos, please click and subscribe. And I will talk to you in the next video. Bye for now.