Okay. So recently I was at the
LangChain offices and I saw this very cute sticker of a Llama. And I asked Harrison
actually what this was. And he told me it was
this thing called Ollama. And told me that I should check it out. So naturally I checked it out and
to be honest, I'm not a big one for running a large language models locally. I've been planning to do a whole
sort of series of videos around the 4-bit models and, you know,
how they work and stuff like that. The simple fact is for my work,
I tend to use models in the cloud much more than I use them locally. Even if they are open source models
and fine tuning open source models, like LLaMA-2, like Mistral, et cetera. But I was pretty fascinated by Ollama. And one of the cool things about
it was this ability to very easily install a local model. And I realized that this is a huge
win for people who are not technical, who are perhaps not proficient with
get to pull down models to check out different models, that kind of thing. So I thought let's make a video
on this, have a look at it. I've been using it for
about 10 days or so. And I think it was a very cool tool. So what actually is Ollama? So basically this is like a user-friendly
way to run large language models on your computer at the moment it's supports
Mac OS and Linux, But apparently windows support is coming soon as well. So that will certainly open up
that to a lot of other people. And the cool thing is it's not just,
the LLaMA models that are supported. If we have a look at the Ollama
website, we can actually see that they're supporting a bunch of different
models apart from LLaMA-2 they've got uncensored LLaMA, CodeLLaMA, Falcon is
one they're supporting, Mistral is one of the key ones that they've basically
added in this past week for doing this. And they've got some of the, open
source, fine tunes of some of these things, like the Vicuna model, like, the
WizardCoder, Wizard uncensored, et cetera. So there are quite a few models that you
can actually try out yourself and run. Locally. And one of the cool things that I'll
show later on, perhaps in a future video is that you can actually run LangChain
locally against all these models as well, to be able to, test out ideas
while you're doing this sort of thing. So to get started, you basically
just come to their website, click. download, It's all pretty simple in the
way that they've documented it, et cetera. You basically MacOS, Linux
and like i mentioned earlier Windows suppose it's coming soon. although i'm not sure how far off it is. once you've downloaded it you
basically install it on your machine. And then It's actually going to make
an api where it's serving the model and you can then use that model as you
go through So once you've downloaded Olama and copied it to the right folder
and you run it, it will basically have a sort of welcome screen and setup. And one of the key things you'll need to do
is install the command line for this. So this whole tool is going
to run in the command line. So if you're not used to using a command
line on a Mac or a Linux machine On a Mac, you have an app called Terminal,
which you can run, and then that can run all your commands in there. So let's have a look that,
once we actually get that. All right. So once you open up your terminal,
you'll be able to come in here. And run the Ollama command. And you'll see that this'll basically,
show us what we can do with this. So, any of these, if we want to find
out a little bit about them, we can just come in here for example, and one of the
main ones that we're going to use to run a model is going to be the run command. So, what I'm going to do is
just get the help for that. and we can see that. Okay. this basically will be in the
format of Ollama, run a model. And then any prompt or flags
that we're going to have here. Now, one of the things we want to
do is actually download some models. So let's see, do I have
any models in here first? So if I just go to Ollama list, I can
see, I don't have any models in here. So I'm going to come in first
and I'm going to start off by just running the LLaMA-2 model. So you'll see that when I run
this and I don't have the model installed, it's actually going
to pull down a manifest file. and then it's going to start downloading
the actual model itself, here. And you can see that it's going to
take a bit of time because we're Downloading a 3.8 gigabyte model here. So let's just let this run and
download and then We'll have a look what happens when we come back. Okay as you can see now this is fully
downloaded You see we've now got somewhere where we can type in And it's completing
out very quickly so this is using the LLaMA-2 instruct model that we've got here We can also go in and actually
If we want to get out Say The some actual stats for this. So you can see what we can come
in here and basically push slash question mark and we can see okay
what commands can we run in here. We can list out the models just like
before If we want to come in here and see okay What speed are we getting
for tokens et cetera Then we can just come in here and set verbose. And now i can ask it again. So now you can see i'm getting
the actual number of tokens per second that i'm getting out of
the model as we go through this. so if i ask it one of the common prompts
that i use for checking out these models Write me an email to Sam Altman explaining
why you should open source GPT-4. You can see that we're getting a
good coherent text just like we would from the LLaMA-2 instruct model here. Now the challenge with this
obviously is that this LLaMA-2 instruct model is censored. So you can just come in here to
one of the other models like we pre-selected this Scroll down and you'll
see the command For actually running this. Now what this will do is this
will run it but if it's not there we'll actually pull it first. So The pull command is to pull
the model down and then the run command is actually to run it. This will basically show you
how much memory you need in RAM for running these models. And if you scroll down further you can
actually see what's the actual model that it's actually downloading from. So we can see here that this is the
uncensored chat model and it's in the GGML format Which is one of the four bit
quantized ways of running these models Okay. So now that we can see how that
works, if we want to get out of this, we can just come out like this. If we want to make our, let's say we
want to make a custom prompt here. So here I'm going to make a custom prompt. I'm going to call it Hogwarts. And so what I'm doing is just
making like a model file in here. So you'll see that this model file
is going to have from llama to, I can set some of the hyperparameters
here with things like temperature. But the main thing that I'm setting
here is the system prompt here. So I'm going to say you're Professor
Dumbledore, answer is Dumbledore the assistant, only give guidance
about Hogwarts and wizardry. So I'm just going to save that,
save it to the file Hogwarts. So now what I'm going to do is
create the model and what we were doing before was actually, you
could see, was coming from Llama 2. So we already have Llama 2 installed
in this case, and we're going to create a model called Hogwarts. And what it's going to do is basically
go look at that model file that we just had, parse through it, look at
the various things like temperature, the settings that we had for
hyperparameters in there, and then it's going to create this into a manifest. and now we've got this as a model. So let's try this out. So if we've got Ollama and we say
run Hogwarts, and now I've got it. If I ask it, who are you? So obviously before it wasn't giving
us great answers, but here it says, ah, a new student, how delightful. I am Professor Albus Dumbledore,
headmaster of Hogwarts School of Witchcraft and Wizardry. And you see, now we've basically
got a custom prompt in there for the system prompt that we
can use straight out of this. What if I ask it, where
can I buy a real wand? All right. So you can see that it's, it's
definitely in character and this is running fully locally now. So this is using a custom prompt
that we've put together ourselves. And of course, we can
just come out of this. If I now go and look at the Oh Llama list,
you'll see that I've got a few in here. I've got my Hogwarts, I've got Llama,
I've got a Mario model in there as well. So what if I want to remove a model? I can just come in here and say,
okay, let's remove the model, Mario. And in this case, it's left
the Llama 2 weights in there. So let's just check that. We can see that we've still got the
Llama 2 weights in there because we've still got two models that are
referencing those Llama 2 weights. If we actually deleted them all, it
will delete out the weights from this as well for going forward for this. All right. So I think it's probably a good spot
to leave it here for the first video. I'll certainly make a video of, you
know, doing some other stuff with Ollama, perhaps, LangChain, perhaps like how
to load a custom model from Hugging Face, et cetera, as we go through. But hopefully this gives you
an introduction to Ollama. I understand that if you're on Windows,
this is not going to be super useful yet. And I'm not sure how long until they
actually have the Windows version out. Hopefully it's coming very soon. This is a very easy way to run models
locally and be able to access them. As always, if you've got any questions,
put them in the comments below. If you found this video useful and
you'd like to see some more like this, please click like and subscribe. I will talk to you in the next video. Bye for now.