Ollama - Local Models on your machine

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Okay. So recently I was at the LangChain offices and I saw this very cute sticker of a Llama. And I asked Harrison actually what this was. And he told me it was this thing called Ollama. And told me that I should check it out. So naturally I checked it out and to be honest, I'm not a big one for running a large language models locally. I've been planning to do a whole sort of series of videos around the 4-bit models and, you know, how they work and stuff like that. The simple fact is for my work, I tend to use models in the cloud much more than I use them locally. Even if they are open source models and fine tuning open source models, like LLaMA-2, like Mistral, et cetera. But I was pretty fascinated by Ollama. And one of the cool things about it was this ability to very easily install a local model. And I realized that this is a huge win for people who are not technical, who are perhaps not proficient with get to pull down models to check out different models, that kind of thing. So I thought let's make a video on this, have a look at it. I've been using it for about 10 days or so. And I think it was a very cool tool. So what actually is Ollama? So basically this is like a user-friendly way to run large language models on your computer at the moment it's supports Mac OS and Linux, But apparently windows support is coming soon as well. So that will certainly open up that to a lot of other people. And the cool thing is it's not just, the LLaMA models that are supported. If we have a look at the Ollama website, we can actually see that they're supporting a bunch of different models apart from LLaMA-2 they've got uncensored LLaMA, CodeLLaMA, Falcon is one they're supporting, Mistral is one of the key ones that they've basically added in this past week for doing this. And they've got some of the, open source, fine tunes of some of these things, like the Vicuna model, like, the WizardCoder, Wizard uncensored, et cetera. So there are quite a few models that you can actually try out yourself and run. Locally. And one of the cool things that I'll show later on, perhaps in a future video is that you can actually run LangChain locally against all these models as well, to be able to, test out ideas while you're doing this sort of thing. So to get started, you basically just come to their website, click. download, It's all pretty simple in the way that they've documented it, et cetera. You basically MacOS, Linux and like i mentioned earlier Windows suppose it's coming soon. although i'm not sure how far off it is. once you've downloaded it you basically install it on your machine. And then It's actually going to make an api where it's serving the model and you can then use that model as you go through So once you've downloaded Olama and copied it to the right folder and you run it, it will basically have a sort of welcome screen and setup. And one of the key things you'll need to do is install the command line for this. So this whole tool is going to run in the command line. So if you're not used to using a command line on a Mac or a Linux machine On a Mac, you have an app called Terminal, which you can run, and then that can run all your commands in there. So let's have a look that, once we actually get that. All right. So once you open up your terminal, you'll be able to come in here. And run the Ollama command. And you'll see that this'll basically, show us what we can do with this. So, any of these, if we want to find out a little bit about them, we can just come in here for example, and one of the main ones that we're going to use to run a model is going to be the run command. So, what I'm going to do is just get the help for that. and we can see that. Okay. this basically will be in the format of Ollama, run a model. And then any prompt or flags that we're going to have here. Now, one of the things we want to do is actually download some models. So let's see, do I have any models in here first? So if I just go to Ollama list, I can see, I don't have any models in here. So I'm going to come in first and I'm going to start off by just running the LLaMA-2 model. So you'll see that when I run this and I don't have the model installed, it's actually going to pull down a manifest file. and then it's going to start downloading the actual model itself, here. And you can see that it's going to take a bit of time because we're Downloading a 3.8 gigabyte model here. So let's just let this run and download and then We'll have a look what happens when we come back. Okay as you can see now this is fully downloaded You see we've now got somewhere where we can type in And it's completing out very quickly so this is using the LLaMA-2 instruct model that we've got here We can also go in and actually If we want to get out Say The some actual stats for this. So you can see what we can come in here and basically push slash question mark and we can see okay what commands can we run in here. We can list out the models just like before If we want to come in here and see okay What speed are we getting for tokens et cetera Then we can just come in here and set verbose. And now i can ask it again. So now you can see i'm getting the actual number of tokens per second that i'm getting out of the model as we go through this. so if i ask it one of the common prompts that i use for checking out these models Write me an email to Sam Altman explaining why you should open source GPT-4. You can see that we're getting a good coherent text just like we would from the LLaMA-2 instruct model here. Now the challenge with this obviously is that this LLaMA-2 instruct model is censored. So you can just come in here to one of the other models like we pre-selected this Scroll down and you'll see the command For actually running this. Now what this will do is this will run it but if it's not there we'll actually pull it first. So The pull command is to pull the model down and then the run command is actually to run it. This will basically show you how much memory you need in RAM for running these models. And if you scroll down further you can actually see what's the actual model that it's actually downloading from. So we can see here that this is the uncensored chat model and it's in the GGML format Which is one of the four bit quantized ways of running these models Okay. So now that we can see how that works, if we want to get out of this, we can just come out like this. If we want to make our, let's say we want to make a custom prompt here. So here I'm going to make a custom prompt. I'm going to call it Hogwarts. And so what I'm doing is just making like a model file in here. So you'll see that this model file is going to have from llama to, I can set some of the hyperparameters here with things like temperature. But the main thing that I'm setting here is the system prompt here. So I'm going to say you're Professor Dumbledore, answer is Dumbledore the assistant, only give guidance about Hogwarts and wizardry. So I'm just going to save that, save it to the file Hogwarts. So now what I'm going to do is create the model and what we were doing before was actually, you could see, was coming from Llama 2. So we already have Llama 2 installed in this case, and we're going to create a model called Hogwarts. And what it's going to do is basically go look at that model file that we just had, parse through it, look at the various things like temperature, the settings that we had for hyperparameters in there, and then it's going to create this into a manifest. and now we've got this as a model. So let's try this out. So if we've got Ollama and we say run Hogwarts, and now I've got it. If I ask it, who are you? So obviously before it wasn't giving us great answers, but here it says, ah, a new student, how delightful. I am Professor Albus Dumbledore, headmaster of Hogwarts School of Witchcraft and Wizardry. And you see, now we've basically got a custom prompt in there for the system prompt that we can use straight out of this. What if I ask it, where can I buy a real wand? All right. So you can see that it's, it's definitely in character and this is running fully locally now. So this is using a custom prompt that we've put together ourselves. And of course, we can just come out of this. If I now go and look at the Oh Llama list, you'll see that I've got a few in here. I've got my Hogwarts, I've got Llama, I've got a Mario model in there as well. So what if I want to remove a model? I can just come in here and say, okay, let's remove the model, Mario. And in this case, it's left the Llama 2 weights in there. So let's just check that. We can see that we've still got the Llama 2 weights in there because we've still got two models that are referencing those Llama 2 weights. If we actually deleted them all, it will delete out the weights from this as well for going forward for this. All right. So I think it's probably a good spot to leave it here for the first video. I'll certainly make a video of, you know, doing some other stuff with Ollama, perhaps, LangChain, perhaps like how to load a custom model from Hugging Face, et cetera, as we go through. But hopefully this gives you an introduction to Ollama. I understand that if you're on Windows, this is not going to be super useful yet. And I'm not sure how long until they actually have the Windows version out. Hopefully it's coming very soon. This is a very easy way to run models locally and be able to access them. As always, if you've got any questions, put them in the comments below. If you found this video useful and you'd like to see some more like this, please click like and subscribe. I will talk to you in the next video. Bye for now.

Info

Channel: Sam Witteveen

Views: 59,207

Rating: undefined out of 5

Keywords: ollama, llama-2, uncensored llama, wizardcoder, wizardlm, codellama, llama 2, llama, falcon, mistral, vicuna, orca-mini, tutorial, installing ollama, macos, linux, how to install ollama, ollama custom prompts, ollama on macos, llama2 on macos, LLMs locally, llama2 locally, llama-2 local

Id: Ox8hhpgrUi0

Channel Id: undefined

Length: 9min 33sec (573 seconds)

Published: Sun Oct 08 2023