This week we had quite a few announcements in
the AI world. First, there was Google kicking it off with Gemini 1.5. Then we had OpenAI,
which released their Sora model, which is an AI that can create realistic scenes from text.
Then there was Meta with V-JEPA, and we had a little release from Ollama, which made a preview
for Windows available. I think that's amazing, it can be used right now. Ollama was always one
of the easiest ways to get started with local large language models [on the Mac]. It packages
llama.cpp, which can use a CPU or GPUs under the hood, and provides a local OpenAI API that you can
use from clients, such as OpenWebUI. With Ollama on Windows, you can use NVIDIA GPUs, and you can
decide how many layers of the model should run on a GPU or on a CPU. In addition, llama.cpp uses
quantized models. This is why you can run quite large models on your PC with llama.cpp and Olama,
of course. We will run this on my PC, which has a GPU, which is quite old. It is an NVIDIA RTX 2070
Super, which has eight gigabyte RAM. So let's get started. In this blog article, you can see how
you can install it. Basically, it's just download Ollama on Windows, double-click the installer,
and run it. When you want to use the GPU, you need NVIDIA drivers beforehand, but I have
them already installed. I think they distribute the CUDA toolkit, so it is enough to run it on top
of the NVIDIA drivers only. I think it's installed now, hopefully. There it is already. So I can now
view logs and quit Ollama. Nothing really happens, actually, because it's an API. You can check if
you have Ollama installed by running "ollama", and you can see the commands available to you.
You have basically the serve commands, create, show, run, and so on and so on. I think it's
already running the server, but we need to run a model at first. "ollama run llama2". So let's
try this model. To show the tokens per second, you start the run command with the verbose option:
"ollama run llama2 --verbose. So, okay, 24 tokens per second. Not too bad. To end this command line,
you just enter "/bye", and that's it. This is the easy way to get started using a large language
model on your Windows PC, but there is a nicer way, which is using a web chat UI, like you maybe
know from the OpenAI ChatGPT web client. There's the project called OpenWebUI, and this gives you a
web UI on your PC, which can also use Ollama under the hood. To install it, you need Docker on your
system, so either Docker Desktop or maybe Rancher Desktop, for example, and then you can just run
the command, which is given on this readme page, which is here ... to, on your computer, use this
command. You can also use more than one model. When you run "ollama list", you can see which
models are available right now. As we have seen so many visual stuff in the news this week, I want
to have a visual model at least in my video here, so I run the llava model. That's it. Now I also
want to run a model that's too large for my GPU. You can find the models on the Olama website in
the models area, and we can maybe use llama2, but not the 7B model, which is the default tag,
but the 13B model. You can see the default is when we run olama run llama2, it will run the tag
"latest", which is basically the same technical tag like the 7B model. When we want to run a
larger model like this one, or maybe the 13B chat, we can run "ollama run llama2:13B-chat". Let's
start this one. So let's test this again. Okay, it works. Yeah, now we should be able to just run
the OpenWebUI. Where is it? on localhost:3000, sign up. So that's it. Got it. And I can also type
in 42. Model not selected. Ah, that's correct. Let's check the small one first, 42. Okay. Okay.
That is enough. And then let's check another model. New chat, select model. Now I'll use the
image model. That's normal. Yes, yes. Okay. Okay, it at least sees that it is a Llama. As
you can see, Ollama is a nice and easy way to get started using large language models
locally. It can use NVIDIA GPUs. Unfortunately, at the moment it cannot use AMD GPUs, but that
may change in the future. Of course, you cannot just use it from this WebchatUI, but also from
Python or any other client. The prerequisite is that the client supports the OpenAI API, and then
you can just use it. That's it. See you next time.