Hi, in this video, I'll show you how you can
get a Chat GPT-like interface like this running locally on your machine for free. This is Vincent
Codes Finance, a channel about coding for finance research. If that's something that interests you,
consider subscribing so that you get notified of my future videos. In this video, I'll show you how
you can use Ollama and Open WebUI to create your own Chat GPT replacement that runs on your own
machine. In my case, I've got a MacBook Pro M3 processor with 64 GB of RAM. That's more than
plenty. You don't need as much to actually be able to run that, but the more RAM and the
most powerful GPU you'll have, the better. In order to install our Chat GPT replacement,
we'll first install Ollama and then Open WebUI. Ollama is a small program that runs in the
background and lets you manage and make available large language models that are open source, such
as Llama 2 from Meta or Mistral. In order to install Ollama, all you have to do is go to their
website and click download. If you're on Mac, you can also install it with Homebrew by using
`brew install ollama`. You can look at the different models that are available on Ollama
by clicking on models and then looking at the different models they have. Some are featured,
you can look at the most popular. The most popular one is Llama 2. There's Mistral as well,
and then there's a few variations as well. Most models like Llama 2 will have a few different
versions available. If we go here in tags, we'll see that the default variant that Llama 2 makes
available is the chat variant, which is optimized for chatting. This is what we want today, but they
also have a variant that's optimized for text, and then they also have variants depending on
the size that you want, so 7B or 70B or 13B for the number of parameters that are in the model.
The more parameters, so 7 billion to 70 billion, the more memory the model will require, but
also the more powerful the model will be. If we go down, even for the 7 B parameter chat
model, there are a few variants, q4_0, q4_1, q_5, and so forth. So, what are these? These are models
that have different quantization. The idea being that a typical LLM model will have parameters that
are stored in floating point values with 32-bit size. So, quantization. What these variations
do is that they've reduced the number of bits, the number of memory allowed for each of these
parameters, which means that you need less memory to use the same number of parameters, but you're
losing some precision. There are some trade-offs there. What I recommend is that you play a bit
with them and see which one works best for you. Besides the most popular models, you also have
a few models that might be interesting depending on your use case. For example, they do have
uncensored models. You've got Llama 2 uncensored, which is a variation of the model fine-tuned to
remove the safeguards that Llama 2 has. This model will basically answer whatever you ask it. There
won't be a reply where the model tells you, "Well, I can't reply to this. This is too dangerous or
too bad. I won't do it." These are available to you if sometimes you need them for research
purposes where a typical LLM will be blocked. Ollama is actually a command-line application,
so you have to go to the terminal to interact with it directly. Depending on the way you've
installed Ollama, you might have to start the service manually or set it so that it
starts automatically. If, for example, you've installed it with Homebrew on Mac, it will
be set up as a service for you automatically, so it will run in the background all the time.
If we go on the terminal, you can call Ollama by just typing `ollama`. It will give you the
available commands. So here, we've got a few different commands. If you want to start the
service, you do `ollama serve`. In my case, because I've installed with Homebrew, the service
is running; it's already there. So I'm getting an error that the port it's trying to serve is
already in use because it's already running. Then, you can also list the models that you've got
installed at the current time. If you just installed it, you won't have anything. These are
the models that I have installed on my computer. If you want to install a model, for
example, if I want to install Llama 2, I would just do `ollama pull llama2`. In my case,
it was really fast because I've already had it installed. It's just double-checking that I had
the latest version, and because that's the case, it's all good. In your case, it might take a bit
more time because it has to download the full model. The most powerful model that they have at
this time for chat is called Mixtral. So for that, you would do `ollama pull mixtral`. Again, it's
going to be fast for me because I've already had it installed, but it's actually about 30
GB in terms of download size for this one. If you just want to chat with the model, you can
just do it in the terminal with Ollama directly. I can do, for example, `ollama run llama2`.
It will start the model and make it available to chat with me. Here, I'm just prompted. I
can send it a message. We can, for example, check whether it knows about Ollama. So, what is
Ollama? Clearly, it didn't understand my question, or it doesn't know about Ollama. It's working, but
it's not necessarily the kind of interface we want to interact with on a day-to-day basis. I can just
quit this chat by typing `/bye`. It will stop it. Okay, so now we've got Ollama working on our
computer, but we don't want to interact with it like this. This is what we call the backend,
so it's the service that's providing the large language model to our computer. But now we
want to install a frontend, so the application that will kind of serve as our UI, as our user
interface, to interact with these large language models. So for that, we'll use Open WebUI. It
is a Chat GPT replacement that is open source, and it does offer a lot of the features that Chat
GPT has. It lets you keep track of your chats, store modelfiles, prompts, and so forth. We'll see
what these are, but first, we'll have to install it. This is the somewhat tricky part of this
video because in order to install Open WebUI, you'll actually need Docker. If you don't
know what Docker is, Docker is a container software. So, what containers are? They are little
virtual machines that run on your computer, and Docker is the software that helps you manage these
containers and run them on your machine. It can be a bit confusing the first time, but it's actually
probably the safest way to run software like this because these containers are self-contained.
They're isolated from the rest of your machine. The reason why Open WebUI has to run in
a container is that it's basically a web server. Right, it is a Chat GPT replacement.
It is a web server running on your machine that will interact with Ollama. In order
to do that, you have to run that server, and the container has everything built
into that. It is actually a web server, and it supports multi-user setups. If you want,
you could also set up Open WebUI to serve as an enterprise Chat GPT replacement, where, for a
small team, you'd have a computer where that runs and that serves multiple users. This is not
what we're doing here. Here, we're installing it on our own machine so that it serves only
us. But this is what this software can do. In order to install Docker, if you go on
Docker.com, you'll have to go to Docker Desktop and then download for Mac Apple chip. Be aware of
the license here. If you are at a large company, this might be binding for you. Because for
me, it's just a small personal project, it's fine. But keep that in mind. You
can also install Docker with Homebrew on Mac. I've put the instructions in the video
descriptions if you want to do it that way. Once you've got Docker set up, we can go back
to Open WebUI and look at the instructions here. They will give you the quick start with
Docker. What I'll do here is I'll just copy the instructions for "if Ollama is on your computer,
use this command". This is what I'll copy here, and now I'll go back to my terminal and
paste this command and press enter. And now, I've got a kind of little message with a big
large number that tells me that it is running now on my computer. If I go on the Docker Desktop
dashboard, I can see that it is running here. And that's all there is to it. By default, it
will set up Open WebUI on port 3000. So, in order to connect to it, I would just go to
http://localhost:3000, and then it will ask me to sign in or sign up. If it's the first time
that you launch Open WebUI on your computer, you'll have to sign up. The first user
to sign up will be admin. After that, you'll be able to log in with your account that's
created. Don't worry, it is local to your machine, so the account that you create is on your machine,
you're not sending your information anywhere. And now, I've got a full-featured Chat GPT
replacement. I'll have the list of my chats here. When I first want to run a chat, all
I have to do is pick the model. For example, here, I could do Llama 2. I could set it as my
default, and then I can ask simple questions like, "What are Newey-West standard errors?"
The first time I ask a question, it might take a few seconds for the model to load.
It depends on the size of the model, but overall, it tends to be faster than Chat GPT. At least
that's my experience. On my machine -- the speed will obviously depend on your machine -- it tends
to be faster than Chat GPT. So, it is pretty cool. Now, another cool thing that you can do with this.
If you start a new chat, I have my model here, I can actually add a second model. For example,
here, I could have Mixtral latest, and then repeat my query. Now, Mixtral is quite a large model.
The first time I will query it, it will take a few seconds to load. And I will also see it if I
bring up my activity monitor. The memory usage for my computer, I see that the memory use is jumping
quite high, but it is working, and it is providing me an answer. So, whether it's the answer or not
that I want, well, I actually have two answers because I've added two models. So, I can actually
compare. This is the answer I got from Llama 2, and this is the answer I've got from Mixtral.
That's not something that's possible with Chat GPT, but here, in this case, it works! This is
working on your own machine. You can actually add multiple models and compare the results
that you get from multiple different models. If we explore the other options that
you've got here on the left sidebar, what we've got, we've got modelfiles here.
What are modelfiles? Well, they're pretty much the equivalent of GPTs for Chat GPT. So,
they are kind of built-in sets of prompts or instructions to a model that you can use to
serve a specific purpose. You can build your own if you want. You can create your own models
with the different kinds of instructions here, the prompt and the different types, or you
can also discover the ones that have been designed by the Open WebUI community. So, if
you just scroll down here, you'll see different modelfiles here that are featured. You can
also click and see the most popular ones. Then, you've got prompts. So, prompts are
a kind of simpler version of modelfiles. They're just prompts that you've saved
for future use. And you can also look at the Open WebUI community to see prompts
that have been shared by other users. And finally, you've got documents here. These
documents will be saved in a RAG fashion, so retrieval augmented generation type of
availability, which means that it doesn't quite work as it works with Chat GPT. These
won't be able to access your full document when you query it. So, for example, here,
I've tried with a research paper. I wanted the chat to summarize that paper. It's
not able to do that because it's not able to see the whole document. Basically,
this is more for reference documents. So, it's going to be able to search for snippets
in your document that are related to your query and summarize those parts, but it won't
be able to get a full overview of a document. These are the main features of Open WebUI. You
can explore more if you click on your username and go to settings. You've got a few more options
there, where you can set the theme, your system prompts. You've got advanced parameters. You
can also try different alternative options, such as speech to text and text to speech. You
can also configure image generation. This is kind of one more step, but it works. It's
just that it's a bit more work to get all set up than just the text-based chat, but you
can also add image generation to that as well. So, that's it for today. I hope you enjoyed
this video. If you did, please like, and also consider subscribing to the channel
so that you are notified of my future videos.