Run your Own Private Chat GPT, Free and Uncensored, with Ollama + Open WebUI

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi, in this video, I'll show you how you can get a Chat GPT-like interface like this running locally on your machine for free. This is Vincent Codes Finance, a channel about coding for finance research. If that's something that interests you, consider subscribing so that you get notified of my future videos. In this video, I'll show you how you can use Ollama and Open WebUI to create your own Chat GPT replacement that runs on your own machine. In my case, I've got a MacBook Pro M3 processor with 64 GB of RAM. That's more than plenty. You don't need as much to actually be able to run that, but the more RAM and the most powerful GPU you'll have, the better. In order to install our Chat GPT replacement, we'll first install Ollama and then Open WebUI. Ollama is a small program that runs in the background and lets you manage and make available large language models that are open source, such as Llama 2 from Meta or Mistral. In order to install Ollama, all you have to do is go to their website and click download. If you're on Mac, you can also install it with Homebrew by using `brew install ollama`. You can look at the different models that are available on Ollama by clicking on models and then looking at the different models they have. Some are featured, you can look at the most popular. The most popular one is Llama 2. There's Mistral as well, and then there's a few variations as well. Most models like Llama 2 will have a few different versions available. If we go here in tags, we'll see that the default variant that Llama 2 makes available is the chat variant, which is optimized for chatting. This is what we want today, but they also have a variant that's optimized for text, and then they also have variants depending on the size that you want, so 7B or 70B or 13B for the number of parameters that are in the model. The more parameters, so 7 billion to 70 billion, the more memory the model will require, but also the more powerful the model will be. If we go down, even for the 7 B parameter chat model, there are a few variants, q4_0, q4_1, q_5, and so forth. So, what are these? These are models that have different quantization. The idea being that a typical LLM model will have parameters that are stored in floating point values with 32-bit size. So, quantization. What these variations do is that they've reduced the number of bits, the number of memory allowed for each of these parameters, which means that you need less memory to use the same number of parameters, but you're losing some precision. There are some trade-offs there. What I recommend is that you play a bit with them and see which one works best for you. Besides the most popular models, you also have a few models that might be interesting depending on your use case. For example, they do have uncensored models. You've got Llama 2 uncensored, which is a variation of the model fine-tuned to remove the safeguards that Llama 2 has. This model will basically answer whatever you ask it. There won't be a reply where the model tells you, "Well, I can't reply to this. This is too dangerous or too bad. I won't do it." These are available to you if sometimes you need them for research purposes where a typical LLM will be blocked. Ollama is actually a command-line application, so you have to go to the terminal to interact with it directly. Depending on the way you've installed Ollama, you might have to start the service manually or set it so that it starts automatically. If, for example, you've installed it with Homebrew on Mac, it will be set up as a service for you automatically, so it will run in the background all the time. If we go on the terminal, you can call Ollama by just typing `ollama`. It will give you the available commands. So here, we've got a few different commands. If you want to start the service, you do `ollama serve`. In my case, because I've installed with Homebrew, the service is running; it's already there. So I'm getting an error that the port it's trying to serve is already in use because it's already running. Then, you can also list the models that you've got installed at the current time. If you just installed it, you won't have anything. These are the models that I have installed on my computer. If you want to install a model, for example, if I want to install Llama 2, I would just do `ollama pull llama2`. In my case, it was really fast because I've already had it installed. It's just double-checking that I had the latest version, and because that's the case, it's all good. In your case, it might take a bit more time because it has to download the full model. The most powerful model that they have at this time for chat is called Mixtral. So for that, you would do `ollama pull mixtral`. Again, it's going to be fast for me because I've already had it installed, but it's actually about 30 GB in terms of download size for this one. If you just want to chat with the model, you can just do it in the terminal with Ollama directly. I can do, for example, `ollama run llama2`. It will start the model and make it available to chat with me. Here, I'm just prompted. I can send it a message. We can, for example, check whether it knows about Ollama. So, what is Ollama? Clearly, it didn't understand my question, or it doesn't know about Ollama. It's working, but it's not necessarily the kind of interface we want to interact with on a day-to-day basis. I can just quit this chat by typing `/bye`. It will stop it. Okay, so now we've got Ollama working on our computer, but we don't want to interact with it like this. This is what we call the backend, so it's the service that's providing the large language model to our computer. But now we want to install a frontend, so the application that will kind of serve as our UI, as our user interface, to interact with these large language models. So for that, we'll use Open WebUI. It is a Chat GPT replacement that is open source, and it does offer a lot of the features that Chat GPT has. It lets you keep track of your chats, store modelfiles, prompts, and so forth. We'll see what these are, but first, we'll have to install it. This is the somewhat tricky part of this video because in order to install Open WebUI, you'll actually need Docker. If you don't know what Docker is, Docker is a container software. So, what containers are? They are little virtual machines that run on your computer, and Docker is the software that helps you manage these containers and run them on your machine. It can be a bit confusing the first time, but it's actually probably the safest way to run software like this because these containers are self-contained. They're isolated from the rest of your machine. The reason why Open WebUI has to run in a container is that it's basically a web server. Right, it is a Chat GPT replacement. It is a web server running on your machine that will interact with Ollama. In order to do that, you have to run that server, and the container has everything built into that. It is actually a web server, and it supports multi-user setups. If you want, you could also set up Open WebUI to serve as an enterprise Chat GPT replacement, where, for a small team, you'd have a computer where that runs and that serves multiple users. This is not what we're doing here. Here, we're installing it on our own machine so that it serves only us. But this is what this software can do. In order to install Docker, if you go on Docker.com, you'll have to go to Docker Desktop and then download for Mac Apple chip. Be aware of the license here. If you are at a large company, this might be binding for you. Because for me, it's just a small personal project, it's fine. But keep that in mind. You can also install Docker with Homebrew on Mac. I've put the instructions in the video descriptions if you want to do it that way. Once you've got Docker set up, we can go back to Open WebUI and look at the instructions here. They will give you the quick start with Docker. What I'll do here is I'll just copy the instructions for "if Ollama is on your computer, use this command". This is what I'll copy here, and now I'll go back to my terminal and paste this command and press enter. And now, I've got a kind of little message with a big large number that tells me that it is running now on my computer. If I go on the Docker Desktop dashboard, I can see that it is running here. And that's all there is to it. By default, it will set up Open WebUI on port 3000. So, in order to connect to it, I would just go to http://localhost:3000, and then it will ask me to sign in or sign up. If it's the first time that you launch Open WebUI on your computer, you'll have to sign up. The first user to sign up will be admin. After that, you'll be able to log in with your account that's created. Don't worry, it is local to your machine, so the account that you create is on your machine, you're not sending your information anywhere. And now, I've got a full-featured Chat GPT replacement. I'll have the list of my chats here. When I first want to run a chat, all I have to do is pick the model. For example, here, I could do Llama 2. I could set it as my default, and then I can ask simple questions like, "What are Newey-West standard errors?" The first time I ask a question, it might take a few seconds for the model to load. It depends on the size of the model, but overall, it tends to be faster than Chat GPT. At least that's my experience. On my machine -- the speed will obviously depend on your machine -- it tends to be faster than Chat GPT. So, it is pretty cool. Now, another cool thing that you can do with this. If you start a new chat, I have my model here, I can actually add a second model. For example, here, I could have Mixtral latest, and then repeat my query. Now, Mixtral is quite a large model. The first time I will query it, it will take a few seconds to load. And I will also see it if I bring up my activity monitor. The memory usage for my computer, I see that the memory use is jumping quite high, but it is working, and it is providing me an answer. So, whether it's the answer or not that I want, well, I actually have two answers because I've added two models. So, I can actually compare. This is the answer I got from Llama 2, and this is the answer I've got from Mixtral. That's not something that's possible with Chat GPT, but here, in this case, it works! This is working on your own machine. You can actually add multiple models and compare the results that you get from multiple different models. If we explore the other options that you've got here on the left sidebar, what we've got, we've got modelfiles here. What are modelfiles? Well, they're pretty much the equivalent of GPTs for Chat GPT. So, they are kind of built-in sets of prompts or instructions to a model that you can use to serve a specific purpose. You can build your own if you want. You can create your own models with the different kinds of instructions here, the prompt and the different types, or you can also discover the ones that have been designed by the Open WebUI community. So, if you just scroll down here, you'll see different modelfiles here that are featured. You can also click and see the most popular ones. Then, you've got prompts. So, prompts are a kind of simpler version of modelfiles. They're just prompts that you've saved for future use. And you can also look at the Open WebUI community to see prompts that have been shared by other users. And finally, you've got documents here. These documents will be saved in a RAG fashion, so retrieval augmented generation type of availability, which means that it doesn't quite work as it works with Chat GPT. These won't be able to access your full document when you query it. So, for example, here, I've tried with a research paper. I wanted the chat to summarize that paper. It's not able to do that because it's not able to see the whole document. Basically, this is more for reference documents. So, it's going to be able to search for snippets in your document that are related to your query and summarize those parts, but it won't be able to get a full overview of a document. These are the main features of Open WebUI. You can explore more if you click on your username and go to settings. You've got a few more options there, where you can set the theme, your system prompts. You've got advanced parameters. You can also try different alternative options, such as speech to text and text to speech. You can also configure image generation. This is kind of one more step, but it works. It's just that it's a bit more work to get all set up than just the text-based chat, but you can also add image generation to that as well. So, that's it for today. I hope you enjoyed this video. If you did, please like, and also consider subscribing to the channel so that you are notified of my future videos.

Info

Channel: Vincent Codes Finance

Views: 19,125

Rating: undefined out of 5

Keywords: researchtips, research, professor, datascience, dataanalytics, dataanalysis, bigdata, data science, python pandas, big data, chatgpt, gpt, ollama, artificial intelligence, chat gpt, machine learning, uncensored, opensourceai, llama2, mistral, private, privacy, opensource, ai, private server, local server, self-hosted, web ui, open webui, future tech, llama3, local llm, llama 3

Id: UmUDpxnmLW4

Channel Id: undefined

Length: 16min 46sec (1006 seconds)

Published: Fri Mar 08 2024