Run your Own Private Chat GPT, Free and Uncensored, with Ollama + Open WebUI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi, in this video, I'll show you how you can  get a Chat GPT-like interface like this running   locally on your machine for free. This is Vincent  Codes Finance, a channel about coding for finance   research. If that's something that interests you,  consider subscribing so that you get notified of   my future videos. In this video, I'll show you how  you can use Ollama and Open WebUI to create your   own Chat GPT replacement that runs on your own  machine. In my case, I've got a MacBook Pro M3   processor with 64 GB of RAM. That's more than  plenty. You don't need as much to actually be   able to run that, but the more RAM and the  most powerful GPU you'll have, the better. In order to install our Chat GPT replacement,  we'll first install Ollama and then Open   WebUI. Ollama is a small program that runs in the  background and lets you manage and make available   large language models that are open source, such  as Llama 2 from Meta or Mistral. In order to   install Ollama, all you have to do is go to their  website and click download. If you're on Mac,   you can also install it with Homebrew by using  `brew install ollama`. You can look at the   different models that are available on Ollama  by clicking on models and then looking at the   different models they have. Some are featured,  you can look at the most popular. The most popular   one is Llama 2. There's Mistral as well,  and then there's a few variations as well. Most models like Llama 2 will have a few different  versions available. If we go here in tags, we'll   see that the default variant that Llama 2 makes  available is the chat variant, which is optimized   for chatting. This is what we want today, but they  also have a variant that's optimized for text,   and then they also have variants depending on  the size that you want, so 7B or 70B or 13B for   the number of parameters that are in the model.  The more parameters, so 7 billion to 70 billion,   the more memory the model will require, but  also the more powerful the model will be. If we go down, even for the 7 B parameter chat  model, there are a few variants, q4_0, q4_1, q_5,   and so forth. So, what are these? These are models  that have different quantization. The idea being   that a typical LLM model will have parameters that  are stored in floating point values with 32-bit   size. So, quantization. What these variations  do is that they've reduced the number of bits,   the number of memory allowed for each of these  parameters, which means that you need less memory   to use the same number of parameters, but you're  losing some precision. There are some trade-offs   there. What I recommend is that you play a bit  with them and see which one works best for you. Besides the most popular models, you also have  a few models that might be interesting depending   on your use case. For example, they do have  uncensored models. You've got Llama 2 uncensored,   which is a variation of the model fine-tuned to  remove the safeguards that Llama 2 has. This model   will basically answer whatever you ask it. There  won't be a reply where the model tells you, "Well,   I can't reply to this. This is too dangerous or  too bad. I won't do it." These are available to   you if sometimes you need them for research  purposes where a typical LLM will be blocked. Ollama is actually a command-line application,  so you have to go to the terminal to interact   with it directly. Depending on the way you've  installed Ollama, you might have to start the   service manually or set it so that it  starts automatically. If, for example,   you've installed it with Homebrew on Mac, it will  be set up as a service for you automatically,   so it will run in the background all the time.  If we go on the terminal, you can call Ollama   by just typing `ollama`. It will give you the  available commands. So here, we've got a few   different commands. If you want to start the  service, you do `ollama serve`. In my case,   because I've installed with Homebrew, the service  is running; it's already there. So I'm getting   an error that the port it's trying to serve is  already in use because it's already running. Then,   you can also list the models that you've got  installed at the current time. If you just   installed it, you won't have anything. These are  the models that I have installed on my computer. If you want to install a model, for  example, if I want to install Llama 2,   I would just do `ollama pull llama2`. In my case,  it was really fast because I've already had it   installed. It's just double-checking that I had  the latest version, and because that's the case,   it's all good. In your case, it might take a bit  more time because it has to download the full   model. The most powerful model that they have at  this time for chat is called Mixtral. So for that,   you would do `ollama pull mixtral`. Again, it's  going to be fast for me because I've already   had it installed, but it's actually about 30  GB in terms of download size for this one. If you just want to chat with the model, you can  just do it in the terminal with Ollama directly.   I can do, for example, `ollama run llama2`.  It will start the model and make it available   to chat with me. Here, I'm just prompted. I  can send it a message. We can, for example,   check whether it knows about Ollama. So, what is  Ollama? Clearly, it didn't understand my question,   or it doesn't know about Ollama. It's working, but  it's not necessarily the kind of interface we want   to interact with on a day-to-day basis. I can just  quit this chat by typing `/bye`. It will stop it. Okay, so now we've got Ollama working on our  computer, but we don't want to interact with   it like this. This is what we call the backend,  so it's the service that's providing the large   language model to our computer. But now we  want to install a frontend, so the application   that will kind of serve as our UI, as our user  interface, to interact with these large language   models. So for that, we'll use Open WebUI. It  is a Chat GPT replacement that is open source,   and it does offer a lot of the features that Chat  GPT has. It lets you keep track of your chats,   store modelfiles, prompts, and so forth. We'll see  what these are, but first, we'll have to install   it. This is the somewhat tricky part of this  video because in order to install Open WebUI,   you'll actually need Docker. If you don't  know what Docker is, Docker is a container   software. So, what containers are? They are little  virtual machines that run on your computer, and   Docker is the software that helps you manage these  containers and run them on your machine. It can be   a bit confusing the first time, but it's actually  probably the safest way to run software like this   because these containers are self-contained.  They're isolated from the rest of your machine. The reason why Open WebUI has to run in  a container is that it's basically a web   server. Right, it is a Chat GPT replacement.  It is a web server running on your machine   that will interact with Ollama. In order  to do that, you have to run that server,   and the container has everything built  into that. It is actually a web server,   and it supports multi-user setups. If you want,  you could also set up Open WebUI to serve as an   enterprise Chat GPT replacement, where, for a  small team, you'd have a computer where that   runs and that serves multiple users. This is not  what we're doing here. Here, we're installing it   on our own machine so that it serves only  us. But this is what this software can do. In order to install Docker, if you go on  Docker.com, you'll have to go to Docker Desktop   and then download for Mac Apple chip. Be aware of  the license here. If you are at a large company,   this might be binding for you. Because for  me, it's just a small personal project,   it's fine. But keep that in mind. You  can also install Docker with Homebrew on   Mac. I've put the instructions in the video  descriptions if you want to do it that way. Once you've got Docker set up, we can go back  to Open WebUI and look at the instructions   here. They will give you the quick start with  Docker. What I'll do here is I'll just copy the   instructions for "if Ollama is on your computer,  use this command". This is what I'll copy here,   and now I'll go back to my terminal and  paste this command and press enter. And now,   I've got a kind of little message with a big  large number that tells me that it is running   now on my computer. If I go on the Docker Desktop  dashboard, I can see that it is running here. And   that's all there is to it. By default, it  will set up Open WebUI on port 3000. So,   in order to connect to it, I would just go to  http://localhost:3000, and then it will ask me   to sign in or sign up. If it's the first time  that you launch Open WebUI on your computer,   you'll have to sign up. The first user  to sign up will be admin. After that,   you'll be able to log in with your account that's  created. Don't worry, it is local to your machine,   so the account that you create is on your machine,  you're not sending your information anywhere. And now, I've got a full-featured Chat GPT  replacement. I'll have the list of my chats   here. When I first want to run a chat, all  I have to do is pick the model. For example,   here, I could do Llama 2. I could set it as my  default, and then I can ask simple questions like,   "What are Newey-West standard errors?"  The first time I ask a question,   it might take a few seconds for the model to load.  It depends on the size of the model, but overall,   it tends to be faster than Chat GPT. At least  that's my experience. On my machine -- the speed   will obviously depend on your machine -- it tends  to be faster than Chat GPT. So, it is pretty cool. Now, another cool thing that you can do with this.  If you start a new chat, I have my model here,   I can actually add a second model. For example,  here, I could have Mixtral latest, and then repeat   my query. Now, Mixtral is quite a large model.  The first time I will query it, it will take a   few seconds to load. And I will also see it if I  bring up my activity monitor. The memory usage for   my computer, I see that the memory use is jumping  quite high, but it is working, and it is providing   me an answer. So, whether it's the answer or not  that I want, well, I actually have two answers   because I've added two models. So, I can actually  compare. This is the answer I got from Llama 2,   and this is the answer I've got from Mixtral.  That's not something that's possible with Chat   GPT, but here, in this case, it works! This is  working on your own machine. You can actually   add multiple models and compare the results  that you get from multiple different models. If we explore the other options that  you've got here on the left sidebar,   what we've got, we've got modelfiles here.  What are modelfiles? Well, they're pretty   much the equivalent of GPTs for Chat GPT. So,  they are kind of built-in sets of prompts or   instructions to a model that you can use to  serve a specific purpose. You can build your   own if you want. You can create your own models  with the different kinds of instructions here,   the prompt and the different types, or you  can also discover the ones that have been   designed by the Open WebUI community. So, if  you just scroll down here, you'll see different   modelfiles here that are featured. You can  also click and see the most popular ones. Then, you've got prompts. So, prompts are  a kind of simpler version of modelfiles.   They're just prompts that you've saved  for future use. And you can also look at   the Open WebUI community to see prompts  that have been shared by other users. And finally, you've got documents here. These  documents will be saved in a RAG fashion,   so retrieval augmented generation type of  availability, which means that it doesn't   quite work as it works with Chat GPT. These  won't be able to access your full document   when you query it. So, for example, here,  I've tried with a research paper. I wanted   the chat to summarize that paper. It's  not able to do that because it's not   able to see the whole document. Basically,  this is more for reference documents. So,   it's going to be able to search for snippets  in your document that are related to your   query and summarize those parts, but it won't  be able to get a full overview of a document. These are the main features of Open WebUI. You  can explore more if you click on your username   and go to settings. You've got a few more options  there, where you can set the theme, your system   prompts. You've got advanced parameters. You  can also try different alternative options,   such as speech to text and text to speech. You  can also configure image generation. This is   kind of one more step, but it works. It's  just that it's a bit more work to get all   set up than just the text-based chat, but you  can also add image generation to that as well. So, that's it for today. I hope you enjoyed  this video. If you did, please like,   and also consider subscribing to the channel  so that you are notified of my future videos.
Info
Channel: Vincent Codes Finance
Views: 19,125
Rating: undefined out of 5
Keywords: researchtips, research, professor, datascience, dataanalytics, dataanalysis, bigdata, data science, python pandas, big data, chatgpt, gpt, ollama, artificial intelligence, chat gpt, machine learning, uncensored, opensourceai, llama2, mistral, private, privacy, opensource, ai, private server, local server, self-hosted, web ui, open webui, future tech, llama3, local llm, llama 3
Id: UmUDpxnmLW4
Channel Id: undefined
Length: 16min 46sec (1006 seconds)
Published: Fri Mar 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.