5 Open-Source Options for Running LLMs Locally

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

In this video, I'll reveal five open-source options for running LLMs locally. You can use these to run the newest language models like Llama 3 on your own machine. I'll show you how to replace ChatGPT with beautiful desktop solutions like Jan or GPT for All. I'll also show you some secrets on how to use RAG and chat with all of your documents easily. Importantly for us developers, we'll learn how to use open-source LLMs as replacements for the OpenAI API and, of course, for running agents with Crew AI or Microsoft Autogen. You might be wondering, why should I do this? First, you might have experienced times when you were frustrated with limited and biased AI responses. Sometimes the AI refuses to complete a task even if it seems reasonable or avoids certain topics altogether. By using open-source LLMs, you can bypass these limitations and get more creative, useful, and unbiased responses on all topics. Next, let's talk about data privacy and security. Are you concerned about data privacy? In general, I avoid sharing my data with others when it's not necessary, especially if it's sensitive for personal or work reasons. With locally hosted LLMs, you can be sure that your work and data remain confidential and stay on your machine. And the most important of all, cost efficiency and business benefits. By using self-hosted LLMs, you avoid paying for subscriptions or API call charges. Additionally, with proprietary LLMs, you're at the mercy of the provider's pricing changes. Now that you understand the benefits of local LLMs, you might be wondering how to get started. In the next section, I'll discuss the various solutions available for running local models. LM Studio is often praised by YouTubers and bloggers for its easy setup and user-friendly interface. It allows you to run LLMs on your laptop entirely offline with features like an in-app chat UI, an OpenAI-compatible local server, and the ability to download any compatible model files from Hugging Face repositories. LM Studio is perfect for beginners looking to explore and select models with ease; however, it doesn't allow you to chat with your own documents. In this video, I'll show you how to easily set up LM Studio and use it with the new Llama 3 model. I'll also show you how to use Anything LLM together with LM Studio to enable chatting with your own documents. Enough spoilering, let's get started with LM Studio. There are installers available for Mac, Windows, and Linux. In my case, I'll install the Mac version. After completing the installation, we are greeted with the home view where we can directly see the most recent open-source models. We can also search for specific language models using the search option. Let's search for Llama 3. Great, there are many choices. I have already downloaded Llama 3 with 8 billion parameters. Let's give it a try. We say hi, and it greets us politely back. Let's start with something simple, tell a joke. It returns one of the standard open-source LLM jokes that I've seen hundreds of times. Next, let's ask it to write a blog article about LLMs. It has started, but I can already see that it is slower than some alternatives like Jan or GPT for All. Let's speed it up a little until it is finished. Unfortunately, there is no direct way to chat with documents. However, with the help of another tool, we can use LM Studio with our local documents. For this, we go to the webpage of Anything LLM. We can download versions for different operating systems. I'll opt for the Mac version. After the installation, we can get started with Anything LLM. We are greeted with a clean welcome screen. As you can see, you can use Anything LLM not only with local models, but it also shows a variety of well-known remote LLMs like OpenAI or Gemini. Before it can work, we have to start a local inference server. That sounds much more complex than it actually is. We simply select the model we want to load, and after a few moments, the server starts automatically. Let's copy the URL because we will need to paste it into Anything LLM in a few seconds. All right, let's paste it and continue. The chat model selection has been filled out automatically. We set the token size to the standard 4096. Everything looks good. Let's create a new workspace. Now everything is set up so we can already say hi. Okay, the communication with LM Studio seems to work. That means it used the provided Llama 3 model via the LM Studio server. When we click on the small upload symbol, we can start to upload documents. Let's upload one of the transcripts from my last YouTube videos. The goal is for my questions to be answered based on the information I've shared in the videos. That worked like a charm. Let's add some more transcripts to see if it can also provide references to where it found the information. I simply select all the other documents too. We can now select all documents and move them to our current workspace. We click on Save and embed. The documents are now embedded, and the information should be available in our chat. Let's try it out. I ask a very specific question related to one of the videos where we show how GPT Vision can create recipes based on images of the fridge's contents. The answer is perfectly extracted from the video transcripts. It explains that both work together to create recipes from a simple image of your fridge. We can click on show citations, and it also shows where it found the information. LM Studio and Anything LLM seem to be a dream team. In the last part about LM Studio, we saw how to use the built-in chat interfaces to tell jokes or write blog articles. But LM Studio can also be used in the same way as Alma with Light LLM. We can set up a server that works just like the OpenAI API but runs completely on your local machine, making it free and private. To do this, click on the server symbol on the left. Here we can easily start a so-called local inference server. LM Studio provides many examples using cURL or various Python solutions. To keep it simple, we will try out the provided chat completion example. Click on copy code, and we can then try out this code directly. First, start the server by clicking on the green button. Next, create a new Python script and paste the copied code. Let's name our script app.py and paste the code example. As we can see, it requires the OpenAI package. Therefore, let's install it. We can do this with the command pip install openai. After the package is installed, we can try running the script. Everything works like a charm. The local Llama 3 LLM introduces itself. Let's change the prompt to tell me a joke. We don't need a system prompt for this. We clear the terminal and run the script again. All right, and we see the same joke we saw earlier. One minor thing I'd like to change is to display only the actual joke. For this, we only need to print out the message content. As you can see, LM Studio is a great way to set up a server that works just like the OpenAI API but runs completely on your local machine, making it free and private. LM Studio has become a de facto standard for running large language models locally without hassle. However, its proprietary nature and potential licensing constraints make it less ideal for business use. A new exciting option is Jan, an open-source solution with a clean UI, active community engagement, good documentation, and a transparent roadmap. In the next few minutes, we'll follow the instructions in the GitHub repository to build it so we can try it out and create our own picture. All right, let's start by cloning the project from GitHub. Next, we navigate to the directory and build and execute the application using Make. This process takes some time, so I'll speed up the video to keep things efficient. Once it's finished, it opens a nice interface. It looks a little like GPT, but we can see additional settings on the right side. For example, we can choose the model and select from the latest versions like GPT-4 Turbo, GPT-4 Vision, or GPT-4.0. I'll go with my current favorite, GPT-4.0. Let's say hi. Oh, it can't be processed because it needs an API key. No problem, let's head over to the OpenAI page and quickly copy the API key. After pasting it into the settings, we can try again. Great, this time it works. Let's start with a simple task: tell a joke. And here's the answer delivered promptly. Now I want to switch to a local model. Let's select the new Llama 3 model with 8 billion parameters. The download takes a few minutes, so I'll speed up the video during this time. All right, the download is complete, and we can use it now. On the right side of the chat, I can now switch the model from GPT to Llama 3. Let's say hi again. As you can see, it remembers the context. This is interesting because it means the chat history was sent along with the current prompt to the model. Good to know. Let's ask Llama to write a blog article. It starts generating content quickly, and I'd say the speed is impressive and much faster than LM Studio. Now I want to include some transcripts from previous videos so the models can use them to answer the next questions. To do this, I need to go into the advanced settings and enable experimental mode. Then I go back to the chat, and now I can enable the retrieval option. When I click on the small attachment icon, I can now upload images and documents. I'll ask a question that is very specific to one of the videos where we use GPT to analyze fridge contents to create recipes. The answer is creative but doesn't relate to the transcripts. Let's try again with another transcript, but once again, no success. I suspect there's a reason why it's still in experimental mode. Nevertheless, I really like the clean UI and the easy way to switch between models. Even though Jan stands out as an open-source alternative with great documentation and a clean, nice UI, we found that its retrieval augmented generation (RAG) function is still experimental and didn't work well in our tests. In contrast, GPT for All is an all-in-one application that mirrors ChatGPT's interface and quickly runs local LLMs for common tasks and RAG. The provided models work seamlessly with local documents and even provide references. Installing GPT for All is straightforward. The installation page provides installers for Windows, macOS, and Linux. For a detailed guide on installation and using the document chat feature, you can access all the full-length videos in our new open-source LLM course on AI for Devs. See the description for details. CLI tools like Alma allow you to run open-source models locally and integrate them directly into the console, which can then be connected to front-end interfaces for a custom experience. They often provide endpoints compatible with the OpenAI API, making it easy to switch models with minimal code changes. While chatbots are the most common use, these tools can also power more complex systems like agents using frameworks such as Crew AI and Microsoft's Autogen. Let's start with the installation process and later move on to coding examples to see how these tools work in practice. The installation of Alma is straightforward. On the website, we see variants for Mac, Linux, and Windows. In my case, I chose the Mac version. We see a charming "Welcome to Alma" screen and click next. Then it asks us to install the command line version. Okay, let's click on install. It also provides us the command to run a model from the command line. We copy and paste the command, and we can assume that it downloads the Llama 3 model in the background. When it has finished, it gives us a prompt. We start with something easy, tell me a joke, and we get one of the standard open-source jokes. Okay, now something more complex. Let's enter, write a blog article about LLMs. At this point, I'll increase the speed of the video because it's a long article, and I can imagine that you're not interested in the actual blog article. Okay, great, the blog article is written. Let's now examine how we can use Alma together with our Python scripts as an OpenAI replacement. First, we create a virtual environment to keep the dependencies cleanly separated. We activate the environment, we install Light LLM to serve our local LLM models, enabling seamless integration with our Python scripts. The command light llm model with the argument Llama 3 is necessary to start the server that hosts the Llama 3 model, making it available for your Python script to interact with. Let's create a new Python script where we add similar logic as if we were communicating with the OpenAI API. First, we install the openai package and import it. The package is necessary to create an OpenAI client, but you'll see that we don't communicate with the OpenAI API at all. When we initialize the client, we set the base URL to localhost on port 4000. We can set anything as the API key. We then create a so-called chat completion exactly as if we were communicating with GPT via the API. We only have a user role and ask it to tell us a joke. In our print statement, we output the actual response without any metadata. Okay, let's try it out, and we see that it works like a charm. Now we have an offline, privacy-preserving replacement for GPT that we can use in all our projects.

Info

Channel: AI FOR DEVS

Views: 4,017

Rating: undefined out of 5

Keywords: llama 3, openai, gpt4o, lm studio, anythingllm, ollama, litellm, python tutorial, jan, chat gpt 4o, openai assistant api

Id: 0n35ETVXY9g

Channel Id: undefined

Length: 14min 42sec (882 seconds)

Published: Thu May 30 2024