5 Open-Source Options for Running LLMs Locally

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
In this video, I'll reveal five open-source  options for running LLMs locally. You can   use these to run the newest language models  like Llama 3 on your own machine. I'll show   you how to replace ChatGPT with beautiful  desktop solutions like Jan or GPT for All.   I'll also show you some secrets on how to use  RAG and chat with all of your documents easily.   Importantly for us developers, we'll learn how  to use open-source LLMs as replacements for   the OpenAI API and, of course, for running  agents with Crew AI or Microsoft Autogen. You might be wondering, why should I do  this? First, you might have experienced   times when you were frustrated with limited  and biased AI responses. Sometimes the AI   refuses to complete a task even if it seems  reasonable or avoids certain topics altogether.   By using open-source LLMs, you can bypass  these limitations and get more creative,   useful, and unbiased responses on all topics. Next, let's talk about data privacy and  security. Are you concerned about data   privacy? In general, I avoid sharing my  data with others when it's not necessary,   especially if it's sensitive for personal  or work reasons. With locally hosted LLMs,   you can be sure that your work and data  remain confidential and stay on your machine. And the most important of all, cost efficiency  and business benefits. By using self-hosted LLMs,   you avoid paying for subscriptions  or API call charges. Additionally,   with proprietary LLMs, you're at the  mercy of the provider's pricing changes. Now that you understand the benefits of local  LLMs, you might be wondering how to get started.   In the next section, I'll discuss the various  solutions available for running local models. LM Studio is often praised by YouTubers and  bloggers for its easy setup and user-friendly   interface. It allows you to run LLMs on your  laptop entirely offline with features like an   in-app chat UI, an OpenAI-compatible local server,  and the ability to download any compatible model   files from Hugging Face repositories. LM Studio  is perfect for beginners looking to explore and   select models with ease; however, it doesn't  allow you to chat with your own documents. In this video, I'll show you how to easily set  up LM Studio and use it with the new Llama 3   model. I'll also show you how to use Anything  LLM together with LM Studio to enable chatting   with your own documents. Enough spoilering,  let's get started with LM Studio. There   are installers available for Mac, Windows, and  Linux. In my case, I'll install the Mac version. After completing the installation, we are greeted  with the home view where we can directly see the   most recent open-source models. We can also  search for specific language models using the   search option. Let's search for Llama 3. Great,  there are many choices. I have already downloaded   Llama 3 with 8 billion parameters. Let's give it  a try. We say hi, and it greets us politely back.   Let's start with something simple, tell  a joke. It returns one of the standard   open-source LLM jokes that I've seen hundreds  of times. Next, let's ask it to write a blog   article about LLMs. It has started, but I  can already see that it is slower than some   alternatives like Jan or GPT for All. Let's  speed it up a little until it is finished. Unfortunately, there is no direct way to chat with  documents. However, with the help of another tool,   we can use LM Studio with our local  documents. For this, we go to the webpage   of Anything LLM. We can download versions  for different operating systems. I'll opt   for the Mac version. After the installation,  we can get started with Anything LLM. We are   greeted with a clean welcome screen. As you  can see, you can use Anything LLM not only   with local models, but it also shows a variety  of well-known remote LLMs like OpenAI or Gemini. Before it can work, we have to start a local  inference server. That sounds much more   complex than it actually is. We simply select the  model we want to load, and after a few moments,   the server starts automatically. Let's copy  the URL because we will need to paste it into   Anything LLM in a few seconds. All right, let's  paste it and continue. The chat model selection   has been filled out automatically. We set the  token size to the standard 4096. Everything   looks good. Let's create a new workspace. Now  everything is set up so we can already say hi. Okay, the communication with LM Studio seems  to work. That means it used the provided Llama   3 model via the LM Studio server. When  we click on the small upload symbol,   we can start to upload documents. Let's  upload one of the transcripts from my   last YouTube videos. The goal is for  my questions to be answered based on   the information I've shared in the videos.  That worked like a charm. Let's add some   more transcripts to see if it can also provide  references to where it found the information.   I simply select all the other documents too.  We can now select all documents and move them   to our current workspace. We click on Save and  embed. The documents are now embedded, and the   information should be available in our chat. Let's  try it out. I ask a very specific question related   to one of the videos where we show how GPT Vision  can create recipes based on images of the fridge's   contents. The answer is perfectly extracted from  the video transcripts. It explains that both work   together to create recipes from a simple image of  your fridge. We can click on show citations, and   it also shows where it found the information. LM  Studio and Anything LLM seem to be a dream team. In the last part about LM Studio, we saw how to  use the built-in chat interfaces to tell jokes   or write blog articles. But LM Studio can also  be used in the same way as Alma with Light LLM.   We can set up a server that works just like  the OpenAI API but runs completely on your   local machine, making it free and private.  To do this, click on the server symbol on   the left. Here we can easily start a so-called  local inference server. LM Studio provides many   examples using cURL or various Python solutions.  To keep it simple, we will try out the provided   chat completion example. Click on copy code, and  we can then try out this code directly. First,   start the server by clicking on the green  button. Next, create a new Python script   and paste the copied code. Let's name our script  app.py and paste the code example. As we can see,   it requires the OpenAI package. Therefore, let's  install it. We can do this with the command pip   install openai. After the package is installed,  we can try running the script. Everything works   like a charm. The local Llama 3 LLM introduces  itself. Let's change the prompt to tell me a joke.   We don't need a system prompt for this. We clear  the terminal and run the script again. All right,   and we see the same joke we saw earlier.  One minor thing I'd like to change is to   display only the actual joke. For this, we  only need to print out the message content. As you can see, LM Studio is a great way  to set up a server that works just like   the OpenAI API but runs completely on your local  machine, making it free and private. LM Studio   has become a de facto standard for running large  language models locally without hassle. However,   its proprietary nature and potential licensing  constraints make it less ideal for business   use. A new exciting option is Jan, an  open-source solution with a clean UI,   active community engagement, good  documentation, and a transparent roadmap. In the next few minutes, we'll follow the  instructions in the GitHub repository to   build it so we can try it out and  create our own picture. All right,   let's start by cloning the project from GitHub.  Next, we navigate to the directory and build and   execute the application using Make. This  process takes some time, so I'll speed up   the video to keep things efficient. Once it's  finished, it opens a nice interface. It looks   a little like GPT, but we can see additional  settings on the right side. For example,   we can choose the model and select from  the latest versions like GPT-4 Turbo,   GPT-4 Vision, or GPT-4.0. I'll go with my  current favorite, GPT-4.0. Let's say hi. Oh,   it can't be processed because it needs an API key.  No problem, let's head over to the OpenAI page and   quickly copy the API key. After pasting it into  the settings, we can try again. Great, this time   it works. Let's start with a simple task: tell a  joke. And here's the answer delivered promptly. Now I want to switch to a local model. Let's  select the new Llama 3 model with 8 billion   parameters. The download takes a few minutes,  so I'll speed up the video during this time.   All right, the download is complete, and we  can use it now. On the right side of the chat,   I can now switch the model from  GPT to Llama 3. Let's say hi again.   As you can see, it remembers the context. This is  interesting because it means the chat history was   sent along with the current prompt to the model.  Good to know. Let's ask Llama to write a blog   article. It starts generating content quickly,  and I'd say the speed is impressive and much   faster than LM Studio. Now I want to include some  transcripts from previous videos so the models can   use them to answer the next questions. To do this,  I need to go into the advanced settings and enable   experimental mode. Then I go back to the chat,  and now I can enable the retrieval option. When   I click on the small attachment icon, I can now  upload images and documents. I'll ask a question   that is very specific to one of the videos where  we use GPT to analyze fridge contents to create   recipes. The answer is creative but doesn't relate  to the transcripts. Let's try again with another   transcript, but once again, no success. I suspect  there's a reason why it's still in experimental   mode. Nevertheless, I really like the clean  UI and the easy way to switch between models. Even though Jan stands out as an open-source  alternative with great documentation and a clean,   nice UI, we found that its retrieval  augmented generation (RAG) function is   still experimental and didn't work  well in our tests. In contrast,   GPT for All is an all-in-one application that  mirrors ChatGPT's interface and quickly runs   local LLMs for common tasks and RAG. The provided  models work seamlessly with local documents and   even provide references. Installing GPT for  All is straightforward. The installation page   provides installers for Windows, macOS, and  Linux. For a detailed guide on installation   and using the document chat feature, you  can access all the full-length videos in   our new open-source LLM course on AI for  Devs. See the description for details. CLI tools like Alma allow you to run open-source  models locally and integrate them directly   into the console, which can then be connected to  front-end interfaces for a custom experience. They   often provide endpoints compatible with the OpenAI  API, making it easy to switch models with minimal   code changes. While chatbots are the most common  use, these tools can also power more complex   systems like agents using frameworks such as Crew  AI and Microsoft's Autogen. Let's start with the   installation process and later move on to coding  examples to see how these tools work in practice.   The installation of Alma is straightforward.  On the website, we see variants for Mac, Linux,   and Windows. In my case, I chose the Mac version.  We see a charming "Welcome to Alma" screen and   click next. Then it asks us to install the command  line version. Okay, let's click on install. It   also provides us the command to run a model from  the command line. We copy and paste the command,   and we can assume that it downloads the Llama 3  model in the background. When it has finished,   it gives us a prompt. We start with something  easy, tell me a joke, and we get one of the   standard open-source jokes. Okay, now something  more complex. Let's enter, write a blog article   about LLMs. At this point, I'll increase the  speed of the video because it's a long article,   and I can imagine that you're not interested  in the actual blog article. Okay, great,   the blog article is written. Let's now examine how  we can use Alma together with our Python scripts   as an OpenAI replacement. First, we create a  virtual environment to keep the dependencies   cleanly separated. We activate the environment, we  install Light LLM to serve our local LLM models,   enabling seamless integration with our Python  scripts. The command light llm model with the   argument Llama 3 is necessary to start  the server that hosts the Llama 3 model,   making it available for your Python script to  interact with. Let's create a new Python script   where we add similar logic as if we were  communicating with the OpenAI API. First,   we install the openai package and import it. The  package is necessary to create an OpenAI client,   but you'll see that we don't communicate with the  OpenAI API at all. When we initialize the client,   we set the base URL to localhost on port 4000. We  can set anything as the API key. We then create a   so-called chat completion exactly as if we were  communicating with GPT via the API. We only have   a user role and ask it to tell us a joke. In our  print statement, we output the actual response   without any metadata. Okay, let's try it out,  and we see that it works like a charm. Now we   have an offline, privacy-preserving replacement  for GPT that we can use in all our projects.
Info
Channel: AI FOR DEVS
Views: 4,017
Rating: undefined out of 5
Keywords: llama 3, openai, gpt4o, lm studio, anythingllm, ollama, litellm, python tutorial, jan, chat gpt 4o, openai assistant api
Id: 0n35ETVXY9g
Channel Id: undefined
Length: 14min 42sec (882 seconds)
Published: Thu May 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.