In this video, I'll reveal five open-source
options for running LLMs locally. You can use these to run the newest language models
like Llama 3 on your own machine. I'll show you how to replace ChatGPT with beautiful
desktop solutions like Jan or GPT for All. I'll also show you some secrets on how to use
RAG and chat with all of your documents easily. Importantly for us developers, we'll learn how
to use open-source LLMs as replacements for the OpenAI API and, of course, for running
agents with Crew AI or Microsoft Autogen. You might be wondering, why should I do
this? First, you might have experienced times when you were frustrated with limited
and biased AI responses. Sometimes the AI refuses to complete a task even if it seems
reasonable or avoids certain topics altogether. By using open-source LLMs, you can bypass
these limitations and get more creative, useful, and unbiased responses on all topics. Next, let's talk about data privacy and
security. Are you concerned about data privacy? In general, I avoid sharing my
data with others when it's not necessary, especially if it's sensitive for personal
or work reasons. With locally hosted LLMs, you can be sure that your work and data
remain confidential and stay on your machine. And the most important of all, cost efficiency
and business benefits. By using self-hosted LLMs, you avoid paying for subscriptions
or API call charges. Additionally, with proprietary LLMs, you're at the
mercy of the provider's pricing changes. Now that you understand the benefits of local
LLMs, you might be wondering how to get started. In the next section, I'll discuss the various
solutions available for running local models. LM Studio is often praised by YouTubers and
bloggers for its easy setup and user-friendly interface. It allows you to run LLMs on your
laptop entirely offline with features like an in-app chat UI, an OpenAI-compatible local server,
and the ability to download any compatible model files from Hugging Face repositories. LM Studio
is perfect for beginners looking to explore and select models with ease; however, it doesn't
allow you to chat with your own documents. In this video, I'll show you how to easily set
up LM Studio and use it with the new Llama 3 model. I'll also show you how to use Anything
LLM together with LM Studio to enable chatting with your own documents. Enough spoilering,
let's get started with LM Studio. There are installers available for Mac, Windows, and
Linux. In my case, I'll install the Mac version. After completing the installation, we are greeted
with the home view where we can directly see the most recent open-source models. We can also
search for specific language models using the search option. Let's search for Llama 3. Great,
there are many choices. I have already downloaded Llama 3 with 8 billion parameters. Let's give it
a try. We say hi, and it greets us politely back. Let's start with something simple, tell
a joke. It returns one of the standard open-source LLM jokes that I've seen hundreds
of times. Next, let's ask it to write a blog article about LLMs. It has started, but I
can already see that it is slower than some alternatives like Jan or GPT for All. Let's
speed it up a little until it is finished. Unfortunately, there is no direct way to chat with
documents. However, with the help of another tool, we can use LM Studio with our local
documents. For this, we go to the webpage of Anything LLM. We can download versions
for different operating systems. I'll opt for the Mac version. After the installation,
we can get started with Anything LLM. We are greeted with a clean welcome screen. As you
can see, you can use Anything LLM not only with local models, but it also shows a variety
of well-known remote LLMs like OpenAI or Gemini. Before it can work, we have to start a local
inference server. That sounds much more complex than it actually is. We simply select the
model we want to load, and after a few moments, the server starts automatically. Let's copy
the URL because we will need to paste it into Anything LLM in a few seconds. All right, let's
paste it and continue. The chat model selection has been filled out automatically. We set the
token size to the standard 4096. Everything looks good. Let's create a new workspace. Now
everything is set up so we can already say hi. Okay, the communication with LM Studio seems
to work. That means it used the provided Llama 3 model via the LM Studio server. When
we click on the small upload symbol, we can start to upload documents. Let's
upload one of the transcripts from my last YouTube videos. The goal is for
my questions to be answered based on the information I've shared in the videos.
That worked like a charm. Let's add some more transcripts to see if it can also provide
references to where it found the information. I simply select all the other documents too.
We can now select all documents and move them to our current workspace. We click on Save and
embed. The documents are now embedded, and the information should be available in our chat. Let's
try it out. I ask a very specific question related to one of the videos where we show how GPT Vision
can create recipes based on images of the fridge's contents. The answer is perfectly extracted from
the video transcripts. It explains that both work together to create recipes from a simple image of
your fridge. We can click on show citations, and it also shows where it found the information. LM
Studio and Anything LLM seem to be a dream team. In the last part about LM Studio, we saw how to
use the built-in chat interfaces to tell jokes or write blog articles. But LM Studio can also
be used in the same way as Alma with Light LLM. We can set up a server that works just like
the OpenAI API but runs completely on your local machine, making it free and private.
To do this, click on the server symbol on the left. Here we can easily start a so-called
local inference server. LM Studio provides many examples using cURL or various Python solutions.
To keep it simple, we will try out the provided chat completion example. Click on copy code, and
we can then try out this code directly. First, start the server by clicking on the green
button. Next, create a new Python script and paste the copied code. Let's name our script
app.py and paste the code example. As we can see, it requires the OpenAI package. Therefore, let's
install it. We can do this with the command pip install openai. After the package is installed,
we can try running the script. Everything works like a charm. The local Llama 3 LLM introduces
itself. Let's change the prompt to tell me a joke. We don't need a system prompt for this. We clear
the terminal and run the script again. All right, and we see the same joke we saw earlier.
One minor thing I'd like to change is to display only the actual joke. For this, we
only need to print out the message content. As you can see, LM Studio is a great way
to set up a server that works just like the OpenAI API but runs completely on your local
machine, making it free and private. LM Studio has become a de facto standard for running large
language models locally without hassle. However, its proprietary nature and potential licensing
constraints make it less ideal for business use. A new exciting option is Jan, an
open-source solution with a clean UI, active community engagement, good
documentation, and a transparent roadmap. In the next few minutes, we'll follow the
instructions in the GitHub repository to build it so we can try it out and
create our own picture. All right, let's start by cloning the project from GitHub.
Next, we navigate to the directory and build and execute the application using Make. This
process takes some time, so I'll speed up the video to keep things efficient. Once it's
finished, it opens a nice interface. It looks a little like GPT, but we can see additional
settings on the right side. For example, we can choose the model and select from
the latest versions like GPT-4 Turbo, GPT-4 Vision, or GPT-4.0. I'll go with my
current favorite, GPT-4.0. Let's say hi. Oh, it can't be processed because it needs an API key.
No problem, let's head over to the OpenAI page and quickly copy the API key. After pasting it into
the settings, we can try again. Great, this time it works. Let's start with a simple task: tell a
joke. And here's the answer delivered promptly. Now I want to switch to a local model. Let's
select the new Llama 3 model with 8 billion parameters. The download takes a few minutes,
so I'll speed up the video during this time. All right, the download is complete, and we
can use it now. On the right side of the chat, I can now switch the model from
GPT to Llama 3. Let's say hi again. As you can see, it remembers the context. This is
interesting because it means the chat history was sent along with the current prompt to the model.
Good to know. Let's ask Llama to write a blog article. It starts generating content quickly,
and I'd say the speed is impressive and much faster than LM Studio. Now I want to include some
transcripts from previous videos so the models can use them to answer the next questions. To do this,
I need to go into the advanced settings and enable experimental mode. Then I go back to the chat,
and now I can enable the retrieval option. When I click on the small attachment icon, I can now
upload images and documents. I'll ask a question that is very specific to one of the videos where
we use GPT to analyze fridge contents to create recipes. The answer is creative but doesn't relate
to the transcripts. Let's try again with another transcript, but once again, no success. I suspect
there's a reason why it's still in experimental mode. Nevertheless, I really like the clean
UI and the easy way to switch between models. Even though Jan stands out as an open-source
alternative with great documentation and a clean, nice UI, we found that its retrieval
augmented generation (RAG) function is still experimental and didn't work
well in our tests. In contrast, GPT for All is an all-in-one application that
mirrors ChatGPT's interface and quickly runs local LLMs for common tasks and RAG. The provided
models work seamlessly with local documents and even provide references. Installing GPT for
All is straightforward. The installation page provides installers for Windows, macOS, and
Linux. For a detailed guide on installation and using the document chat feature, you
can access all the full-length videos in our new open-source LLM course on AI for
Devs. See the description for details. CLI tools like Alma allow you to run open-source
models locally and integrate them directly into the console, which can then be connected to
front-end interfaces for a custom experience. They often provide endpoints compatible with the OpenAI
API, making it easy to switch models with minimal code changes. While chatbots are the most common
use, these tools can also power more complex systems like agents using frameworks such as Crew
AI and Microsoft's Autogen. Let's start with the installation process and later move on to coding
examples to see how these tools work in practice. The installation of Alma is straightforward.
On the website, we see variants for Mac, Linux, and Windows. In my case, I chose the Mac version.
We see a charming "Welcome to Alma" screen and click next. Then it asks us to install the command
line version. Okay, let's click on install. It also provides us the command to run a model from
the command line. We copy and paste the command, and we can assume that it downloads the Llama 3
model in the background. When it has finished, it gives us a prompt. We start with something
easy, tell me a joke, and we get one of the standard open-source jokes. Okay, now something
more complex. Let's enter, write a blog article about LLMs. At this point, I'll increase the
speed of the video because it's a long article, and I can imagine that you're not interested
in the actual blog article. Okay, great, the blog article is written. Let's now examine how
we can use Alma together with our Python scripts as an OpenAI replacement. First, we create a
virtual environment to keep the dependencies cleanly separated. We activate the environment, we
install Light LLM to serve our local LLM models, enabling seamless integration with our Python
scripts. The command light llm model with the argument Llama 3 is necessary to start
the server that hosts the Llama 3 model, making it available for your Python script to
interact with. Let's create a new Python script where we add similar logic as if we were
communicating with the OpenAI API. First, we install the openai package and import it. The
package is necessary to create an OpenAI client, but you'll see that we don't communicate with the
OpenAI API at all. When we initialize the client, we set the base URL to localhost on port 4000. We
can set anything as the API key. We then create a so-called chat completion exactly as if we were
communicating with GPT via the API. We only have a user role and ask it to tell us a joke. In our
print statement, we output the actual response without any metadata. Okay, let's try it out,
and we see that it works like a charm. Now we have an offline, privacy-preserving replacement
for GPT that we can use in all our projects.