Okay, so along with Gemma 2 being
released for Keras PyTorch, Hugging Face transformers, and a number of
other different formats yesterday. One of the, formats that got
released was the Ollama format. and so today I've been playing
a little around with both the 9B and the 27 B in Ollama. And actually you get some
really nice results, especially with the 9 billion model. I think the 27 billion model actually
has its place also for doing things, perhaps where you're doing agents,
you don't need responses in real time. I'm finding that it's just, quite a bit
slower than the 9 billion, model in here. So, you can just come
in here, into Ollama. Should know, I'm guessing, by
now, like, how to install this. I've done a bunch of
different videos about this. But you can see that they've
got, the both sizes in here. you can install it, you can
get, using it straight away. So what I thought I'd do is
recently I had a number of people asking me to do a fully local RAG. surprisingly, a lot of the things
apparently on YouTube that are called local RAGs actually use embedding models
from the cloud or other things like that. so what I thought I'd do is just put
together a very simple little script of using the 9 billion, Gemma 2 model. we're going to use some
different, embeddings. The embeddings that I'm going to
use for this are going to be the Nomic embeddings, that are here. So you can basically just
install both of these. There's instructions for installing these. And then I'm going to run everything
locally, in VSCode and just walk through some code of doing this with LangChain
and with just putting together a simple example of building a local RAG system. And perhaps we'll add to it
in some future videos as well. We could add on a UI and we can make
a sort of more advanced version. The goal here was really just
to get something going with Gemma and see what it could do. So what I thought I'd basically
do is take the transcripts from one of the YouTube channels. this is Alex Hormozi's YouTube channel. So he does lots of interesting videos
about business, about sales, about improving profit, that kind of thing. And what I thought I'd do
is just basically take those transcripts, embed them locally. I'm going to use ChromaDB
for the database. I, could certainly use Qdrant or one of
the other databases, that's local as well. I'm then going to basically use
Ollama for the Nomic embeddings, and for the Gemma 2 model in here. the idea here is that we're
going to basically put together a very simple, RAG system. And then as I go along, I can show
you some of the other things that I would start to add to this, as
I start to do testing and stuff. But in this video, I really just want to
get the whole thing up and running with Gemma 2 and, see how we go from there. Now the first thing that you want to
build when you're building your RAG system is that you want to build an indexer. So basically an indexer will take
your raw documents, do the splitting for you, create the vector store,
set that up with the embeddings, etc. So in here, by putting it into a
separate file, We've also got, where we can test different things out. So you can see in here, I've been
testing out different text splitters, using the semantic chunker, using
recursive character text splitter. And certainly over time, I would test
out probably, variety of different text splitting, methods in here to work out
which is going to be the best here. And by having it in a separate file
like this, I can actually just make separate DBs of the whole thing. so even though the text files, I think,
are, like two and a half thousand, files, When we're making a database
like this, it's not going to be huge on, amount of storage on your disk
to be able to test it out and get a sense of at least of what's working. the other thing too, is at the start,
I often just make a folder with sort of 10 or 20, different files
and then try it from there to, be able to get things going in there. So you can see, let's
just go through this. So I've brought in the two
text splitters in there. we're going to have a directory
loader, Ollama embeddings, and we're going to have chroma in there. okay, we're going to
just go to the directory. we're going to load up all the files. We're going to glob all the
files that are text files. We're then going to, run them
through, some embeddings. So we need an embedder for that. And this is where our Nomic,
embedded text model comes in. in this particular file, I like to have
show progress because it will, draw nice progress bars along the bottom. So having them like that, we can
actually see it going and we get a sense of, how long it's taking. very useful when you're doing
quite a number of files in here. And you see that we're persisting
the directory out, to this db Hormozi directory in here. And so by getting this to be
separate, I can index stuff. And once it's indexed, it's just done. I don't need to go back and do it again. I don't need to do it each time I
run the file, that kind of thing. So next up, we want to
actually do the RAG part. So here we're going to need the,
Embeddings again, because we remember we need the embeddings not just for indexing,
but also for doing the query lookups and stuff like that as we go through this. So you can see here that we're
getting that going through. We're going to need, what else, we're
going to need to bring in chroma. So we're going to need to load up, this,
directory, that we've got, already here. so we're loading up this
directory to basically use this. we're going to need a retriever. So just taking this directory and adding
a retriever to it, in this case I'm going to use search type similarity, with k=5,
so just bringing back five examples. I haven't put MMR on here, but,
this is something that you would probably want to do as well. the idea, with this is just to put
the bare bones down, and then perhaps, in future videos I'll start adding
to it with some agent sort of stuff and other things in here as well. Next up, we want to actually get our, LLM. So the LLM that we're going
to use is Gemma 2, right? So this is going to be the 9 billion LLM. so I'm just loading that in. I'm setting max tokens to 512. You could play around
with this, certainly. I'm setting temperature to zero in here. The keep alive in here is that you don't
want the model loading and unloading each time you basically call it. so here I've just set it
to three hours, for this. Now you can set it to be indefinite. You can set it to a number of
minutes, a number of hours, whatever you want for that. Next we've got our, prompt template. So this should be really simple,
for those of you who've seen a lot of my other sort of RAG videos and
stuff like that, we've got a prompt template that we're setting up here. you definitely want to tune this, right? So I've started tuning this a
little bit, but I'm not totally happy with it at the moment. I would probably want to spend more time
going through this and looking at it. we're going to be passing in a context. We're going to have a question in there. at the moment I don't have any memory or
anything like that going in here as well. So that's something I could add to that. One of the things I was experimenting
with was using, the Mesop, UI, and there, there were just more challenges
with the streaming aspect of it. So you can certainly get it going with
Mesop and that can then help you with a lot of the history, stuff if you want
as well, that it can pass that around. All right, once we've got the chat, prompt
template set up, we need to make a chain. So our chain is just going to be a simple
LangChain expression language chain. so in here, we're going
to get the retriever. obviously the retriever takes
a quest, the question in there. So we're passing the question into that. And that question is the first thing
we're sending into the chain there. but we also want to pass the question
on to the prompt as well, right? So we're going to have the
context coming back to the prompt. We also want the question coming
back to the prompt as well. So this is why we've got this
runnable pass through in here. this thing comes into our prompt. it goes through a LLM, and then we're
going to just choose to stream this back. So you can see here, I'm just
doing a streaming back and we're going to basically run this. All right. So let's have a look at it, in action. okay. I'm just going to run the simple
file that we've just been looking at. I'm going to start off with a question. What is the rule of 100? So I know he has a video about this. Okay. And you can see that after a
second or two, it's coming back and streaming back at result in here. And we can see the rule of 100 states
that you should do 100 primary actions every day to promote your business or
make your products or services known. Okay. So that one worked really well. What if we try something
totally different? All right. Alex Hormozi is a pretty buff guy. He occasionally talks about
protein and stuff like that. let's see what he comes up with Okay. So we got to on point answered there. let's look at asking it
about this protein bomb. Okay. You see, we got quite a proper
answer after seeing it in there. Now, I don't actually have a
memory or anything in this one. so we could certainly add that. you can also add that through the UI. if we were using Mesop, we
would have actually gotten that sort of for free in here. All right, I'm going to quit out of this. I talk about some of the other things that
you could do, going along with this. So one of the things that I do for a lot
of these is I have a debugging script. You'll see that, is that
I'll often make you. And this is often the first version
that I'm making is like some sort of debugging one where I've got a bunch
of different debugging things in there just to see what's going
on in the actual system. So let's say I come in here. Okay. So you can see that first off
we've got the embedding set to show progress equals true. So we can see that the
embeddings kicked in. we can see that, some of the things
that are coming back now, actually this is using a much smaller DB, right? It's not the full sort of DB. So normally what I would do is have
a folder of maybe just sort of 10 files, try to get everything running,
make sure it's going like that. Then, scale up with the
index for the full thing. But in this one, okay,
I've got this going now. What do I got here? you can see that let me
just bring this down a bit. you can see that in our chain now I've
also got this print and pass prompts, so I've got these two different
functions where I can print out and see what's going into the actual LLM. so this just helps, immensely
for knowing what's working. What's not working. is it getting the right context? That kind of thing. Now you can play around with this a lot. I could have done things where if I
put this straight after this part. I could actually just print out
the context if I just wanted to do the context and stuff like that. It's very important though, that
if you go after this, you've got to be using a dictionary, input. and if you go after a prompt, you're
just going to be using a string input. So make sure that your formatted function
is basically or your debugging function is expecting what it should be coming in. Otherwise you'll get frustrating
errors and stuff like that. So I find these can be really good way of
just debugging something, to see, okay, where it's at and what do I want to add to
it or something like that going forward. I did look at some ad-ons for this. some ad-ons that you could start thinking
about, that I started to think about. I mentioned before, like the
semantic chunker and the indexer, that's definitely an interesting
add on to, so putting in there. another one is the multi query retriever. so where you've got your retriever
here, you can actually basically, rather than return five results just from
the one question, what we can actually do is set up a multi query retriever
where we change the prompt, which is already take the input quick query. and we basically rewrite, five
versions of it or something like that. And then we get, responses back for each
of those five versions, and pass that in. So that would be something like this
where, and this prompt is just taking from the LangChain docs in here. But you would then basically
go through, work this out. You want to have a parser so
that you're getting each line as being a separate question. You've got a little chain for that. So here, I've got this multi
query chain, which basically is just taking the query, prompt the
LLM, the output passer from here. And, we can then basically feed
that in, to the sort of standard RAG chain that we've got in here. All right. So that's something that you can play
around with and add to these things. And the end, this I'm just
scratching the surface here. So basically My goal with this was to
get something going with Gemma 2, to see how it comes out and stuff like that. and then probably what are the big
thing I would do next from this is I'd go back and fine tune my prompts. and really work on getting prompts. do I want the responses to be in the
style of, Alex Hormozi or something like that, or, you know what it is? do I want my responses to be like mini
reports with just bullet points of key points Do I want my responses to be long. All of these things are things that
you should be thinking about, and you should be using and balancing like the
amount of context you're passing in, with the amount that you want to generate
out, as well as you go through this. anyway, this gives you a sort
of a simple example of taking, a LangChain, a RAG fully local. There's not, we're not calling
the internet for anything here. We've got a local database, so
we're not having to call some, vector store in the cloud. We've got a local embedding model. and we've got obviously the local
LLM being Gemma 2, in this case. And one of the nice things about this
is once you've got this set up, it's very easy to swap out Llama-3 with
this, to swap out, A bunch of different things on the fly, and then you can
also test out like, different indexes. So you can see here I've been, where I
was testing out the semantic chunker. I would probably go for that. and just work out what the best
break point thresholds would be for this particular dataset. and try out, some, queries that
I know, going to return different responses and stuff like that. So, that just gives you an idea of
how you can do something like this. And then also, how you can just get
started with building something with Gemma 2 locally on your own computer. So anyway, as always, if you've got
any comments or questions, please put them in the comments below. If you found the video useful,
please click like and subscribe. And I will talk to you in the next video. Bye for now.