Gemma 2 - Local RAG with Ollama and LangChain

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Okay, so along with Gemma 2 being released for Keras PyTorch, Hugging Face transformers, and a number of other different formats yesterday. One of the, formats that got released was the Ollama format. and so today I've been playing a little around with both the 9B and the 27 B in Ollama. And actually you get some really nice results, especially with the 9 billion model. I think the 27 billion model actually has its place also for doing things, perhaps where you're doing agents, you don't need responses in real time. I'm finding that it's just, quite a bit slower than the 9 billion, model in here. So, you can just come in here, into Ollama. Should know, I'm guessing, by now, like, how to install this. I've done a bunch of different videos about this. But you can see that they've got, the both sizes in here. you can install it, you can get, using it straight away. So what I thought I'd do is recently I had a number of people asking me to do a fully local RAG. surprisingly, a lot of the things apparently on YouTube that are called local RAGs actually use embedding models from the cloud or other things like that. so what I thought I'd do is just put together a very simple little script of using the 9 billion, Gemma 2 model. we're going to use some different, embeddings. The embeddings that I'm going to use for this are going to be the Nomic embeddings, that are here. So you can basically just install both of these. There's instructions for installing these. And then I'm going to run everything locally, in VSCode and just walk through some code of doing this with LangChain and with just putting together a simple example of building a local RAG system. And perhaps we'll add to it in some future videos as well. We could add on a UI and we can make a sort of more advanced version. The goal here was really just to get something going with Gemma and see what it could do. So what I thought I'd basically do is take the transcripts from one of the YouTube channels. this is Alex Hormozi's YouTube channel. So he does lots of interesting videos about business, about sales, about improving profit, that kind of thing. And what I thought I'd do is just basically take those transcripts, embed them locally. I'm going to use ChromaDB for the database. I, could certainly use Qdrant or one of the other databases, that's local as well. I'm then going to basically use Ollama for the Nomic embeddings, and for the Gemma 2 model in here. the idea here is that we're going to basically put together a very simple, RAG system. And then as I go along, I can show you some of the other things that I would start to add to this, as I start to do testing and stuff. But in this video, I really just want to get the whole thing up and running with Gemma 2 and, see how we go from there. Now the first thing that you want to build when you're building your RAG system is that you want to build an indexer. So basically an indexer will take your raw documents, do the splitting for you, create the vector store, set that up with the embeddings, etc. So in here, by putting it into a separate file, We've also got, where we can test different things out. So you can see in here, I've been testing out different text splitters, using the semantic chunker, using recursive character text splitter. And certainly over time, I would test out probably, variety of different text splitting, methods in here to work out which is going to be the best here. And by having it in a separate file like this, I can actually just make separate DBs of the whole thing. so even though the text files, I think, are, like two and a half thousand, files, When we're making a database like this, it's not going to be huge on, amount of storage on your disk to be able to test it out and get a sense of at least of what's working. the other thing too, is at the start, I often just make a folder with sort of 10 or 20, different files and then try it from there to, be able to get things going in there. So you can see, let's just go through this. So I've brought in the two text splitters in there. we're going to have a directory loader, Ollama embeddings, and we're going to have chroma in there. okay, we're going to just go to the directory. we're going to load up all the files. We're going to glob all the files that are text files. We're then going to, run them through, some embeddings. So we need an embedder for that. And this is where our Nomic, embedded text model comes in. in this particular file, I like to have show progress because it will, draw nice progress bars along the bottom. So having them like that, we can actually see it going and we get a sense of, how long it's taking. very useful when you're doing quite a number of files in here. And you see that we're persisting the directory out, to this db Hormozi directory in here. And so by getting this to be separate, I can index stuff. And once it's indexed, it's just done. I don't need to go back and do it again. I don't need to do it each time I run the file, that kind of thing. So next up, we want to actually do the RAG part. So here we're going to need the, Embeddings again, because we remember we need the embeddings not just for indexing, but also for doing the query lookups and stuff like that as we go through this. So you can see here that we're getting that going through. We're going to need, what else, we're going to need to bring in chroma. So we're going to need to load up, this, directory, that we've got, already here. so we're loading up this directory to basically use this. we're going to need a retriever. So just taking this directory and adding a retriever to it, in this case I'm going to use search type similarity, with k=5, so just bringing back five examples. I haven't put MMR on here, but, this is something that you would probably want to do as well. the idea, with this is just to put the bare bones down, and then perhaps, in future videos I'll start adding to it with some agent sort of stuff and other things in here as well. Next up, we want to actually get our, LLM. So the LLM that we're going to use is Gemma 2, right? So this is going to be the 9 billion LLM. so I'm just loading that in. I'm setting max tokens to 512. You could play around with this, certainly. I'm setting temperature to zero in here. The keep alive in here is that you don't want the model loading and unloading each time you basically call it. so here I've just set it to three hours, for this. Now you can set it to be indefinite. You can set it to a number of minutes, a number of hours, whatever you want for that. Next we've got our, prompt template. So this should be really simple, for those of you who've seen a lot of my other sort of RAG videos and stuff like that, we've got a prompt template that we're setting up here. you definitely want to tune this, right? So I've started tuning this a little bit, but I'm not totally happy with it at the moment. I would probably want to spend more time going through this and looking at it. we're going to be passing in a context. We're going to have a question in there. at the moment I don't have any memory or anything like that going in here as well. So that's something I could add to that. One of the things I was experimenting with was using, the Mesop, UI, and there, there were just more challenges with the streaming aspect of it. So you can certainly get it going with Mesop and that can then help you with a lot of the history, stuff if you want as well, that it can pass that around. All right, once we've got the chat, prompt template set up, we need to make a chain. So our chain is just going to be a simple LangChain expression language chain. so in here, we're going to get the retriever. obviously the retriever takes a quest, the question in there. So we're passing the question into that. And that question is the first thing we're sending into the chain there. but we also want to pass the question on to the prompt as well, right? So we're going to have the context coming back to the prompt. We also want the question coming back to the prompt as well. So this is why we've got this runnable pass through in here. this thing comes into our prompt. it goes through a LLM, and then we're going to just choose to stream this back. So you can see here, I'm just doing a streaming back and we're going to basically run this. All right. So let's have a look at it, in action. okay. I'm just going to run the simple file that we've just been looking at. I'm going to start off with a question. What is the rule of 100? So I know he has a video about this. Okay. And you can see that after a second or two, it's coming back and streaming back at result in here. And we can see the rule of 100 states that you should do 100 primary actions every day to promote your business or make your products or services known. Okay. So that one worked really well. What if we try something totally different? All right. Alex Hormozi is a pretty buff guy. He occasionally talks about protein and stuff like that. let's see what he comes up with Okay. So we got to on point answered there. let's look at asking it about this protein bomb. Okay. You see, we got quite a proper answer after seeing it in there. Now, I don't actually have a memory or anything in this one. so we could certainly add that. you can also add that through the UI. if we were using Mesop, we would have actually gotten that sort of for free in here. All right, I'm going to quit out of this. I talk about some of the other things that you could do, going along with this. So one of the things that I do for a lot of these is I have a debugging script. You'll see that, is that I'll often make you. And this is often the first version that I'm making is like some sort of debugging one where I've got a bunch of different debugging things in there just to see what's going on in the actual system. So let's say I come in here. Okay. So you can see that first off we've got the embedding set to show progress equals true. So we can see that the embeddings kicked in. we can see that, some of the things that are coming back now, actually this is using a much smaller DB, right? It's not the full sort of DB. So normally what I would do is have a folder of maybe just sort of 10 files, try to get everything running, make sure it's going like that. Then, scale up with the index for the full thing. But in this one, okay, I've got this going now. What do I got here? you can see that let me just bring this down a bit. you can see that in our chain now I've also got this print and pass prompts, so I've got these two different functions where I can print out and see what's going into the actual LLM. so this just helps, immensely for knowing what's working. What's not working. is it getting the right context? That kind of thing. Now you can play around with this a lot. I could have done things where if I put this straight after this part. I could actually just print out the context if I just wanted to do the context and stuff like that. It's very important though, that if you go after this, you've got to be using a dictionary, input. and if you go after a prompt, you're just going to be using a string input. So make sure that your formatted function is basically or your debugging function is expecting what it should be coming in. Otherwise you'll get frustrating errors and stuff like that. So I find these can be really good way of just debugging something, to see, okay, where it's at and what do I want to add to it or something like that going forward. I did look at some ad-ons for this. some ad-ons that you could start thinking about, that I started to think about. I mentioned before, like the semantic chunker and the indexer, that's definitely an interesting add on to, so putting in there. another one is the multi query retriever. so where you've got your retriever here, you can actually basically, rather than return five results just from the one question, what we can actually do is set up a multi query retriever where we change the prompt, which is already take the input quick query. and we basically rewrite, five versions of it or something like that. And then we get, responses back for each of those five versions, and pass that in. So that would be something like this where, and this prompt is just taking from the LangChain docs in here. But you would then basically go through, work this out. You want to have a parser so that you're getting each line as being a separate question. You've got a little chain for that. So here, I've got this multi query chain, which basically is just taking the query, prompt the LLM, the output passer from here. And, we can then basically feed that in, to the sort of standard RAG chain that we've got in here. All right. So that's something that you can play around with and add to these things. And the end, this I'm just scratching the surface here. So basically My goal with this was to get something going with Gemma 2, to see how it comes out and stuff like that. and then probably what are the big thing I would do next from this is I'd go back and fine tune my prompts. and really work on getting prompts. do I want the responses to be in the style of, Alex Hormozi or something like that, or, you know what it is? do I want my responses to be like mini reports with just bullet points of key points Do I want my responses to be long. All of these things are things that you should be thinking about, and you should be using and balancing like the amount of context you're passing in, with the amount that you want to generate out, as well as you go through this. anyway, this gives you a sort of a simple example of taking, a LangChain, a RAG fully local. There's not, we're not calling the internet for anything here. We've got a local database, so we're not having to call some, vector store in the cloud. We've got a local embedding model. and we've got obviously the local LLM being Gemma 2, in this case. And one of the nice things about this is once you've got this set up, it's very easy to swap out Llama-3 with this, to swap out, A bunch of different things on the fly, and then you can also test out like, different indexes. So you can see here I've been, where I was testing out the semantic chunker. I would probably go for that. and just work out what the best break point thresholds would be for this particular dataset. and try out, some, queries that I know, going to return different responses and stuff like that. So, that just gives you an idea of how you can do something like this. And then also, how you can just get started with building something with Gemma 2 locally on your own computer. So anyway, as always, if you've got any comments or questions, please put them in the comments below. If you found the video useful, please click like and subscribe. And I will talk to you in the next video. Bye for now.

Info

Channel: Sam Witteveen

Views: 12,492

Rating: undefined out of 5

Keywords: ollama, ollama rag, langchain rag, rag with langchain, langchain app, ollama embedding, ocal rag, retrieval augmented generation, local ai, local llm, llm, tutorial, rag, generative ai, advanced rag tutorial, advanced rag project, advanced rag, ai chatbot, document interaction, response testing, text embeddings, natural language processing, rag local, local rag tutorial, rag ollama, langchain project, gemma 2 ollama, gemma 2 rag, run gemma2 locally

Id: daZOrbMs61I

Channel Id: undefined

Length: 14min 42sec (882 seconds)

Published: Fri Jun 28 2024