RAG from the Ground Up with Python and Ollama

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
retrieval augmented generation or rag is a technique that allows large language models to interact with data sets or documents of any size and today's video we're going to build up a fundamental understanding of this process by coding our very own rag pipeline step by step using Python and AMA we don't need to mess with fine-tuning we also don't need any more compute than it takes to run the model by the end of the video you're going to have a script that you can use to answer questions about your very own documents let's go so first let's start by setting up our python environment um here in a clean folder called Rag and start by installing our virtual environment which we've covered in previous videos next we'll activate that environment with Source activate and finally we're going to install our dependencies which are just going to be AMA uh which we're already familiar with and numpy uh which is a great python library for working with numbers and mathematical equations so now that we have our dependencies installed I wanted to just quickly cover a couple of the olama functions that we'll be using today so we open up a python interpreter import AMA um and the first method that I want to cover is the chat completion now you should be familiar with this already for my intro AMA video but if not I'll link that above so the format that we're following is. chat and then we pass in the name of the model that we want to use followed by a list of messages where each message has the form roll which is one of user system or assistant and then content which is whatever our prompt happens to be so we hit enter uh we wait for a minute for the model to load into memory and then we should see a Json response we have our response back uh we see that we have a lot of meta information but we also have content and then the actual response to the query that we asked it the next AMA function that I wanted to address is called embeddings now I'm going to talk about what an embedding is later but for now just know that uh this method takes in a prompt just like the chat completion and it gives out a list of numbers that's all that we need to worry about for now so since we're building a rag pipeline the first thing we're going to need is Source material so I'm going to be using Project Gutenberg which is a great source of uh freely downloadable open source books um in my case I'm going to be downloading Peter Pan um and so now we can browse this and sort of look at the format and one thing to call out right now is the formatting is a little bit weird so even though this is one sentence it actually has a new line break a few times throughout the sentence so we'll just want to keep that in mind as we start parsing this document so now that I've downloaded our source document and stored it as Peter pan. text I'm just going to hide the file tree so we can focus on the code um now the first thing we want to do is parse this document and we sort of saw the layout of the document how each line is not necessarily one sentence or one paragraph so I'm going to write some custom logic to handle that and group it up into paragraphs and so first I Define my function parse file where we pass it the file name and then I'm going to open it in the standard pythonic way uh storing the reference to the file as F this encoding we're passing uh just helps us to avoid any weird encoding characters that would maybe be part of the output uh next I'm creating variables for the paragraphs which is the total list of paragraphs we're writing um and finally a buffer which is where we're going to append each string to as we're going through one single paragraph and then we begin reading each line of our file um and what I'm doing here is I'm stripping each line so for example if a line was only a new line character or only a tab um it would just be concatenated down into an empty string then we say if the line exists if the line is truthy basically if it's not an empty string uh we append the line to the buffer uh we also say else if uh the line is not truthy right so if it is empty if it happen to be a new line which is how our file denotes breakes between paragraphs um and if we have something that's currently in the buffer then we uh join everything in the buffer and pend it to the paragraph So basically we're going line by line when we hit a new line break an empty line we say everything above that that we've seen already is one paragraph We can catenate those with a space and then pen them to our list of paragraphs and then we reset our buffer uh once we exit this Loop once we're done reading the whole file still might have some things in the buffer that we've added already and so then we just repeat this check we say if there's anything in the buffer um concatenate them with a space and then append it to the paragraph and then finally return our list of paragraphs so now we can add that to our code we parse the file given the file name and then I just want to print out the first 10 paragraphs that we see and so let's see if this works great so we see we have an array it seems to be a list of things that would appear to be paragraphs if we compare this to our input file we see that we were able to correctly split on new lines and just get the paragraphs without any weird spaces or new line characters as part of our Corpus so now that we've written a function to parse our document and return just Atomic paragraphs we need some way of determining which of these paragraphs is most relevant or similar to a given query the industry standard way of doing this is by using embeddings so let me explain what an embedding is by way of example let's think about how we might describe and compare fruit what are some ways that you would describe fruit you might say one is sweeter or Tarter it's one color versus another it's bigger or smaller you can take each of these as an axes on a chart for example like I've done here once we do that we see that more similar fruit are closer together in Fruit space and dissimilar fruit are further away you can even imagine that each of these axes is on a scale of 1 to zero and each fruit's coordinates is now how we describe the fruit these coordinates are basically how embeddings work where similar fruits have similar embeddings now while we know what each dimension of this coordinate system means in our toy example that won't be the case for model generated embeddings furthermore while our embeddings are only two Dimensions sweetness and color model generated embeddings will be on the order of hundreds of Dimensions but the important takeaway is that the clustering effect still applies similar items will have similar embeddings regardless of the dimensionality okay so now that we understand what embeddings are let's put them to work in our project so the first thing that we want to do is import ol um and then I've defined this function get embeddings here uh which just wraps the embedding interface from olama itself so we pass it a model name in case we want to use an embedding specific model um and then we also pass it a list of chunks what this is going to do is perform a list comprehension on each of these chunks and for each one it's going to get that chunks embedding from ol uh moving on okay we're going to say embeddings equals get embeddings and then we're just going to use the mistro model for now but I would recommend using an embedding specific model and finally we're going to pass it paragraphs but because this is an entire text worth of paragraphs um I'm just going to say I want from 5 to 90 just to give us a sample of what this is going to be like and finally because we know that embeddings are just a big list of numbers I'm just going to print out the length of the embeddings that we get okay so now we can do python Main and you're going to notice that this actually takes a long time to run um especially as we run this file repeatedly over the course of this demo we don't actually want to generate those embeddings fresh each time because an embedding is always going to be the same for a given input assuming that we're using the same model okay so we finally got our output and that took a really long time I'm kind of curious as to how long that actually took us so let's find out I'm going to go up to our Imports and I'm just going to import time which is a standard python Library um and then I'm going to wrap our get embeddings call with um start equals time. perf counter this is going to count the number of seconds and once this completes I'm just going to print time. perf counter which is the current time minus start and let's see what this spits out okay so wow to generate just 85 embeddings it took over 16 seconds so we're really going to want to have a way to speed this up so I think a better way of doing this is instead of generating the embeddings fresh each time we run the script we can just save them to disk and anytime we want to read a specific files embeddings again we can just see if that file exists on disk and then read the embeddings from there so let's Implement that so above our get embeddings function um I've created this function called save embeddings where we pass out a file name and our list of embeddings the first thing that it does is checks to see if an embeddings directory exists and then we open up a Json file um where the key is going to be the file name so in our case Peter Pan and it's just going to dump in the embeddings directly okay in order to facilitate this we're going to need a couple Imports we need OS and we also need Json okay so now we have the ability to save embeddings next we'll need some way of loading these embeddings so I'm going to write a new function called load edings so given a file name um it checks to see if the file. Json exists and if it doesn't exist then we know there's nothing there we just return false otherwise we open that file again in read mode and we load the Json back so we're parsing the Json and returning that dictionary in Python and finally we need to modify our get embeddings function to use these two previous def functions that we defined so I'm just going to replace this okay so now our interface is is a little bit different um now it takes a file name in addition to the model name and the chunks so the first thing that we do is we try and load the embeddings from the dis and we persist that result to the embeddings variable and if that's not false then we just return the embeddings because we know that we had them persisted um and we don't need to create them again if that fails then we go straight into our embedding generation code just like we had before and then finally we save those embeddings so we can look them up later and we just return them back to the user so the only we need to make into our main script here is just to add our file name to our get embeddings call all right now I'm going to pass this the full paragraph set so this is going to take a really long time but then after this I shouldn't need to wait for the embeddings to load ever again so let's give it a shot okay so that took a really long time but if we go back into our directory here we see that we now have an embeddings folder and a Peter pan. Json that has our huge list of embeddings this file is massive however if we run our script again now that we've pregenerated the embeddings we should see that our timer tells a different story all right that took less than a second to load and we now have each embedding for every single paragraph in this book once we have the embeddings for each paragraph in the entire document there's only one more embedding that we need to create and that's the embedding for the query itself so in our case I'm creating a prompt who is the story's primary villain and then we just embed that using AMA as we would what we need to do next is compare this prompt embedding to each of our paragraph embeddings to determine which ones are the most relevant to our query now if you remember our fruit example this would be easy to do in two Dimensions but it's a lot harder when the diens diality is more like 7 or 800 so in order to facilitate that I'm going to use a tool called coine similarity now this is very mathy so I'm just going to have you accept it as a given but it's basically a mathematical tool for calculating the similarity between two vectors and next we Define a function find most similar where arguments are needle basically the thing that we're comparing against and the Hy stack which is our list of embeddings that we're comparing the needle embedding two the primary thing that's happening here is a list comprehension that calculates the cosine similarity for each item in the Hast stack that we provided it and then the return value is going to be a sorted list um of tuples where the Tuple consists of the similarity score that's the output of our coine similarity that's essentially saying given your query how similar is each paragraph on a scale of 0 to one where one is the most similar they're basically identical and the second value in the Tuple is its index in the list and finally we're going to reverse it such that all of our highest scores are first in the list and our lowest scores are at the bottom of the list all right so now let's use this in our script we're going to be calling find most similar where our needle is the prompt emitting and the Hast stack that we're searching through is our list of embeddings of the entire document and then from that we're only taking the top five most relevant results then we iterate through each of our results and we're printing the similarity score which is in index zero and then we're getting the paragraph located at the index of that item all right so let's run our script and see what we get okay so we see that we've correct they returned five responses with the first part of the response being the similarity score to our prompt and our prompt is who is the story's primary villain and the results that we get are a little bit abstract they're not really directly related to Captain Hook which is what we would expect and so let's think of some ways that we could improve that the two main things that I could think of off the top of my head uh one we see that we're actually parsing sentences instead of paragraphs we would expect each paragraph that we're creating an embedding for to be multiple sentences but just due to the formatting of the book how the author laid it out um it turns out that we're actually embedding individual sentences and so that means that you know while our response may be more specific it's going to include less information and less context overall and that's going to be less helpful to the llm and so one exercise that I'll leave to you guys is maybe adding a chunking function somewhere around here that takes multiple sentences together and chunks them into three or four sentence paragraphs before embedding them another thing uh as I called out earlier is that the model that you use for embedding really does matter in our case we're using mistol because I expect most of you to have it on your systems already but I would really recommend using a model like BGE base which is specifically designed for embedding creation it also yields much smaller embeddings and so your process will be faster and probably more accurate as well but moving along one of the final steps in our process is actually taking those relevant chunks and passing them to an llm to answer questions about so the first thing I want to do is to create a system prompt that cues our llm about what its role is I've written a prompt that I think should should be a good place to start says you are a helpful reading assistant who answers questions based on Snippets of text provided in context answer only using the context provided being as concise as possible if you're unsure just say that you don't know and then I leave it open to pass context in and we'll use that system prompt right here so we're going to be using ama's chat method we passed in the model mistal and then we passed in our array of messages where the first message is a system prompt so it's from the role of the system we have our prompt template and then for each of our most similar chunks we are simply joining uh the paragraph that corresponds to that index um onto our system prompt and finally we're adding one message from the user which is just the question that we have that we want our llm to answer and lastly once we get a response we just print that to the console all right so let's see what we get the primary villain in this context is Captain Hook then it elaborates some more on the role of the crocodile but I think this is really good it's able to take our text and digest it and give us a reasonable answer and as one final touch I'd actually like to give it my prompt directly from the command line so I'm going to replace our fixed prompt with this input variable and this is going to read straight from the command line so let's run it again what do I want to know where does Peter take Wendy in the story Peter takes Wendy to Neverland okay perfect so in less than 100 lines of code we wrote Our Own rag pipeline entirely by hand including a caching layer that we can use on any document of any size now you might have noticed that we spent a lot of time thinking about how to pars our docment or how to break it up into chunks that we can pass to our llm in future videos I'm going to be addressing libraries like Lang chain or llama index that provides helpers that do a lot of this stuff for us and it even gives us easy ways of parsing documents like csvs or even PDFs they also provide much more advanced techniques for comparing the similarity of embeddings as well as storing and serving them so let me know in the comments what you thought about this project do you feel like you have a better understanding for how rag works could you explain embeddings to somebody if they asked how do you see yourself using this project in your own life thanks for joining me if if you enjoyed this video please consider liking and subscribing
Info
Channel: Decoder
Views: 5,434
Rating: undefined out of 5
Keywords: ollama, machine learning, large language models, LLM, python, RAG, self-hosting, artificial intelligence
Id: V1Mz8gMBDMo
Channel Id: undefined
Length: 15min 32sec (932 seconds)
Published: Mon Mar 25 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.