Building Corrective RAG from scratch with open-source, local LLMs

Video Statistics and Information

Video

Captions Word Cloud

Captions

hi this is Lance from Lang chain team I'm going to talk about building a self-reflective rag apps from scratch using only open source and local models um that run strictly on my laptop now one of the most interesting Trends in the rag research and a lot of like methods that become pretty popular in recent U months and weeks is this idea of self-reflection so when you do rag you perform retrieval based upon a question from an index and this idea of self-reflection is saying based upon for example the relevance of the retriev documents to my question or based upon you know the quality the generations relative to my question or the generations relative to the documents I want to make I want to perform some kind of reasoning and potentially feed back and retry various steps so that's kind of the big idea and there's a there's a few really interesting papers that implement this and what I want to kind of show is that implementing these ideas using something that we've developed recently called langra is a really nice approach um and it works really well with local llms that are much smaller for example than you know API uh gated very large scale Foundation models um and so we're going to look at particular paper called corrective rag or C rag now this paper is kind of um there's been some attention on for example Twitter about this work uh it's a really neat paper and the idea is actually pretty simple and straightforward if you go down to the figure here you're going to do perform retrieval and you're going to grade the documents relative to the the question so you're kind of doing a relevance grading and there's some theistic like basically if the documents are deemed correct they actually do some uh knowledge refinement where they further strip the documents to compress relevant chunks within the documents and retain them um and if the documents are either deemed ambiguous relative to the query or incorrect it performs a web search and supplements retrieval with the Webster so that's kind of the big idea but it's a nice illustration of this General principle of don't just do rag as kind of like a you know a singleshot process where you perform retrieval and then go to generation you can actually perform self-reflection and reasoning you can retry you can uh retrieve from alternative sources and so forth that's kind of the big idea now in our build here we're going to make some minor simplifications um here's kind of a layout of the graph that we're interested in we're going to perform retrieval and for that we're going to use no embeddings which run locally um we're going to build a node for grading those documents relative to the question to say are they relevant or not and if any documents are deemed irrelevant we'll go ahead and do a query rewrite web search and we'll s go ahead to generation based upon the web search results so that's the flow now first things first is how do I get started running LMS locally and and kind of where do I go and where I often direct people and what I found to be really useful is AMA um it is a really nice way to run models locally uh for example on your Mac laptop very easily and they are launching support for various other platforms as well um and so basically if you go to their website it's very simple you simply download their application um you can see it's running here on my machine um and once you have it downloaded you all you need to do is you can go to their model list and you can kind of search around so you can actually look I think it's sorted by popularity so you can see mraw obviously a really interesting open source model um is kind of one of the top so you can see it has like 210,000 polls it's one of the top models I click on this and where this takes me is a model page I can look at this tags uh Tab and this basically shows me um a bunch of model versions that I can really easily uh just like download and run and we'll we'll show how to do that here very shortly um what I'm going to do is I'm going to choose mrol instruct so that is their 7 billion uh parameter instruct model and so all I would do I'm going to go over to my notebook here so I have an empty notebook and all I've done is I've already uh done a few pip installs and I've also set uh a few environment variables to use Langs Smith and we'll see why that's useful later that's really all I've done now I'm going to put a note here to uh for olama and what I'm going to do is this AMA pull the model I want and you just run that so normally this will take a little bit because you're actually pulling the model and typically it's like a couple gigs I actually already have this model so it's faster um it's actually already done but that's really all you do okay so that's kind of like step one and then what we're going to do is I'm just going to create this variable local llm um that I am going to yeah so I'm just going to Define this variable Mr all instruct because this is the model that I download using a llama pull miston struct that's all that's going on here so this is be the llm I'm going to work with I've pulled this so it's local on my system it's available V Lama which is basically running in the background on my system and you can see is really seamless and easy to use now the first thing I want to do for this approach is I'm going to call this um index so because this was a corrective rag approach I need an index that I care about that I'm actually performing rag on and so here I'm going to use uh a particular blog post that I like on agents and we can like pull it up here and have a look let pull it up over here actually so this is a pretty neat blog post on autonomous agents it's like pretty long and mey so it's kind of like a good Target for performing retrieval on lots of details here uh really neat really detailed blog post so what I'm going to do is I'm going to load it here I'm going to split it and I'm going to use a chunk size of 500 tokens um these are kind of somewhat arbitrary parameters you can play with these as you want the point is here I'm just B building a a quick local index um so I load it I split it into chunks now this is the interesting bit I'm going to use GPT for all embeddings from nomic which is let's actually pull up the link here I had it available here so these are um you can see right here it is a CPU uh CPU optimized cont contrastively trained s basically eser model um so you can like drill into sentence Transformers so you can see um yep so there it is the initial work is described in our paper espert basically so the key point is this this is a locally running CPU optimized embedding model that works quite well I found um runs on your system no AP I nothing so it's pretty nice runs fast so we're going to go ahead and use that um from our friends at nomic and I'm also going to use chroma which is an open source local Vector store that's really easy to spin up runs locally and all I'm doing is I'm taking my documents um I'm going to define a new collection taking my embedding model GPD for all embeddings I'm going to create a retriever from this so there we go okay it shows you some uh you some parameters so cool I have a retri so we can actually call we can say get relevant documents um and I can say something about like um let's say like agent memory or something you know let's just test and okay cool look at that so it's nice and quick we get a bunch of documents out that relate to memory so yeah you can see memory stream like the documents are are sayane so it looks like everything's kind of working here so that's great we have a retriever now let's think a little bit about what we want to do next next so when I do these kinds of uh kind of logical rag flows but as graphs first I always try to lay out the logic and um let me move this up here I try to lay out kind of The Logical steps um and in each logical step what's happening is I'm transforming state so in these Gra really all you're doing is you're defining a stake that you're just modifying throughout the flow of the graph now in this case because we're we're interested in rag our state is just going to be a dictionary and that dictionary you can see I actually kind of schematically uh laid it out here it's just going to contain a few keys that are things relevant to rag it's going to be like a question then it's going to be you pen documents to your dick um and then eventually you're independent generation so that's really all that's going on in terms of like how your State's being propagated through the graph and at every node you're making some modification to State that's the key point so you're basically going to do you start with a question from the user you perform retrieval relevant to the question um you're then going to grade the documents so you're going to do a modification of the documents then you're going to make a decision are they relevant or not if they're not relevant um you're going to transform the query so you modify the question do a web search the final step is a generation based upon the D documents so that's your flow now what I want to call out here is there's one very important what we call conditional Edge where depending upon the results of the grading step I want to do one thing or another so I'm going to make a decision so I want to show you something that's very convenient um that we can use with olama to help us here so this is I'm going to kind of make a note here um to note what I'm going to highlight so this is AMA Json mode so the basic logic of that conditional Edge decide to generate is going to be something like this I already have this prompt laid out um but it's basically going to be I'm going to take a document and I'm going to take my question and I'm going to do some kind of comparison to say is the document relevant to the question that's really what I want to do but here's the catch because I want that edge to process very particular output either yes or no I want to make sure that my output is structured in a way that can reliably be interpreted Downstream in my in my graph this is where Json mode from Alama is really useful and you can see all I do is now I'm I'm importing chat llama this is going to reference that local model that I specified up here Mr instruct which I've downloaded so I have the model locally and I'm just flagging this um format Json to tell the model to Output Json specifically and what I'm going to do in my prompt here I'm basically saying you know you're a grer um here's the documents here's the question and here's the catch give a binary score yes no um and provide it as Json with a single key score and no Preamble no explanation so I kind of explain in the prompt what I want and when I call this with Json mode uh it will enforce that Json is returned and hopefully with this single key we expect score and either binary yes no and when I'm going to run that as a chain so I'm going to supply that prompt to my llm and I'm going to then parse that Json string out into a Json object which I can work with so let's try that we're going to try to run this chain we defined we're going to run retrieval on here's a here's a question here's our docs let's grade one of the docs using basically passing question and one document and we're going to take the page content from the document which is like basically all the text and we're going to run this so let's test that quickly and it is still running now it's finished let's check the output here we can see so we get a Json back which just is the score yes no so that's exactly right that's what we want and we can actually look under the hood here at um yeah so we can actually look under the hood in Langs Smith at that grading process and we can see here that our prompt got populated with um the context so here is the document um and um right here was a question here is a document and um the task was of course to grade it so we can see here's like the full prompt you're a grader assessing the relevance retri document here's a document and then here's the model output score yes so this is really nice we've basically enforced the output from our local llm um using Json mode so we know every time it's going to Output binary yes no score as a Json object which we again extract so that's a very key point that I just wanted to flag it's a very nice thing that ama offers that's extremely helpful when building out uh particularly these kind of logical graphs where you really want to constrain the flow at certain edges so that's kind of the like really key thing I wanted to highlight a lot of the rest of this is actually pretty straightforward so let's now Define our graph State this is the dictionary that we're going to basically pass between our nodes so this is just some code I'm going to copy over this is defining your graph State you're just saying it's a dict that's all there's really to that um now here is where I'm going to copy over some code that basically implements a function for every node and every conditional Edge in our graph so if you remember we can kind of go over and look our graph is laid out like this and all we're doing is for every node drawn we're going to find a corresponding function here which performs some operation so retrieve is basically just doing we had our retri defined get relevant documents and write them out to state so again we take a question in so if you look here we basically have this state dict passed into the function we extract the state dict here uh we extract the question from the state dict we do retrieval and we write that state dict back out to the so you think about every node is just doing some modification on the state reading it in doing something writing it back out that's really all that's going on and we can really just march across our little like diagram here and see how um basically each one of these nodes is implemented as a function and again you can see in every case we're using uh for example cadow llama in some of these cases we don't need Json mode so if we're just doing like a generation step um as you can see here we don't need Json mode for the grading we do so we're actually going to implement here the same thing we just showed um chat AMA using Json mode and what's going to happen is we can see we generate our score every time and then we can extract our grade from that Json and then we know the grade is going to constrained to the output yes or no then here's the key point we do some logical reasoning on that um to say for example if the grade is yes um then we're going to um like append the document it's relevant if not then what we're going to do is we're going to filter that document out and we're also going to set this flag to search perform web search as yes so what really happening here is we are kind of applying a kind of a logical gate to say if any document is scored as relevant then we just add it to our final list of filter documents if not we're going to go ahead and do a web search and we're going to set the search flag to be yes and we're not going to include that document in the output and you can see here we return a dictionary which contains our filter documents our question and then that flag to run web search yes or no you can see it was default no but if we ever encounter an irrelevant document we change that to yes so that's really all that's going going on here um you can see we do our queer transform down here again we just use um Mr all again here is like a a transform prompt but you kind of get the idea um web search node we use Tav web search here it's really kind of a nice quick way to perform web searches um and you can see we just supplement the documents with the web search results and then this was kind of the final step where we wrote out yes or no to our search key and depending upon the state which we can read in here we make decision to uh basically either return transform query or return generate which will basically that's determining the next uh node to go to um so this decide to generate is our conditional Edge that's actually right here and so it's looking at the results that we wrote out from grade documents in particular that uh search yes or no key in our dict and it's then going to basically determine the next node to Traverse to that's really all we're doing here so that's kind of nice now what we're going to do is we kind of copied over all these um these functions we then can go ahead and run that and now we just lay out our graph so again our graph was kind of explained here and here's where we actually just lay out the full kind of graph organization um how we're going to connect each node so we add the nodes first we set our entry point and then we add the edges accordingly between the nodes and basically the logic here just Maps over to our diagram here that's really all that's happening um cool so I'm going to go ahead and go down and now let's kind of see this all working together so I'm going to go ahead and compile My Graph and I'm going to go ahead and ask a question explain how the different types of agent memory work and what I'm going to do let's go back to our D diagram so we can kind of reference that I'm going to call this and I'm actually just going to like this will like Traverse every step along the away and it'll print out something to explain what's happening so you can see I perform retrieval and now I'm doing my grading steps and this is all running locally um and they were all deemed relevant so then I'm going to go ahead and generate and it's running right now and there we go so we can go over to Lang Smith and let's actually have a look at what happened under the hood so this is what just ran so we can see that at each one of these steps we called Shadow llama with our mraw 7B model that's running locally um and this is our grading step so this was each document being graded um so again like look at this so it outputs a binary score yes no as a dict that's great um so this has a bunch more down here so these are all of our documents uh graded and now here is that final llm call which basically packed that all into our rag prompt you're an assistant for question answering task use a following to answer the question here's all up our docs here's the answer so that's pretty cool um we can see that this uh multi-step logical flow all works um now let's try something kind of interesting I'm going to ask a question that I know is not in the context and see if it will kind of perform that default to do web search so um I'm going to say Explain how how uh Alpha codium works so this is a recent paper that came out that's not relevant at all to this blog post so I know uh that retrieval should not be considered relevant and let's go ahead and run that and convince oursel that that's true so good this is perfect so the greater is determining these documents are not relevant and so it should be making that decision to perform web search so it it should be kind of going to this lower branch transform the query run web search and looks like that all ran so it tells us Alpha coding is an open source AI coding generation tool developed by Cod M this is perfect that's exactly what it is and we can actually go into Langs Smith and again see what happened here so you can see here the trace is a little bit more extensive because all of our grades are incorrect so or irrelevant again we get the nice Json out um and okay so this is pretty cool so this was our question rewriting node so basically provid an improved input question without any Preamble so what is the mechanism behind Alpha codium functionality so it modifies the question we use Tali search right here so it basically does retrieval it searches for Stuff related to Alpha codium so that's great and then we finally passed that to our our model for Generation based on this new context and there we go Alpha codom Source AI code assistant tool um so that kind of gives you the main idea and the key point is this is all running locally again I used GPT for all embeddings for indexing up at the top right here and I used AMA with mrol 7B instruct um and Json mode for that one crucial step where I need to constrain the output to be kind of a score of yes no um and for other things I just use the model without Json mode to do perform Generations like to question rewrite or to do the final generation so in any case I hope this gives you kind of an overview of how to think about building logical uh flows doesn't have to be rag but rag is a really good kind of use case uh for this using local models and Lang graph and the thing I want to kind of leave you with is there is a lot of interest in complex logical reasoning using local llms and a lot of you know focus on using agents and I do want to kind of encourage you to think about depending on the problem you're trying to solve you may or may not actually need an agent it's possible that kind of implementing a state machine or a graph kind of as shown here with some series of logical steps this can incorporate Cycles or Loops back to like prior stages we have some more complex examples that show that um this actually can work really well with local mod models because a local model is only performing a step um within each node so you're kind of constraining it to like just do this little thing just do this little thing like just rewrite the question just grade the document rather than using the local llm um as like you know an agent executor that has to make all these decisions kind of jointly um or kind of in a less controlled workflow where for example like the the The Ordering of these various tasks can be determined arbitrarily by the agent here we really nicely constrain The Logical flow and let the local model just do little tasks at each step and I've just found it to be a lot more reliable and really useful for these kinds of like logical reasoning tasks um so hopefully this is helpful give it a try um and we'll make sure all this code is is easily shared thank thank you

Info

Channel: LangChain

Views: 81,914

Rating: undefined out of 5

Keywords:

Id: E2shqsYwxck

Channel Id: undefined

Length: 26min 0sec (1560 seconds)

Published: Fri Feb 16 2024