Reliable, fully local RAG agents with LLaMA3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi this is Lance from L chain So Meta llama 3 came out today which is super exciting and something that I've been waiting for for a while and I want to hop on here and H and talk about how to build uh reliable agents using llama 3 that can actually run on your laptop so it can run locally now just for a quick kind of refresher here llama 3 just dropped today we can see looking at the performance characteristics for the 8 billion parameter model they're very strong so I've done a lot of work with mraw which was previously kind of my go-to and it looks like on a number of popular metrics or benchmarks you know met llama 3 is is indeed a bit better so again I haven't tested this yet this is kind of like a first dry run but uh it's really exciting so to convince you that we can build local and reliable agents I'm going to pick ideas from three different rag papers they're all pretty sophisticated and they're going to kind of roll up into this pretty in this pretty uh kind of interesting and complex rag flow so we're going to do routing from the adaptive rag paper which will basically take a question route it to either a vector store or to web search based on the content of the question where they're going to introduce the idea of fallback so basically we're going to do retrieval from our Vector store if the question's relevant to the vector store we're going to grade our documents if they're not relevant to the question we're going to fall back and do web search so that's an idea from this corrective rag paper and then we're also going to do self-correction or checking of the generations to see if they're if they contain hallucinations and if they're relevant to the to the original question and if they're not we'll fall back and do web search so the point is we're going to implement an interesting complex raglow we're going to show we can run this reliably and locally on my laptop I have a Mac M2 30 32 gigs so it is a reasonably sized laptop but it's not insane um so first and foremost what's an agent and this is kind of controversial in itself but a really good blog post from Lilian Wang lays out an agent is something that has planning so it can break down task into smaller sub goals or subtasks it has memory um chat history your long-term memory in a vector store and it can use tools now let's just say we want to use an agent to build corrective rag which is that middle blue thing we talked about right so typically when people think about an agent they immediately say oh you know react that's like a very popular framework for building agents um and it typically involves a flow looks like this for planning uh the LM will select an action observe the result think and then choose the next action um and again you know rea agents typically will use memories chat history or vector store and of course can use different tools so if I want to do this above flow as a react agent it would look like this I would take my question it would I was first perform an action like use my Vector story to get documents I would then observe the documents I would say okay I need to think about grading them and then I would go back to my action and then I would you choose the grader tool and then I would kind of go in this Loop until hopefully I follow this trajectory as laid out here right so that's kind of how it work with a react agent now I want to introduce another idea from implementing this that uh you can basically lay this out as a control flow so instead of having an agent make a decision at every step in this like kind of in this Loop instead we're going to layout as the engineer ahead of time here's the control flow I want my agent to take every time it's run so I'm basically taking this uh kind of like planning away from the llm and I'm actually creating a control flow that I'm defining and what's nice is that the llm then only has specific tasks within each of these steps so you know in terms of planning I'm laying out a control flow ahead of time in terms of memory I can basically use what I'm going to call a graph state to persistent information across this control flow um and of course it can be relevant talk things relevant to rag like documents question and in terms of tool use you know each uh graph note like we talked about we can use a different tool like the vector s retrieval we'll just use a retriever tool um the greater we'll use a greater tool um you know the web search will use a web search tool so you know again thinking about this what are the tradeoffs react of course one of the big challenges is that you get lower reliability when you have this react style flow the agent has to make the correct decision at every point and this is kind of when you can see things go off the rails and get off track particularly with small llms whereas with Lang graph you're actually laying out this flow ahead of time so the the agent effectively always traverses this path every time the llm doesn't have to make choices about you know which node to go to next uh in kind of an unconstrained way um now in terms of flexibility a react agent would be more flexible so it could choose any sequence of actions through this given these tools whereas the control flow that I lay out with L graph is constrained so it only ever traverses this path um but we'll see because of this constrained kind of control flow uh this these landcraft agents are very are compatible and actually quite reliable with with local and smaller llms and that's kind of one of the main benefits I want to kind of bring home to you today so let's actually get to the code let's kick this off and start with the the corrective rag piece so that's kind of the middle piece of this overall agent we want to build and what I'm going to do is I'm just going to take a few of these components and just test them individually so you can see them working so I have a notebook here a few pip installs um here's again my flow to reference now for local models for embeddings I'm going to use GPD for all uh which is you know from Lang chain nomic I I uh we have a partner package with them with nomic and uh I really like these embeddings they're really good now AMA just came out it's available uh sorry llama 3 just came out it's available on AMA uh and all I have to do is AMA pull llama 3 um and the only other thing I'm going to reference is occasionally met llama 3 has a particular prompt format which we have to pay attention to so that's really it now let's kick this off I'm going to choose a local LM I'm going to say llama 3 and first I'm going to do I'm just going to build an index so I'm going to build an index of three web pages uh blog post that I like and I'm going kick that off I'm setting a chunk size um I'm using chroma local Vector store and that all ran so cool now I have an index so that's this piece right basically the index is the key component of my rag flow I need to be able to retrieve documents then now I'm going to get into some fun stuff here I want a retrieval grader so that is this piece I want the be to retrieve documents and grade them for relevance relative to my question so that's what's happening here now here's where llama 3 comes in and I'm not going to use something really convenient I set my local LM to llama 3 AMA has Json mode which confirms that the output from the llm is Json so my prompt basically just says grade the documents and return a Json with score yes no that's it so I'm going to do a mock retrieval so I'm going to say my question is Agent memory I can just call invoke here actually uh on my Vector store it's a little bit more convenient um I can kick that off so this is now running now one other thing I did I set tracing in lsmith so I actually and that already finished but what's nice is you'll see me reference over here um when this runs um as this is running I can actually inspect what's happening under the hood so we've called a shadow llama and we get this nice Json out that's great that's exactly what we want cool so that's our grader now for Generation I'm just going to do good Old Rag so again I have some custom rag prompt here nothing too unusual I'm still going to use uh llama 3 of course and I'm basically just taking my documents and my question I'm Plumbing them into llama 3 there you go so you can see it runs pretty quick let's actually check over here um so I can see my the time is around 4 seconds not bad and this is pretty cool so I can look at the my my prompt which contains the documents and the output great so we're rolling here this is pretty good um and okay we got our index we got our grater we got our generation done uh I'm going to find a Search tool as well so this is basically just a tool that I'm going to use to query the web I like tab for this it's kind of a really nice Quick Search tool and here's where I'm going to um basically Define My Graph so all that's happening here is each of our uh I'll should go up and show you here so each of these pieces these green things let's call them nodes so each of these is just a function okay so this first node retrieve documents I'm just going to wrap that as a function that's called retrieve uh this generate I'm just going to wrap this a function generate grade documents again it's just going to be a function now what you're going to see here is each of these functions these are basically the nodes of my graph I'm going to take in the state I'm going to modify it in some way so in the retriever node all it's happening is I'm taking the state which I've defined up here this is a placeholder uh dictionary that contains my state so the way to think about this is the state is information that I want to persist across my agent so again it's kind of that notion of memory it's basically short-term memory that lives over the lifetime of my agent and it contains everything I want my agent to be aware of throughout this control flow right so for rag it's like question generation web search intuitive stuff right so so my retrieve node is just going to take in my question um and this is going to be passed from the user and it's just going to do a document retrieval so I'm actually going to use invoke for this it's slightly kind of a nicer way to do it um that's my retrieval node generation same deal I'm just going to use that rag tun we defined above I'm just going to call it but I'm going to invoke it now on my graph State my question and my documents um and you can see at every note here we just write uh you know in this case we write the question and the documents back out to state so we just update the state at each of these nodes that's it now grading is what I would call um you know another node so we're basically going through our documents we are grading them for relevance and if they're not relevant we're going to filter them out and we're also going to turn on this flag to say web search so if anyone's not relevant we'll go ahead and do web search and then web search is my final node this is basically going to hit my hit my search API that tab API and again just append those search documents to my state now here's where I'm introducing of a conditional Ed so all Happening Here is those noes we just talked about they all just take in state modify it in some way right so retrieval will'll just grab documents from my retriever add it to State grading will filter those documents in state now I have this notion of an edge where I want to make a decision based upon State as to where to go next so this is where I can Implement kind of interesting logic here all I'm doing is basically I'm taking in state and actually I'm doing something really simp simple and kind of silly here I previously set this flag web search that's going to tell it to do web search if any document was deemed irrelevant uh we set that up here in this uh grade documents node so what all I need to do is in this Edge I'm basic going to take in my state I'm going to see does the state contain web search if yes then go to web search if no go to generate and this what I return this string is just the name of a node right web search or generate that's it so we defined all our nodes and that's it we are kind of rolling here now I'm just going to scroll down uh now all we need to do is to build our graph and I'm going to kind of flag this graph build very nice and so all it's happening here is I'm actually just implementing the control flow I want so I registered all my nodes up here and here here's where I'm just setting like the order of the noes so you can see my entry points give me retrieval I'm going to go from retrieval to grading um and then I'm going to add my conditional Edge again it's following our diagram up here retrieval grading Edge and then um we can see here uh after web search I go to generate and after generate I go to end so that's all it's going to happen and I can go ahead and compile that so basically I can run my graph and let's see if this works I kick that off so you can see it ran our retriever so it's I'm basically printing out the steps as we go cool so we're doing our grading right now so we're grading our documents first document's relevant second document's relevant and I can go over to lsmith here and let's see what's going on so it's kind of chugging along so I can see I can really dig into this like I can look at every every document getting graded I can look at the individual grade prompts the individual documents it's all pretty nice it's all logged for us so that's all our grading stuff and it ran that's great so we just build a little simple agent again it has memory it has state it has planning it has a control flow um and it uses tools it's an agent and it ran locally all on my laptop so that's cool that's kind of step one now let's beef this up a little bit I can throw in the selfrag stuff which is what we see in our diagram up here in green I just need two new graders but it's going to use the same stuff we just talked about I want to grade the the generations for hallucinations I want to grade the generations for relevance to my question so let's throw in uh two additional graders here and um why don't I add them up here just for convenience um so this is me my hallucination grader let's kick that off and I'm just doing a kind of simple test here so all this is going to do is really simply um determine whether or not there's the the answer is grounded in my documents so yes no right if if it's grounded then yes otherwise no um and same here so answer grader um you know does my generation answer the question again you can look at the prompts I'll share all this code of course and so that all runs cool so we have that now all I need to add this is actually pretty simple is just one additional conditional Edge to My Graph and let's scroll down to where my edges are so you you before we defined decide to generate as a conditional Edge and again what was that doing that was making this decision the next decision is is this um you know hallucination conditional Edge so basically if my hallucination grader uh identifies that there's hallucinations I'm going to feed back and we will go ahead and see that here shortly so um here's additional conditional Edge and this is actually going to wrap in both of my checks so you can see what's happening here is first my hallucination grader looks at the generation R to the documents it gets the grade um if the grade is yes um then the generation is grounded in in the document so that's a good thing um and then we go ahead and move on to test whether or not it's grounded in the if relevant to the question um and so it's basically going to return three things um either the a generation has hallucinations I it's not supported by the documents um or the generation is in which case then it's either useful or it's not useful okay so that's kind of how we set up our conditional ledge and all we need to do now is we can update basically our graph build here and so what you're going to do is we can map from the outputs of our conditional Edge so remember we're outputting either useful not useful and not supported we map between those to the associated nodes we want to go to so if it's not you if it's not useful we're going to fall back to web search okay so that's basically this case so basically if it's not use doesn't answer the question we're kicking back to web search right so that's kind of scenario one uh if it's not supported we try again we go back to generate otherwise it's useful and we finish that's it so let's go ahead and try that so we're going to retrieve again we're going to check our document relevance again documents relevant so we can see our agent rolling here this is always kind of fun uh just kicking just chugging along and I can close this down so it's checking my relevance it determined relevance is good it's doing generation now so that's also cool let's also open that up so we can see the whole Trace in real time doing our grading here generation so we again we can like really dig in and then this is our this is our second grading step you can really drill drill into all these pieces so I really like having my traces here where I can actually see what's going on under the hood in each of these pieces you can see it's like really nicely laid out you can kind of close this stuff down if you don't want to see it this is pretty cool right um so yeah we're going generation then we're doing our grading and so this is pretty cool so it did generation it checked hallucinations it found that the generation is grounded in the documents um and then so the the hallucinations are good and then it found that the generation addresses our question so that's really cool we're really rolling here and we really only need one more piece so let's just throw in a router Router is pretty easy it's going to build on what we just talked about actually so router is just again I'm going to use Json mode but here I'm just going to basically say hey given the question given what's in my Vector store so I tell it the vector store has uh llm agents prompt engineering and adversarial attacks right um if the questions related to those topics you use Vector store otherwise fall back to web search I tell it return either Vector store web search that's it let's just do a test here make sure that actually works um so I'm passing a question related to my Vector store it determines yeah use the vector store easy um so I want one more Edge route questions um let me go ahead and throw this in here so here's all my edges let's just add this one cool so route questions so again follows ex what we did before we look at our question invoke our router depending on the router State um so I can probably get rid of these extra prints H why not I'll keep them um so basically if the source is web search then go to we Sear search if it's not good if it's Vector store go to Vector store right really simple stuff um cool let's go ahead and do that um let's now build our graph so we're going to set an entry point now so my entry points the router it's going to decide to go to web search or retriever um and then you can see this this control flow is the same as we had before that's it nice and let's try that so first we're going to Route the question so it's kind of printing that out and it decides to go to the vector store as we expect that's great and now it's following the same flow we talked about before so we've just implemented this final piece of the routing you can see it kind of went to uh it went correctly routed to our Vector store so that's great let's look at the trace and yeah so we can like dig into all this we can look at the router yeah so it makes the right decision let's go to the vector store that's fantastic um then it does retrieval then it grades the documents we saw this before so it's kind of nothing new it's still chugging along here we can look at the generation okay and it looks like it probably finished very cool so that's kind of everything now let's actually just s check this again let's let's ask a question like related to current events uh okay so we can ask a question who are the Bears expected to draft first in the NFL draft let's see if that kind of all flows as expected we can route to we'll route to web search um check for hallucinations we can actually look at our land graph Trace just to kind of Sandy check what's going on um cool the Bears are expected to draft USC star yeah Caleb Williams this is this is kind of the consensus pick so that looks great so anyway we've seen in relatively short period of time that we can build a pretty complex rag flow as you see right here with routing with retrieval grading with different interesting decision points a fall back to web search uh grading of generations for two different criteria we can build this it runs reliably it runs locally on my laptop it runs with llama llama 38b and you know we can look at you can look at Lang Smith traces here um we can look at the latencies this whole thing ran in 14 seconds which is pretty good for something that's running locally on my laptop um and you can see this is like a non-trivial rag flow introducing ideas from three papers they were able to do uh all locally and again this idea of control flows is really what allows you to kind of lay these out in such a way that a local kind of agent can actually run reliably I think that's like really the important point to to kind of bring home and uh I encourage you to play with this I'll make sure the code's public but um yeah hopefully this is useful and feel free to leave any comments thanks
Info
Channel: LangChain
Views: 100,073
Rating: undefined out of 5
Keywords:
Id: -ROS6gfYIts
Channel Id: undefined
Length: 21min 19sec (1279 seconds)
Published: Fri Apr 19 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.