How to Build, Evaluate, and Iterate on LLM Agents

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hey everyone my name is Diana Chan Morgan and I work at deeplearning.ai running all things Community today we have an amazing Workshop uh with some of our course Partners to bring together uh what's next for llm agents So today we're working with llama index and trara and they will guide you through the entire process of building evaluating iterating and deploying successful lln agents uh this session will be recorded and the slides will be sent afterwards we also have a notebook that our speakers were so kind enough to share so we will drop that in the chat for you to be be able to access and follow along during the session this Workshop utilizes cuttingedge open source tools like L index a simple and flexible data framework for connecting custom data sources to large language models and true lens a powerful platform for testing and tracking llm app experiments in this session you'll be able to gain valuable insights into building your first llm agent Effectiveness hallucinations and bias iterating for production ready applications and maintaining high performance in production for any questions for the speakers we've also dropped a link in the chat to ask and vote upon the questions that we will answer in the last 5 to 10 minutes of the workshop today this Workshop was also inspired by our new short course that was just launched last week it's called building and evaluating Advanced rag applications I'm sure our speakers will talk a little bit more about it today uh you can find that on our website and we'll also be setting that out afterwards and you can find it um in the link in the chat as well and to start off with I want to share our first Speaker we have Jerry Jerry Leo is the CEO of llama index he brings a wealth of experience from his previous roles as an ml engineering manager as a robust intelligence and research scientist at Uber hey Jerry really happy to have you here today hey thanks for having me absolutely and next we have anap anaam is the president and chief scientist at chera and is a renowned AI expert and former professor at carnegi melan University his research his research focuses on ensuring accountability and fairness in AI systems we're so happy to have you here today ad of great to be here Danna absolutely well I'll let you guys take it away for the workshop I know you have a lot of stuff planned so uh I think our community is ready to dive into everything awesome perfect I'll you guys take it away awesome the slides are visible y great so let's get started today's Workshop is about how to build evaluate and iterate on llm agents like de said Jerry and I just recorded a course and released it on deeplearning.ai which focuses on rag based applications you can think of this as a continuation of that but expanding into agents and later on Jerry will comment more on the technical connection between the two uh between the two so just to orient ourselves we'll take a few minutes to look at some examples of Frameworks and actual AG agent based llm apps that are starting to get quite a bit of adoption so chat GPD put out their plugins sometime back as many of you are aware and with these plugins or llm agents you can augment Chad GPD to access upto-date information run computations or use thirdparty services and we share the slides later these are hot links so you can explore them these are some of the examples of applications that the first set of applications that were currently available you can see that these agents like on Insta card you can order so these agents can start planning for you they can start acting on your behalf and order things like Grocery and help you plan for trips and so on or even do math so that's one set of applications there were a couple of other very early seminal work in this area one is this very interesting result in paper called react which combines reasoning and acting with large language models this has been a very influential paper in the space of Agents before that thinking around reasoning like Chain of Thought reasoning where an llm might when it's asked a question reason through the answer first and then provide the answer which is very widely used reasoning was separated from acting where you can go out and the llm can be used to act on the environment and act on your behalf for various tasks like planning your travel reservations or making restaurant reservations and so on with react these two bodies of work were pulled together where there was an interleaving of reasoning and actions and that really significantly increased the power of agent based llm apps and let me give you a very quick example this is taken from the react uh from the react paper so the question that was asked here is aside from the Apple remote what other device can control the program Apple remote was originally designed to interact with and if you just ask a standard lln then the answer was iPod which was incorrect if you get if you make the llm think through reason step by step even then it comes back with the wrong answer if it only acts then it starts doing an external search but the search does not produce a meaningful result so this on the left here is standard reasoning and act and and a separation between reasoning and acting when you do one at a time but not interleaving that does not produce the right result on the right here you can see how where thinking and actions are interleaved you suddenly start seeing a much better result so the thinking here was well I need to search for the Apple remote so it invokes a search API to do the search the search comes back with this front row Media Center as a keyword the second thinking now is we need to go search for Front Row the first search for front row does not give anything very concrete but suggests that you might want to look at some similar things which are which includes the software so the next layer of thinking does results in the action being searched for front R software eventually it results in finding the right answer which is you can also control these devices with keyboard function keys the details of this example is not so important but the main thing is this interleaving of reasoning or thinking with actions and this back and forth really is where a lot of the power of llm based agents derive from so I encourage you to take a look at this set of results the react paper the GitHub to get a sense of the history of one of the important developments uh that spurred the recent excitement about agents the other project that I'll point you to is auto GPD Auto GPD came out some time back and it was one of the fastest growing uh open source projects on GitHub you can see it's this is a relatively recent screenshot has about 150 plus 150k plus Stars uh and then this hackathon which was recently done uh in October uh has lots of examples of interesting llm based applications that you can explore um so for example um there are some some interesting things around coding and so on now as people start Ed diving into agent based applications there are some limitations that started emerging as well some failure modes often this includes augmenting llms with additional tools over apis uh to do search to look for travel itineraries to look for uh restaurant reservations and so on and as you start increasing the number of tools the llm has to reason based on a query what tool is the right tool to use as part of creating the answer and that's where mistakes happen that's a common failure mode there can be infinite Loops there's also hallucinations a problem that is known very common with large language models U appears also in this context and we'll talk about these in a minute so as we think about the space of AI agents uh we think about it in this kind of three stages is the specialized data agents which are quite similar to retrieval from a vector store and the retrieval augmented or rag kind of architecture but with access to real-time information and some additional pieces there's General data agents that have access to more than one tool and can accomplish a wider range of tasks and then there are agents that can take action in the real world so we'll we'll focus quite a bit on data agents in this talk and in that context we will will make use introduce you to how to work with how to build agents data agents with llama index and and to evaluate it systematically understand the failure modes iterate and improve it by leveraging true lens so these are our githubsign and with true lens um and you know our developer uh enjoy it when when you request features make contributions uh please give us stars as well if if you like what you see there with that I'm going to transition it over to Jerry Jerry will talk about how to build llm agents with llama index Jerry awesome yeah thanks an pom and so uh maybe taking a really quick pivot right here uh we're kind of featuring our course that just came out last week building and evaluating Advanced rag applications uh and so the course itself is focused a little bit more on Advanced frag techniques uh retrieval augmented generation uh for those of you who might be less familiar with rag rag is basically just a technique for augmenting your llm with an external knowledge Corpus uh and so the way it works is you first you know do retrieval or index uh your data into a storage system then you do retrieval from the storage system on query time and then add that context The Prompt window of the all and so there's both like a simple way of doing that um but these days a lot of developers are excited about exploring Advanced ways to do that and to do uh to try out Advanced methods you also need a rigorous system to do evaluations on both the retrieval and generation piece um and so we we really excited to work with triera on this course uh because you know true lens has a robust evaluation Suite uh llama index has invested a lot in advanced retrieval techniques and so we combine them together into to this overall short course um and so you should definitely check it out and this also serves as a nice Bridge into how we think about how agents evolve from rag basically right like what there there's kind of these two buzzword floating around there's Rag and then there's agents and so how do we really think about um the transition between Rag and and agents um and so the way rag works right we can kind of think about it as box uh the user has a question like a query um rag is this overall pipeline defined over your knowledge Corpus uh and within this overall uh kind of rag box we'll do retrieval from the knowledge Corpus as well as synthesis uh and and after that we get back a final response one of the ways we think about agents uh next slide is basically we uh can think about that as an additional layer um in front of the rag uh pipeline um and and what this how we think about this is basically given a user query um we can actually uh feed it to an overall agent that can more dynamically decide um how to uh you know uh route this query through to relevant tools to try to synthesize the response and so in a traditional rag setup what typically happens is that you know the llm uh the llm call occurs at the end of this overall pipeline you do retrieval which is typically like you know topk embedding lookup and then you feed it to the language model at the end but here the idea is what if we actually use the llm at the beginning as well you know to actually take in the query and figure out how to actually make use of the underlying tools whether it is a rag pipeline an external API call or some other service so we kind of think about agents as wrapping a layer on top of rag that dynamically enriches the query with additional information and basically allows this overall kind of higher level abstraction to use tool TOS in the right way to try to give you back a response uh in the next few slides we'll go into a little bit more about the architecture of how how this works as well so a few months ago we defined this concept of data agents within llama index which are all empowered knowledge workers um a data agent uh is basically something that is designed to help you automate knowledge and can both do search and retrieval as well as synthesis as well as modify data so uh for an for example aamp flow is given in this slide right here um given a data agent uh you can for instance make a call to read the latest emails that you have uh for instance from Gmail or from some external Service uh afterwards you could also retrieve additional context for your knowledge base um this is a perfect example of the rag pipeline as a tool within an agent so this uh by itself could be a rag tool um you could have an analysis agent that actually analyzes the file for instance using a tool like code interpreter or another rag pipeline um and then you could have another uh tool where by actually calling it you are taking actions and and modifying State and so in this case you send an update to a third party service like slack uh and that actually you know triggers a message to be written um to this service next Slide the way we think about agents is the core components are really one an agent reason Loop um which onom covered a little bit in terms of like a react loop as well as tools right and and so the the the way it works is the agent reasoning Loop typically operates over the set of tools that you give uh to this agent uh the agent reasoning Loop is typically powered by a large length of model because that allows it to be basically dynamically decide you know for instance what are the next tools I should use given the user input uh to try to call to try to you know um synthesize the right results the tools themselves can be for instance uh the what we call query engine tools like rag pipelines um so you could to basically take all the techniques you learned in the short course that we just find and plug them in as rag tools to this overall agent we also have a variety of different tools on llah HUB that connect to thirdparty services um we have over 20 or 25 tools on llah HUB this includes um API interfaces to for instance slack notion uh Z the code interpreter and all these tools basically act as API interfaces uh to uh specifically design for alum agents and so you can actually mix and match for instance like your rag agent tools as well as your llama Hub tools and basically combine them um into this overall agent that can both do search and retrieval as well as take actions on different Services I think onom covered what react is in the previous few slides there's also of course a few other ways to actually perform agent reasonings um one way is you can just do a function calling Loop over the open AI agent uh for instance like under the hood I think the API service will just handle for you and decide uh you know what what are the functions that you should call given the inputs that you give it there's also other algorithms like tree of thought uh plan and and execute um and a few others uh you know there's there's a few papers coming out of these conferences every year next slide maybe a quick comment here is how do you actually do um search on retrieval from a knowledge base right and and this touches on the point about basically plugging in a rag pipeline as a tool within this overall agent um one idea here is um kind of elaborating on this a little bit uh L index basically you know um the the tldr is we have a bunch of these things called query engines which are basically for all intensive purposes generalizations of rag pipelines they take in a user query and they give you back a response um and so what you can do is you know we have a bunch of these query engines we have semantic search which is basically a rag pipeline um we have summarization which is uh you return all the contexts and you try to synthesize an answer over all the context instead of just the top K we have query engines over structured data um uh for instance like taex a SQL over a SQL database we also have uh Advanced rag pipelines um some of which you actually learn in the short course um this enables you to do stuff like document comparisons as well as being able to do uh do kind of like combined quering over hybrid structure data with unstructured data the way the Llama index architecture works is you can actually just plug in all these query engines as tools to an agent um and so you have basically have this outer agent layer right which actes a reasoning Loop um and if it calls the tool then the tool executes which is a query engine um and so this basically gives you the ability to compose uh kind of more advanced reasoning um abstractions and and layers on top of your data um so that you basically first have this agent handle the user query decide what tools to call and then execute the tools there's a variety of examples that we have in the notebooks of what this allows this basically allows you to answer more complex queries allows you to do a little bit of query planning to try to execute some things in parallel allows you to dynamically do both stuff like semantic search and summarization um as well as Texas Equal as well as combine information from dispar data sources next slide um an example here right um and uh basically from the screenshot is we have this kind of uh example agent uh for instance and let's say we plugged in two query tools one for Uber and one for Lyft and each query tool is just a rag pipeline over each company's uh annual report like the 10K filing so each uh tool corresponds to you know a topk rag pipeline over that company you plug both tools into this agent and now you can ask a question like compare and contrast Uber and Liv's Revenue growth um the agent uh through its reasoning Loop and Chain of Thought process and and react reasoning will break down the question into sub questions over the tool um and then each uh question uh sub question Plus tool will be executed right uh each tool will be executed with that sub question and within each sub question we'll do per document rag so first we'll look at Uber's Revenue growth look at the Uber tool do top K rag get back the answer then look at lifts Revenue growth do topk rag get back the answer and combine the results at the end so this is just a concrete example next slide in in the next two slides we'll kind of talk about just some general architecture decisions um for how do you think about um modeling some of these agents uh and and the slides after that a goes into detail about you know a lot of the failure modes as well as how you actually evaluate these agents but just some General things to think about is one is how do you actually handle large responses from tools like if a tool resp returns like an entire essay or in like a very large web page how do you actually handle that um one some strategies that we basically have uh and you know we call them like on demand loader tool as well as load and search tool basically involve um on the-fly indexing of this data sometimes it's nice to first load this data and actually index it into Vector storage uh so that you when you actually kind of query this data that you loaded from the tool you actually do search and retrieval over it instead of trying to stuff everything into the CeX bu even though context windows are getting B uh bigger like they still overflow on large amounts of data and so this is uh like indexing it beforehand is a way to mitigate that next slide another problem um that I think audom also touched on is uh a lot of these agents tend to struggle a little bit when you overload them with tools um and so if for instance you have more than like five tools uh and and you know in in the limit you might have like hundreds thousands even millions of tools uh at a certain point this isn't going to fit into your context window so what you can actually do is you could try to actually index the tools themselves right index and metadata of the tools and then during um basically during query time you actually first do search and retrieval over the relevant tools uh first and then uh pass that to your agent uh to that ask a question so these are just some general considerations and we'll do a section at the end as well uh kind of talking about some best practices for constructing agents uh but hopefully this gives you a general overview uh and passing it back to anom great yeah thanks a lot Jerry and now you have gotten a good sense of some of the key building blocks for how to build agents and especially how to do it with llama index where Jerry and team have driven a lot of the early work in this space we'll shift gears a bit and start looking at evaluation and what are some of the failure modes of agents and how do you detect those kinds of issues how do you iterate to improve them so I'll do a quick introduction to true lents in this context true lents is an open source library to track and evaluate your llm experiments as you're building your or applications and llama index uh for example or any other python framework you can use true lens uh fairly easily and as part of that application building process uh and we'll show you a notebook and Di is also sharing it uh so you'll have access to the notebook on GitHub to play with yourself so as you're building your LM application with say Lama index with just a few lines of code you can connect true lens to it you can start logging the records the proms the responses the intermediate results the entire call call Tray and interestingly then what you can do is we have this nice abstraction of what we call feedback functions or evaluations which lets you log and evaluate the quality of your llm applications along a number of different dimensions and once you have those set up a number of these are available out of the box you can add your own evaluations as well very easily you can start exploring these records evaluation results uh in a true lens dashboard see what the failure modes are and that can in turn inform iteration selection of the best llm application version for your uh for your use case so let's let's look at a concrete use case where we may want to use the kind of data agents that Jerry introduced for real-time retrieval calling out to external apis and tools so the user comes in here with an input a query and then the llm acts as the reasoning agent it picks a tool one from a set of tools that might be available from it it translates the user query into an input that is appropriate for the tool gets back the output from the tool and then produces the final response in this process what are some agent failure modes it might the agent might in this reasoning step pick the wrong tool if it's a question about restaurants instead of putting Yelp it might if it picks something like archive and it gives you a research paper on which makes which talks about restaurant recommendations um that would not be very appropriate but that kind of mistakes in picking the right tool is pretty common it's a common failure mode for agents especially as the number of tools start going up um it can get stuck in infinite Loops um especially because many of these agents the base base case ones are don't have car state um API calls might fail say the if the input is not formatted if the query translation it's not done in the right format you might have um also other kinds of failure modes related to hallucinations so let me talk about a very concrete use case now and this is what is available in the notebook that is made available to you we'll talk about a restaurant information chat bot where the user starts asking some questions and then the agent is makes use of the Yelp API so one tool in this case to answer questions related to it and we'll compare that with a baseline where the llm is just answering these questions based on its pre-trained knowledge and without access to a uh to to an external API so those are the two that we will compare um this is available in this u in this notebook here is a QR code you can play with it after the session if you like uh it'll also be shared with you it's available publicly on GitHub to make things interesting we instructed the agent to respond in the style of Gordon Rams and ask questions about um restaurant recommendations let me just show you a couple of examples here so the first question that we asked was uh what is the best restaurant in Toronto or the best diner in Toronto and this was to the just to the open AI llm not without access to without access to the yell reviews and this is the answer came back with I'll take you a second to read it remember that the prompt was to respond in the style of garden Ramsey and so there you can see the it's not necessarily particularly respectful we asked the same question of the llm agent which had access to the Yelp apis and then the response came back uh this is the response that came back um this is a bit more polite maybe because he there was an answer that was actually retrieved which the llm was then to summarize so now we have a couple of these answers and we want to be able to think through and have tooling to do evaluations so let me Circle back and start talking about a framework for evaluations of llm Agents so before we go into llm agents I want to recap one slide from our deeplearning.ai course on Advanced rag techniques that Jerry and I recently launched um which you should take a look at it's just one hour um in that we talk about the rag Triad for guarding against hallucinations if you think of a rag the user comes in with a query much like in this example what's the best diner in Toronto based on that query a set of contexts or chunks are retrieved from a vector database and then those then get summarized Ed by an llm to produce the final response and along each edge of this triangle there is a property that you want to test for on the F first Edge there's a property that we call context relevance which is check if the retrieved pieces of context were relevant to the query that was asked on the second Edge there's groundedness groundedness is checking is the final resp response supported by the context and then finally with answer relevance we are checking is the final response relevant to the query that was asked so this is the rag Triad and in in the deeplearning.ai course we discuss how to implement these kinds of evaluations uh within true lens programmatically automatically and you can make use of them out of the box now once we transition from rags to agents where things get a little more sophisticated is this first step that Jerry was talking about you much like Jerry said for especially for data agents you can think of that initial reasoning step of the agent as a layer of abstraction over a rag so the rag Triad becomes the agent quad when the user asks a query then there is a tool selection step which needs to be evaluated as well which is done by executed by the agent and that involves both picking the right tool as well as translating the original user query into a query that is in the right format for the tool that was selected and then once that is done the next few steps are similar to what you would do for a rag based application you want to check for context relevance you want to check for groundedness and you want to check for answer relevance and what I'll do next is to just give you some concrete examples of how this plays out when we work with the uh example of the llm agent that's making use of the Yelp API and then we will Jerry and I will tag team and share with you a bit of the flow the codee in the notebook that makes it all real so in this first step here for this particular example because there's only one tool which is the Yelp API tool selection simply reduces to query translation and this was the original question uh that was asked what's the address of Gumbo social in San Francisco and the llm agent translated it into this question for the Yelp API address of Gumbo social in San Francisco and the automated evaluation that is available in TR lence checks for semantic equivalence between these two between these two um questions leveraging llms so some of the some of the evaluations that are available in trand make use of large language models as part of the evaluation process there are also other ways of doing evaluations that make use of standard NLP metrics as well as uh smaller models uh that that offer some trade-offs between Effectiveness and scalability if you look at this example it's a nice example the quest the llm agent does a good job of translating the original question into the kind of keywords that work with the Yelp API and the evaluation gives it a high score on a scale scale of 0 to one that's one meaning that it did a good job of query translation next we look at the second step in the squad which is context relevance so we get back this answer from the API and now we are asking is this answer relevant to the question that was asked and this also works well context relevance does as well here the answer is quite relevant to the question that was asked and so that gets a high score as well the third test is around groundedness groundedness is looking at whether the final response is grounded in the set of retrieve context in this case the final response very closely follows the this is the final response the address of Gumbo social in San Francisco is this this particular address and the retrieved context was just the address itself so the final response that's provided by the llm is well supported by the evidence in the context that was retrieved in this case from the Yelp API so it gets a high score so in general these steps could fail it could be that the llm hallucinates and makes some stuff up and it produces a final State sentence or response which is not backed up by the retrieved information from the yel API in this example it does rely on that exclusively and so the groundedness score is high so here's an example where the groundedness score is not as high so this was the uh Toronto Diner example so you can see that uh there are a few sentences here in the final response the first sentence uh talks about the best diner in Toronto is subjective and can vary depending on personal preferences and the supporting evidence for that is this is what was retrieved from the L API it is difficult to determine the best diner in Toronto without prior knowledge however based on the context information provided here are some of the diner work considering over here this is again well supported it's talking about there's not a unique answer to this question of best diner in Toronto but then the llm here injected in a bit of additional things its own opinion you can I recommend you try them out and decide for yourself and then it also snug this in just to make sure to have low expectations as most diners in Toronto are mediocre at best so this last sentence over here perhaps the llm is responding to being prompted to act like Gordon Ramsey it just can't get away from it so that that is not well supported by the retrieved information the the retrieved context and so you can see that it has a low score so in general the groundedness evaluation is super powerful in identifying areas where the agent makes stuff up and and engages in H ations and then this finally uh so here's another example this is from New York uh maybe I'll skip this example but if you take a look um the best pizza places in New York it talks about ruir Rosa lombardis and so on and then it injects some sentences these places are known for the delicious and authentic pizzas uh and a few other things like Ruby Rosa offers a variety of thin crust feeds us these are likely things that it knows the llm might know that because it was pre-trained on such a huge uh data set that it may have figured these things out based on its training data or its just general trends that it learned from its training data that these are some of the kinds of things that are available in P in PS areas but it's not backed up by the retrieved pieces of context so this is super helpful because it's telling you showing you where you don't have nice traceability of the final response back to individual pieces of retrieve context from the Yelp search so so those are some examples of both success successfully grounded responses as well as ones where there are blind spots and made up stuff which are not grounded and and now we can look finally at answer relevance where it's an evaluation again we are checking whether the final response is relevant to the question that was asked and this turns out to be the case in this case um over here where there's a less clear answer it gets a lower score uh for for Toronto and and as you experiment with these data agents you can also do this kind of comparative view where with true lens where you can look at different versions so remember that we are trying out two different versions one is just the open AI chat completion without access to the yel preview so this is just simple simple chat bot whereas over here we have access to the Yelp agent and you can see that certain things uh these metrics can can change like the agreement with with some source of ground truth goes up quite a bit as you go from the simple chat based application to the one that makes use of Agents so I'm going to pause here the next thing that we wanted to do was to start getting into the notebook itself to show you how this kind of agent can be built easily with the abstractions available in llama index and evaluated with with true lens uh let me switch over to that and Jerry and I will tag team a little bit on this and walk you through walk you through this notebook so Jerry maybe you can kick things off with building the app and then we can I can speak to the evaluations yep that sounds great um so uh you know we're we're basically walking through the notebook of first building uh a l index agent a data agent uh and and specifically in this use case we're building it over the Yelp tool um so uh a little bit kind of different from the original examples I shared where it was building like an agent over a static knowledge Corpus like a rag pipeline here we're actually building it over an external API tool and we'll walk through the process of how that works of course like you know we need to do the PIP installs and install true lens eval lendex llama Hub Yelp API um we see that the Yelp tool is actually found on llama Hub which I mentioned is our Community Driven hub for just a variety of different Integrations from agent tools the data loaders to templates so the first thing you'll do is um as anom is showing you know from llama index. agent import open aai Agent um you know that that just Imports the class you want to make sure that your open AI API key is specified um you also want a Yelp API key because we're actually going to be using this agent to interact with the Yelp API um and we'll kind of walk through the process of how that works the next step is here we're actually going to construct uh the Yelp tool um you'll see that it basically the statement is from LL hub. tools. Yelp import Yelp tool speec what exactly is a tool speec you might ask a tool speec is basically an API interface definition within LL index that's specifically designed for agent interactions so um it's written just you know like a standard python class with uh you know a class definition plus a bunch of functions each function can take in you know a set of arguments and pass back a response um maybe the one difference is that you know you can't actually just load this class and call it manually like call these functions manually um but what's more interesting is that you can we actually let you basically directly plug in these python classes um as tools right into an agent so that instead of you manually calling a function the agent can actually you know has the understanding of all the function signatures within the class and can actually call the different tools or functions within the class with the appropriate uh uh parameters um and and get back the response uh the response feeds back into the conversation history and you can iterate from there so here we designed a Yelp tool spec which is a set of functions specifically around interacting with a Yelp API um and um you know we we'll kind of go into the specific functions here in just a bit but we initialize a Yelp tool spec the other thing we're going to import is what we call the load and search tool spec which I covered briefly in a previous slide but it's essentially a wrapper abstraction for helping you do on the-light indexing uh for handling large responses um so for instance like the Yelp API some calls like if you you know look at all the reviews for a restaurant might return a lot of text Data um if you just uh do the naive thing of stuffing all this text Data into the context window for an agent's conversational memory you might overflow the memory buffer so what we do is we have some nice uh convenience abstractions which will basically take in any sort of data uh construct an index for it and actually the agent instead of ingesting all the data into its conversation memory can actually do some sort of top case search or retrieval uh and you can plug in you know a l index R rag obstruction in there to try to find the most relevant results so um the next line you know we're just going to have a Gordon Ramsey prompt uh you answer questions about restaurants and sou Gordon Ramsey often insulting the asker um that part is pretty straightforward the step after is we construct the agent um open agent. from tools and this is where you see that we pass the Yelp tool spec um we convert it into a set of tools and wrap it with the load and search tool spec and pass that into the agent um so is there two tools two tools here because there's a consumer and business separate interface or yeah and I think you you'll probably see the execution of what what the tools are as as you run through the app but here you know in this API definition there's a few tools and and I think uh it probably um involves I'm just going to speculate like uh listing like restaurants as well as uh kind of being able to search over the reviews of a restaurant um I think that's it and and we can verify that as we go down the notebook um and then and and then the next section is to just you know set up uh gbt 3.5 for for um kind of comparison purposes uh and I think uh the next part is around inst instrumentation with uh true lens so that you can actually set up some sort of eal scaffolding around the agent uh and that part I'm going to pass it back to you on to just walk through a little bit about how that works yeah so with true lens you have similar set of things to to import there's this abstraction of a feedback function that I mentioned earlier in the slides so that's the class you import here we make use of open AI to do evaluations but we can also work with other llms in this particular notebook we're using open AI there are a few other things here and then there are some Bas uh feedback functions that are available out of the box which you also import like groundedness and ground truth and so on there's a database that you set this true object includes a database which logs the prompts responses intermediate results evaluations you reset that so that you're starting from scratch um and then in this evaluation setup over here you're defining some of these new feedback functions like query translation query translation as I informally mentioned earlier question one is what the user asks question two is what the llm agent translates it into to make it work for the tool the Yelp tool in this case that it's working for and then this llm based devaluation which will make use of GPD 3.5 turbo is just given a very simple prompt to check given these two questions the original user question and the question that um the l m produced to send to the Yelp API how similar are they and gave it a review on a scale of a rating on a scale of 1 to 10 there are a few other things here which I'll not skip in the interest of time but then you set up this query translation function to uh make use of the user input The Prompt the user query and and then you're also telling it uh where to get the agent the llm agent translated query from so those are the two inputs that it works with so this is one way to set up these these feedback functions there are other feedback functions here available like the groundedness one that I mentioned question answer relevance or answer relevance and context relevance all of these are already available out of the box so you're here you're just setting up what the inputs are that it that these evaluations will work work with so for context relevance that's the original user query and the retrieved piece of information that comes back from the Yelp API call and so on so once you have set these up you also set up a ground truthy well so there's a golden set here that we are setting up these were the kinds of questions that we were walking you through the examples on the slides like what's the best diner interon tto what's the best where's the best pizza in New York and so on Gumbo social in San Francisco uh we have set up a ground Tru the Val for this just as a baseline uh and then we run the dashboard when we run the dashboard stml dashboard pops off that looks a bit like this so This has um this has keeps track of the number of Records or prompts and responses that were processed there are six in this set um average latency cost uh the and then the various feedback functions so you can see here that with the just the open AI chat completion the ground truth EV Val is at what 0.56 answer relevance 0.18 which is quite low when you update you augment it with the Yelp agent the answer relevance goes up a lot over here there aren't even any Notions of groundedness or query translation or context relevance because we don't know how to provide traceability for just a just an llm based purely llm based application whereas here you can uh ground these things in the responses from the external source of Truth here which is the help agent so you get a bit of this leaderboard that lets you compare and pick the best version of the app and then for individual bullet points like individual records and queries you can go deeper so uh if we go back to that best diner in Toronto question uh the leader board if you drill down you can get at E Record level evaluations uh for the relevant feedback functions so on the what's the best diner in Toronto this was the response from the open AI completion chat uh the ground truth on the other hand is the George Street Diner so on answer relevance it got a it got a low score if you look over here at the uh same question but for the Yelp agent you can see that context relevance is fairly High uh on the ground truthy Val it did poorly because it said white lily Diner relative to George Street Diner but it's well grounded it did come back from the Yelp preview uh that the best diner is white white lily Diner uh and then for query translation it did well answer relevance did well as well so this gives you a quick tour I know we are kind of getting at time here we did want to spend a minute or two on best practices so I just want to wrap up and leave you with this Takeaway on the agent quad for evaluating llm agents for hallucinations where it's very closely related to the rag Triad but has this additional step of tool selection and checking which includes evaluating that the right tool was selected and that the query was appropriately translated and then we go into context relevance groundedness and answer relevance which we just walked through and we'll take a couple of minutes here to give you uh some best practices for agents as your building agents some some final best practices to keep in mind so over to you Jerry and then we'll open it up for a Q&A sounds great um yeah some of these best practices are pretty intuitive we wrote a blog post on this a few months ago um are ever evolving or thinking on this uh one is just writing like good tool prompts for your API interfaces uh in terms of tools that you pass to the agents um basically a lot of like qt4 for instance uh is probably the best model for for interacting with some of these tools but even that requires a bit bit of kind of prompt engineering to make sure that you know get this def function definition is clear enough so that I can actually call it um you need to make sure the arguments are correctly specified and that it actually can kind of interpret how to actually um you know call these functions with the correct arguments uh one General principle is for given the kind of there there's a certain um error of like you know maybe these agents aren't super reliable right now and so as a result you do want to give them a little bit of handlebars uh so make these tools tolerant of partial or faulty inputs if the agent actually infers the wrong parameters or actually doesn't fill in the parameters it's supposed to can you for instance actually you know uh replace that with good defaults can you actually return a proper error message to the agent so that you know basically giving it a few shot example of a negative example so that it can actually correct itself in the next Loop um really designing the tool in a way so that it's friendly to use by agents that might not always use the tool correctly this is maybe an aspect of API interface design that's a little different than kind of traditional API interface design for software engineering where you're really trying to guide the agent into actually interacting with your services the right way um uh this also relates to some stuff around you know being able to return the right messages uh whether it's from errors like exception handling or from uh like post request like basically if you actually modify State can you actually return success or error um to the agent so that it knows that it did the job correctly um the last two don't overload the agent with tools uh We've mentioned this if you have if you basically give the agent more than like five or six tools it starts getting confused um and and might not be able to use the right ones correctly uh another practice that we found to be pretty good actually is this idea of higher coal agent modeling um basically um having an agent instead of calling just a tool um call another agent so that basically you have a network of Agents each agent maybe has like access to three or four tools some of these tools might be other agents too and each agent is roughly specialized in its domain of executing a certain task um this means that you don't actually overload a single agent with a bunch of tools you just have a network of different agents uh that can for instance orchestrate and communicate with each other um these are all things that were basically practices that we're evolving but that's basically a list for now great with that I think we'll wrap up and we'll open it up for questions you could check us out check out our observability watch list which is a more scalable product that goes beyond the open source and you're welcome to get on that watch list as well we'll give Early Access to folks uh and of course Lama index has been great to work with Jerry and collaborate on both teaching the course and this working more closely with llama index and team on building and evaluating rag based applications and llm agents uh and we are excited to take final questions this is a great time for our field perfect well thank you so much Shar anap pal I know our community really enjoyed everything that you taught us today uh very interesting session uh I've gathered some questions I think before we end today I think we could get through a few uh but the first question is how can I create a chatbot that accurately answers complex scientific questions using the knowledge base exact phrasing and wording is Paramount yeah I can I can take this I was taking a look at this question on the slid and basically there's a bunch of different components you you kind have to think about and so I can kind of list some of these components and it really depends on the nature of your data your use case as well as your performance requirements um so one is what what what is the kind of like pursing strategy for this data um for instance like are they archived papers with like embeded charts and tables um are they kind of like you know web pages uh are they basically structured data uh it really depends on what the form factor of this like scientific knowledge base is um I always recommend like start simple with like some basic parsing strategies before you try something more advanced uh once you try something more advanced you can take a look at our deep learning.ai course or also the LA index documentation as well one comment about like scientific stuff in general is that sometimes people struggle with uh if you're using like a default llm um like especially if it's pre-trained on a bunch of data it might not understand specific kind of technical terms or concepts right uh this really is model dependent and so if that is the case and you're locked into using that model you might want to consider fine-tuning as well um and so basically just like actually fine-tuning so that understands the overall like vocabulary that um you're um that of the domain that you're operating in um another piece is it really depends on kind of what types of questions you want to ask are you asking like very analytical questions or structure data or are you asking kind of um questions that span require understandings of charts and tables along with the text um these all require like maybe slightly different strategies to play around with um but you know uh in the end the answer is relatively broad but I would basically kind of think about all the dimensions from model selection to the data parsing ingestion to the retrieval strategy and of course you need to like basically every app needs to set up like some sort of eal scaffold yeah and the rag TR for evaluation could be particularly valuable there with groundedness being extremely important yeah good point absolutely I think for our next question at what level of complexity does using agents make sense yeah um that's a good question uh I think it kind it it definitely depends a little bit on your use case I think if you're for instance a lot of Enterprises for instance are building rag um and so if you're for instance interested in doing like search or retrieval I probably start with some of the core rag Concepts first um basically doing like retrieval uh you know and then synthesis using an Ln so you start a little bit smaller um and what we typically see is that uh companies and and developers start wanting to add a gentic behavior on top of their rag pipelines once they want to handle kind of more complex questions or once they start wanting to interact with services but would typically think about it like start small um if especially if you're during rag build like a basic rag pipeline first then add the agentic stuff on top if on the other hand you just directly want to build something that interacts with the API service you basically need to build an Adan and so that like for instance if you want to build something that interacts with a rag Yelp API that anom just showed then you should just start there yeah that makes a lot of sense do we have time for one last uh I think one more and then we definitely have to end um how do you think companies will look to apply standard mlops principles to LM and gpts yeah I think I can kick that off uh and then Jerry feel free to add I think for example on the observability side there's a lot of overlap so if you think about the mlops platforms observability is increasingly recognized as an important component of it and by that I mean full life cycle observability so during devel velopment testing and debugging and then once the model is in production continue to monitor it and then Circle back and debug if there's a problem and that carries over from traditional llm uh traditional machine learning models to large language models in generative AI uh the differences are in how you do the evaluation which is kind of where we what we focused on quite a bit today um but the core components of the platform around tracking things over time monitoring over time and so on so there's a lot of shared infrastructure but there's also technical differences in how you do evaluations how you scale up those evaluations into production workloads which might go into millions or tens of millions uh but there's a lot of overlap there and then on the development side there are some significant differences that maybe Jerry can talk to or no I mean I think um it's it's Bally what you said like besides kind of like training a model um you're not really a training model for for the the usage of this but you still need to set up some basic metrics because you're trying to evaluate like a blackbox stochastic system um and so definitely have some number some data set that makes sense and try out some of the Advanced Techniques perfect well Jerry an palom thank you so much for speaking to our community today I know we learned a lot we will send the slides and the notebook out afterwards um and we hope to see you all again very soon we've also dropped a survey in the link as well uh so please fill it out in 2024 we're going to have a lot of different events and so we'd love to hear your feedback on how to improve everything but thanks again for coming everyone we'll see you next time bye bye thank you
Info
Channel: DeepLearningAI
Views: 26,300
Rating: undefined out of 5
Keywords:
Id: 0pnEUAwoDP0
Channel Id: undefined
Length: 62min 12sec (3732 seconds)
Published: Tue Dec 05 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.