Breaking Down & Testing FIVE LLM Agent Architectures - (Reflexion, LATs, P&E, ReWOO, LLMCompiler)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody Adam LK here and today we're going to intuitively break down the flows of six main approaches to building out a gentic architectures for large language modelbased applications if you're new to agents or unsure what sort of an agentic approach to building language model applications is I have another video which will pop up right here right now that you can go check out that gives a broad overview and a good example and breakdown as to what sort of agents are and what it means to sort of approach building llm applications with an agent approach in mind however in this video we're going to break down six main architectures that I've seen and they're Associated kind of papers and workflows intuitively I think this feels a niche because a lot of the times these applications are very code heavy and not a lot of people truly understand sort of what's going on inside the actual execution of these architectures so I wanted to put together and break down both basic reflection and different abstractions of planning and execution so that you can better understand how that's set up and maybe even create an architecture of your own someday so without further Ado the first three approaches are really going to be focused around reflection so to break down what kind of basic reflection entails it's super simple and kind of follows this flow your initial user query is input and then an initial response is generated with the first prompt so let's say that the application might be an essay writer you might input a request for Generate me an essay on you know woodland birds or something like that that initial response will then be generated and then it goes to a reflection step the reflection step would be an additional prompt let's say an essay greater something that almost acts like maybe a teacher grading an essay and then its whole point is to generate a critique and so the critiques are sort of recommendations for improvements and different Reflections and thoughts on the initially generated context and initially generated response this is then fed back into a generation stage where that generation stage will hopefully take those critiques take all those thoughts um different improvements that could be made that it itself generated and then remake a second response this repeats sort of and number of times until you're happy with that final output and this is pretty much about as basic as a reflection Loop can get just generating the response reflecting on it and generating critique regenerating and then doing that sort of an arbitrary number of times until you hopefully get sort of a better output back this is as opposed to just generating an initial output and then using that first output so let's see what that actually looks like when it runs and so all of these different agentic approaches what I've done is gone ahead and ask them sort of the same question and use the same model the current input question that I was interested in is what are the current trends in digital marketing for tech companies and then the model that I used for every step is going to be the recent April 9th release of gp4 Turbo and so one quick note is that not all of these agents are set up to sort of do the same task have the same tools or even have the same prompts so this is not necessarily a good scientific comparison or determination of what the best architecture might be however it should give you a good idea of the different approaches and what that sort of looks like so for the basic reflection we see the time that it took was about 118 seconds and about 18,160 [Music] generate a review and this review looks a little something like this talks about topic relevance introduction of new trends you know well covers well established Trends could benefit from the inclusion of emerging Trends such as you know voice search optimization etc etc and then looking over here we can see that it generates reflects generates reflects generates reflects three times before a final generation here at the end and so then this final output is what we see back here in this nice format so I just took the markdown that it output and then rendered it here my general thoughts on this is that it took a lot of time there's a lot of processing since it generates a decent amount of text and then kind of considers and has all of this in its you know quote unquote memory each time that it goes through and reviews so it's a little bit token heavy and a little bit um you know time intensive to actually read regenerate every step of the way and then regenerate a full report every step of the way and then another thing here is that as you can tell there's no actual tools attached to this basic reflection agent and so it's making a lot of interesting claims things like you know video content and interactive media indicating that video will account of for 82% of all internet traffic by 2022 and you know Amazon's use of personalized marketing techniques as it reportedly increase their conversion rates by over 202% I actually looked up a few of these things and it turns out that 99% of the actual numbers here are completely made up it's nothing that it's actually outputting here is very grounded in any sort of Truth um since it doesn't have access to things like web search tools which a few of the later agent architectures that I go over will have it's pretty much just generating to generate and a lot of its internal sort of waiting does not accurately get a lot of these um a lot of these numbers right are very specific things so that's certainly one thing to uh one thing to consider when using sort of just basic reflection with no grounded tools and just as my point here says you know it's making things up and it's a little bit of slow processing slow for processing and the token usage is high so however overall super easy to implement basic reflection moving on to the second example this is one that we have a little bit of actual research on this is the architecture proposed by the paper reflexion language agents with verbal reinforcement learning not 100% sure if it's reflexion reflexion or what exactly it is but I'll just go with reflexion because that's how I'm reading it and it's based on this paper that came out in October of 2023 and takes this sort of basic reflection idea and abstracts it a little bit further and then finally introduces different things like tool use and so the general flow that this takes is as such your user query is input and then an initial response is generated along with the self self-critique and then also suggested tool queries so for this quick example here my example question was why is reflection useful in artificial intelligence and then after a little bit of verification steps and and generating that first response we get this nice initial response and then it also will generate its own critique at the same time so it says here that it's you know Superfluous answer might be a bit too General and it's discussion or reflection without diving into the technical mechanisms or specific use cases provides good overview but lacks depth and then the one tool that it's provided with here is a web search tool and so it generates also some search queries real world examples of reflection in AI Technologies enabling reflection in AI etc etc and so that's this initial response and then what differs is instead of actually having a reflection immediately it takes those suggested tool executes and does just that so the suggested tool queries are executed in this example web searches for more information and then the original response reflection an additional context from the executed tools are sent through a revision prompt so that's where this revisor here which is going to be another language model prompt is going to tackle this so once that initial response is made the tools and other context are executed and put in and then the reviser is what gets all of that information all at once to generate a revised response so the revised response will then update the answer create a new self-reflection and then new tool queries and so here we see that this revised answer is here it's also citing some things in here with actual references that it pulled from the tool Search tool and then it sort of Loops this way because it's made more search queries so it'll get additional and additional context from the web using this tool and it'll repeat this step and number of times until it's happy or until we hit sort of our arbitrary limit of what we want at which the end result will be sent to the user the biggest difference here between basic reflection and sort of this reflexion actor is that the actual tool execute step is separate so instead of having everything just be regenerated all at once it will take sort of those suggested queries execute them all and then all of the context along with all of the prior um context required including the generated response is then looped through and follows this until the end hopping down here to the testing area then we prompted the reflexion actor with what are the current trends in digital marketing for tech companies and it took only about 69 seconds and 24,000 tokens to generate this response so hopping over to Langs Smith to see if we followed this precisely let's check it out and so yeah what we're seeing here is the input current trends in digital marketing yada yada and then the output is this first generated response and then also the the search queries and a bit of a reflection so we see this response actually be generated here in this drafting stage and then these search queries initially are executed here in the execute tools stage so this is exactly what we were seeing here with the initial response and then tool execution so all of these tools are executed and then the reviser over here if we hop down here is able to take that see all of the context from the web search here and then actually revise and answer and then this is where it enters a loop of reflection so you can see that it is specifically referencing missing things and also specifically referencing Superfluous things and then now that it's using a um web search tool it's keeping track of references and with that it Loops executes the tools revises execute the tools twice more so this is another example of kind of running through this about three times similar to what we did with the basic reflection agent until finally here at the end we have the final all the way down at the bottom here we have the final revised answer with still some reflection but we take that out and just end up with our final report here so it actually took less time and less processing time than the general reflection agent agent and then if we are to actually compare these tokens the big difference here I would say is that this can vary from sort of the information that it's getting from the web so a lot of the times the actual web scrape that it gets from its web search tool can be a lot of tokens so the cool part about it though however is that we almost knocked the time down in half by taking out this execution stage and revision stage and having it Loop like that the third architecture the more reflection oriented approach which is is highly abstract it's language agent research or lats for short and it's based on this paper language agent research unifies reasoning acting and planning in language models that came out December 5th of last year and it's basically using large language models in three ways it uses them as the actual agents to do both the generation the value functions of the tree search and then also as optimizers for for the tree with the general framework of a Monte Carlo tree search approach and so in Len Chain's diagram here they have this wonderful little synthesis of it where it kind of just repeats until solves you know select node generate new candidates act reflect and score and then back propagate up through but let's break that flow down a little bit more specifically especially for those who aren't familiar with things like you know Monte Carlo tree search might be one wondering what the hell does that mean doesn't matter we're going to break it down real easily so basically First Step as always is the user query is input that's that first generation and then the initial response is Generation generated as a starting rot node and in this case that root node can e either be a answer or a tool execution what we're seeing here in the generate initial candidate step right here is that the first route was actually just a tool search and so in this specific flow example I had you know generate a 500 word report on what Cisco Systems is doing with artificial intelligence and then it says it wants to use my search results tool with the query Cisco Systems artificial intelligence initiatives and so then that is actually executed and after that the reflection prompt generates a reflection on the output whether it's the output of the actual tool execute or the generat response and then generates a score for the output and then determines if a solution is found so it's saying here then that the reflection is the search result you know provides valuable information on Cisco's initiatives with AI company has developed blah blah blah blah blah yada yada yada all that stuff it gives it a score of eight and it says that this accurately found the solution then following that an additional sort of n candidates are Genera with the context of the prior output and reflection and this sort of expands our tree out so coming from this node and the reflection of it we generate then in this sense one two 3 four five additional candidates and those candidates are then going to be sort of if it was a tool use it's going to be just plain generation if it was generation then it might be further generation or tool of use and in this case it looks like it was just generation so coming from the context of the user query as well as the web search tool that was executed it generated five different outputs of actual reports then on top of that then it sort of repeats this the reflection prompt grades scores and determines each new candidate and then the scores for the best kind of trajectory are updated so it'll pick the child node from that first initial generated node and the one with the highest score is going to be the new candidate to start from and starting from then the best child node more candidates are generated for the next step with the context of the prior output and then the reflection and scoring is executed further and further and further and this cycle repeats until either there's a high enough determined score or a Max search depth is reached so in short basically and this uh this diagram really confused me at first when I was trying to break it down but in essence all that this does is it will generate something initially and then it'll generate sort of candidates from that executed um initial node and then it chooses the best one and then from that best one generates another number of candidates for the next step chooses the best one and then generates another number of candidates until you either hit a max depth or you know iteration of nodes or a score is actually reached from that generated score of from the language model jumping down to the testing area lats actually was able to generate its report on digital marketing Trends in only 30 seconds with about 8,000 tokens used and this sounds phenomenal in comparison to a few of the other architectures that we've just went over however there's a lot of different concerns to be had here so let's first hop over to the trace and see this in action great and so looking at this what we can see is that our initial input is what are the current trends in digital marketing for tech companies and then it does the generate initial candidate the initial candidate was a tool call with current trends in digital marketing for tech companies in 2023 which are Tav Search tool gives us here it then reflects on this and grades it so so yeah looking over here ah here we go it reflects on it the assistant provided you know comprehensive overview of current trends and digital marketing scores at a 10 and then says it found the solution so you might be starting to see some of the issues here when using language models as every single part of your tree search algorithm we then expand from this initial candidate with ah I see one two or no one two three four five different Generations we generate this candidate here which has a report and then five different Reflections and so all of these scores as you can see are starting to be around like nine it's saying it found the solution and since all of these are pretty high score and say that it found the solution one of them was chosen and then finally it was all output all the way at the end to this report here and so what you're probably seeing here is not great if you know anything about sort of different treesearch optimizers and other things like that because what's going on here is that the actual scoring of the generated responses and the reflections of the responses given that it is an llm is very indeterministic so using an llm as the actual sort of Optimizer for this for the generated responses can be a little bit finicky it tends to overscore I would say and it tends to be very ready to sort of accept an initial answer if it seems to be somewhat related to the input query and as we can see it didn't really generate too much further past sort of One initial expansion because it determined that everything had been found so you could force this to generate more nodes or some of the different approaches would be to use a better Optimizer that isn't language model based and might be more you know hardcoded in and using that it would be a lot less um lot less finicky and a lot more reliable for actually scoring the desired output that you're looking for however you know was able to give me a nice graph founded report on digital marketing Trends but one of the other things is that it does have to generate a lot of text especially if it's generating you know five or six candidate nodes per each you know generated node that can really run away with token counts if you don't have sort of a Max search depth setup or if you're generating a lot of text all at once so now that we've gone over some from basic to very abstract and complicated approaches of actual reflection let's move on to sort of the last three here that are more based around a different kind of reflection more of a planning and reasoning stage as opposed to direct reflection so the first one is actly named plan and execute and it is based on this paper plan and solve prompting improving zero shot Chain of Thought reasoning by large language models that came out around May last year and it takes the reflection idea and breaks it down into a little bit of subtasks more like task decomposition than actual reflection which can be sort of argued as a type of reflection as well and so this one I find to be pretty interesting because the general flow of it is that your user query is input as it always is and then instead of generating something right off the bat an initial planning prompt comes up with a step-by-step approach to sort of completing the query so for this objective input which was you know what is the hometown of the current Australia Open winner the plan here is then identify the current year find out who won the Australia Open in the current year and then search for the hometown of the winner identified in step two first step of the plan is then given to the agent which either generates or executes a tool which is the single task agent so it'll look at that first step and then the original query the original plan and the past step output are given to a replan prompt that either updates the plan or Returns the output to the user so once the first task is executed it will give it to this replan the replan takes a look and says have we finished the task do we need any additional tasks or are we good to go to respond to the user if it is good to go it responds to the user if not it updates the step the updated step is given back to the agent to either execute or generate it updated generation with the context of the prior steps is then looped through input and plan are given back to the repl planner and this repeats kind of end times until the repl planner determines that the answer is adequate and returns it to the user so basically a plan is made the first step of the plan is executed and then we take a look and reflect see if that first step actually answers the entire plan or if we need more tasks or if we're good to go if we need more tasks an updated task list is made sent back to the agent that either executes or generates and then you reflect again if it is good to go the final response is sent over to the user and so heading down to the agent testing environment here with the plan and execute approach we saw that this took only 24 seconds and about 3,000 tokens as mentioned before tokens can vary wildly especially with things like web search tools however I've seen that with a lot of these the actual time tends to go down and I think this is a bit more of an interesting approach than the actual reflection the decomposition of it into different subtasks can be more easily executed upon from llms and breaking up sort of those complex problems into more simple steps are easier to sort of look at and actually see happen especially with these language models and so taking a peek at the trace here what we can see is here we go the input were the current trends in digital marketing for tech companies and then we have a planning step so the planning step is research recent articles reports and studies on digital marketing transends for tech companies identify key trends mentioned in these sources such as AI integration personalization analyze how these Trends are being implemented summarize the findings to outline the current trends in digital marketing for tech companies so then we execute here here we go we execute the first step which is research recent articles reports and studies on digital marketing trends for tech companies and then what it does is it searches the web for recent articles reports on digital marketing transfer tech companies and then generates a bit of here we go an output here based on that and then oh it even gives some references here we shoved that generation then to the repl planner and the repl planner looks at you know the input the plan the past steps everything that's been made and then determines that based on the information provided appears that all necessary steps have been completed to identify and summarize blah blah blah blah blah no further steps are needed and this summary can now be used for analysis and Reporting so in one shot without any actual need to replan it determines that was able to accurately answer the question that then gives us this output over here and so a few notes that I had on this is that you know forcing a planning stage makes it a bit more efficient than a series of revisions since it tends to sort of you know approach with a plan as opposed to just generating wildly and then generating a reflection which it will always do if you always ask it to reflect on something so it'll always generate something but this tends to generate you know decent responses quicker with a little bit of its own guidance following a plan instead of explicit reflection the next two architectures here then are going to be some pretty interesting optimizations of the whole plan and execute approach and the first one that we'll look at here is reasoning without observation which has a phenomenal name which they called Riu and is based on the paper Riu decoupling reasoning from observations for efficient augmented language models that came out in May of last year and in short this really combines this multi-step planner stage with variable substitution for very effective tool use to reduce both token consumption and execution time with a full chain of tool use in a single step so let's break that down as always we start with the user query as the input and then the planner generates a plan in the form of a task list and this task list includes special variables as the arguments to allow for variable substitution a good example of this is up here in the diagram So the plan let's say our question was who is playing in the next Super Bowl or something like that or the upcoming Super Bowl I should say probably and the plan here is I should look up the Super Bowl contenders and then the first variable E1 would be search for the Super Bowl contenders and then the plan would be get the first team and the step E2 this variable E2 would be a large language model call for the first team from the results of E1 so what you're seeing here is that this full task list is generated with variable dependencies however these variable dependencies are persistent throughout this entire task list so that it can be executed on all in one stage so to just reiterate what that is one more time looking at this plan identify the winner of the 2023 Australia Open let's say the Google search would be George and so then here in plan number two find the hometown of the identified winner from E1 E2 would be Google hometown of and then this E1 placeholder variable here which is equal to this output would be George so then E2 would then become equal to Google hometown of George and so then that would be E2 so on and so forth and so then as each step is then executed output is substituted into the placeholder variable the next step and fed back into the llm agent once all of the steps are complete the plan and the quote unquote evidence from the tool execution is fed to a solver prompt which generates and returns a final result response to the user so in essence What's Happening Here is instead of executing one step of the task and then reflecting to see if another step should be done or more steps should be done a full task list is made and then the entire task list is executed resolving all of these dependencies as each step is done and then it's all all of that context is sent to a solver and then the solver creates the final output for the user hopping down then back to our testing environment we can see that Riu took only about 21 seconds to do the same thing that the plan and execute stage did in 24 seconds so let's take a look at the trace here seeing here we see that the task is what are the current trends in digital marketing for tech companies and then the plan that is created is here we go E1 is Google current trends in digital marketing forch te companies in 2023 and then the second plan is summarize the key trends identified and so it's going to be an llm call for summarized key trends and digital marketing for tech companies from the results of E1 so E1 is then executed with our Search tool current trends in digital marketing for tech companies all this good text is then put into our second execution here which is this summarization and then the summarization is created and made into this report and then that is all sent now finally into the solver which has the task the plan strings the steps and then the results from each one and then its output is this right here phenomenal which ended up being little bit short but still good right here and this as we can see is really an optimization of this plan and execute style agent and well it did have a little bit of a higher token count and as mentioned before that could be due to the web search it still took less processing time so it's definitely a little bit quicker and a more optimized version of this plan and execute Loop the final architecture example then here is a complete optimization of Riu called llm compiler and it's based on the paper llm compiler for parallel function calling from February 6th of 2024 and in short llm compil compiler pulls on the idea of directed as cyclic graphs from compiler design to automatically an algorithmically generate an optimized orchestration for parallel function calling in sort of a react style setting what does that actually mean however is pretty simple it's just um a lot of uh a lot of fancy jargon makes this sound a lot more complicated than it actually is but let's break down the flow here so as we've been doing user query is taken as the input and then there is a planner stage and the planner stage takes in the input and generates a task list with placeholder variables for subsequent task dependencies as well as thought lines for reasoning through the plan so already starting off very similar to Riu with creating this sort of approach of creating a task list with dependencies so looking here at this example we have you know what's the temperature in San Francisco raised to the third power and then the output is we're going to do a web search with the current temperature in San Francisco and then we're going to do a math problem a math function call of the number one output raised to the third power with the context being number one and then this is joined so then a task fetching unit parses the plan and determines the interdependencies of each step and then tasks that have no interdependencies are then fetched together and sent to an Executor to run in parallel so that's where it sort of differs from Riu over here where Riu will kind of go through and execute each one one by one and then resolve the dependencies one by one if there are multiple steps that don't require both um that don't have any dependencies llm compiler and the llm compiler approach will um execute those in parallel and then resolve everything at once so looking at this second example here from the paper the user input you know how much does Microsoft's market cap need to increase to exceed Apple's market cap we can see that the first couple steps of the plan the search for Microsoft market cap and then the search for Apple Market Cap don't have any interdependencies here however math here for number three and then the actual llm response for the fourth step here do have dependencies on 1 two and three so what it'll do is it'll actually fetch one and two and then resolve them and execute upon them in parallel so both of them happen at the same time and then both of their results are then spit back at the same time and once those results are spit back the output of the executor is then you know fed back into the task fetching unit to resolve the dependencies repeating back and forth until the plan is fully executed the output of the fully EX executed plan is then sent to a joiner prompt which either determines the final answer from the context or append a thought to the end of the original plan and sends it back to the replan to add any additional steps if needed a rep plan a continuation of the plan is created and then it goes through and does the above Loop until the Joiner determines that enough information is there to return an output to the user so this is getting a little bit complicated but essentially it's taking the ideas from you know reasoning without observation and plan and execute and combining them together with sort of some algorithmic approaches for parallel function calling and parallel um resolution of dependencies in the actual task and so this is all to sort of optimize the speed and efficiency a little bit further great so hopping over to the trace here we can see that the initial plan from the input what are the current trends in digital marketing for tech companies was first step search for current trends in digital marketing for tech companies and then send it to the Joiner end of plan so super simple plan makes a lot of sense when you see how this is all set up it does the query for the internet and then the Joiner prompt here takes all of this and then says that we've got enough to go on and creates a resp response after giving a bit of a thought around it so this response was then this final thing here that happened there so that pretty much brings us to the end of our very comprehensive overview of six main agent architectures from some popular papers as always all the code is going to be available via Lang chains platform full credits to them for setting up a lot of these good graphics and a lot of the code that I was actually able to use and learn from their YouTube tutorial series will be linked down below as well as all of the papers and as well as all of the Lang chain traces that we went over here in the video If you enjoyed it drop a like hit subscribe let me know if you have any questions in the comments and I'll see you in the next thank you
Info
Channel: Adam Lucek
Views: 13,019
Rating: undefined out of 5
Keywords: artificial intelligence, OpenAI, AI, Gemini, Mistral, Mixtral, Llama, DALLE, Llama 2, Open Source, HuggingFace, Machine Learning, Deep Learning, Neural Networks, AI Trends, Future of AI, AI Innovations, AI Applications, AI for Beginners, AI Tutorial, AI Research, AI Solutions, AI Projects, AI Software, AI Algorithms, Artificial General Intelligence, AGI, AI Jobs, AI Skills, AI Strategy, AI Integration, AI Development, ElevenLabs, AssemblyAI, Multimodal, Agent, Azure, LangChain
Id: ZJlfF1ESXVw
Channel Id: undefined
Length: 36min 39sec (2199 seconds)
Published: Tue Apr 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.