5 Problems Getting LLM Agents into Production

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

All right. So in this video, I want to talk about the five problems that I keep seeing again and again that people face of getting their agents good enough to basically put into production. I get a lot of questions about in regards to frameworks around this. And while I'm trying to be sort of reasonably framework agnostic here, certainly some of these things apply a lot more to some frameworks than to other frameworks. So one of the things that came up recently was someone asked me about putting CrewAI into production. And my comment was that I actually would never currently put CrewAI into production based on the fact that, there were so many issues with it that I wouldn't trust it. Putting things like LangGraph into production that's certainly much more reliable. but I think you've got some of these problems with all of the different agent frameworks if you're not aware of them and if you're not thinking about how to basically fix these problems as we go through. So let's dive into this. By far the number one problem for all of agents out there at the moment is reliability. So talking to a lot of startups, talking to a lot of companies that want to do agents the thing I'm seeing consistently is that companies are very reluctant to do agents, for anything really complicated just because the reliability of the agents is so low. While your typical company wants five nines of reliability, they'd probably even settle for, two nines of reliability, meaning 99%, but most agents are probably at best getting around 60, 70 percent of being able to do things. Now, there are some places where maybe that's okay, but for the majority of things, getting something into production, you have to make it reliable. You have to be able to make it consistently be able to produce an output that the end user would be able to benefit from. That the end result would be able to, be like they expect it to be and something that they can benefit from and actually use it. there's no use of creating agents that only work some of the time, and then end up failing a large percentage of the time. The issue that creates is the whole issue of humans then having to basically check every single thing in the agent. Now that's fine if you're, starting out and you're trying to make training data or something like that, and you've got a human in the loop and you're doing that kind of thing. but really what we want for agents eventually is we want to be able to be fully autonomous, to be fully operating by themselves, producing a consistent level of result, without a human having to be in the loop there. So this brings us to some of the things that actually go wrong. So the second thing that I see happening a lot is, agents going into excessively long loops and this can be for a variety of different reasons. But it's quite common to see this in CrewAI and some of the other frameworks. , where you'll have it set up and the agents will basically not like the output , either of a tool, which can be one of the ways that this happens quite often is a failing tool or a tool that sort of just don't working in some way. the other way too, though, is that where the LLMs basically, get a response out from one sub agent to the next part. And it just decides that no, it needs to do that part again. And it just gets into this loop of going through it again and again and again. Now this is one of the frustrations I've felt a lot with CrewAI and with some of the others. with LangGraph, what I actually do is I sort of hard code it so that we kind of know how many steps it's taking. Now CrewAI has actually, set up a thing also that does something like that nowadays too where you can actually limit the number of steps that it goes through or repeats and retries that it does for this kind of thing. But this is a very common pattern that you see with LLM agents, that they get into these kind of loops. And a lot of what you have to think about when you're architecting an agent is actually how to handle any of these loops. ideally you want to reduce them to none. but if they do happen, you want to make sure that your overall sort of agent or system is aware that they're happening. And then puts a stop to them pretty quickly. Otherwise, you find that you end up just getting an agent, just going on, making LLM call off the LLM call. And if it is, fully autonomous where you're not watching that, they can get very expensive very quickly if you're using an expensive model or something The third problem that can go wrong is around tools. Now, tools is something that I've been meaning to make a lot more videos about, in here. In the previous section, I talked about failing tools. And this is something that happens a lot, that I feel like people are often not aware of. while the tools in things like LangChain, a pretty nice for starting out, you're gonna find that you want to customize them a lot to your specific use case. you need to understand that a lot of those tools were made over a year ago. They were very simple at the time. They're not really made for agents a lot. They're often made more for use in sort of RAG than agentic stuff. and you really find that what you want to do is basically make your own set of custom tools. Now I will follow up with a video talking a bit about custom tools, but I will say that, tools are really your agents sort of secret sauce. if you got a really good set of tools that basically can filter inputs can use inputs in the right way. can generate outputs that are going to be beneficial to the actual LLMs. So really the whole tools thing is all about how do you get data? how do you manipulate data? And how do you prepare it for an LLM? And then when it fails, how does the tool basically tell the LLM that it's failed in a way that, is actually going to be beneficial Rather than going into an endless loop in here. So you can see for often really simple things, I will make quite complex tools. This is an example of a webpage diffing tool, just to check, basically the outputs of a web page so then an agent can tell when a web page has been updated. So for example, this was a simple use of the tool for basically checking, if OpenAI's webpage had been updated. it could then basically assess what new links were there, and then be able to go to those new links. and find out what had been announced for, returning news, returning different kinds of things. Now the same kind of thing, worked nicely on sites like CNN and other news sites and stuff like that. The idea here though, is that this is a very custom tool for a very specific use case. And that's how you want to think about most of the things that you're doing. When I look at some of the best, agents that I see companies doing, they've generally got very specific tools that, they are able to sort of handle, different kinds of input, work out what they need to do to generate data, et cetera, provide that back to the agent in a way that's useful so that the agent can know what's going on. one of the sort of classic examples is if you look at a lot of the simple search tools while they'll return information about, what's on the page, they don't actually provide the URL. so you want to sort of go through and customize some of those things so that you're actually getting the URL back. You're storing those URLs. You'll then basically, caching any response to that URL. So, if you're scraping that URL, then you're caching it so that your agent can basically use that cache again and again, without having to do any kind of, repeating itself of calling these different things. this is a whole class of what I would call sort of intelligent tools that you want to build in here. All right. Th this brings us to the fourth problem that I see a lot is the whole idea of self-checking. you need your agent to have some thing or some way of being able to check its outputs and see, is it generating outputs that are useful or not useful? the classic example of this would be, with code examples. So if you got an agent that you've got, that's actually generating code, you want to make sure that at some point, that code is checked and that might be as simple as running a unit test on it to see, do all the imports work, do the functions actually run, and return what I expect for them. You want to set up some tests for things like that So that you can actually check the output of the code that the agent is actually generating. Now in lots of other use cases, you're not going to be generating code. So you need to think about in those sort of situations, how will your agent have the ability to know if something is right versus if something is wrong, how can it check to see that this is something that's going to be useful versus something that's just going to be totally off base of what the end user wants? and that can be things like, checking URLs, LLMs loved hallucinate URLs. So check, do those URLs actually exists? Do they not exist? That kind of thing that you want to think about as you're going through, but this idea of self checking is a really sort of key thing. The last thing, I think that you need to think about a lot and that I see as a big problem with LLM agents is the lack of explainability. So you really want to think about when the user actually gets a result back at the end from an agent. Can the agent sort of point to some explanation? Now this could be citations is a great way of doing this. citations showing exactly where the information that used to basically make a decision or to do something, was, That gives people a lot more confidence in the output of the agent when they can see why the agent said something, or why the agent gave a certain result, that kind of thing. It can also be things like, being able to look at a set of log files or look at a set of outputs that the agent made along the way. So this brings us to sort of like the sixth of the bonus sort of thing that you need to think of, which is debugging an agent. you need to have some kind of outputs or some kind of logs that are kind of intelligent and not just purely calling the LLMs and the agents. That's one way of doing it, but can be very tedious way of going through. You need to be able to assess at which point does the agent start to fall apart? Now, remember a lot of this stuff. if you're using the LLM agent, you should be using that to basically make decisions. And perhaps generate, tokens out, as either text or as code or something like that but mostly what you're using the reasoning part of an LLM agent is to be able to make decisions is to be able to see these things. Now you want to make sure that's something that gets logged independently that's quite easy for you to see, ah, okay, this looks a bit suspicious what's going on here? Can we debug this? We can look at the reasoning points in the agent as we go along. So these things I think are things that you need to be thinking about constantly when you're doing anything with LLM agents, autonomous agents in here. far too often, I see people doing stuff that actually, you don't even need an LLM, to do some of these things, you can just basically, sequence them up . There's no need for any sort of decision point or something like that in there. make sure that, when you're building your agent, you want it to have as few decision points as possible to get the outcome that you want to be able to achieve with this. So go back and assess some of your own agents and look at it and think about, okay, where are the points of decision, going on in here? And how am I checking to make sure that each of these things is being conformed to, so that you do get the actual sort of reliability out of these things. Are we making a bunch more videos of looking at building things with LangGraph, even with things like CrewAI. Even though, I don't think CrewAI is ideal for production. I think it's great for trying ideas out really quickly. I'll show you some sort of things that I've been doing with that To be able to build some of these crews really quickly and try out ideas and get a sense of what is probably going to work, what is not going to work. and then look at, more about how converting them across to much more sort of low level code things like LangGraph, things like just coding some of these things in plain Python. Often you don't need a framework to do some of these things. and that's something that I want to go into more in the future as we go through this. Anyway, hopefully this video was useful to get you thinking about the key things that go wrong in getting LLM agents into production. And how you can start just think about mitigating some of these problems that you come across. As always, if you've got comments or questions, please put them in the comments below. If you found the video useful, please click like and subscribe. And I will talk to you in the next video. Bye for now.

Info

Channel: Sam Witteveen

Views: 12,279

Rating: undefined out of 5

Keywords: ai agents, crewai, autogen, phidata, phidata ollama, crewai tutorial, crewai local, agents in production, ai agents explained, ai agents use cases

Id: 06kslWw_QOc

Channel Id: undefined

Length: 13min 11sec (791 seconds)

Published: Tue Jun 04 2024