Streaming for LangChain Agents + FastAPI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we're going to learn how to do streaming with Lang chain now streaming is very popular feature for large language models and chat Bots but it can be quite complicated to implement or at least confusing to get started with now for those of you that maybe not aware of streaming it's essentially when you are talking to an llm or a chatbot and it is loading the output token by token or word by word now the whole point of using streaming in the first place is that particularly if you're generating a lot of text you can begin showing the user that text and they can begin reading sooner now implementing streaming it can be very simple if you are implementing it for a simple use case but it begins to get more difficult first when you start using line chain and then secondly if we begin using agents it gets a little more complicated than if we take it a little further and maybe we want to stream the data from our agent through to our own API then again things you get a little more complicated but in this video we're going to go through all of that so by the end of this we'll actually have an agent in line chain that is streaming via a fast API instance so let's let's just jump straight into it now the simplest form of streaming that we can do is basically just printing out to the terminal or as you'll see in a Jupiter notebook to achieve that level of streaming it's it's very easy let's take a look at how we would do it so we use these two parameters when we're initializing our LM so here just a chat open AI object I'm initializing it as usual these are all typical parameters that we'd use but I'm also adding in the streaming and this callbacks parameter streaming is I think pretty obvious what it does it just switches on streaming and it's also worth noting that this will only work with certain LMS not all LMS will support streaming but for open AI of course it is supported and then we also need to pass this callbacks parameter okay and I think there's probably more interesting part here so this here is what handles the streaming all right this handle as you can see that says streaming right okay streaming standard output callback Handler so the standard output is basically like a print and we can even see what it's doing by heading over to its definition here so you can come to here and see this method on LM new token when this is executed it's going to print new token okay so that's all we're doing by adding this callback in here with every new generated token it is going to be taken to the standard output now that's great um I have initialized that also of course you have put your API key here from openai if you haven't already now if we run this okay so like we would in line chain normally so we create our our human message and we just pass it to our LM you will see straight away we have streaming right so it took 2.2 seconds to complete so if we didn't enable streaming there it would have taken 2.2 seconds to show anything right it would have basically come out with this at the bottom here but you saw that it's it kind of went through token by token printing them to our output right so we can we can see it again like that now this is really good especially when we have more you know long outputs so maybe if we say tell me a long story and if we run again it's gonna take a little more time you can see you know we're kind of going through and obviously if we were printing out the whole thing nicely we'd be able to kind of follow it and read as it's going all right we can see that if we we're not doing streaming here you know it keeps going on it's 19 seconds now 20 seconds 21 seconds we'll be waiting that whole time just to begin reading right so that was almost 26 seconds in total if we didn't use streaming we'd have to wait that whole time to begin reading but with streaming we can begin reading earlier which obviously is a nice little feature to have now that is you know using an LM using an LM and streaming to you know your terminal or the Jupiter Network output is you know it's the easiest thing you can do it begins to get a bit more complicated as soon as you start adding in more logic so let's take a look at how we might do it for a agent in light chain okay so we're just going to initialize an agent here we have our memory we are going to load in one tool we're going to initialize our agent it's going to be a conversational react agent okay and we're using the same LM so because we're using the same LM we already have our callback our streaming standard output callback Handler that is already included within our agent so we can initialize that one thing that you will or that you should do here is make sure you return an intermediate steps as false because that will trigger issues if you're trying to call things back and you can actually pull back those intermediate steps by passing what is being output to you anyway and I also set verbose equal to true so that we can actually get all of the the outputs so after initializing that I'm going to go to I'm going to create this prompt uh it's just string this time rather than the messages that's just because we're using an agent rather than the llm directly and yeah let's try it okay you can see that it actually you know it's streamed and it output the entire you know we have the action the action input so whereas before we kind of just had this bit with a agent we have more because an agent basically works it's like a an added a little bit of logic around the LM and in order for it to correctly pass the what the LM wants to do it asks it to return the output from the LM in this Json format with the action and that should input now this is useful because you know sometimes maybe we want to use a tool using our agent or we want to go through multiple almost like thinking steps right so here's an example of that we're going to use the calculator or the LM math tool that we created so we're going to say what is the square root of 71. we run this and you see it's using the calculator action right and from that it basically uses this calculator tool and it puts in this value here so the square root of 71 and from that it gets this answer which we can then feed back to us in the final answer right and when we ran that it's going to stream the you know the calculator it also streams this a little bit here I'm actually not sure why it does this but anyway and then it streams the the final answer okay so we can we can pass that and we can extract the tools we can extract the final answer and just do it in a cleaner way but how do we do that in a clean way right because right now it's just outputting everything to us well we have like two options we can either use there's a lang chain call back Handler built specifically for outputting the final answer to us from an agent that is literally what it's for and the other option is we can create a custom callback Handler and I would say the custom callback Handler is well it's more flexible of course it just means it requires a little bit more extra work on our side but both are pretty straightforward so let's first have a look at the the simple out of the box line chain callback Handler okay so because we're initializing a new callback Handler we need to re-initialize our LM for now I'm going to use the default tokens here so I will initialize it like that and we also reinitialize the agent and then we can go ahead and try and see what we get okay right so it didn't really or it it didn't stream anything it just streamed this little bit here and that's because by default this is looking it's looking for something like um well something wrong so rather than looking for final answer it's looking for something like I don't know final answer with the lowercase a or something along those lines I don't quite remember so what we can do instead is obviously just say with this answer prefix tokens once you see these tokens then you begin streaming right so we should hopefully get a better result by doing this let's try okay okay so it started streaming it streamed but then again it's still kind of messy so the honestly the the easier approach in my opinion is just to use a a custom callback Handler so to do that we come down to here okay and I just create this here so this is the custom cover handle let me go through it step by step so first let's just remove that first okay we're inheriting the class from the the streaming standard output callback Handler that we saw before we initialize it and we set this self content variable equal to just an empty string basically we're going to be feeding everything into that and we're going to be using that as a almost like a check when we see that the final answer is within this self content we're gonna switch to actually streaming rather than just do nothing and the way that we make that switch is using this Final Answer value okay so let's add the rest of that in there so this is the on lmnu token we saw this earlier in the the line chain definition for actually this class here so we're creating our own version of this now this is called with every new token output by the LM so we take that token we add it to our our entire content and then we say okay if the final answer is in self-content that means we're now in The Final Answer section so we set Final Answer equals true but then we also reinitialize our content right and maybe I can just show you why why we do that so I'll comment that out and say okay if file answer is true that means this will activate and this is where we actually get uh where we actually output our tokens so again like I said this is just equal to printing the token so once we get to that fine line so we still need to wait for Action input and then once we once we see that we begin printing so let's run all of that and let's try okay and if you saw just that it began printing action input as soon as it's all financed in fact it even gets answer here now the reason it does that is because as soon as we see final answer that gets activated but action input is already there so it's actually looking for this one and saying okay now I can start printing stuff it doesn't wait for the second action input so that's why we add or that's why we re-initialize the content here so let's try it with that and see if it works any better okay so you see there it began the streaming on the actual answer output or answer input sorry now there is one thing that I haven't been able to figure out and you know if anyone out there knows how to deal with this please let me know but this little bit here is so streamed which no it doesn't make much sense to me I was you know I was looking at this trying to figure out where it is being streamed from but honestly I I have no idea but anyway I I thought okay I'm not I'm not really too bothered about that it's streaming this out here I can add some sort of filter or whatever around it the backticks but rather than dealing with that what is more important to me is getting all this working with an API which again adds a little more complexity to the whole thing the issue with getting this working with an API is primarily that in order to pass these tokens through an API like if you're using fast API for example you need to be running a loop that is looking your tokens and passing it through what is called a streaming response object the only issue is that we also at the same time need to be running our agent logic so we need to be running these two separate bits of code at the same time essentially we're going to have to rely on a lot of async functions so the first thing we need to do is set up our API so I'll solve this very simple version it doesn't include streaming it's just a fast API it includes everything in its notebook right so in here we're actually just initializing our agent and then we go on we have this like the input format that we that we query with or that we send to our API we have a health check here we're going to test that in a moment and then we also have this chat endpoint okay so that's where I am sending out our queries to there this is just some fast API stuff to get everything running so before we run anything we need to actually do start the API so we switch over to our terminal I'm going to check where we are so we have this demo Pi file here that is the I'm running so I'm going to do evicon demo app reload okay if you're going to run this with what will be the the main python file I'm going to show you you will just replace demo with main there okay so we can check that it's running by calling this okay we should get this status here that okay everything's good now this here is actually how we would do streaming but we haven't implemented streaming yet so let's just send a request like we usually would right so what I will first do actually will just be to get rather than post and I'm going to do requests.get and send it to that endpoint so localhost thousand and it is a chat endpoint and now we have something that we actually want to send which is a Json and in there we have a text parameter and we'll just put you know our query so let's put hello there okay let's run that see what we get okay 200 that's good and let's see what that contains Okay so we've got the response here that's great but that obviously that's not streaming we want to implement streaming so let's go through how we would do that so the first thing we're going to want to do is we're going to need to add the Callback to our agent but we've already initialized it so we're just going to replace the parameters in there for the Callback parameter in there so we'll do async death run call and we'll have a query but we'll also have a streaming iterator in here as well now this streaming iterator will be of the type initially it will be of this type okay so the async iterator Callback Handler so let's put that there and the first thing we want to do is assign that that stream iterator to our callbacks like this okay and then from there what we want to do is await our agent call but because we're doing everything in async we actually need to call a call like this and in that we're going to pass out inputs which is going to be our input which maps to our query like this okay so that will get us you know we're going to get our response and let me let me show you what you can do with that so we'll get the response let's say we return that response and what we'll do here is we will actually just call this run core here so we're going to do response await run core okay uh exactly not quite like that so we need this in here okay or we can even separate that so we can create our streaming iterator here so then we just pass it into here Okay cool so we can we'll save that that's going to reload and then we can just see see if it's working let's try so okay Health First all looks good let's try this okay right so so far it just seemed to you know give us a response straight away but at the same time we're not doing a streaming request here so actually we should try this so let's try it get stream okay method not allowed that's what type of get post sorry let's change that okay it's taking a long time and that's because this isn't a stream it's kind of just like stuck now all right so if we look over here we can see it's entering the new execute chain it has generated as a story not without any errors though and okay we have something there but we just got an internal server error so clearly that whatever we just did doesn't work let's change that's it hi there so it's a little quicker now if we try and make another get request we'll also see that error again so basically we broke it and we just need to we're going to shoot down and restart okay now let's try again okay now it looks good so we're clearly missing a few things from our file here so let's let's work through those now in order to get streaming working we can't just return a response like this so we actually need to return what is called a streaming response here all right we have this value here I mean I'll talk about in a moment so we're returning this event stream right so I need to go up to the top I'm going to uncomment that and now we have our streaming response about streaming responses that's a generator right so okay maybe we can just put a response in there let's see what happens okay let's come to here and try again okay we get this error if we look at our terminal here we see that we have this addict object is not an iterator that is because our streaming response here expects a generator object or something it can iterate over and we don't have that we're just kind of returning everything at the moment so we need to fix that and the way that we fix that is actually by not giving it the response but by actually giving it our iterator here because this this is an iterator right but we need more of like a standard generator so we come to let's come to here right we have this great generator it's going to be taking for now let's take this we use async IO so I need to import that so we have async IO crate task and we are using this we're sending our room call to to execute in the background concurrently alongside our overtime so also running right because we also need to be running this here so by using this create task we allow this to be run in the background whilst we're also running this now this here the reason that it needs to be running is because this is how we get tokens from here and send them into our Student Response here all right so to actually set that up we need to get our generator we can actually drop the response object here we don't need it and we replace this with our generator okay and obviously now we don't actually need to have this run code here so save that and let's try and run again okay nice so that kind of looks like it's streaming but let's let's rerun that and we'll look at the terminal as well and see at what point is this actually running because sometimes it can kind of trick you and you might think that you're you're running this or you're streaming this but actually it's it's loaded everything and then it begins streaming which is not ideal so what we're looking for here is that this will begin streaming in fact we can ask for the long story now so tell me a long story this should begin streaming whilst it's still only on entering new execute chain all right if it comes up with this before we start streaming it means that we're not actually streaming so let's try okay looks good we my streaming it hasn't come up with the answer yet so that means it's actually giving us these tokens as they are being generated by the model but then we get to the end here and we still we still have this error so we need to stop our app on the other side and we will just run that again let's try again all right so tell me a long story or let's go high there again let's make it quick okay so we can see here that it should do this all the time um but it's returning like the full response like the agent formatted response and we don't necessarily want that we we kind of want it to just return this bit here so to do that we need to go into sort of custom callback handlers again so we can we can modify that as we saw before we had like the custom callbacks we can do the same here so I'm just going to copy the custom callback code in and I'll just go through it with you so we need element result and we also need any here as well so I'm taking the async iterated Callback Handler all right and that has these two methods that we need to override so on a landing token basically here because we're using agents I want to do what we did before checking for the final answer and also the action input but slightly differently I on end one I want to check okay if we reached final answer yet because if we're using multiple tools it will basically say I'm done as soon as it hits the first tool which we we don't want so by adding that sort of if statement in there we are stopping streaming but only once the final answer has been given and we have this so self done set and so it tells the the Callback Handler when the Callback is or when streaming is complete and once that is complete we come down to here so we have this await task this is going to finish right now we need to replace this with our async callback Handler so it's what we the name we gave it here and we also we should also take change the types here as well Okay cool so we rerun that now we can try again now you can see we just get that like not the full agent output there so with that we have our pretty rough API that can handle streaming obviously there's a few other things we should really add in in most cases we should obviously test it a lot more and just see if there's any weird things that happen particularly in the way that we have the Callback Handler set up right now is you know we're looking specifically for the final answer and the content what if what if the agent just decides no it's not going to generate the final answer or you know the agent format that it should do you know we need to have some logic in there just to handle those those cases where it might do that but for the most part this is like the core of what you need in order to actually have streaming and have it behind an API now that's it for this video so I hope this has been useful and interesting but for now I will leave it there so thank you very much for watching and I will see you again in the next one bye
Info
Channel: James Briggs
Views: 23,174
Rating: undefined out of 5
Keywords: artificial intelligence, natural language processing, nlp, chatgpt, langchain vector database, chroma db langchain, llm, langchain tutorial, langchain, james briggs, gpt 3.5, gpt 4, langchain 101, langchain search, langchain memory, ai, openai api, langchain python, langchain ai, retrieval augmentation, pinecone langchain, langchain agents, langchain streaming, langchain streaming response, langchain streaming response fastapi, langchain agent streaming, langchain agent fastapi
Id: y2cRcOPHL_U
Channel Id: undefined
Length: 27min 41sec (1661 seconds)
Published: Sat Sep 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.