Function Calling with Local Models & LangChain - Ollama, Llama3 & Phi-3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Okay. So recently I've been making a number of videos about building agents and using the Llama 3 model for building agents. And we've actually been using Groq in the cloud to serve the 70 billion Llama 3 model. In this video, what I want to do is move away from using the cloud at all. And we're going to be using Ollama and the smaller Llama 3 8 billion model locally. In fact, we're going to be using a quantized version of that on Ollama to look at the idea of, can we do function calling, with a local model. So I've made videos about function calling in the past. And generally I've tended to use one of the big models that has been a proprietary model served out of the cloud, whether that's OpenAI, whether that's Gemini, whether that's Anthropic et cetera. But what I thought I would do in this video is actually look at running some of these things locally and I'm going to try and do a few of these videos showing, building agents locally. One of the cool things you can do is get an agent to basically run all night and use the internet and do a whole bunch of different things, While, the LLM and all of the other parts are running locally. So it's not costing you anything by paying for tokens. You don't have to worry about, exceeding token limits, all this kind of stuff. It's all running locally here. So, what I thought we'd do is start off by looking At function calling for the local Llama 3. I also want it to do this, looking at the Phi-3 model. So I didn't make a video when Phi-3 came out. I do think it's a very impressive little model it can be used for a variety of different tasks. You can fine tune it for a bunch of different things. and certainly while it's gotten a lot bigger than the previous Phi models, it's still over half the size of a Llama 8B. So I want to look at this and see okay, can we do any function calling, any sort of structured outputs, JSON and stuff with both these models? And let's look at how you could actually do this in here. So for measuring function calling, there's really nice paper called gorilla, which basically looked at the whole idea of function calling and stuff. And one of the things that's come out of that is this function calling leaderboard. So if you ever want to know, okay, what models out there can do function calling and can do it, you can actually come in here and you can look and see, what they've actually got in here. you'll see that the obvious models are there, like your GPT 4, your Claude 3 Opus, et cetera. But if we scroll down, we can see in seventh place Llama 3 70 billion is in here. and if we scroll down, quite a lot further down we can see that actually the Llama 3 8 billion model is also on this leaderboard. Now while it's nowhere near as good as the bigger version and the bigger proprietary models, it's still getting, pretty decent results in here. So we're going to use this model and I'm going to try out the Phi-3 model going through things here as. well. All right. Let's jump into the code. All right. So in the first example, we're just going to look at setting up Llama 3 with Ollama. So you really want to use this chat Ollama for the model there. You can just pass him model name, Llama 3. You can see I'm bringing in a string output parser here. I'm bringing in a chat prompt template. some other things that you want to pay attention to is this whole property of keep alive. So this is probably more key if you're doing notebooks. if you're doing notebooks, it can be real pain to run something from cell to cell and it has to load the model up each time. so the doing the keep alive is the way to, basically keep the model going in memory as you go through All right. Next, we set up a very simple, prompt template. So you can see here, I'm just saying, write me a 500 word article on topic from the perspective of and then pass in a profession in here. we're going to use LangChain expression language. So we're just passing in the prompt, the LLM. Now, obviously, where we've got this. these are going to be swapped out for our topic in our profession. we've got the prompt going into the LLM, going into the string output parser. And then we can just invoke this to get this going. And you'll see that, when I run it, it's going to take a little bit of time to actually, get going in this way. here, we're basically doing, the invoke. We're going to get the whole article back in one shot towards the end here. So let me just scroll this up. So we can see it as it comes through. Now you'll find, depending on your machine, this can take, A short amount of time can take longer. all right. So there's our article that we got back, about LLMs game changes for shipping magnates, like me. it's come back now. You'll see again, that it's in markdown format. but here we're just printing this out. So what if we want to see it as it's actually generating, we can come up here. And comment out the invoke and let's put the stream in there. Let me just clear this. And you'll see this time when we run it, we're going to get basically a streaming response back and we can see it as it streams out. once it's loaded the model and gotten going here. Okay. So this time when we run it, we you'll see very quickly that the model starts generating out straight away. And we can see that, it's generating quite a lot of tokens per second here. This will obviously change, but based on, what computer you using, et cetera as you go through here. All right. So next stop. Let's have a look at doing something where we're trying to get a JSON output from Llama 3. Okay. So this is a modified example of one of the LangChain examples here. what we're going to do in this case is we're going to basically define a JSON schema. we're going to try and get JSON back. and we're going to use a JSON output passer in here. So let's talk about this. Now, one of the key things that you want to do here is that you want to set the format of your model to be JSON Now, if you set this to be JSON, I'm not a hundred percent sure what Ollama is doing in the background, but I think they're doing something like, the JSON former paper where they're basically making it so there's a much higher probability of you getting JSON out than just normal things. And they claim that this guarantees that you will get JSON out all the time. I still find that you need to basically tell it that you want JSON, in your prompts, et cetera. and go through it and do this, but this is a great way for being able to extract information out in JSON, get responses back in JSON. and this is one of the key things that you'll often want to use with agents is you want to make sure that you're getting something very specific back with some kind of schema in here. So we can see that the schema here is basically, just telling it that it's going to be a person. We've got a description. We've got some properties in there that it's going to fill out. as we go through this. And you'll notice what we're doing is we've got the chat prompt here where we're going to pass that schema in. We're then going to pass in exactly what we want it to pass out for us. and you can see that, this idea of a schema, we're just taking the original JSON schema dictionary that we had up here. We're going to convert it to a string to pass in, and then you'll see that. We're just going to pass that in and it's going to go through the prompt through the LLM and then through the JSON output parser. Okay. So you'll see that, once we basically run this, it's going to go through, it's going to parse out this here and we're getting back a dictionary. I could see the dictionary is going to have name John age 35 hobbies. pizza. I think it may have made a mistake in the hobbies. It should be favorite food in there. So it won't always get it a hundred percent. obviously the more detailed things that you asked for the more chance you're going to have that it's going to get it wrong. we did get back the required, name and age. in here. And you can see here that I basically printed out the response, but then I also printed out the type of the response so that we can see that we actually got a dictionary back here. What if we change, using the JSON output parser to just using a string output parser. you'll see that rather than get the dictionary back now this time we're going to get just a string back. So your JSON output passer basically takes it and converts it to a dictionary, like treats it as if it's a JSON string and converts it to a dictionary. And I can see here that we got the same response back. but we've got a whole bunch of white space after it. and then right down at the bottom, we can see the class was typed string that we've got in there. So generally you want to stick to using the JSON output parser rather than using the string output parser in here. Okay, next up we're going to look at getting out structured responses now using Ollama functions. So this is like using the tool, calling or the function calling. in LangChain, whenever you hear tool calling, function calling they're interchangeable. Here, you can see that what we're going to do here. And this is the sort of first one, Rather than give the schema as a dictionary and then convert it to a string here. Now we're actually using Pydantic to define a class for the actual, things that we want to extract. So in this case, we want to extract the person's going to have a name. They're going to have a height. They're going to have, the color of hair. this example of modified From the LangChain docs in here where they're basically giving it, some things to think about. So you've got, Alex is five feet tall claudia is one. It should be, one foot taller than Alex and jumps higher than him. Claudia is a brunette and Alex is a blonde. so what we're going to do is this is the context that we're going to pass into the prompt. So our prompt is going to be your smart assistant. Take the following context and question below and return your answer in JSON. So looking at our model, you can see again, we're using Llama 3, we're using JSON. We've got the temperature set to zero here. We've got our prompt that's going to basically take in the context and the question. You can see, I'm wrapping this with the prompt format for Llama 3. Personally, I've found this to actually get, be one of the key things of getting it to work properly is getting the right prompt format in how you're doing these kinds of things. So I'm going to basically have a question context, and then I'm going to prompt it to get the JSON out here. we're going to use this with structured output idea. So you can see that we're basically taking this LLM and it's we're binding that Pydantic model to the LLM for getting the structured output. And we're going to have the prompt. And we're going to have the structured LLM, which is just our LLM With the Pydantic class schema in there now. and then you can see, we can basically pass it in something like a question and a context in here. And now we can just run it. And we can see that. Okay. It's now using the Pydantic model. obviously we've got the Llama 3 set to JSON, in this case. And hopefully we're going to find out some things back and it's gonna, come back now, look here. What we got back. We've got Claudia height 6.0, hair color brunette. So you can see the, actually what it's done is it's worked out that Claudia was one foot taller than Alex. So Claudia is therefore six feet tall in here And that Claudia's hair is brunette in here. So we've gotten these right answers out. So let's try the same thing with the Phi-3 model, right? So the Phi-3 model is a lot smaller than Okay. So the only changes, in here is that we've changed the prompt to use the Phi style of prompting and I've changed the model to be Phi-3 in here. the rest is exactly the same. So this model is a lot smaller. I wouldn't expect that it's going to be as good, as Llama 3, but you'll see that when I run it on something like this, Let's see when it's coming back. you can see that he also can work out that, okay. The name is Claudia for who is taller, the height is 6.0, the hair color brunette in here that we're getting back. So this shows us that we can use the Phi-3 model for doing a lot of this, simple, structured output stuff. I'm not going to say that this is going to be great for it. probably on the whole, the Llama 3 8 billion parameter model is going to be better. Certainly The 70 billion parameter model is going to be a lot better, but you'll be surprised that you can actually get some really nice outputs with this Phi-3 model doing exactly these kinds of things. All right. Let's move on to the sort of tool use and function calling example. Okay. So in this example, I'm back using the Llama 3 model again. and now what we're doing is we're using this new sort of experimental Ollama functions. so this has just been added to, LangChain recently. And what it lets you do is basically bind tools to the model just like you do with other function, calling ways in here. this is not necessarily going to work with every model on Ollama, but the models that do support function calling you'll be able to do this. So you've got your Llama 3. you've got Phi-3, like we're playing with here. you've got some of the Mixtral models. There are a bunch of different models that you can do something like this. and then basically get back using this Ollama functions. So we're taking one of the classic examples here of the tool that it's going to use is basically get the weather. So we're parsing in here, basically this definition of the tool here. Not actually defining the tool here. You can look at other examples I've done for that. but we want to see that if we give it something related to weather, will it actually come back telling us that, we should be calling this, get weather, get current weather function in here. So you'll see that. Okay. I'm just passing in a simple question of what is the weather in Singapore? Okay, I'm running that. what I want to see back here is, not just a response now in JSON, but actually a function call response that we're going to get back here. And sure enough we can see that we've got this back that we've got this content with additional arguments and it's function call. We can see the function called that it wants to do is get current weather. The arguments are going to be Singapore, the unit is going to be Celsius in this case. so now we can basically use those to do our function call and then pass it back, through this just like I've done before with things like, when we've done it with OpenAI, et cetera. let's try the same thing for Phi-3. Okay. Now with the Phi-3 example, you can see that everything is exactly the same. We've just swapped out the model to be the Phi-3 model here. I will say that, I think it's going to work for this one. Let me run it. it won't necessarily always work, especially when you've got multiple tools in this case. Now you can see that was actually very quick to. but can see that we've got back this back. we've got no content as in, text coming back. but we've got a function call. We've got the name of the function call. We've got the arguments. in this case, it didn't give us the argument for the unit, It just gave us location Singapore. So it does show that it's not as strong, perhaps as the Llama 3, 8 billion model at doing this, but it still was able to handle it. It still was able to pass things back for us. So this gives you a few ways that you can do sort of JSON use. structured outputs, and tool use with these local models, like your Llama 3, like your Phi-3, like some of the Mixtral models. And then there are a whole bunch of other models that people are fine tuning just for function calling. So if they've been fine tuned in a way that sort of makes them compatible, you'll find that the often work with this. I Actually find that the Llama 3 70 billion can do a variety of different ways of function calling. Just because it seems to do so well that it's able to pick up, the sort of format you're giving it, the function calling in and work out what it should be returning, pretty easily in here. But anyway, Now you've got two models that you can use locally to start building agents where you can do function calls with tools. and basically have, free amount of tokens for running nonstop to do a variety of different tasks, et cetera. in some of the future videos, we'll start looking at actually building some agents and building some RAG systems on these local models. so that we can use them for a variety of different things. Anyway, as always, if you've got any comments or questions, please put them in the comments below. I would love to hear back from people, how they using this, what kind of tools you're looking at using, or currently using for your agents, that kind of thing. And as always, if you found the video useful, please click like and subscribe, and I will talk to you in the next video. Bye for now.
Info
Channel: Sam Witteveen
Views: 28,929
Rating: undefined out of 5
Keywords: llama 3 8b, function calling, llama 3, llama 3 tutorial, function calling mistral, function calling langchain, llama 3 70b
Id: Ss_GdU0KqE0
Channel Id: undefined
Length: 17min 29sec (1049 seconds)
Published: Wed May 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.