Okay. So recently I've been making a number of
videos about building agents and using the Llama 3 model for building agents. And we've actually been using
Groq in the cloud to serve the 70 billion Llama 3 model. In this video, what I want to do is
move away from using the cloud at all. And we're going to be using Ollama and the
smaller Llama 3 8 billion model locally. In fact, we're going to be using a
quantized version of that on Ollama to look at the idea of, can we do
function calling, with a local model. So I've made videos about
function calling in the past. And generally I've tended to use one of
the big models that has been a proprietary model served out of the cloud, whether
that's OpenAI, whether that's Gemini, whether that's Anthropic et cetera. But what I thought I would do in this
video is actually look at running some of these things locally and I'm going
to try and do a few of these videos showing, building agents locally. One of the cool things you can do is get
an agent to basically run all night and use the internet and do a whole bunch of
different things, While, the LLM and all of the other parts are running locally. So it's not costing you
anything by paying for tokens. You don't have to worry about, exceeding
token limits, all this kind of stuff. It's all running locally here. So, what I thought we'd do is
start off by looking At function calling for the local Llama 3. I also want it to do this,
looking at the Phi-3 model. So I didn't make a video
when Phi-3 came out. I do think it's a very impressive
little model it can be used for a variety of different tasks. You can fine tune it for a
bunch of different things. and certainly while it's gotten a lot
bigger than the previous Phi models, it's still over half the size of a Llama 8B. So I want to look at this and see
okay, can we do any function calling, any sort of structured outputs, JSON
and stuff with both these models? And let's look at how you
could actually do this in here. So for measuring function calling,
there's really nice paper called gorilla, which basically looked at the whole
idea of function calling and stuff. And one of the things that's come out of
that is this function calling leaderboard. So if you ever want to know, okay,
what models out there can do function calling and can do it, you can actually
come in here and you can look and see, what they've actually got in here. you'll see that the obvious
models are there, like your GPT 4, your Claude 3 Opus, et cetera. But if we scroll down, we can
see in seventh place Llama 3 70 billion is in here. and if we scroll down, quite a
lot further down we can see that actually the Llama 3 8 billion
model is also on this leaderboard. Now while it's nowhere near as good
as the bigger version and the bigger proprietary models, it's still
getting, pretty decent results in here. So we're going to use this model
and I'm going to try out the Phi-3 model going through things here as. well. All right. Let's jump into the code. All right. So in the first example, we're just going
to look at setting up Llama 3 with Ollama. So you really want to use this
chat Ollama for the model there. You can just pass him model name, Llama 3. You can see I'm bringing in
a string output parser here. I'm bringing in a chat prompt template. some other things that you want
to pay attention to is this whole property of keep alive. So this is probably more key
if you're doing notebooks. if you're doing notebooks, it can be real
pain to run something from cell to cell and it has to load the model up each time. so the doing the keep alive is the way
to, basically keep the model going in memory as you go through All right. Next, we set up a very
simple, prompt template. So you can see here, I'm just saying,
write me a 500 word article on topic from the perspective of and
then pass in a profession in here. we're going to use LangChain
expression language. So we're just passing
in the prompt, the LLM. Now, obviously, where we've got this. these are going to be swapped out
for our topic in our profession. we've got the prompt going into the LLM,
going into the string output parser. And then we can just invoke
this to get this going. And you'll see that, when I run it,
it's going to take a little bit of time to actually, get going in this way. here, we're basically doing, the invoke. We're going to get the whole article
back in one shot towards the end here. So let me just scroll this up. So we can see it as it comes through. Now you'll find, depending on your
machine, this can take, A short amount of time can take longer. all right. So there's our article that we
got back, about LLMs game changes for shipping magnates, like me. it's come back now. You'll see again, that
it's in markdown format. but here we're just printing this out. So what if we want to see it as it's
actually generating, we can come up here. And comment out the invoke and
let's put the stream in there. Let me just clear this. And you'll see this time when we
run it, we're going to get basically a streaming response back and
we can see it as it streams out. once it's loaded the model
and gotten going here. Okay. So this time when we run it, we
you'll see very quickly that the model starts generating out straight away. And we can see that, it's generating
quite a lot of tokens per second here. This will obviously change, but
based on, what computer you using, et cetera as you go through here. All right. So next stop. Let's have a look at doing
something where we're trying to get a JSON output from Llama 3. Okay. So this is a modified example of
one of the LangChain examples here. what we're going to do in
this case is we're going to basically define a JSON schema. we're going to try and get JSON back. and we're going to use a
JSON output passer in here. So let's talk about this. Now, one of the key things that you want
to do here is that you want to set the format of your model to be JSON Now, if
you set this to be JSON, I'm not a hundred percent sure what Ollama is doing in the
background, but I think they're doing something like, the JSON former paper
where they're basically making it so there's a much higher probability of you
getting JSON out than just normal things. And they claim that this guarantees
that you will get JSON out all the time. I still find that you need to
basically tell it that you want JSON, in your prompts, et cetera. and go through it and do this,
but this is a great way for being able to extract information out in
JSON, get responses back in JSON. and this is one of the key things that
you'll often want to use with agents is you want to make sure that you're
getting something very specific back with some kind of schema in here. So we can see that the schema
here is basically, just telling it that it's going to be a person. We've got a description. We've got some properties in
there that it's going to fill out. as we go through this. And you'll notice what we're doing is
we've got the chat prompt here where we're going to pass that schema in. We're then going to pass in exactly
what we want it to pass out for us. and you can see that, this
idea of a schema, we're just taking the original JSON schema
dictionary that we had up here. We're going to convert it to a string
to pass in, and then you'll see that. We're just going to pass that in
and it's going to go through the prompt through the LLM and then
through the JSON output parser. Okay. So you'll see that, once we basically
run this, it's going to go through, it's going to parse out this here
and we're getting back a dictionary. I could see the dictionary is going
to have name John age 35 hobbies. pizza. I think it may have made
a mistake in the hobbies. It should be favorite food in there. So it won't always get
it a hundred percent. obviously the more detailed things that
you asked for the more chance you're going to have that it's going to get it wrong. we did get back the
required, name and age. in here. And you can see here that I basically
printed out the response, but then I also printed out the type of the
response so that we can see that we actually got a dictionary back here. What if we change, using the
JSON output parser to just using a string output parser. you'll see that rather than get the
dictionary back now this time we're going to get just a string back. So your JSON output passer basically
takes it and converts it to a dictionary, like treats it as if it's a JSON
string and converts it to a dictionary. And I can see here that we
got the same response back. but we've got a whole bunch
of white space after it. and then right down at the bottom,
we can see the class was typed string that we've got in there. So generally you want to stick to using
the JSON output parser rather than using the string output parser in here. Okay, next up we're going to look
at getting out structured responses now using Ollama functions. So this is like using the tool,
calling or the function calling. in LangChain, whenever you
hear tool calling, function calling they're interchangeable. Here, you can see that what
we're going to do here. And this is the sort of first one, Rather
than give the schema as a dictionary and then convert it to a string here. Now we're actually using Pydantic
to define a class for the actual, things that we want to extract. So in this case, we want to extract
the person's going to have a name. They're going to have a height. They're going to have, the color of hair. this example of modified From
the LangChain docs in here where they're basically giving
it, some things to think about. So you've got, Alex is five
feet tall claudia is one. It should be, one foot taller than
Alex and jumps higher than him. Claudia is a brunette
and Alex is a blonde. so what we're going to do is
this is the context that we're going to pass into the prompt. So our prompt is going to
be your smart assistant. Take the following context and question
below and return your answer in JSON. So looking at our model, you
can see again, we're using Llama 3, we're using JSON. We've got the temperature
set to zero here. We've got our prompt that's
going to basically take in the context and the question. You can see, I'm wrapping this
with the prompt format for Llama 3. Personally, I've found this to
actually get, be one of the key things of getting it to work properly is
getting the right prompt format in how you're doing these kinds of things. So I'm going to basically have a
question context, and then I'm going to prompt it to get the JSON out here. we're going to use this
with structured output idea. So you can see that we're basically
taking this LLM and it's we're binding that Pydantic model to the
LLM for getting the structured output. And we're going to have the prompt. And we're going to have the structured
LLM, which is just our LLM With the Pydantic class schema in there now. and then you can see, we can
basically pass it in something like a question and a context in here. And now we can just run it. And we can see that. Okay. It's now using the Pydantic model. obviously we've got the Llama
3 set to JSON, in this case. And hopefully we're going to find
out some things back and it's gonna, come back now, look here. What we got back. We've got Claudia height
6.0, hair color brunette. So you can see the, actually what
it's done is it's worked out that Claudia was one foot taller than Alex. So Claudia is therefore six feet
tall in here And that Claudia's hair is brunette in here. So we've gotten these right answers out. So let's try the same thing
with the Phi-3 model, right? So the Phi-3 model is a
lot smaller than Okay. So the only changes, in here is
that we've changed the prompt to use the Phi style of prompting and I've
changed the model to be Phi-3 in here. the rest is exactly the same. So this model is a lot smaller. I wouldn't expect that it's going to
be as good, as Llama 3, but you'll see that when I run it on something like
this, Let's see when it's coming back. you can see that he also
can work out that, okay. The name is Claudia for who is taller,
the height is 6.0, the hair color brunette in here that we're getting back. So this shows us that we can use the
Phi-3 model for doing a lot of this, simple, structured output stuff. I'm not going to say that this
is going to be great for it. probably on the whole, the
Llama 3 8 billion parameter model is going to be better. Certainly The 70 billion parameter model
is going to be a lot better, but you'll be surprised that you can actually get
some really nice outputs with this Phi-3 model doing exactly these kinds of things. All right. Let's move on to the sort of tool
use and function calling example. Okay. So in this example, I'm back
using the Llama 3 model again. and now what we're doing is
we're using this new sort of experimental Ollama functions. so this has just been added
to, LangChain recently. And what it lets you do is basically bind
tools to the model just like you do with other function, calling ways in here. this is not necessarily going to
work with every model on Ollama, but the models that do support function
calling you'll be able to do this. So you've got your Llama 3. you've got Phi-3, like
we're playing with here. you've got some of the Mixtral models. There are a bunch of different models
that you can do something like this. and then basically get back
using this Ollama functions. So we're taking one of the classic
examples here of the tool that it's going to use is basically get the weather. So we're parsing in here, basically
this definition of the tool here. Not actually defining the tool here. You can look at other
examples I've done for that. but we want to see that if we give it
something related to weather, will it actually come back telling us that, we
should be calling this, get weather, get current weather function in here. So you'll see that. Okay. I'm just passing in a simple question
of what is the weather in Singapore? Okay, I'm running that. what I want to see back here is,
not just a response now in JSON, but actually a function call response
that we're going to get back here. And sure enough we can see that
we've got this back that we've got this content with additional
arguments and it's function call. We can see the function called that
it wants to do is get current weather. The arguments are going to be
Singapore, the unit is going to be Celsius in this case. so now we can basically use those to
do our function call and then pass it back, through this just like I've
done before with things like, when we've done it with OpenAI, et cetera. let's try the same thing for Phi-3. Okay. Now with the Phi-3 example, you can
see that everything is exactly the same. We've just swapped out the model
to be the Phi-3 model here. I will say that, I think it's
going to work for this one. Let me run it. it won't necessarily always
work, especially when you've got multiple tools in this case. Now you can see that was
actually very quick to. but can see that we've got back this back. we've got no content as
in, text coming back. but we've got a function call. We've got the name of the function call. We've got the arguments. in this case, it didn't give us
the argument for the unit, It just gave us location Singapore. So it does show that it's not as
strong, perhaps as the Llama 3, 8 billion model at doing this, but
it still was able to handle it. It still was able to
pass things back for us. So this gives you a few ways
that you can do sort of JSON use. structured outputs, and tool use
with these local models, like your Llama 3, like your Phi-3,
like some of the Mixtral models. And then there are a whole bunch of
other models that people are fine tuning just for function calling. So if they've been fine tuned in a way
that sort of makes them compatible, you'll find that the often work with this. I Actually find that the Llama
3 70 billion can do a variety of different ways of function calling. Just because it seems to do so well
that it's able to pick up, the sort of format you're giving it, the function
calling in and work out what it should be returning, pretty easily in here. But anyway, Now you've got two
models that you can use locally to start building agents where you
can do function calls with tools. and basically have, free amount of
tokens for running nonstop to do a variety of different tasks, et cetera. in some of the future videos, we'll
start looking at actually building some agents and building some RAG
systems on these local models. so that we can use them for a
variety of different things. Anyway, as always, if you've got
any comments or questions, please put them in the comments below. I would love to hear back from people,
how they using this, what kind of tools you're looking at using, or currently
using for your agents, that kind of thing. And as always, if you found the video
useful, please click like and subscribe, and I will talk to you in the next video. Bye for now.