Function Calling Datasets, Training and Inference

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
function calling is one of the most powerful tools that you can use with your language models by training and using models that are able to provide structured responses you can then send those responses to an arbitrary API this means you can connect a model with basically any other service and use structured python code or any code to send and receive data in this video I'm going to go through using function calling data sets to fine tune models I'll go through quite quickly the fine-tuning of models for function calling and then I'm going to focus on inferencing function calling models there is quite a bit of setup if you want to use function calling because you have to provide data on the functions that you want the model to be able to use and when the model responds with a structured response you have to be very careful in how you handle that response and properly send it to make the function call and return the information that the model needs here's what we have for agenda I'll start off by introducing ucing what a function calling model is giving you Lama 2 as an example I'll show you the repository which is open and available for you to use next I'll show you a function calling data set that I have developed it can be used for training any model to be able to do function calling and it's the data set that I use for any of the out of thebox function calling models that we'll talk about during this video next up I'll quickly go through fine-tuning models for function calling this is something I've covered in quite a bit of detail in a prev video I'll just show it up here uh but let me give some latest learnings that I have on fine-tuning the focus of this video will then be inferencing I'm going to talk in detail about how to send function data to models how to get back function calls handle those function calls and send back the responses to the model in a way that will be useful for them I'm going to go through a few examples here I'll actually very quickly run an open AI example of function calling but then I'll focus on custom models I'll do Lama 2 I'll do open chat 3.5 and then I'll do deep seep coder which is a 67 billion model these models are all available out of the box on Trellis research repo and huggingface just look up huggingface face. trellis and you'll find the models there's also a ye 34b with 200k context model available for function calling oh by the way we are going to have a few extras as well I have been asked to test out the model in espanol um Pro in ESP um function calls I'll do a little bit then on highlighting how to decide which model you should use what size model you should use if you expect different performance and I'll finish off with a few final tips let's start things off here on huggingface doco Tris you can scroll down and you can take a look at the collections uh so let's just expand the collections here and take a look at function calling V3 note that this is now going to take Pres since over function calling V2 um so if you're going to use models you're probably better off using V3 the main difference being that they use the open AI format for for uh function calls that doesn't mean the data was created with open AI I've created the data my myself manually so that there aren't licensing issues um but the format of the metadata is the same as open a ey which makes it a lot easier if you're already doing function calling with open AI let's head into this collection here and you can see that there are currently five models there's open chat there's um Mistral Lama 27b e 34b and then deep 667b and we're going to start off with the Lama 2 7B which is um publicly open for you to try out now as I said I'd start off just by explaining what is a function calling model well it's a fine-tuned model of some base model in this case the base model being Lama 2 specifically the chat version of Lama 2 and specifically the 7 billion param model so it's the smallest Lama model from meta to understand uh what function calling is it's easiest if you take a look at an example so I'm going to scroll down and we can take a look here at a prompt so the idea with function calling is right at the start of your prompt you tell the model that it has access to certain functions and to use them if required so here I've provided two functions get big stock which is um to get the names of the largest n stocks and the next function I've provided is get stock price so you put in the name of a stock and it should return the price of the stock so of course a language model is not going to know the current price of say Apple stock because the price moves around so the idea is if you have an API that can give you the price of Apple stock well the model can tell you by doing a function call hey I want the price of Apple stock and then you can get that from the API and send it back again to the model so you'll see just a few things here for example to get the stock you need to the model will need to provide an array of the stock names and I'm saying that's a required parameter so if the model calls to get the price of a stock it definitely needs to give the names of the stock it can't just say give me the Stock's name it needs to give the name of the stock all right so after that Prelude where we tell the model what functions are available we then ask the model to get the names of the large of the five largest stocks in the US by market cap uh and then we have this uh token which is typical for Lama that indicates the end of our instruction so we would basically send all of this prompt everything up to this point here to the llm and get it to respond and if the model is fine-tuned for function calling it will respond with a structured response like this which is adjacent object that gives the name of the function to call which is get big stocks it gives then the arguments so the number um of big stocks to uh find the names for and also the region because that was a parameter we said was possible uh within the function details up above and then because Lama uses SLS for an end of sequence token it emits this token to tell us okay I'm done I've made my function call so get me that information and then we can continue so this is a function uh calling fine-tuned model if you did not fine-tune the model for this purpose then then you would get a much different and much crazier response here we'll actually see some examples of that during training because I'll show you what it looks like to try and do function calling with a small base model and I'll show you the difference between before and after now that you know a little bit about what a function calling model is let's answer the question of how you would train a model for function calling well as with any training you need a data set specifically you need a data set that has function called calls along with the responses so you can habituate the model to responding with the structured Json response and you see linked here there's the data set that I use for fine tuning and it's available on hugging face for purchase if you take a look at it it's quite a small data set 66 rows of training that's the data used to train then 19 rows for validation this is to calculate um an independent loss during the training so you check that the model is improving with without overfitting and then there are also seven test rows these are some test examples where I use totally different functions just to see how well the model is performing on a quite independent data set now the reason you don't need that much data is uh you can read the Lima paper paper less is more actually for fine-tuning it's more important to have very high quality examples um that you send into the model rather than having a large amount of lowquality samples so that's how this data set is focused as I mentioned earlier in the video it was created manually by me it was not created using chat GPT or other models that have more restrictive licenses and therefore you can use this for commercial purposes so let's just um take a look at some of these prompts and I'll very briefly describe the data structure um let's start here down in the middle so we have three sections three columns with have the function list so it's just a list of functions like I previously showed then we have the user prompt and then we have the assistant response so here for example we have two functions one is search archive which is for patents and the other one is get the current weather so the question I then ask is I'm going on holidays later today how's the weather in Boston give me the temperature in Fahrenheit and then the response is name get current weather arguments um location Boston USA and the unit is Fahrenheit so this here is function called and the whole whole idea is we're going to feed in the uh left two columns as part of a prompt and then we're going to have the model generate a response and we're going to compare that response to the assistance response here and when there are differences between what the model is predicting for the assistant response and what we would like it to be we penalize the model and back propagate so that the next time the model will be a little bit closer to giving us this structured response now just a few little comments um I'm going to get into some specifics here some little comments on um why certain data is included here you'll see this is quite straightforwardly just a function call um but you might have noticed right up towards the top I have a question here that is no function call at all it says what happens to you if you eat watermelon seeds and the answer is watermelon seeds pass through your digestive system um this is taken from the truthful QA data set and the reason I include this is because I want to get the model used to seeing functions in um at the very start but then responding normally to a normal question cuz what I don't want is if I put a lot of functions into the model at the start I don't want it to just keep responding with function calls even if I ask a trivial question well I don't know if that's trivial fairly trivial what happens to you if you eat watermelon seeds um and then just a few other examples down here um I also have put in some pretty pretty kind of nonsensical or single word prompts like par of scissors uh or typewriter and then uh I get some response here and the reason for this is sometimes if you just put in a single word in a in a prompt and you have function list beforehand the model thinks it wants some kind of a function call on that single word but really the user shouldn't be putting in just single words and if they do then uh you really want to either respond in some sensible way to that word or you might want to ask for uh further context let me see if I can find find an example here so if I just prompt a model with some functions and then say why uh the answer I would like is could you please provide some more context or clarify your question I don't want a function call with with why so that's a very brief overview of this data set here um it's built on eight base functions and I use different functions then for the test set which you'll see when we go through the training and so we can have an evaluation um of that and indeed I have pre-trained some models so if you want to get some pre-trained models you can just check out the Lama model that I've already gone through here or if you'd like you can uh purchase any of these uh pre-train models here if you want larger higher performing models or you can even purchase the data set here uh V3 next step is to train these models using the data set for function calling there is a handy resource here TR /f function- calling where you can see a list of the data set the pre-train models but also you can find a link um to the training notebook that I'm going to use right now you can purchase the script by itself or you can buy access to the advanced fine-tuning repo which has a lot of other scripts including for supervised unsupervised fine-tuning um direct preference optimization also there are scripts for quantization for ggf and also for awq there too so I'm going to go over and use runp pod runpod is a Cloud Server service and it allows you to quickly deploy a GPU you could of course use your own GPU if you have one you could also train on a Mac using uh your Mac chip if you've an M1 or M2 um there are some tips in the advanced fine-tuning repo for that although I hope to dig in and do more of a series later anyway let's get started here on runp pod I'm going to go to um secure cloud and um you can find trus referral link if you want to support the channel you can find it below in the description uh so once you've created an account and you've topped it up with um some money you can deploy now I like deploying uh the a6000 it's got 48 GB of RAM so it's good for training 7B models 13B models in full Precision you can also train quantized versions of 70b um or 34b models as well so let's click on deploy here I'm going to go through an example for Lama 7B so when I click on customize here I don't actually need that much space that's probably more than enough I can just put in 50 here now actually I have used the wrong template here so I'm going to go back I'm going to go to deploy and the template I want for training is not an inference template I want the runp Pod Pi torch 2.1 so a pi torch and here I do like to increase this a little bit um the s B models are 13 about 13 GB in size but you end up need needing to save them a few times so I recommend leaving some Headroom here and you can set overrides continue and deploy and it'll just take a few minutes for this uh pod to start up and once it does I'm going to run a Jupiter Notebook on this pod and upload a script um the script I mentioned from the trellis homepage here already advanced fine-tuning repo all right the Pod um has just started up so I'm going to click on connect and I'll connect to Jupiter lab here once Jupiter lab opens up I will upload my function calling V3 script for training by the way you can do this in Google collab as well if you'd like I um think that is fine as an approach if you want to use the T4 you can also pay and use a V1 100 or an a100 the a100 are small on collab they're 40 GB instead of 80 um so it's actually smaller than an a6000 and I think it's not actually cheaper um so I like training with the a6000 it also allows you to use Flash attention version two which allows you to speed up your training process that's not possible if you use a T4 because it uses an older architecture all right so I'm going to upload um my file for fine tuning okay I've uploaded a fine tuning script I've also uploaded a quantum Iz ation script which I'll briefly talk about later but let's get started with this fine-tuning script and um the first thing you want to do is you want to um you want to log in with hogging face the reason is so that you can push the model to hugging face later on or if you're using a gated base model then you need to be logged in so you can access it after that you can set your base model actually here um I'm using open chat I think that's I think that's fine we can go ahead with open chat and next I'm going to do some installation you can see I'm installing Transformers flash attention if you want to fine-tune uh quantized version you can install bits and bytes um and accelerate as well so after some imports we're then going to load the model here you can see I've commented out uh quantization config let me just increase the size of my screen here that's going to help a little bit so here uh I've come out quantization because I'm going to um train with full um with full Precision or rather bf16 16 bit that's how I have fine-tuned all of the out of the box models I just think it's slightly better for quality but generally it's probably fine to use um bits and bites quantization the quality is reasonably good although there can be some complexities in pushing it to HUB which um I might get into on another video all right I'm using flash attention 2 you can't use that unless you're using either an a100 h100 or an a6000 maybe some other architectures but you can't use it on a T4 so just be careful about that once the model is loaded you're going to set up the tokenizer and once I've set up the tokenizer I like to check what the end of sequence token is um for this model it's end of turn and what the pad token is there's no pad token specified um for this open chat model um I set the tokenizer padding to the right hand side when I do function calling I think that's good for short responses because I like to have the model um start off it'll start off with the prompt and then give the response and then the padding rather than have a lot of padding um before the prompt and the response okay so since there isn't a pad token already specified what I do is I need to set a pad token if there's an unk which is an unknown token in the tokenizer so this represents any unknown tokens I like to use that I generally don't like adding a new pad token if I can use something already in the token ier because adding a new pad token means that there needs to be a new token added to the tokenizer which means that people need to use your new tokenizer with that token added it just adds an extra piece where things could go wrong um so in this case the on token is in the tokenizer which means that I can use it so I set the pad token to be equal to un and I don't need to change the number of tokens in the tokenizer you can actually see there's 32,2 tokens generally the number would be 32,000 but two more special tokens have been added um in this case by the model creators um you can actually see here a list of the special tokens they've added in these end of turn and Pad uh zero tokens okay so then I'm going to print out the model architecture open chat actually uses a Mistral type uh architecture and next I just print out a sample screen a string and tokenize it um The Next Step here is to set up um is to set up Laura so broader than fine-tuning every single parameter in the model we're going to fine-tune um only certain parameters in the model so only certain modules and we're not going to directly fine-tune those modules we're going to fine-tune an adapter which is a smaller representation so if you imagine a big Matrix to be trained what we do is we say okay we're going to train a smaller Matrix um and that smaller Matrix will have a mapping onto the larger Matrix and we're going to freeze all all of these parameters in the main model and just train this adapter and at the end of the training we're going to expand this adapter up and merge merge the two weights the fixed weights and the training adapter so this is called Laura it's low rank low rank because this Matrix here has got a low rank you can think of it as a low number of rows or columns and this ends up being more efficient for training and it also turns out that the performance is very good the reason being that large matrices within language models actually could be represented with lower rank representations um so this means actually you don't need to train something quite as large for fine-tuning you can do just as well or sometimes better by training this smaller representation okay so we're going to set up um Lura later down first we're going to enable gradient checkpointing this reduces um vram requirements it's not absolutely essential um I like to um I like to have a function to um Define the trainable parameters this lets me see a list of the modules in the model um here I'm going to set up the lower configuration so you remember I said we're not going to train everything we're only going to train certain modules the modules we're going to train are the attention and we're also going to train uh the linear layers so we're actually kind of covering most of the model if you look at these other things like the layer Norms or um you look at some other Norms these are actually a much smaller number of parameters so we're kind of training most of the parameters in the mo in the model however as I said we're training a much smaller adapter version of those parameters um so actually the percentage uh or the number of parameters we training is much much smaller than the model size in fact when um I run this here you can see there are approximately 20 million trainable parameters um and of all parameters there's of course about 7 billion because it's a 7 billion model and we're only going to train about 3% okay next um so once we've set up the Laura which is the adapter we're going to train we're going to prepare the data set and we will load trus function calling V3 notice that previously I would have used function calling extended but that's now deprecated so we print out the data which has the Train the validation and the test set and you can see it has the three columns function list user prompt and assistant response um next up we're we're going to do some setup so we're going to set up the prompts we have these three columns of data from the data set and we need to put those columns into a single uh line a single prompt for each row that we're going to feed into the model so here you can see that before the function list we'll say you've access to the following functions use them if required and we're going to also set up the appropriate start and end tokens for this model now the start and end tokens are usually specific to the model which is annoying because it means you need to use different templating for every model and we'll talk about that more when we get to inference but for now um we're going to use the appropriate one for open chat you can see if we were using Lama 2 or Mistral we would instead use inst and inst if we're using deep seek uh coder we would use this we using Y we would use this um but we're using open chat so we're going to use this so here's where we assemble the prompt and you can see that basically assembling it involves putting in the functions putting in the user prompt and then finishing off with the assistant response now I do go into a little more detail around the attention masks and also the end of sequence tokens that ends up being important for getting good performance on these fine-tunings um you need to make sure that you are penalizing the model based on the tokens that are being produced in the response um that's a little bit better than penalizing it based on um how well it's predicting both the tokens in the prompt itself and the response so for that I'll direct you to another video I'll link down below okay so our data set is now loaded and I like to generate then a sample of that data set so I set up a generate function that will just generate samples and let's uh see what that looks like so here you can see the whole prompt together GPT for correct user such a weird it's such a weird way to start off prompting on the user side but that's what open chat does you've access to the following functions use them if required you have um two functions the get bake stocks and get stock price and then we have the prompt request which is get the names of the five largest Stocks by market cap and then the Open chat model requires you to finish each turn with this end of turn token and then we have gp4 correct assistant so all of this prompt gets fed into the language model and we then get it to generate a response which we would like to be a function and here you can see what happens if you use a model it's not fine- tuned you can see it responds to say to get the names of five largest stocks you can use get big stocks uh with the following parameters so I mean the response is sensible but it's not providing a structured Json object so this is not helpful if we want to easily connect to an API and the correct response would be this uh this is ideally what we would like um we would actually like this with an end of sequence token following here which actually we'll see later so we run a seven tests and you can see for all seven of these tests that there aren't clean function calls because the model is not fine-tuned so let's move on past these seven quick tests which are purpose f for manual inspection uh we'll move on to training here and um we're going to set up the trainer and I will go through one or two of the parameters in the trainer itself so we're going to pass in the train data set and then for evaluation we'll pass in the validation data set and we're just going to train for one Epoch I like only showing the data once because it means that you're less likely to overfit um for supervised or unsupervised fine tuning you often pass the data through many times but for function calling because what you're really training is structure rather than the specifics of the content I prefer to um give more data um to convey more structure rather than just giving the same data because that's encouraging the model to repeat the exact same words okay so um we're going to train for one Epoch so the model should see the data once by the way sometimes you'll find it does a bit better for two epochs or maybe three but I find stronger models they only need one Epoch of training um the batch size um you could actually increase the batch size here but the the data set is quite small so uh this is not going to be a long training but you could increase the batch size the uh it takes up more space more vram on your GPU we've plenty of space on the GPU because it's 48 and the model itself in bf16 will be about 13 so we could easily increase the batch size here especially because the batches are short um so each batch um every batch is just taking up KV cash so every model is 13 billion and then You' have the KV cach for every batch so the batch uh KV cach is going to be really small for the attention um so you really could have quite a lot of batches I would bet you could probably run with a batch size of 16 if you wanted or 32 anyway um it's not a Long training so I just said it's one um technically this allows you better granular as well in the training it means that the model will pick up more individual Nuance so I think probably bat size small is good if you are not constrained by time gradient accumulation Um this can help if you have kind of wild jumpy results it will smooth because it instead of back propagating after every batch it will back propagate after every X batches so here I'm not going to use gradient accumulation um I don't see any instability issues so um this is fine okay um we're using a learning rate of 1 E minus 4 and the learning rate schedular type is constant so we're going to have a constant learning rate typically in big training runs you reduce the learning rate um because as it conver the model converges you can't uh you can't take as large steps in your optimization but I'm going to leave it constant because it's a short training run and we're only doing one Epoch okay and then we'll run the training and you can see here the training losses are pretty slow uh pretty low and the validation loss it goes down 82 79 73 and then up a little 76 so what I do here is I just check that the validation loss um is falling it's going up a little bit here that's probably okay if you see the validation loss starting to go up a lot then probably you've overtrained and so I would recommend um reducing the EPO but since we've only got the epoch of one this is fine if um if you were going to train multiple epochs like three epochs and you see the validation loss go up then probably you should reduce the number of epochs and here we'll go to an example after fine tuning so you should see the difference so now we'll have the very same question get the names of the of the five largest Stocks by market cap and indeed we get the exact response we want and we get this end of turn token it's very important the model emits this end of turn token because it looks different for different models but you need to know when the model is stopping generating um because otherwise it might generate the function and keep lving on and you've no way to know then uh to stop the generation so this token is very useful and important and you can look at uh the other examples as well the model does very well actually I think this model gets maybe six out of seven of them exactly correct uh the Open chat model is quite strong I recommend it it's probably the best function calling model of this 7B that I've trained so far uh I will just show you one little uh hiccup let's say and this can be expected particularly with uh smaller models so here we just prompt the model with some functions and then the word shop which that's a weird that's a weird prompt for the model it's not going to have a lot of data that makes a lot of sense of that and it does indeed respond with a function here which is probably not the best response I would have preferred it to say well would you like to know about shops um so this is one place where it trips up and then you can see I just have a generic test question about the planets in our solar system and this is to test that it doesn't call a function when it doesn't need to and you can see it does well here um it's got the same content of response as the answer that I would have liked now just a few comments once the model is trained you want to push it to HUB or you probably want to save it somewhere so I'm just going to show you how to push it to Hub um you've got the base model name you want to Define an adapter model remember we've actually trained the adapter we've Frozen the model and so we're going to need to merge the adapter and then push to HUB so I've defined um the adapter model name I've defined a new model name and you can see those here and next we're going to save the model and push it to HUB now when we just save and push to HUB we're saving the adapter and we're pushing the adapter um so we still need to merge which is why we call merge and unload so we merge uh basically we'll expand this adapter and merge it on uh to the main model and then when we push the Hub it will push the full model uh next then you can save the pre-train model and you can also save the tokenizer and push the tokenizer I'm actually going to skip down and show you that uh right here so um working backwards we should have um we should have some code here that's see um for pushing the tokenizer so here we save the tokenizer and here we push it to HUB now the reason I do that at the bottom I won't go into this in too much detail but the reason I do that is because I first want to adjust the chat template so if you look at a tokenizer object uh I think I do it up here if you print out uh tokenizer do chat template it's going to print out this here which is templating language I think it might be handlebars or something and what this does is it takes in uh a list of messages and it puts them in the format that this language model expects that's why you can see that it has things like gp4 correct which is the format for open chat and it has this end of turn um token which is specific to open chat but because we now need to not just have user and assistant messages we also need a function message it's ideal if we can also update that chat format to include uh function metadata handling so basically if I give a list with function metadata user assistant it's going to format all of that into the prompt that I want so here what I'm doing is I'm just setting up some functions and some sample messages you can see sample messages here include a function metadata message a user message I've even just for testing included a function call message a function response message um and then what I'm doing is setting up a chat template that's going to do well for this uh model and then I'm going to apply that chat template um I'm going to apply it to my series of messages and see if the formatting is correct so here you can see I've applied the chat template uh so remember we've got this discussion here what's the current weather in London and I wanted to format this into a prompt so indeed it's formatting it it's putting in the start token it's putting in the uh beginning of sequence token it's then putting in the fun functions um it's then putting in my prompt it's then putting in the first function call um and then it's putting in this little snippet here that's quite important actually so once you get a function call once you call that function and get the response back in addition to giving the language model the response you want to kind of guide the model to use that response in a helpful way and that's why I include here here is the response to the function call if helpful use it to respond to my question request um so um and yeah I think I've just realized if you're watching the Spanish part of the video later on this is not yet updated in my Spanish video this English language here and I think actually that would improve the performance a little bit so just keeping note of that I didn't uh translate that little snippet but anyway it's important for the handling of the function call response because once the language model receives this up until this point here the model will respond and say that's great now say hello end of turn um so once we've fed in everything up to this point here the the model then when it gives a response it's going to respond taking the useful information from the function and using it to answer so here it's answering the current weather in London is cloudy or temperature of 15° C and that looks good and I've just manually appended on another request which is saying uh just for the model say hello so if I take all of this formatted prompt and send it in the model indeed does respond with hello okay just before we wrap up on this training script here I just want to make a few quick comments because I get a lot of questions on pushing models to HUB so if you're merging the adapters it will work fine if you're not quantizing however if you are quantizing the model now what you have is you have a quantized big model and you have an adapter and you're still training the adapter but your base model is quantized and unfortunately it's not possible although um there's work being done on it it's not possible right now to merge an adapter with a quantized model um so what that forces you to do is instead of running this uh merge step here which is not going to work you actually first need to reload a base model and you need need to reload the base model in un quantized form so rather than loading it with um with quantized Precision let's see if I can uh just show that here rather than loading it with bits and bytes quantization or some other quantization you're going to load it in float 16 which is a 16bit format um and then merge that model merge the adapter onto that model now there are slight inaccuracy reasons um for that it will affect your Precision your perplexity a tiny bit but broadly speaking I found this still works if you're using a quantise model so just don't forget you need to reload your base model then you need to apply the adapter to that reloaded model and only then are you able to merge if you're using Cur um now just before I move off of this runp pod I'll just show you I often run a ggf quantization this allows you to run with Lama CPP if you want to run on a Mac um ggf is quite a quick quantization it's not data dependent so you don't have to load data it's um much quicker to quantize than other forms um so there's a script here that will allow you to just run through um there's some installations you need to install Lama CPP itself and then it will go through and quantize and by the end you should be in a position to be able to push the model which in this script here was Mistral uh in ggf form to the hub and I'll just show you that if you want that ggf model um I've created it for most models except for deep seek cuz I'm missing a file there the tokenizer model but for the others if you go to files and if you go to uh the ggf branch you'll be able to find the ggf file there the nice thing is it's quite a lot smaller it's about the quarter the size uh maybe a third the size of the model in the bf16 Precision okay let's take a step back at where we are we've gone through what a function model is function calling model is with the Lama 2 example I've walked walk you through the data set the latest one is V3 I've then walked through in detail fine-tuning uh a language model for f um for function calling including a brief section on ggf quantization so next up we're going to focus on inferencing and I'll take you through a very quick open AI function calling inference example and then I'll go through Lama 2 open chat and deep seek so let's head back over to tr.com function calling and you can see that there are in infering scripts available um I mentioned earlier there are fine-tuning scripts but these are inferencing scripts if you just want to get started quickly with making calls to a model that supports function calling you can buy the script itself or you can buy it as part of the advanced inference repo which also has inferencing for long context models for setting up a server on AWS and also for using runpod oneclick templates so we're going to head over to that repo right now and here we are we're in the the advanced inference repo I've just done git clone I recommend that you CD into the function H the API call section and you'll then want to create a virtual environment um there's a description in the readme of how to do that I've called it API en you want to activate that and then you want to install some packages I'll just show you the readme here um so rather it's this read me here specifically for API calls so as you've get cloned the repo CD into the repo set up the virtual environment and then install these packages here if you're going to use the open AI function calling you'll also want to install Tik token and open Ai and if you're going to use a ye model then you want to install sentence piece which is needed for the tokenizer next you'll want to go um to the environment variables and you want to set your open AI key um I've created a sample. EnV file so you can simply copy this file and rename it EnV and in the EnV file uh you'll put in your open API key and then for running runp pod you'll put in the Pod ID in the model but I'll show you that a little bit later so let's go back uh to the read me here once you're in the API calls folder um you then have the option to do some simple API calls to open AI or to run pod which is where you'll run a custom model or you can do what we're going to focus on today which is the function API calls for calling functions um so let me just CD into function API calls in fact I'm already there you can see if I do LS I've listed out the files uh functions um and actually I'm on the Espanol Branch so I'm going to do get switch and get to the main branch so this repo has got two branches right now if I do get Branch there's the main branch and then there's a Spanish branch which I'll talk about a little later in this video so on the main branch if I do Ls I'm now in this function API folder and I have a few scripts the first one I'm going to look at is open AI function call. py which is going to allow me to run open API open AI uh function calls now the basic setup you need for doing function calling with a model is first of all you need a model and second of all you need functions you need to define those functions and those functions are defined in the functions folder uh on underneath tools um so in tools what you have is a list of information that you will provide the model within the prompt here I've got get current weather and a description of the parameters that accepts and I've got get clo which is a function that uh when given the weather conditions it will get me the clothes to wear so for example if I call get current weather for Dublin it will say that it's cloudy and then if I call get Clos with it being cloudy it will say it bring maybe a rain Cod so you can see these functions I've set them up in a way that function called chaining is possible that would be a case where you want the model to first off uh call for the weather and then once it's got the weather the model should call to get the close to appropriate to wear for that wet so this is called tools. Json it's the same file that you will fill out um for calling open AI but it's also the same file You' fill out at least as I've set it up for calling any of the custom models now this is the data that goes into the models um as function metadata but you also need to Define individual functions which I've done here and they each have their own file so I for get close file which is a function that takes in temperature and condition and then it will output um the outfit that you should wear and I have get current weather which takes in a city and optionally takes in either Celsius or Fahrenheit and it will return uh the weather data for that City or it will say there's no data available for that specific City so you need to set up your two functions here like this now once this is done uh you can go to the open AI uh fun call. py and you can set the model that you want I'm just going to use GPT 3.5 turbo and this supports function calling I believe GPT 3.5 turbo without this uh date at the end will also support function calling and what's happening in this script is it's going to import my uh list of metadata for for the functions it will also import the functions themselves and when making a chat completion request it's going to feed in that list of functions um via tools so tools is my list of function metadata and it's going to feed it into the chat completion request and when the data comes back out um I'm going to have a function ready here called execute function called so when I do a chat completion request it will return a response and then it's going to allow me to call execute function call um here I have a function that will just uh print out the messages that I've submitted um and here I set set up a chat so I'm going to start by creating a messages array just that's empty and I'm going to add in uh a first message that says what's 1+ one and then I'll get a completion request and when I get a completion request I'll have an assistant message and I'll check if that assistant message um is a function call and if it is a function call then I will execute a function call and if I execute a function call then I'm going to send the data I get from the function call back into the language model and get a second chat response so basically the model says I want to know the weather if I get the weather I give it back to the model and then the model will respond so I can synthesize and respond with that data and then we'll have uh response at the end now here I've just asked what's 1+ one so I don't expect any function call for this first simple example so let's just try that out python open AI Funk call. py and let's see if this works okay so 1+ 1 is equal to two now we'll move on and we'll ask what's the weather in London so here we should expect a function call in this case Okay so we've watch the we in Dublin the assistant message is actually none in this case because open the eye considers that it's a function call response uh but there is a function call which is temperature and cloudy and given that information which is fed back to the model the assistant will respond the weather in London today is Clarity with a temperature 15° so that is working well and it's made use of my function called so let's try a little little bit more tricky question um which is what clothes should I wear what clothes should I wear let's just say I'm in Dublin so let's see what it does so here it should ideally ask first for the weather and then it should ask for close so indeed it does call uh to ask for the weather and that makes sense because if it asks for the weather then I'm going to be able to feed that back in to the model and then the model will be able to ask for what clothes to were so this is a very simple example I actually build on this for the custom models and I'll automat automate that chaining so that it will automatically call the model and it will keep on calling functions as much as it needs and then stop and provide a final summary response back so this is a quick summary of open Ai and the reason I wanted to show it to you is because you're going to use the exact same format for fun for formatting your functions and you're going to use the exact same functions when you use the custom models that we're going to run through runp pod next up I'm going to show you how to query models like Lama 2 fine tune for function calling then open chat and we'll finish off then showing you a deep seek model which is a 60 5 67 billion model and I'll also show you that model in Spanish as well so let's go over to the model card uh we can check out Lama uh here and just go to the model card and scroll down for the runp Pod template so what we want to do is get started quickly here and I'm going to open up runpod template which is one click and this is going to do allow us to set up a server very quickly that we can then do inferencing on with function calling okay so we have the template here and I'm going to pick a GPU I'm going to pick the a6000 again I'm going to click continue and deploy now I'll be able to grab I'll be able to grab the Pod ID here this is actually this is running still from the fine tuning I was showing you so just going to close that pod down so that I'm not spending money on it and then I will delete that pod as well and now copy the ID of this Lama 2od and head over to my EnV and update here to my 7B and here I need to put in the model ID uh it's the model slug from hugging face actually I can get it just by copying there and uh saving so this sets up the Pod um it means that we're going to be hitting this endpoint of runp pod running text generation inference and we're going to be using this model it's important to specify this model because that is going to define the chat template that we're going to use and where we're going to use it is in this runp pod TGI fun call chat template script which is going to allow us to inference the function calling model and you can see here that we're importing the Pod ID and we're importing the model the Pod ID is needed because that gives us the URL that we're going to hit uh actually this is the URL that we hit for generating responses with TGI and the Pod ID is inserted right in there so the Pod is going to take a few seconds to initialize if we we go over and check out the logs uh we can see here that the system is extracting it'll take a minute or two and then the shards will download and the model will start to run now before I start inferencing that I'm actually going to start up a second pod and the second pod that I want to start is going to be the Open chat pod uh just so we have that running in parallel so I'll just search for open chat Travis V3 and if I scroll down on that model card I should also be able to see uh I should also be able to see a one click link I think I've gone too far yeah here we go so there's my oneclick link now importantly when I use this template I need to specify a token um an access token because it's a gated repo because this model is available for purchase so let's check out um let's check out this function calling model with the a6000 but when we customize the deployment we're going to add in here another variable which is hugging face Hub token and then I'm going to go to my settings and I'm going to get a token so I'll scroll down to access tokens I'm going to use the one I use for runp pod read permissions is enough in this case and I'll copy it in here and click set overrides and continue so this now should be running two different pods both of which are starting up I'll go back to my pods here let's take a look at how things are going with um yeah it's still initiating so we'll give that a few minutes and the other pod should be starting too of course when we start to use the Open chat one we'll have to swap the Pod ID on our script in the EnV file okay so just as a brief overview of what happens in this script um first off we download the tokenizer for this model that's why we have to specify the model and the reason for this is to use the apply chat template apply chat template will format a list of messages automatically into a long prompt of course for open AI you just send the messages and behind the scenes open AI decides how to format The Prompt we don't see that um but because we're calling uh a custom model here we do need to properly format that prompt and that's why we need a tokenizer and that's why later on down here I can just show you once we go to the chat completion request uh one of the first steps we do is tokenizer apply chat template and get the formatted messages okay so we are just like open AI we're importing our list of tools it's the same list of tools it's just the function metadata then we import our functions again the same functions get close get current weather then we Define executing the function call that's the same as the open AI um we are just going to call the function when the model asks for us to call the function I have a short function here to test the API is up just checks that the um pod is running and we'll give you an error if it's not then we have the chat completion request but this time using runp pod we will first off check that the API is up and running and next we have a recursion depth check so you'll remember there was a case where I said we would call two functions so first we get the weather then we get the clothes so that's a chain of calls and you can have like multiple calls if the language model is strong enough and but what I want to do is limit an infinite Loop so I've got a counter and it's going to count how many times we have looped and it will exit after I think um four uh Max recur Max recursion depth of four so basically this will stop us from chaining infinitely if there was a problem with the model next we format the messages um just to show you what the messages might look like you set up a list of messages the first one has to be the function metadata then you have user and you keep appending in you need to append in the data you get back from the function call you need to append the data we get from the response um so this is your list uh just a sample list of messages I like to print the formatted messages so we can see just down here what that will look like and then we have our curl command so the curl command will make a call to runp pod and it will send in the payload which uh includes the formatted messages it also includes the max number of tokens to generate and do sample equals false do sample equals false means that it uses greedy decoding it means it picks the most likely next token every time so it doesn't have any temperature the temperature is zero I like this parameter a lot because I think with temperature you will tend to get um you will tend to get less consistent responses and you want to have consistent responses for function calling so once the curl command is set up uh we will try to get a response and once we get a response we're going to extract the generated text and once there's generated text we're going to check if there's a function called Json so we'll try and load a Json object which would indicate a fun function called and if there's no Chason object then that means it's a normal response so we just append the response um as an assistant message into our list of messages but if there is a function call and there's a function name specified then we're going to append a function call so we'll append that data in under the function call uh role so there's a user role an assistant role a function call role then we'll call the function and with the results of the function we'll append a function response so there this is the function response Ro and as I said earlier if um there's no function Jason or if there's no name of a function in the Json then we're just going to append the response as an assistant message okay just one last thing here I skipped over but it's important once we do get the function response we actually call the API again this is actually a recursive function so you can see we're calling the function that we're in uh so we're in the chat completion request but we're going to keep looping with more chat requests so long as functions are being called and this allows us to get chaining here I've just set up printing so it will print a nice conversation here to the screen and next we'll start with our chat script so first we have our messages um then we have to append the function metadata which is simply the content of the tools file here and we're not going to include a system message here you can with some models of system messages uh Lama supports it Mistral supports it uh ye does not deep seek does not and then we will start to send in some messages I'm going to send in a trivial one first what is 1+ one but first I need to check that we are up and running let's see here so in the container logs yep so it looks like our Lama model is up and running and we can check if open chat is up and running yeah actually both models are up and running so let's go across and we're now going to be able to run this script here so python run pod sorry this is so long but I wanted it to be descriptive because it's a run pod server running text generation inference for function calling specifically and it uses chat templating so you don't need to set up the prompt yourself because we use use this tokenizer apply chat template okay so calling with our simple question see what happen okay there's 1 plus one 1+ 1 is two that's good and just to show you here exactly the prompt we're sending in we're sending in you've access to the following functions so use them if required the two functions and then what is 1+ one so this is the exact prompt that goes in and then this is the pretty formatting of the discussion okay so that's our basic answer now let's try the next question which is what is the current weather in London what is the current weather in London we expect the model is going to respond with a function call to get the weather for London that's what we expect so let's see if it does that and it looks like it has requested the weather and based on our information excuse me based on the information temperature in London is 15° and the um condition is cloudy so that's uh great we're getting the right answer for the we in London using one function call and next we're going to use we're going to ask what clothes should I wear I am in Dublin so here what should happen the model should first realize it's in Dublin get the weather for Dublin and then figure out the clothes to wear in Dublin so let's try this out and remember we're using a llama model here just a small criticism of the Llama model and you could maybe improve this by adjusting the prompt format but if you look at that response we got um it does say the weather is 15° but it does include this little snippet we don't really want which says thank you for providing the response con function call so that's not ideal okay so here um let's see exactly what it's doing so actually what llama here is doing is calling a calling a function for get close and it's just hallucinating the temperature and the condition um and then the function response is working and it's responding correctly based on the function response it has so you can see some of the functionality is working but the model is the model is not strong enough here it's basically hallucinating unfortunately so let's move to look at the open chat model and see if that's going to run a little bit better what we need to do here is swap out our pod I'm going to shut down this LMA function calling pod let's close it let's close it okay and let's grab the Pod ID from the Open chat one and we got to paste that in here and we also need to grab um we need to grab the Open chat slug so I'll just put in open chat TR V4 is not here yet open chat trous V3 let's see what we get okay and here I'm going to paste this in let take a look there that's great so let's go again and ask all the basic questions we'll start with 1+ one what is 1pl one and um the beauty here is that we just had to change we just had to change the model and the Pod um and everything else is automated within the script so 1 plus 1 is two that is good news and what's the current weather so let's see how it handles this yeah so this is better it's better because it gives a clean response it doesn't it doesn't Prelude it by saying thank you for this information it just gives the response so that's quite nice and let's try the old tricky one so here we would love if it called the weather first and then called get close yeah so this is this is this is a thing of beauty here what closeth should I where I'm in Dublin function call get weather then it actually gets the weather back and uses that to call get clothes and then it gets that response and it says you should wear a long sleeve shirt and jeans which generally is good advice for Dublin to be honest um so yeah that's open chat and you can see there the difference between open chat and um and llama Mistral I think is a bit better than llama in my experience with these function calling you can get the Mistral model um open chat I think is probably the best performance of what I've seen so with those two models done let's take a look at a much bigger model called Deep seek it's a 67 billion um model there's also a y 34b model which is good if you want long context now something I've seen and I don't have enough data to say this for sure but generally I find models that have been trained for coding do particularly well in function calling which probably makes sense because there's a lot adjacent objects within coding and what that means is that actually if you wanted to do function calling with a coding model it's going to perform well um whereas if you move more towards a language model and the E model seems to be more language than coding it's not quite as strong so even though ye 34b is larger than open chat I think in some cases it's a little bit worse for function calling the problem of course is that for many applications where you're using function function calling it's not just a coding application and therefore it's um necessary to use a model that's for language like say ye or like Lama or um mral or open chat for the smaller ones but what you probably couldn't use so well for maybe customer service is something like deep seek coder or uh code Lama those models are really good the Deep seek model is really good uh deep seek coder specifically not plain deep deep seek but the Deep seek coder model would not be as good with normal language that would be important for your chats next up I'm going to take a look at Deep seek 67b the function calling fine-tuned model I've started up the oneclick template function calling deep seek 67b by trellis uh it's up and running if I check the logs here um I can see that the server has started up and this here indicates that it's ready to receive API calls now I've copied that ID of the Pod and I'll go over to my scripts here I'm in the advanced inference repo and in myv file I've just pasted in the Pod ID and I've also put in the name of the function calling fine-tuned model right here okay so next up I'm just going to CD into the API calls folder and I'm going to activate my virtual environment where I've installed the packages earlier so I'm going to do Source API n been activate and now that virtual environment is activated I'm going to CD into the function API calls folder okay so here I am in the function API calls folder and I'm going to open up the runpod TGI uh function Funk call chat template this is the template that allows me to uh run function calling uh against the runp Pod API that I have set up so we're going to start off with a basic prompt here um using deep seek let's scroll down to the very bottom here to where I have the chat scripts and you can see the user prompt um is that I'll use is the most basic one which says what is the current weather in London and you'll remember that we've given the model a function to get the weather for various cities so when I call this it should do a function call so let's do python runpod TGI Funk call chat template. py and here we should see what's happening it's going to call the API at least twice because it needs to make the function call get the response and then summarize it so here we can see um it's made a function call and um let's see what it's recommended so here the chat starts off um actually I've run the second I forgot to click save so I'm actually running this prompt here what should I wear in Dublin so I'm asking what clothes should I wear in Dublin it makes a call to get the weather in Dublin because to know to call the function for getting the clothes it needs to know the weather so it calls the function for getting the weather then it calls um it gets the weather it then calls a function to get the close and then it um summarizes that function call saying that you need to wear moderate uh clothing like a long sleeve shirt and jeans so this is perfect uh function calling performance deep seek is quite a strong it's a big model uh so as you can see makes a function call gets the response makes another function call and gets the response that's a pretty Advanced usage because the the language model has to uh determine that it first needs to get the WEA before it makes the function call okay let's make that more basic function call here just asking what's the current weather in London I'll click save and uh head back notice that there were three calls to the API in the previous one because there were two function calls plus the summarizing call so here we're just going to run and ask weather in London so what's the weather in London get current weather it makes a function call it gets the response here and then uh it says the current weather in London is 15° C and cloudy so it gets this which is not surprising because it's easier than the other function and of course I always like to have a very mundane test function that can be used just directly asking the llm H what is 1+ one and it says that it's two which is correct just want to make sure it doesn't accidentally call the function so this is the Deep seek model um it's probably the strongest model that has a function calling fine-tuning done and it's available um online on Tris smart and you can find links more Below in the just below below the video the next thing I want to show you and I'll do it again with the Deep seek model is function calling in espanol uh so I have a branch here that I'm just going to go to um git Branch to show me the branches there's the main branch and there's the Espanol Branch so I'll do get switch espanol um and I do have some changes I just need to discard uh so let's discard those and get switch espanol so in addition to having function um functions available we have functions now in Spanish and I'll just very briefly show you this I've translated or rather chat GPT has translated a function uh for getting the current weather so for example so it's very much the same you'll notice that um anything to do with coding is maintained in English like defa for defining the function but everything else is translated into Spanish now I've done the same here for up eropa apologies if there is bad Spanish here my Spanish is mediocre and I Rely just on GPT so this will get you a set of clothes given La temperatura and then the metadata again you'll see these um the keys are in English but the values are um actually this here function is still function but the rest is in Spanish so I've translated everything to Spanish and I've also made a little change to applying the chat template um you can see right up at the top here in the Spanish branch in the Espanol Branch when I load the tokenizer which is right here I then actually adjust the chat template and the reason iust adjust it is because I want to put in this Spanish hereas CS nario so I've changed that to Spanish as well and this will basically allow me to do a fully Spanish chat if you go down to the messages um the sample messages that I have okay so we're going to start off with a basic question I've got uh Quanto Uno M Uno so we should be expecting a text response here and I've actually just run it B but let's run it again Uno and unoo so that's good and the next one is so let's try this one and let's see yep NBL we have a function call and Al CA 12 and londes leli and just so you can see the function syntax here here's the raw prompt being sent in so the API is been up and running um and then here the prompt starts so user CS necessario then it'll give the function descriptions in Spanish and then it will say and lres and then assistant will call a function and then the function response is called by my python script uh so it says here's the response of helpful useage respond to my question and then the assistant um has to respond to this and the response that the assistant gives to that is Ela and so this works quite well and it works well for the basic functions I'm going to show you one further example so so this is the harder example because we're asking what clothes to wear in Dublin which requires the llm to call first for the weather and then uh to call for what clothes to wear based on that weather so let's run that function and see what happens so first off let's see if it does call a function and it doesn't so it saysa duin so it's saying to know what clothes to wear um first you should know what weather is in what the weather is in Dublin could you tell me what the weather is in Dublin and so this is a sensible response but it's not quite as good as it could do because it should be able to call the function uh to get the weather automatically by itself um so a few things here I guess possibly it's because the model is just not as strong in Spanish as it is in English um but something else is you could look at the functions themselves for example if I go to tools here um there definitely could be some improvements if you carefully write out these functions or even get help from chat GPT uh to improve the quality of these functions the clarity making it really clear when the model should call them and what is required for a successful call um just by improving the functions themselves that can definitely improve uh the quality of the responses you get but it's good to know that this deep seek model certainly provides pretty good responses in Spanish and it's also useful to know that if you just translate the function call um and you translate all of the data you can get pretty good performance in other languages and that's it folks for the big overview of function calling you can get a lot more information on trell us.com particularly if you go to the function calling section where I've linked the data sets the inferencing and also the training just a few thoughts before I leave on Which models to choose as you know as you move to larger models they are more accurate and the larger models are generally better at function calling as well so if you need Precision or particularly you need chaining or you need to have strong logic from the model within your function calling application it's going to be better to use larger models as you saw in this video open chat seems to perform the best of the function calling models compared to Lama I didn't show it but it also performs a bit better than Mistral and seems to be able to do function calling chaining in certain cases of the larger models the Deep seek model is probably the best at function calling applications and seems to perform similarly or better than the Open chat 3.5 of course that would be partly expected given the model size Now function calling syntax exactly what symbols and prompts you put where is really important to get good performance from function calling little differences like having a few extra lines or an extra character or having a beginning of sequence token where it shouldn't be there small differences like that can really affect the model in in producing Json objects so you need to be really careful with the templating and that's why when I was building V3 of the function calling data set and models I wanted to release as well an inferencing repository to try and make it systematic how the prompt is generated and that's something as I showed can be done using the apply chat template portion of the tokenizer that's pretty much it for this video let me know any question questions below if there are any extra models that you think would be useful to fine tune let me know the application and what the model name is and I'll see what I can do uh looking forward to chatting more folks cheers
Info
Channel: Trelis Research
Views: 10,144
Rating: undefined out of 5
Keywords: function calling, function-calling, function calling llama, function calling openai, function calling fine-tuning, function calling inference, function calling python, function calling model, function calling llm, function calling llama 2, function calling api, function calling server
Id: hHn_cV5WUDI
Channel Id: undefined
Length: 76min 36sec (4596 seconds)
Published: Thu Dec 07 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.