The Best Tiny LLMs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm going to talk about the best tiny large language models I'll walk you through a performance comparison of some of the top models out there I'll talk you through how to fine-tune these models how to inference them and in particular I'll pay attention to using function calling versions of these tiny models let's get started with an overview of the best tiny llms I'm going to take you through the motivation for using these small llms I'm comparing the performance of f deep seek coder 1.3 billion parameters and Tiny llama I'll then move on to give you a very key tip for fine-tuning these tiny llms next I'll move to function calling with tiny llms I want to show you the performance of open chat with different quantizations making it a really small open chat model and show you how quantization affects the performance for function calling then I'll talk you through some of the challenges of getting a tiny model to work for function calling before showing you a custom model that I've developed I've called it Tris tiny it's a 1.3 billion model based on deep seek that has been chat fine-tuned and then function calling fine-tuned and I'm going to show you how even with this really small model we're able to call apis and get uh predictable Json objects back in response there are two reasons why you might want a tiny large language model which I should probably just start calling a tiny language model model although they're not that tiny all the same at 1.3 or even 2.7 billion parameters like f 2 first off these models give you the best chance to run locally on your laptop potentially you can even use a quantise model that will go below one gigabyte in size and this makes it much easier to fit into the RAM on consumer Hardware second of all if you want a very high throughput API if you want a server that's delivering over a 100 tokens per second the fact that you're delivering those tokens so fast means you can serve many more requests and that's going to cut down your cost of serving per token so let's get started with that performance comparison between three different models I'm going to use a Jupiter notebook that's available in the advanced inference repository in fact before I get started let me just show you the two repositories I'll be using to guide us today there's the advanced inference repository this contains instructions for setting up servers in a number of ways either with Lama CPP on your local computer on runp pod on vast AI or in this archive folder here you can find instructions for setting up on an ec2 Ubuntu instance once you have a server running you can then make use of these API call scripts they allow you to run speed tests using text generation inference using VM using uh even Lama CPP there's also a set of scripts for function calling these automatically handle making the calls receiving the responses back from the functions and then sending those responses for a synthesis um for the language model to then interpret and give you back a meaningful answer the function API call scripts include Lama CPP scripts open AI uh TGI that's text generation inference and VM now I'll show you those in detail a bit later in the video or can check out the function calling video before I will talk a bit about fine-tuning today there are of course many different ways to fine-tune models uh some of which are dealt with by specific C scripts in each branch of the advanced fine-tuning repo here there are scripts for direct preference optimization chat fine-tuning for developing and using embeddings that's retrieve log Meed generation uh function calling fine-tuning long context fine-tuning quantization of your finished models and then supervised and unsupervised fine-tuning now there are similarities between these scripts and what I want to explain to you today is one tweak that you need to make in the Lura training if you want to get good performance on these tiny models okay so these are the two private repositories but hopefully I'll give you enough in this video to do it by yourself if you prefer to work through step by step with freely available materials so let's go back and start off our performance comparison as I typically do in these videos I head over to runp pod and I get started in the secure Cloud what I like to do is pick out an a6000 it's uh the RTX a6000 down here it's 79 cents an hour click deploy and pick out a pytorch instance that is going to allow us to run a Jupiter notebook so I'll continue here and deploy this and once it's up and running I'm going to upload the llm comparison script which is in the base of the advanced inference repository it's this script here so let me just come back when the instance has started and I've opened the jupyter notebook in fact it's probably already ready here so all I need to do now is upload that llm comparison script and here I have uploaded that script it is going to allow us to perform a comparison between different tiny models um to start off I'll connect my huging face uh I'll log into hugging face just so I can access any gated repos we're actually not going to use any gated models they're all public so that's not essential in this case now the three models I'm going to compare are tiny Lama Microsoft's f 2 and deep seek uh coder the 1.3 billion instruct model now just to give you a little high level on each the largest of these models is Microsoft 5 2 which is over 2 billion parameters in size and the smallest of these is Tiny Lama which is 1.1 billion deep sea coder is in the middle but it's just between the two it's at 1.3 billion parameters and as the name suggests for deep sea coder it is specialized for coding and in fact there's a system prompt that is there by default that tells the model not to respond if there are questions that are non-coding based um that does restrict the field of use for which uh deep SE coder is going to be relevant still it's a very strong model and coding models are very strong for function calling or when you fine-tune them for function calling so it's a relevant model for us to consider one other thing I want to mention before we move on is that the Microsoft fi model this is not um available for commercial use it's under a research only license and so while you can can use tiny llama and you can use deep seek for commercial purposes according to the Deep seek license you are not able to do that for the Microsoft 5 2 model and as such that is also a significant disadvantage now I've set these three models and I've gone through my installations here um I've installed Transformers we're not going to use quantization because they're small models so I'll run them in full Precision so you can see the full accuracy I'm installing flash attention as well which will allow us to get speed ups when I move down here to where I'm loading the three models you'll note that for model a uh here which is Tiny L I'm using flash attention I'm also using flash attention here uh when I'm loading the Deep SE coder model which is actually a Lama type architecture I believe but I'm not using it for fi because Flash attention is not supported so this again is a little drawback for using the FI model so once these models are loaded which is pretty quick because they're all quite small in size um we can move on and set up the tokenizers now I have a little check here to see whether each tokenizer has a chat template the chat template allows us to take a prompt and then format it with all of the prefixes and tokens required uh according to what the model expects there isn't a default template for fi it's actually a base model it has not been chat fine-tuned as such you would expect the performance to be less a little less good for that reason and also because it's not chat fine tuned there's no chat template defined so I have defined a template based on what has been suggested on the model card for f 2 um you can check that out and if you scroll down here to the prompt you can see that the recommendation is to format it with instruct followed by The Prompt then a new line and then output and so that's the format that I've set up um for the template next I've set up a quick function that will allow us to um stream a resp response and I've then asked a simple question which is to list the planets in our solar system so here we have started our performance evaluation and the first result we can see is Tiny Lama responding with a list of planets um tiny Lama here even in the chat format you can see it's very verbose so it's actually getting the right answer but it's continuing on not just to include Pluto but many other planets and this seems to be a characteristic of the tiny llama model that we'll see a bit later as well it just tends to be both and it makes it difficult uh to have responses that are finished uh which is a desirable property because we want an answer that stops when the answer is correct now the F2 model is excellent it just outputs a list of planets even though it hasn't been uh chat fine tuned it does very well here deep SE coder as you can see there's this uh system message that's been injected you are an AI programm assistant you only answer relation question uh questions related to computer science um you refuse to answer other questions politically sensitive Etc and indeed it does refuse to answer listing the planets even though the model does know uh the planets it does not answer with the planets and you might think well can I leave out this system prompt and you can but that does not necessarily mean it's going to respond it will often still refuse to answer many questions including questions that sometimes don't even fall within this just because it's not perfectly precise in picking which questions it should uh respond to or not we'll now move to the evaluation where we're going to look at three things the first is returning a sequence of letters in Reverse this is a very difficult challenge for language models you'll see they do poorly um this is challenging even for quite large language models in fact chat GPT uh the 3.5 GPT 3.5 is able sometimes to um to reverse sequences of maybe six uh seven or eight letters in a row uh GP GT4 is quite a bit stronger it's able to reverse maybe double that or more but many of the open source models have a lot of trouble returning sequences of letters in reverse the next one uh and actually I've got the order flipped here but we'll do a code generation test and then we'll do Pass Key retrieval which is asking the model to pull a pass key that's randomly inserted right in the middle actually of the text and see if it can pull it out so here I ask each of the models to return a sequence in Reverse first off you can see Tiny llama we ask it to respond with this sequence in reverse and uh it starts talking about a poem so this uh actually it first responds by saying AC which is not the opposite of ab so it's simply Incorrect and fails even for a sequence of two now f um here you can see um actually in the case of fi we do manage to get um one token two tokens in a row that are responded to correctly so what fi is doing here is it takes in ab and it just responds with AB which is incorrect it should be ba next up is DEC coder and here you do see some better performance so it gets the first one correct um I'm not penalizing by the way if the model blabs on after giving the right answer I'm still considering it to give the right answer um so you can see here deep C coder actually gives the right answer and stops which is beautiful now we move on to three a sequence of three and it also is able to get that in reverse moves onto a sequence of um four and you can see here it fails so deep SE coder definitely stronger likely because it's a coding model it's seen more structured data um I think that probably helps with positioning um getting statistically uh the right distribution of its weights to allow for understanding of or representation of reversing sequences okay so clearly based on this the Deep seat coder model is the strongest we'll move on now and take a look at code generation here I asked the model to respond with some python code that prints the first uh n primes I asked for the first 10 prime numbers in the Fibonacci series so Fibonacci is 1 one and then you add one and one is two then one and two is three then two and three is five then three and five is eight so you have this sequence that's increasing but the numbers are not all prime so we want the model to put out a piece of code that allows us to calculate the primes and each of the models um puts out a fairly large piece of code you can see here the code that's generated the code from fi is quite a bit shorter um whereas the other two are long now what I've done is copied those pieces of code I've copied the code here from Tiny Lama and I've executed it and it executes a series of twos so uh that clearly is not the correct answer the F model meanwhile meanwhile outputs the Fibonacci series so it does something related but it does not pick out the prime numbers and then deep seek manages to pick out the prime numbers uh within the Fibonacci series the first 10 of them and not only that but it gets the same answer as um what chat chat GPT uh the GPT 4 version gets so very strong coding performance from what's only a 1.3 billion model here and again the winner in this case a coding oriented challenge goes to the coding model which perhaps is not a surprise but kind of leads me to think whether these models uh that are not coding models would still benefit from having a lot more code next up is Pass Key retrieval I um upload a file Burkshire 23. txt it's a transcript of the Burkshire hathway 23 2023 meeting and I insert this Pass Key here right in the middle and with that pass key in the middle uh the model is then asked to retrieve now you could put the pass key anywhere else but the hardest place is the middle that's typically where the models perform the worst so I've asked that question let's take a look at tiny llama we put in this big piece of text that does indeed contain the pass key and the response here is not even related so uh tiny Lama fails on this test next is Microsoft fi and we ask it to find the pass key and Microsoft fi uh manages to find the pass key within this uh 200k length of context and next up is deep SE coder we ask it to find the pass key and you see unfortunately we get this response here that uh it's not permitted to get it so what is pretty unfortunate about the Deep seek model is that at least the instruct model it really restricts the answers you get and so while it's got this great performance um you can't really use the instruct model for a lot of things which is why later in the video when I show you the trellis tiny model based on deep seek I start with the Deep seek base not the instruct and then I chat fine-tune that model so that it doesn't block um generating answers and then I fine tune it for function calling and that's the way that you are able to um get the performance aspects that are good of the model without accidentally blocking out answers you might want next up I'm going to talk about fine-tuning these tiny llms and really not a whole lot changes if you're running any of the scripts here um that are in the branches for different fine-tunings and by the way there's a video for each of these branches so even if you don't have the repository you'll be able to get a lot of what you need just by watching the videos on the YouTube channel and most of what's there is not going to change at all there's just one change that I want to talk about and that's around using Laura so let me give a high level overview of how um Laura works just to recap and then I'll explain what it is we should do differently for the tiny models and then I'll show you an example actually with function calling of where this little tweak makes a difference in terms of performance so let's go over to the PowerPoint slides or rather my Google slides and take a look at um fine-tuning with Laura to understand how you need to tweak the training for tiny models you need to understand a bit about Laura Laura is low rank adaptation and it's a technique that is used for fine-tuning these models large language models consist of a bunch of matrices and the matrices are bigger in bigger language models so let's just think of some reference language model consider it maybe a mediumsized and let's say the matrices are about 1,000 by 1,000 in size rather than training each of these large matrices what we do instead is we typically freeze the values that are in these matrices and and we instead train adapters and these adapters are a lot smaller in size typically they have one dimension that's the same so one24 in this case but then the other dimension we make a lot smaller this is called the rank and it's a value of eight in a lot of cases when you're doing Lowa fine tuning and there are two of the these matrices A and B so that when you multiply them together in a certain way you're able to get back to a matrix that's about 1,000 inside size however when it comes to training you've got far fewer parameters because each of these smaller rank matrices has got 1,24 * 8 so about 8,000 parameters so that would be 2 * 8,000 in matrices a versus B whereas the original Matrix which is Frozen has got 1,000 * 1,000 which is about a million parameters so this allows us to reduce in size significantly the number of parameters that we are going to train because we do do this adaptation for every single Matrix or at least a subset of the matrices within the large language model and just to further repeat that to clarify how it works we freeze the model's main weights so they stay there and in parallel we train this adapter with fewer parameters and when we're done what we do is we collapse so we add the adapter on top of the original to get what we call a merged model this technique is called Laura and it's shown to work well not just from an efficiency standpoint but from a parameter um but from a performance standpoint as well and this is because the large matrices in language models tend to be sparse in other words they can be represented by lower rank representations I.E the adapter that we're striving to get okay so that's a lot about Laura but how does this affect how we should tune a tiny model well in a very large model let's say one for reference with a 1,000 by 1,000 in the main Matrix even when we pick a small rank here say8 which gives us gives us about 8,000 time 2 about 16,000 trainable parameters 16,000 is still quite a lot of parameters for um fine tuning this Matrix but think now if we go to a tiny model and remember that as we go to Tiny models we're going to have smaller matrices in each of the layers so let's consider a little smaller model here with two5 6 by 256 so the base model has got smaller matrices but what that means is when we're training the adapters here the base Dimension that we're going to start with for the adapter will be 256 by 8 so now we're going to have fewer parameters within our adapter and the problem is already if you have a tiny model it doesn't have that many parameters and so the adapters can get really really small and that means you're just not training that many parameters which means you can't really build much new information into the model and so what I found at least empirically and this is the theory that I'm using to try and explain what I think I'm seeing what I found empirically is that if I continue and use a small rank which results in a very small number of parameters in my adapters I'm not able to get the model to adapt to the data I'm using to train and I can get it to do one very specific thing for example call a function but the model then calls functions no matter what I tell it even if I ask it what's 1+ one it won't respond two it will just respond with a function call for the weather or whatever functions I have built in and so the key takeaway here I'm trying to explain is when you're training a very tiny model you don't necessarily want to use the same rank for your lower adapters because it will result in training too few parameters let's now see how that works in practice with a function calling example so here what I'm going to to do is train uh the Deep seek model which is the strongest model that we showed and I'm going to do it using uh an advanced fine tuning script from the function calling Branch it's this uh fine-tuning function calling V3 script and it's going to make use of um a data set here which is trellis uh V3 function calling uh that is available for purchase on hugging face now let's take a look uh again via runp pod so again I've started up an instance and run pytorch and next what I'm going to do is just upload that script okay I've uploaded the script and here we have it shown I'm not going to go too slowly through this because you can look at the function calling video that I made quite recently but I will show you where the Laura parameters are set and where I have bumped up the parameters uh of Laura so as per usual I connect to hugging face login so I can push and pull from private Repose um I've now connected weights and biases for tracking the model training I've loaded a base model here um this is a chat model that I have uh fine-tuned myself from the base deep seek that's kind of a site note for the purpose of what I'm showing now I've installed the packages required I've loaded the model I'm not loading quantied I'm doing full fine tuning with flash attention I've loaded the tokenizer I've set the padding tokens as I discuss in more detail in the function calling video and um moving on here I'm going to load my data set um but first i'm going to set up Laura so Laura is this uh low rank adaptation and you can see here a list of the modules so a language model has got multiple layers and in each layer there are multiple matrices that serve different purposes there are the attention uh matrices and there are the linear layer matrices and it's common to train the attention matrices and sometimes also the linear layer matrices as well and that's what I do in this example they account for the majority of the parameters within any language model that's the attention and the linear layers the norms and the other layers are important if you're doing a chat fine-tuning um but they account for a smaller number of parameters and they're not very important to fine-tune if you are doing um function calling or structured responses fine tuning okay so moving on here to the lower specification there are a few things we have to do with Laura so as a reminder Laura specification is basically deciding how we're going to set up these adapters so the first thing we have to do is we have to say Okay within each layer what are we going to train so which matrices within each layer are we going to apply this uh style of an adapter to and as I just mentioned we're going to apply the adapters to the attention layers um sorry to the attention modules within each layer and we'll apply them to the linear layers um the linear modules linear layer modules within each layer as well okay so we have that set and up here we have two very important parameters for Laura which is the rank which I've shown here as eight and also the alpha which I'll explain in a moment but first off note that often and as you'll have seen in many of my tutorials I recommend a rank of about eight um and that value works well for fine-tuning for function calling all the way up to 70 billion parameter models but in this case because setting that to eight is going to result in training quite few parameters I've increased it to 128 so that's a significant increase I'm basically increasing the number of trainable parameters so that we can better mold and the training of the model around the specific data I'm going to use now the second parameter that I'm going to choose is the Laura Alpha so I've already explained how we need to pick pick a higher value of or um but we also need to adjust Alpha now here's what Alpha does any language model when you're training it is going to have a training rate it basically tells you based on the data there's some correction that the model needs to make how far in the direction of that correction do we want to go how far of a step so if the learning rate is high you take a big step in that direction and you might get a fairly volatile but quick training um or if you take a smaller learning rate you take a smaller step which can lead to a smoother training and a more stable training actually I can just skip down here very quickly and show you within the trainer if we go down to the trainer you'll see that there's a place where we specify the training rate and so the learning rate here it's set to 1 E minus 4 so this is the learning rate if we were doing a traditional training and in traditional training we would simply be training the main matrices but we're not not training the main matrices because we're using the Lura technique I did in fact try training the main matrices but this ends up to be ends up being less stable and results in worse performance actually at least in the parameters I checked than training the lower adapters so we're not going to train the main matrices so that main learning rate is not going to be relevant what's relevant is the learning rate that we're going to use for updating these parameters the parameters within the adapters and the learning rate for those parameters is related to the learning rate for the model um by the following formula so here's the formula the learning rate for the adapter is actually the learning rate for the model which is the 1 E minus 4 multiplied by Laura Alpha so multiplied by the alpha and divided by the rank so what this means is when you set a given rank let's say we pick eight if we pick an alpha of eight that means the learning rate app used for the adapter will be the same as the learning rate used at for the model if we were doing a full fine tuning now typically it's common for language models to set the alpha to be about four times what the rank is that means you're training the adapter at a bit of a faster rate than what you might train the base model and that's why when I recommend a rank of eight I often recommend an alpha value of 32 so that's 32 divided by 8 is four so it would be effectively a 4 e minus 4 rate of training you can think about it like this in high high level terms so because we've decided to increase the rank so let's go back to our lower adapter settings because I've set the rank to 128 I want to increase my value of alpha as well because if I don't increase Alpha then I'm going to have a very low effective training rate for the adapters now I could keep it so that the alpha is a factor of four that probably would work fine I decided just to put my Alpha is 128 so this sets the learning rate um for the adapters you can think of it like 1 E minus 4 so that's how we set the or and how we set the alpha now after I've loaded my data set and again you can go look at the function calling video for how that's all configured and once I have set up the training I train for one Epoch on this function calling data set which is a data set with 66 66 rows once I have done that um we can move down and have a look at some of the answers here and I'm going to show you the the responses that I get from a test data set so this is uh not the training set it's the answers from a test set and I will talk you through what the answers were when I had a lower value of the rank that was insufficient so here we have um we have an input to the model telling the model that it has access to the following functions one to get stock price and one to get um the list of names of the largest end Stocks by market cap so it has two functions in the metadata and then it's being asked to get the five largest Stocks by market cap and the response it generates um here this is before fine-tuning is this kind of obnoxious Json object which is not exactly in the format that we would expect so these are the examples and there are many more examples here actually seven just showing without the fine-tuning you don't get a properly structured Json object um that is going to be giving you information the way you need now here just for comparison this is what response I wanted so if I'd have trained the model and it be performed as I wished this is the correct assistance response but now if I go down after training so I have to go all the way past the training and you can see here um during training we've got low training loss and the validation losses maybe slightly going down when we run the example after that training you can see here I've got the same question that I posed the same test question get the name of the five largest Stocks by market cap and here we get proper syntax so we get a proper function call that matches similarly to the correct assistant response now it does add in this uh parameter World here that's actually an optional parameter that's specified in the metadata so that's kind of reasonable here's another test question where we ask to give the names of five largest stocks and here you can see it's now generating the exact response we want now this is trained with a rank of 128 but I got the exact same thing with a rank of eight so actually you can get function calling performance with a low rank but the problem is as follows let's now go to a test question where I just ask it something trivial for example let's go to a test question where I just say Greetings so I give the function metadata but I don't ask any questions requiring that data I just simply say greetings in this case case if you train with a low rank you will still get a function call so the model will just not be able to not do a function call because you've trained so few parameters it's not able to fit around different scenarios however when I increase the rank to 128 I get a sensible response here which is hello how can I help you today and the correct response here is just saying greetings to you to so this is the difference between using uh a low rank or low number of training parameters versus using a high number and one last tip I'm going to give you is that you can always look at how many parameters are being trained so here you can see the trainable parameters are about 8% and I'm training um it looks like 119 million parameters which is 8% of the model size which uh there are about 1 and a half million parameters here because we have the base parameters of 1.3 um billion plus we also have now the additional adapter parameters I've added on so if you set the rank to eight what you'll find is you get a very uh small percentage but more concerningly you just get a very small number of trainable parameters and that's why I say as a tip if you're training a model just take a look at the total number of trainable parameters and if that's getting really small then you may get to a point where you're not going to be able to get enough detail with your fine tuning if you're dealing with a very large model it's not so much of a concern even if you're only training sometimes .1% of the parameters on a 70 billion model that's still a lot of parameters being trained for the data set that you're using which is ultimately small so that's not going to be an issue but for tiny models it is and that's the key takeaway no matter what kind of training you're going to do within uh supervised unsupervised or not just make sure you're training enough parameters if you're using a tiny model next up for agenda I I want to talk about function calling tiny llms there are two broad approaches that I want to describe if you want to have a tiny llm you can start with say a moderate sized 7B llm and see how far you can quantize it quantize it as small as possible and see if that still gives you performance so that's one approach that I want to talk about and I'll show you how that works or doesn't work the next approach is to start with a tiny model maybe tiny llama deep seek because that is looking good from our comparisons or perhaps um take a look at the FI model and fine-tune that so I'll show you some performance examples when we try and fine-tune a deep seek model and the model uh the custom model that I'll take you through I'm calling trellis tiny which is a deep seek fine tune for chat and for function calling and I'll show you how to get the performance out of that with a few tips that can manage edge cases on the tiny models now a common question I get in general setting aside tiny models is which model should I use if I'm going to do function calling and there's an answer for medium-sized or small to medium and there's an answer for a large and the answer is if you want something around 7B the best model in my experience is the Open chat model open chat is a fine-tune of Mistral and it performs very well with function calling it's even able to chain function calls in other words if I ask it for the weather um if I ask it for what clothes to wear in Dublin and there's a function for what clothes to wear given weather and there's a function for weather in Dublin it knows to first get the weather and then figure out what clothes you should wear that's something you don't see across most other models of that size indeed many bigger models are not able to do function calling chaining now for the largest models what I recommend is always a coding model something like deep seek the coder model the 34b or Cod L will perform very well on function calling deep seek being a little bit stronger of course they are coding models so they are kind of limited if you go outside of the coding domain and you want it to also answer general questions if you want a more General model the Deep seek I think it's a 67b model you can find it in the trellis function calling collection if you're interested in buying it that deep seek model also performs well it's a large model now let's come back to the question of a tiny model and let's think about the approach of taking the Open chat model which is 7B and seeing if we can quantize it now when we talk about uh quantization we can think about ggf which is the quantization used by the Lama CPP Library here I've got a few quantized forms that are available um the original 16 bit or the 16bit ggf is 14 GB so still quite large then we have the 8 8 bit model which is 7.7 then we have um a four two four-bit models here there's a mixed um a mixed Precision One one that has that is about 4.4 GB in size and the smallest is um a mixed two prec two bit Precision type model and that goes down as small as 3 GB now this open chat model performs very well when you run it in 16bit it performs well when you run it in 8bit it performs quite well in 4bit in the mix Precision but unfortunately when you bring it down to the two bit you start to find that it gives incorrect function calls the and objects are not correctly structured so unfortunately if you cannot run a model that has um that requires at least about 5 gigabyt of ram then this isn't going to be an option um even if you have a Mac like I do it's a Mac with an M1 chip um but it only has 8 gigabyt of vram I am able to run the 4-bit model I'm not able to run the 8bit and the 4-bit the text generation is kind of slow so I would say you need at least 16 GB of um vram if you're going to try and run this model even in quantise form to get some reasonable speed you can run it on runp pod or you can run it on vast AI so let me give you a very quick demonstration of performance here on the open chat model what I'm going to use is I'm going to use the Llama um CPP setup there are already good instructions if you'd like just on the GitHub the public Lama CPP project I'll put that in the description um I have additional guidance here on getting started some warnings and expectations and depending whether you're setting up on a laptop or whether you want to run you can also run with a one-click template on runp pod or on vast AI now once you have Lama CPP up and running which I'm just going to do now so I'll head over to a terminal and let's just start a new window I'm going to uh CD into a folder where I have Lama CPP and I've already got uh installed in this folder I've got Lama CPP um installed fully following the instructions and there's a models folder within Lama CPP so I'll just change directly directory into the models folder and you can see here I've already downloaded open chat in the 2bit format and I've downloaded it in the 4-bit format as well um so you can download those um if you have access to this you can of course download some of the open models as well the Lama 2 function calling model is publicly available and of course there are many other ggf models available from the Block so I'm going to CD Now um just back one folder and I have a script that's called um in fact I don't remember the name of the script it's called server Dosh and this is just a oneline command where here you can see I'm going to be able to call call a model with uh two bits and so let's just set up that server so I'm just going to run server. sh and now the server should start up so it looks like the server's ready and we're ready to now make calls into the server all of this as I said is described in the advanced server um setup repo so let's head over and go to VSS code where I've opened up Advanced inference I'm going to go and navigate to um I'll navigate to the Lama function call. py which is here so this is the script that I'm going to run it will allow us to run some test calls uh using those functions to get the weather and to get recommended clothes based on weather now I'm going to set my API as um my Local Host because that's where the API is running when I've just run Lama CPP there and I need to set the model as well because this uh sets the correct tokenizer we'll look at trellis tiny later but for now I need to run the Open chat model so I'm just going to grab the Open chat model repost slug from here and paste it in uh right here as a model name and with that I can save and I should be ready now to call Lama CPP I'm going to start off with a very simple question here which is what is 1+ one let's see how this model does um the command and I need to CD into the function calls folder function API calls and I'm going to run python Lama CPP Funk call. py keep in mind this is the two- bit model so we'll see whether we're able to get good performance or not you can see already that the response is not that fast because I'm running on an 8 uh gigabyte vram M1 but it does get the answer so 1+ 1 is two that's pretty good now the next question is what's the weather in London what the function um what the model should do is it should make use of get current weather so it should call that function let's see if it does so here we've asked what's the weather in London after inputting all this metadata and let's see what kind of answer we get okay so it's made a function call and based on that um it's given me a suggestion of what to wear so let's actually Trace what's happening so it made a function call that's good good um the function response came back from um from the model and here it looks like um we're not quite getting something sensible it's it's making up this function here about suggesting close which is not quite right so you can see already um the model fails in 2bit quantization when you're asking it uh to find the WEA in London if you go to the next level and you ask it for what clothes to where in Dublin which requires a chain function call so it has to first get the weather and it first gets the weather in Dublin and then has to get Clos using a second function it's definitely not going to be able to do that so it succeeds at a basic question that's non-function it can call a function but it's not able to handle the function response properly so let's go back and take a look at our server and we'll get the server running this time for uh a better Quant monetization so let's go Nano server and instead of running the q2k model I'm going to run the q4k M model so I'll just save that and I'll run the server so the Ser is now running with the with the four let's just give it a moment okay we're ready now to inference so again we can just ask it the trivial question as a test and here with the trivial question let's just check it's good it's actually causing issues with my video um because it's being a bit slow too okay so maybe quite appropriately my laptop is not able to run the um 4bit open chat model when I'm also recording video and so it just shows you that we do need to move to a different tiny model if we actually want to run on a laptop like what I have I'll just note that you can run Lama CPP on uh runp pod or vast AI or some kind of service if you want on GitHub there's a public repository um from trellis we can just take um a quick look here here in the public one click alms repo you can find a variety of oneclick um oneclick templates including for Lama CPP there's one here for M trials having be instruct uh in 8bit but you could run it in any quantization using the one of the blocks quantizations you can also modify the template for your own purpose but for now let's get back and talk a bit about fine-tuning a tiny model for function calling there are a few challenges with doing this obviously the model is small so it's not going to be as strong as a model like open chat or the even larger models specifically if you look at models that are not coding models models like tiny Lama or like fi these models are weak at function calling and for being fine-tuned for function calling um because they're not trained on a lot of code or structured responses I tried training fi and the results I got were not even usable and Tiny L unfortunately the responses tend to blab on more than you would want which makes it hard um as a model to use this means that deep seek is probably the best option if you want to fine- tune a model for function calling and that's the model that I chose when I first chat fine-tuned and function calling fine-tuned the base deep seek coder model the next problem is that chain function calling is difficult I think um I mean maybe somebody can find a way to do this but the model needs to have uh very good statistical distribution across data and I'm not sure that at this model size that is very easy to do in fact I'm sure it's not easy to do I'm just not sure if it's even doable at all so I think we need to lower our expectations around chain function calling I think have having a model that will call one function is definitely doable and I'll show you that but getting chain function calling on a 1 billion model is quite difficult the next problem is a little bit like I showed you during the function calling example there's a problem where with small models if you get the model to call a function it just keeps calling functions even when you give it a response it will just continue to call another function and this is a problem now by using a larger number of trainable parameters we can get around this as I already showed you and the other trick is to just make sure when you make the inference calls that you never allow the model to make more than one function call in other words once you call once the function uh call is made by the model you return it a response and you then do not allow it to loop back and make another iteration in fact you can even go a step further and you can prompt the model um in a way that will encourage it to respond with text when it gets a response back from a function and I'll show you that here you can go to um trellis Das tiny this is where I have uh the tiny model so let's actually just put it straight into the URL up here and you'll see that I've noted when you are prompting the model it can help once you get back the response from a function to say based on the information available from the function call so uh and then you get it to answer let me just show you what that looks like in code so here if I go to um a text generation inference template you can see I have a function here that will um make H get a chat response let's just find the very top of that function so here's the chat completion request and it will send a series of messages it will put them in a prompt format and get a response from the language model and then there's a step at where it will extract adjacent object if there is one present so if there is adjacent object that's what's happening here then it will make a function call so it will call for the function to be executed and when the function is executed it will return the response to the language model and make another call to the language model but what we can do here is we can add in an extra parameter to this function to the execution so that if it's a small model we will always suppress a further call being made to the language model and this is how you can manually Force the language model to respond um rather than just keep on calling functions and as I said what we can further do is whatever response was given we can append on uh now make use of the information in the Json object and provide an answer to my previous question and furthermore we can actually um prompt it by giving it the start of the response so the start of the response must be based on information available from from function calls comma and then it starts the response and because I've um forced it to start with this part of a sentence it's not going to be easily able to respond with adjacent object and so the model will give more reasonable responses so let me actually show you uh running this script here I'll do a full example um of the tiny La um the trellis tiny model as I said this is a model based on deep seek and it's been uh chat fine-tuned and it's also been fine-tuned for function calling using the Tris function calling V3 data set um it is capable of function calling as you'll see and it's capable of high token Generation seed speed whether um that's done locally or whether it's hosted so let me show you um in two ways I'll show you using a one-click template from runp pod which uh is available on the model card from those of you who um purchase access if you don't want to buy access to model um I would recommend fine-tuning the um deep seek coder model because that seems to be the strongest starting from the base so let's just go to the quick server setup and find the one-click template um we can run that template you can run it on uh I mean it's going to fit on any GPU in fact if you want the lowest cost GPU you should use vast AI um I can develop a template for that as well I'm going to just run it on an a6000 here and I'm going to continue and deploy now once I've deployed there's just one more tweak um that I need to run we can uh go in here to the readme and the readme will give full instructions including how to make simple curl calls um but I want to take uh my hugging face token edit the Pod go to the environment variables and add another uh token here because I do need to put in my my um my hugging face access token here I have my token so let me just copy um token here from runp pod in the meantime what I'm going to do is go to um the Tris tiny model and I'm going to download copy of the GF file so I'll go to the ggf branch I'm going to run an 8bit here my general rule of thumb is if you ever quantize below 8 Bits you're probably going to start to see degradation in performance 8 Bits is kind of the cut off where I see the performance more or less being the same um so let me download this model 1.43 by the way the 4-bit model should also work quite well I haven't made it too bit because I think the quality just wouldn't be good enough so I'm going to put this into uh my Lama CPP folder which I'll just find here hopefully and I'll then put it into models within the L CPP folder and save it's got a nice short name tiny. Q -0 so I can head back over to my terminal here and look at this uh quick command line and adjust this because going to now be operating with a qore 0 and the model I think is simply going to be called tiny. Q 8-0 let's check if that's downloaded yet still downloading yep indeed that is the name of the file so meantime we can go back and check on our pod I can close down the other pod that I was running for to show the training scripts don't forget to leave the pods running if you're not using them and and here we can see that the container is being set up uh for TGI and that should just take a moment to uh to load it'll then download the model weights and once the model weights are downloaded we'll be able to inference I'm going to copy the runp Pod ID there I'm going to bring it over here and set it uh in my environment variables uh I'll copy out Lama CPP I'm going to copy in the Pod that we're using for run pod copy out Lama CPP um copy in the Pod that we're using for runpod and I also need to copy in the model name like this here so here I've just copied in the Pod ID and I've made sure the model name is set and I now should be ready fairly soon once the container is up and running I'm still waiting on the container to get up and running and for the way to download and still waiting for Lama CPP to download Ty llama or still waiting for Lama still waiting for the ggf file to download as well and back in runp pod here we have got the weights downloaded and we're running with EQ which is 8bit quantization which should maintain good quality I'm also using speculative decoding here which you can check out more on the speculative decoding video um n g of three so that should give us a little speed up as well and we're just waiting for um the shards to load and now the model um is ready and the API is ready so I'll head back over we have everything set up here and we're now ready to make some function calls we're going to make function calls using TGI Funk call so uh python TGI Funk call. py and when we we look at that script we'll start off just with the same basic question that we always do which is what is 1+ one and let's see 1+ 1 is two so that's good news now we'll check out uh what is the current weather in London so here we're expecting one function call and you can see uh let's just look through this a bit more slowly what's happening so the question question is posed what is the current weather in London it makes a function call the response comes back from the function along with this helper text to make use of that and also some helper text to make sure the response starts off um without calling another function and indeed it does get the current weather in London is 15° C and cloudy now let me show you the limit of the model so let's try that chain uh function call in question and I don't expect it to get this here but let's see what it does does yeah so it actually just hallucinates on the answer in this case um and says I am in Dublin doesn't even make a function call so you can see that um for single use function calls you can actually do that with a very very small llm provided it's being fine-tuned correctly you do need to make some tweaks you need to carefully chat fine tune function calling fine tune and then further um you need these manual uh kind of filtration techniques that make sure um it's a small model it's not going to recursively keep calling functions and by the way um if you have access to this Advanced inference repo this is all in the tiny models Branch as opposed to the main branch which is where um most of these scripts are located and the little tweak I've made is to add in this little parameter where if you say it's a tiny model it will not allow the model to recursively call functions whereas if it's not a tiny model um it will recursively call functions which is actually desirable if you want to use chain function calling the Open chat model will successfully be able to do that if it needs to call more than one function okay so that's the TGI demonstration and that one-click template is available um you can find it on the oneclick llms GitHub repository that's public um on Trellis on GitHub now let me show you at the same this time uh using Lama CPP by now our model should be downloaded uh we can just list out what in the models directory directory uh if you want to take a look and indeed it's there tiny. q80 ggf so we're ready to run server. sh and this is going to get a server up and running it's an 8-bit version of the trellis tiny model and we're ready to start accepting requests so I'll head back over to my Advanced inference repo I'm going to comment out the runp Pod uh API I'll close down that server later and I'm going to comment in the Lama CPP API um note that there are also apis here for vast AI um if that is of Interest now let's go back and this time use the Lama CPP function calling script let's start off here with our simple question of what is 1+ one let's see if I have that uh I do have that stored so now I'm running off my laptop and hopefully it won't crash my video okay 1+ 1 is 2 so that's good so running well in that case and let's try asking it what the current weather is in London with a little helper text and you can see here uh what is the current weather in London um this is exactly the problem when you do not use um when you do do not use the helper text it just will keep on calling function so notice the difference in in that answer here I've got um the tiny model flag turned off and just notice the difference from when we ran the same model on runp pod which you know should give the exact same results um when we asked that exact same question let's scroll up here to find question what is the lettera in London it gave the exact same function call but because we included the helper text here and because we included the logic that stopped the model from looping it was able to correctly synthesize um the response so that just goes to show that even if your fine-tuned model has got some issues sometimes it's possible just using logic in order to deal with those um as in this case here where a little bit of helper text allows you to make use of these tiny models for function calling in a way that will answer both normal questions and uh will answer also uh function calling questions let me give a quick summary Before I Let You Go if you're looking for a small language model or a tiny language model one of the best you can use for General conversation is the FI 2 model although it's not allowed to be used for commercial purposes a very strong model is the deepsea coder 1.3 billion model although it really stops you from using it for non-coding questions through the way that it has been instruction fine-tuned for the instruct format now there is the big format but to use that you would need to fine-tune it in some way yourself that's the motivation I had for developing Tris tiny which is a model that is intended for utility purposes especially function calling it's able to make single function calls not chain function calls and it's Al also able to return short normal responses if you're going to fine tune don't forget to take a look at the number of parameters you're training you might need to increase the lower parameters so that you train enough parameters for your model to have some Nuance if you're going to do inference these tiny models will allow you to get very high speeds and they're much more practical for running on a laptop as you saw running larger 7B models when quantized can run into issues with quality and also with the memory of your laptop that's it folks you can find more resources in the description or on tr.com and let me know your questions in the comments right below cheers

Info

Channel: Trelis Research

Views: 12,565

Rating: undefined out of 5

Keywords: tinyllama, phi-2, phi llm, deepseek 1.3b, deepseek coder, deepseek coder 1.3b, tiny llm, small llm, small lm, small language model, tiny language model, best small llm, best tiny llm, fastest small llm, best small language model, fine-tune tinyllama, fine-tune small llm, function calling small llm, function calling small language model, tinyllama 1.1b, phi-2 microsoft, llm function calling

Id: yxWUHDfix_c

Channel Id: undefined

Length: 62min 25sec (3745 seconds)

Published: Wed Jan 03 2024