Fine-tuning Language Models for Structured Responses with QLoRa

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
going to take you all the way through the fine tuning of a llama 2 model and we'll be doing it in Google collabs so you should be able to follow along for free I'll take you right through a basic example and I'll also show a detailed example if you want to get results with a very small data set and before that I'm going to talk a little bit about model size and quantization and why that helps us to use tools like Google collab let's get started llama comes in a few sizes so you'll see there's a llama 2 and it comes in the 7 billion size 13 and then also comes in 70 billion and this means this is the number of Weights in the model so you think of the neural net and that's represented by 70 billion weights in the 70b model now each of those weights typically would be represented each weight represented a 32 bits so if you have um eight bits per byte and you have seven T billion model that means that you're going to have 70 billion multiplied by 32 bits divided by 8 bits per byte which is going to be 70 multiplied by 32 divided by 8 which is 70 multiplied by 4. so that means they're going to be about 280 gigabytes um of Weights now the problem there is if you connect to some kind of a runtime and I'll show you the first of the notebooks we're going to go through if you connect to the very first runtime we'll just allocate some space let's take a look at how big the computer is that is allocated To Us by Google collab if you're initializing and I'll put a link to this in the comments so you can see um we currently have we don't have the GPU started up but I know when it starts up because I'm on a free plan here that it's going to provide 15 gigabytes so the GPU will be 15 gigabytes in size and you can see that's just going to be way too small to fit a model that's 280 gigabytes now it's a bit better if we look at the lamba 7B that's going to be 7 billion multiplied by 32 bits over 8 bits per byte and that's simply going to be 28 gigabytes so even laminate 7B would be too big if we represent it in 32 bits so that's what brings us to quantization which is a way to shrink the model so that we can fit it into something like Google collab and just for reference when people talk about the most powerful computers like an a100 from Nvidia that can come in at 40 gigabyte representation or 80 gigabyte so you can see that if you were to represent the model in 32 bits the 70 gigabyte uh sorry the 70 billion llama model would still take many of these gpus as I said that's where quantization comes in so that the idea with quantization is instead of using 32 bits for after zones ways um you scale it to Ritz so that means that the model is going to be less precise because if you have four bits that's going to be 2 to the power of four so that's going to be 32 uh discrete values that a variable can take so it could for example be on a scale of 0 to 32 it could be 0 1 all the way up to 31. whereas if you have 32-bit representation then uh well if you're just looking at an integer that is always going to be positive for example I don't even know what 2 to the power of 32 is but that's going to be a really large number so you're basically taking a lot of granularity in a large number and you're going to simplify it down so that there's just like 32 steps and you're going to have to record a range for when you quantize so you might take 32 steps at going all the way from zero or minus one thousand plus one thousand and split that into 32 different steps and then you can represent the number which was between minus 1000 and plus one thousand you can now represent it discreetly as one of the 32 steps between there so what this allows us to do let's say for seven billion model is now to have 7 billion multiplied by four bits divided by eight it's right so now that's going to be 3.5 gigabytes so the value here is we can now fit the 7 billion model very easily onto Google collab notebook um if it has 15 gigabytes which you would see in a minute it's very easily going to fit 3.5 and so that's the idea of quantization you do have some loss in the accuracy but actually the loss is maybe not as large as you might think because some of the weights it's not that it's not that necessary to have very precise values and in fact you can even go a step further and decide to quantize or shorten some of the weights and leave other ones that are more important long and by leaving the most important ones long you could end up getting a good trade-off and not affecting quality too much so this brings us on to Q Laura and the Q Laura is short for quantized Laura okay so it simply means doing Laura which I'll explain in a minute but quantize so it means it means training with quantized weights 4-bit in this case and what Laura means is a short for low rank and without getting into too much detail the idea is that instead of updating all 7 billion weights we'll just update the most important ones that's at a very high level what's happening in Laura so we're making the training easier because we're representing the numbers with shorter number of bits four instead of 32 or 16 and then also we're not going to update all the weights we're only going to update some of the weights and there is a package um that's part of Transformers Library which is um hugging face helped Andrews a part of that library now that's supporting grids and bytes which is a Quant ization Library reporting Laura and just one other note you'll hear about peft so this is parameter efficient fine tuning is another way to say we only train some weights and basically Laura is a type of peft foreign so that explains Q Laura now I'm going to go all the way through I'll step through every step of the notebook for fine tuning and then I'll step through a second notebook that's a bit more advanced um basically if you're very careful exactly how you fine tune and where you put the emphasis for the model to learn within the prompt you can train with less data and that's what I want to talk about in the advanced fine tuning I'll talk about prompt masking and I'll talk about the end of sequence token and that's going to require me to come back and talk about two things the first would be attention second will be um loss masking well I'll come back and talk about those I think it's more instructive to First go through a simple model yes I am not a robot okay so I'm here on this notebook and um there are going to be a few steps so we will install the packages we then setup for training we'll then get some data that we'll use for fine tuning we do some training we'll do some inference so we'll run some samples get some results of output and then I'll show you how to push that train model back to the hub so here we are in the installation and I've just installed these packages you can then move forward to loading so let's get the loading started here this is going to take um probably about a minute um so actually if you're going to use llama this model here you will need to have the notebook you will need to sign into hugging face notebook so I'm gonna just add a little snippet of code to do that this is a snippet to login now you will need to request access to hugging face to do this you could also use trellis um charted yeah 16. so um this is a model that's publicly available you don't need to get access from meta um and it should allow you to just download without doing the hogging face login let's just check that this here is refract one yep so you could we could just we could just right away do that but what I'll do is I will log into the notebook instead so here um I'll add some code and I'll run and this is going to allow me to log in to hold your face so I'll click on face tokens copy one of my tokens and login okay so I've logged in and now I should be able to run this model without that error because I've been approved by meta threatening here you could load a bigger model as well that would work fine all right I'm just gonna pause for a more material while this loads should take about another minute okay so um this has now been downloaded just a few notes I did have to go back and reinstall there is a temporary issue it's August 9th um the latest Transformers library has got a breaking change so I've fixed it to 4.3.1 um but you can see now that the model has been loaded you can see that this model is about 4.7 gigabytes in our earlier calculation I calculated 3.5 back of rule of thumb back the envelope but um so I was going to be a little bit higher than what the estimate is so um we have the model loaded and we can move on to the next step which is the training setup so here we set up a parameter efficient fine tuning which means we'll only update some parameters basically where setting up the model to only train certain parameters here and now we're going to get a printout of the number of tradable parameters um this is a function that would print that out it's not going to print it just yet because first we're going to import configuration so here um will you run this and this and yeah so actually this needs one other fix as well um which I will get from my detailed set up Laura so actually to fine-tune Lama um you need to change the target modules they're just defined differently and the way you would do that is by running this command here um but the target modules are different than for other module then for other models so [Music] Ash put in specific to line my models and I'm going to run this so what's happening now is we're loading the parameter efficient fine tuning model with this configuration and basically when we say Target modules we're selecting to only update with the fine-tuning certain groups of the weights so we're not going to update all the weights we're only going to update certain groups in fact we're only going to update 24 percent of the weights in this model so now we move on to data setup this is an English data set I don't think it's the best choice for fine tuning but it's a simple set you can find data sets in hanging face just look up data sets here but we'll go ahead with this we're going to load these samples should be fairly quick it's just tax data and now we move on to training so for this we're going to set a pad token so that way training works is you have these sequences of data but the data sequences may not be exactly the same length I mean we can just go in here that's just grab this and have a look so here we have [Music] um so here we have the English quotes so here's an example with a quote and leave a tag so yep we will quote and we have an author and let's see how this data is being assembled so going back to the fine tuning um it looks like we're just fine-tuning on the quotes actually so we're just taking the quote variable which is here and the quotes are not all the same length so that's why you need to pad them to all be the same length to fit inside a matrix and that's why we set a padding token and the padding token for llama is um it's actually the end of sequence token is this one here so that's what's going to be used as well for padding potentially that can create some issues but it's quite detailed we'll get into that in the Advanced Training and then we set up the trainer um they're we're just going to run this for 10 steps and um let's see how that goes so it's going to take 10 quotes and it's going to run through each quote and what the model does is every time it feeds in the quote it predicts the output it then compares the predicted output to the actual output and so I'll predict each of the next words compare that to the actual next word and it will compare that and the difference will result in a loss so if the predictions are very different from the actual next word then you'll have a higher training loss and that last then is used to back propagate so that loss is used to update the weights by flowing backwards through the neural network so you generate the prediction you get the loss and using glass you then use that last to update all of the weights and currently because we've set gradient accumulation steps to four actually what's happening is we're calculating glass four times and then reusing those losses to back propagate effectively it won't go every four times so you can see the loss is uh kind of declining as we go along and I'm not expecting it to change the output much here I will show you in the advanced example how it does but let's just move ahead and let's do inference inferences where we actually do some prediction so here I'm going to just um get it to stream very brief comparison and um this is some code I copied over so actually this here needs to be to put a zero because we want to move to the tens move to the GPU and my syntax is not exactly right let's go back and get the exact syntax um and actually I think I need to go to notebook or to the Jupiter and I'm a notebook where I do inference and here I actually have a script that um that does all of this so the problem if you remember here is that the runtime flag was undefined so I'm just going to Define that runtime flag and now we will stream and we do need to install the text streamer security needs to be installed all right so we do have this warning here actually there's a small script that I should run before calling inference which we can just run here so if we want to clean that up but anyway you can see that the output is showing right here I'll just stop that because I want to clean that up so let's just clean up all right and here this is worth looking at um because you get familiar with the structure but for llama the structure is you have a system prompt that's wrapped within sys then you have um both the system and the first user prompt are wrapped within inst which is short for instruction and then finally you have um this s token which is the beginning of sequence this should wrap both the system and the user and the assistance response um and you can see that the end of sequence tokens is right here so there we go we've now trained and fine-tuned sorry fine-tuned and run inference on this model and I'll just show you now if uh if you wanted to push this model to the hub you can you can run this cell this is going to put it up as a repo onto your own organization you can see trellis is here it's not going to let you push the trellis because you need to be logged in with the notebook and have permissions but you can swap trellis to whatever your organization name is and here what's happening is it's saving the adapter model the adapter model is basically the the deviation of the fine tune model from the original model so we'll save that deviation then we're going to define a name and a target for the new model here we'll save the model so this is saving um basically uh pushing the adapter onto the Hub so you'll have a new repo for the adapter weights which is again the difference between the original model and the fine tune model and um then what you'll do is reload the base model you'll apply the adapters so that's the merge and unload so reload the base model load the adapter model basically scrunch those two together and then push the full model back up onto the Hub so that's handy if you want in future to inference the full model now the last step here is you might want to push the tokenizer as well so that you have a tokenizer available in your repo it'll just copy the tokenizer that was from the base model now I think I just need to fix base model name that looks good um this is actually not this model it's model ID in this notebook foreign size so it's going to split up the weights in your model usually I go from maybe five you could go for 10. I wouldn't recommend going higher than that um the advantage to five is that it can make it easier to load into Google collab notebooks so that's pretty much an overview of from start to finish of a simplifying tune and allowing you to push it to the hub what we do now is move on to the advanced fine tune now the Advanced fine-tune Notebook is available if you'd like to support this channel it's available for 14.99 and allows you to do some more advanced kind of finicky things around the fine tune that should help performance for small data sets now to explain the benefits here I need to go through two different concepts the first is attention and the first and the second is last masking so attention is the idea at the prediction of the next token depends on earlier tokens so as an example if I say the quick brown fox jumped like this and and if we break that into tokens let's just say for argument's sake it's something like this which um this is not exactly how it would be tokenized because the tokenizer uses sub words for argument's sake let's say it's like this so here we have five tokens and we're trying to predict this sixth token here and attention basically says whether um whether you're going to consider the fact that the previous tokens were there in evaluation the next token so for example if you wanted to pay attention to all of these tokens then um you might have an attention mask that looks something like this so let's say something like this sorry I don't mean to put the dots there into those yeah so here's an attention mask where when we're predicting that next token we're considering all of uh the previous tokens so the attention mask is just a mask of ones now I mentioned earlier that when you're training you have many Snippets say quotes and those quotes may not be the same length and so you pad them and padding means that you put in pad tokens um actually in the example I'll show you right now the advanced example we're going to put in pad tokens like this so let's say that we have let's just say we have one padding token because the length of every sequence we use for training instead of it being five let's just say that the maximum sequence like we're training with is six so we need to pad every sequence to be six long now for the attention mask we don't want to pay any attention to the pad token because um that shouldn't influence our prediction of this uh next token here so when it comes to the attention mask we put a zero there and so that any influence any matrices that are going to take into account previous tokens that entry will just be multiplied by zero for the pad token so there's basically not going to be any effect of the pad token when it comes to attention so the idea with an attention mask is that you want to set it so that you ignore the pad tokens and but you do want to pay attention to um you do want to pay attention to all of the other tokens and um that's pretty much attention and I'll show you in the example how we are going to tailor detention Mass to do exactly that now the next thing is last masking and the idea here is selecting but tokens what tokens generated to penalize well it's really more like what token predictions so here again let's take our example let's say we have inputs here of the quick burn fox jumped and let's just say that we have outputs and outputs are going to be predictions um so what we're going to do here is uh we'll make up some predictions so after the word there I will just say that the gas is maybe boy um and after the word quick we'll say the word is brown and let's just say that it gets all the rest of it correct here we would have so that's the inputs that's the output which I'll just clarify by calling prediction and then the actual of course will just be a copy of the inputs so the actual what the prediction should be is just a copy of the inputs um but of course the outputs and the actual outputs are going to be shifted by one because the prediction after that is quick prediction after quick is brown so um and here for example all right so what happens here when calculating the loss is the model the model gives the predictions and we'll compare them to the actual and I'm just going to make up the losses here but here let's just say there's lots of 0.5 because it thinks that it's boy but it's actually quick on the next one uh the loss is going to be calculated as zero let's say on the next one let's say loss of zero um on Fox that said loss is at 0.1 and let's say here that is zero now just a small Point here which is here clearly the model gets the token right it guesses jumped and the actual is jumped but there still can be a loss because for example jumps might be predicted with a probability of 90 um so if the model doesn't predict it with 100 certainty then there's still going to be some loss because the model is not fully sure that it's jumped so for example the only way you would really get zeros for the losses is if the model guesses the next half is 100 going to be brown and it actually is brown so very rarely will you see losses of zero you pretty much see no maybe small losses and then when it gets to talking uh wrong it generally will get some bigger loss here and so here what you can see is where we've got losses for each of the predictions and then these losses are then um they're basically they're basically averaged across the tokens effectively and that average loss then is used to back proc it through the network now that was a bit of a roundabout way but needed to be gone through because now we got to the last mask so the last mask um determines which tokens to include in the loss calculation now going back again and let's take our inputs here just like for the attention mask if we have a pad Coke and say that we have a pad token like this here we actually don't want to include that in molasses so if any pad token that we have we're just going to multiply by zero but we do want to include all of the other tokens so here we're going to put in uh loss masks of one now this doesn't mean that the losses are going to be one but what it means is that any of these last numbers here we're going to multiply by um we're going to multiply here by the last mask so to make this more concrete if we do have say a pad token here and let's say the model predicts the output after a pad token is um the the actual token that comes after pad well in fact let's just say that the model gas is hello actual token is okay meter so now when we calculate the loss obviously the word there the actual is not compared it's not the same as high so there's going to be some high value of loss here let's just say four but what we're going to do is multiply the last mask by the losses what that's going to result in is that first loss even though it's a lot of four we don't really care about it so it'll be multiplied by zero whereas this next loss we do care about it so it'll be five the next loss would be 0.01 Max loss will be 0.02 0.1 and 0.3 and so that's the last mask effectively it's um allowing us to ignore certain tokens that we don't care about and that we don't want to penalize the model based on now where this becomes particularly important is when you have um an example of text then let's just go back and take an example here from this model we go to the inference so here let's just take all of this and copy it across sorry it's not allowing me to do a nice layout here so let's just put in some more text like this all right so here we have some text it's doing some weird formatting on me um that's fine who cares uh but here we basically have a system message and then we have this user instruction and then we're starting off with um the assistant response and when we feed in some kind of prompt like this basically what we want is we want to pay attention we want to pay attention to all of these tokens here so we want the attention Mass to be all ones here although if there's any padding tokens before we want it to be zero but for the loss mask sometimes sometimes it might be good to penalize the model based on predicting the next words here it's just kind of generic information but usually we want to penalize the model specifically based on the answer we're not trying to train the model on predicting the next word in a question we're trying to predict its ability a router question here or request we're trying to trade it specifically on the output and what we're going to see if we're trying to train the model to provide a function as a response so let's just say it needs to provide a function call something like this um we are definitely going to want this to be graded based on this object here we don't want it to be graded based on this question and by applying a last mask so that you target this point here for the losses you're able to focus the model's training just on the answer and that's going to be more specific than training the model on some kind of generic prediction of the next word in a question which is probably something the model is already well trained on okay so with that advanced fine-tuning as background we move over to the advanced spreadsheet here now I'm going to just shut down the runtime for the basic spreadsheet and we're going to get it set up here with uh or login to hooking face you can see that we're being allocated some bandwidth from Google and it will then ask me to log into hugging face so you'll notice I run this out first um before running out of the other cells because if you don't run this out first and wait what happens is The Notebook will start evaluating the rest and you never have your your um you'll never be approved as login and that will cause bugs and it'll cause it to crash later on okay so one more step here this is optional you can comment this out but I typically will connect my Google Drive what this does is it means the model will be downloaded in fact I've already downloaded the model to my Google Drive and the benefit of this is it means the model is going to be more quick to load so I'll probably save maybe three or four minutes because of connecting Google Drive all right here we're going to load the matter model again I needed to be logged in and in order to do that I need to be logged into hooking face which I've done so that should be good so that the drive mounting I need to approve and then it will load the model and then it will connect Google drive as the caching directory and then we'll move on to installation of these packages and then we'll move on to importing those installed packages and then we'll move on to loading the model itself which I will get kick started I've separated it out here the loading of the tokenizer so notice that we'll see this printout later when it comes to it um I'm using a fast tokenizer I'm setting the padding to be to the left hand side so that means each of my sequences for training they've got the padding tokens on the left and I'm also going to set the pad token to being pad so I'm basically defining a pad token here this means it's not going to be confused with the end of sentence end of sequence token that we used in the basic notebook and when I Define that I need to also increase the number of embeddings um to match that length because I've added one more token foreign I'm just doing a check here that the model is pad token and the tokenizers pad token have been set so that should I'll run we're just installing still up here Let's uh get these tokenizers rolling let's get the pad talking now here we for sample string I will run this just because I want you to see how insta gets tokenized it's probably not a great choice of a token to um or of a delimiter because insta actually gets broken into multiple tokens and it does so in inconsistent ways so it wouldn't be my Preferred Choice but that's the standard llama choice then like the basic model we'll set up Lora this is very much the same as what we saw before there's no real difference here you can see I have the same Target modules as in the base example and you see the model is not being downloaded because it's in my Google Drive so instead we're downloading these checkpoint charts which takes about a minute and a half things will move very fast after that so yeah let's just talk about um let's talk about the data set here so we'll install data sets and we're going to use the function calling extended data set which is for training models on function calling and here we have the data set and just to show you an example um just as an example here we provide some context in the system prompt telling the language model that they can make use of certain functions then we ask the model to search Bing for instructions to fly a kite and then we respond in a structured format and the structured format allows a quarter to take out the search parameters and actually do a Bing search and then return that information back to the model and notice also that there's some training data here that doesn't call a function because you don't want to condition the model to call the function every single time no matter what so you need to include some questions where the functions are available but it's a question that doesn't need a function call like this one here but water mount seeds and uh yeah this is um this is a a data set that's been developed by trellis that you can purchase and request access I think it's 10 10 euro if you're interested in function calling all right so I'm not a robot by now we should have [Music] um the model should be loaded anytime it's actually taking a bit longer than I expect here yeah let's check my 7B um I'll print that text data set so um here what I'm doing is back to our last Mass discussion and um what this allows for is just one second Okay so here um we have just completed running so the model is loaded which is good and you can see here initially um the end of sequence token is this slash S um there is no pad token set by default on llama so we set the pad token and the pad token ID is 32 000. um here's the configuration of the lamba model and then with a sample string and I said I'd show you how insta gets tokenized it actually gets broken up into three um different pieces but you can see here we tokenize we untalkenize come back to same thing and we've done the Laura setup so Laura is now set up and there are going to be um 0.5 percent of the parameters will be trainable a pretty small amount now we will install the data sets which I've gone through there are 78 training data sets and there's 21 test data sets foreign so now is the part here where basically put the training data together um here what's happening is I've got the inputs and then the labels are actually the actual outputs so think of the labels as and what the model should be predicting and quite simply the labels are the inputs but shifted one to the left and that's because um basically the answer for predicting a given tokens next token is always the next token in that array so you just copy the array you clone it and then you shift it to the left and here I've put I've put an end of sequence token in the very last position of the predicted of the predictions so there's going to be a slash S and that predicts at the end of the sequence this is important because we want the model when it responds with a function call to include an end of sequence token we don't want it to continue blabbing on we wanted to know that once the Json object is provided that's the end and it needs to signal that signal that by adding in an end of sequence token and it's going to be important that we have a one there in the last mask so that we're actually considering the loss from predicting that end of sequence token in fact that's what we're doing right here you can see we're setting a loss mask so that it matches basically the inputs minus the response um that's the starting index so if you have your full sequence input and response we want to have ones for all of the response so we have ones for all of the response and [Music] um we will have a one as well for that last position this is the shifted last mask um basically it needs to match that response so the labels are shifted one left so the last mask also needs to be shifted on one left and then we need to add in a one to match up with that end of sequence token and we'll see this so um I think now we have run that we've run that we've prepared the data set so we're now putting this all together in the format we need we'll apply that setup to the two training sets well sorry to the training set in the test set now we'll take a look at some of the data set so you can see here right at the start it's just padding tokens because we're padding uh all the sequences to the same length so that makes sense notice how the attention mask is zero is because we don't want to pay attention to those padding tokens and we don't want to include the loss so that's why we have zeros there now at the end of the sample we um we have a tension mass of one because we do want to pay attention to the previous tokens but notice here how the last mask is still zero and that's because we're still in the question portion and what we want to do is we want to focus the model and we want to focus the training on the answer part so there's actually only seven tokens that we're focusing the training on when it comes to looking at losses and it's the last seven tokens and those last seven tokens are the Finish token and then we have apples are nutritious so these are the seven tokens and we're focusing the loss we're only considering the loss from these seven tokens and notice how we are including the and token because that's important as I said we don't want the model to Ramble On afterwards particularly when this here is a Json response all right so now we're going to just generate one sample and this is of the untrained model so you can see here we have the system input your helpful research assistant we've provided two functions here there's a list files function and a save chat function and then we give the instructions for how the language model would call a function now you can see here the question is what are my files so what the model should do is recommend list files and that's what the character system response is it says list files but what you can see here that actually the assistant is providing you know it's it's kind of a reasonable response here but it's then blabbing on further with other stuff so this is going to be a harder answer to and this also doesn't make sense because it can't actually retrieve the files and then it makes up the file names so you can see the generate response is not quite right and that's why you want to train it to give a more terse answer like this and just notice we're generating a sample this is sample zero in the test data set which we will also run later on all right so let's go to training we've got a training script here and uh we have custom data collator basically I am um putting the data together in a more manual way than automated and the reason for that is because I want to apply the last mask um in particular I am I want to make sure that the last mask is being multiplied by the actual lasted losses so that we focus on those response answers so this is the custom compute last function so we're Computing and we're calculating the outputs in a forward pass we're calculating the logits which are a representation of the probabilities of each of the 32 000 tokens in the vocabulary so there's a probability for each possible vocabulary as the next token and what we're doing down here later is calculating the losses by comparing the actual to the predicted and this is the key step here where I'm applying the masked last loss so um yeah I'm basically applying that my the last mask so that we only focus on the response okay sorry I've run that already it doesn't matter and let's train so you can see here I've got a table printing like here is an input token um a predicted token and then a true token and the last is com is achieved by comparing the predicted to the true now notice how um we have the last mask and we have the raw ask and then we have the mass flask so there's a raw loss pretty much for every token but when you apply the last mask it results in the mass loss being zero and so basically all of the last uh the mass loss should be zero until we get to the start of the prompt and the very start of the prompt here it's a function called you can see you have the instruction you have two Carriage returns and then this is where the actual response starts in this case the model is predicting um it's predicting a function call but the actual correct answer is to answer the question about the origin of fortune cookies so um you can see the model is not getting this training step correct but we are correctly applying a loss and it's getting a big penalty here because it's incorrectly predicting how to respond it shouldn't be responding with a function it is so it's getting a big loss now notice how it is predicting a lot of later tokens correctly and that's why it's getting smaller losses here okay so um training is underway you can see here um it looks like in a very in some of these training steps we have some errors basically sometimes the model um during training can run into bumps where it gets into a state makes it hard to get out of foreign and we shall see if the remain in that state just a reminder of the configuration here so we're using a batch size of two that means there are two data sets going in at a time and gradient accumulation means we're saving up for two steps we're saving up what the losses effectively there's technically the gradients but so let's have a look here see if we're having issues with the training still it's not looking great so I'm going to stop the trading all right so I've rerun the training here I basically um I made sure that the model was downloaded fresh from The Meta repo so I wasn't loading from Google Drive I have noticed in the past stacking cards issues I re-ran the exact same training um and uh here with two for grading accumulation and two for per device train so we've got 40 or actually 38 steps and you can see here that every training step is giving a reasonable loss and the loss is arguably dropping a little bit as we go through the steps here so let's take a look at running another sample and see if we get any better results than the last time and in fact I'm not going to be able to see that result because I am I'm currently pushing that model to the hub okay so I am going to now run some inference on our trained model using this script here I'll provide a link to it in the video so I've run the setup in install I'll be running on GPU and now you can see that I'm actually calling the trellis model on the Hub which is model that we just pushed foreign I've not mounted Google Drive I've got the model loaded uh tokenizer setup and we're ready for inference I've run a first inference here asking about comparison Street sales and Machado it was very long um but yeah that functioned as expected so now what we want to do is test some of the prompts that we had previously tried and see if we can do better so let's go back and find um the pre-training result and see if we can perform better than that so what I'm going to do is copy all of this here and we want to see if we can perform better the assistant was a bit for both we want to see if we're going to perform any better so let's just say prompt equals um I may need to stringify this prompt first so let's do that okay so I've just pasted in this instructions variable it's really a problem so I'll just call it prompt and uh then we're going to stream with background to see inbox so let's see what happens oh yeah it's like double wrapped so that doesn't make sense that doesn't make sense it actually worked but um I don't need to include um I actually just want to set that as the whole input um so I'm gonna just copy this function and make a new version here what I'm going to do is foreign now when I run this here we've now got that working so now we're saying to call a function and let me respond immediately and only with an object's following format and you can see the response here is exactly what it should be from the Generation Um it's even got these brackets correct so that's good news you can see it's matching here list files and it's a lot better than what it was pre-training let's try another example I think there may be an example um let's just modify this and say instead of what are my files let's ask a question from the function calling repo just try to find the function calling repo find it here foreign let's just find some question that does not have a function as the response so maybe something like this um now we actually do need to pick something from the test data set because if we pick it from training it will be not very impressive so let's try this one that humans really land on the Moon and notice how when I put this in I'm going to keep I'm going to keep the definition of the functions because um I want to test it and push it so that what we don't want is for it to respond with a function just because it's provided the option of calling functions so let us now see if this works so that humans really land on the Moon the moon landing in 1969 interesting did humans really land on the moon in 1969 and the answer is that interesting so um in this case here I may be a bit surprised maybe a bit surprised that um didn't give a more comprehensive answer but at least it's not providing a function call so it is coherent with um another slight variation here yeah so yeah this is looking reasonable you can see that it's correctly not calling a function in this case um and this is only a seven billion model so it's pretty impressive that this is working quite well for such a small model I'm just going to um I'm just going to delete this here because we have um we have of course no need for this normally I'm just going to put back in what we had in the original also there's no need for that let's just get it back to generating some reasonable response excuse me okay so um it's funny you put in Nobody Knows the difference between but uh it's actually possible yeah yeah it's giving us the answer here so um this is pretty much a summary uh we've gone through a lot of steps and found a lot of different interesting issues along the way um I will share a link to the kind of lecture notes Here and I'll share a copy of the free notebook I'll give you a link for um if you want to support the channel and buy the advanced fine tuning notebook I'll also provide a link to the the function calling extended data set and to some of the models which you can download uh already trained the seven billion function calling data set and the link there you will find the inference so I hope this was of some use folks please let me know in the comments with any questions or requests for other videos that you might like to see or any books that you might find cheers
Info
Channel: Trelis Research
Views: 5,392
Rating: undefined out of 5
Keywords: google colab, fine-tuning, llm, llama 2, PEFT, LoRA, QLoRA, Language Model, bitsandbytes, quantization, large language models, fine tuning, quantized, fine tuning gpt, fine tuning llama 2, quantized fine tuning, fune tuning qlora, QLoRA fine-tuning, LoRA fine-tuning, fine-tuning google colab, llama 2 fine-tuning, llama 2 fine-tuning using qlora, llama 2 fine tuning tutorial
Id: OQdp-OeG1as
Channel Id: undefined
Length: 65min 27sec (3927 seconds)
Published: Wed Aug 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.