Fine tuning Optimizations - DoRA, NEFT, LoRA+, Unsloth

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm going to take you through a series of fine-tuning optimizations a very common way to fine-tune language models is using a technique called Laura this works very well in terms of efficiency and performance except for sometimes it doesn't quite match the performance of a full fine-tuning I'll be covering some of the newest techniques that have emerged Dora neft Laura plus and also unsl sloth so let's take a look at the agenda I'll start by covering how Laura works just a quick recap then I'll describe Dora which is a slight modification of Laura it's a directional form then I'll cover Laura plus which uses different optimization rates for the Laura matrices next I'll cover UNS sloth which is quite different to Laura although it supports Laura it's the combination of a number of clever speedups that allow you to get at least usually a 2X speed up on fine- tuning then I'll cover nft which involves adding noise to the fine tuning and that allows you to get reduced overfitting and generally better performance now I'll be showing you all of these in a jupyter notebook script which you can get from the advanced fine-tuning repository you can purchase that over on tr.com but as usual I'll try to give you enough details so you can make the modifications by yourself the idea of Laura is to avoid fine-tuning all of the parameters in the weights of a language model language models have different modules and these modules have matrices so there are many matrices of Weights within a language model for example you might have a matrix that's about 1,000 by 1,000 in size and the idea with Laura is for each of the matrices within the language model rather than tuning what here would be about 1 million parameters we're going to instead tune an adapter for example and this is exactly how Laura is set up we will fine tune two smaller matrices A and B and a will be the same width 1,000 as the original Matrix but it will be smaller in height here I've shown that as eight and the same with B so these are both kind of long matrices and when they're multiply together taking B transpose time a you'll get back to a matrix that's about it 000 in size but the idea with Laura is to freeze all of these original weights and to instead train just the subset of weights and by doing that we only have to train about 16,000 parameters in this specific Matrix case compared to a million parameters so we have far fewer parameters to update but the other benefit is that because we train this adapter it allows you to get some form of smoothing and it tends to make the up updates in a more even way than if you're trying to individually optimize a full about a million parameters so here are the equations for Laura applied to one Matrix and we usually use W to represent that Matrix so instead of training w we train instead uh B transpose times a and we freeze W so we we represent the new W as being a combination of the original Matrix plus this adapter we train the adapter and at the end of training we'll just combine that adapter on top of the original W so here you have it the original Matrix plus the matrices that we're going to train and more specifically I already said that we'll freeze the original weights but when we initialize these trainable matrices we're going to initialize the B Matrix to zero and we initialize the a matrix to random values and the reason for this is so that at the start of training when we multiply B * a because B is zeros B * a is going to evaluate to zero and so right at the start of training we're just going to be training model that's the very same as the original model and over time as b gets updated and as a gets updated that's going to result in there being non-zero values in B and B and a is going to contribute a fine-tuning update and so this is Laura and it works very well and it's still works very well so if you're happy using Laura in most cases I would stick with that but as an improvement and a method that delivers a little bit closer to a full finetune we have Dora now the idea behind Dora is to take the original weight Matrix and split that into a magnitude you can think of this in simple terms as a scaler times a matrix which represents a Direction so you have a scalar magnitude time a Direction so we decompose the original weight Matrix like this and now instead of doing Lura on the full weight Matrix W we're just going to fine-tune or we're just going to represent the directional matrix by a * B now let me make that concrete here with Dora we represent the weight Matrix as not just the weight Matrix plus b * a but a magnitude Vector time the directional Matrix plus b * a so basically Dora just does Laura but on the directional Matrix and additionally Dora involves making M here the magnitude trainable so you can think of Dora as taking the original weight Matrix allowing the size of that Matrix or the magnitude of that Matrix to be trainable and then using Laura to train the direction of that Matrix so Dora just adds in this extra parameter which is is allowing us to train the magnitude of the original Matrix and here it is in kind of graphical form quite simply you're training M which is the magnitude times the directional form of the original Matrix plus Laura adapters that you're applying specifically to the directional form of that Matrix and so Dora generally should be at least as good as Laura except it gives this extra fine-tuning degree of Freedom around the magnitude and it turns out that just being able to adjust the magnitude of the original weight matrices typically can lead you to getting better performance closer to a full fine tuning now how do we apply Dora I'll show you within a Jupiter notebook but very simply when we're using the hugging face trainer when we're setting up the Laura configuration we will pass in used Dora equals true and in addition to selecting the modules the attention and the linear layers which I'll show you we're going to also make trainable the Laura magnitude Vector which is this m here and that's the summary of how Dora works so let's move to the next approach which is Laura plus Laura plus is another modification of Laura and it's quite simple so Laura plus takes the original form of Laura and it applies different learning rates to The Matrix B and the Matrix a so normally when we optimize we use the same learning rate for all of the matrices let's say we call that learning rate LR and let's say that we're applying the learning rate LR to the parameters in Matrix a well in Lura plus they've realized that actually you can have a relatively Higher Learning rate for the Matrix B which is initialized to zeros now the intuition around this is a bit difficult maybe one slightly in exact way to think about it is that because B is initialized at zeros you can afford to have a Higher Learning rate and because you need to bring it up to some steady state of having nonzero values but anyway empirically you're able to increase the learning rate of Matrix b a bit higher than Matrix a probably up to about 16 times according to the paper and this overall will allow you to get faster convergence uh within your Laura training so to use Laura plus you don't change anything about Lura you just change the optimizer and I'll show you some code that I've put together with the help of the Laura plus project on GitHub and I just adjust the optimizer to change the learning rate for the Lura B matrices so that it's some multiple of the learning rate for Laura a the next method I'll talk about is neft and this involves adding noise to your model during fine-tuning very specifically it adds noise to the embedding layers now this again is not a Lura specific technique what it invol involves is taking the embedding layers which convert from tokens into Vector representations of those tokens and applying Some Noise to just that specific layer some gaussian noise and it turns out that when you add some noise it allows the language model to better appreciate the higher level features of the training data set rather than focusing too much on the oneoff granularity that would result in overfitting and simply by adding in a little bit of noise to the embeddings you can improve the performance of the fine-tuning and I'll show you that in the code the last class of optimizations I'll talk about is unslot unslot is a very smart project where they've dug into the nitty-gritty of how the fine-tuning process works and they found a large number of small speedups that together allow for a large speed up in the fine-tuning process on this blog you can see step by step all of the different small improvements that together generally will get you a 2X or faster speed up and basically there are different ways that the matrices are multiplied and UNS sloth has found ways to reduce and combine calculations so that you get overall speed UPS during the fine-tuning process hugging face provides documentation for how to integrate UNS sloth it's supported by Transformers there are a few differences that I'll go through in a notebook but when you load the model you need to use a fast language model instead of an auto model for causal and you also need to use the fast language model when applying the Lowa adapters that's getting the parameter efficient fine-tuned modeled there are some limitations to UNS sloth generally it supports only Lama type models that does include Mistral models or lamif versions of models like a lamif version of the E model but it does constrain a bit the models you can use there are are some other technical differences for example you can't apply the Lowa Dropout which means masking certain weights during the fine-tuning in order to to avoid overfitting but broadly speaking the UNS sloth approach can be used with tools like sft trainer without any changes and that means you can even overlay optimizations like adding noise with neft or using slightly different Optimizer if you want to use Lura plus before I dive into the note book comparison I just want to give a quick overview so we have Dora Laura plus neft and UNS sloth as I mentioned you can use all of these methods Dora Laura plus and nft on pretty much any model that's supported by hugging face with UNS sloth you are a bit more limited to Lama type models in terms of quality I think you get some boost using Dora and I'll show you that and Laura plus the same with neft with un sloth this is more about speed boost and you can expect about 2x you'll see setup uh how difficult it is when I go through the notebook in general using Dora is easy it's just adding the used Dora parameter and then setting the Lura magnitude Vector to trainable neft is very easy as well you're just adding one flag to the sft trainer Lura plus is a bit more difficult because you need some custom code right now in order to modify the optimizer but I'll show you that and it's not too long and UNS sloth I would say it's it's relatively easy to run and if you look at the GitHub they provide very good scripts and notebooks for doing full fine tunings there can be some intricacies in getting the installation right at the start depending on the Cuda drivers you have so that's probably the most tricky part and also because unslot requires different model loading and also PFT or rather Lowa adapter loading that means you'll have to make some changes to your code if you're using the original Transformers approach it's time now for me to run through notebooks with each of these approaches and I'm going to be using the chat fine-tuning branch of the advanced fine-tuning repo you can check that out and make a purchase if you'd like on tr.com as a reminder the advanced fine-tuning repo has got a wide variety of scripts all the way from DPO um function calling long context fine-tuning quantization supervised unsupervised fine-tuning to get started I'm going to use runp pod and I'm going to use a oneclick template which is for a Cuda 12.1 environment I like using this template because it means that I have a consistent uh setup consistent Cuda drivers every time I run fine tuning and typically here I select an a6000 so I'll click deploy often I'll increase the size of the volume a little bit I usually make it much bigger than I need so I don't have to go back and increase the size device later on now that pod is going to load up and next I'm going to upload the Jupiter notebook from the chat fine-tuning branch and I'm going to upload a few versions that I've saved so we can take a look at the different fine-tuning method I've just uploaded the four files that we're going to compare so I've got a chat fine tuning with Laura with Dora then the same Dora fine tuning with the addition of noise that's nft and then have a script for UNS sloth with Laura plus so in that last script we'll see the speed up benefit of UNS sloth and then we can see the um perplexity or performance benefit of using Laura plus I'll go through the Laura script first and then more quickly I'll go through the other scripts and just highlight what's going to be different so here in the chat fine-tuning script I'm going to start off by connecting to hogy face so I can push models I'll also install and connect to weights and biases so that I can track uh my run and the performance in weights and biases next I'll do the installations you can see that here I have set specific versions so that when people run the script in future they're not going to run into bugs because of future Library upgrades once the installations are done I'm going to enable this environment variable HF Hub enable HF transfer this allows you to load or download weights and upload them much much faster this is part of HF transfer package and I highly recommend using it next up we're going to load the model and the model that I'm going to chat find tune is the bass Mistral 7B model now there is an instruct version of this available so what I'm doing is a bit redundant but it's a good exercise to show you how to take a model uh that's a base model not chat fine-tuned and then how to fine-tune it and I'll show you the data set a little later next I'm going to load the model using Auto model for causal and I'm going to load it with flash attention 2 uh to provide for some speed up and I'm going to load it in B FL 16 which is possible because I'm using an Amper GPU I'm using an a6000 if you're using an A40 or an a100 or a h100 those will also support the float uh brain float 16 format which allows for improved quality note that I'm not going to load in quantize format usually when I fine tune I try to do it in 16bit because it gives the best performance and then allows me to make quantizations off of that high quality fine-tuned model last here I'm just loading the tokenizer next I like to run some loading checks I just checked that there are no parameters on meta which means there are no parameters on the CPU that means everything is just on the single GPU which is what I want now I'm going to prepare for Lura fine-tuning so here I create a function that will show me the trainable parameters in the model and next I'm going to enable enable gradient checkpointing which saves some vram during training I'm going to print the model so here you can see what modules or basically lists of matrices are contained within the mod within the model the Mistral model has 32 layers from 0 to 31 it has uh attention matrices here self attention qkv and it has uh multi-layer perceptrons um layers the gate up down there are also the input layer Norm post detention layer norms and we also have a root mean squared Norm here so when I decide to create Laura I'm going to create adapters only for specific matrices and a common approach is to create them for the attention layers so qkv and also the multi-layer perceptrons now notice here that I've also listed Lura magnitude Vector so that also would have a set of Laura adapters if I turn on Dora now the fact that I've put it here it's just going to check if there's any module named Lowa magnitude vector and there won't be because I'm not turning on used Dora so you can just leave it there and it won't make a difference but it will make a difference if you have Dora turned on as you can see here I've got Dora turned off quite simply I would turn that to true if I want to use Dora and that will mean there is going to be a matrix in each layer for Lura magnitude Vector next I'll apply that Laura configuration and load the model the next step is to set up the tokenizer and padding usually I like to print the tokenizer and inspect it check the vocab size check the special tokens um I'll then often inspect the chat template uh so I can set up the same chat template in my data set for fine-tuning I also like to check uh the pad token usually I'll use the pad token if it's already defined otherwise I'll use the on token and if there's no on token I might manually Define a pad token using option b here okay I'm moving quickly on here I'm going to do one more thing which is necessary for chat fine-tuning and that's to set the embed and Norm layers to trainable it's actually not enough usually to just fine-tune attention and multi-layer perceptrons if you wanted fine tune uh especially because of these tokens at the start and the end of conversations or between the roles in the conversations I find that you need to set the embed and Norm layers to trainable now these layers are not very large matrices so it doesn't add much extra training time but I find it's a very important step to get good performance so I set some trainable parameter names embed tokens input layer Norm post attention layer norm and that means these modules are also going to be set as trainable they're just going to be trained fully there won't be any Laura adapters applied to these so once that applied um we have set trainable the Lura parameters and also the embed and the norm layers and we're ready to set up evaluation I have a function here that will just create a streaming output given some questions and I like to run that on some test questions you can see here I've just run the evaluation first question is what planets are in our solar system and it gives the correct planets but it keeps on blabbing on and that's because the model hasn't been fine-tuned likewise when I ask the first five Fibonacci numbers it gives the correct answer but the model also keeps flaing on and the same on the last question about writing a python snippet so doing this evaluation is a way to test if the model has been chat fine tune just yet next I'm going to load uh the data set and the data set I'm going to use is the open Assistant uh Lama style data set and we can have a quick look at it here on hugging face it's a version of a filter it's a filtered version of the open Assistant data set that was filtered by Tim Detmer and further it's been adapted then to the chat format I need for lamb or Mixr uh Mistral or Mixr actually um so you can see it includes end of sequence tokens and also includes these beginning of instruction and end of instruction tokens here and this is publicly available if you want to make use of it so after loading that I often inspect the data um check everything is correct maybe check test out some tokenization before then moving on to the training step for training I'm going to train typically for one Epoch although actually I'm just going to train for 20 steps in this case I'm not going to bother running the full training cuz I just want to run a comparison between the different uh optimizations I'm going to run with a bat size of four um that's going to fit within my GPU vram and I'll run with gradient accumulation of eight usually I like the batch size times gradient accumulation the product of that to be 32 it means that in every step I'm processing 32 uh rows of data before I back propagate next I have a custom uh login call back function this just adds some extra logging in to the training process and I have the Lura plus Optimizer code here which we'll run in a later opt in a later iteration for now I'm just using a standard Optimizer you can see down here in the training args optim equals atom W torch so that will use the same learning rate here of 1 E minus 4 uh for all of the parameters in the model as I said we're just going to run for 20 steps so you want to comment that out if you actually want to run for one full Epoch as here I'm using the sft trainer this is nice because I'll be able to use it as well for doing UNS sloth and it also allows me to add noise if I like which we'll see It'll be down here we'll add an extra parameter this here is the commented out parameter for putting in a custom Optimizer we can also as you'll see later put in a custom parameter for adding in noise so I then run the training and the training here takes about um 12 minutes using Laura and you can see that my validation loss goes down to to about uh 1.12 so keep that in mind because we're going to compare how low we can get the validation loss when we look at some other methods a little bit later on and that's a quick overview of Laura I just wanted to give you that Baseline because we'll see now how to apply the optimizations on top next up I'll show you the tweaks to make for using Dora now there are three different things that we need to change within the script the very first one is that we do need to make a small installation update and that's because Dora has not yet been installed within the Transformers package I assume it'll be merged quite soon so this little cell here needs to be run so that we install uninstalled PFT and reinstall it then from Benjamin Boston's branch and that will allow us to have Dora integrated within this code base the second change then is as we saw a little earlier we will go down to the used Dora flag and make sure that's commented in it's also important that we set the lower magnitude Vector to be trainable um because we need to train the length of those original matrices here we are with the results of Dora and there are a few things that stand out the first is the time so it took about 27 minutes whereas in original Laura it took about 12 this is because Dora is not yet fully optimized so unfortunately it provides for actually a Slowdown until there are optimizations brought that bring it back on Parallel to Laura second of all the validation loss is not significantly improved in fact it's a little bit worse so the validation loss is 1.1 127 whereas in the original Aura I've got 1.21 now that isn't a very big difference and I do find that when I do some manual comparison I sometimes get improved performance in the Dora fine-tuning uh for example I've done some function calling fine-tuning here and this is a Lowa example that scores 0.86 whereas my Dora example scor 0.856 so in this case with function calling fine-tuning I've gotten a slight Improvement but really not very different using Dora although when I inspect the results of the function calling I find that the percentage of correct answers is very slightly higher using Dora so I believe that it is possible I'm getting an improvement here but in short until Dora is optimized a bit more and it'll be nice when it's merged into Transformers I don't think that using it is Justified because of the Slowdown and also there's not a very significant Improvement at least for these two specific cases of chat fine-tuning and function calling fine-tuning next up I'll show you how to add noise to embeddings to improve performance and I'm just going to run the same Dora notebook but I'm going to make one change here so I'll just search in The Notebook for neft and quite simply I add in anyf anyf tune noise Alpha equal to 5 within the sft trainer and that is it um it's a very easy addition and we can go down and take a look at the results so after step 20 I have a validation loss of 1.16 or 1.17 rounding up versus the original of 1.22 so you can see a small Improvement in the validation loss with adding the noise it's quite a simple ad and also you can see that comparing just Dora with 27 minutes to Dora with 27 minutes when we add the noise there isn't any uh slowdown from adding in that noise that I can measure here so I think in a lot of cases it's a parameter that's worth adding in that neft noise in the sftt trainer next I'm going to take a look at using Laura plus which is where we have a different optimization rate for the Laura B parameter now everything stays the same here except the optimizer needs to be different so I'm just going to search in The Notebook for Optimizer and I have a piece of code here that is dedicated to setting up a custom Optimizer so we're creating a Laura plus Optimizer and I'm using the code from the Laura plus repo on GitHub and you can see that this code splits the um modules in two there's group a and Group B and for group a we're going to apply the learning rate and for Group B we're going to apply um some multiple of that learning earning rate and you can see here that once the optimizer has been defined uh what I do is initiate the optimizer with a Lowa plus ratio now I tried using a ratio of 20 actually so having a 20 times faster training of Laura B and what I found was that my validation loss was very high so I came back and I reduced it down to a value of two and it worked reasonably well I think possibly you could increase it further and maybe get more benefit but for now I'm just going to look at the case where I'm using um a training rate for Laura B matrices that's twice the value for Laura a now I need to load that custom Optimizer within my trainer so you'll see here there's this line optimizers equals Optimizer and you could pass in a scheduler but I'm doing a constant learning rate so I'm not going to pass in auler here and I'm also just going to comment out the learning rate and the optimizer here I believe they'll be overwritten in any case but for clarity I'm commenting those out because I am loading the custom Optimizer here so we'll move on down to results and remember this is using both UNS sloth and also um using Laura plus the optimizer and you'll see here in results first off the validation loss after 20 steps 1.10 uh versus the original of 1.12 so there's some slight Improvement but nothing very material I would say to determine whether it's owing to UNS sloth or to the choice of Optimizer I can run a script script that just runs ons sloth so I here have run everything only using ons sloth but I have not used the custom Optimizer you can see that's turned off here and I'm using the original default Optimizer and in this case we get down to a final loss of 1.15 uh compared to 1.10 38 and so basically I think the conclusion here is that the performance is not really changing all that much whether I use onslot or whether I use the optimizer with Lura B um being trained at a faster rate now I think probably with more steps and more work I could figure out what the optimal ratio would be so instead of using a ratio of two I previously said that if I use a ratio of 20 it becomes unstable so maybe using a value between two and 20 would lead to a better improvement from using Laura plus but for now I'm not seeing a big one and unfortunately because because you need to kind of fine-tune what this ratio is that makes it quite a bit less practical for when you just want to get a fine-tuning done and have a reasonable guess of what hyperparameters you should be using just to confirm when you run unslot on its own you get a similar uh time so we can tell that the optimizer itself is not really changing the total time for training now UNS sloth generally does provide a 2X or greater Improvement in speed that's not the case in what I've on here I've got 12 minutes with standard Laura and I've got um just under 11 minutes using Onslaught so there is some speed up but I think the reason there isn't such a big difference is because in this case I'm spending a lot of my time evaluating and the training run itself is very very short and I'm doing five evaluations within a small number of 20 steps and so even if unslot is really speeding up the training because a lot of my total time here my 12 minutes is taking up by by the evaluation that means I'm not seeing a very big improvement with Onslaught so it's probably not showing it in the most favorable light certainly Onslaught is bringing a speed up though and so generally I would recommend if the model is supported by unslot It generally is a no-brainer to use that as a form of training okay folks so I've explained to you at a high level how each of these optimizations works I've shown you how to implement it within your scripts but you can see that the improvements and the effort required to apply the optimizations is not necessarily always worth it and it doesn't necessarily bring clear benefits to apply all of these techniques I'll give a big caveat though that the results will depend on your specific fine-tuning task here I have just run for 20 steps on a chat fine-tuning task and I did mention function calling as well but perhaps for more complex tasks Where You Are really moving the model away from its base training set you will see larger optimization improvements and that's something that's specifically mentioned in the Laura plus paper and I believe also would be true for using Dora taking a practical approach I would recommend if the model is supported by unslot it will give you speed UPS if you can use the unslot fine tuning I also think adding noise it's a very simple parameter change it doesn't seem to slow down performance so that is probably a smart optimization to add when it comes to Dora right now because it slows down the fine-tuning I can't recommend using it until it's been further optimized and probably integrated into the Transformers Library through a merge lastly I think that Laura plus potentially could provide speedups but you need to be willing to spend some time optimizing what the right ratio is for the training rate of Lura B matrices to Laura a and so so unless you're doing a significantly long task where you're going to run some short tests initially it's probably not worth the effort to add in Laura plus that's it for these optimizations you can check out the scripts in the advanced fine-tuning repository you can buy access to that repository and get access to any future scripts on this topic that I upload to the repo in the meantime let me know your questions right below cheers
Info
Channel: Trelis Research
Views: 1,468
Rating: undefined out of 5
Keywords: dora, lora, dora explained, lora explained, neft explained, neft, dora huggingface, unsloth, unsloth explained, unsloth fine-tune, dora fine-tune, neft fine-tune, lora+, LoRA+, LoRA Plus, LoRA+ fine-tuning, dora fine-tuning, fine-tune with dora, language model fine-tuning, huggingface fine-tuning, dora llm, neft llm, unsloth llm, language models
Id: ae2lbmtTY5A
Channel Id: undefined
Length: 33min 25sec (2005 seconds)
Published: Mon Feb 26 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.