Low-rank Adaption of Large Language Models Part 2: Simple Fine-tuning with LoRA

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hello everybody welcome back this is Chris uh and as promised today we're going to be talking about the implementation of Laura so we're going to walk through a notebook which has a simple implementation and uh that's what we're going to do today I'm not going to set this up in any kind of uh specific fashion so all of the models and data that I'm using are totally interchangeable totally reliant on what you want to achieve but we're going to go through an example today that's straightforward and hopefully you know lets you understand one of the ways that we can fine tune using Laura so as always we're going to have to get some dependencies you are going to want to make sure that you get the most up-to-date versions of both theft and Transformers these are libraries that are changing at a very fast pace right now so please keep in mind that you know some of the quirks or bugs that I might tell you about in this video may be fixed by the time you actually get to implementing this so we're going to be relying on a few libraries to help us today we're going to be relying on accelerate Laura lib and pfft now the pest library from hugging face is the parameter efficient fine-tuning Library and it is what gives us access to the Laura model uh and it is what gives access to the Laura method of parameter efficient fine tuning so parameter fishing fine tuning your PFT as I'll refer to it for the rest of the video is a broad kind of umbrella term for uh fine-tuning methods that are parameter efficient which means that they aren't just fine-tuning the whole model now Laura is a subset of pest so Laura is a specific method or application of Puft and that's what we're going to be focusing on today after you do this you're just going to want to make sure that you have Cuda available the model that I'm using today is going to be the bloom 3B model it is going to take roughly 40 gigabytes of GPU Ram you can use the Bloom 1b7 model and it will take less than 16 gigabytes of GPU Ram depending on your data set so I would definitely look to use the smallest model that you can depending on what resources you have so I'm using collab premium we can see that by going to runtime change runtime type and you can see that I have GPU premium and standard so if you want to follow exactly this you will need premium so collab premium but if you want to use the blue 1v7 model you will not need premium at all which is fantastic again depending on your data set right so the if you have a very large data set and I don't mean many rows I mean if each item in your data set is large then you're going to be struggling for enough GP Ram to keep up so why are we using Bloom well let's take a look at Bloom first of all it's decoder only it is a causal language model and it has a ton of parameters including a fairly permissive uh you know token max length so we're able to squeeze a lot of context in here which is fantastic additionally it has a fairly reasonable vocabulary size Bloom is definitely not the best model to fine tune for every application but it is very good one of the reasons it's very good is that it's trained on a fairly large set of languages including code and if you can see in the code it's trained on quite a lot of common programming languages so I like Bloom because it is first of all relatively easy to fine tune second of all relatively multilingual so it can be fine-tuned on a number of different languages which is always a benefit and then number three this reason it has a permissive license the big science rail license is fantastic because it basically says you're chill to do what you want as long as you follow these rules these are the use restrictions set forth in the big science rail license and they essentially amount to don't do bad stuff with the model which is something I can get behind so that's why we're using Bloom we do need to use the auto model for causal LM again this is a causal language model so we need to be cognizant that we're always keeping that in mind when we're making selections for our parameters we are going to use uh floating Point 16 for this just reduces the amount of memory it takes up that's fantastic you know the reduced Precision doesn't really hurt us here that's great we're also using device map Auto this is a feature of Accelerate from hugging face it lets the uh it lets the model layers exist across multiple devices if it's necessary now you will run into some errors if you choose to do that you will have to make some tweaks in order for it to work properly but it is an option and it's a very powerful option so you don't have to have enough compute anymore to train your model listen it will train slower without having enough compute but you can shove a lot of it onto RAM and uh and it will still run we're just using the big science tokenizer uh that's it it's uh we need a tokenizer and this is the one that we're using as you can see the model itself is quite large so when I download the model you're gonna see that it is 6.01 gigabytes of model that's quite big and so we want to you know be aware of that this is why we need enough GPU Ram as much as you know Laura is an efficient fine-tuning method Bloom is still three billion parameters once the model is loaded we're going to go ahead and print it thanks pytorch as an aside we are using torch 2.0 in the notebook but it doesn't really matter we're not taking advantage of any of the optimizations put forward in torch yet uh once the hugging face libraries uh find a way to include it I'm sure it will be there but as of right now it just takes some extra fiddling and so I don't think it's necessarily worth uh worth the time why do we want to print the model well if you remember from the Laura paper we need to choose which weight matrices we want to decompose and learn the decompositions of in the paper they chose WQ and WV which is the query and value Bloom doesn't have that specific delineation but we do have the query key value module which is what we're going to go ahead and use so this is the module we're going to Target with our fine tuning this is going to be different depending on the model that you select so if I loaded a different model I would need to use a different module to Target with Laura so say for instance if I wanted to do the Llama style models I would use proj q and proj V that is the specific modules that we would be targeting there but for Bloom we're just going to use the query key value uh module as our Target we're going to do some pre-processing here this is not strictly necessary but it does help improve training stability and so why not right we have a helper function that just helps us see and visualize why Laura is so damn good and then we have kind of the Crux of the notebook which is setting up our Laura config so this is going to let us set our rank so that's the number of Dimensions that we're going to limit our decomposed matrices to so in this case it would be the uh you know Rank by the initial Dimension and then the initial Dimension by the rank for the two decomposed matrices we have our Alpha again they didn't talk about a lot in the paper how to determine this I have been going with double and it's been working fine uh but you can kind of you choose whatever works best for you this is a very important piece remember that Laura can apply to any weight Matrix right so we can inject those two uh decomposed matrices in a place of any weight Matrix in the model in this case though we want to choose the weight Matrix for the attention mechanism so we're going to go ahead and do that by targeting query key value again if we look at our model we can see that we have all of these modules we have a bunch of these blocks of modules right and we are going to go ahead and inject it to every one of the query key value sub modules in this block of uh you know 30 modules so hopefully that's clear if you have any questions please leave them in the comments below we could also like it said in the paper Target our MLP uh weights but we just don't want to for the purposes of this uh video we're trying to stick relatively close to what Laura put out we have our Dropout again uh Dropout serves the same purpose here as it would in any other machine learning task we don't care about biases so we can set that To None lastly we want to make sure that our task type is set to causal LM this is just because Bloom is a causal language model if we were using say a mass language model we would set this to a different parameter but we're not so we choose causal LM this is dictated entirely by your model choice now we can use from the PFT library to helper function get pet model which accepts both a model and this Lora config and then returns a model that has all of those injectable uh matrices that has all those injectable Matrix pairs already slotted in and yes it is that easy a lot of the times we're talking about these you know these tasks or these kind of interesting sounding applications you know places like hugging face have already done a lot of work to make actual implementation of them quite straightforward we can then use our helper function that we put above here to get the number of trainable parameters of our model and as you can see this is the benefit of Laura we are going to be fine-tuning less than a tenth of a percent of the parameters of this model and the results will blow your mind you know that's not even click bait they really will like this is the power of Laura right we're able to go from three billion parameters to less than 2.5 million parameters insane the data we're going to be using today is the squad V2 data set which is just a kind of a question answer data set it's got context it's got some questions it's got some answers so we're gonna be fine-tuning our Bloom 3B model on that data set to see if we can get it good at some kind of extractive question answering this is not something that bloom is good at normally which is not something that the blue model is traditionally good at but we want it to be and so that's what fine tuning is for we're going to give the model our questions and answers in context in this format we're going to be focusing on a very uh simple to understand example that's relatively naive if you want to go deeper on it uh you know we run some workshops at Fourth brain that might go into more depth in the future on these topics but for right now we're just going to get this thing going I think that's the easiest way to start understanding the fine-tuning process of these models is just getting something that works you can see it working and then kind of evolving from there so again we're going to be feeding it this exact thing for every question over and over again right this is the training process we're just showing it this and what we want it to learn is not the content what we wanted to learn is the structure so what that means is we're not teaching the model all of the things in the questions we're showing it we're trying to show it that hey you know when you see this string of characters or combination of tokens and then some context and then you see this combination of tokens and then some query we expect you to answer relative to these two things we've also included the stop token just so it knows to stop doesn't keep running over forever but that is what we're doing that's the goal today so I'm a big proponent of this idea that fine tuning is for structure not knowledge it isn't specifically true in the sense that you it's not like you can't teach a model new things I just feel like it's incredibly inefficient and sometimes the idea is you want to teach these models new things when there are easier ways you can approach that specific problem once we have this desired format all we need to do is put our question and answer data set into that format so we go ahead and do that using this helper function it's going to map all of the data sets to this specific prompt you can see that being done here also if there's no answer provided we're going to show the model that it's allowed to respond with cannot find an answer this is helpful because we want the model to be able to say I don't I got no clue man right so that's why we include that cannot find answer when there's no answer it's just an easy way to let the model say it doesn't know and we'll be back when this is done being mapped all right now that that's done mapping we can we're going to be using the Transformers trainer class here this makes this process so straightforward number one we include the model this is the model we set up above with the helper function this has the Laura config set up in it which means that we've injected these rank 8 or Max rank 8 Matrix pairs into each of the places that we'd expect to see the actual weight Matrix we also have set up our training half of the mapped QA data set we have our training arguments class which includes the number of uh you know what's the batch size per device we're using one device so the batch size is just going to be four uh we have our gradient accumulation steps shouldn't be necessary but we said it anyway we have our warm-up steps we have the max number of steps you'll note that this is 100 so there's no way we're getting through our entire data set but again we're not trying to give our model all that knowledge we are just trying to show it the the the pattern where we're trying to teach it what does structure of our prompts is and what they should be doing we have a learning rate set to a completely standard uh you know learning rate listen there's so much literature about what to set the learning rate to these days that you just look up whichever paper you like the most and choose that one we are using the floating 0.16 so we have to set this to true I just like seeing it log out in every step you can choose whatever you'd like and of course we have our output directory say we wanted to use epochs here and we wanted to use validation how we do that well we can check out the trainer docs trainer docs gives us everything we need to proceed in a way that makes sense to us you can see here that we can Center eval data set so that would be say the validation half of our mapped QA data set when we look at our trading arguments class we can see that we have this do eval Boolean that we can use we can look at our evaluation strategy we can look at things like our max number of training epochs instead of steps we can use the torch compile back end this is the key here we don't guarantee any of them will work as the support is being progressively rolled out by pytorch so we're we're going to kind of ignore that for now we can set the number of valuation steps so the number of steps between each evaluation or we can have it do it at the end of each Epoch anyway the documentation is basically your best friend here it's going to tell you everything you need to know uh the documentation is incredibly robust so I would definitely look it up if you have any questions or if you want to do anything outside of this demonstration but for today this is what we're going with now finally of course we need to include this MLM equals false flag because we are not a mass language model so can't use that we're just using this to get rid of warnings this is uh taken from a hugging face tutorial so uh you know I've left it in because it removes warnings and that's fantastic once we've completed setting up our trainer class we can just call Dot train on it you know dot train is the new DOT fit if you remember from sklearn uh we are in we are in the dot train world now and as you can see this will just start training and we'll be back when it's done all right now that it's done we can see that uh the loss has gone down some we're not really concerned with this training loss like it's just training loss we're not trading on the whole thing we're not doing any validation we don't care about any real metrics um it's good to know but we're really just we want the model to have learned what we need it to have learned so we're going to check that now now you're going to be able to save this to hugging face using hugging faces fantastic API all you have to do is set your token in there and you're good to go you'll also notice that during training we did use quite a bit of GPU memory um so we spiked around the 36 gigabyte range again that's due to the fact that we try goes to 3 billion parameter model and because we're using a rather Hefty data set some of the contexts included in the squad can get quite long now we just need to name our model and then we push it to the hub so I've called it Squad Bloom 3B and that's it once we've pushed it to the hub you'll notice first things first this is all we need to push to the hub less than 10 megabytes right because that's all we're trading we're just trading those decomposed matrices so we don't even need that much you know bandwidth to store this entire adapter again it's not technically an adapter because the adapter is an overloaded term but it this is our injectable decomposed Matrix pairs right so that's all it takes to store them is less than 10 megabytes absolutely fantastic then we can reload the model all right what's that suckers loaded we're ready to rock now I want to show you guys something that to me is mind-blowing okay this is the part that's to me insane about Laura let's just let's just get rid of this right okay boom we got rid of it now QA model is equal to none let's re-initialize this PFT model now both both the model and the pest model exist currently locally we've already set them up right we've we've already given them their like we've downloaded the files so they're already set up but I mean but like it loads it in what can only be described is no time flat right and this is part of the advantage of the Laura system right these are hot swappable entities we we can fine tune a bunch of different Laura tasks and then just even at time of inference because the reduction like the inference latency introduced by loading it is minimal we can actually go ahead and we can get this going so quickly right so here you go okay so I know what you're saying you're saying Chris I don't believe you you're a big liar well I've just gone ahead and restarted the run time I'll do it again here while it while it's recording you can see we're initializing the runtime nothing in GPU Ram nothing in system Ram alright hey let's click this button to go so we're going to load our base model here this is something we're going to want to do at the beginning of our application startup right we don't want to do this every time someone hits the end point because that would be ridiculous but we do want to do it every time that uh you know the application or the container running this on loads at the start so it's part of the warm-up process right that's just like a classic uh way to save some time so we've got that we've already downloaded pre-trained path model so what happens when you click QA model here and that's it that's how long it took to get that QA model set up with that specific adapter now let's say we wanted to load a different theft model would we have to hold the whole model structure in memory to do that well we've got another one I've trained more than one of these things let's call this QA model V2 puffed model from pre-trade model theft model ID let's load that sucker look how fast that loaded we're not taking up additional GPU Ram right right this is the power of PFT it's so fast we can do we can load these models at inference time right it doesn't introduce a tremendous amount of latency to do this so I know what you're saying you're saying hey what what if we had a different model how about that well let's try that so this is the market mail model with downloading actual information itself so that less than 10 megabyte file we already have this model loaded that is with the download from hugging face hub right think of how little inference latency is introduced here yes over the course of tens of millions of calls every bit of inference latency adds up of course please but this is the thing we don't need to have a whole brand new model to do this we don't need to have a uh you know so much extra Headroom we can just load our base model and then inject layers into it like a dream so let's see how this thing works we're going to go ahead and load our QA model make sure it's loaded and we're going to run a query through it this is just a helper function that does that we have the simple context cheese is the best food we have the simple query what is the best food we go ahead and click this button it says make inference we get that the answer is cheese cheese is the best food all right let's look at a different example uh the context is cheese is the best food and the question is how far away is the moon from the earth we make inference I cannot find the answer let's use longer context right look at all this context beautiful go ahead click the uh the button what distance does the moon orbit the earth and it correctly extracts the correct answer and this is all done on less than 40 gigabytes of vram it took a grand total of less than 30 minutes to do the whole process right that's the power of Laura we built a model that is much better at this task than the base model base model fails at this consistently and fantastically but we built a model that does it well ish okay you know is it perfect absolutely not will it be perfect absolutely not but it is fine and we built it in no time flat right that's the power of Laura That's The Power of being able to train these massive three billion parameter models in like you know 10 minutes on a single consumer level GPU right now again one of the other powers is if we get our Market mail model going which is just uh from a fourth brain Workshop that uh that I did before you can check out the video on Fourth brain's Channel but if we do that here and we use our you know product name is the culinator and personal cooling device to keep you from getting overheated on a hot summer's day as our actual uh you know product description we can go ahead and run that it is going to take a second because this one has a lot more uh tokens to generate and we get a fairly reasonable marketing email uh generated you know and again we haven't reloaded the model at any point here we the the actual model itself so the base model which is the uh model here which is just our Bloom 3B model hasn't been loaded again right these are just those puffed injectable matrices that we're injecting and this is what we get suffice to say this is a fantastic thing um it's a very powerful tool very powerful to fine-tune models with it I suggest you you give it a shot again you can run this in collab um and and Fiddle around with it and see if you can build if you build anything cool please let me know we had someone in a workshop build a uh natural language to SQL converter when Bloom is not even trained on SQL right that's how powerful Laura is and again not to belabor a point but I really want you guys to focus on this number the trainable percent less than a tenth of a percent less than a tenth of a percent of the trainable parameters for using Laura and we get results that are fantastic and this is on a hundred steps with no validation we just showed the model a bunch of examples and those matrices learned that task very well so thank you very much guys I hope you have a wonderful day thanks so much for watching if you like the video please you know smash that like button subscribe if you want I'm sorry I'm trying to do the YouTube thing but I just like making these videos they're fun uh if it adds value to you though click the like button or whatever and we will see you in the next one

Info

Channel: Chris Alexiuk

Views: 21,237

Rating: undefined out of 5

Keywords:

Id: iYr1xZn26R8

Channel Id: undefined

Length: 27min 19sec (1639 seconds)

Published: Thu May 04 2023