Fine-tuning LLMs with PEFT and LoRA

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So what's the problem with training large  language models and fine tuning them,   the key thing here is that we  end up with really big weights. this raises a whole bunch of problems here. These  problems are two main things. One, you need a lot   more compute to train. For this And as the models  are getting larger and larger, you are finding   that you need much bigger GPUs multiple GPUs just  to be able to fine tune some of these models. The second problem is that in addition to  basically needing the compute. The file   sizes become huge. So, The T five. X xl.  check point is around about 40 gigabytes   in size. Not to mention, the sort of 20 billion  parameter models that we've got coming out now. Are getting bigger and bigger all the time. So,  this is where this idea of parameter efficient,   fine tuning comes in. So I'm just gonna talk  about this as PeFT going forward. So PeFT uses   a variety of different techniques. The one we're  gonna be looking at today is LoRa, which stands   for low rank adaption. And it comes from a paper  all about doing this for large language models.   But PEFT also also has some other cool techniques  like Prefix tuning. P tuning and prompt tuning. That we'll look at in the future and when to  use those and how they can be really useful.   And there are some of the techniques that are  actually being used by companies like Nvidia   to allow people to fine tune these models  in the cloud. So that's something really   interesting to look at. So what PeFT does  and with LoRa particular is that it's just. Allowing you to fine tune only a small  number of extra weights in the model.   While you freeze most of the parameters of  the pre-trained network. So the idea here   is that we are not actually training the  original weights. We're adding some extra   weights. And we're going to fine tune  those. One of the advantages of this. Is that we've still got the original weights . So  this also tends to help with stopping catastrophic   for. If you don't know, catastrophic forgetting.  Is where models tend to forget what they were   originally trained on when you do a fine tuning,  if you do the fine tuning too much. You end up. Then causing it to forget, some of the things  from the original data that it was trained on,   but PeFT doesn't have that  problem because it's just   adding extra weights and it's tuning  those as it freezes the original ones. So PeFT also allows you to get really good,   fine tuning when you've only got a small  amount of data. And also it allows this to   generalize better to other scenarios as well.  So in all this sort of thing is a huge win. For fine tuning, large language models and even  models like stable diffusion. A lot of the AI art   models That we're seeing currently are starting to  use this as well. one of the best things is that   you end up at the end with just tiny checkpoints.  In one of my recent videos, I showed fine tuning. The Lama model to create the alpaca model. And  I think the [00:03:00] final checkpoint for   just the add-on part. Was something around 12  megabytes.So it's, tiny now you still need the   original weights. So it's not like you are getting  away totally from everything, but you've got   something that's much small on. So in general, the  PeFT approaches allow you to basically get similar   performance to fine tuning, a full model. Just  by. Fine tuning and tuning these add-on weights   that you're gonna put into itHugging face has  released. A whole sort of library around this,   and this is what where, this PeFT comes in is  they've taken a number of papers and implemented   them to work with transformers library and the  accelerate library. So this allows us to basically   take off the shelf, hugging face. Pre-trained  models that have been done by Google done by meta,   done by a variety of Different companies. And  put them into something where we can use them.   With this and fine tune them really well. So  we're gonna jump into the code and we're gonna   look at how to basically use [00:04:00]  PeFT to do a LoRa fine tuning of a model. All right in this notebook, we're gonna go through  and look at training up a model or fine tuning a   model. Using PEFT. Bits and bytes and doing a  LoRa checkpoint for this. So this is a Laura   fine tuning. So if you remember the idea with  Laura is that we're training sort of adapters   that go on. We're not training the actual  weights. We're adding weights to the model. At various points in the model and we're  fine-tuning those to get the results out that   So you just come up at you install your libraries  here. I always like to set up the hugging face   hub early because if you're gonna leave this  running and. It gets to the end of the training.   You want it to basically save your model, your  weights up to hugging face hub as quickly as So that your CoLab doesn't stop and then  you lose all your work. In there. I tend   to put this up the front. This is  basically just get your hugging   click here get your hugging press token.  You'll need a write [00:05:00] token, obviously to do this. So this  CoLab, I've run on an a 100. But you can certainly, you should  be able to do it with a T4 If you   change the model to be a smaller version  of the Bloom model so the model that I'm   training here or fine tuning here is the  bloom. 7 billion. Parameter and there's   also likea 760 version. I think there's  also a, 1.3 billion version, et cetera. That you could try out. so. we are loading in the  model. So you'll see, we've just got an order.   We've just from transformers. We're bringing in  bits and bytes, which is gonna handle the eight   bit. Turning our model into eight bit, which  means that it won't take up so much GPU Ram that makes it easier. Makes it quicker.  Makes it easier to store things later on,   too. And we've got. our Auto tokenizer and we've  got this auto model. For causal language modeling.   So when we just bring in from pre-trained, we  can pass in the Name for the bloom 7 billion.   And all we have to do here is pass in, load in  eight bit equals And, transformers will take   care of the [00:06:00] eight bit conversion.  Using the bits and bys library for doing this. If you're using a GPU at home  where you've perhaps got a 30,   90, or something like that, and you want to  try it on there. If you've got multiple GPUs   you can do a device map to basically  map parts of the model across. But in   this case we're just using auto and I  suggest you try our auto at the start. Anyway. So we've got our model in, we've  got our tokenize in The next thing we want   to do is basically go through and freeze the  original weights. So you can see here that we're   basically just going through and Freezing these  weights with a few exceptions, the layer, norm. Layers. We want to keep them and we  want to actually keep them in float   32. And also the outputs we want to keep as  being float 32. So what this is just doing,   this is some standard code for you for  doing that. Next up is setting upThe   actual adapters. So this all comes down to  the config here. So we're gonna basically. Get the config. So we've remember up here we've  got our model here and this is the [00:07:00] full   size model. But there's no Laura added to  that yet. In here, we're gonna make this   config. And then we're gonna basically pass in  the model that we had and then get the PT model,   which is gonna have the original model and the  Laura adapters on this. So the config here is key. You're basically setting the number  of attention heads. That you want.   The alpha scaling. If you know that your  model's got certain target modules. I don't   find a lot of documentation about  this in the library at the moment,   but my guess is that going forward, people  will work out are these are the best. Modules in the large module to basically  have LoRa adapters on there. Setting your   dropout for Laura. And another key one  is just setting the task type. So is   it a causal language model? Meaning  that it's a decoder only model. A G   P T style model. Or is it gonna be a  seek to seek model more like the T. Five models. The flan models, et cetera.  And I'll perhaps make another video of   going through tra fine tuning a  seek to seek model. So you can   see the [00:08:00] differences in here. So  by playing around with these two settings   up here. This will determine the size  of the trainable amount. Quite a lot. So you can try out some different ID ideas  here, but you'll see that. Okay. We've got   this 7 billion, Parameter for all the parameters,  but the trainable parameters is just tiny, Really   tiny in here. So this gives us the total trainable  parameters that we can see that's going on there. All right. In this case for data. So I've just  picked a really simple little task. in here   there's this data set of English quotes. Rather  than what most people seem to do is use that to   finish a quote so that if someone starts a quote  and it can finish it. Looking at the data set,   I saw that there are actually a bunch of tags  about the quote. And what I thought would be   cool is let's try and make a model where you can  input your own quote and it will then generate   tags. For that quote. So you can see here. What  I've done is basically just merge some of the   columns. To make it. So we've got this [00:09:00]  quote. And then we've got these three characters   here. Now those three characters are chosen  because they're probably not gonna appear. In that order, very often. In the  pre-training and stuff like that.   So we're trying to teach the model that  anytime you see these three characters we   are gonna condition on the input before  that, and we're gonna generate the tags   out after that. So you can see here by  looking at the data set that we've made. We've got, this be yourself. Everyone else has  taken and then the tags. So we've got that. There,   and then the tags are gonna be this,  be yourself, honesty, inspirational.   Mis-attributed to Oscar Wild the, these kinds  of things. Now, some of them. I probably. Being able to predict whether a quote was message  attributed to someone is probably not gonna be   easy for the model to learn to do especially  if you're making up the quotes. But certainly,   elements about what, the sort of keywords  in the quote. Should be appearing up here. As you see here? Things like, so many  books, so little time books, humor,   right? that's a good one to try out. Let me just  Take that, [00:10:00] and we Try that later on. So we've got the data there. We are just running  it through to basically to get the input IDs. The   attention masks all for that. Now we want to set  up our training. The training is just gonna use   the hugging face. the sort of, transformers  we pass in the model here. We then pass in   the train data set. so you can see here,  we've got this train data set. And then   we've gotta pass in the arguments. So let's  go through some of the arguments. The first   ones. are This we're gonna have gradient  accumulation steps, meaning, and this is,   these are the things that you would change.  If you're trying to run on a smaller GPU. So here, we've got, we're gonna do four examples.  For four forward passes. And then we're gonna   do four of those before we calculate the  gradients. So normally if you think of a batch,   if you were training this with a lot of GPUs, you  would just do a batch size of 128, or, a lot more. In the llama. Paper, they're using batch sizes  of 4 million, right. They're using so many   GPUs. Unfortunately [00:11:00] we don't have  that budget. So here, what I'm trying to show   you is that you could use, and this is probably  underutilizing it for the A 100. We could actually   make the batches bigger here because see here,  we are basically saying it, we're gonna do four,   examples at a time, we're gonna collect those  in gradients we're gonna accumulate them for   four steps. And then that will be one batch. So  it's the equivalent of doing a batch of 16 here. Next up, we wanna set up the warmup  steps. So we don't want to just go in   there and start with our learning rate.  At the full amount. And shake everything   around. We start with the learning rate  being extremely low and then building up   to the learning rate that we've set. And  that will take, a certain amount of time. And then we can set the max steps  here. So the max steps here,   I've said is very small. This is more just  a toy project to show you getting something   loading. We're using floating point 16. We're  setting this in here. We've got the outputs   where we're gonna be checking things.  And then we just kick off our training. You can see here that it's. tell us [00:12:00]  okay, how long it's gonna train in this case,   it's trained very quickly. But you might find  for your particular one, it's gonna train, for   a lot longer. And then we can see like over time  that yes, sure enough, our loss is going down. So the model is starting to learn something.  you could go through an experiment doing this   with a lot training them what I've done  here. Then the next part is sharing this   onto the hugging face hub. So here you  can say, see, I've basically just put   my hugging face hub username slash then  the model name that I'm gonna call it. So this is the bloom 7 billion Laura. tag is what  I've called this here And I can put some info   in for the commit message. I, can set this to be  private or to be public. I will come and make this   checkpoint public afterwards so that you can play  with this. but that will then basically upload it. And it's just going to upload the Laura waits  too. It's not uploadingThe full bloom model. Plus   the Laura waits, So you'll find on the hanging  face This is gonna be a tiny, tiny file. We're   talking about. Megabytes here, not multiple,  multiple gigabytes. [00:13:00] here. In fact,   you can see here that this is gonna be  31 mega something when it's fully up. Uploaded. The next thing is, if you wanted to  just do inference. You can just, basically,   this is how you would bring it  in. So you can basically load this   in. And then this will basically put  togetherThe one that you've trained,   but also bring in the actual full model as well.  So you can see that this is basically bringing in,   it's gonna work out from this. Okay.  I need the bloom 7 billion model. I'll bring that in. I need the tokenize for  that. And I'll bring those in. And it will go   off and download those. Then finally you're left  with this. So you can basically do some in. and   here we are basically passing in a quote and  we've got our sort of magic, three characters. That we're gonna put out and then it's gonna  predict there's something. Now you can see   that. Okay. I haven't trained it that long.  So it does seem to go into a loop. We could   even look at putting a end of sentence. Tag or  something like that in there as well in the data,   but we can see, okay. The world is your oyster.  It's worked out the [00:14:00] keywords there. World and oyster. Let's see. I think I. Put  in this one, so many books, so little time. And we could change this. Obviously  here we could change the max tokens,   et cetera. Okay so many books. So little time  it's generated books, reading time, reading,   writing, time writing gone on again. You can see  that. Okay. It's going into sort of repeat mode. This would help probably help if we Did  this a lot more let's put in just something okay training. models with PEFT and loRa is cool. Let's see. Okay. What will it pick out  for And you'll find that, some of them. it will,   obviously, it Could pick out keywords, but for  some of them too, it will pick out other things.   Now it's interesting. Okay. So it's got training  and teaching here hasn't really worked out, PEFT and LoRa, which is to be expected. And you  can see here that it's got some of its previous   training still in there so you would probably  want to, it looks like that there's some things   related to training models in there. That is  bouncing off. You'd want [00:15:00] train this   for. For longer. And if you really wanted to  use this as a model, but this gives you just   a good example of how to make. A causal  language model with fine tuning a bigger   causal language model with LoRa. And then you  can use that for something that you particularly   want. It's very easy to play with your data  set. put the whole thing together in here. As always, if there's any questions,   please put them in the comments. If you  found this useful, please click like and   subscribe and feel free to let me know what  you would like to see videos going forward. Bye for now.
Info
Channel: Sam Witteveen
Views: 48,531
Rating: undefined out of 5
Keywords: hugging face, PEFT, LoRa, fine-tuning, lora finetuning, LLM
Id: Us5ZFp16PaU
Channel Id: undefined
Length: 15min 35sec (935 seconds)
Published: Mon Apr 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.