Fine-tuning LLMs with PEFT and LoRA

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

So what's the problem with training large language models and fine tuning them, the key thing here is that we end up with really big weights. this raises a whole bunch of problems here. These problems are two main things. One, you need a lot more compute to train. For this And as the models are getting larger and larger, you are finding that you need much bigger GPUs multiple GPUs just to be able to fine tune some of these models. The second problem is that in addition to basically needing the compute. The file sizes become huge. So, The T five. X xl. check point is around about 40 gigabytes in size. Not to mention, the sort of 20 billion parameter models that we've got coming out now. Are getting bigger and bigger all the time. So, this is where this idea of parameter efficient, fine tuning comes in. So I'm just gonna talk about this as PeFT going forward. So PeFT uses a variety of different techniques. The one we're gonna be looking at today is LoRa, which stands for low rank adaption. And it comes from a paper all about doing this for large language models. But PEFT also also has some other cool techniques like Prefix tuning. P tuning and prompt tuning. That we'll look at in the future and when to use those and how they can be really useful. And there are some of the techniques that are actually being used by companies like Nvidia to allow people to fine tune these models in the cloud. So that's something really interesting to look at. So what PeFT does and with LoRa particular is that it's just. Allowing you to fine tune only a small number of extra weights in the model. While you freeze most of the parameters of the pre-trained network. So the idea here is that we are not actually training the original weights. We're adding some extra weights. And we're going to fine tune those. One of the advantages of this. Is that we've still got the original weights . So this also tends to help with stopping catastrophic for. If you don't know, catastrophic forgetting. Is where models tend to forget what they were originally trained on when you do a fine tuning, if you do the fine tuning too much. You end up. Then causing it to forget, some of the things from the original data that it was trained on, but PeFT doesn't have that problem because it's just adding extra weights and it's tuning those as it freezes the original ones. So PeFT also allows you to get really good, fine tuning when you've only got a small amount of data. And also it allows this to generalize better to other scenarios as well. So in all this sort of thing is a huge win. For fine tuning, large language models and even models like stable diffusion. A lot of the AI art models That we're seeing currently are starting to use this as well. one of the best things is that you end up at the end with just tiny checkpoints. In one of my recent videos, I showed fine tuning. The Lama model to create the alpaca model. And I think the [00:03:00] final checkpoint for just the add-on part. Was something around 12 megabytes.So it's, tiny now you still need the original weights. So it's not like you are getting away totally from everything, but you've got something that's much small on. So in general, the PeFT approaches allow you to basically get similar performance to fine tuning, a full model. Just by. Fine tuning and tuning these add-on weights that you're gonna put into itHugging face has released. A whole sort of library around this, and this is what where, this PeFT comes in is they've taken a number of papers and implemented them to work with transformers library and the accelerate library. So this allows us to basically take off the shelf, hugging face. Pre-trained models that have been done by Google done by meta, done by a variety of Different companies. And put them into something where we can use them. With this and fine tune them really well. So we're gonna jump into the code and we're gonna look at how to basically use [00:04:00] PeFT to do a LoRa fine tuning of a model. All right in this notebook, we're gonna go through and look at training up a model or fine tuning a model. Using PEFT. Bits and bytes and doing a LoRa checkpoint for this. So this is a Laura fine tuning. So if you remember the idea with Laura is that we're training sort of adapters that go on. We're not training the actual weights. We're adding weights to the model. At various points in the model and we're fine-tuning those to get the results out that So you just come up at you install your libraries here. I always like to set up the hugging face hub early because if you're gonna leave this running and. It gets to the end of the training. You want it to basically save your model, your weights up to hugging face hub as quickly as So that your CoLab doesn't stop and then you lose all your work. In there. I tend to put this up the front. This is basically just get your hugging click here get your hugging press token. You'll need a write [00:05:00] token, obviously to do this. So this CoLab, I've run on an a 100. But you can certainly, you should be able to do it with a T4 If you change the model to be a smaller version of the Bloom model so the model that I'm training here or fine tuning here is the bloom. 7 billion. Parameter and there's also likea 760 version. I think there's also a, 1.3 billion version, et cetera. That you could try out. so. we are loading in the model. So you'll see, we've just got an order. We've just from transformers. We're bringing in bits and bytes, which is gonna handle the eight bit. Turning our model into eight bit, which means that it won't take up so much GPU Ram that makes it easier. Makes it quicker. Makes it easier to store things later on, too. And we've got. our Auto tokenizer and we've got this auto model. For causal language modeling. So when we just bring in from pre-trained, we can pass in the Name for the bloom 7 billion. And all we have to do here is pass in, load in eight bit equals And, transformers will take care of the [00:06:00] eight bit conversion. Using the bits and bys library for doing this. If you're using a GPU at home where you've perhaps got a 30, 90, or something like that, and you want to try it on there. If you've got multiple GPUs you can do a device map to basically map parts of the model across. But in this case we're just using auto and I suggest you try our auto at the start. Anyway. So we've got our model in, we've got our tokenize in The next thing we want to do is basically go through and freeze the original weights. So you can see here that we're basically just going through and Freezing these weights with a few exceptions, the layer, norm. Layers. We want to keep them and we want to actually keep them in float 32. And also the outputs we want to keep as being float 32. So what this is just doing, this is some standard code for you for doing that. Next up is setting upThe actual adapters. So this all comes down to the config here. So we're gonna basically. Get the config. So we've remember up here we've got our model here and this is the [00:07:00] full size model. But there's no Laura added to that yet. In here, we're gonna make this config. And then we're gonna basically pass in the model that we had and then get the PT model, which is gonna have the original model and the Laura adapters on this. So the config here is key. You're basically setting the number of attention heads. That you want. The alpha scaling. If you know that your model's got certain target modules. I don't find a lot of documentation about this in the library at the moment, but my guess is that going forward, people will work out are these are the best. Modules in the large module to basically have LoRa adapters on there. Setting your dropout for Laura. And another key one is just setting the task type. So is it a causal language model? Meaning that it's a decoder only model. A G P T style model. Or is it gonna be a seek to seek model more like the T. Five models. The flan models, et cetera. And I'll perhaps make another video of going through tra fine tuning a seek to seek model. So you can see the [00:08:00] differences in here. So by playing around with these two settings up here. This will determine the size of the trainable amount. Quite a lot. So you can try out some different ID ideas here, but you'll see that. Okay. We've got this 7 billion, Parameter for all the parameters, but the trainable parameters is just tiny, Really tiny in here. So this gives us the total trainable parameters that we can see that's going on there. All right. In this case for data. So I've just picked a really simple little task. in here there's this data set of English quotes. Rather than what most people seem to do is use that to finish a quote so that if someone starts a quote and it can finish it. Looking at the data set, I saw that there are actually a bunch of tags about the quote. And what I thought would be cool is let's try and make a model where you can input your own quote and it will then generate tags. For that quote. So you can see here. What I've done is basically just merge some of the columns. To make it. So we've got this [00:09:00] quote. And then we've got these three characters here. Now those three characters are chosen because they're probably not gonna appear. In that order, very often. In the pre-training and stuff like that. So we're trying to teach the model that anytime you see these three characters we are gonna condition on the input before that, and we're gonna generate the tags out after that. So you can see here by looking at the data set that we've made. We've got, this be yourself. Everyone else has taken and then the tags. So we've got that. There, and then the tags are gonna be this, be yourself, honesty, inspirational. Mis-attributed to Oscar Wild the, these kinds of things. Now, some of them. I probably. Being able to predict whether a quote was message attributed to someone is probably not gonna be easy for the model to learn to do especially if you're making up the quotes. But certainly, elements about what, the sort of keywords in the quote. Should be appearing up here. As you see here? Things like, so many books, so little time books, humor, right? that's a good one to try out. Let me just Take that, [00:10:00] and we Try that later on. So we've got the data there. We are just running it through to basically to get the input IDs. The attention masks all for that. Now we want to set up our training. The training is just gonna use the hugging face. the sort of, transformers we pass in the model here. We then pass in the train data set. so you can see here, we've got this train data set. And then we've gotta pass in the arguments. So let's go through some of the arguments. The first ones. are This we're gonna have gradient accumulation steps, meaning, and this is, these are the things that you would change. If you're trying to run on a smaller GPU. So here, we've got, we're gonna do four examples. For four forward passes. And then we're gonna do four of those before we calculate the gradients. So normally if you think of a batch, if you were training this with a lot of GPUs, you would just do a batch size of 128, or, a lot more. In the llama. Paper, they're using batch sizes of 4 million, right. They're using so many GPUs. Unfortunately [00:11:00] we don't have that budget. So here, what I'm trying to show you is that you could use, and this is probably underutilizing it for the A 100. We could actually make the batches bigger here because see here, we are basically saying it, we're gonna do four, examples at a time, we're gonna collect those in gradients we're gonna accumulate them for four steps. And then that will be one batch. So it's the equivalent of doing a batch of 16 here. Next up, we wanna set up the warmup steps. So we don't want to just go in there and start with our learning rate. At the full amount. And shake everything around. We start with the learning rate being extremely low and then building up to the learning rate that we've set. And that will take, a certain amount of time. And then we can set the max steps here. So the max steps here, I've said is very small. This is more just a toy project to show you getting something loading. We're using floating point 16. We're setting this in here. We've got the outputs where we're gonna be checking things. And then we just kick off our training. You can see here that it's. tell us [00:12:00] okay, how long it's gonna train in this case, it's trained very quickly. But you might find for your particular one, it's gonna train, for a lot longer. And then we can see like over time that yes, sure enough, our loss is going down. So the model is starting to learn something. you could go through an experiment doing this with a lot training them what I've done here. Then the next part is sharing this onto the hugging face hub. So here you can say, see, I've basically just put my hugging face hub username slash then the model name that I'm gonna call it. So this is the bloom 7 billion Laura. tag is what I've called this here And I can put some info in for the commit message. I, can set this to be private or to be public. I will come and make this checkpoint public afterwards so that you can play with this. but that will then basically upload it. And it's just going to upload the Laura waits too. It's not uploadingThe full bloom model. Plus the Laura waits, So you'll find on the hanging face This is gonna be a tiny, tiny file. We're talking about. Megabytes here, not multiple, multiple gigabytes. [00:13:00] here. In fact, you can see here that this is gonna be 31 mega something when it's fully up. Uploaded. The next thing is, if you wanted to just do inference. You can just, basically, this is how you would bring it in. So you can basically load this in. And then this will basically put togetherThe one that you've trained, but also bring in the actual full model as well. So you can see that this is basically bringing in, it's gonna work out from this. Okay. I need the bloom 7 billion model. I'll bring that in. I need the tokenize for that. And I'll bring those in. And it will go off and download those. Then finally you're left with this. So you can basically do some in. and here we are basically passing in a quote and we've got our sort of magic, three characters. That we're gonna put out and then it's gonna predict there's something. Now you can see that. Okay. I haven't trained it that long. So it does seem to go into a loop. We could even look at putting a end of sentence. Tag or something like that in there as well in the data, but we can see, okay. The world is your oyster. It's worked out the [00:14:00] keywords there. World and oyster. Let's see. I think I. Put in this one, so many books, so little time. And we could change this. Obviously here we could change the max tokens, et cetera. Okay so many books. So little time it's generated books, reading time, reading, writing, time writing gone on again. You can see that. Okay. It's going into sort of repeat mode. This would help probably help if we Did this a lot more let's put in just something okay training. models with PEFT and loRa is cool. Let's see. Okay. What will it pick out for And you'll find that, some of them. it will, obviously, it Could pick out keywords, but for some of them too, it will pick out other things. Now it's interesting. Okay. So it's got training and teaching here hasn't really worked out, PEFT and LoRa, which is to be expected. And you can see here that it's got some of its previous training still in there so you would probably want to, it looks like that there's some things related to training models in there. That is bouncing off. You'd want [00:15:00] train this for. For longer. And if you really wanted to use this as a model, but this gives you just a good example of how to make. A causal language model with fine tuning a bigger causal language model with LoRa. And then you can use that for something that you particularly want. It's very easy to play with your data set. put the whole thing together in here. As always, if there's any questions, please put them in the comments. If you found this useful, please click like and subscribe and feel free to let me know what you would like to see videos going forward. Bye for now.

Info

Channel: Sam Witteveen

Views: 48,531

Rating: undefined out of 5

Keywords: hugging face, PEFT, LoRa, fine-tuning, lora finetuning, LLM

Id: Us5ZFp16PaU

Channel Id: undefined

Length: 15min 35sec (935 seconds)

Published: Mon Apr 24 2023