So what's the problem with training large
language models and fine tuning them, the key thing here is that we
end up with really big weights. this raises a whole bunch of problems here. These
problems are two main things. One, you need a lot more compute to train. For this And as the models
are getting larger and larger, you are finding that you need much bigger GPUs multiple GPUs just
to be able to fine tune some of these models. The second problem is that in addition to
basically needing the compute. The file sizes become huge. So, The T five. X xl.
check point is around about 40 gigabytes in size. Not to mention, the sort of 20 billion
parameter models that we've got coming out now. Are getting bigger and bigger all the time. So,
this is where this idea of parameter efficient, fine tuning comes in. So I'm just gonna talk
about this as PeFT going forward. So PeFT uses a variety of different techniques. The one we're
gonna be looking at today is LoRa, which stands for low rank adaption. And it comes from a paper
all about doing this for large language models. But PEFT also also has some other cool techniques
like Prefix tuning. P tuning and prompt tuning. That we'll look at in the future and when to
use those and how they can be really useful. And there are some of the techniques that are
actually being used by companies like Nvidia to allow people to fine tune these models
in the cloud. So that's something really interesting to look at. So what PeFT does
and with LoRa particular is that it's just. Allowing you to fine tune only a small
number of extra weights in the model. While you freeze most of the parameters of
the pre-trained network. So the idea here is that we are not actually training the
original weights. We're adding some extra weights. And we're going to fine tune
those. One of the advantages of this. Is that we've still got the original weights . So
this also tends to help with stopping catastrophic for. If you don't know, catastrophic forgetting.
Is where models tend to forget what they were originally trained on when you do a fine tuning,
if you do the fine tuning too much. You end up. Then causing it to forget, some of the things
from the original data that it was trained on, but PeFT doesn't have that
problem because it's just adding extra weights and it's tuning
those as it freezes the original ones. So PeFT also allows you to get really good, fine tuning when you've only got a small
amount of data. And also it allows this to generalize better to other scenarios as well.
So in all this sort of thing is a huge win. For fine tuning, large language models and even
models like stable diffusion. A lot of the AI art models That we're seeing currently are starting to
use this as well. one of the best things is that you end up at the end with just tiny checkpoints.
In one of my recent videos, I showed fine tuning. The Lama model to create the alpaca model. And
I think the [00:03:00] final checkpoint for just the add-on part. Was something around 12
megabytes.So it's, tiny now you still need the original weights. So it's not like you are getting
away totally from everything, but you've got something that's much small on. So in general, the
PeFT approaches allow you to basically get similar performance to fine tuning, a full model. Just
by. Fine tuning and tuning these add-on weights that you're gonna put into itHugging face has
released. A whole sort of library around this, and this is what where, this PeFT comes in is
they've taken a number of papers and implemented them to work with transformers library and the
accelerate library. So this allows us to basically take off the shelf, hugging face. Pre-trained
models that have been done by Google done by meta, done by a variety of Different companies. And
put them into something where we can use them. With this and fine tune them really well. So
we're gonna jump into the code and we're gonna look at how to basically use [00:04:00]
PeFT to do a LoRa fine tuning of a model. All right in this notebook, we're gonna go through
and look at training up a model or fine tuning a model. Using PEFT. Bits and bytes and doing a
LoRa checkpoint for this. So this is a Laura fine tuning. So if you remember the idea with
Laura is that we're training sort of adapters that go on. We're not training the actual
weights. We're adding weights to the model. At various points in the model and we're
fine-tuning those to get the results out that So you just come up at you install your libraries
here. I always like to set up the hugging face hub early because if you're gonna leave this
running and. It gets to the end of the training. You want it to basically save your model, your
weights up to hugging face hub as quickly as So that your CoLab doesn't stop and then
you lose all your work. In there. I tend to put this up the front. This is
basically just get your hugging click here get your hugging press token.
You'll need a write [00:05:00] token, obviously to do this. So this
CoLab, I've run on an a 100. But you can certainly, you should
be able to do it with a T4 If you change the model to be a smaller version
of the Bloom model so the model that I'm training here or fine tuning here is the
bloom. 7 billion. Parameter and there's also likea 760 version. I think there's
also a, 1.3 billion version, et cetera. That you could try out. so. we are loading in the
model. So you'll see, we've just got an order. We've just from transformers. We're bringing in
bits and bytes, which is gonna handle the eight bit. Turning our model into eight bit, which
means that it won't take up so much GPU Ram that makes it easier. Makes it quicker.
Makes it easier to store things later on, too. And we've got. our Auto tokenizer and we've
got this auto model. For causal language modeling. So when we just bring in from pre-trained, we
can pass in the Name for the bloom 7 billion. And all we have to do here is pass in, load in
eight bit equals And, transformers will take care of the [00:06:00] eight bit conversion.
Using the bits and bys library for doing this. If you're using a GPU at home
where you've perhaps got a 30, 90, or something like that, and you want to
try it on there. If you've got multiple GPUs you can do a device map to basically
map parts of the model across. But in this case we're just using auto and I
suggest you try our auto at the start. Anyway. So we've got our model in, we've
got our tokenize in The next thing we want to do is basically go through and freeze the
original weights. So you can see here that we're basically just going through and Freezing these
weights with a few exceptions, the layer, norm. Layers. We want to keep them and we
want to actually keep them in float 32. And also the outputs we want to keep as
being float 32. So what this is just doing, this is some standard code for you for
doing that. Next up is setting upThe actual adapters. So this all comes down to
the config here. So we're gonna basically. Get the config. So we've remember up here we've
got our model here and this is the [00:07:00] full size model. But there's no Laura added to
that yet. In here, we're gonna make this config. And then we're gonna basically pass in
the model that we had and then get the PT model, which is gonna have the original model and the
Laura adapters on this. So the config here is key. You're basically setting the number
of attention heads. That you want. The alpha scaling. If you know that your
model's got certain target modules. I don't find a lot of documentation about
this in the library at the moment, but my guess is that going forward, people
will work out are these are the best. Modules in the large module to basically
have LoRa adapters on there. Setting your dropout for Laura. And another key one
is just setting the task type. So is it a causal language model? Meaning
that it's a decoder only model. A G P T style model. Or is it gonna be a
seek to seek model more like the T. Five models. The flan models, et cetera.
And I'll perhaps make another video of going through tra fine tuning a
seek to seek model. So you can see the [00:08:00] differences in here. So
by playing around with these two settings up here. This will determine the size
of the trainable amount. Quite a lot. So you can try out some different ID ideas
here, but you'll see that. Okay. We've got this 7 billion, Parameter for all the parameters,
but the trainable parameters is just tiny, Really tiny in here. So this gives us the total trainable
parameters that we can see that's going on there. All right. In this case for data. So I've just
picked a really simple little task. in here there's this data set of English quotes. Rather
than what most people seem to do is use that to finish a quote so that if someone starts a quote
and it can finish it. Looking at the data set, I saw that there are actually a bunch of tags
about the quote. And what I thought would be cool is let's try and make a model where you can
input your own quote and it will then generate tags. For that quote. So you can see here. What
I've done is basically just merge some of the columns. To make it. So we've got this [00:09:00]
quote. And then we've got these three characters here. Now those three characters are chosen
because they're probably not gonna appear. In that order, very often. In the
pre-training and stuff like that. So we're trying to teach the model that
anytime you see these three characters we are gonna condition on the input before
that, and we're gonna generate the tags out after that. So you can see here by
looking at the data set that we've made. We've got, this be yourself. Everyone else has
taken and then the tags. So we've got that. There, and then the tags are gonna be this,
be yourself, honesty, inspirational. Mis-attributed to Oscar Wild the, these kinds
of things. Now, some of them. I probably. Being able to predict whether a quote was message
attributed to someone is probably not gonna be easy for the model to learn to do especially
if you're making up the quotes. But certainly, elements about what, the sort of keywords
in the quote. Should be appearing up here. As you see here? Things like, so many
books, so little time books, humor, right? that's a good one to try out. Let me just
Take that, [00:10:00] and we Try that later on. So we've got the data there. We are just running
it through to basically to get the input IDs. The attention masks all for that. Now we want to set
up our training. The training is just gonna use the hugging face. the sort of, transformers
we pass in the model here. We then pass in the train data set. so you can see here,
we've got this train data set. And then we've gotta pass in the arguments. So let's
go through some of the arguments. The first ones. are This we're gonna have gradient
accumulation steps, meaning, and this is, these are the things that you would change.
If you're trying to run on a smaller GPU. So here, we've got, we're gonna do four examples.
For four forward passes. And then we're gonna do four of those before we calculate the
gradients. So normally if you think of a batch, if you were training this with a lot of GPUs, you
would just do a batch size of 128, or, a lot more. In the llama. Paper, they're using batch sizes
of 4 million, right. They're using so many GPUs. Unfortunately [00:11:00] we don't have
that budget. So here, what I'm trying to show you is that you could use, and this is probably
underutilizing it for the A 100. We could actually make the batches bigger here because see here,
we are basically saying it, we're gonna do four, examples at a time, we're gonna collect those
in gradients we're gonna accumulate them for four steps. And then that will be one batch. So
it's the equivalent of doing a batch of 16 here. Next up, we wanna set up the warmup
steps. So we don't want to just go in there and start with our learning rate.
At the full amount. And shake everything around. We start with the learning rate
being extremely low and then building up to the learning rate that we've set. And
that will take, a certain amount of time. And then we can set the max steps
here. So the max steps here, I've said is very small. This is more just
a toy project to show you getting something loading. We're using floating point 16. We're
setting this in here. We've got the outputs where we're gonna be checking things.
And then we just kick off our training. You can see here that it's. tell us [00:12:00]
okay, how long it's gonna train in this case, it's trained very quickly. But you might find
for your particular one, it's gonna train, for a lot longer. And then we can see like over time
that yes, sure enough, our loss is going down. So the model is starting to learn something.
you could go through an experiment doing this with a lot training them what I've done
here. Then the next part is sharing this onto the hugging face hub. So here you
can say, see, I've basically just put my hugging face hub username slash then
the model name that I'm gonna call it. So this is the bloom 7 billion Laura. tag is what
I've called this here And I can put some info in for the commit message. I, can set this to be
private or to be public. I will come and make this checkpoint public afterwards so that you can play
with this. but that will then basically upload it. And it's just going to upload the Laura waits
too. It's not uploadingThe full bloom model. Plus the Laura waits, So you'll find on the hanging
face This is gonna be a tiny, tiny file. We're talking about. Megabytes here, not multiple,
multiple gigabytes. [00:13:00] here. In fact, you can see here that this is gonna be
31 mega something when it's fully up. Uploaded. The next thing is, if you wanted to
just do inference. You can just, basically, this is how you would bring it
in. So you can basically load this in. And then this will basically put
togetherThe one that you've trained, but also bring in the actual full model as well.
So you can see that this is basically bringing in, it's gonna work out from this. Okay.
I need the bloom 7 billion model. I'll bring that in. I need the tokenize for
that. And I'll bring those in. And it will go off and download those. Then finally you're left
with this. So you can basically do some in. and here we are basically passing in a quote and
we've got our sort of magic, three characters. That we're gonna put out and then it's gonna
predict there's something. Now you can see that. Okay. I haven't trained it that long.
So it does seem to go into a loop. We could even look at putting a end of sentence. Tag or
something like that in there as well in the data, but we can see, okay. The world is your oyster.
It's worked out the [00:14:00] keywords there. World and oyster. Let's see. I think I. Put
in this one, so many books, so little time. And we could change this. Obviously
here we could change the max tokens, et cetera. Okay so many books. So little time
it's generated books, reading time, reading, writing, time writing gone on again. You can see
that. Okay. It's going into sort of repeat mode. This would help probably help if we Did
this a lot more let's put in just something okay training. models with PEFT and loRa is cool. Let's see. Okay. What will it pick out
for And you'll find that, some of them. it will, obviously, it Could pick out keywords, but for
some of them too, it will pick out other things. Now it's interesting. Okay. So it's got training
and teaching here hasn't really worked out, PEFT and LoRa, which is to be expected. And you
can see here that it's got some of its previous training still in there so you would probably
want to, it looks like that there's some things related to training models in there. That is
bouncing off. You'd want [00:15:00] train this for. For longer. And if you really wanted to
use this as a model, but this gives you just a good example of how to make. A causal
language model with fine tuning a bigger causal language model with LoRa. And then you
can use that for something that you particularly want. It's very easy to play with your data
set. put the whole thing together in here. As always, if there's any questions, please put them in the comments. If you
found this useful, please click like and subscribe and feel free to let me know what
you would like to see videos going forward. Bye for now.