Fine-Tune Large LLMs with QLoRA (Free Colab Tutorial)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video we're going to learn how to fine tune your large language model using something called Q Laura Q Laura is a completely new technology that helps you fine-tune large language models even with less computation power before we jump into Q Laurent the code actual I would like to show you a quick demonstration of what Lora is so if you see the blue color as pre-trained weights of a large language models if you are going through the traditional fine tuning method what you would end up doing is you would end up fine tuning the entire set of Weights like all the part of the dense layers of the neural network but what Q or Lora does is Laura creates a new set of weight matrices or it's also called as update matrices that you see on the right hand side the orange color one while freezing the blue color pre-trained weights so the output activations of the original pre-trained weights are frozen and that are augmented by the new updated adapter low rank adapter that is also called as update matrices or weight matrices and this is how you end up getting a smaller file after fine tuning while also not compromising a lot on the performance and accuracy while doing all these things in low computation now Laura has taken a completely new tone with Q Laura thanks to Q Laura now you can train a really large let's say 65 billion parameter model on just a single GPU of 48 gigabytes memory GPU while preserving full 16-bit fine tuning task performance I'm not going to get into the details of Q Laura but one interesting fact is that the authors who presented Q Laura actually created a new model called gunaku guanaco guanaco outperforms all the previously released open source models on The vikuna Benchmark this is before the arrival of Falcon let's say let's leave Falcon out of it and it said it it reached 99 percent of the performance level of charge GPT while only requiring 24 hours of fine tuning on a single GPU so bottom line Q Laura is really an exciting space and thanks to Bill and bytes the library called bits and bytes we can fine tune or load inference any model based on Q Laura and that is exactly what we are going to see in this video how to fine tune a large language model using Q Laura so we're going to use two main libraries called Transformers and bits and bytes for doing the quantized Lora which is the Laura but we are going to use it with the quantization technique and we need to install all the libraries bits and bytes Transformers pfift accelerate and data sets so whatever model gets supported by accelerate you should be able to do this entire thing based on the for those models so I'm going to show this demo with GPT Neo X this is a Google collab notebook that hugging face team has released hugging face and bits and bytes team has released as part of the blog post so this blog post has a lot of details and you can read this blog post and also see this Google collab notebook that can help you first simply you're going to load the existing model import Torch from Transformers you want Auto tokenizer auto model for causal LM and bits and bytes config now inside bits and bytes config you're going to specify and then say you want to load it in 4-bit and you want to load it with all this all this information that makes it load at 20 billion parameter model which is 40 GB model in half a precision and that's what you're going to see and now when you see this new type nf4 you can go to the blog post and read more about what is that nf4 type normalized float for value and you can see the details around how does the nf4 configuration work and what kind of quantization it is basically you are trying to quantize the model reduce the size of the model and at a lower Precision you are trying to load the model in a lower memory machine like in this case I've got a Google collab machine and if you see my Google collab it's a free Google collab I've got total 12 gigabytes of RAM and 80 gb disk space and 15 GB GPU Ram so entire thing like this fine tuning and inference all these things are going to happen in the 12 GB Ram 15 GB GPU RAM and a 80 gb disk storage and the model in itself is 40 GB in half Precision next we are going to download the tokenizer and download the model while downloading the model you're going to download the model with the quantization configuration that we just specified before which is the bits and bytes configuration at this point we were we have successfully downloaded the model it takes a bit of time and as you can see it's a big models it's a model from you Lucid AI which was ideally supposed to compete with the GPT 3 volt like long before before you know the introduction of Lama and other models once you have successfully loaded all these models now you are going to do some pre-processing you need to do to train the model so you are going to prepare the data or prepare the model the existing model for training so you're going to from Impact import prepare underscore model for kbit training so that's something that you need to import and you're also going to enable gradient checkpointing for the model and load the model with pre prepared model for kbit training and then you define a print trainable parameters like the model parameters details function and now is the main part where you are going to specify something called luro configuration so your lower configuration now is something that you are going to specify the lower configuration so your load of configuration is where you're going to make the decision about what is that size of the new update Matrix matrices that you want to fine tune so if you remember the Char the initial animation that I showed you this particular part that rank Factor actually defines the r the rank Factor defines what should be the size of the matrices and that's going to have an impact on the computation that you need on the memory that you need and also the size of the output files and now at this point you're going to just give R is equal to 8 that's the dimension of the update matrices Lora Alpha and the target modules is something it differs for different kinds of models that you are using so I'm using GPT Neo X for that the target module the part in which you want to the part of the dense layer in which you want to make changes are fine tuning that's this one but it could be different for a different model maybe if you are doing a stable diffusion with Lora then this would be completely different so you need to check what is that part and what is what is that model that you are using so if you are going to use this code with GPT Neo X this will work without any issue and also if you want to update anything in bias and what kind of task it is this is a simple causal LM task and again you can check the lower documentation where all these tasks sorry pipped documentation where the loader config all these tasks are specifically given to you once you create the configuration then you are going to use the model that we have already loaded which is the GPT Neo X which we have loaded as a quantized model with the configuration the Laura configuration that we just specified and you are going to print the trainable parameters like how many parameters are going to train and as you can see we are going to train eight percent of the parameters so if you see the the trainable parameters so like all the parameters in the model you can see this is all the parameters in the model the trainable parameters in the model so how much are we training we're actually training only the eight percent of the total parameters in the large language model that we have got so if it's uh if it's like let's say 60 billion parameter model then if it is like eight percent you know what is the eight percent of so that's what we're going to train only and now we are getting into the training part so to train we're going to use Simple English code data set it's a very simple data set it it has got only the training aspect node validation or test and it just helps you create quotes we're going to load the data set the data is available data of train will give us that the next thing is why so now what do we have we have the base model ready we have that loader configuration ready and we have the training data set the fine tuning data set ready now we are going to use that to use Transformers trainer class to train the model so what we are going to do is import Transformers whatever the padding token or end of sentence token that you need to use depending upon the model the tokenizer you can add it and then transformers.trainer you're going to give the model which is the base model in our case GPT Neo X and the training data set which is the data of trade like if you're going to do this like supervised fine tuning sft or instruct fine tuning that instruct fine tuning data set goes in here actually and basic training arguments so the training arguments whatever you want to give the learning rate and all these things that you can give here one another important thing is where do you want this model to store the output to be and this is another place quite important because your Lora that you are training will get saved here so you don't have to look anywhere else so specify the output directory where do you want to save the adapter or the lower reads the model let's call it fine tune model run this entire thing and once the trainer starts just goes on and as you can see the training loss goes down and finally you can see how much time it totally took it took about let's say in this case 169 seconds for the training run time and you can see for every iteration how much time it took total loss and all the information you can see started with 2.38 it ended up with 2.46 you can see at some point it started overfitting and somewhere the best is let's say eight and you can see all the details there and then now what you have completed is you have completed the fine tuning of a huge model a 20 billion parameter model GPT Neo X which is 40 GB in size in a half of precision inside Google collab which has much much lesser memory now the next thing is if you want to use the model there are multiple ways to do it one you can push you to hugging face model Hub which I'm not showing you here but if you want to save the model locally one you could already go here and then look for it but in case if you if it's harder for you you can go here and start running this code that is to save the model save the pre-train and when you save this model this model will have it's not like your regular pytorch model so if you go here you will have like files like this what are those files you will have adapter underscore config.json adapter underscore model.j bin these are the files that you would see when you save the model and those things now you need to load using the same loader configuration that you have got so you're going to from pre-trained loaded using the same folder from where it is and you already here have the base model and use the base model and the Lora configuration and use get underscore pip underscore model to use PIV to load the model and once you have that then you can basically use the model just similar like how you use any Transformer models give a prompt text set the device and then you have to say Okay use the tokenizer to tokenize it but send it to this device in this case Cuda and then use the outputs the model.generate which is the model that is with the base model and also with the Lora component and in this case it's a quantized laurel so it's a q lower component so ultimately you're going to combine these two and that's it exactly what we are doing in this tip we have used the base model but also the Lora configuration which is the fine-tuned component that we have done and now finally printed and unfortunately in my testing I didn't get any code um but the whole concept of training and fine tuning actually works very fine so now you can push it to hugging face model Hub just like the creator of creator of bits and bytes Tim detmesh have done so you can go to Tim detmer's profile and you will actually see a lot a lot of Q Laura ways being shared here so Q Laura alpaca 33 billion Q Laura long form 65 billion Q Laura flan 33 billion and also the main model that I said the guanaco model is also being shared so you can try out this all these models but if you still want to fine tune this model on your own data set then this Google collab notebook should help you do that I'll link the Google collab notebook in the YouTube description but for you to understand the Google collab notebook better in detail you can go through this blog post that will give you all the details about what is the quantization that we are doing what is the Lora that we are doing and how bits and bytes are helping us integrate this with Transformers but even if you do not read the blog post only by watching this video you should be able to successfully fine tune a really large language model which otherwise wouldn't have fit in your let's say Nvidia RTX 3080 or 2090 or even Google collab memory but now you can do that with Q Laura thanks to this amazing Innovation that they are calling this as democratization of large language model fine tuning inference I hope this video was helpful to you let me know in the comment section what do you feel about this see you in another video Happy prompting

Info

Channel: 1littlecoder

Views: 50,028

Rating: undefined out of 5

Keywords: ai, machine learning, artificial intelligence

Id: NRVaRXDoI3g

Channel Id: undefined

Length: 14min 45sec (885 seconds)

Published: Sat May 27 2023