LoRA & QLoRA Explained In-Depth | Finetuning LLM Using PEFT Techniques

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi guys in this video we are going to talk about fine-tuning of large language models specifically we are going to talk about PF technique parameter efficient fine tuning and we are going to talk about two specific fine tuning methods one is Lura and the other one is qora and uh we're going to understand like how it is important to fine tune it what is the significance of fine tuning we have the P10 model or we call a foundation model uh we can think of this Foundation model as a GPT model so in this visualization as you can see the pre-end model is getting fed by huge number of books and um Wikipedia data and a humongous amount of data from the Internet is and all this humongous data has been fed into a neural network which has been trained or you can say pre-trained on this data and learned how to create the next word so in this case uh we can think of it as a only GPD model we are not talking about chat G model I'll show you like how the chat GPD model got evolved got introduced in 2022 now what happens is that only GPD model is not trained enough to talk with the human beings so it doesn't know if if you say hello hi it would not able to you know answer you with hi hello back to you okay the instruct fine-tuning has to be done on top of the speed models so that it can interact with the humans and that's when the chat GPT got evolved so in this table you can see that we have the GPT family and it all started from June 2018 gpt1 gpt2 gpt3 and GPT 3.5 got introduced in March 2022 and if you see the capability part it has introduced chat capability which is called chat gbd and we all know that in 20122 what happened after the release of CH so now that we have already seen the instru part let's talk about the safety part so safety fine tuning is very much needed because when we talk about Foundation model right there is no such mechanism which basically restricts the toxic or unlawful outputs so that's when the safety fine tuning becomes very much important right so any open- Source model or any kind of model which is coming into the market definitely have the safety fine tuning part so that no toxic or unlaw fuel outputs gets output and then the instruct part will be there only when the organization wants that model to get interact with the users right so this is the whole understanding of fine tuning for the instruct part and for the safety part now there are different other aspects of fine tuning also so let's say for example you have very specific Downstream task let's say summarization sentiment analysis or Q&A so for all these tasks you can have a separate fine tuning and those fine tuning will be very much specific to uh the domain also right now let's say you have built a Q&A application but you are not that much satisfied with the outputs of the question answering capability of the LM so what you can do is basically you can fine-tune your large language model on a domain specific data set which you all you only have because it's a customized data and there is very much possibility that that particular data set Foundation model might not have seen that particular data which you have in the medical line right that could be a reason like why your question answering application is not giving you a satisfied answer right okay so in the slide we can see that we have a very simplistic form of neural network where we have inputs and it gets sumission and then basically we have the activation function and we get the output the basics of neural network is that we will have weights in the neurons and all these weights will get basically updated and uh that this gets updated in each and every eox and eox is basically nothing but the forward and backward propagation of the whole the network right so when we create this network we have this right side metrix which is called weight metrix and this weight metrix gets updated in each and every epox okay and I feel like you guys might have the understanding before I'm not going into the detail part of it think of it like this so till now we were talking about GPT because for the sake of explanation GPT was a better example but now I feel like you can think of it as a mistal open source large language model why because for GPT GPT we don't have much information about their Network their fine-tuning part or the detailing part of it right but for mistel or for the other models which you can download in your um device we we have at least the basic information of it and we know all the versions of it so for the mistel we have 7 billion parameter model we have uh 70 billion parameter model right and we have some other uh versions of it but for the sake of again the explainability of it I assumed we have only two models let's think of it one is 7 billion the other one is the 70 billion parameter in this very basic example we have 12 right but in this example we have 7 billion BS and in 7 billion parameter model we have 70 billion weights so just imagine if I have to ask you that you have to do a full parameter fine tuning how much time it will take for a particular computation to complete the epox and to complete the fine tuning part so again in this case where we were talking about only 12 weights I would say it's very easy it would not need much Hardware requirement there won't be any GPU needed and everything will be fine but the moment I talk about 7 billion parameter and 70 billion parameter it's a huge number okay so that's when the limitations of fine-tuning comes into picture you need to update all the 70 billion weights if we are talking about 70 billion parameter model if it is more than that then you have to uh consider those many weights right and hence moreo will need more computational power and we need High RAM memory is needed longer time is needed and only large GPU or GPU clusters can perform such competions so here is the equation where if we take 7 billion parameter we have to multiply it with 32 bits divided by 8 Bits so it comes as 28 GB so 28 G is the memory which should be available in the device and there's one more catch on this so basically there should be double amount of memory available in the device because you also have to hold the gradients of these weights so as you know that there'll be an Epoch and the forward and backward propagation so there'll be a gradient will that will get generated in each and every Epoch and you have to hold it so that you can calculate the loss of it so um again it's not 28 you also have to hold the gradient so it'll be like 28 plus 28 so it's a huge memory right and when we talk about 70 billion parameters it's on the next level right so for an individual having such kind of memory and doing fine tuning is is near to Impossible that's why we see like all the big organizations can only do this kind of full parameter tuning now let's see how Lura is different than the full parameter fine tuning so Lura and qora are the two methods which falls under the be techniques be is nothing but parameter efficient fine-tuning so let's see this visualization and let's say that we have this existing model in the left side model weights are there right so I want you to think of it like this so in this slide we were talking about a very simplified neural network right and then we um have assumed that there are 12 weights right here so the same way I want you to assume that we have this these many model weights here in the left side and then what we are doing in the Lura fine tuning is that we are creating a new Matrix for the updated Lowa fine tune weights so we know that in each and every epok there will be forward and backward propagation and then weights will get updated right so that updated weight is not going to add in the existing model weights but it's going to get added or get saved in a new Matrix okay and then what's going to happen is that we are going to freeze the existing model weights so we are not going to touch this entire Matrix okay and we will only save the updated weights in the new Matrix right so new Matrix for the updated load of fine tune weights uh you can ask me that okay so we were talking about such a huge RAM requirement right now we have doubled it that means it's not only uh that we have the model weights we have the new weights also and then on top of that we are also calculating the gradients so it's like huge amount of RAM or memory is needed right so as per Lura paper which is low rank adaptation of large language model we do something like this so instead of using the whole Matrix we are using the Matrix decomposition method and that's where the whole Magic is happening so in this example what we are doing is basically we are multiplying the 5 cross one Matrix with 1 cross 5 Matrix and we are getting five cross five Matrix right so it's like you multiplying two small values and you are getting a significantly bigger value right so think of it in our case we were talking about 7 billion parameters we were talking about 70 billion parameters of Mel with this particular Matrix decomposition method we have saved so much of memory and what happens is when we talk about rank two so in rank two what is happening is as you can see we have 5 cross2 and 2 cross 5 Matrix but the end result is still 5 cross 5 but the Precision is higher why because in rank two we providing more information and hence we are getting the end result with a higher Precision so one thing you have to understand is that the more the rank will be the more the number of trainable parameters will be so for example exle here rank one has only 167k trainable parameter but for 70 billion model it is actually training 529k parameters right but let's stick to the 7B parameter model only so the more you have the ranking part right you are basically having much more number of trainable parameter but compared to the actual number of parameters how big is this number how big is one 67k number so let's see it in the percentage so in the percentage we see if we have rank one we not even you know doing 1% or even 0.01% of training but if you increase the rank we are increasing we are gradually increasing the percentage of trainable parameters so you see in 512 rank we were basically you know updating the number of parameters 86 million but if you see the same number in percentage it's just 1.22% so it is significantly very less than the actual number of training parameters which are available in the model Matrix right so guys that was the Crux of low of fine tuning so we have seen like with such a small number of trainable parameters how we can achieve um High Precision models right so now if you compare the full parameter fine tuning disadvantages which we already talked about we can see that we have significantly overcome all those issues right so we don't need to update all the billion weights and uh we don't need High Ram we don't need longer time and we don't need the GPU clusters we do need it but it all depends like what kind of uh use case we have if we are dealing with huge number of parameters then definitely we need still we need it but significantly you can see like how much we have reduced the requirement part right so qora is basically Lura over the quanti large language model so think of it like this so large language model loaded using a low Precision data type in the memory so in the previous examples right when we were talking about full parameter fine tuning we were talking about weights right so all those weights were 32bit format but in the Lura what we learned is how with the help of M decomposition we can reduce the footprint of memory right now let's take this thing one step further so what we are doing is not only we are using the Lura we are doing the Matrix decomposition but on top of that we whatever bias or whatever um weights or whatever the activation functions we which we are going to use will be in 4bit format not in 32bit not in 16bit but in 4bit format right so think of it like how we have significantly reduced the memory footprint of it and with the help of this we can f tune it using a single GPU or we don't need that much powerful Hardware right so we don't even need the consumer gpus so this is basically the Crux of Kora now let's talk about the implementation part of Laura and Kora and how both are different in the implementation so as you can see in the left side we have two networks one is for loraa the other one is Kora so the basic difference as I said is the bits so so in the Lura the weights the bias the activation everything is either in 32 bit or 16 bit or 8 bit right when we talk about Kora all these values be it weights be it summation beat activation everything will be in 4bit normal float or in a bit normal flat depending on which configuration we want to go with but in general we always talk about 4bit normal float so when we see the implementation part for the Lura we can see that this example where we have taken Auto model for coal LM where we are loading the model we are passing the model name but in qora what we are doing is we are also using the bits and bytes config package and we are saying like load in 4bit = to True 4bit BNB 4bit use double Quant true we are going to go through all these things like what is double Quant what is the Quant type of nf4 and all these things but let's understand the the implementation part first so we are saying use double Quant equals to True 4bit Quant type is nf4 which is normal float and BNB 4bit compute data type is TCH B float 16 okay so once we have this bits and bytes configuration ready we pass it to model for caal DOT for from pre-rain right but you see the rest implementation is almost same so we are basically using the training argument and then we are uh creating the Lura config so once the bits and bytes configuration is ready right we pass it on to the auto model for caal DOT from pre- trained so that's where the model gets loaded in 4bit and and that's the only difference between Lura and qora where you are basically in the Lura you're basically loading larger model because the bits which we have used is larger and for the Kora you have basically used the compressed version of the data type which is nf4 and that's why the memory footprint is also less apart from that if you see the whole implementation is SE right so here also you basically your creating the Lura config you're going to use this this particular code for both the implementation be it Laura be QA right and if I talk about some hyper parameters so you can see this Laura configuration has few hyper parameters right so let's talk about it one by one so when we talk about Laura Alpha it is 16 so what is Lura Alpha it controls the learning rate specific to the Lowa parameters so in this case it is 16 uh that that suggest a relatively High Learning rate okay dropouts is nothing but the regularization technique to prevent overfitting by randomly you know setting a fraction of the output units to zero during training R equals to 64 so as we have gone through the whole Laura explanation right there we were talking about rank so R is nothing but rank okay so it represent the rank of the low rank Matrix in the Lowa a lower rank means fewer parameters and less computational cost but potentially less capacity to learn complex pattern but here you can see it's 64 which is significantly more than what we usually use so we usually if you see right I mean in most of the cases there are few cases where I have used Ral to 2 also and it worked because the use case was very simple but in this example there might be some complex task some complex patterns that's why we went for R = to 64 so in this use case bias is none and task type is causal element so this defines the type of task the model is being configured for and causal LM indicates that the model is set up for causal language modeling where each token can only attend to its left context of it okay and then we are basically creating the sftt trainer which is nothing but the supervised fine tune trainer we are not going to go into much detail because I'm going to come up with one more video where we are going to talk about the implementation part in more detail so let's try to understand what you are talking about right we were talking about double Quant we were talking about the nf4 so 4bit normal format is basically a quantized form of higher bits okay so the 32 bits got compressed into four bits or 16 bit got compressed into four bit so I would suggest to go through this blog which which is making llms even more accessible with bits and bytes 4 bit quantization and Cura so these are all the researchers who came up with the 4bit normal format so they have also explained all the different formats for example float 32 float 16 float 8 and the two versions of float 8 here you can also get some understanding like what is quantization and how the compression logic happens on the 32bit or even for the 16 bit right and how it gets compressed into a very small 4bit format right so it's a very good blog it's a very informa blog I would suggest you guys to go through it and get the better understanding of the quantization so what is double quantization so it is nothing but basically we are quantizing the already quantized constants or information for instance let's say your original quantization process have used 8bit right now what you are doing is you are quantizing that 8bit also into 4bit so you have substantially you know reduced the the required memory okay so during this whole process right during this whole fine tuning what happens is that the GPU the memory spikes will be coming on a regular basis to reduce these GPU spikes we have the paste optimizers we have the pasted Optimizer so what past optimizers um are used for is to address the memory Spike so that they can you know reduce the number of spikes which comes during this whole fine tuning process and here's the comparative table when to use what so for GPU memory efficiency we can go for Kora for Speed we can go for Laura because Laura accelerates tuning process by about 66% faster than Kora for cost efficiency we can go for Laura for higher Max sequence length we can go for Kora accuracy Improvement is almost same both the model Del comparable accuracy enhancement but obviously there will be slightly information loss for Kora so keep that in mind and higher bash size again Kora because it supports significantly larger batch sizes where I thought of like where I have tried to accommodate as much information as possible about Laura and Q fine tuning I hope you like this video and if you have learned something if you think like it was helpful please like And subscribe this channel thank you than you guys have a nice day

Info

Channel: Ayaansh Roy

Views: 432

Rating: undefined out of 5

Keywords: #AI, #MachineLearning, #LoRA, #QLoRA, #FineTuning, #Innovation, #llms, #AIIntegration, #Tutorial, #ArtificialIntelligence, #DeepLearning, #NeuralNetworks, #NaturalLanguageProcessing, #AIDevelopment, #ModelIntegration, #AIProjects, #AIApplications, #AIProgramming, #WebDevelopment, #AIInnovation, #RAG, #aiapplications, #SoftwareDevelopment, #mistral, #mistralofmilan, #gemma, #ModelOptimization, #AIRevolution

Id: mz0oQlu6xtc

Channel Id: undefined

Length: 22min 34sec (1354 seconds)

Published: Mon Apr 29 2024