Mistral 7B FineTuning with_PEFT and QLORA

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone in this video I will go through the code of mistal 7B fine-tuning so first uh you need to install the relevant libraries uh and these are all the common stuff you can imagine that is um accelerate bits and bytes for applying Q L uh and for quantizing the model so that you can accommodate the entire model in a single GPU and do the fine tuning rather quickly uh and then PFT library of course and data sets and T well and uh then uh the first thing the code that you see here is quite standard Co code in many production grade fine tuning code bases that you will find and which applies to most of the uh open source model that is used for fine tuning nowadays and it starts with these uh script arguments where we have a bunch of uh parameters defined and uh these are all defined within a single class so let's quickly go through them for example the uh this one power device train batch size uh so here I have kept it at four and depending on the power of your GPU you can you may have to keep it lower this number uh so this one the uh this one so when I have set it uh to a number four and for example you have just single GPU that is one GPU then your total effective batch size for the training will be four and if you have four gpus then the total effective batch size for the entire training will be four multiplied by four that is 16 so that's how this is calculated and uh then I come to this one part device eval batch size uh which is exactly the similar concept like the trend batch size uh and then gradient accumulation steps so this is a number of updates uh steps to accumulate the gradient s for before performing a backward or update pass uh then coming to these parameter local Rank and so in the context of distributed training a single training job is split across multiple processes often spread across multiple machines or nodes and multiple gpus and each process running a piece of the training task is assigned a unique r to identify it the local rank parameter is specifically the rank of a process within a single node uh that's what it represents so to give you an example imagine you have two nodes each with four gpus making it a total of eight gpus the global rank would be an index from 0 to 7 to uniquely identify each process across both nodes and the local rank would be an index from 0 to three in that case for each node to identify which GPU is being used for the current node so uh the use of local rank becomes particularly important in strategies like distributed data parallel or DDP in pytorch uh DDP uses local rank to to to partition the mini batch across the multiple gpus on a single node and why it defaults to minus one that is the one that you see so when local rank is set to minus one It generally means that the training is not being run in a distributed environment the framework would then default to using all available resources on the single machine you are running the training job on and then you have learning rate this is pretty obvious and uh another one next one is Max grad Norm so when back propagation is uh performed each parameter's gradient is computed these gradients indicate how much the loss function will change if the parameters are adjusted by a small amount in some cases especially in uh recurrent neural network and deep networks these gradients can become very large leading to what is known as exploding gradients problem so gradient clipping is the technique to prevent gradients from becoming too large during the training of neural networks so to do that if we consider the gradients as vectors the idea behind gradient clipping is to scale down the gradients if their L2 Norm exceeds Max grad Norm so that's the number that we are setting here that is 0.3 which will control at what level the gradient clipping would be applied and uh the idea of gradient clipping is is very simple if the gradient gets too large we rescale it to keep it small all right then moving forward all these parameters uh use 4bit use nested uh Quant Etc these are parts of the bits and bytes uh config that I will come in a second so let's just move forward for now and there are some other as well some of them are quite um quite obvious for example gradient checkpoint that's a Boolean parameter and I definitely suggest uh it just quickly go through hugging F documentation because it's all these parameters are very very well defined in the documentation all right and then uh after after I defined uh these entire class Subs script arguments then to actually uh apply and invoke this class I uh can have these two lines that is HF argument parser and then parser do par ARS into Data classes but this will be applicable if you are running in a python file that is you are executing an actual python file which is um which is the most regular normal case if you are running this code in in your production uh but here I I have kept these two lines commented out because this is a jupyter notebook environment and these argument Purser will not really work uh correctly but anyway uh just to know what they are so this HF argument parsa class is provided in hugging faces Transformer Library this class t one or more data classes and turns their data attributes into command line attributes or command line arguments of an argument parser object that's a whole job here okay so but anyway I'm keeping it uh commented out here so instead because it's a jupyter notebook so what I'm doing I'm just using another variable script uh script ARS which will and then just manually uh manually defining each of the parameter meters value so script args is I'm using the original class and then just give the numbers for each of the parameters all right uh and crucial thing is the model that I'm using here is mrl 7B instruct v01 so this is the instruct version of the model and the data set that I will be fine-tuning on this model is python code instruction 18K alpaka it's a very famous data set I will show you what the data set contains in a second uh so yeah that's the uh so these all these parameters uh are uh defined with this script ARS variable so each time I want to use any of these values I just do for example script ARS do W Decay and similar uh couple of things all right so this is uh the data set that I'm going to use for this fine tuning and uh so you can see there are uh instruction column input column output column and then the all important prompt column there so if you just click on any of this row you will get to expand it so this prompt colum is actually holding all the relevant information that I need but uh I need to make some formatting changes uh for this um for this prompt column because uh mistal 7B has a very specific requirement and let's let's actually go uh go through that so uh this is uh I have put the instruction here so Mr 7B instruct that is a model that I'm going to use uh it's recommended to use a following chat template so my template needs to start with this s then inst then your actual instruction then another this particular format that is a square bracket and forward slash and another inst then your model answer then again an S and then again inst and then your followup instruction and then again inst so this uh if you are familiar with large language models you will be this is actually quite easy each model has their own a particular prompt requirement when uh that is very important either for inferencing or while you are doing fine tuning so here this uh s and the ending s uh these are special tokens for beginning of string um BOS and end of string that is eos and while this inst this particular string here is just a regular string and so my job now is to change the original data frame that we just saw that is the python code instruction 18K alpaka to this particular format and hence I have this uh J batch train this util method so let's quickly go through uh this part of the method uh initial parts are just the normal ones that is I'm loading the data set then uh making the total samples uh taking the 10,000 and also you need to uh you need to assign a validation percentage so that your Trend and validations are completely different and let's start the iterations um the looping through the data set so for uh when the counter is more than train limit because we uh only uh do this whole calculation only for the up to uh the train part of the data set and just don't forget this part that this is our original data frame that is the original data set is in this particular format and I will be working only with this prompt uh this particular column data because that has got all the necessary uh part that is that I I'm going to need so here in this Loop I'm just going through row by row so let's uh go this is this is like the first row that I'm working with so first I need to extract the original prompt and for that I'm kind of removing uh this part of the string and also python code this part of of the string uh by replacing them with an empty string and then the next one that is uh this line instruction start so here uh this line is finding the start index of the actual instruction in the original prompt first original prompt. find uh then this instruction this will locate the index where the substring instruction begins and by adding the uh adding this part that is length of this particular string uh the code is effectively moving the start point to the end of the substring that is uh to the beginning of the actual instruction content and if you look at the original data format it will become clear again that is um so you can see that uh instruction uh is starting here and what I'm interested is that I need to remove this part and I'm interested only with the actual instruction uh which is create a function to calculate the sum of a sequence of integers this is actual instruction and then again start um other stuff and uh instruction end this SS uh will just find the end index of the instruction part by locating where the uh where these output this string this subring begins so this essentially marks the boundary between the instruction content and the output or response that follows in the training data and finally the actual instruction that is uh this line is just um this substring represent the instruction part of the prompt so I'm also applying the strip method which will remove the leading any leading and trailing wh spaces characters from the extracted instruction and then content start uh this uh this line is identifying the starting index of the actual content or output part of the original prompt uh this is done by finding the index of the substring triple hash output and then moving past the substring using the length of this particular string just uh this is exactly similar to what was done previously with the uh with the instruction start this one and finally the content is uh original prompt content start this line right here so the this this extracts the output or content part of the prompt starting from content start to the end of the original prompt and the strip method again ensures any extraneous white spaces at the beginning or end of the content is removed and then this uh new uh text format this is just formatting my instruction and content exactly as recorded by mral 7B model that is I need to append these part to my instruction and also append this part to the instruction okay uh and finally I'm yielding uh yielding okay I'm also tokenizing it here by invoking my tokenizer which we have defined earlier and if not I think I will Define it in the next cell and then finally eing the text and the new text format and increasing the counter and thereby this whole Loop gets executed and it will go through row by Row for the entire data set and I have the exact same util method for my validation batch as well it is absolutely no difference it is just for the validation okay and uh yeah next is is actual the util method which will actually prepare my model with all the Lura config and BNB config and P config Etc so yeah so I'll will go through quickly with these um bits andb config uh let's quickly go through these uh individual parameters here of the bits and bytes config so the first one load in 4bit uh this parameter is for loading the model in 4 bits Precision this means that the weights and activations of the model are represented using four bits instead of the usual 32bits this can significantly reduce the memory footprint of the model 4bit Precision models can use up to 16x less memory than full Precision models and can be up to 2x faster than full Precision models however if you need the highest possible accuracy then you may want to use a full Precision models and uh the next one is uh BNB 4bit use double Quant equal to true this parameter enables double quantization or also called nested quantization which applies a second quantization after the initial one it saves an additional 0.4 bits per parameter and then the next one is BN this one BNB 4bit Quant type equal to nf4 this parameter specifies a type of 4bit quantization to be used in this case nf4 refers to normalized float 4 which is is a default quantization type and the last one BNB forbit compute D typ called to torch. float 16 so this parameter determines the compute data type used during the computation it specifies the use of the B flat 16 data type for faster training the compute type can be chosen from options like float 16 B float 16 float 32 Etc this configuration is needed because while 4B bits and by stores weights in 4 bits the computation still happens in 16 or 32 bit and here any combination can be chosen by the user that is by the developer and uh the options are float 16 B float 16 and Float 32 and here I've chosen float 16 and uh the matrix multiplication and training will be much faster if one uses a 16bit compute D type and a basic question you may ask is does floating Point 4bit Precision quantization needs any special Hardware or has it any hardware requirement note that this method is only compatible with gpus hence it is not possible to quantize models in 4bit on a CPU that's the only requirement and among the gpus there should not be any hardware requirements about this method therefore any GPU could be used to run 4bit quantization as long as you have UDA version equal to 11.2 or more installed and keep also in mind that the computation is not done in 4bit the weights and activations are compressed to that format and the computation is still kept in the desired or native D type so let's quickly go through the various parameters of Lura config I have already discussed in detail this one target modules as to why I need that Etc now the next one hyper parameter is R so R represents the rank of the low rank matrices learned during the fine tuning process as this value is increased the number of parameters needed to be updated during the low rank adaptation also increases so intuitively a lower r value may lead to a quicker less computationally intensive training process but may affect the quality of the model thus produced however increase using R Beyond a certain value may not yield any discernable increase in the quality of the model output and the next one is Lura Alpha so this parameter is used for scaling so according to the Lura uh research paper the original one the updated weight Delta W is scaled by Alpha divided by R where Alpha is a constant when optimizing with Adam tuning Alpha is roughly the the same as tuning the learning rate if the initialization was scaled appropriately the reason is that the number of parameters increases linearly with r and as you increase the R value the values of the entries in Delta W that is weight updates also scale linearly with r and we want Delta W to scale consistently with the pre-rain weights no matter what the R value is used that's why the author said these Alpha parameter to the first R and do not tune it the default of alpha value is eight and then I have these dropout rate so uh this is the probability that each neuron's output is set to zero during training used uh to prevent overfitting so Dropout is a general technique in deep learning to reduce overfitting by randomly selecting neurons to ignore with a Dropout probability during the training the contribution of those selected neurons to the activation of Downstream neurons is temporarily removed on the forward pass and any weight update are not applied to the neurons on the backward pass the default value of Lura Dropout is zero and then I have this bias parameter which can take any of the values between none all and Lura only if all or Lura only is used the corresponding biases will be updated during the training and the default value is none and lastly I have this task type so this is generally just represent the type of task that the model is uh being fine-tuned for that is all so possible task type include caal LM feature extraction a question answer sequence to sequence LM sequence classification and token classification so these create and prepare model then finally prepares my um all the quanti Iz ation parameter that is bits andb and PFT config and finally from this entire util method I'm returning model PFT config and tokenizer all these three things we'll be using in the next cell where I'm defining all the training arguments all right so the training arguments almost all of these parameters are coming from our script arcs that we have defined at the very top right here that is these are part of our script argument class and then we Define the script arguments variable uh by actually manually specifying all the parameters and we have already discussed about this so let's move on uh yeah so after uh the training argument is defined I am just invoking or executing this create and prepare model which will give me my model P config and tokenizer and let's quickly check out uh the model so once I get the model so this is the model structure that we have for these fine tuning and we can see that U the attention mention the self attention layer here and this is the target modules that we actually used while defining the bits and bytes config and these are the ones Q KR V Pro Etc and uh then before actually starting the training I obviously have to Define my train gen and Val gen that is a training data set and a Val validation data set by invoking or applying these gen batches trrain and gen batches validation uh method that we have defined earlier and both of these method as we already discussed in detail they are just pre-processing or preparing the actual training data to be uh uh worthy of mral 7B fine tuning because we need in a particular format for Mr 7B as required by the model and with that we are almost done uh this is pretty much the last step because actually running the training so you uh Define this particular thing you have to do I got it from some other repos other people have faced some issues with fp16 training because we are here we are using quantization that is we are not doing the training with the full Precision of fp32 instead we are using fp16 and for that you kind of uh many people have used this particular uh configuration for the tokenizer uh to go the training in the right way and uh trainer this is just uh uh I'm just implementing my sft trainer with all the parameters defined and here the parameters are quite so simple model is model Trend data set is my Trend gen and Pi config is coming from from uh the previous PFT config that we have got after executing this method create and prepare model uh all right so with that you uh Define the trainer uh that is with all the parameters of the trainer and then you simply run trainer. train and it may take couple of hours depending on your the power of your GPU and uh yeah so that's pretty much uh a very production great code for fine-tuning uh mral 7B Thank you very much for watching do stay tuned because I plan to do quite a few llm related and fine tuning related videos over the coming weeks so see you in the next one

Info

Channel: Rohan-Paul-AI

Views: 3,246

Rating: undefined out of 5

Keywords: machine learning tutorial, machine learning algorithms, machine learning projects, Deep Learning, natural language processing, hugging face, Deep Learning for NLP, huggingface, large language models, gpt 4, llm, langchain, langchain tutorial, large language models tutorial, open source llm models for commercial use, open source llm, generative AI, Langchain, llm training, large language models explained, large language model tutorial, fine tuning language model

Id: 6DGYj1EEWOw

Channel Id: undefined

Length: 24min 8sec (1448 seconds)

Published: Sun Nov 26 2023