Efficient Fine-Tuning for Llama 2 on Custom Dataset with QLoRA on a Single GPU in Google Colab

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this tutorial we will learn how we can find tune Lama 2 model on custom data set I have divided this complete tutorial into four parts in the first part we will learn how we can create an instruction data set in the part number two we will fine- tune the Lama 2 model on the custom data which we have created in the third part we will evaluate the fine tune Lama 2 model in the fourth part we will push the fine tune Lama 2 model on the hugging phas Hub so let me give you a quick demo of how we will create an instruction data set so here is the Google collab notebook first we will see how we can create an instruction data set to finetune the Lama 2 model so first I will load the model from the hugging phase then I will just convert the raw text into tokens then I will download the embedding model from hugging face and we will convert the text into vectors we will also see how we can apply top case sample Ling over here plus I will also explain what is near D duplication along with this in the last we will see how we can convert our data set to the or how we can map our data set to the chat template uh provided by the Llama to model next we have the Google call app notebook to find you the Lama 2 model so over here we will see how we can load the data set which we have created in the last part and how we can then find tune the Lama model on the instruction data set that we have created and I will also explain how you can set different configurations and how you can we will set different training arguments as well and we will also see how we can set supervised fine-tuning parameters and then we will train the Lama 2 model and then after training the Lama 2 model we will evaluate the Lama 2 model as well and then in the last after evaluating the Lama 2 model then we will push the Lama 2 model on the hugging face Hub so so over here you can see that we have pushed the Lama 2 model on the hugging pH Hub so this is all we will do in this U tutorial so let's get started with this in this tutorial we will learn how we can build an instruction data set to find you the Lama 2 model there are different type of data sets which we can use to find you the large language models which include instruction data set raw completion and preference St data set in this notebook we will be we will modify an ex an existing instruction data set and we will be using instruction data set to F tune the lat 2 model so usually instruction data set is used to F tune the lat 2 model and as we will be find tuning the L 2 model using supervis F tuning so we can only use the instruction data set so there are two options to finde the L 2 model one is to use supervis STP find uning and the other option is to use reinforcement learning through human feedback so usually supervised fine tuning is used to finetune the lat 2 model and to use supervised fine tuning or to finetune the mod model using supervis F tuning we require an instruction data set so now we have two options either we create our own instruction data set or modify existing instruction data set so I will go with the option number two I will modify an existing instruction data set okay so now you can see that we have a pre-trained Lama 2 model available which is the base model we will fine tune this Lama 2 model using supervised fine tuning the other option is reinforcement learning to human feedback so we will use supervis finding to find the L 2 model and to do supervise F tuning we can only use the instruction data set so we will be following using the instruction data set and as I told you that we will modify an existing instruction data set so we will use using this open platipus data set which is available on hugging phase so it is a combination of different data set so I will be using this open platipus data set uh so it is a very huge data set but we will take some sample from this data set now you can see that this this data set is a combination of different data set which include PRM 800k size QA size bench recolor T QA open book QA ARB so these is are the different data set which is a so the all this combined data uh data set will make up this data set which we will be using uh in this notebook so in this notebook I will be using open platus data set which is a combination of different data set so in the first step we will install all the required libraries so we will install data set Library so that we can load the data set from hugging face into this Google cab notebook then we'll install Transformers Library so that we can import a auto tokenizer so so we will uh we will install uh Transformers library because uh we will import auto talizer from Transformers Library so we require Auto tokenizer to convert the raw text into tokens then we have sentense transform Library so I will uh I'm installing some test transfer Library so that I can uh download and embedding model from hugging phase so that uh we will be using uh we will converting our text into white back so for this we require an embedding model so plus we have a f GPU we the F GPU is basically uh a vector database it's not very good but it's very easy to use so we will be using the F CPU as our Vector database then I'm just uh let me just install all these libraries then we have import OS so that I'm just setting the hug face token over here so if you want to get your hugging face token so let me just show you let log in with my ID yes so now it will work so if I just go to the settings if you just go to this uh assess token so you if you want to load any data set or embedding model or tokenizer from hugging face into the Google Go app notebook you can simply use stre uh and you can copy this token but if you want to push your model to the hub or if you want to push your data set to the hub then you just need to access this right token so I'm just loading the data set from hugging pH into my Google cab currently so I can just use this SS token so I just pass the SS uh having this token over here okay so if I just run this cell so so now I'm just um I will just go over here and I just showed you the previous thing the data set uh which we will be using okay so you can simply copy this from here and you can just add this part over here so this will uh load the data set from hogging pH into this Google Go app notebook so this might take few seconds so now you can see we have the STA set so let me show you the data set as the pandas D data frame so you can just write two dasond P to see the data set as the pandas data frame so we have the instruction column over here we have the output column over here so output is basically our responses and instructions are the input on which basically we will find tune our L to model okay so output you can say are responses okay so now first we will analyze our data set so to analyze our data set in the first step I will import all the required Library so from Transformers I will Import Auto dooner so that we can convert our raw text into token uh tokenize uh into tokens and then I will require M plot lip and C B Library so that I can see the distribution of the tokens or distribution of data basically so now I will load the tokenizer from Lama 2 so I'm just using this Lama 2 model over here I'm not using the official Lama 2 model from meta because to assess that model official L to model from meta through hugging phase to assess that model from hugging face you need to have a hugging face Pro account so I'm I don't have an hugging face Pro account so I'm using assessing Lama 2 model 7 billion model with 7 billion parameters from noise research okay next I will load a toiz from L 2 I will use the noise search of L 2 and not the official from meta to use the official of Lama 2 you need to have a hugging F Pro account which I told you okay so now it uh it downloads basically it downloads the tokenizer from L 2 if you just run this cell you can see that it will download the tokenizer from L 2 and it's downloaded so now what I will do is now I will tokenize so now you can see this is our instruction column and this is the output column so these are the two columns which are out our in and we will using these columns as we go ahead so now I will just talk now if I just see show you over here I will tokenize each row in the instruction and output columns in the uh data set and I will count the number of tokens in each of the row so now I will tokenize each row in the instruction and in the output and I will count the number of tokens in each of the row so I've just written all this code so here I will just tokenize in each row in the instruction column and we will count the number of tokens in each of the row so if I just show you some result so this will visible soon so now now you can see over here 85 5375 so if I just show you these are the number of tokens in the first row so 85 there are 85 tokens in the first row there are 75 tokens in the second row of the instruction so now you can see that uh if there are 53 tokens the second row there are 75 tokens in the third row there are 86 tokens in the fourth row of the instruction so and they are the length of instruction to is 24926 so our data set length is also 24926 rows we have then we are just doing output tokens count so we have 223 tokens in the first row of this output column and we have like now you can see we have five tokens in the second uh row of the output column we have 193 tokens in the third row of the output column and we have 24926 U tokens count as we have the same length of the data set over here you can see as well so now after doing the token count for the instruction and output column so here we are doing the token count for the instruction column and here we are doing token f for the output column so now I will just combine instruction column token count and the output column token count so now what I'm doing is that I'm just combining the token count for the instruction column as well as for the output column okay so now if you just see over here 2223 + 85 so this will give me 308 so now after if you combine token count for the instruction column and the output column so now you can see 308 53 + 105 158 193 + 75 268 so so we are just combining the instruction and output call of token count and we have the length of combined tokens count as 24926 as well this is why I just this so now we will just uh create a hogr using uh M so so we can see the distributions of our token count so you can see that we are using his program from M so that we can see the distribution of the uh tokens count so now you can see that if we just show you over here in the combined token count we can see that the mean is around 500 tokens okay here we are just combining instruction and output column so you can see the mean is around 500 uh tokens okay so now you can see here we have the number of tokens so the mean is around 500 tokens while it wases ahead till 5,000 so the tail is very long it goost still 5,000 but the number of tokens you can see in the instruction and output column uh you can see the mean is around the here we have number of tokens the mean is around 500 tokens so now you can see that we have found out the number of tokens in the instruction column and the output column and the combined instruction and output column as I told you already so but you might be thinking why we need to know the number of tokens why like why we need to know the number of tokens we have in the instruction column in the output column and then the combine instruction and output column because the each of the large language model has an input to limit or input context limit like you can pass these maximum number of tokens at the input of the language model for example the input token limit of the Lama 2 model is 496 tokens while over here you can see that our token counts go to 5,000 as well so if the token counts exceed the input token limit of the large language models for in the case of Lama 2 the input token limit for the Contex size is 496 but the over here you can see we have five Tok ,000 tokens as well so if the token goes if the tokens goes beyond this token limit then it is basically not helpful so we need to know the number of how many tokens we have in our data set so now you can see in some cases the token count goes to 5,000 as well so now we will filter out all the rows um which have more than 248 tokens in the combined token count so we will be taking the combined to count of the instru plus output column and you can see we will filter all the rows which have uh more than 248 tokens in the combined to count in the combined total count okay so we'll filter out rows with more than 40 2,48 tokens in the combined total count in the instruction column and um output column so we will uh consider the tokens count so I will just add the tokens in the instruction column plus I will also add the tokens in the output column so what basically we are doing is so for example here you can see that 85 are in the first row of the instruction uh column and we have 223 tokens in the first row of the output column as well so if we combine 223 + 85 this comes as 308 okay so if the combined total tokens count of the instruction and the output column is greater than 2048 we will exclude those rows from our data set so if the combined tokens count of the instruction and the output uh column is greater than 248 then we will simply exclude those uh rows like you can see here we have 1990 so in some cases we will have this value greater as well like here we have 1,188 Okay so you can simply write token count so filter out rows with more than 248 tokens in the combined token count so I here I've just written all the board so first of all uh we will remove samples with more than 48 tokens in the combined token comp here so first we will filter rows with less than or equal to 248 tokens so here I'm just filtering the rows which have uh less than or equal to 248 tokens so this represent the number of valid rows so if I just see the length of my data set so I have 24926 rows so what the all the rows which have less than uh less than 248 tokens or less than or equal to 248 tokens are 24894 do have less than or equal to 2,48 token so we need to remove only 31 rows okay so here I'm just uh finding the number of row with more than 248 token and here I'm just uh extract valid rowes based on the indes so I'm just keeping only those row which have less than or equal to 248 tokens count you can see over here here we are just getting the tokens count and here I'm just find plotting the distribution of the data so you now you can see over here we have the maximum number of tokens count as 248 and the token count start from zero and it goes up till 248 so we have maximum number of tokens count as 248 okay so now I'm just uh printing my data or uh izing our data over here in the pandas data frame so now you can see the original data set has 2 24926 rows now the number of rows has been decreased to 24895 we have excluded 31 rows where we have a tokal account greater than 2048 okay so now in this notebook I'm using the GT large embedding water so you can just see the different embedding model over here so the GT large embedding model is not the best but it is fast so that's why I'm just using over here so this is the Le board which show you the best embedding models so you can find all the details over here so here in this notebook I'm using GT large embedding B it's not the best embedding B but it is fast uh I will use sentest transform library to download the embedding model and we are using f as our Vector database okay and here we are using tm. Auto notebook so that we can create a very nice loading barard as well so so here I'm just passing the name of the embedding model so I'm using GTE large embedding model so I'm just creating the function and over here when I just call this function over here I will just pass the name of the embedding model over here okay so here I'm just passing the name of the embedding model so now we will embed every sample every Row in the data data set output column so now we will be doing embedding on this output column only like you can see that we will be applying embedding or we will converting in embedding basically convert the text into vectors so we will applying embedding on this output column only okay so now using edding model we will convert a text into edding so we will create a vectors okay so now here you can see that I'm just uh loading this uh bedding model and I'm just converting the text into our beddings okay and now I'm just creating anex using the f as our Vector database okay so now what I am doing over here is that I'm just converting the text into embeddings so now what we do over here is that if the if we have we will just filter out the duplicate so if the uh two embeddings are 95% same so we will remove the duplicate embedding so this is the reason why I'm just converting my text into embedding so that I can just remove the duplicates okay so first I will just convert the text into embeddings and then we are just uh creating andex using FES as our Vector database then I just normalizing the embeddings over here so now I'm just using filtering out V duplicate so what I'm doing over here is that if like you can see over here we have set that treas so thresold is the limit like I I have just set the thresold value you can see over here 0.95 so if the like you can say we have will Define the treas blow like 0.95 if theem one embedding is 95 5% similar to the other embedding then we will remove that embedding okay we will only keep one embedding we will not keep both of those embedding so if one embedding is 95 to other embedding uh then we will remove one of those embedding we will not keep both of these embedding we only keep one embedding so this is what I'm doing all over here and this is embedding model which I'm using so if I just want to show you this betting model so you can just copy this name for here so we will just using this text embedding model and you can simply just write this name over here and you will be able to use this embedding model so I not run the cells from above let me just see where I have skipped so I have not run cells right now so here I'm just doing near D duplication so I'm just removing those embeddings which are um similar and I'm just keeping only one of those embeddings okay so now you can see this will take a bit of time so let wait for this process to gets completed so this uh let me explain you the rest of the code so over here we are just uh here we have all the Andes where we have the duplicate embeddings okay or no so these are the all all the indexes which we want to keep okay so here we are just creating creating a list by the name to keep uh so this will have all the samples or the indexes of the samples that we want to keep okay so this will take some time because it's a very slow process okay so in the two ke L we will keep all the indexes of all the rows or samples that we want to keep okay so in the data set we will select on only all all these rows or all using indexes we will select all those rows that we want to keep in our data set and the all the IND all the rows IND denes which are not in the two ke list we will exclude those rows so here you can see it's downloading the embedding model so this will take time so as it completes then we will put a go ahead so um in the meanwhile it does so after uh so after doing D duplication like uh removing those embeddings which are quite similar so now our data set size is reduced to 18 ,1 1668 so previously you know that uh after applying tokenization my data set size reded to 24895 so after doing D duplication uh my data set size reduced to 18,1 68 rows so now number of rows in the original data set 24895 and number of rows in the dup data set is 18168 so we have removed 6,727 rows okay so I've already done this so this will take time so but I will explain all this so now we are just tokenizing each row in the instruction and output column and count the number of tokens so we will take D up data set this is our updated data set and we are doing instruction token count and the length of the instruction token count similarly we're doing we are doing for the output tokens count and the combined token count so we have 85 223 it's 308 so the same process that we have followed and we have the L length of output tokens SC as 18168 the length of our data set so now next what we doing is that we are implementing top case sampling so now in top case sampling I will separate top th rows from a data set and we will using those rows to F tune the lat 2 model so I will uh now in top case sampling like we will so now when you want to F tune the L 2 model we will only take two columns one is the instruction column and other is the output column so we are only interested in those two columns instruction and output column we will skip data source and the input column from here we will only instruction we are only interest instruction and the output column okay so in topkit sampling I will separate the top th000 rows from my data set based on the number of tokens so now now you can see over here but the data set has 18,000 rows so if I just write my data set has 18,000 168 rows so as my data set has 80 18,1 168 rows so I will take those th000 rows which have most number of tokens so now I will take those top th000 rows which have most number of tokens so so here I'm just getting the top, RS with most number of tokens so I just sorting by the descending token count so like that as sending order we start from low the setting order we start from high so now you can see over here I will just separate those th000 rows which have uh previously I was just separating 500 rows so now but you can keep the K value you can vary the K value as well so I just keep the K value th000 so now I will just separate those th000 rows which have um most number of tokens okay so in the end C I'm just separating 500 rows but you can just uh previously I was separating 500 rows but you can change the K value th000 and you can just keep those, rows which have uh most number of tokens okay and now here we are just uh using this updated data side and we will we are seeing the instruction token count output token count combined token count and this is the data set we have instruction output and now in case of Lama 2 you can see that we have some uh chat model as well like we need to follow this template so now I will just update my data set instruction column considering this chat template we have so now I will just update my data set and add this accordingly to this chat template and then I will just push the data set to the hugging face up so simply uh now you can see that at the start have was passed I have passed the hugging fish token and here I'm just passing the hugging fish token over here again so let me show you that telling the reason if I just go to the H face okay okay and if I just go to the settings so in the S tokens so now now I want to push my data set to the hugging face Hub so now I will just use the right token over here so you can just copy this token from here and you will add this token over here and now using this um token I will just push my model to the uh push my data set to the hting ph up and you can write any name for the data set over here it's your choice okay so in this way you can create your data set and in the next part we will see how we can find you the lava 2 model now we will see how we can fine tune the Lama 2 model they are basically two main fine tuning techniques which include supervised fine tuning and the second one is reinforcement learning from Human feedback in this notebook and usually supervised funing is being followed and in this notebook we will also be following supervis fine tuning to find tune or train our Lama 2 model on the data set which we have created in the previous part so what is supervised find unic in supervis fine tuning the model is trained or fine tune on a data set of instructions and responses in the last part we have uh modified our data set or we have created our data set so if you just uh see remember that data set we have a for instruction and here we have the column output so output basically responses here we have the instructions in this uh forom okay so in supervis P tuning the model is trained or F tuned on the data set of instructions and responses so we have a data set which have instructions and output which refers to responses so in supervised F tuning the weights in the llm uh mod adjust the weights in to minimize the difference between the generated answer and ground to respon uh response which are acting as labels so in this notebook we are using supervis fine tuning to fine tune the Lama 2 model on our data set so here we are just installing the libraries which include Transformers data sets accelerate uh PFD which represent parameter efficient fine tuning TRL which is basically a rapper and bits and bytes Library which we are using for quantization but I will explain you the purpose of each of the library so yeah you can see over here we are installing uh Transformers Library uh so that we can Import Auto tokenizer so that we can load the tokenizer from the Lama 2 so when we after we have find you the L to model on our data set if you want to push the model to the hub we also require the tokenizer of the Lama 2 model so here we are so that's why we are importing Transformers Library so that we can U uh load the tokenizer of Lama 2 so that we can push the model uh up and the to organizer to the hub as well then we have data sets Library so that we can load the data set from hugging phase and then we have accelerated Library so that we can speed up all the process then we have PFT Library so PFT library is refers to the uh parameter efficient find uh tuning so this is used uh we will use uh par PFT libility to import low of configurations as well uh which is basically uh required to in if you just use uh L of configurations basically they reduce the computational and memory requirements so if you just see over here I have written all the details so uh we will use parameter efficient fine tuning uh liability to import a lot of configurations so in laa basically instead of training all the vs we will add some adapters in some layer and we will only train the added vs which will reduce the cost of the training so uh this reduce the cost of the training and which in directly reduces the computational and memory requirements as well then we have the TRL Library so TRL is basically uh used to import SF trainer or if we are uh performing uh fine tuning through reinforcement learning from Human feedback then we can import import RL EDF trainer so TRL is a rapper that can be uh go for supervised fine tuning or um reinforcement learning from Human feedback then we have bits and bytes Library so we basically install uh bits and bytes Library if we want to do cotization so here we are uh doing supervis fine tuning using Q not using L we using we will be doing Super Wise fine tuning using qora and you can see over here uh qora which uses Lura but here the model has been quantized so the only difference between Laura and Kora is in Kora we use the model that has been quantized so we'll discuss the more details about it later so now uh I have already run this cell so you can see the tick mark So now here I'm just setting the environment as hugging face token I've already told you in the previous part how you can assess your hugging face token now I'm just importing all the required libraries we have installed the data sets uh Library above to import the load data set so that we can load the data set over here then we are doing Auto model for quaal LM then we are importing Auto tokenizer so that we can load the tokenizer from the Lama to model so that we can push the find T model and tokenizer go to up then for quantization we require the bits and bytes uh configuration library then we have uh uh Auto tokenizer uh you can uh just remove this I put it two times then we have training argument so that we can Define maximum steps learning rate all these then we have pipeline so that we can um we are will use hugging F pip pipeline to generate the response then from parameter efficient F tuning Library we are importing low of configuration param efficient find tuning model and repair model for kbit training I will also explain about this library in the below part then from PRL we are importing sf3 trainer so TR is basically a rapper from there we can import sf3 trainer so if we are fine tuning the Lama 2 mod using supervis fine tuning we can import sf2 trainer or if we are fine tuning the Lama 2 model or any other model using reinforcement learning through human feedback we can import rlf trainer as well so I've already done this cell you can see so next uh now you can see that U we will be using five as we have discussed above we will be fine tuning the Lama 2 model using supervis fine tuning but there are three ways in which we can fine tune the model using supervis fine tuning so the first approach is food fine tuning the second approach is Lora and the third approach is Kora in this notebook we will using Q to fine tune the Lama model using supervised fine tuning so what is basically full fine tuning uh with full fine tuning we are going to use uh we are going to use the entire model we will train all the weights in the model which is very costly so in full fine tuning we will finetune the complete model weights which is very costly plus you can see that I'm using Google PAB free GPU which offers 15 GP V of vram or GPU memory so it's not going to be possible over here like I want to do full fine tuning uh or if I want to fine tune the entire model weights uh that is not going to be possible in this Google goab notebook at least plus it's very costly and it's so much time consuming as well and the second approach is Laura in Laura instead of training all the weits we will add some adapters in some layers so in we will not be following full fine tuning approach we will not not be training all the model WS uh we will just add some adopters in some layer and will only train the added we so the layers or the weights in which we have the added the adopters we will only drain those WS in which we have added the adops okay which will reduce the cost so uh what this how this will work is this will significantly reduce the cost of the training of the model because we are only training like one or 2% of the entire weights so in L up basically we add some adopters in some layers of the model and we will only train those wees where we have added those adopters so we are only training like one or 2% of the model weights because we have added only adapters over there while in the case of full fine tuning we have to find you our model on the entire vs which is very costly so next we have the Kora so Kora which is L but the only difference between Lura and QA is that in Kora the model is quantized we only use the quantized model in we don't quantize our model while in the case of qora we our model is being quantized so why I need to quantize my model because of the GPU memory which we have like 15 GB of GPU Ram we have so we need to quantize the model so if our Lama 2 model is being occupy 16 bits on the in using um quantization uh we we basically we will we will quantize it to four bits uh but this will uh result in a lot of precision loss okay so here we will following qora but yeah the only difference between Laura and Kora is that in Kora we basically quantized our model so if the model is cine 16 bits on the D we will quantize it to occupy four bits on the disc but uh this will result in the loss of precis and to do this we will be using this U bits and bytes configuration Library so in this notebook I fine tuning the Lama 2 model with 7 billion parameters so Lama 2 model comes with with 7 billion 13 billion and 60 billion parameters models so here I am using Lama 2 model with 7 billion parameters so here I'm using T4 GPU which offers 15 GB like you can see over here it offers 15 GB of GPU memory which is barely in enough to store the Lama to 7 billion model BS like you can see that 7 billion multip by two bites so 14 GB in uh six fp16 if we are using 16 bit of uh if we are just using basically uh 16 bits on that dis okay so it's 14 GP already occupied so if we use Lama 2 model with 7 billion and if you don't quantize it uh and so we have 15 GP of GPU Ram so this will not work because this will occupy 14 GP plus we need to also consider the overhead cost which include Optimizer States gradient and forward Activation so to reduce the vram like if you I want to uh reduce the vram like I don't want to use all the 15 GP then we will F tune the Lama 2 model in 4bit Precision so we will do quantization so to reduce the GPU memory usage we will uh quantize the Lama 2 model in 4bit precision and we will be using following the approach of Hora over here so first we will quantize the Lama 2 model in 4bit precision and we will be following the Kora approach over here so this is the base model I'm using Lama 2 model from uh jet model from noise research I'm not uh assessing the official Lama 2 model from meta because to assess that model you need to have a Google uh you need to have a hugging pH Pro account account so I don't have a pro account because it's as some cost Associated so I'm using a Lama to model chat model available from the noise research so I will be using this L 2 7 billion model available from the noise research uh because if we want to assess the official model from meta we need to have a G Count and this is the point to model name so after fine-tuning the Lama 2 model my model will have this name output model so now I'm just loading the data set from hugging face so if you want to load the data set from hugging face you can simply go over here you can simply click over here because in the last tutorial I've showed you how you can push your data set to hugging face up so it's not pushed to the hub you can simply copy this from here and you can just go back to the um your codap notebook and just paste this over here and just led to train then I am just Shing loading the tokenizer from lar Tu so it will download the tokenizer from Lama 2 over here like you can see over here plus here we are using the padding as end of sentence token you can see over here um so in l two we don't have the packing padding token which is uh really a very big problem because we have a data set like I showed you in the previous part in our data set we have uh different number of tokens in each of the row so we need to pad it so they all have the same length so of each row we have a different number of tokens like in in some row we have 80 tokens in some rows we have 90 tokens in some rows we have 200 tokens so using padding we will uh make the length of the token same and here I'm using end of sentence token and when we generate the response uh this end of sentence token will have an impact and when we generate a response from the fin model I will show you how this end of sentence token impacts our model so I'm using end window sense token for f tuning and we have just good to go so here is the data set which have I have loaded from the hugging ph up you can see we have two columns one is the instruction column and the output column we have discussed about this in the previous part and um here I've just setting the configurations of the Gora so here we have the quantization conf configuration so to reduce the V usage so to reduce the GPU memory usage we will load the model in 4bit precision uh and to do this we will do the uh to implement this we will do quantization so here I'm using bits and bite configuration Library so that I can do perform quantization I'm loading my model uh Lama 2 model in 4bit Precision over here uh plus here I'm using the Quant type as nf4 so this uh nf4 format so this was introduced in the qora paper plus uh here uh ASB just just uh storing our model BDS in the 4bit procision so we want to compute it uh so while computation we will be using 16 bits although we are storing the Lama 2 uh model weights in 4bit Precision uh but for computation we will be we are going to use 16 bits so this will uh result in more accuracy uh and here you can see that we are using double cotization even quantization parameters are also quantized as well so here you can see that we have a lot of configuration so we basically we are just implementing parameter efficient fine tuning so that we can reduce our computational and memory requirements so here we have set the alpha value as uh 15 so Alpha is basically represent the strength of the adapters so now you can see that in some in L basically we add adapters in some layers and uh we only train the Ws where we have added adopter so basically Alpha is basically the strength of the adapters in L instead of training all the wees we will add some adapters in some layer and we only train the um added wees okay we can merge the adapters in some layer in a very weak way using low value of lapa so if you want to merge the adapters in some layers of your Lama 2 model in a very weak way you can use a low value of Alpha and if you want to EMB the adap in the layers uh in a very strong way you can use the high value of alpha or you were using a big we so 15 is a very big weight usually the value of alpha is considered as 32 32 is Bally a standard value but I am just following 15 over here and here I'm just add adding a drop out of 10% bu is none we don't care about buys over here and the TOs type is FAL LM so here I'm just loading the base model so my base model is this um of original llama 2 chat model so here I'm just loading the base model over here and here I'm just setting the qu quantization configurations which have defined over here and device map is set to zero because we will be using GPU memory over here okay and now you can see that U so now here we will prepare model for Kit training to so this uh here this function basically prepare model for K training helps to build the best possible mod model as possible so now here we cost the layer Norm in fp32 so basically the aim of this function is to generate the best possible uh uh model to build the best possible findun model okay so here we have we have set the training arguments so our output directory is this so now you can see we are just fining the Lama 2 model okay so in this results directory in this runs folder now we will have all the finetune model bits will be saved over here we are only uh fine tuning the Lama 2 model on one Epoch just to show you the demo but 345 epox is a good fold find in the Lama 2 model then we have part device train B size is being set to four so basically these are the number of batches that we are going to take uh at every step okay so number of badges that we are we will pass to our uh and uh we will pass the model uh at every step so uh four number of batches will be passed uh to the model at every step and we have St the gradient accumulation step as one we have St the evaluation strategy uh but this is not going to be helpful because we don't we are we don't want to evaluate the model and we are not evaluating the model over here we are just training our model okay so evaluation step is not of interest and we will log the result after every uh 25 steps we are using Adam Optimizer uh and uh so Adam Optimizer will be used but a version that is paced so we are using App Adam Optimizer with the version that is paced uh so it will uh page and it is in 8 Bits so uh we are using Adam Optimizer with a version that is pasted and in 8 Bits so this will uh result in a loss of less memory okay and here we have defined as learning rate Schuler as linear you can also use schedu for sign and we can our results will be displayed on the tensor load which I will show you below and then um we are doing supervis fine tuning I I told you in the start so we have using SF trainer if you want to do fine tuning using reinforcement learning through human feedback you can use rlf trainer here I'm just passing the model so now you can see that we just passing the base model over here okay I'm just passing the train data set and for evaluation we have the same data set we are not having don't have any separate data set for evaluation and here I am just passing parameter efficient fine tuning qu figuration like lot of configurations over here then data set text field so if you just see over here uh these are the instructions and here we have the response so we have our data set text field will be the instruction which I am passing over here and uh the maximum sequence line will be 52 as you know in the data situation we have put a thresold of 248 like uh we have although Lama 2 model can accept uh 4,000 uh although the Lama 2 model has a Contex length of 496 meanwhile data set creation we have set that fual to 248 to be saved okay so in data set creation we put a TR 248 for Content SL like input token liit like in input we cannot have a uh text or tokens more than 248 which is we said when we are doing data set creation but as we don't have enough vgpu memory so we will further uh reduce it and our maximum sentence length or the maximum input tokal limit will be 512 so we will we are further reducing it okay so here I just stting all this reason here I'm just using the tokenizer over here so if you just see over here we have also defined a tokenizer and here I just passed that training arguments which have defined a bus and here I'm just training the model and here I'm just saving the train model VS over here as we are find tuning the Lama 2 model for one EPO but it is taking time like U it will take 33 more minutes or it has taken 56 minutes but it will take 33 more minutes to finish the point tuning process so at it completes I will be back and I will go we will go further and run text generation Pipeline and we will just pass an input prompt and see what kind of response we are able to generate over here so let's wait for this fine tuning process to finish and then we are good to go ahead now you can see it's done we have fine tuned the lat 2 model on this stat set now you can see that I have only find you the lat 2 model on the STA set for one EPO but it took one over 30 minutes so you can also find un it for 5 to 8 EPO so here we have the tensor board so now you can see over here the loss is continuously decreasing which is a which is a very good sign you can see over here and you can see that performance or accuracy of the model is continuously increasing uh which is also a very good sign as well so now I have already done this so now I will just run text transation Pipeline with our model so here I'm just passing an input prompt what is a large language model and I'm just wrapping up that prom using that right chat template okay so here I just wrapped my promp on the right champ template and now I'm just using the pipeline from the hug face and as I am doing text generation uh I'm just passing TOS text generation and here I'm just passing my fine tune model and here I'm just passing the tokenizer and the maximum length of my output is 128 and here I'm just generating a response so I'm just trimming the response over here and just removing the instruction from my response memory so now I'm just asked what is a large language model and I just got the response a large language model is a type of artificial intelligence model that is trained on a large desct of text to generate language outputs okay so now you can see that this is a quite long response which I'm getting uh you can ask any other question as well plus you can see that uh it doesn't stop over here now it has asked itself other question what are are some potential applications of large language model so here we have some potential application of large language models include language translation so now you can see that the response keep repeating because of the pending Technique we have have used end of sentence token padding technique so if I just show you at the start we have used end of sentence token padding technique and I told you that this will have an INSP uh this will have an impact on the the response generation of our model so now you can see that it complete it already repeating like we have another instruction and we have the response so the response keep repeating because of the Ping technique end of sentence token which we have used if you w don't want to see this Behavior please use that different padding technique so you can use other padding technique as well if you want to don't want to generate further responses then I just uh deleting by or making space for my GPU memory over here like I'm I will just free up my GPU memory over here if I just run this cell so this will quite free of met the GPU memory so now I want to just merge the base model with that train adapter so now here you can see these are the fine tune model VS adapter which we have over here so to do this I will simply run this my model so now here as you can see I'm just reloading the model in fp16 and merge it with the Lowa be we have find our U model L to model using two so here I'm just reloading the B model and here I'm just uh passing the Kora adapters you can see over here we are just passing these two adapters over here so now I'm just reload loading the tokenizer so now when we po when we will push our finetune model to the hugging face up so we'll also push the tokenizer to the hugging face as well so we'll push the F model as well as a tokenizer to the hugging Bas up so we'll push this both of these things so now you can see that as I just free up my uh GPU memory so now you can see that it has already very much decreased the GPU Ram has gone down so and it is basically increasing because of this I'm just uh vering the base model with the train adapter but it will not go very much high that it fails so this will uh so now it's finished okay so now we can now push the point two model to the hugging face H as as well in the last I'm just pushing the find T model and tokenizer to the hugging phase Hub so now here I've just passed the name of the model repository that I have created in the hugging phase so if I just show you the model repository which I have created in the hugging phase so now you can see over here I have just created this model repository and if you want to create you can simply go over here and click on new model and here you can create a new model repository by writing the model name and create model so after creating the model repository you can simply copy the name of the model repository which you have created and you can go back to your Google C app notebook and you can just pass the name of the model repository over here similar you can also pass the name of the mod repository over here and then you can simply run this cell and you will be able to push your model to the hub one thing is that here you need to pass the token over here hugging f as token so what token you will password Here is so if you can go to over here and the settings so here if you just go to the test token so you as you are pushing the border to the hub so you now you will pass this uh right token over here so simply you can copy the token and when we just run this cell over here so when you run this cell a popup will appear you will just pass your login token over here and add token as gri cral you can simply write yes over here and you will be able to push the model to the hub so this is a bit time taking so uh I will just finish the video over here this will take 5 to 10 minutes so uh not waste your time so that's all from this tutorial in this tutorial we have learn how we can create a data set and how we can find you the l two model in the Google B app notebook using free G view so that's all from this video thank you for watching bye-bye
Info
Channel: Muhammad Moin
Views: 7,733
Rating: undefined out of 5
Keywords: Llama 2, Llama, Efficient Fine Tuning, Fine Tuning Llama 2, Llama 2 on Custom Dataset, Fine Tune Llama 2 on Custom Dataset, Efficient Fine Tuning Llama 2, Fine Tuning Llama 2 with QLoRA, QLoRA on a Single GPU, QLoRA, Efficient Fine-Tuning, Llama 2 on Custom Dataset with QLoRA, QLoRA on a Single GPU in Google Colab, Fine-Tuning for Llama 2
Id: YyZqcNo4hdo
Channel Id: undefined
Length: 56min 15sec (3375 seconds)
Published: Sat Dec 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.