How to finetune LLMs with LoRA (PEFT hands-on): gemma-2b + @HuggingFace

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

now given a pre-trained model we generally start with prompt engineering and try to understand how the model works but soon there will be situations when we'll actually have to do some fine tuning so in this video I'll be diving into an example where we actually have to do some fine tuning and I'll be doing fine tuning using the data brakes 15K data set I will walk you through how I created a mini data set that's derived from this data brakes 15K data set and the model that I'm going to be fine-tuning is the Gemma 2 billion parameter model I'll be fine-tuning on the Google colab so that anyone with just a Gmail account should be able to fine-tune this model and then get their own fine-tune model and look at the working of this model I will also show the plots of the training loss to see how the loss exponentially decreases as the training progresses I've just fine-tuned on Thousand samples but if you really want to fine tune and play around with the model then you can actually fine tune on the entire data set I've uploaded the data set on the hugging face Hub and I've made it available publicly so the notebook for fine-tuning is also available for free and can be accessed through the link in the description below so without further Ado let's get started let's start with the preliminaries which is to install the needed packages we obviously need the Transformers library from hugging face on the data sets Library so one thing I noticed with the Gemma model is that we actually have to upgrade these libraries because in the default version of the Transformers Library that's available with Google collab the Gemma model and the Gemma tokenizer is not actually available so we will have to upgrade them in order to make use of the Gemma model because we are going to be doing fine tuning on a single GPU on the collab which is is the T4 14 GB GPU we actually need to make use of these libraries which are a PFT TRL bits and bytes and accelerate and we also need the tensor board to sort of visualize the results of our training we also need the JSL Library which we'll be using to sort of dump the data set that we going to create and then export it and then save it in the hugging phase data sets so we also have to log into the hugging phas ecosystem system which we can do using the uh notebook login function available from the hugging face Hub so once we have logged in and once we have imported the packages we can try to run inference on the Gemma model as a first step we can load the model and the tokenizer and then visualize the model architecture and then finally run a query which is a prompt uh through the pre-train model in order to understand how the model responds so here's the model name that I've given which is uh Gemma 2 then I'm just loading the model using the auto model for Cal llm and I'm again loading the trans tokenizer using autot tokenizer beauty of Auto tokenizer is based on the model name it loads the tokenizer corresponding to that model we don't have to worry about which tokenizer gets loaded it's taken care of by the autot tokenizer class once we have the model and the tokenizer loaded we can then print the model architecture to see what are the different layers that we can use for met efficient fine tuning so we can see that the Gemma attention layer has the query key values and O so these are the four main matrices that will actually be a fine-tuning when we do the parameter efficient fine-tuning so to query the model I've just written a query which is what should I do on a trip to Europe and the model responds by saying the answer to this question is not as simple as it seems there are many different things to see and do in Europe and it can be difficult to know where to start if you're planning for a trip to Europe here are some tips to help you get started to begin with it says decide what you want to see or do there are so many amazing places to see and do in Europe now as an additional example I've also prompted this question which is explain the process of photosynthesis in a way that a child could understand and the response I'm getting is a little bit of gibberish a 100 wat light bulb is plugged into the standard blah blah blah which I don't quite get anyway let's find out what happens with the fine tune model before that what I'm trying to do is so just to motivate for parameter efficient fine tuning what I'm trying to do here is to load a data set and then initialize the trainer and then start the training without doing any parameter efficient fine tuning at all and then find out what really happens so in this case I'm going to use the data braks uh do 15 data set so the data set is a corpus of 15,000 records generated by thousands of data break employees to enable llms to exhibit the magical interactivity of chat GPT so datab breaks do 15 is an open source data set of instruction following records generated by thousands of data break employees in several of the behavioral categories outlined in instock GPD paper including brainstorming classification closed question answering generation information extraction open question answering and summarization so we can happily use this data set so let's switch to our code and see what we can do so what I'm trying to do here is just load the data set and I'm just taking the training split of the data set and from that I'm just going to use the first thousand records and I'm just trying to visualize the the first record and then see what the instruction is and what the corresponding response is to understand how the data looks for example uh the instruction is when did Virgin Australia start operating and the response is Virgin Australia commenced services on 31st August 20 2000 as virgin blue with two aircrafts on a single route so this is just one record and if you want to get a detailed look of the different records we just browse the hugging phase data set card or we can switch to the viewer and then look at quite a few records there we can see there are some instructions with context and there are some instructions without context and there are also categories which says closed question answering open question answering classification information extraction so we can use the data set for quite a few uh fine-tuning tasks so what I'm trying to do here is just kind of form The Prompt and then uh initialize the trainer without actually uh doing any sort of parameter efficient fine-tuning and then I'm just trying to uh start the training using trainer. train as expected we are getting uh CA out of memory error because a 14gb GPU is simply not sufficient to fine-tune a model that is like 2 billion parameters so what we'll have to do is resort to parameter efficient fine tuning and in parameter efficient fine tuning I'll be creating a new data set and then I'll be visualizing the data set and I'll also be doing a little bit of uh cleaning and pre-processing and finally I'll be uploading the Creator data set into the hugging pH Hub and as a last step I'll be using this data set to fine-tune the um Gemma 2 billion parameter model so to create my own data set I don't have any in-house data so I'm I'm going to use the uh data brakes doly 15 data set that's available but I'm going to modify it I'm going to create a mini data set which I'm going to use for fine-tuning so for that I've just loaded the data set fully the entire training split of the data set and what I'm trying to do is first get a count of the different categories for example open question answering there are 3,742 records for information extraction there are are 1,56 records so on and so forth to make the 15,000 records in the dolly data set so to create the mini data set what I'm trying to do is I'm trying to filter out those records that actually do not have a context in order to do that I've just put a small condition which says that if there is some context then continue without adding it to my uh list of filtered data and if there is actually no context then I'm going to to format the data that is very specific to Gemma for example the Gemma model likes the input to be format as instruction colon and next line the data instruction and again after two lines there has to be a response colon and a new line followed by the response so we need to stick to this format that is specific to the model that we are going to be fine-tuning in this case I've written a few lines of code just to format the data and if you just look at a couple of Records after we have formatted this way we can see that uh the first record is which is a species of fish taupe or rope and the response is toope and the second record is why can camels survive for long without water and the response is camels use the fat in their HS to keep them filled with energy and hydration for long periods of time so those are the couple of records that are there in the filter data set and finally with this data set I'm creating a small uh Json L file using the Json lines library that we just installed in the collab and I'm just writing it the entire data dump into this Json L file and if you go to hugging face hog I'm just creating a new data set if we go on the new and then do data set we can give a new name and probably uh give a license and then create a new data set I've already done that and created a data set by the name datab breaks mini we can see that the data that we visualized there for example which is a species of fish taupe or rope response is Taupe one thing to note is that the format of this one is quite different from the actual Dolly data set for example the dolly data set has four features which is instruction context response and category but in our case we have just one feature and I've named that text and under that each record is just one single line of text and everything combined together both the instruction and the response are just combined together in one line of text rather than having both of them as separate features so once we have got the data set sorted I'm loading the data set again uh into the data sets variable and I'm just taking a thousand samples from the training record just to demonstrate the uh fine chaining process the next step in the fine-tuning process is to define the different parameters so the first one we'll be defining is the Lowa parameters and the next one it is a bits and bytes parameter and the third one is training arguments themselves and finally we'll be defining the supervised fine tuning parameters we given a new model name which is the Gemma ft for fine-tuning for the Lura we'll have to Define three main parameters which is the rank and the next one is Alpha and the next one is the Dropout probability for the different lower layers basically if you have sufficient memory in your GPU you can have the rank higher probably uh 64 or 32 uh to make the fine tuning process easier and to make it consume less memory I've reduced the Lura R to 4 instead of 64 and I've left the other parameters intact when it comes to bits and bytes so we have chosen to use a 4 bits precision as the base model data type for the 4 bits is float 16 and the quantization type is nf4 instead of uh floating point 4 we have also disable the nested quantization if you have watched my previous video on Kora you will understand that there are two stages which is the standard quantization and there's double quantization so this is the variable to enable or disable the double quantization and then finally we'll have to define the training arguments themselves so you'll have to provide a output directory where all the training logs are created so that you can visualize the different training parameters later on so on the number of epo I'm just running it for one Epoch which is just going through the entire training data set once so we'll be choosing uh whether it's going to be fb16 or bf16 that entirely depends on the hardware so if we check the device compatibility of your GPU uh and if we get a major version that's more than eight we can then set it to uh bf16 otherwise we'll be setting bf16 to false so bat size again uh bat size can be varied depending depending on the hardware that you've got so I started with a bat size of 16 and I realized that it was running out of memory so I reduce it to um two or four you can choose a bat size that suits your hardware for example if you have a cluster of four gpus then you could just increase the bat size and the Transformers Library will take care of uh just scaling the uh processing these are all the standard parameters like the learning rate weight Decay which comes with the standard training of any neural network uh I haven't changed any of these I'm just going to leave these as such so the max steps goes along with the uh the eox the training eox if we are going to set the training eox then we probably don't have to worry about the Mac steps I'm the kind of person who chooses the epox rather than the steps so I've gone for setting the number of epox and then I've set this to minus one which means I'm not bothered about the the max steps the next thing that concerned me was the max sequence length so if we set this to quite a high number let's say 128 again we should make sure that we have enough compute in order to cope with this sequence length so I played around with the sequence length for a bit before I arrived at the uh sequence length of 40 and if you have a cluster of like 4 gpus or 8 gpus then you could feel free to increase the max sequence length to all the way to to 56 or even higher I'm going to leave it to 40 which was sufficient for my training and I've set the packaging to true because we have multiple short examples in the same input sequence to increase the efficiency so we want the training to be more efficient so I've set the packaging to true and when it comes to device map so I was initially training on my local desktop which has two gpus so I had to change the device map to Auto which ensures that the training makes use of both the gpus but if you have only one GPU then you can probably go for device map to zero which is the is the only GPU that's available to you for loading the quantise model we need to set some configurations to the bits and bytes library for example we need to create a configuration for bits and bytes and we need to ask to use 4bit model in 4bit Precision rather than the actual Precision that the model comes with and we also have to ask it to use normal float 4 and we'll also have to use nested Quant we just saw that we not using nested Quant but we need to specify that we're actually not going to be using nested Quant or double Quant basically so once we have created those configurations we now need to load the model load the tokenizer and and then pass this configuration to the the model we are creating the token is to say that we actually have access uh this is your hugging phas token which ensures that you have access to the model that you're actually using and this is the bits and bytes configuration that I've just created and the device map is auto to make sure that you're making use of all the gpus that are available at your disposal and finally we don't want to be using any cash so we are just setting it to false and again we just using we just loading the tokenizer and we need to specify what the pad token is the end of sequence or end of sentence token is and we'll also be fixing any overflow issues by just uh setting this padding side to right so moving on to Laura configuration as we saw before Laura has these parameters which is Alpha the Dropout and the rank so we need to specify those parameters in the lower config in order to initiate the uh parameter efficient fine-tuning on top of that the most important bit is the target modules now if you had watched my Laura video uh you would have realized that there are only a few modules especially the attention modules that will actually be fine-tuning and we be ignoring the rest of the modules so we need to specify the attention modules here which are the target modules for fine-tuning your model so usually these are the the kind of modules that will actually be uh fine-tuning when it comes to Laura so after setting all the Laura configuration we now have to initiate the training arguments so all those uh the training EO the output directory for logging and the you know the bat size and all those training settings will be passed to the training argument class and we'll be creating the uh the training arguments so now we have the training arguments we just print that to see what the training is going to use for example we have only one GPU so the NG GPU is one and these are all for example the atom beta for the atom Optimizer you know the Epsilon for again for the for the atom Optimizer so once we have those training arguments we will be passing the model that we will be fine-tuning to the supervis fine-tuning trainer class and we'll also be passing the data set that we just created and uploaded to the hugging face Hub remember it's the uh datab BRS um datab bricks mini data set that we just created and we'll also be passing the parameter efficient fine tuning configuration the list of configurations that we just created and we'll also be passing the the sequence length that we have defined I choose 40 but we can choose 128 or even higher based on the compute that's available at your disposal we'll be passing the tokenizer the arguments and finally we'll be initiating the trainer so once the trainer object is successfully created it's all about kickstarting the training by invoking trainer. train so I let this run for one Epoch and that was 1,340 iterations and these are the training losses that were just loged during the training process by the first look we can see that the training loss has exponentially gone down starting from 5.2 to all the way to 2.9 and then going down to uh 2.7 and what we can do is visualize the training in the tenso so for this we'll have to just run this tenso command and then this T about visualization opens up and we can see that the training loss has actually gone down which is a nice thing but at times the loss goes up after a certain point in time indicating overfitting so we need to be wary of that so if it shows any signs of overfitting then we could resort to either augmentation or we could sort of resort to adding more data points to your training data set I just fine tune with thousand records and so it's quite possible that the model has overfit the uh the fine-tuning data let's find out by just querying um what's going on with the uh the fine tune model so for prompting the newly fine-tune model I'm using the same prompt that we used for the pre-rain model which is what should I do on a trip to Europe because we have fine-tuned only the lower layers we need to be merging the base model along with the the adapter layers that are actually being fine-tuned so so for this we will be using the PF model class from fine tuned and we'll be merging the base model along with the new model that we have created and this gives our fine tune model and we also need to invoke the merge and unload function in order to arrive at the merged model so once we have this merged model we could pass this The Prompt that we are just have on our hand and then find out what's going on with the response the response I'm getting is there are many things you can do on a trip to Europe you can visit Eiffel towa in Paris the colosium in Rome the Vatican in Rome and the response is quite specific actually but this this response which keeps repeating the pantheon in Rome the pantheon in Rome so this this kind of makes me feel that the model has a little bit overfitted so it's time to go back add more records to your training data set and then rerun the fine tuning and see the response you're getting is any good so we can keep playing around but between training evaluation and testing and we can just keep going on and on training more models and we can also keep changing the data or updating the data and we can add more augmentations and again train another round to compare the train models so all this we can do when we can actually evaluate the results of the model so in my next video we will have a look at how we can evaluate these train models and we will also have a look at how we can quantize the pre-train models or the fine-tune models so that you know we could just squeeze or compress these models and then we could deploy them for production use so please stay tuned for that and I will see you in my next video Until then take care [Music] [Music]

Info

Channel: AI Bites

Views: 2,857

Rating: undefined out of 5

Keywords: machinelearning, deeplearning, transformers, artificial intelligence, AI, deep learning, machine learning, educational, how to learn AI

Id: _xxGMSVLwU8

Channel Id: undefined

Length: 24min 11sec (1451 seconds)

Published: Fri Mar 01 2024