Fine-Tune Your Own Tiny-Llama on Custom Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in the last video we looked at tiny Lama which is a small language model in this video I'll show you how to fine-tune this model on your own data set if you have watched the previous video you will know that tiny Lama is a very small model so it's not really great for General language tasks however you can fine-tune this model on very specific tasks and you can run this on edge devices in this video we will look at how to fine tune this model on a very specific task to find you this model we are going to be using a data set from hugging face called colors however you can just format your own data set I'll walk you through a stepbystep process of how to do that before looking at the data set I would like to say that this video is highly inspired from this block FS from minyang chain which is Tiny Lama colorist fine tune with color data set so I'm essentially taking his idea to replicate exactly the same behavior link to this blog post is going to be in the video description let's first understand the data set before we look at fine tuning the model so there are two columns the first one is description this column describes a given color so for example you have pure black then here's a description of the color and the second column is colors which is the corresponding hexad decimal code now we want to train or fine-tune the tiny L model so that it accepts the description as an input and then generates the corresponding Hax code now you can do this task through prompting if you are working with a large language model by simply telling it to give you the Hax code corresponding to a color description however in this case we want the model to just look at the color description generate the hack code without any other instructions here is an example I took the tiny Lama chatbot which is the instruct fine tune version and I provided the color description and here's the output that it generated this is basically text however in our case our fine tune model is going to generate the corresponding hexad decimal code without having any system instructions I hope this clarifies what exactly we are trying to achieve here so now let's look at the code I'm running this on a free Google cab and it has a pretty good infant speed as well now first we need to install uh some packages so we will use accelerate uh bits and bytes Transformer TRL uh for fine-tuning the model uh in this case we're going to be using the PFT package from hugging face and again we're going to be adding Aura adopter on top of the model so we not actually fine-tuning the original model rather than the Lura adopters if you want to learn more about luras I have a video on it link is going to be in the description next we need to import all the packages that we will need in order to find you the model so we need py torch since we're downloading data sets from huging face that's why we need the load data set function and data set class we are going to be using Laura that's why we need to import the Lura configuration also the corresponding TFT package and for running the model we will use the auto model for coal LM as well as the corresponding tokenizer we're going to be using the supervise fine tuning trainer from TRL package from hugging phas then we need to do some housekeeping so first we need to provide the name of the data set that we want to use then the corresponding base model now in this case if you uh see I'm actually using the chart version of the model rather than the base model you can do this on the base model as well but just to keep the training time shorter I'm using the already fine-tune version uh of the model uh that is trained I think on around 350 billion uh tokens and then we need to define the output directory or where we are going to store this model next we need to properly format a data set so that we can uh start using this data set to train the model now the tiny Lama instruct version or chat version uses the chat ml format by default and that's exactly what we're going to be using in our fine tune as well so here is a function that you can use um to properly form at your data set uh so you have the initial special tokens then the input from the user and then the corresponding response okay uh we're going to use this function later but here is how I'm formatting my data set set so first we provide the data set ID then we load the data set because it's available in hugging face we convert it to Panda data frame then we take both of the columns available in the data set so one was description the other one was colors so the description becomes input to the model that's where you expect the user input after these special tokens and then the color column becomes the response which is after the special tokens for assistant so if you were to format your own data set it has to be formatted in exactly the same way we convert that to a data separate object again now after formatting it in this specific format we are creating yet another column which we are calling text and this is critical whenever you are formatting your own data set you need to apply the prompt template that you want to use and then assign it to a column called text we convert it to data set object again right and here is how we are processing the data so we calling this prepared train data function on our uh data ID and if I look at the data set we have three columns one is color description and then text right and there are around uh 34,000 examples in this data set so if you look at an example here uh this is the first data point or sample so you have the color the corresponding description and this is the new column that we added right so in the beginning you have uh the description which becomes the input and the corresponding uh hexad decimal code which becomes the response from the model okay so after this we need to start setting up our model uh in this case we're using the chat version of tiny Lama so first we need to download and initialize the tokenizer corresponding to this model the tokenizer is exactly the same like Lama 2 model we're using the bits and bytes package that's why we're setting all the corresponding configurations then we are loading or downloading the model itself so if we call this function with the model ID it will download the tokenizer as well as the model and we'll set the corresponding bits and bite configurations uh correctly in here and the is going to be model and tokenizer next uh we need to set up uh the low configurations so in this case I'm going with the default values I have another video in which I go into more details on how to select different aowa adopters so if you're interested watch that video link is going to be in the description next we need to set up the training parameters so I'm not pushing the model to hugging face that's why I'm defining an output model library or a directory in here since it's a smaller model you can use a much bigger batch size even if you're running this on a free T4 GPU from Google collab we're using four steps for creating accumulation here's the optimizer that we want to use these are different things that you need to experiment with specifically you want to experiment with the learning rate which will control the speed of conversions of training now I'm just running this for 150 steps so it's not going to go through the whole data set multiple time I think it's going to be a fraction of an epec so not even a full epe but if you want to run this for longer I would recommend to either comment this out just train it for a longer number of Apex okay after this we are going to generate or create our supervise fine tuning trainer object so we provide the base model that we want to use then the corresponding data so in this case the data set has already been preformatted for that specific prompt template in my mixol videos I showed you how to actually provide the function which will format your data set so if you're interested in doing it that way have a look at that we provide the uh lad of configurations in here since the data set is a pre uh formatted that's why we need to tell the model which column to use for training so we specifically using the uh text column in this case uh then we provide the corresponding uh training arguments that we Define up here corresponding tokenizer and the last thing that you need to come up with is the max sequence length so I'm just keeping it to 1024 even though the data in the data set is uh much uh shorter so just to summarize we're going to run this for 250 steps and here is is when we are training the model so there was no validation data set that's why it's only showing the training loss and if you look at it it actually decreases pretty nicely now there seems to be some fluctuations at the end so that it's not really an indicator of overfitting yet but we could potentially reduce the learning rate and then it's going to be much smoother another thing to pay attention to is that we ran this only for almost half an epoch so not even a full AP if you want to train your model make sure that you run it at least for one AP okay so after this process it only trains the low adopters not the actual model so we need to merge both of them together so I loaded the original model then provided the path of the U last checkpoint which was trained and loaded that and then you want to merge and unload this models so this is going to be the final model which has the Lura adopters merged with it and the way you do it is you use this PFT model class then you need to provide the actual uh model that you want to merge the Lura withd the corresponding uh path of the Lura adopters and you get uh the final model so if you look at the final uh merg model you will actually see that it has the low adopters in here okay so inference time let's see if the model actually learned anything so for inference we're going to be using this function called generate responses we get input from the user that's going to be the prompt that the user provides then we use the formatted prompt function so that we can put this in the correct prompt template or prompt format so here I put the definition of that function so basically you receive a question this will become user input with the corresponding special tokens and then we're using special tokens for assistant response so after this the assistant or the train model is supposed to generate a response for us now the rest is standard hugging face code the only difference is here we are starting a counter basically to look at the start time and then we stop that that's basically the time it took to make a prediction because this is a small model so it's supposed to generate prediction at a really fast pace even on a T4 GPU okay and we look at both the output as well as the corresponding time it took to generate the output okay so before testing it here is another small heler function which accepts hexadecimal code for color and it will generate uh the corresponding color and show us so we can actually see if the hexadecimal color color belongs to the color that we provided or not okay so the input prompt is a light orange color uh the response took only 35 seconds so under a second which is pretty amazing and here is the hexad decimal uh code that the assistant generated and for that color here's the corresponding actual color generated by the function that we used so we provide the response or the hexadecimal code generated by tiny Lama and it's actually pretty close to light orange color so it seems like our model is actually learning this is pretty amazing now just to reiterate you can easily do this with any large language model by instructing it to convert a given color description into heximal code however in this case the tiny Lama model was able to learn from the data without any explicit instruction which is pretty amazing I'm actually excited about these small language models I think we will see more of them uh during 2024 and this is one of the uh viable solution for running these devices uh on consumer Hardware such as phones or Edge devices I hope you uh learned something new in this video and found it useful uh let me know if there are any questions thanks for watching and as always is see you in the next one
Info
Channel: Prompt Engineering
Views: 16,810
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, Tiny llama, tinyllama, how to fine tune llama, LLM
Id: OVqe6GTrDFM
Channel Id: undefined
Length: 14min 31sec (871 seconds)
Published: Wed Jan 10 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.