Fine-tuning Tiny LLM on Your Data | Sentiment Analysis with TinyLlama and LoRA on a Single GPU

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone my name is Vin and in this video we're going to have a look at how you can f tune a tiny L on your own data set we're going to start with preparing the data set for training then we're going to have a look at what parameters you need to set in order to get your tokenizer and model prepared for training along with war setup then we're going to train the model within a Google CL notebook free tier finally we are going to wot the train model and do evaluation on a test set to see whether or not the fine tuned model is doing a good job let's get started if you want to follow along there will be a complete text tutorial along with the link to a Google clap notebook for this video and this will be available within the bootcamp section of ML expert. and then find unink tiny L on custom data set this is available for ML expert Pro subscribers so if you want to support my work and get access to this please go and subscribe to mxer Pro thanks so what do you need in order to find you a tiny L first we're going to go through why you would might want to choose a tiny I over something like wama 7B Li parameter models then we're going to have a look at why you would need to do some fine-tuning then we're going to have a look at some of the checkpoints that you need to cover in order to choose and prepare your data set and finally I'm going to give you some tips in order to find you a tiny L using War so why tiny L first and most importantly those types of models are relatively small or smaller compared to regular watch language models such as 7 billion parameters models such as mistra or W 2 and Tiny LM are usually something like tiny wama the one that we're going to use in this video and other like F and F two which is on the let's say limits of what I would call a tiny a another important thing for tiny L is that you can do much faster inference with those and uh the training itself can be a lot faster compared to what you might get with a relatively larger a and you can even use like older gpus in order to train those types of models and finally even though those models are tiny some of those are still trained with very high quality data such as fi and f 2 and trained on a lot of tokens in the data set such as Tiny Lama which has uh more than 3 trillion tokens in the training data set why would you want to do some fine-tuning well first you can try to start with some prompt engineering and if that works for you and The Benchmark or the performance of your model is relatively good then try to stick with just prompt engineering but if you want to increase the performance of your model and if you have enough data in order to do that fine tuning is a very good approach in order to get much better performance of your tiny a and in the general case tiny a are not as powerful at 70 billion parameter model plus like uh for example W 2 or mist or other models and not even close to CH GPT and GPT 4 and GPT 4 Turbo so in that case if you want to have some much smaller model that is performing relatively well on your benchmark on your tasks you would likely need to do some fine-tuning in order to provide much better performance for your tiny a another good thing about fine tuning is that you're going to reduce essentially the number of tokens that you need in order to pass into the input with with the prompt so you might just pass in your data and you might just want to think of a much smaller template that will be good for your prompts and you can essentially just use that instead of some larger prompts and this will make your inference time even faster of course you might want to have a data or might have data that is private to you or your company so when you're fine tuning your own models you don't have to expose the data to the outside world so this is another uh let's say positive of the fine tuning approach and how would you prepare your data as a general row of temp I would suggest more than a thousand examples dat of high quality so uh preferably you might want to have a humans that were looking through the data and they would essentially get a feel of where the data quality is and when you get get a good quality data your fine tuning L are going to be much much better compared to if you have some let's say Shady data points and you would have to think about what type of tasks you're solving in this video I'm going to show you that we are going to use the a for two different tasks which is very good uh in the past if you had to Sol for multiple tasks essentially you have to train multiple models or have a single model that have multiple heads for each prediction in the era of the L uh we are going to just say that we want two outputs one will be the sentiment of uh news and then is going to be the subject of the news or cryptocurrency news this is the DAT set that we're going to use and you would have to have a look at how much tokens do you need in the input and the output and uh have a look at your model maximum context WID and choose whether or not you're going to be able to fit the inputs and outputs within the context window and then you would have to think of a template that is going to be essentially good in order to prepare your own data the data set that we're going to use is uh these Crypton news plus that are available on KGO and it says that there are Crypton news articles containing title text and sentiment analysis of course the sentiment analysis is going to be Essen probably predicted from some model so the labels might not be perfect but still this is a real world example of what you might have and uh here is the Crypton news data for year over a year 21 to 23 structured format including title text Source subject and sentiment analysis and this is the example of data that you get you have a class for the sentiment polarity and subjectivity and of course you have this subject and all of those are going to be accompanied with the text and a title from the news and this is just the first paragraph of the article and this is the title of the article I have a Google cop notebook that I've have wed the Crypton news data and I essentially took the CSV or the original CSV file and created this stratified split between train validation and test sets and here is the data frame for the training headit training data frame and you'll see the split between the training the validation and tests examples we still have a lot of data and uh you'll see that I've got the subject here and I essentially split the sentiment within a couple of columns so this will be a bit easier to work with compared to what we had into the original U data set uh other than that I'm going to show you the splits between the train test and valid ations so you can see that the stratified sampling has worked wonders for us you see that the trend the validation and the test set for each uh subject which is Bitcoin altcoin blockchain ethereum nft and defi all of those are split um pretty much as the way that the training set has the frequency for those and you see that essentially we have a very large bias towards Bitcoin outcome and blockchain examples which is again something that you might not want in your data set but this is uh the real world in here you can of course use some techniques such as oversampling under sampling Etc in order to fight this but just for this fine tuning example I'm going to stick with the original distributions uh this is the subject that we're going to try to predict and then we have the sentiment again the distribution is uh essentially kept as in the way that the training set has this so again with the stratified sampling and you see that we have the positive neutral and negative sentiments and you might see again that we have somewhat of a skew data towards neutral and positive news while the negative news are much much less compared to the neutral and positive so keep that in mind as well and this is the subjectivity score something that we are not going to predict but I've shown this in order to get a few of this uh category this distribution so the first thing that I'm doing here with the data set in order to pre-pro is to essentially get the data set from pandas and I'm going to use the huging phas data sets Library I'm going to just create this dictionary with the train validation and test subsets and then I'm going to essentially W the tokenizer for the model that we're going to use in our case this is going to be the tiny Lama model and I'm going to get the latest model that is not a chat model and this was trained on 3 trillion parameter tokens and I'm going to set a padding token or P token for the tokenizer and uh here you see that I'm getting the tokenizer for the model then I'm adding this special token for the P token and then I'm setting a padding side to right and after wading the model itself I'm going to resize the token embeddings in order to get the new uh token embeddings count since I'm loading or adding this tokenizer and I'm expanding this to a p of multiple of eight and you'll see that we've added this token the padding token that is and you see that now the tokenizer has all the available tokens and this is the new token that we've added to the tokenizer so this is very important because if you don't have some padding or correct ping within the training sets your model is tending to essentially repeat the last couple of words or tokens that is going to generate so this really helps with the repetition of the model and then another thing right here is that if you're using a GPU that is capable of using flash attention to I would strongly suggest you that you turn on this one but since I'm using the T4 GPU which is available on the free tier of Google C I'm essentially commenting out this one so essentially this is how you're going to to what the model and the tokenizer itself next we are going to make sure that the number of tokens are going to be fitted right within the context window of our tiny W model which has 248 tokens of context WID and in our example I'm going to create this format or template which is something that I've chose to use this is not something standard so I chose to set the title the text and then the prediction in this format for the article of or the news article and then you see that within the prediction I have this subject and then sentiment and in order to have a look at how many tokens we are going to need I'm essentially counting the number of tokens in each example after formatting it into uh using this template and you see that the number of tokens is much much more L compared to the maximum limit of 28 48 so we are going to essentially need at most 200 tokens for the input so the problem with the context window should not be uh anything errow our examples are very tiny compared to what the tiny one model can handle while you can fine tune a tiny l in its fullest still 1.1 billion parameter models are not small by any means even though the name is Tiny wama so if you have a single GPU for example a T4 that we're going to use within the the Google cop notebook you might have a hard time fitting this model into the GPU and fine-tuning it in on its own so in our case I'm going to have a look at how you can use war in order to fine tune the tiny I and this will allow us to even increase the bat size that we are going to use in order to train this model so one important thing to note is that war or with War when you're training such models you are going to essentially train just a small model called adapter on top of the original model so you have to essentially W the original model within the memory and then create a smaller model or a set of or a matrix of parameters in order to find youe just those and even though when you're training models such as wama 7B you might just train roughly or even lower than 1% of the parameters if you do that with tiny L you're going to get like something like maybe 1 or 10 million parameters in order to train your model so in the general case this wouldn't be enough of course this depends on the task at hand so as a general start I would recommend something like 100 million parameters which is a great start and you can tweak that in order to get something like this for the tiny wama we're going to increase the rank of the wama or sorry the water conf to about 128 so this will give us roughly 8.5% of the parameters for training of the original model and then I'm going to increase also the war Alpha in order to scale the learning rate and not change its value and again I'm going to set this number to 128 to start with the training I'm going to set the P token ID on the model and then on the model config P token ID to the tokenizer Token IDs then I'm going to have a look at model config in order to double check that the Ping token or P token ID has been properly set which is and then we are going to have a look at the model architecture which is going to tell us where do we need to apply the war scaling or the war Target modules so in this case you're going to see within my config right here that I'm targeting the self attention one and then the MLP ones so these are the linear layers and these are the self attention layers as you can see right here and I'm essentially targeting all of those and for the rank of the Matrix and shout out to Tris research YouTube channel from which I've seen that he's actually targeting tiny LS with much higher number of parameters so thank uh thanks to you I've seen that you can actually you need to actually increase the number of parameters or the ranking of the war Matrix in order to find you much better with tiny a and here I'm going to set the rank of the Matrix and the war Alpha in order to scale the warning crate within 128 bolt and I'm going to apply a small Dropout to the War uh so this is the new adapter model and then I'm going to say that this is a coal language modeling task from the task type right here and then I'm going to get the P model on top of the original tiny Lama model with the water config application right here and you see that we are actually targeting roughly um 100 million or 101 million parameters for training 8.4% on the training front with the water so next I'm going to show you how you can train just on the completions and this is uh something that my colleague called wo have shown me thank you w for that so instead of training the the whole text or using the whole text for the training you essentially what you want to get is to use for example from this example uh you want to calculate the was only on this so essentially I'm going to ignore this which is the changing part within the DAT set and to calculate the W I'm going to essentially take only those tokens in order to have a look at how well the model is performing and this will drastically reduce the was that you have but keep that in mind that if you're training for a task such as are on right here for some completion just so some for some completions then this type of collator is doing a great job but if you're training for something like um assistant and chats Etc this might might not be a good use case of the data cleator so keep that in mind and uh in our case I'm going to use the prediction as a template I'm going to encode this and then I'm going to pass in the template IDs or the response template IDs to the collator and then I'm going to pass in a tokenizer to that so essentially what we are going to do here is to um tokenize the template since without that uh discolor appears to be failing at least for me and I'm going to essentially get a single example and tokenize it in order to show you what the labels uh this collector is going to add so you'll see here that when I create this data water and I get the next patch from it you see that now we have input IDs attention mask and then a new field callede labels if you look through the batch labels you'll see that everything uh before the template essentially has been given an ID of minus 100 so this is essentially ignore these tokens and for the was itself only these tokens are going to be used for the calculation of the loss since we get a bit of repetition with the subject and sentiment you can essentially prove this to be either better so essentially what you might want to get is to get rid of this and get rid of this and just um print those two lines this would probably be much better compared to what we have right now and your wor is going to be performing even better but yeah this is an exercise that if you want to do this and then for the training arguments I am going to essentially use a b size of four but I'm going to multiply that by four in order to get an effective B size of 16 using gradient accumulation so what this will do is going to be passing only four examples through the GPU but then uh the results are going to be accumulated within a four uh iterations of those four batches and then the accumulation or the gradient is going to be calculated on top of that this appears to be he to pink with the training and I've seen that during the training on this single GPU this gave me a much ler uh war or sorry much l wases so it appears to be helping uh then I'm going to be using a regular Adam with uh wdk fix from torch Optimizer we are not using any um quantized um any quantized Optimizer since we are going to be using uh fp16 or floating Point 16 training for this one we don't need Q for those tiny uh language models this appears to be training very fast and it appears to be very stable with very good results so no quantization on this part right here and I'm going to essentially use a constant schedu type uh yeah this is is a bit redundant since we're not going to be using any warm up right here uh and then another important thing is that I'm going to train just for one Epoch of course you might want to train for multiple epochs that depends on the DAT set size that you have I've trained this for roughly 40 minutes I believe and if you train for longer you might actually get better results with those tiny l so uh it it might be worth to experiment with that and those are essentially the training arguments that we have then I'm going to get this format prompts which is going to be passing essentially a example and within this example I'm going to our examples and within that I'm going to essentially use the format of the template that we've seen thus far and this is going to essentially create our batch for us so this is the trainer that I'm going to use um I'm going to pass in the model the training arguments then the training and the validation uh sets a tokenizer Max sequence length which can be increased but in our case that's not needed then the formatting function which is this one and then the data calator which is going to be training only on the completions so uh this is essentially the output of the training uh and you see that the model is actually performing very well this is the evaluation was from the the tensor board training uh you see right here that we start with a relatively high value of 0.15 then uh after 600 steps this is 0.11 uh below 0.10 and yeah you can see that uh in relatively let's see that again in about 26 minutes of training we get this far below so this is really good uh and uh you can check the tutorial for the full outline of this but this is my training course without any smoothing and you see that is again generally decreasing uh you might U argue that we are going to hit a plateau right here but I would say that the training went really well and these are again the results from this one uh you see here in this table that we have the training course and the validation was and you can see that that they're fairly similar uh and the validation was is actually a bit better in the later iterations which is surprising but um it is within the realm of what you might get since the training set is much larger compared to validation set so it might be just uh Randomness right here and then in order to get this model to be safed I'm going to use the trainer model save pre-trained and within the same folder I'm going to essentially get the tokenizer to save itself as well with the proper configuration so in order to try out our model I'm going to um essentially I've at this point I've restarted the Google WAP notebook and what I did here was to get the base model wed into for 16 and then apply the pth model on top of that this is again the same folder and train folder and then I'm going to essentially merge the P model on top of the original model and this is going to get again the tokenizer which was correctly formatted you can see right here that we have a padding token and we have a correct padding site and a correct P token ID and after that I just again setting the P token ID and p uh config P token ID just in case and now we can use our function model as a regular Hing face Transformers model I'm I'm going to create a pipeline I it for text generation I'm going to pass in the model the tokenizer the maximum number of new tokens this is going to be only 16 since uh we already know that our model is going to be producing a very small number of tokens for the completion and I'm going to essentially format the example for completion or for prediction I'm going to just take from the example the title the text and then I'm going to pass in the prediction without the prediction itself I'm going to to um reduce the verbosity of the Ling and then I'm going to have a look at 10 examples note here that this is the text so this is the complete text from the example and then I'm calling format for prediction right here with the example itself and I'm going to essentially output the prediction so um the original subject or sentiment is not passed into the model so this is the first example binance research report reviews Etc and the subject here is from the original data point is nft the sentiment is positive and this is now the prediction you see that we have a um duplicate of the sentiment line by the model this is relatively common and we're going to address that in a bit but the subject appears to be correct right here and the sentiment appears to be positive as well let's look at the another one subject altcoin sentiment positive outco positive again uh this is essentially what we have right here it is correct uh then subject etherium sentiment positive again those appear to be uh exactly correct altcoin positive but the prediction was negative let's have a look at the title coinbase coinbase coo calls for regulation of centralized crypto entities the demise of FTX has set back crypto by years and This Disaster is likely to steer Regulators Regulators into action so the sentiment is positive but I wouldn't exactly agree with this label right here uh you can decide on your own and I think that our model is actually predicting a better sentiment than the one in the labels something that is very interesting let's have a look at another one altcoin positive again uh correct now this subject here here is altcoin but the model is saying Bitcoin let's have a look at and there again positive sentiments for both so bitcoin's PR prediction as BTC breaks through Etc Bitcoin the world swes currency and the label is altcoin yeah our model is uh performing very well indeed so uh this looks to be the case that the labels are not exactly perfect but our model seems to be doing a good job even though the the data set is not of that high quality uh and yeah you can go through a lot of examples and see for yourself so next I'm going to do something a bit different I'm going to extract the prediction for the complete test set again this is 1, uh200 uh yeah 1,24 242 examples this took about 10 minutes and this these are the predictions uh the title the text true subject true sentiment predicted subject predicted sentiment this is essentially the data frame that we're going to get and I'm going to essentially calculate a very rough accuracy for the subject which is according to this calculation 78.6% accuracy of course you might want to go through some examples and see for yourself if the model is actually better compared to the labels uh this is essentially a heat map or a confusion Matrix of different predictions for the subject and the real values uh you can see that we have some overlap between blockchain and altcoin right here but uh nothing really major and again for the TR subject and the predicted subject uh let's see let's get an example from right here AI optimizing crypto exchange functions artificial intelligence tools are providing so the TR subject is Bitcoin but the predicted subject is blockchain yeah at least from the first couple of words it appears that again our model is performing better than the labels but I might be wrong I I mean go over the title and the text for some examples on your own next for the sentiment uh we have exactly the same calculation and you see that this time we have a just a tiny bit over 90% accuracy on the test set which is really impressive if with such a small data set again this is the confusion Matrix uh yeah and again we are going to have a look at some examples bad news is good news Bitcoin plays with USD Bitcoin reaches its highest Target in nearly seven Etc and here the sentiment is positive while our prediction is neutral I would agree that the labels here is better compared to what we have in the model itself uh arst stroke promised me 100 in Bitcoin is it possible that coinbase CEO Etc neutral and our prediction is negative I'm not sure I have to see the title and the text for this one but yeah even if this is correct 90% is very good for such a small training so this is it for this video you now know how to find you a tiny L on your own data set and you know how to set up correctly the war configuration for for that and also you know how to save the model after training and then get the final model on top of the original model and do some inference with it in the next video I'm going to show you how you can use the adapted model and fuse it or merge it within the original model push that to a huging face Hub repository and then from there we're going to deploy the model in production behind an API and we're going to start to get some inference on top of a real world example thanks for watching guys please like share and subscribe also join the Discord channel that I'm going to link down into the description below and I'll see you in the next one bye
Info
Channel: Venelin Valkov
Views: 10,207
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: _KPEoCSKHcU
Channel Id: undefined
Length: 31min 41sec (1901 seconds)
Published: Tue Jan 30 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.