Fine-tuning Alpaca: Train Alpaca LoRa for Sentiment Analysis on a Custom Dataset

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I need to pass in the input and then the instruction is recorded the instruction was detect the sentiment of the Tweet a project with great prospects and opportunities this should be a positive sentiment let's run through the model hey everyone my name is Verlin and in this video we are going to fine tune the wama 7 billion model using alpaca Aurora we are going to create a custom data set that contains tweets for Bitcoin sentiment we are going to convert this data set from a CSV file into a Json and use it to fine tune the Llama model let's get started the data set that I'm going to use is called BTC tweets sentiment and it's on Kegel I'm going to link it into the tutorial that is also going to be available with this video in in here we have BTC tweets sentiments so it looks like the authors had pretty much scraped around 50k of tweets and they have among with the Tweet text they have also the sentiment of the tweets so I'm going to basically use this data set but I'm going to apply some paper sync in essence I'm going to remove the retweeted tweets and also I'm going to remove all the tweets that contain a link the alpaca water repository is actively maintained and I'm going to show you the probably the easiest way to fine-tune the alpaca War models and it is right from the readme of the project you can go here and they have a pretty good example calling fine tune dot pi and in here you specify the name of the model or the base model a llama 7B for the 7 billion parameters and then some data path path to your data set also they have an output directory so this is pretty much the fastest way to fine tune the model but in this video I'm going to show you what is actually behind the fine tune script which is this one and I'm going to go take some bits of those of the script and show you step by step how you can implement the fine tuning process to fine tune our model we are going to need a data set and alpaca water is using a somewhat custom format for that so I'm going to need to convert the tweets and their sentiment data into a format that is suitable for fine tuning with alpaca work I have just created this notebook and in it I've started with essentially listing all of the requirements from the alpaca water repo and here you can see that we have bits and bytes the data sets from hanging face also PFT which we're going to use for easier or faster training the latest version of the Transformers Library torch 2.0 or pytorch 2.0 sentence piece since this is required for the tokenizer and then I'm also installing tensorboard X and radio these are pretty much of the requirements from the alpaca War repo but I've just fixed it fixed the versions of those so other than that I have these Imports and we're going to use pretty much all of them uh here some of the more important ones are the wama tokenizer and one more for causal hovering language modeling sorry also we have the path that we are going to use again for the faster training or fine tuning also I am importing some styling and finally I'm looking for the CPU or Cuda device and in this case I'm using a Google Co-op notebook that has a Cuda device next I'm going to download the data which I've already told you it is actually pre-processed quite a bit compared to the cargo data that I've shown you and here we have this CSV file in it I'm going to and I'm going to work that file using pandas it's called Bitcoin sentiment tweets let's see what it contains so it has the date of the Tweet the Tweet text and then the sentiment and let's see what types of values do we have for the sentiment and how much of each do we have so I'm going to essentially call the value counts so you can see that we have pretty much roughly equal amount of neutral and positive sentiment and then the negative ones are very rare uh like one fourth let me just show you this in a bit more nicer format yeah you can now visually see how much the positive and the neutral tweets are actually overwhelming the negative ones so at least at this time the sentiment for Bitcoin was quite positive or bullish one might say so next I'm going to change the format of the data set we are going to be required to pass in a Json file and I'm going to convert this CSV into a Json file so to do that I'm going to create a function called sentiment score to name in which I'm going to essentially convert 0 1 or -1 into text so in here I'm going to pass in the fourth of the text and then I'm going to return a string so if the score is above 1 I'm going to return positive if the score is less than zero I'm going to return negative or otherwise I'm going to return neutral and I'm going to create the data set data and this will be essentially a list comprehension so here I'm going to iterate over every dictionary in the data set and I'm going to convert the data frame to dictionary so here I need to pass in three different Keys the instruction which is going to be a constant and I'm going to name it detect the sentiment of the tweet next I have the input which is going to be the text of the tweet and the output is going to be the result of the sentiment score to name function and I'm going to call this with the sentiment let me run this and let's see the first row of the data okay so we have an instruction an input and an output and this is the required format of the data set that the alpaca water is using so we're going to need to save this in a Json file and use it as a date set for the fine tuning so I'm going to call this alpaca Bitcoin sentiment dataset.json I'm going to write the file and I'm going to dump the Json into the file so here yeah we should have this Json file which contains the instructions the inputs and the outputs now that we have a data set we need a base model and in this case alpaca water is using the wama meters model so I'm going to need to initialize it and use the pre-trained weights from the hanging face repository bolt for the tokenizer and the model and this is pretty easy to do I need to first start by specifying the name of the base model and this is pretty much the same model that we've used in the previous video this is the repo in hugging face it is essentially using the wama 7 billion parameters next I need to initialize the model and I'm going to use the wama for causal language modeling and I'm going to use the base model here then I want this to be wallet in a 8-bit tensors since this will or save us a lot of space and probably speed up the training process then I'm going to specify the torch types that we are going to use fold 16 and then device map Auto so this will make sure to put the model on the GPU next I'm going to initialize the tokenizer so this will be the Obama tokenizer let's see if we get some hints tokenizer again from pre-trained the base model and here we want to change the path token ID to zero this is pretty much done with the idea that I'm taking this from the fine tuning so next I want to set a padding site to the left I'm not sure if this isn't the default value and this should go ahead and start downloading the very watch model that is contained within the warmer 7 billion parameters so this should take some time about three and a half minutes but after this is done we are like getting the model and the tokenizers or they are loaded and they're pretty large so that's all right next I'm going to what the data set from the Json file we have the model now the Json file for the data files and here I'm going to call the alpaca Bitcoin sentiment data set Json file this should load it all right so if I have a look at the train data here we see that we have around 109 1900 examples in our date set the authors of the GitHub repo have implemented this prompt template and from here you can see that there is this text which is essentially the template behold is an instruction that describes a task pair with an input that provides further context context Write a response that appropriately completes the request and then the instruction the input and the response but in the previous examples of this repo we had this function called generate prompt which essentially contains the same thing and I'm going to paste it here I find this much more understandable so we have pretty much the same thing it's accept essentially a data point that contains the instruction from here the input and the output so I'm going to run this function and then I'm also going to paste in two of the other functions provided by the repo and we're going to go through those so we have the tokenize and then generate and tokenize prompt and here we have this cutoff Len which I'm going to Define as two 5 6 so the cutoff Len since the tokenizer in the context of the model is limited we're going to limit this to 256 so what is happening here we are generating a full prompt and then tokenizing it so I'm essentially getting this and then using the tokenizer on it so we are running with a maximum cutoff length no padding and not returning uh specifying a non-returning of tensor types so we have the truncation set to True here and then if this is larger than the cutoff length essentially we are cutting it off and other than that I'm uh other authors are also applying the labels to the result using the input IDs so essentially the labels are the input IDs from the tokenizer right here so this is pretty standard if you understand the Transformers library and how tokenizers work next I'm going to get the data and split it into a trend in validation sets train that we've talked about here I'm going to call train test split and the test size is going to be 200 examples I want this to be shuffled and I want to set a seat of 42 and then for the train data I'm going to take the train part of this let me actually run through this and show what we have thus far so this is doing the split and now we have 200 examples in the test or the validation set and then uh 100 and roughly seven one thousand and seven hundred examples in the train set so for the train data I'm going to take the train split Shuffle Shuffle it and then call generate and tokenize prompts so this should run through the function that we described above and then for the validation data I'm going to test the test split and call the same thing now that the date set is completely prepared we also need to prepare our model for 8-bit training we're going to use the water helpers from the pest library and do all the necessary configuration in order to start the training process I'm going to continue with pasting in some configuration constants and some of those are related to water and water we are going to use by essentially calling the water config other than those which are or related we have a bot size and micro bot size we are going to use a microbot size 4 warning rate which is again taken from the repo and I'm going to train for 300 steps uh you might want to increase this if you are getting still better results but for the purposes of this video I've trained my model for only 300 steps of course if you have much better gpus you might want to train for longer also if you have much larger data set you might want to train again for longer and I'm training on a single GPU so you might need to adapt or use the fine-tune script from the original repo if you want to train on multiple gpus so next I'm going to start by preparing the model for 8-bit training prepare model for eight for in nine eight training I'm going to call this and then create a configuration of Aurora config and this is using the path library from hanging face and water is an acronym for wall rank adoption of large language models and there is a paper which is linked From the Path repo right here from the hanging face and if you're curious about that you might want to have a bit larger or deeper dive into what this is but if you go through the source code you will find that water config is essentially extending the path config and here we have R which is attention Dimension which in our case is going to be eight we have the target modules we are going to essentially fine tune the query and the values projections of the Transformer then we have Alpha parameter for scaling at least it looks like that and then we have a Dropout in our case we're going to use a five percent dropout and these are pretty much the only parameters that we're going to use in our case so let's continue with the notebook so the rate is going to be in our case R 8 so I'm getting all those values from the fine tune script and the dropouts I don't want this to have any bias again taken from the fine tuning script and then the task type is going to be causal language modeling all right so I'm going to apply this config to the model by calling get path model and adding the config and finally I'm going to print the trainable parameters let's see if we get probably not no completions here so I have a typo all right let's run this yeah and as you can see we are training roughly 0.06 of the of the possible parameters which are nearly 7 billion it looks like so we are fine tuning uh what of what smaller model compared to the original wama model to train the model we are going to use the hanging phase Transformers trainer and this one accepts a lot of training arguments which we are going to use to configure the training process itself I'm going to start with the training arguments and I'm going to create an instance of that from the Transformers training arguments let's see if we get any help probably not that's all right so we are going to specify the microbot size using the better device batch size then the gradient accumulation steps which we calculated based on the bot size and the microbot size I'm going to specify the number of warm-up steps for the warning rate and this is again taken from the alpaca or repo next I'm going to specify the number of training steps in our case this will be 300 the warning rate then I'm going to specify that I want a mixed Precision training I want this to be working on every 10 steps the optimizer is going to be Adam with way Decay fix from pytorch and I want to strategy to have a evaluation strategy on the number of steps and also save checkpoints on the number of steps and eval every render evaluation every 50 steps and also safe on 50 steps the output directory of the experiment is going to be the output there which is going to be experiments and I want to save at most three checkpoints I want the trainer towards the best model at the end and then once the results to be reported to tensorboard dancer import all right let's run this gradient accumulation steps okay this looks alright next I'm going to create a data collator for the batches data collator for sequence to sequences we are doing such a model and here I'm going to pass in the tokenizer but to multiple of 8 which is the number of elements that I want to have in the batch so the return tensors are going to be pytorch and then I want to apply a padding at the end so the data collator is done and next we are going to use both the training the data collator and the training arguments into a trainer so here I'm going to start with the model training data set is going to be the train data the eval data set is going to be the validation data the arguments for the training are going to be the training arguments and the data collator is going to be the one that we created let's run this okay so the trainer is looking good and next I'm going to say that the model is not going to be used cache this is again taken from the original script and I'm going to replace some of the state dict using the PFT model State dict so I'm going to save the Old State or the current state and I'm going to apply the path model State dict so here I'm essentially passing some parameters this is again taken directly from the repo so I'm going to take get past model State dict and here I'm going to pass in the model in the Old State indict finally I'm going to call the get method of this model and finally I'm going to compile the model using pi torch 2.0 hopefully this will speed up some of the training then I'm going to call the trainer train and I'm going to save the pre-trained model into the output directory once the training is complete let's run this and see if we have so I'm going to take this in order to be sure that I've completely did this correctly because I might have some errors within that and while the training might be a bit slow you will see the start of the progress right here and this should take a word of time of course that depends on your device in order to completely train this model so this is the training progress from my model that I've also uploaded to hugging face models and you can see the evolution was is decreasing uh pretty rapidly right here and on the 300 step we have somewhat of a plateau right here maybe if you have more data this should get even better but I've also included this model into the hugging phase and it's called alpaca Bitcoin tweets sentiment and it's available from my account and we're going to continue with this model now that you have a completely fine-tuned model we need to have a look at how you can use it I'm going to use the generate by script from the original apocal War repo but with our custom model and we're going to see if our new model is actually able to detect the sentiment of some Bitcoin tweets so the first thing that I need to do is to go ahead and go on the original Ripple foreign and then I'm going to check out a specific commit since I want to be since I want this to be reproducible and yeah this should give you this directory right here which is the ripple at this particular commit other than that I'm going to follow the the prescription for generating the model from here but I'm going to change that to our own custom model let me close this for you and to do that I'm going to call this Crypt generate dot pi what in 8-bit I want the base model to be the one that we specified in our fine tuning which is this one so this is the base model and then the water weights I want this to be my custom model and I want this to be a shareable script in gradient and I just need to start this and this should take some time to First what the warmer model and then create a shareable link once the loading of the generator script is complete you should get this live link so if you open up this you should get this interface not exactly this because I've changed a bit but yeah so it's Bitcoin sentiment analysis with alpaca and let's try this with this tweet which I took originally so I need to pass in the input and then the instruction is recorded the instruction was detect the sentiment of the Tweet a project with great prospects and opportunities this should be a positive sentiment let's run through the model and sometimes you might get some errors but in this case it appears to be this to be a positive sentiment let's try another tweet that I took earlier get ready to take short positions let's see what this is a neutral sentiment that's all right and then one popular tweet with bitcoiners so BTC 2023 boom Market powered by foot let's see so again this is a positive sentiment which is very good and now I have some live tweets let's try this one if you think the Run of for BTC is over good luck let's see what our model thinks about this so it thinks that this is a negative sentiment let's try another one so we have Bob Lucas so let's try something did you know that as soon as BTC broke the SP in the gaussian channel the bull market started let's see what our model thinks of this part of the tweet so this is positive okay looks quite impressive actually in this video we've seen how we can fine tune the wama model using the alpaca War repository we've taken the long road and pretty much went step by step using a custom code that is based on the fine tune script from the original repo we fine-tuned the model on a set of tweets that contain Bitcoin sentiment and we from the fine-tuned model using the radio app from the original repository and so the sentiment predictions based on some tweets and some live tweets as well guys please like share and subscribe thank you very much for the all of the comments and likes and the watch of the previous video please like share and subscribe I'll also include a complete text tutorial that is going to be also including a Google Co-op notebook thanks for watching bye
Info
Channel: Venelin Valkov
Views: 21,133
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: 4-Q50fmq7Uw
Channel Id: undefined
Length: 33min 32sec (2012 seconds)
Published: Thu Mar 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.