LLAMA-2 πŸ¦™: EASIET WAY To FINE-TUNE ON YOUR DATA πŸ™Œ

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is the easiest way you can fine tune a llama 2 model on your own data set we will do it with a single line of code using Auto Train library from hugging phase the best part is you can fine tune any model you want using the exact same line of code in order to run this locally on your own machine you will need to download the Auto Train Advance package from hugging face GitHub repo I'll put a link in the description of the video all you need to do is to run this command pip install Auto Train Dash Advanced you will need a python version greater than 3.8 for this to work you will also need an Nvidia GPU in order to fine-tune a model now if you don't have any TPU you can always try the free Google collab and that's exactly what we're going to be doing in this tutorial okay so let me show you how to run this using a Google collab first you need to go to runtime make sure that you are running the GPU runtime okay after that simply run the first cell so this will install the auto Trend Advanced package as well as the hugging phase health package sometimes it will ask you to restart the runtime after just following this so you need to go ahead and restart your runtime now if you're running this on Google collab specifically you probably want to run this Auto Train setup hyphen hyphen update torch if you run it locally you probably don't need to do this okay after that we need to actually provide a hugging phase token for that you need to go to your hugging face account go to settings and then click on access tokens and here you can create a new token if you want or I already have a token so I'm going to just copy that and then go back to my Google collab put that token in there and then hit login okay so here you can see that we have successfully login to hacking face now in order to find in the model we are going to be using this single line of code well don't weigh if it looks complicated I will walk you through it step by step okay so let's break this down so this is exactly the same line of code but I broke it down into multiple parts so that's it's easier to understand now we start with Auto Train that is the package that we're using uh we have this exclamation mark since we are running it you within the Google collab environment if you are not running it in Google collab you don't have to put that in terminal just remove this next we have an llm flag in here the Auto Train package is not limited to training on fine tuning large language models only it's a package from hugging face which you can use to fine-tune other types of models including computer vision models or neural network models using the tabular data sets right so we need to specifically tell it that we want to use or fine tune a large language model next we have this trained flag so it tells it that we want to train our model and after that we need to provide a project name in this specific case I'm calling my project llama2 open assistant so I'm going to be using a data set from open Assistant to fine-tune my model next we need to Define which model we want to fine tune or retrain and that is defined using the model flag so in this specific case I'm using llama27b and I'm using the sharded version because I don't have enough vram to load the whole model at once in my Google collab notebook that's why I'm going to be using the sharded version now another thing to keep in mind is that this method is not limited to fine-tuning llama models only you can literally pick any model available on hugging face and use the exact same code to fine-tune that model now another thing to keep in mind that is not the official version somebody was kind enough to create a sharded version from the original gamma model site and there are a lot of people who have done this so you can pick any one you want okay so the next uh thing that we need to take care of is what data set that we want to use to fine-tune this model so we'll have to provide the name of the data set or the paths to actually the data set using the data underscore path flag you can provide data sets using two different ways the easiest one is to Simply upload your data to hugging face and just use the repoid for that data set so if you look here I'm using the open assistant go on code data set or if you want to use your own data set which is uh present locally and you're running this volcanic you can simply provide the path of the folder in which the data set is located so it expects a data set in the form of a cbs3 file but the path to the data is going to be part of the folder you don't want to include the file name when you're providing the path now the only thing you need to be worried about is what is going to be the format of the data set so I'm going to be looking at two different examples of two different data sets available on hugging face that you can use with this model so the first one is alpaca full GPT and it has 52 000 rows or 52 000 examples now if you if you look at this data set it has four columns instruction input output and then a text column which basically combines these three columns with a system message range so if you look here there's this special token for instruction so that is the instruction uh in the First Column and then since there is no input in this specific example there is a special token for response and here is the response that the model is supposed to generate right so you have input or instructions and what is going to be the output from the model now we can look at some of the examples as well for example here you have three different columns present and then using the text column you are simply combining everything together so this is a typical alpaca data set format and the model actually expects things in the alpaca format right so what you want to do is you want to have a single column that actually puts everything together so in this case we are simply calling it a text column you can call it whatever you want but by default it's looking uh to this text column now here is another example and this is the one that we're using so the open assistant go on code data set in this case it has almost 10 000 different sample or examples now if you look at this it's following a very similar uh format so here's a special token that's the human and then response from the assistant right so you can easily see throughout this data set you are going to see this format so you have the your human and then there's going to be a corresponding assistant response right so you want to format your data set in exactly the same way it doesn't matter what you call these but you want to use these special tokens when it in order for the model to inference like what is the input and what is output okay one more thing that I want to highlight is the the format that we're looking at is very different than the original prompt template from gamma 2. so if you look at the prompt template from Lemma to chat models here is the format that they use but the format that these data sets are available in or you will need to format your own data set is going to be very different and the reason is very simple we actually fine tuning the base model not the chat model and since the base model is not following this specific prompt format you get to decide what your data is going to look like so after that we are defining the text column to use so by default it's going to be looking at column name text but you can call it whatever you want and use that in here after that we are using the use path flag that means the model itself is going to be using the parameter efficient find maintaining methods for fine-tuning the models this is a method proposed by hugging face and now we're using it in four bits so we're not using the full Precision after that we are defining the landing rate so learning rate basically controls the speed of conversion during training process you can set it to a lower value so that will take longer but it will converge better the rest of the parameters you can set depending on the hardware that you have so you can set the train batch size so since I'm running this on Google collab I'm using a value of 2 for a smaller data sets I have found it using a value of 4 seems to work right but it really depends on how many gpus you have what's the V dam for each GPU and you can select it based on that after that we need to define the number of training epics I'm using three but uh again depending on the data set you probably want to use a higher number here now the trainer that we're using is sft so this is supervised fine tuning essentially we're providing uh the data set in input output format right and that's why we're using the sft for uh trainer next we are defining the model max length now if you recall the new llama2 models have a context window of 4096 tokens I'm defining a much shorter context window in order to speed up the training process none of the tokens that I'm using uh in this data set are actually more than two zero for it and that's why we're simply using a much shorter Max than tokens right now if you want to push this to hacking face Hub and your own account so you can use this push to hug flag for that you will need to provide the repo ID so in my account I'm going to be creating another repo or model that's going to be called number two open assistant right and once the training process is complete it will simply push that uh to this specific repo ID now keep in mind that since it's fine tuning pretty large language model it will take a long time for the model to appear in your own hugging face account you can expect at least an hour or so depending on the internet speed that you have for this to show up in your hugging face account so be mindful of that then we are simply defining the block size and we also want to store all the training logs so that's how you fine tune a large language model using your own training data set using a single NF code so in this case if you look at State going through the training process so it's going through the tokenization of my current data set now I selected a relatively large data set so this is going to take a while uh it seems like yeah I'm going to be here for a while right so I'll probably not be able to include the final model however uh I want to show you this so uh during the training process here's what you actually see so it will create this project folder so it's the same same name as our project name right and then it will keep track of whatever is happening during the training process right now it's going through the tokenization process but once the training is complete you will see uh config.json file as well as a tokenizer and the model file in here in this folder and then you can use the Transformer library to load the tokenizer the model and do inference or prediction on top of those or since we enabled pushing it to hugging phase so you will be able to see the model after a while and you're hugging this account and again you can use the Transformer package to download the model and you start using it on your load local machine so that's how you fine tune a large language model on your own data set using this powerful Auto Train package only thing you need to actually do is to provide your own data set so in an upcoming video I will show you a process of how you can create your own data sets rather than using those available on hugging face and then for that we're going to be probably using tpd4 along with the code interpreter so keep an eye out for that video now just to reiterate you will actually need a powerful GPU for this to work because if you're running this on a free Google collab it probably will time on you unless you have a very small data set if you run into any issues or you want to discuss different fine tuning methods check out the Discord server in the description of the video we have a very active Community who are helping each other out if you need help with fine-tuning your own model or would like to discuss those with me check out my kindly Link in the description of the video I hope this video was helpful consider subscribing to the channel and liking the video thanks for watching and as always see you in the next one
Info
Channel: Prompt Engineering
Views: 36,588
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, finetune llama on custom dataset, how to finetune llama, finetune llama on google colab, how to finetune llm, llama finetuning on local machine, llama v2 instruction finetuning, llama finetuning, llama fine tuning, falcon finetuning, llama v2 finetuning, llm training custom dataset, llm finetuning, how to train llm, autotrain llm, autotrain llm training, artificial intelligence, autotrain advanced, llama google colab, machine learning
Id: LslC2nKEEGU
Channel Id: undefined
Length: 14min 16sec (856 seconds)
Published: Sun Jul 30 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.