Mistral: Easiest Way to Fine-Tune on Custom Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the real power of Open Source large language models comes from the ability to be fine-tuned on task specific data sets M 7B is one of the best option if you're looking to fine tune a small large language model in this video I will show you how to fine tune a mral 7B on your custom data set I'll also show you how to correctly format your data set for fine tuning and at the end of the video I'll show you an alternative that you can use to find tune large language models on your own data set if you don't have access to powerful gpus so let's get started I'm using the pro version of Google collab I'm using an a100 GPU that has 40 GB of vram later in the video we are going to look at gradient which is a platform that you can use to fine tune and serve your llms and they are the sponsor of today's video the credit for this notebook goes to the the AI Makerspace YouTube channel they have an awesome Channel and I would recommend everybody to check it out so first and foremost we need to install all the required packages so this includes Transformers TRL accelerate P Touch bits and bytes we will use piff for training the model and we're going to be using a data set from hugging face but I'm going to show you how to format your own data set once the installation is complete we will be Lo ining the instruct V3 data set from Mosaic ml they were the creators of the MPT models these were some of the early open- Source large language models so let's try to understand how the data set is structured each of the data set has three columns the First Column is the prompt that's the user input then there is a model response and then there is a source column and we are going to look at this to learn how the data is formatted now there are two different uh parts or splits so there is a training data and there is a separate test data in terms of the composition of the data set this data set is basically a combination of nine different other data sets that are put together in here and they contain data break do 15K which was combined with the entropic helpful and harmless data set then there are a few other data sets which are listed in here so this include computation math there is this well-known GSM 8K and then sun screen FD and spider data set but we're going to be simply interested in the dolly section of the data set for this video now let's understand the structure of the data so first we downloaded the data set and we are calling it instruct tune data set and if you look at the data set dictionary there are two components there's a train split and test split and each of them contains three different columns now the source column will Define where the data is coming from so it's going to be one of those nine distinct data set and we're going to be combining these two together in a specific promt template that works with Mr 7B first we're going to Simply filter this data and we just want to look at the doll harmful and harmless data set so the way we do it is we are using this Lambda function which looks at the source column for each sample and if it belongs to this dolly harmful and harmless data set then only that is included in the final data for all the rest of the eight values they will be simply ignored so as a result now you can see that our train set has 34,000 examples whereas our test set has uh 4,700 examples if you look at the original unfiltered data set so it had around 57,000 examples for the train set and around 6,800 examples for the test set we are not going to be training our model on all of these 34,000 examples because that is going to take a while and it's going to cost us a lot of money but we're going to just use a subset of this data set and here is how you do it so I'm simply selecting the first thousand examples out of this uh train set and then the first 200 examples out of the test set now if you look here we have th000 examples in the train set and 200 examples in the test set however I don't want to use the original format uh that is structured in so let's go back and look at the data again so here is an example from from the data set so at the top you have a system prompt or user input which states below is an instruction that describes a task Write a response that appropriately completes the request then the instruction is a question and the model is generating a response based on the question however the way we want to do it is we want to provide this as an input and then we want a model to generate an appropriate question for the text that we provided so this is going to be the input to the model and then we want the model to generate a question which is going to be the response of the model now with that in mind we're going to create a single column that is going to combine both the prompt and the model response and this is exactly how you want to structure your own data set as well so the prompt template that we want to use is something like this so there's going to be an instruction which will States use the provided input to create an instruction that could have been used to generate the response within llm then there's going to be a special token for input this is where we're going to provide text so let's say this is the answer that the model is going to use to generate a question and then there's going to be another special token followed by the question that was used to generate the response so this is how you want to structure your own data sets so here is a python function that does exactly the same thing so here is how it works we have the beginning of the sentence token then the original system message or system prompt that is coming from the data set so if we go back and look at all the examples that belongs to this dolly data set they have this special system instruction or system message in the start or beginning of each and every prompt so what we want to do is we want to replace the original system message by this new system message which states that use the provided input to create an instruction that could have been used to generate the response within llm so imagine that you get a sample from the data set then we are replacing the original system message with this new system message and the way we do it is we replace the original system message with an empty string the special token for for instruction with the empty string as well as the special token for response with empty string and then we take the response from the input training sample and put that as an input and now we're creating a full prompt so the way it works is that initially you define an empty string then you put the beginning of the sentence special token so this is going to be the special token that is going to be put in there then we are going to append it with this special um token for instruction then next we're going to add the system message this is our new system message then we add special token for inputs and then we add our input so this is basically the previous response from the model then again special tokens for response we add the response that we got in here and we add the end of the sentence token so here we can look at an example so we pass the first training sample as an input and the output is this so you see the instruction the instruction has been updated to our new system message so this is the one use the provided input to create an instruction that could have been used to generate the response with an llm so this looks correct then we have the input and the input is actually um a small paragraph and the response is in this case is the original question that was asked to generate the text that is provided as input in here so it seems like this is working so we just looked at an example of how the output is going to look like from the create prompt function however we haven't applied that to the data set yet you can actually transform the whole data set using the python map function so all you need to do is just provide the create prompt function in here and and this will map each and every training as well as test examples in the data set to this new format however instead of doing this manually in here we are going to do this as a part of the training process so I'm going to show you that in a bit okay now we are all set for training our model after the data pre-processing now in this case we are going to be using the Transformer package to do the training now even though we are running this on a00 GPU which has 40 GB of vram if we were to fine-tune the model in full Precision which is 32bit we will need a lot more vram that's why we are going to bring it down to 4 bit so we're going to load the model in 4 bits that means we will need only 25% of the compute that is going to be needed for something like 32-bit Precision however for actual model weight updates we are going to upscale this to 16 bits just to ensure that we do not lose a lot of information even with this I saw that the vram consumption goes up to 32 gigabytes when we are training the model not everybody has access to something like a 100 now keep in mind I'm not loading the model and updating them in different shards you can do that as well I have a video on that so in that case you could potentially use something like a T4 GPU but the performance of that model may not be as good as doing it in 16bit updates now if your resource constraint when it comes to training llms that's where the sponsor of today's video comes in gradient is a platform that lets you fine-tune your own large language models not only that but it lets you serve them through an API so you can access them anywhere you want all you need to do is provide you a data set and they will take care of the infrastructure part they have a number of large language models to choose from including Lama 2 noes Lama 2 they have a very powerful python SDK simply choose the base model that you want to fine tune on your own data set set up all the pan parameters for training provide your data set in a Json format and let the system take care of everything else another great thing about them is that they have an embedding API which serves an open- Source embedding model that you can use in your own rag application and it has integration both with L chain as well as llama index so with a single API call you can bring in your custom fine tune model into L chain in Lama index do check them out I'll be making a lot more content on them now back to the video next we want to load the model that we want to fine tune so for that we're going to be using the auto model for caal LM and we'll be using a pre-trained instruct fine tune ml 7B now keep in mind I'm not using the base version of mistl but rather I'm using an instruct fine-tune version of mistal and the reason is that the base models are simple next to toward prediction models so in order to fine-tune them on an instruct data set the way we have configured we will need a lot of bigger data set than thousand examples for it to work perfectly the instruct fine tune version is already fine tune on question answer pairs so it's a lot easier to fine-tune that with new knowledge and we are loading this on our GPU so we let it decide to map the GPU that it want to use there's a single a100 GPU so we are not going to run into any issues and now we load the model in 4bit so apart from the model we also need the corresponding tokenizer and that's how we are loading our tokenizer okay so before fine-tuning the model with our own data set let's look at how the base model that we just loaded uh is going to respond to our new prompt template so here's the prompt so we put the instruction in the beginning which is use the provided input to create an instruction that could have been used to generate the response within llm so here is our input text there are more than 12,000 species of grass and so on and so forth now in terms of the response we were expecting a single line which is going to be a question but it generated a pretty dated response even though we provide this specific instruction to the model to just give us an instruction that could have been used to degenerate the response so you can already see the model is good but it's not doing what exactly we want it to do and that's where the fine tuning is going to help so next we're going to look at the actual training part now for training we're going to be using Laura that is implemented as a part of the PFT library from hugging face now to understand the concept of Laura let's look at at the simple diagram the original model weights that we have when we load the Mr 7B there are a whole bunch of Weights that are simply redundant and Laura uses that to actually reduce the size of trainable weights that we need Laura stands for low rank adaptation of large language models this was the paper that introduced this concept of Laura to keep the description very brief so instead of a dating the original weights of the model during the fine-tuning process we actually append weights that are much smaller in number and we update those during the fine tuning process and later on WE simply merge both of them together to get the updated or merged fine-tuned weights and this is a technique that helps you a lot in Saving the amount of vram that we will usually need in order to find tune these large language models so here we are just defining all the configurations for the Laura then we need to apply those configurations to our models so we're going to be using the prepare model for kbit training this is a function within the PF package from hugging face right and simply update those so we have a new model with the Laura applied but we haven't trained the Lura yet so we need to set a few more hyper parameter for training and again um as I said in the beginning of the video this is based on a pretty awesome notebook from the AI maker space so do check out the channel so here we provide the output directory where the train model is going to be stored now when you're doing training you can either use the number of appex that you want to run or the number of uh steps so I'm going to explain this in a little bit now you need to select the batch size so this is basically the number of examples that is going to be used at once to train the model then you have some other parameters but the most important parameters that I would recommend you to keep track of is the learning rate this will control the conversions or speed of training then you also want to be very careful about the number of epex or the number of steps that you are going to train the model with because if you train it for too long on a small data set you will probably overfit the model so let me explain the difference between epex and steps so epox is basically one iteration where you process the whole training data set through it once so let's say if you are using a training data set of thousand examples with a batch size of four that means you will take a total of 250 steps so that you can have a one iteration of the whole data set through the model and that is going to be your one epec now the number of steps are basically how many batches you are passing through the model so we are setting it to a small number in here just to say on Save on compute cost so either you can go with a full Apex or a sub aex which is going to be controlled by the max number of steps that you want the model to go through then at each 20 steps we are going to use the test set that is available to us and evaluate the performance of the model all right so this is the part where the actual training is taking place we're going to be using the supervised fine tune trainer from hugging face in here we need to provide our model so keep in mind we added the Lura adopters to the model then we're providing the PFT configurations that we already defined this is is the maximum sequence length even though the mro 7B has a much larger sequence length we are just limiting it to smaller sequence length because the both the input that we are providing as well as the response that is supposed to generate is going to be within 2,000 tokens then we provide our tokenizer now here's the most important part that actually need to consider as I said in the beginning you can use the map function to map your data set to this new format or you can provide this function to the formatting function in the sft trainer so that will format The Prompt template on the fly when it's processing the data through the sft trainer and then we provide our training set and our test set now after that you can simply run the training process here you can see that both the training as well as the validation loss are going down so that's a good indicator that the model is actually training and and you can see that this ran for point4 Apex so one the 100 steps that we Define is probably around point4 of 250 steps that is going to take to create a single epic and at the end we're just storing the model to a local directory now you can push this model to GitHub if you want to use it later on so the way you do it is you provide your GitHub credentials if you run this block it will ask you for a token you can provide that then provide the repo ID where you want to push this now it only pushes the low adopters so you will actually need to merge those to your original model if you want to use it again so let's test the model so here's a function that will generate a response and in this case I provided both the prompt as well as the model so let me show you this is the merged model that we're using the text or the input describes how to make Gamo and and the model response is how to make goam right so it is actually working pretty fine uh with a relatively smaller data set that we use for training so that's how you fine-tune a mystal 7B model on your own data set now just to recap you need to be careful about the pre-processing of your own data set that is I think the most important part the rest of the parameters that we saw during this uh video are pretty much the same that you can use now you can Al play around with the learning rate if the training is too slow you can also play around with the number of aex that depends on how big of a data set you have because it's going to take quite a while if you have a large amount of data set I hope you found this video useful let me know if you want me to make content like these and go into a lot more details of the training process I would love to do that I already have some videos on how to format data sets using G gp4 so I'll put a link to that video in the description thanks for watching and as always see you in the next one
Info
Channel: Prompt Engineering
Views: 23,637
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, Mistral-7B, Fine-tune Mistral 7B, Easiest Way to fine tune, LLMs
Id: lCZRwrRvrWg
Channel Id: undefined
Length: 22min 5sec (1325 seconds)
Published: Mon Dec 11 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.