Fine-tune Mixtral 8x7B (MoE) on Custom Data - Step by Step Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is probably one of the most requested video on my YouTube channel currently today we're going to be looking at how to fine tune a mixl 87b which is the mixture of experts model from mistal AI on your own data set the idea behind mixture of expert or Moe has gotten a lot of traction recently because it's speculated that gp4 is potentially a mixture of experts thee or mixed of experts from mral AI is also really good for its size and it's able to beat GPT 3.5 on a whole bunch of benchmarks so in this video I'll show you how to fine-tune this model on your own data set in this video we are not only going to be looking at the code but I'll also talk about some of the Practical things that you need to consider when you're fine tuning any llm or a machine learning model for that matter now first and foremost you will need around 60 to 65 gab of vram in order to find you in this model even if you're using S Laura or Q Laura thanks to my friends at SX AI I got access to an h100 so I thought I'll just use that so let's get started first we need to install all the required packages in this case we're going to be installing Transformers TRL for training the model accelerate P torch bits and bytes and for training we're going to be using a pluging face data set but I'll show you how to format your own data set so that you can use it if you want to find you in this model and we are going to also use Flash attention if you want to uh train this model or fine-tune this model you probably need four uh T4 gpus which comes out to be around 64 GB of V Ram next we need a data set to train the model in this case again I'm using the Mosaic ml instruct with 3 data set this is the same data set that I used in another video where I showed you how to fine tune the original mistal 7B model let's have a quick look at the data set the data set has two different splits one is the train and the other one is test split there are three columns in here one is the prompt so the prompt contains a system message then the actual user input and the corresponding response is another column this data set has data from multiple sources so you can see this here is Dolly then competition math this is Chain of Thought GSM 8K so the third column shows where a particular example is coming from I discussed this data set in a lot more details in my previous video so if you're interested I'll highly recommend to watch that let's have a close a look at the data there are about uh 56,000 samples or examples in the train set and around uh 6,800 examples in the test set again there are three columns prompt response and Source most of the data in this data set is coming from this one specific data set I think this is Dolly harmful harmless this is coming from anthropic now when I filtered it for this specific data set we have 34,000 examples in the train set and around 4,700 in the test set so this subset is the data that we're going to be using to fine-tune our model next in order to fine-tune the mix 87b we need to format our data and bring it into a single column we need to format our data in a way specific format which we're going to be calling the prompt template now if you look at an example from the data set here is how it's formatted so you have the prompt which contains a system instruction then then what the user wants from the model so basically a question and then special tokens response and here's the response that the user is supposed to get from the model but the Mixr 87b instruct version follows a different format and the format looks something like this so you have these special tokens in the beginning then you have the system message user input then another special token which indicates the end of the user input then the model is going to generate a response and then there is a special token which indicates the end of the model response if you're are fine-tuning the instruct version of the mixol 87b you need to provide your data in this format but if you are fine-tuning the base model which is the next word prediction model then you have the freedom and flexibility to Define your own prompt template now in my case even though I'm fine-tuning the base version of mixol 87b I will still follow uh The Prompt template that mol recommends all right so this function will basically reformat U the initial data into the prompt template that we want now to make it a bit more challenging we will also rearrange our data so instead of uh just providing um a system message or system instruction a question from the user and getting a response from the model we will rearrange it so our system message is going to be this which states use the provided input to create an instruction that could have been used to generate the response with an llm so basically what we are doing it we are providing text as an input and we asking the llm to generate uh certain questions that could have been asked by the user to get uh the text that we are providing to the model here is an example of reformatted data so this is the initial instruction that we are giving to the user then here's the text that we are providing to the model and the model is supposed to generate a question based on the text that we provided next we need to load our base model but before that as I mentioned in my previous video this uh notebook is based on the amazing work done by the folks at AI Makerspace do check out the channel link is going to be in the description as I mentioned we are using the base version so this is not the instruction version next we will set some configuration we're going to load the model in four bits but for compute we're going to still keep it in 16 bits okay so this comment doesn't apply here because we're using the base model although for fine tuning the base model correctly we will need a lot more data compared to when we are fine tuning the instruct version of the model next we will load the actual model so we provide the model ID we want to use all the gpus available on the system and we are making use of flash attention too so if your GPU supports Flash attention make sure to enable it that will expedite your training process after that we are also loading the tokenizer we are setting the end of sequence or end of sentence tokens now not all the sentences are going to have the same length so we need to also pad the ones that don't have the max number of tokens or Max sequence net the neural network or Transformer architecture in this case is performing uh mathematical computation so you want your data being to be in the metrix form but for that you need to have each and every row have the exact same Dimension or the number of columns so let's say if your max sequence length is 200 tokens but there are examples which have only 150 tokens then you want to pad those examples so that their length is equal to 200 and the metrix computation is going to be a lot more efficient now the question is how do you determine the max sequence length so in order to explain that we're going to look at an example later in the video so before fine-tuning the model let's look at what kind of responses we get out of this base model so I'm going to be using this function which basically takes a prompt from the user and the model generate responses okay so let's look at an example this is going to be my input prompt and the prompt is Ed the provided input to create an instruction that could have been used to generate the response within llm and the text that we're providing is this so there are more than 12,000 species of grass the most common is Kentucky Blue Grass because it grows quickly easily and it is soft to touch so this is basically the The Prompt that we are providing and we're asking the model to generate potential questions that could have been asked by a user from an llm okay so when I ran this prompt through the model here's the response that it generated I simply reformatted so that it's easily readable for some reason it added a couple of other special tokens but anyways here's the original prompt that we provided and after that the model since it's a base model and it just does sentence completion or next word prediction so it reiterated my prompt and after that it simply started creating a whole bunch of random special tokens now this is the expected behavior from a next word prediction model it's not really following the instructions and that's absolutely fine let's see if you can fine-tune this model so that it can start following these instructions correctly before looking at the find hearing process let's again talk about about the tokenization and the max sequence length that your fine tune model is going to support this is very important because depending on how big your max sequence length is the model will take that much longer because it will need a lot more computation mixol 87b out of the door supports 32,000 tokens but if you are working on a specific application which doesn't need that big of a sequence St you want to fine tune it with a much shorter uh sequence St and let me show you how you do that so this is a code snippet that I um borrowed from Brave AI I think they have an example notebook I'll put a link to that in the description so basically what we are doing is we have our create prompt function that we use to reformat or prompt so I am reformatting all my training data as well as test data and then passing that to the tokenizer of mix 12 87b so we will get a tokenized data set now after that using this function we are looking at the distribution of the sequence length of each and every sample so here is the frequency distribution most of the concentration of this histogram is in the lower part so that means that most of the examples that you have in the data set they have a sequence length of probably somewhere around 7 or 800 here I just looked at a relatively zoomed in plot of the same data and that most of the sequences or examples that you have they are less than 12200 tokens if you're fine-tuning a task specific llm just look at the distribution of uh length of tokens of each example and make sure that that you select a Max sequence length so that it covers most of the examples there might be some outliers you can just throw them out completely ignore those this will save you a lot of computation next we're going to quickly look at the architecture of mixl 87b just uh pay close attention to these layers the linear layers that you see in here these are the ones that we're going to be adding uh extra adopters using uh Laura next let's start about our Lura configuration so these are different modules or layers that we want to attach our these Lowa adapters too so if you go back you will see these model or block names and these are the ones that we're going to be targeting now I have covered Laura in a previous video I'll put a link to that video after setting the specific Target layers that we want to address you can play around with these different parameters for example you can Define the Lowa dropout so basically this will regularize the model and potentially reduce overfitting of the model right so after setting those up we apply the Lowa adopters to our model and we're going to be just training these Lowa adopters so here using this function we can look at out of all the parameters that are available there are a lot we are just training. 24% of all the parameters that are in the model so essentially we're just fine-tuning or updating the weights of these Laura adopters not really playing around with the weights of the model if you want to find you the whole model that will take a lot of compute a lot of time and a lot of resources and that is the beauty of Laura okay now let's look at the model again after attaching our Lura so if you see here this is one of the linear layers and we added the Lura Dropout then the Lowa adopters to that then we took the second linear layer and again we're going to add both the Lura Dropout and Lura adopters so if you notice we are only targeting the linear layers we are not playing around with the nonlinear layers or even the router module in there okay next we're going to set the hyper parameters for training so first we need to check if you have multiple gpus on the system if you have multiple gpus we will enable parallelization of model training all the training is being performed to the amazing tlr package from hugging face now we need to define the output directory where we want to store the model then in order to train the model you have two options either you can Define the number of epoch that you want or you can Define the uh maximum steps that you want so for this quick experimentation we're going to set it to 250 steps with a huge batch size of 32 then for we're going to look at the results every 10 steps and I'm setting a relatively lower learning rate here and I want to have a smooth training process the convergence will take longer but hopefully the loss is not going to shoot up over time now these are different parameters that you definitely want to play around because these will control your training process okay so with all that said we need to next set our trainer so I'm using the supervised fine tuning trainer from TRL package now again Max sequence length I basically looked at this plot and I think if I set it to 1024 it covers majority of the samples that I have in the training set now you can manually discard uh the samples that have a sequence greater than whatever you uh set in here but in this spe specific example I'm not doing that another thing that you will notice here here is formatting function so um I did not reformat my training data set what I'm doing is I'm going to reformat that on the Fly using my create uh prompt template function and we are doing that both for the training as well as the test data set okay so on my h100 thanks to my friends at SX it took around 85 minutes to complete 250 steps if you look at the training as well as validation loss there's a pretty nice gradual decrease uh in training as well as validation loss so we don't really see any sign of or fitting in here that's a good indicator it also means that I could have potentially trained it for longer to get better performance out of the model all right so after training the model we can just just uh store the weights in a local folder you can also push this model to hugging face using this command so you will need to install install the hugging face CLI for the notebook you can use this this will ask you for your credentials and after that you can just use the trainer. push command provide the repo ID where you want to push your model and it will upload your model there now we only train the ler adopters so in order to use this model we will need to merge both the original base model as well as the adopters that we train but let's see how good this model actually is now here I'm using the same function to generate a response we will use exactly the same prompt uh but in this case the model is the merged model here's the response that I got from the model I just uh formatted it for readability so initially we have our input and uh the output is this so it says use this input and we expect the model to generate questions the first question is how many species of grass are there what is the most common grass and what grass would you find in a very dry area now these are definitely relevant to the text that we have in here now for some reason the model actually uh repeated the same question again but at least it's following instructions and it's generating questions if we train this model for much longer we will start getting much better responses and the reason is that we were using only 250 steps with the batch size of 32 so the model basically looked at only 8,000 examples it hasn't even looked at all the data set yet that's why if you have a huge data set you want to train these model at least for one or two EPO in some cases probably more than that depending on um how large your data set is okay so this was a very quick tutorial on a step-by-step process of how to train the newly released Mixel 87b uh mixture of expert model from mistal AI let me know in the comment section below if there are any questions or if there are any other specific topics you would like me to cover in my future videos I hope you found this video useful consider liking the video and subscribe to the channel for more content like this thanks for watching and as always see you in the next one
Info
Channel: Prompt Engineering
Views: 10,769
Rating: undefined out of 5
Keywords: prompt engineering, Prompt Engineer, Mistral AI, Mixtral, Mixtral 8x7B, Mixture of Expert, Fine tuning LLMs
Id: RzSDdosu_y8
Channel Id: undefined
Length: 19min 20sec (1160 seconds)
Published: Fri Dec 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.