Fine-Tune Mixtral 8x7B (Mistral's Mixture of Experts MoE) Model - Walkthrough Guide

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Harper and I'm n and we're from bev. and today we're going to finetune mixol which is mol's mixture of experts model it's 87 billion parameter models each one is an expert in something this is actually architected very similar to GPT 4 which is why we're seeing such great results from it yeah the way it works is that based on your prompt it is routed to the most appropriate model and it outperforms llama 270b on most tested benchmarks so yeah let's dive in we're going to start in the notebook which I will link and we don't need an A1 00 for this we're going to be using Cur which is quantizing and then using low rank adapt adapters adaptations um and I have a article on how that works which I will share with you all as well and I will talk a bit about it as we go through the notebook as we get to each stage of Cur so first we want to instantiate the GPU and so we can just click this badge right here and we're going to need a 4X T4 so my other guides almost all of my other guides I use hilora for fine tuning and you only need an a1g which is a smaller like a one a1g machine which is I think 24 GB of GPU memory we need a 4X T4 so a T4 is smaller at 16 but we're going to get four of them um and so that's about $4 an hour right now so let's just click deploy everything is preconfigured for you oh I have to enter my card so let me just refresh this because it should have my card info and it should load everything again here we go okay still playing I hope you all have a wonderful holiday season I think I'll be releasing this as part of the 12 Days of Christmas so I hope you all have a wonderful holiday season and I hope everything is is great in your world okay so here we are we're in the environment UI and now we're going to click build and the Cuda version and python version are already preset for you if you use that link from the guide that I clicked earlier the deploy now badge link so let's just let this build and we're going to wait a few minutes and we'll come back in a little bit okay so we see the instance is building this is build the verb container which is kind of like a Docker container and we're going to wait a few more minutes until it's finished Okay cool so it looks like the machine is done okay cool so it looks like the machine is ready verb is finished as it says your verb container is ready and unfortunately the notebook button isn't clickable and it's unhealthy and it's been a few minutes so that probably means we're running into the bug that we know about and we're working on fixing it but the way to get around it is you can read these docs or you can just follow me and so we're going to say going to get an iterm so get a shell and type brev upgrade so make sure you have the newest version of the CLI I do and then brev notebook and then the name of your machine and then it gives you a link which we command click click to open and I already have the notebook uploaded but if you don't have it just upload it from your machine so you download it from GitHub this little download button and then you can just upload it so we've already gotten the GPU Eevee my dog is very protective of me we've already gotten the GPU and so we just want to run this once per machine so even if you stop the machine and restart it it you only need to run this once so there you go we'll give that a minute and if you see this asterisk that means it's running and if it turns into a number then it has completed so we'll just wait until this completes and it's done great okay so now we want to set up the accelerator which should speed up training and then we also want to set up weights and biases which is good for in training metrics like your eval loss and your training loss it'll create a graph for you that you can that you can look at so we're just going to run this and then it'll prompt me to authorize it I'm logged in so I'll just copy this go back paste it and then we're good to go now we can move on to loading the data set so we want to use the Vigo data set it's really great for testing whether fine-tuning works because it teaches the model A Unique form of desired output you'll see what when we go through it's very specific in what it wants and it's quite unusual and so the model usually doesn't perform well out of the box on this and so you can really see that the model has sorry I don't know why I'm looking over there you can really see that the model has um done well so let's load we want to load the train evalent test I want to print them just make sure we have it all this is from hugging face so you can swap this out with your own hugging face data set just paste it in here and so we can see like if I go to hugging face fig go it should be in data sets here we go yeah so if you want a different this is what I copied here and then you just paste it back in my oops I lost it it should still be running so we should be good great yeah so we have them all loaded and then again yeah so the you would just paste that right in here okay so now we want to load the base model so again we want to load mixol 8 x7b so the mixture of experts model um that apparently does I think better than llama 70b what did they say yeah I performs llama 270b on most tested benchmarks pretty cool [Music] um I guess it's not that cool cuz it's 8X 7B which is basically 10x 7B but it's still cool okay this will take a minute to load cuz it's a pretty big model Okay cool so the model is loaded and now we can move on to tokenization so this is a screenshot from the Vigo hugging face data set page and we can see that we have these length distributions and so you can use that to get this max length variable which is what we need we need to set this max length parameter for the model which will dictate the um length of our data set matrices and so we want to get the right length because you don't want to overdo it you don't want to use extra GBU memory that you don't need um if the training examples aren't that long and you don't want to cut off the training examples and lose the data so you want to get it just right and so what we're going to do is we're going to load this tokenizer and we're going to have this generate tokenized prompt and so this is describing the Vigo task so Vigo the data set I'm using has this meaning representation and it has a Target sentence and then it outputs it into this meaning representation and so what this does this is how I'm going to format the data that will go into my model and what I'm doing is I'm just telling the model I'm describing the F the um task so we have you know describe the Target string so this thing um in one of these functions inform request give opinion whatever the attributes must be one of the following name release year P Multiplayer any of these and so then we have the target sentence and then the output which is what we want the model to learn is how to Output it into this meaning representation and so for every Vigo training example which is each of these 1 2 3 it will take the target it will take it will describe the task take the target sentence and then the meaning representation and then at test time we will give the model up to the end of this meaning representation and hope that it's able to Output the correct meaning representation and so now we're going to tokenize the data set and if we decode it we can see that it looks like how we would expect the description the target sentence and the meaning representation and now we want to plot the lengths because again we want the right length we don't want to overdo it we don't want to underdo it so it's pretty well distributed and let's just go with 340 I think we can fit it so see I said 340 we said a beginning of sentence token and end of sentence token and then we pad so that all of the training examples are the same length length so we can see that it's padded with twos it starts with the beginning of sentence token which is one and ends with the end of sentence token which is two and then if we unoken eyesee we can see that it looks like how we would expect it to with the Tas description the target sentence and then the meaning representation which we wanted to learn and just another note on this again they should all be the same length after we Pat it because we're it we're representing the training data as a matrix for our Matrix multiplications so we need them all to be the same length so that a Matrix Matrix multiplications work so first here we're going to see how the base model does and again we want we chose this data set because it's easy to tell that the model learned if we were to give it just a basic fact data set it might be kind of hard to tell that the model is actually learning something when fine tune it because it might already have access to those facts and the base model might perform pretty well on what we're training it on so this data set is really useful in that it's quite an unusual task and we can expect that the model may not perform very well on it on it out of the box so we'll see how it does so here we can see um the sentence that we're taking and this is what we want the output to look like and this is what it actually looks like so we can see it's it's missing quite a bit you got the Target string incorrect so it said inform a verify attribute it got the name right but it got has multiplayer wrong and the rest of these wrong so now we're going to start on Laura and Laura is low rank adaptations and so what that is is taking the original full-size model and making the layers have making the layers smaller matrices than they were before so before we have this model that has multiple layers and each of those layers is a matrix and those values in The Matrix are updated as the data is pass through the model and back propagation happens we have to update all of these parameters and that is extremely intensive for the model uh for the machine and so you need pretty large machines for a typical full-size fine tune and so what Laura does is it makes an adaptation of those layers those Matrix layers that is a smaller Matrix than what it originally was and so you're actually only training on a subset on a subset of the number of original parameters I guess not subset but on a smaller number and so basically we're using low rank adaptation which is the type of technique and SVD singular value decomposition is the uh low rank de decomposition form that's used in Laura when I was looking at the paper we're able to again make representations of these matrices that have a smaller um dimensionality a smaller Rank and so the rank is kind of analogous to dimensionality it's not the same but but you can um for the math people out there for the non-math people out there you can kind of think of it similarly so it's representing um it's representing the Matrix but losing some um granularity so it's a bit lossy but it should have a pretty decent um it should represent it decently and that's similar to quantization which I actually forgot to mention we're loading the model in 8 bit so what the model originally is is I believe 32bit it might be 16 and when we load the model we load it an 8bit see here oh 4bit excuse me we're a different guide that I do is 8bit we have so many guides these days so we load it in four bit here and so the model is either 32 or 16 originally and what loading in 4bit does and what quantization does is it Maps those 32-bit floating points to 4bit fixed representations of those floating points so it basically shrinks it by a quarter to an eighth and so you can fit this large model on a uh smaller amount of memory of GPU memory so quantization and Laura are kind of similar in that way and so Kora is paired you load the model and you quantize the model load it um in fewer bits and then you also apply these Lura adapters which is low rank adapters onto the layers of the model and here we do it on the linear layers to reduce the trainable parameters and so we're basically representing this model we're making it smaller but and we're losing it's lossy so we're losing some of that that data that it has but it is smaller and able to fit on a GPU and in some cases actually might provide some more um uh normalization so if the model is overfitting because it has all these parameters in some cases quantization and providing Lura adapters actually might help with generalization so okay long-winded explanation of what is going on here um but yeah let's load the model and so we have these linear layers which we're going to apply Laura to so we've got Q KR v o and then these three and so I list them here and LM head at the bottom here and I'm going to choose a rank of eight so rank is the rank of the low rank Matrix using the adapters and so it controls the number of parameters trained and so we we just talked about this and so a higher rank will allow for more expressivity but there's a compute trade-off and Alpha is the scaling factor for for the Learned weights so the weight Matrix is scaled by Alpha over R and so a higher value of alpha assigns more weight to the Laura activations so we're going to have the activations have about two times the rank which is um pretty common for Laura that I've seen and then Dropout is usually 05 that helps with regularization randomly just drop out values um cool and so now we can print it and we can see that these linear layers that we saw before have all of these Laura adapters on them so yay okay cool and so now we've gotten to the Moment of Truth training so we can run this we have four devices so you should have four if you're using the brev link and project is the Vigo fine tune you can rename this this is for weights and biases um and you can comment that out if you don't want to report it to weights and biases and yeah I'm just going to run this so I'm going to run it for 500 steps I'm setting the learning rate to 2.5 over 10 5th because um that worked well in a the mistal run that I did um before and I think it's about 10 times the mistal learning rate not mial mistal but I'm assuming maybe they're similar um but you can tweak that you can do a hyperparameter sweep if you would like to make your really good and tweak things like the learning rate um and I'm saving every 50 steps and I'm evaluating every 50 steps so every 50 steps I am saving a checkpoint and so I might see that an earlier model so say checkpoint 300 at at training step 300 it'll have a training loss and an eval loss and I might find that that model performs better so it has a lower loss and a lower training lower eval than a model later on so perhaps a model starts to overfit where the eval loss is higher so it starts to go up but the trading loss goes down but the eval loss goes higher and so what that means this is overfitting and what that means is that the model is learning its training data really well but too well it's not generalizing it's um making this elaborate curve around the data set but that curve is actually not doesn't generalize well to like the rest to the eval data set and so we might find that we want to use a model that doesn't overfit and so we might want to we want to save these checkpoints and load an earlier model potentially so we'll see and we're going to log every 25 steps and I'm doing per device train batch sizes one I know that's small but it works with my GPU U size you could actually probably tweak this play with it um and see if it doesn't crash on you and if it does it's no big deal just reload the machine or restart your kernel depending on how bad it is and run this notebook again so one thing you'll want to do before you leave your model to train is you'll want to open another shell and run caffeinate and this will prevent your machine from going to sleep so turn the brightness down all the way and leave your machine you don't want it to go to sleep while it's training cuz you'll lose the training run okay so we have run it for 500 steps I should have gone for longer actually so I'm going to update this for you guys to go to a th and you can always stop it early by um saying kernel interrupt kernel because we can see and let's go look at the weights and biases graphs just so that we can practice looking at it so see the eval loss is going down and the I don't know why it only has one point here something messed up with the graph that's a bummer but we can see the eval loss is going down and it once it starts to climb again that's when it's overfitting and it's not terribly overfitting but we can see it's maybe perhaps starting to overfit but I don't know if there's really enough data points to see I would I would probably if I did it again I would probably train for a th steps or even longer you can always cancel it early let's see how the model does let's we saw it didn't overfit and it continued if we we don't have the training loss graph for some reason chart um but we can see that it continues to go down so it hasn't doesn't seem to have converged yet so I again I would train for longer okay so let's restart the kernel and then we're going to load the mistl 8X 7B base model and then we're going to load the Laura adapters on top of it and I have checkpoint 500 and let's load both and I'll be back in a minute to run this the same test prompt that we ran before and hopefully it does better after training for about 500 steps okay so compared to how it did before we can see this was how it we wanted it to do and before again it got inform so it got this wrong and it said has multiplayer little little big adventure so it didn't do very well at all and this is the Gold Label and ours also went over but it's says verify attribute great got that right got the first part right got the rating close has multiplayer got right platforms PlayStation it seems like oh oh no it didn't get PlayStation that's the gold but anyway it's it did okay and um it did better and again it didn't converge so I would train it for longer I would train at 1,00 or even longer and you can always stop it as soon as you see that the training loss doesn't continue to go down and the eval loss starts to go up I also want to shout out today we got 1,000 followers on YouTube and 20,000 views on the original mistl video I'm so glad that this has been helpful and I love connecting with you I really appreciate your support this is so fun so please do let me know if you have any other um guides that you want to suggest or recommendations or questions I love connecting with you so yeah let's connect I'm on X I'm on the gra Discord and on here on YouTube so or X this actually is also going to be posted on X but anyway um thanks a lot guys happy holidays and I am excited to see you next time and connect with you soon take care
Info
Channel: Brev
Views: 14,972
Rating: undefined out of 5
Keywords:
Id: zbKz4g100SQ
Channel Id: undefined
Length: 23min 11sec (1391 seconds)
Published: Thu Dec 21 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.