Mixtral Fine tuning and Inference

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

open AI GPT 4 is believed to be a mixture of experts but now there's an open- Source mixture of experts called Mi stral available I'll be showing you how to inference the model how to fine-tune it and also telling you a little bit about how mixture of experts work let's take a look at what we have for agenda I want to explain why to use mixture of experts over a normal model then I'll give some detail around him how mixture of experts work I'll start off off with basic Transformer design and show you what a single layer looks like in a standard Transformer because that will make it easier to understand how a layer is different in the mixture of experts design then I'll show you how to inference using a oneclick template for runp pod and give you some tips on the best ways to inference right now I'll then walk through a fine-tuning example it'll be an example for function calling but a lot of the tips I give will be applicable to unsupervised or supervised fine tuning scripts that I've gone through in the past as well then last off I'll leave you with some tips on future work that I think will be done to incorporate mixture of expert type designs into common packages and tools you might be familiar with the idea with a mixture of experts is that instead of asking a single model for an answer you have a little module called a router that will decide which expert to ask for example there might be eight different experts to ask and instead of asking one very large expert the router will pick one of these smaller experts to give you an answer in short what it allows you to do is to get the quality of a very large model like a 70 billion model but doing it only with the speed of a small model perhaps like a 13 billion model the reason is that asking any one of the smaller experts once you've chosen which one is best suited to the question it's able to calculate the answers and give you the next token much faster than if you asked a very large model however because you have all of these experts they're able to cover a span of knowledge that is somewhat equivalent to what a very large model would be able to cover let's take a deeper look into how mixture of experts are designed by starting with the basic Transformer so if you're working with llama 2 or many of the models out there the basic MRA model falcon the ye model deep seek any of these or indeed GPT 3.5 most likely you're looking at a multi-layer Transformer that will take in inputs like how are you and it will help to generate the next token which in this case is shown as prediced as today these models consist of a series of matrices that will convert first the words into numbers and then propagate these numbers through a series of layers to calculate the next token prediction now you can see here I've shown I think seven different layers often these models have anywhere between maybe 20 and 50 or 60 different layers and these layers all are structured in a similar way so their architecture is the same in each layer but the exact values of the matrices used for the multiplications will be different in every single layer now there are two ways that you can increase a transformer in size one is you you can have more of these repeating units or layers and the other is you can make each layer bigger so you can make the number of Matrix the number of Matrix parameters within a layer larger and larger let's take a quick look here at what happens inside a layer there are two things that happen within a Transformer layer the first is a feed forward mechanism what a feed forward mechanism does is take the inputs for example how are you and multiply ly each one by some kind of a constant so you could have a * how plus b * R plus C * U now you'll say how do you multiply words well remember the words before they go into the first layer are converted into numbers so how might be 0 01 r and u so when I say A * how I mean multiplying a by a number representing the word how so this is a feed forward layer it's a constant times an input plus a constant times an input and you can expand the size of this linear layer or feed forward layer just by adding more constants so here you can have a node or an output you can think of it as which is a * how B * R and C * U you can easily add another node to that layer to make it bigger and have D * how plus e * R and F * u f * U is coincident now the choice here of these parameters is determined by the training process so all of a b c d d e and f are all going to be optimized during the training process so to recap if you want to increase the complexity of a feed forward layer you can keep adding more nodes and what you do is you're going to connect the output of the previous layer to uh the input of the next layer so here we're receiving information from the previous layer and passing on to the next layer now of course on the very bottom layer we will very distinctly see that the words how are you are being passed up As you move towards next and next layers the meaning of the output becomes more and more abstract so it wouldn't be possible to attribute say the output here to any one particular word the important thing about a fe forward layer though is that it's linear it's just these constants multiplied by effectively the inputs in order to generate outputs now the second thing that happens in a Transformer layer is called attention and this is the key feature of the 2017 paper attention is all you need that makes Transformers very effective and what this attention mechanism does is it allows a given word to be compared to other words so for example if we have how are you with attention the word you is compared which mathematically we can uh describe as a multiplication the word you is compared with how and is compared with r and this allows the Transformer to look back at the past and look at interactions between words within the sentence and that's really important for getting good accuracy in predicting the next word and just like I showed these words are represented by numbers um you can have those numbers multiplied by constants so here I've got g h and I and again these are trainable parameters that will be trained when we back propagate through the model during the training process so here's a very simple attention mechanism just with one node but you can easily if you want to add more complexity to the model you you can add multiple nodes within that layer so you could have constants G H I J KL and keep going and going if you want to make your model larger and larger now here's all of this put together in the original Transformer paper and you can see they're doing both things I mentioned the attention right here and the feed forward and they've got them in series now in fact you can put them in parallel and some of the more recent papers are looking at that mechanism but the important thing is that each layer when we talk about a Transformer like GPT it typically has two mechanisms it has the feed forward and it has the attention so this is how the original Transformer works and as I mentioned if you want to make a Transformer more complex you have a few options you can increase the number of layers so increase the number of repeating units of this or you can increase the size of the feed forward layers so you increase the number of trainable parameters in your feed forward Andor you can increase the size of the attention uh mechanism within the layers so ghi or whatever constants you're using to represent the matrices of the attention of course when you make your model larger and larger it's able to better fit the data that you're going to use to train it but it's also going to make the model slower because you've more calculations youve more parameters that need to be multiplied in order to get from input all the way through your output and this is how you get to the idea of the mixture of experts basically you keep on increasing the size within each layer you keep on increasing the number of layers your model gets better and better but as you try and make it a better fit to the data by adding parameters it becomes slower so what mixure of expert does is it actually keeps the same attention mechanism within each layer so here we're looking at one layer which is going to be replicated say 30 times throughout the trans former typically in mixture of experts we keep the attention as is so we don't change that but what we do in the feed forward layer is instead of just having one single um set of nodes or one single layer architecture right here feed forward we split that into multiple matrices and so rather than just having one you can think of it has one big Matrix for the feed forward we split that into maybe eight matrices and rather than every time running through that full big Matrix we have a gating mechanism and the gating mechanism decides which one or maybe two of the smaller matrices we're going to use every time we pass forward so if we have a token that come in that says how it'll go through attention it will get to the gate and the gate will decide okay this is the token how we're going to send it to experts one and two out of the eight experts and that means we now only need to run the computations through two out of those eight smaller experts rather than running through one very large Matrix so what the mix mixture of experts does is it means we cut down if we're choosing only two out of eight experts we're cutting down the computation that we need by roughly a factor of four just to make it concrete let's head over to the model card for Mixr 8X 7B instruct so here we have a mixture of eight experts Each of which has 7 billion parameters and we can go in here to the files and take a look at the size of the model when it's saved in 16bit format so here there are 19 of these uh save tensor files and they're each about 5 gbt each so if you multiply that out you get about 45 gbt for the model size in total if you are just replicating eight of the 7B models 8 sevens is 56 so it should be 5 6 GB instead of 45 however you'll remember that the attention mechanism is the same in each expert so every expert uses the same attention mechanism for the same given layer so layer 29 of each expert the attention mechanism is going to have the same variables ABC or the same constants ABC as the rest of them and that's why it's 45 instead of 56 now if you think about larger models a model that has roughly about um say 45 billion parameters that's you know somewhere around 2/3 the size of a Lama 70b model however because only of only two of the experts so that's two out of seven are being activated at any one time it's basically giving you the speed of a 13B model which is like 2 * 7 very roughly but it's giving you the size of knowledge of a model that's about 45 billion parameters in size so that's a theoretical overview of Mixr it's very roughly a 50 billion parameter model but has an inference speed of a model that's only 13 billion parameters in size because it runs two experts roughly around 6 to 7 billion parameters each now additionally Mixr makes use of 7B Mistral models that are trained on highquality data sets and these data sets has have improved since the ones that were used on Lama 2 for example so when I say that it's like a 50 billion model that's just talking about the number of parameters probably in terms of performance it's performing like an even larger model because those larger models are older and the data sets were not of as high quality I assume which leads them to have lower performance than the Mistral model okay with that let's dive into an example of inference how to run this model and then I'll talk a bit about fine-tuning the easiest way to run inference is to start off with a the one-click template this is if you want to set up an API where you'll be able to make calls including parallel calls and get responses from Mixr now those of you who've already purchased the advanced inference repo you'll be able to find a link right here under runp pod um you'll see under TGI that there's a one-click template here and those of you who haven't purchased it you'll be able to see the link just below in the description so I've clicked on that link and this should pop up the mix instruct API by trellis and I'm now going to pick a GPU now just a few notes on picking the GPU in order to load the model fully into your vram you're going to need to have roughly 100 GB which is a lot that's even more than an a100 so it's very difficult to run the model in 16bit Precision just on one GPU I don't think it's possible however if you run it in 8bit quantization EQ with TGI you you'll be able to fit it in 48 GB if you run it in bits and bytes nf4 which I'll also show you briefly you can even run it probably in 24 to 26 GB of vram and maybe you can squeeze it onto a slightly smaller GPU so the gpus I'd recommend and I'm going to use an a6000 here has 48 GB of vram so it will fit it in 8bit Precision which is under 48 GB you could also run for improved speed on an a100 which is more expensive but should give you a little better performance in terms of speed so we'll click on deploy and we can check out the custom parameters you see here that it's already set up by default as EQ which is 8bit quantization and you can see here how I've chosen let me increase my screen size you can see how I've chosen the model 8X 7B instruct and I'm running the latest version of the do ER image for che text generation inference so I'll set those overrides and then I'll click continue and deploy to get this running here I am with my instance up and running and you can see that uh about 69% of the dis base have been taken up because there's just over 100 GB of the weights that have been downloaded and we can check in on the logs here and what we can see is that all of the weights have been downloaded Um this can take some time and I noticed there can be some bugs downloading the weights so if that is the case what you can do if it gets stuck after downloading a few shards you can just click here and click on restart pod that means you won't lose your progress on any installation or on the weights that have previously be downloaded and the Pod will just pick up and continue to download more of the shards so you might have to do that once or twice to make sure all uh 19 shards have been downloaded and then then the model start should start to be loaded onto the GPU and I've just opened up the logs on the Pod and you can see that after quite some time all of the shards have now loaded and we're ready to query the API now for that we do need the Run pod uh pod ID so we have the Pod ID here that I'm going to copy and I'm going to paste it into the environment variables on my um on my repo so I have my ad inference repo open up here and in the environment variables right here EnV I've just enabled the API endpoint for runp pod and in here I've pasted in the Pod ID so that we're ready to query I've also put in here the model name that we're going to query so we're all ready now to make calls to the API and we're going to do that with TGI do-e you can check out the inference video I made made but this is going to send a quick question and ask for a long essay on the topic of of of spring so let's just do that TGI python tg-specific in uh with with uh spacing of 1/8 of a second and we'll see how the server is able to handle that load so I'll just save those updates and still waiting on that first query to come back here we go so you can see an example of this response I've um asked for a maximum of 500 car of tokens and we can see that the time for generating that was just under uh 30 seconds and the tokens per second is about about 19 so with EQ using an a100 you're going to get about uh 19 tokens per second and now let's try and run a concurrent test TGI speed concurrent dopy which is the same request but again we're pinging it every eighth of a second and we're doing it 20 times so all of those requests will end up overlapping and the language model will have to handle them all in parallel so we want to see if that's going to slow things down okay we've just gotten the results back so you can see all of the requests uh going through and the requests have also taken about 30 seconds each uh for a total completion of course you would see the tokens come much before that if they're being streamed and the tokens per second here is about 17 the lowest I see is 1694 1681 so there's a very small penalty here for doing parallel parallel requests which shows you that with the EQ form of quantization that 8bit you're able to handle a lot of parallel requests um going to the server and it's able to deal with them without penalizing the time per token much at all now let me briefly show you how you can use another form of quantization so if you go in here to the Pod and you click on edit pod you'll see where we have previously set quanti zq now instead of quanti zq you can reset this to bits and bytes and - nf4 and what this will do is provide for a 4-bit format it's going to compress your model by another factor of two so you can fit it into a smaller size of vram I will note though that the Vol the speed of the requests can be slower and it can also be difficult to Ping the API with more requests in parallel and I think this is because when you quantize it down even further there's a lot of dequantization so expansion back up to 16 bits in the inner part of the GPU that has to happen when the final computation is done in 16 bits so this is helpful for fitting it into a smaller vram but it may slow down your performance okay there's one more inference I want to show you and that's the trellis Mixr function calling model that's available on hugging face it's a gated repo and you can purchase access it includes a one-click template so if I just scroll down here click on that oneclick template I'll see the function calling model pop up and again because it's the same base model as it's the mix base model you want to use an a100 or an a6000 if you want to run it and you can run it with the same quantization options as I recommended so here you would go to a6000 deploy you can check the setup options note by the way that I'm setting the volume of the disc to 150 gbes that's to give plenty of space for about 100 gbt of the model weights so I've got that pod running um I had it set up a little bit earlier to save on time so I'll head over to my pods and secure Cloud where it's running and I can check in on the logs just to see that all of the shards have been downloaded the 19 shards and furthermore the shards have been loaded onto the GPU and my host name here is defaulting to 4 zeros which means everything should be running nicely I won't go through the function calls in great detail because that's covered in a very recent video on function calling V3 so let's move on and I'll show you an examp example of fine tuning I'm back over in runp pod which I typically use for fine tuning you can also use vast AI That's useful in particular for small models and smaller gpus because they offer lower price points however the mix TR model is quite large you're going to need at least two a6000 or two uh two A1 100s or two h100s if you want to be able to fit the model on in full Precision you could alternatively try to find tune using Cur and you could probably do that on an a6000 maybe one single unit if you use 4-bit quantization I think it's probably difficult to fine tune on smaller gpus for the moment so what I'll do is typically I would get started by picking let's say two of these a6000 and clicking on deploy now the template I want to run with is a pytorch 2.1 it's got pre-installed the NVIDIA drivers as well it's important to have plenty of dis space when fine-tuning these models you need about 100 gabyt for the model once over but often if you're training different versions and resaving I like to leave a lot of Headroom so I typically would put that value quite a bit higher once that's ready to go we can set the overrides and then we can move ahead and continue and deploy now while that pod is loading I'm going to go back to my present presentation and show you a little bit of how the training is going to work so when we have an original Transformer if we're training something like Lama 7B then often we'll use a technique called Laura and Laura means that we train a small adapter in parallel and this has got a smaller number of weights and at the end of the training we take those trained weights and we merged them on top of our Frozen base model weights and typically we can choose a few different ways to there are a few ways to select which matrices we want to train in many of my videos I might choose to only train the matrices in the attention portion but sometimes you want to train more parameters and you also will train some of the weights within the feed forward Network actually within the chat fine tuning video I showed recently I train using Laura the way in the feed forward and also in the multi- head attention however now that we're moving on to the mixture of experts we're not going to train the feed forward Network this here is more complex now because there's a router that has to choose between which expert to train and it becomes more difficult given this Advanced architecture to make these parameters trainable and get sensible results so we're going to focus only on training the Mony had attention we're only going to train those parameters and in fact we're not going to directly train these parameters we're going to train a small side adapter here the Lowa adapter which is typical practice and achieves good results for Less memory requirements so I'm heading back over to my pods and I've got the pytorch Pod here ready to go I'll connect and connect to Jupiter lab next up I'm going to upload a file um it's the fine-tuning file you can purchase the fine-tuning script on its own for function calling you can also purchase purchase a script for unsupervised or supervised fine tuning each on their own or you can buy access to the full repo that's the advanced fine-tuning repo if you do have access to the repo you get the benefit of all future scripts that I upload load to that repo as well so here I've uploaded uh mix tral script it's the script I used earlier for training for function calling now a lot of what I'm talking about here is going to be applicable directly to the supervised fine-tuning scripts and the unsupervised fine-tuning scripts and the chat fine-tuning scripts and the long context fine-tuning scripts the main difference is that we are not going to train the linear layers or the feed forward layers but everything else will be very much the same for this training which I'll demo using function calling so as per usual I start off by connecting to hugging face so I can push and pull weights including from pivate repos I've set the base model here to Mixr let me increase my screen size a little bit and I install all the same packages that I typically would install Transformers bits and bites if you're using qu quantization flash attention in order to improve the training speed after importing the modules then I will load the model itself and here you can see I've commented out the quantization so I'm loading it in full um 16bit Precision using flash attention note that this way here of calling flash attention has recently been updated I give a list of updates now in a newsletter you can find at trellis .s substack do.com that's tr. substack docomo has been loaded we set up the tokenizer the pad token we will set to the unknown token that's a common practice you'll see in many of of the other videos as well the unknown token is already defined in the mix trial tokenizer next up we will set up uh Laura in fact first we'll apply gradient checkpointing this allows us to reduce memory a little bit during trading and now we're going to print out the model and you can see here that the Mixr model has got these attention layers these are the ones that we are going to train and it also has um 0 to 7 so that's the 88 exper uh listed here but we're not going to touch any of this here because as I said it's too complicated to try and train this uh rooting network of experts so we're going to focus on the attention layer that's the same for every expert we're not going to touch the layer Norms either that's something we do need to train if we're training for longer context length and sometimes for chat fine tuning but we're not doing that in this case we're just looking at function calling so here you can see when we configure the the Laura I've actually commented out these uh layers here that typically I might train and I'm only training the self attention layers uh right here so once I've set the Laura config I get a Laura model which basically freezes the base model and creates a set of low rank adapters on the side that we're going to train based on these weights you can see we're only training a very small portion 0 1% of the parameters are going to be trainable and it works quite well um so here we'll load a data set we'll be loading the trus function calling V3 data set you can check that out if you're interested on hugging face um it is a gated repo that you can find I'll put a link down below it's a specifically designed data set that allows you to achieve function calling performance from a base model so with this data set loaded and with the data being organized and put into the right prompt format which is basically the Lama 2 prompt format nothing too exotic there we're now ready to examine the data sets and print out our first sample so I know I'm moving quickly but you'll want to look back at the function calling V3 video if you want to see more detail I'm just trying to highlight the parts that are different in the case of mixture of experts here I'm running a test and the test says you've access to the following functions um I've listed out the metadata for the two functions the model can access and then I ask it to give me the names of the five largest Stocks by market cap and you can see that the model uh does not by default respond with a structured Json object it responds more verbosely when we would like it to respond with a structured Json object and of course that's the purpose of the training so I've run a few more examples here that can later be used for comparison but let's swiftly move on to the training and the good good news here on the training is that everything is going to be the same I've used the same learning rate I've used the same uh number of epochs just one Epoch for training and you can see that the validation loss falls down nicely and once we get to an example after fine tuning which is just down here below we'll take a look here get the names of the five largest Stocks by market cap and the model now responds with a structured Json object and an end of sequence token which is exactly what we would want now you'll notice one little small thing here it does include uh recommendation to search for the biggest stocks in the world that's just um an artifact of how we've defined the functions it wasn't a required parameter um so it didn't necessarily have to provide that but it may just have defaulted to that okay once the model has been trained then the same procedure happens um as per usual with uh defining a new model which is effectively an address that we're going to push to and then saving the model and pushing the adapters to the hub and later merging the model and pushing the model to HUB now I did when I tried to push to HUB have an error with the push to hub for Mixr so instead I had to use the upload function from the hugging face API to upload the files onto the huging face repo I'm not sure if that was just a specific issue that I faced I've created an issue on GitHub so perhaps this raw um this raw mechanism will work for you if you're going to push it to HUB but you can in any case use the upload file or upload folder function if this push to HUB doesn't work before I round up with this video on Mixr just a few highlights on function calling and also on inference for any fine tunings whether function calling supervised fine tuning unsupervised fine tuning all the scripts of which are in the advanced fine-tuning repo the key thing to have in mind is that you should not train the feed forward layers you want to focus on just using Lura and training Laura adapters for the attention layers only other than that Transformers from hugging face Works quite well and it seems that most of the functionality is working smoothly with Mixr more or less parallel to how it works for other models like llama or the basic Mistral model when it comes to inference the easiest way to right now is using again hugging face and the bits and bytes are EQ forms of quantization which allow for 4bit or 8bit quantization to get that model to fit on one machine now there is a gptq model from the bloke and it can be inferenced using the Transformers package but it's not yet supported by text generation inference which is the API software that I showed you so if you want to set up an API I still think it's easier to use EQ over time though awq probably will be supported as will gptq and at that point I would probably recommend for Speed using awq for your means of inference perhaps you could also use VM as well if you want to check out a video I made recently on inference you can see more details of using VM and perhaps by the time you see this video VM will be supporting Mixr which it's not currently all right folks let me know any questions below and I'll leave links to all the various repositories in the description cheers

Info

Channel: Trelis Research

Views: 7,890

Rating: undefined out of 5

Keywords: mixtral tutorial, mixtral 7b, mixtral fine-tuning, mixtral inference, train mixtral, mixtral, mixtral-8x7b, mistral 8x7b, mixtral ai, mixtral moe, mixture of experts fine-tuning, finetuning mistral 7b, mistral 8x7b install, mixtral function calling

Id: EXFbZfp8xCI

Channel Id: undefined

Length: 33min 33sec (2013 seconds)

Published: Mon Dec 18 2023