Mixtral On Your Computer | Mixture-of-Experts LLM | Free GPT-4 Alternative | Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome back to another video guys in this video I want to talk with you about the mixol mixture of experts large language model which achieves impressive scores on various benchmarks indicating that it could potentially be a successor of the lmer 2 model as the most capable open large language model and yeah in this video I want to briefly talk with you about mixture of experts model then I will show you how to run the mixol model on your own computer and also how to potentially find you in the mixol model and here it has to be said that still a lot of research is ongoing investigating how to train mixture of experts large language models but more on that later only 24 hours ago the mistol AI team published a mix trol 8 time 7 billion large language model which according to mistol outperforms the Lama 2 70 billion llm on most benchmarks with six times faster inference which is incredible news for everyone working with such big large language models and it also comes with with a permissive Apache 2 license which allows you to also use it for commercial purposes and as the authors describe here it outperforms the Llama 270 billion model on most benchmarks but also matches or outperforms the GPT 3.5 models a most standard benchmarks which is very impressive and the first key highlight of the model but at the same time also has to be taken with a grain of Sal because several Studies have shown that it's pretty easy to achieve better results on common llm benchmarks for example by adding the the training data of the evaluation Benchmark to the data or like training data used for pre-training the large language model and this way we can see here for the 7 billion llama 2 model compared to the version which didn't include any evaluation training data of The mlu Benchmark achieving a score of around 43 compared to when adding the training data of the mlu evaluation Benchmark we can see that a way higher score could be achieved and this holds true across benchmarks as we can see here different varieties it also holds true against different backbones or like Foundation llms as you can see here so I definitely don't want to say that mistol added data of evaluation benchmarks but it's always helpful to keep that in mind when seeing benchmarks and I think the best proof as of right now is just working with such llms yourself and figuring out their capabilities I mean like as a company obviously you want to train and publish model models that achieved the highest scores on all the benchmarks out there to show how good your model is which then attracts investors so obviously that could be a helpful let's say trick to get more investment into your company so yeah I really don't want to make the impression that mistol did it but I just want you guys to be aware that this is technically possible okay another highlight of the Mixr model is that it can handle English French Italian German and Spanish which is really cool mistol is a French company so I think that's why they're also interested in incorporating other languages than English so definitely something interesting I guess especially for developers in Europe one thing that I however found interesting is the following table where you can see different Benchmark the arkc hello s MML I hope I pronounce them correctly and you can see those are translated to different languages for example French German Spanish and Italian and then the mixol model is compared to the L 2 70 billion and the Llama 133 billion model and while you can see that the Mixr model achieves the best results for all four languages it has to be said that especially the Lama 2 model was Solly focused on English language and therefore it's not really a big surprise that it achieves worse results than the Mixr model which was trained also using French German Spanish and Italian but still cool to have a model that is not only good on English language but also on French German Spanish and Italian and the last thing that I would like to highlight is that the mixer model is able to handle a context of up to 32,000 tokens which in case you're not already thinking or processing in tokens is around 50 pages of text so yeah also compared to the context window of 4,000 from the Llama 2 model this definitely a huge step so really cool that the Mixr model supports a context size of 32,000 tokens all right as I'm mentioned already in the start of the video the mixture model is a mixture of experts model and here we can see an example of what that actually means so we kind of have a layer that has an input and usually you would see something like a feed forward neural network or also called multi-layer perception so a dance neural network and here instead of kind of passing all the neurons in in the neural network we first have a gating Network that decides where our input should be passed to so we have multiple experts from one to 2 N and then the gating Network decides to which experts we pass our input and as you can maybe imagine this has several benefits for example one is listed here in this really nice blog article from hugging phase where you can see a mixture of experts enable models to be pre-trained with far less compute which means you can dramatically scale up the model or data set size with the same compute budget as a dance model in particular uh mixture of experts model should achieve the same quality as its dense count part much faster during pre-training so if we jump back to this illustration we can see during the forward pass only these two expert models are passed and this way we can save a lot of compute not passing through all the other expert networks which then during back propagation also means that we only have to calculate the gradients of these two expert networks and with this our model can contain way more parameters while actually during inference or also during training a smaller part of the overall architecture is trained or passed and that's also why mixture of expert models are called spars because not all the neurons or all the parameters are used during an forward inference and also during back propagation but only some of them and this approach allows given the same budget to pre-train way bigger large language models and very likely that's the reason why mistol also trained such a model with a Mixr model and to summarize the mixture of expert networks we can say that their pre-training is much faster compared to the ones of dense models mixture of expert models also allow a faster inference compared to a model with the same number of parameters because we only pass the experts and hence have less computational effort inside our Network for example for the mix trol we pass two experts per tokens instead of all eight experts inside the mixol llm and one downside of mixture of expert models is that they still require high reram as all experts are loaded in memory because you can't really compute which token will be processed by which expert so you kind of have to have them available inside your vam at all times and can't really unload them or load them dynamically because that would just take too much time and now you might ask yourself why do the experts in mixture of experts diversify instead of collapsing into a single model and how can the router learn to dispatch the data to the right expert and if that's the case I can recommend you to check out this paper towards understanding mixture of experts in deep learning which was published last year in August so this shows that mixture of experts are still heavily researched and probably the mixol model will help gaining new knowledge having such a big Community using it potentially at least and one of the empirical results that I want to share with you here is that they found that the cluster structure of the underlying problem and the nonlinearity of the expert are pivotal to the success of mixture of experts and I think with Tex data that kind of makes sense that you for example have medical domain or financial domains that you can distribute certain tokens to a specific domain or expert in more practical terms this means that inside the Transformer architecture we can see here that the feed forward network is replaced with a mixture of experts Network which you can see here and in this implementation for example they replace the feed forward Network in every other Transformer feed forward layer so we can see here in the following Transformer block that the feed forward layer is still there and that's why they half the amount of Transformer blocks here just so you don't get confused by this illustration and then really thinking of large language models which are processing tokens we can now see here and this is also how it's done with a mistal model the router decides which expert Network processes which token so here we can see the token more is processed by the second expert Network while the second token parameters is processed or passed through the first expert Network all right to now run the model we will use the popular Transformers library from hugging phase and they also wrote a block article how to run the mixure model using their library and before we start doing so I quickly want to show you the vrm requirements which are very high so here we can see the mixer model is roughly equivalent in size to a 45 billion perameter dense model and this is kind of one of the downsides of mixture of experts model while you can scale your model and have more capacity in your model your vram requirements increase at the same time but the benefit obviously is because you don't pass all the experts during their forward pass that your inference is faster and here we can see that it takes at least 23 GB of vram to load the model just with 4bit precision and this is also a field where I could see a lot of progress in the next coming month or weeks with having way more people exploring such a sparse mixture of experts large language model and there's already a work out here mixture of quantized experts which could be also interesting for you guys and I'm sure soon we will see how people start running the Mixr model maybe on their CPU or at least on gpus using less vram but yeah as of right now these are kind of the requirements to run the model and luckily Nvidia was kind enough to support my Channel with their RTX 60008 GPU which has 48 GB of vram I will also link this GPU in the description box in case you're interested in the GPU and yeah it will actually really cool to only load the weights or parameters that we use per token so this way we could drastically reduce the memory or the reram needed to load the model but for example if you download the mixol model it has a size of around 93 GB and you can imagine loading and unloading Experts of the model takes way too long and now we can see here how to run the mixol model on your computer in case you have enough vram so first make sure to install the Transformers library with the version 4.36 then also all the other libraries that I've listed here I'm using those specific versions and then we can see here this is the model ID on hugging phase so we can see I'm using the instruct model which obviously is a behavior we know from cat GPT and this just specifies that the model should be loaded in 4bit Precision which I also specify here and this is an internal computation type which by default was set to float 32 and I got a warning so I had to specify this specifically here to to torch float 16 and then the Transformers Library makes it super convenient for us to load the model and then to pass prompts to the model and generate our outputs using a pipeline which in this case is a text generation pipeline here we pass the model ID which you have specified here and yeah those are more or less hyperparameters to load the model then I just press shift enter all right and once our model is loaded we can then convert our instruction into a prompt and the mixer model expects the following format so we can see here a beginning of sentence token then we have this instruction beginning string and the instruction ending string and then end of sentence token which indicates that the generation of new token should be ended after this token and luckily we don't really have to worry about all that because this is already handled by the pipeline that we have initialized here so all we need to do is to bring our instructions and the responses from the assistant into the following format so we have a message list where we have the role user and then the instruction which is here called content and then for example for the assistant you would then have the role assistant and here will be the answer from the assistant and yeah maybe I quickly run this command so you can see how this actually looks like so if we so here you can see the beginning of sentence token then the start of instruction the actual instruction and here we have end of instruction and now the model knows okay it's time for me to generate the reply and for this then we forward this promt to the pipeline Define how many tokens should be generated as a maximum which is 256 here definitely feel free to increase it if you want we want to sample so this means we are not taking the next token with the highest probability but sample through a certain collection of tokens and the temperature impacts the probabilities of certain tokens usually it's said the lower the temperature is or the closer it's to zero the more deterministic the answers of the assistant will be which means you will see the same answer more often while if you use a higher temperature around one or even above the assistant answers will be more diverse and yeah the top K just means that we pick out of the top 50 tokens with the highest probability and top pins that we build a cumulative sum of probabilities to 0.95 and and since we also defined top K if it takes more than 50 tokens to reach a cumulative sum of 0. 95 then we only sample from the top 50 most probable tokens just to explain these settings and then we get our output and the instruction in this case is explain what a mixture of experts is in less than 100 Words and and yeah let's see what the mix 12 model basically says about itself all right and here we can see the format that we had before and the model and the that it had to generate the answer after this end of instruction string and since I usually like to work with llms weer uh user interface I wrote this pretty concise script that allows you to interact with the Mixr model on a UI I will link it down in the description box so feel free to check it out and run it yourself and for demonstration purposes I will quickly run it by typing python mixol dop and what we've seen before the model will now be loaded and then we will be able to interact with the llm using a user interface and now you can copy this link open it in your browser and start interacting with the mixol llm all right and now we can basically ask the same question again explain what our mixture of experts is in less than 100 Words and and here I was actually surprised how fast the generation is which partly is thanks to the mixture of experts model which allows way faster interference for having such a big large language model but at the same time I used to work a lot with RTX 390 gpus from Nvidia and now having this RTX 60008 GPU really makes a noticeable difference when generating tokens or text but this is not only limited to the RTX 6000 I think you also will see a similar leap if you use the RTX 4090 and I'm actually curious if people are able to run the mixture model on a RTX 490 so in case anyone of you is trying that please let me know in the comments I will be very keen to to to know how it goes and yeah now I could go on and ask the model to tell me more and yeah I guess it's obvious why I rather interact with models in this way than running individual Cod cells all right hugging phase also described in their blog post how to find you in the mixture model using their TRL Library all right to Now find you in the mixture model we again can have a look at the hugging face blog post where they describe how to find you the mixer model using their TRL library and I have to say planning on creating a video on how to fine tune large language models using their Transformers Library anyway because the procedure for fine-tuning the Lama 2 the falcon or the mixol model is more or less always the same and for that reason I'm planning to create a specific video for that which I will upload soon so in case that sounds interesting to you definitely make sure to subscribe to the channel if you haven't already and for fine-tuning the mixture model or in a broader sense spars mixture of experts models is definitely more complicated or more difficult compared to dense large language models and this is primarily because the overfitting Dynamics are very different between dense and sparse models and sparse models are more prone to overfitting and one very interesting paper is this one St Mo I just call it designing stable and transferable sparse expert models by a research lab from Google brain and here in chapter four fine-tuning performance of spars models we can see for example this comparison of two different tasks so we have this CB task and the record task and for the CB task we have 250 train sequences while for the record task we have 138,000 train sequences so tremendously more and we can see that the sparse model is able to achieve a higher score on the train Evo for both task showing the blue line here so we have a higher score compared to the green one which is the dense train even evil and this holds true for both cases but the issue with Spar mixture of experts model is that fine-tuning them is a little bit harder here we can see for the task it has only a few train sequences 250 we can explore that the dense model achieves higher scores in the validation data you can see here this red line is higher for the dense validation evil compared to the let's say orange or yellow Force validation evil while for this task where we have many train sequences it's the opposite way that the spars model achieves higher scores for the validation evil compared to its dense counterpart with a red line here which is a first observation that to effectively findun a spars mixture of experts model we need many many examples of that specific task and yeah please keep that in mind because we will come back to that in a second then another very interesting finding of the paper is they try to investigate which parameters of the overall architecture or model should be fine-tuned and for that that they try different configurations for example one is to only fine-tune the parameters inside the mixture of experts Network which more or less means we only fine-tune our experts for a specific task and one very interesting finding is that fine-tuning all non mixture of expert parameters is almost as effective as fine-tuning the whole model and this might be not super obvious but the mixture of expert makes up for the majority of all the parameters and therefore only fine-tuning the non mixure of expert perameters can speed up your fine tuning by a huge margin so that's definitely a very interesting finding and then in further research they also found that for sparse model it's actually better to have a smaller batch size 65,000 is still pretty large but increasing the batch size further with these orange bars shows that the score is actually getting worse while for dense model it's kind of the opposite way where the score further increases at least until here same with the learning rate it makes sense to use a Higher Learning rate for sparse models so here we can see for dense model it's actually better to use a lower learning rate which we have here while for sparse model it seems like choosing a Higher Learning rate leads to better fine tuning results and then there's one more paper that I want to show you which is called mixture of experts meets instruction tuning a winning combination for large language models so that already should make us a little bit excited especially in context with the Mixr model and I want to keep it short here so what the authors found is that compared to dense models mixture of expert models or spars mixture of experts models benefit more from instruction tuning and are more sensitive to the number of instruction tuning task so here in red we can see the sparse mixture of experts model while in blue it's dens counterpart and while we can see that the dense model also benefits from more instruction tuning examples we can see that the score or the average evil metrics are higher for the spars model and also the gains are higher overall and this also matches the observation that we have made here that having many examples for a specific task helps the spars model to better adapt to that specific task so overall we can say that fine-tuning sparse mixture of experts models is harder than fine-tuning dense llms because sparse mixture of expert models are more prone to overfitting and to overcome this having many examples for a specific task is helpful during fine tuning and in the context of large language models this means having many instruction tuning examples of a specific task is very beneficial for fine-tuning a mixture of experts large language model like the mixture model all right and that's it for today's video let me know in the comments what you think about the mixture model I definitely think it's cool to see kind of a new architecture with a mixture of experts models and I'm very curious to see what people will create using the mixture model and when the first person is able to maybe even run the model on their CPU or to find a better way to quantize the model and yeah that's it for today I hope you learned something I hope you enjoyed it and I hope to see you in the next video Until then have a great time bye-bye
Info
Channel: Martin Thissen
Views: 28,912
Rating: undefined out of 5
Keywords:
Id: ucov1AWvGEc
Channel Id: undefined
Length: 22min 4sec (1324 seconds)
Published: Fri Dec 15 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.