Mixtral - Mixture of Experts (MoE) Free LLM that Rivals ChatGPT (3.5) by Mistral | Overview & Demo

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone my name is venelin and in this video we're going to talk about mistal AI a mixture of experts model that is combining eight 7 billion parameter models into one and we're going to have a look at what this model achieves so this is the official blog post on the mistra AI web page and as you can see this is the MRA 8times 7B models so this is a sparse mixture of experts model or SM SM OE with open weights so this model is open source and at least the weights are and it is licensed under Apache 2.0 so you can use it in a commercial settings and uh one of the important things here is that this is actually outperforming at least to their benchmarks W to 70 billion parameter models with six times of a faster inference which is a very strong result if this is the case and another important think is that this actually outperforms GPT 3.5 on more standard benchmarks again very strong results if true so one of the important things of this model is that it has a very long context so they have 32k tokens they can handle that and the model is actually trained on multilingual data so they have English French Italian German and Spanish unfortunately no Asian languages uh as far as I know and then they're saying that this is actually performing very well in code generation settings and yeah you can uh they're actually also open sourcing an instruction following model so a chat model which we're going to have a look at in a bit so what is this mixture of experts Network and why it works so well at least according to the authors so the interesting thing here is that there actually using something uh maybe you can think of this as an ensemble warning but in a sparse format so what they're doing here is that at every layer of their Network for every token a router Network chooses two of these groups the experts to process the token and combine their output additively we're going to have look at the code how this is done but essentially you can think of a neural network that is actually choosing from one of eight or two of eight uh possible models and each token is um essentially passed through two of those experts or the top of two of these experts and another thing here is that they're saying that this model actually has roughly 47 billion parameters in total but it only uses about 13 billion parameters so this of course would give us a very fast inference speed compared to one of the larger models and again the performance of this model is provided right here as you can see on the mmu uh this model is slightly beating GPT 3.5 uh which is very impressive for such a small model if you think of this as a small model of course and uh you can see that pretty much every Benchmark is U better at least compared to W 2 70 billion parameter model and some of the benchmarks it on some of the benchmarks it is even beating the GPT 3.5 which is again very impressive as far as we know GPT 3.5 is roughly 170 billion parameter model so this mow with the mixture of experts outperforming it on this Benchmark is really impressive here is an very nice comparison between mixt 7B which was the original model from MRA then they have the MRA 8 by 7B this model that we're looking at and then they're comparing this to wama 2 as you can see on the x axis you have the billion of the parameters since we have wamas with 73 and 17 billion parameters and the mix is actually aligning around the 13 billion parameter model at least the others uh claim that and you can see that for this inference budget or 13 billion parameters you are roughly getting the 70 billion parameter or even more from W 2 as you can see uh this model is very strong on code generation but the comprehension part it it is lacking a bit compared to the W 2 models so the mix 87b is provided along with an instruct model or chat model and the authors claim that this is U being optimized or trained with supervised fine tuning and direct preference optimization ation so DPO is a technique that is pretty much standard these days when you're fine-tuning or training your own chat instructed models and they claim that it reaches a score of 8.3 making it the best open source model with a performance comparable to gypt 3.5 of course we'll have to find out on our own when this model is actually more widely available and they're saying that uh you have to pretty much much do your own moderation and they're uh providing examples right here so I would say that this model was not uh strongly censored if you will the mix model is already available on hugging face Hub and the latest version of the Transformers Library can be used to run these models of course the models are very L uh as you can see here this is roughly 47 billion parameter model and and you can easily fit it on a Google quap notebook but uh these are example for example the uh tokenizer and how you can W it into the current version of the Transformers Library so you you have enough resources you can run this model on your local device or machine H face has released this really nice block post callede mixture of experts explained and this is probably a push that was first started by GPT 4 since there were rumors that GT4 is actually a mixture of experts models uh they thought that they have a couple of experts behind a gated Network or a gate or a router Network that is actually choosing the expert to which it's going to provide your prompt to or parts of the prompt we're not really sure how gp4 Works under the hood but um there are papers that are describing those type of experts Network or mixture of experts networks that were provided by uh let's say top researchers and some of them are still working at open AI so it is uh pretty um interesting to see whether or all those ideas are also used in uh gp4 training and also now in the open source world world so here there uh providing a background on the mix 8X 7B and how it works uh the block post is providing a very nice uh background on what uh mixture of expert are and why it is uh essentially hard to train those types of models and what are the challenges and what are the benefits of training such a models and here they have a very nice tldr they're pre-trained much faster they have faster inference since you not using all the parameters of the model during inference time and this also goes for the training but still require high amounts of GPU memory so essentially what you need to do is to W the complete model into the GPU RAM and then it is very hard to find you or train those types of models and even if we have the Mixr models it is still a bit challenging uh to train such models so the main components at least from the standpoint of uh Transformers models is are or the sparse layers the mixture of expert layers if you will and then those are used instead of the typical or standard feed forward neuronet layers or the linear layers in pytorch and essentially what you give here is a parameter to specify the number of experts that you're going to use but still in practice those types of flayers are implemented as uh fit forward neuron Network layers but we are going to have a look at the implementation in a bit and then you have this gate Network or a router that is essentially choosing which tokens are sent to which expert so essentially this is a classifier that given a token chooses which expert or experts is going to get this token uh and here they have an example uh the token more is set to a second expert and the token parameters is sent to the first Network they have a very nice illustration of a switch Transformer encoder book and how all of the setup works here we have two tokens more and then parameters these are provided or going through self attention and then normalization layers and after that you can see that for the first token the router is choosing from four experts and in this case the probability distribution of the w if you will is going through the second expert right here you can see that so essentially again the router is doing just classification based on the tokens and then for the second token again through the router you can see that the probability distribution is suggesting that this token should be passed through the first expert and then the normal operation of the model is continued this is the open source implement of the mix St models in the Transformers Library as we've already discussed this is uh already into the latest version of the Transformers library and here we have the spse mixture of expert block that uh you're going to use in order to essentially create the mix round model and one of the important things here to note is the actual implementation of the layers as you can see the gating or the router layer uh let me close this is actually a linear layer or a feed forward uh layer and you can see that it is taking the number of hidden Dimensions as an input and then it is actually classifying based on the number of experts that you have and then for each expert this is providing a mixt box pass top two MLP layer uh and the config is passed to this and essentially you're creating this module list for the experts so this is um a very high overview how how everything works right here and again for the router wets you can see that the hidden states of the input is actually provided to the gate and then you get the wets and then on top of the wets you actually providing this softmax function and uh again you're selecting the top K elements from here and the top K from here is actually the uh number that you've chose CH of the experts to pass through and in this case I'm going to show you the default config of the mix model so this is again in the Transformers Library mix config and uh you can see here the hidden size uh then the intermediate size etc for the number of heads and number of attention HS of the Transformers Library uh Transformer model that it is used and then uh here is the number of loal experts so this is eight and then the number of experts per token so this is the top two that the model is going to be using in order to choose the number of experts the mix model has already been integrated within the hugging chat and this is a way to essentially play out with different models as you can see we have the mix model right here active uh you can try it out against W 2 70 billion parameter models the Falcon 100 0 billion parameter models Etc and let's try a couple of examples here I'm going to take some examples from the AI boot camp that is going to be available on ML expert so if you want to try out the boot camp please go ahead and subscribe to ml expert. so the first one I'm going to use a system prompt uh let's see how well this is working with the system prompts I'm going to paste it in right here you're a swav coner you love everything swav and understand that is superior to anything else so uh let's probably close this let's see if it's saved okay so it appears to be working and I'm going to ask the model to specify the most iconic dish that is prepared by Suave grandmothers let's see what we get P manyi so Russian dumplings P Roi borch Sarma okay uh let's see pman piri borch SMA okay it looks uh really interesting uh at least smma is very important to the Slavic culture I can attest to that so let's try it uh with some coding I'm going to again change the system prompt you're an experienced python developer that writes efficient and readable code you always strive to use buin libraries okay so we're going to try to um do the second part right here WR a function that fetches the daily pric of test was stock for the last week and it is giving us a way to install the libraries that is going to be using uh let's have a look at the function uh I'm not uh going to check whether or not this is working but uh yeah it appears that it it is using pandas but it is actually missing the import of the pandas as you can see other than that if this is actually taking the current day um it is using this time Delta which is also not imported so H interesting uh and then the only library that is actually imported is the uh Yahoo finance library and it is taking the closing price uh the interval is one day and a fast yeah a vast start to uh today okay but uh interestingly it is fetching by default only the last five days I'm not uh so sure since I've asked for the last week yeah uh I mean this really needs to be a bit improved in order to runn it on your own but uh probably with some imports is going to be working yeah and you have to test it out so uh let's continue with another example for analyzing tweets and these are really interesting on the chat s side you're an ex social expert media analyst you're when analyzing text you always take into account the content and put heavy importance on the Alor okay so I'm going to take this tweet by Elan musk and I'm going to ask the model to do something for us you see in a bit so what is the meaning of this tweet do sentiment analysis rewrited in the words of Marcus oras uh and you can see that I hope that even my worst critics Sur on Twitter because that is what a free speech means elen musk so this is an original tweet by Elan musk he might be implying that some individuals have threatened to leave Twitter due to his presence or actions but he believes that everyone should have right to voice their thoughts freely the overall sentiment of the twet appears to be positive as musk seem to advocating for open discourse and respectful dialogue between people with different viewpoints in rephrasing the tweet in the style of Marcus overas or stoic philosopher we might might say may all voices including those which challenge the most fiercely find the Home on this digital Agora for True Freedom lies not in silence or suppression but in the courageous exchange of ideas wow this is uh really inspiring actually so yeah pretty good so thanks for watching guys please like share and subscribe I hope that you really found the mix or mixture of expert model by Mist AI are really interesting hopefully we're going to try a sparser or distilled version in a Google clap notebook once those are available please also join the AI bootcamp that is upcoming on M expert.i I'm going to link this into the first comment down below thanks for watching bye and I'll see you in the next one
Info
Channel: Venelin Valkov
Views: 2,338
Rating: undefined out of 5
Keywords: Machine Learning, Artificial Intelligence, Data Science, Deep Learning
Id: ZpgxcKIFO0c
Channel Id: undefined
Length: 18min 49sec (1129 seconds)
Published: Tue Dec 12 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.