How Did Open Source Catch Up To OpenAI? [Mixtral-8x7B]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it only took open source a year to make a model that reaches the level of gbt 3.5 while some may say opening has no modes cuz of this but a year in AI time is really on a different scale than in real life anyways this mode discussion all came back into the picture thanks to Moll publishing mixol ax7 B and right off the bat you will notice that it has a really unique naming scheme instead of a whole number that would represent a model's parameter count which is what people usually do they were doing maths on its name and it's because they are referring to a new architecture Paradigm which they introduced for this model called mixture of experts which is completely different from how most L operate while the main idea of mixture of experts is nothing new it has not been a prominent method for LL especially at scale but people at mistro were able to make this method work and perform better than Gemini Pro Claude 2.1 and gbt 3.5 so what exactly is this mixture of experts approach well rather than using a neuron that is for example 512 wide we're not split into eight neural networks of 64 if something like a router can pick the correct Network for each inference then we technically only have to run 1/8 of the neurons on every forward pass well the core idea is that there's eight expert models that specialize in different topics and instead of combining results from all the models a router decides which two expert models to trust to when given a question or a prompt and by only using two models it reduces the computational cost and increases the speed of generation this then combines the strength of multiple smaller expert models to solve a problem when the user throws at it but how can it be so good well a reason might be a lot of research has already proven that smaller models focused on a specific topic outperforms a larger model that is more generalized so you can kind of interpret mixol ax 7B in two ways it is a 47b model but with 13B model speed or a 13B model that has performance of a 47b model which makes the router extremely important here because it has to choose the right model that would generate the best results how does this router know which expert to choose then well it learned to choose the best experts during training this is achieved by having the router train together with all the experts then using a soft Max gating function to model a probability distribution and choose who gets what but how does that work and if we just train it randomly it will most likely be one strong and seven weak models because the one that initialized first will always be the best which will lose the reasons for using Moe so we need to make sure of them are equally good to overcome these obstacles they add a noise to the router and penalize the router if it did not equally distribute its choices across all experts which incentivizes the router to develop Ane where all the experts are used equally during training so overall the researchers do not get to decide which expert specialize in what the process of gradient descent does which still makes it kind of like a black box but as as long as it's better it's free real estate they said that surprisingly we do not observe obvious patterns in the assignment of experts based on the topic and also this suggests that the router does exhibit some structured syntactic Behavior so the experts appear to be more aligned with the syntax and semantics rather than the knowledge domain we would have otherwise assumed but it kind of makes sense given how the model is trained to hop from experts to experts between each token in this example they gave each color represents an expert model that the router assigns the token to for the python codes you can see more clearly about the patterns of the assignments while self and Def are assigned to another expert model so by not all being the same color when all the tokens are about coding and instead have different colors across the board it shows that the router assigns the expert models based on the syntax or semantics and not subject domain some people even made 52 and oh wow these open source people do work really fast and if you do want to run Mixr I have some bad news for you while it is claimed to only use 13 billion parameters when running 86 GB of vram is still the recommended vram size to run it without quantizing it so good luck collecting those VMS in the wild but there is a tensor RT LM version of Mixr so use this information at your own discretion and before we end this video here's an early heads up for nvidia's GTC 24 because if you attend the digital session anytime between March 18th and 21st you may have a chance to win an RTX x 480 super from me so if you are interested in nvidia's upcoming AI breakthroughs and announcements or just want to win a brand new GPU use the link down in the description to sign up right now so you don't forget the GTC 24 AI conference this year has three different registrations ranging from inperson virtual and workshop the GTC inperson conference pass has early bird pricing right now and you can connect with some cool industrial people face to face during the GTC event the virtual session is completely free to sign up but it has uh limited spaces but it lets you attend some key GTC events across the web and then there's the full day workshop where you can attend both in person or virtually and earn certificates of competence this year's GTC conference has topics such as generative AI computer vision and Innovative workflows so don't miss out this chance to learn from the global experts I am still planning how to make the giveaways but it'll probably be along the lines of taking a photo of yourself watching the GTC virtually so sign up now and you won't forget thank you so much for watching a big shout out to Andrew lelas Chris Leo Alex J Alex Marice migm Dean fifal and many others that support me through patreon or YouTube follow my Twitter if you haven't and I'll see you all in the next one
Info
Channel: bycloud
Views: 156,827
Rating: undefined out of 5
Keywords: bycloud, bycloudai, nvidia, gtc24, gtc, mixture of experts, mixtral, mistral ai, mistral.ai, MoE, moe explained, AI Moe, moe llm, mistral moe, moe paper, mixture of experts explained, mixture of experts paper, what is moe, what is mixture of experts, what is mixtral, mixtral of experts explained, mixtral-8x7b, mixtral-8x7b explained
Id: PYZIOMvkUF8
Channel Id: undefined
Length: 5min 46sec (346 seconds)
Published: Thu Feb 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.