Mistral AI: Frontier AI in Your Hands | NVIDIA GTC 2024

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Now let's get started with our session.  We'll have time at the end of the session   for a few questions. You're welcome to  submit questions via the mobile app,   the session catalog, or you'll be able to ask at  one of the microphones in the aisle at the end.   Please welcome Arthur Mensch. Arthur is CEO  of Mistral AI, which trains state-of-the-art   generative models with a strong emphasis on  customization and control. He'll be speaking   to us today about lessons learned while training  their first models, Mistol, and give us a glimpse   of the year ahead as well. Welcome. [Applause] Arthur: Is it talking? So yeah, good afternoon   everyone. I'm CEO of Mistral AI, and I'll  effectively give you a bit of an overview   of what we've built so far and also where we're  coming from. But before I actually present the   company and where we are differentiating and the  kind of models that we intend to build and that   we have already built, I wanted to give a couple  of our vision of what happened in the last two   years for language models and how that evolved.  In between, I guess Palmyra in 2020 and 2019 to   where we're standing today with basically an  assistant that can be deployed on laptops.   So I will try to explain how I saw it unfold  because I think it's a pretty interesting story.  So if you look at basically what kind of  large language models (LLMs) you could turn   into a useful assistant, as a scientist, you  can say that it needs to be higher than 60%   on MLU (Mean Language Understanding). Below  that, it's not useful enough and it's a bit   frustrating. Beyond that, you start to interact  with it like an assistant or a companion,   which is good enough and clever enough. Well, if  you try to look at the sizes of the models that   match this kind of performance, in 2019, Palmyra  wasn't at 60% MLU, but it was good enough, not   far away. At the time in 2020, at DeepMind, we did  a paper called Goofer, which was actually bigger   and reached the 60% MLU threshold. Then, Nvidia  was at the time, so in 2020, there was somehow a   race towards making bigger and bigger models with  all the infrastructure challenges associated with   it. We had to do pipelining, tensor parallelism  in basically every direction. We reached a point   where Nvidia released a 530 billion parameters  model. Then suddenly, in 2022, the same kind of   performance was achieved by a much smaller model.  I'll explain how we actually got to that. That was   at DeepMind. At that point in time, there was a  realization that we didn't need to have very big   models and that it was probably not such a great  idea. If you look at what happened in 2022, you   had CHBT (Chinchilla) that happened, very closely  linked, I guess, to the fact that the Chinchilla   paper actually corrected the scaling laws. Then  we continued in that direction in 2023 with LAMA   (LAMA 2) from GPT-Minds. Many people at Mistral  today were involved in that. At that point,   you see these interrogation marks that basically  say that the field entered into some form of   opacity in the technology used by the different  providers. This is something that kickstarted   Mistral and the reason why we created it. What we focused on is trying to achieve the goal   of Mistral at the beginning when we started the  company, which was to reach 60% MLU with a model   that could run on a laptop. That's how we got to  Mistral 7B. What's interesting is that there's   basically a factor of 40 between Goofer, which  was the first model to achieve that performance,   and Mistral 7B. That was the premise of how  we started, and I believe this opens up a lot   of avenues for interesting applications. So  it's kind of a biased slide because it ends   with Mistral 7B and Mistral 87B, but there is a  trend there, and I'll try to explain exactly how   that happened from a scientific perspective. In 2020, the trend was to make these models   infinitely big, and I guess the starting point  for that was a paper from Kaplan and Andl in   2020. They came to the conclusion that if you have  a 10 times training budget, meaning if you go from   one week to 10 weeks of training on a cluster,  you should allocate that to five times the model   size and two times the data. As it turns out, the  compute spent is basically the multiplication of   two factors: the size of the model and the size  of the data. Kaplan's conclusion was that if you   go to infinity, you need to have an infinitely  big model trained on a much smaller amount of   data. This premise pushed the entire field to  train models that were bigger and bigger on   the same amount of tokens, which happened to be  300 billion tokens for two years. Everybody was   training models on 300 billion tokens, which  kind of slowed down the field a bit because at   some point, if you train your model on 300 billion  tokens, you start saturating and overfitting your   dataset. That was a problem we started to observe  at the end of 2021. We noticed this problem,   and the answer to this problem was that you  needed to train on more than 300 billion   tokens.The way we approached the problem at  the time was to consider the compute budget.   The compute budget is the number of parameters  multiplied by the number of data points. So,   how do you make the appropriate trade-off  between scaling your compute, model size,   and the amount of data? It turns out, and it may  seem obvious now, but it wasn't at the time, that   if you have a 10 times training budget, you should  multiply the model size and the data by the square   root of 10. This is the only setting that  doesn't explode when you go to infinity. This   tells you that if you have an infinite amount of  compute, you should scale the size of the model   and the amount of data points at the same speed.  This means that your compression factor should   stay relatively constant. As it turns out, the  optimal compression factor is around 30 tokens per   parameter. So, regardless of your compute budget,  and it also depends on the quality of the data,   you need to aim for around 30 tokens  per parameter. This was a good insight,   but it also directed the field towards a bad  representation, where everybody started to   think that models should be "Chinch optimal"  (a term coined on Twitter). However, if you're   Chinch optimal with a 7B parameter model, you're  basically training 200 billion tokens, and the   model you end up with is at chance level on MLU  (Mean Language Understanding), and it doesn't   work at all.So, computational optimality is not  great for training if you want to make models that   can run on laptops. Approximately eight months  later, the Lama paper and effort from Meta showed,   to the world, that you should actually train your  model more, especially if the model is small. Lama   7B was probably the first 7B model to have decent  performance, and the reason for its performance   was that it was trained three times more than  the compute-optimal regime that Chinchilla was   advocating for. This suggests that if you have a  10 times training budget, you should consider your   inference budget when making choices in terms of  architecture. If you only consider your training   budget, then you should train something that  gives you around 30 tokens per parameter. However,   if you take into account the fact that you're  going to deploy your model on many GPUs,   you should try to make your model smaller. The  more tokens you see during training, the smaller   the model can be for the same performance.  So, you invest some compute at training time   to compress your model more. This is one part  of the recipe we applied to achieve MRAL 7B.  In that sense, it's useful to consider graphs  where you compare the model size to the amount   of training flops (floating-point operations)  that should be applied. In these graphs,   points that have the same loss (indicative of  performance) lie on a single line. The Chinchilla   Optimal point, at the time, is where you have  a horizontal tangent. These were the Chinchilla   optimal models.If you want to make your model  smaller while maintaining the same loss as the   red point, you will have to pay more in terms  of computation. As you move along the y-axis,   you are paying more computation, which  translates to paying more dollars to your   cloud provider or buying more GPUs. However,  you can achieve significant improvements by   moving from a model with 100 billion parameters  to one with approximately 10 billion parameters.   There is a cost of compression involved in  this process, and it's a very interesting   phenomenon that can be measured empirically.  This graph sets the economics of this period,   where you need to invest a certain amount of  compute to achieve a certain size and compression   level. This is what you should optimize for  when running a foundational model company.  For the mathematicians in the room, the  loss in this context is defined by a   certain term called the natural entropy  of the text. Then, there are two terms:   one that depends on the size of the model (SID),  where larger models have more representation   capacity and can lower the loss, and another  term related to stochastic gradient descent,   which introduces stochastic noise dependent on  the number of tokens. These two factors indicate   that to lower the loss, you can either increase  the weights (representation capacity) or increase   the number of tokens. By plotting this kind of  loss graph, you can make informed choices during   training. When we started the company in  June, we put this into practice and ended   up with a model that was much better than Lama  7B. This is how we achieved MRAL 7B. We pushed   compression as much as possible and surpassed  the performance of Lama 21B at the time. We   continued in that direction and in December, we  developed a model called Mmial, which had only   12 billion active parameters but performed  at the same level as Lama 27B. By pushing   compression and making architectural changes,  we gained a factor of five compared to the   existing state of the art. This was significant  because it allowed us to deploy the model on   laptops and run it five times faster.  This movement and progress explain the   observations in various companies in 20123. Efficiency and open-source were important to   us when we started the company. We released M7B  in June and MX867B in December as open-source   models. We continue to make progress in  the open-source domain, creating better   and better models. Additionally, we have opened up  optimized models as part of a commercial platform   that serves them. This platform is  portable and can be transparently deployed. So, that means that we can bring the platform.  We can bring the model weights for them to be   modified, for customers to modify them, to  fine-tune them. And we have made a variety   of strategic partnerships with cloud providers,  but also with Nvidia. That's something that we   announced, that Jensen announced on Monday. And I  think the mission that we have is to bring the AI   to everybody. So, make it small enough so that  it runs on laptops, make it good enough so that   it reaches the best reasoning performances, and  also make it portable enough so that enterprises,   developers can basically work where they  used to, where they are used to work. So,   that can be on the cloud, that can be on  our platform, that can be on private cloud,   that can be on-prem if needed. And often times it  is, as usually what you do with Genera, that you   work on your proprietary data. And so, there's  a lot of compliance aspects to it that we think   should, that has pushed us to make a portable  platform. So, we do offer like five models for   all the cases and business needs. So, Mmistal  7B, Mmial. But we also have this small model,   which is actually performing better than Mmial  and with lower latency. And we have Mmistal Large,   which is currently within the top TI model in  terms of reasoning capacities. We've worked beyond   the scientific effort that we did until December.  We started to work on adding new capacities to   the model. So, it is now multilingual. It has  function calling. Function calling is actually   quite important in today's application when you  make agents, when you make your model interact   with databases, interact with tools. We have  this Gizon mode that allows to force the model   to output something that you can then use to call  an API. And we've optimized for what we call rag   use cases.So, really optimizing for the model  to be able to think about the entire context it   sees. We also have an amazing model that we did  in December, which used to be state-of-the-art   but is still pretty much state-of-the-art.  It's a very fast-moving field, even more   than in the large language models. Everything is  available through our platform and progressively   available through a variety of partners.  It's also available on ai.nvidia.com now.  In terms of performance, we have been optimizing,  including for Mmistal Large, over the capacity of   the model to make it small enough with a certain  level of performance. So today, Mmistal Large has   quite low latency compared to others. We will  continue bringing this compression aspect to   our technology because we think that this is the  way in which complex applications can be made. The   reason why latency matters is that once you reach  a certain level of latency through better hardware   and better models, you can start thinking about  how your AI can run in the background. Instead of   just having assistants that are talking to people,  you can have assistants that think on their own at   a sufficient speed so that you can create  applications with a good user experience. This   is really something that we're optimizing for. We've optimized for privacy, as I've said. We   don't track your inputs. We've optimized  for security. The custom deployment VPC   deployment is basically as secure as you can  get. We have optimized for guard railing. So, we've, you can define what is appropriate for  your specific use case. You can set the guards you   need in the prompt, and we ensure that those are  respected. And because we have been curating our   data pretty well, we have a very strong bias  control. So, our models demonstrate top-level   performance when it comes to gender, religion,  politics, and biases related to ethnicities.  So, I guess, yeah, those are basically our core  propositions. We intend to be the leader in open   source. Currently, we're serving the best open  source models, and our intention is to continue   doing so in the coming months. We have better  models coming, we have new capacities coming.   Multilinguality is something that is quite  important to us and that we're pushing forward.  And as I've said, the idea of being open  and having very strong open source models   is also to enable our customers  to customize, to modify the model,   and to fine-tune them to their use cases so that  they get better performance. So that the model   actually adapts and gets better over time. This  is also something that is coming to our platform,   and on which we will bring our expertise  in AI. The idea being to lower the bar   of admission toward fine-tuning models.  This is coming, I'm not announcing dates,   but this is coming pretty soon. Thank you, everyone, and I'm   happy to take any questions. Thank you  very much. Thank you very much, Arthur. Uh, we do have time for a few questions.  Um, if you would, if you have a question,   please step to the microphone that's in the  aisle, uh, over on that side of the room.   Are you guys using any synthetically generated  data in your models or model building? Uh, so,   uh, yes, we do, to a certain extent.  I thought so. That's all. Thank you. I want to bring something back to a paper you  worked on, the Retro Transformer from DeepMind.   Uh, you guys were doing retrieval augmented  generation before it was cool. I was wondering   if you, and it was integrated itself in like  cross-attention in the architecture, wondering   if you see that as another method going forward to  integrate retrieval augmented generation directly   in the architecture or agents directly in the  architecture. I want to get your thoughts on that. Um, yes. So, I think that's, uh, so Retro  brought a couple of architecture innovations   and training innovations. Some of them were  good, some of them were not good. In particular,   training on pre-training with augmented,  well, basically with neighbors is a good   idea and a bad idea. Because it's a good  idea, your model actually understands how   to retrieve from documents. But it's a bad  idea because it makes the task much easier,   because your model sees much more context  and so it doesn't need to try to memorize   things as it would if it didn't have access to  retrieve documents. So, I think in that sense,   I don't think we cracked it, and nobody  has ever since on the architecture side.   The encoder-decoder-like models that actually  process documents separately, I think these   are pretty exciting stuff. This is a pretty  exciting paradigm. It's very close to what we   proposed for mral 7B, which are sparse attentions.  Sparse attentions is also basically another way,   a more flexible way of having an encoder-decoder  structure. And I think this is also a very   promising aspect to the AI development we'll see  in the coming years. Okay, thank you very much.  Hi, thank you for your speech. I just wonder,  do you plan to come up with an LLM in the French   language from scratch, or are you guys planning to  stick with the translation LLMs? And if yes, how   different is the process for the French language? Um, it doesn't seem to be a great idea to train   models on a single language from scratch,  particularly for languages that are rarer   than English. So, this is not what we do.  Our models are multilingual by design, and   I think that we have evidence that there's some  transfer between the languages. So, if you want   your model to be very good at French, you need  to train it on English a bit. Okay, thank you.  Okay, thank you for the great talk. So, I have  some questions about the architecture. Did   you find some scaling laws specific for it? And  why did you choose this architecture? And also,   the Gro model appears, but it's not as  efficient as MixTrail. What do you think of it?  Um, so, MixTrail and Gro have some different  trade-offs. They're very good at processing   a lot of inputs, but they're a little worse at  outputting tokens because you need to load all   of the experts every time you have to output a  token. So, there are some trade-offs depending   on the workload you have. Generally speaking,  they're quite efficient for training. Then,   I know that some of our competitors have trained  very large models, very large MES. Well, it's   not all about the size, right? It's also about  how you set things up so that you get something   that is compact and has a lot of knowledge.  So, I'm not sure why MixTrail is as good,   but it is good enough. Yeah, okay, thank you. Thank you very much, Arthur. That's all the   time we have for today. Thank you,  enjoy the rest of GTC. [Applause]
Info
Channel: NVIDIA Developer
Views: 6,008
Rating: undefined out of 5
Keywords:
Id: Vhl3Q-Oe5Ug
Channel Id: undefined
Length: 23min 42sec (1422 seconds)
Published: Tue Apr 09 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.