Mistral AI: Frontier AI in Your Hands | NVIDIA GTC 2024

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Now let's get started with our session. We'll have time at the end of the session for a few questions. You're welcome to submit questions via the mobile app, the session catalog, or you'll be able to ask at one of the microphones in the aisle at the end. Please welcome Arthur Mensch. Arthur is CEO of Mistral AI, which trains state-of-the-art generative models with a strong emphasis on customization and control. He'll be speaking to us today about lessons learned while training their first models, Mistol, and give us a glimpse of the year ahead as well. Welcome. [Applause] Arthur: Is it talking? So yeah, good afternoon everyone. I'm CEO of Mistral AI, and I'll effectively give you a bit of an overview of what we've built so far and also where we're coming from. But before I actually present the company and where we are differentiating and the kind of models that we intend to build and that we have already built, I wanted to give a couple of our vision of what happened in the last two years for language models and how that evolved. In between, I guess Palmyra in 2020 and 2019 to where we're standing today with basically an assistant that can be deployed on laptops. So I will try to explain how I saw it unfold because I think it's a pretty interesting story. So if you look at basically what kind of large language models (LLMs) you could turn into a useful assistant, as a scientist, you can say that it needs to be higher than 60% on MLU (Mean Language Understanding). Below that, it's not useful enough and it's a bit frustrating. Beyond that, you start to interact with it like an assistant or a companion, which is good enough and clever enough. Well, if you try to look at the sizes of the models that match this kind of performance, in 2019, Palmyra wasn't at 60% MLU, but it was good enough, not far away. At the time in 2020, at DeepMind, we did a paper called Goofer, which was actually bigger and reached the 60% MLU threshold. Then, Nvidia was at the time, so in 2020, there was somehow a race towards making bigger and bigger models with all the infrastructure challenges associated with it. We had to do pipelining, tensor parallelism in basically every direction. We reached a point where Nvidia released a 530 billion parameters model. Then suddenly, in 2022, the same kind of performance was achieved by a much smaller model. I'll explain how we actually got to that. That was at DeepMind. At that point in time, there was a realization that we didn't need to have very big models and that it was probably not such a great idea. If you look at what happened in 2022, you had CHBT (Chinchilla) that happened, very closely linked, I guess, to the fact that the Chinchilla paper actually corrected the scaling laws. Then we continued in that direction in 2023 with LAMA (LAMA 2) from GPT-Minds. Many people at Mistral today were involved in that. At that point, you see these interrogation marks that basically say that the field entered into some form of opacity in the technology used by the different providers. This is something that kickstarted Mistral and the reason why we created it. What we focused on is trying to achieve the goal of Mistral at the beginning when we started the company, which was to reach 60% MLU with a model that could run on a laptop. That's how we got to Mistral 7B. What's interesting is that there's basically a factor of 40 between Goofer, which was the first model to achieve that performance, and Mistral 7B. That was the premise of how we started, and I believe this opens up a lot of avenues for interesting applications. So it's kind of a biased slide because it ends with Mistral 7B and Mistral 87B, but there is a trend there, and I'll try to explain exactly how that happened from a scientific perspective. In 2020, the trend was to make these models infinitely big, and I guess the starting point for that was a paper from Kaplan and Andl in 2020. They came to the conclusion that if you have a 10 times training budget, meaning if you go from one week to 10 weeks of training on a cluster, you should allocate that to five times the model size and two times the data. As it turns out, the compute spent is basically the multiplication of two factors: the size of the model and the size of the data. Kaplan's conclusion was that if you go to infinity, you need to have an infinitely big model trained on a much smaller amount of data. This premise pushed the entire field to train models that were bigger and bigger on the same amount of tokens, which happened to be 300 billion tokens for two years. Everybody was training models on 300 billion tokens, which kind of slowed down the field a bit because at some point, if you train your model on 300 billion tokens, you start saturating and overfitting your dataset. That was a problem we started to observe at the end of 2021. We noticed this problem, and the answer to this problem was that you needed to train on more than 300 billion tokens.The way we approached the problem at the time was to consider the compute budget. The compute budget is the number of parameters multiplied by the number of data points. So, how do you make the appropriate trade-off between scaling your compute, model size, and the amount of data? It turns out, and it may seem obvious now, but it wasn't at the time, that if you have a 10 times training budget, you should multiply the model size and the data by the square root of 10. This is the only setting that doesn't explode when you go to infinity. This tells you that if you have an infinite amount of compute, you should scale the size of the model and the amount of data points at the same speed. This means that your compression factor should stay relatively constant. As it turns out, the optimal compression factor is around 30 tokens per parameter. So, regardless of your compute budget, and it also depends on the quality of the data, you need to aim for around 30 tokens per parameter. This was a good insight, but it also directed the field towards a bad representation, where everybody started to think that models should be "Chinch optimal" (a term coined on Twitter). However, if you're Chinch optimal with a 7B parameter model, you're basically training 200 billion tokens, and the model you end up with is at chance level on MLU (Mean Language Understanding), and it doesn't work at all.So, computational optimality is not great for training if you want to make models that can run on laptops. Approximately eight months later, the Lama paper and effort from Meta showed, to the world, that you should actually train your model more, especially if the model is small. Lama 7B was probably the first 7B model to have decent performance, and the reason for its performance was that it was trained three times more than the compute-optimal regime that Chinchilla was advocating for. This suggests that if you have a 10 times training budget, you should consider your inference budget when making choices in terms of architecture. If you only consider your training budget, then you should train something that gives you around 30 tokens per parameter. However, if you take into account the fact that you're going to deploy your model on many GPUs, you should try to make your model smaller. The more tokens you see during training, the smaller the model can be for the same performance. So, you invest some compute at training time to compress your model more. This is one part of the recipe we applied to achieve MRAL 7B. In that sense, it's useful to consider graphs where you compare the model size to the amount of training flops (floating-point operations) that should be applied. In these graphs, points that have the same loss (indicative of performance) lie on a single line. The Chinchilla Optimal point, at the time, is where you have a horizontal tangent. These were the Chinchilla optimal models.If you want to make your model smaller while maintaining the same loss as the red point, you will have to pay more in terms of computation. As you move along the y-axis, you are paying more computation, which translates to paying more dollars to your cloud provider or buying more GPUs. However, you can achieve significant improvements by moving from a model with 100 billion parameters to one with approximately 10 billion parameters. There is a cost of compression involved in this process, and it's a very interesting phenomenon that can be measured empirically. This graph sets the economics of this period, where you need to invest a certain amount of compute to achieve a certain size and compression level. This is what you should optimize for when running a foundational model company. For the mathematicians in the room, the loss in this context is defined by a certain term called the natural entropy of the text. Then, there are two terms: one that depends on the size of the model (SID), where larger models have more representation capacity and can lower the loss, and another term related to stochastic gradient descent, which introduces stochastic noise dependent on the number of tokens. These two factors indicate that to lower the loss, you can either increase the weights (representation capacity) or increase the number of tokens. By plotting this kind of loss graph, you can make informed choices during training. When we started the company in June, we put this into practice and ended up with a model that was much better than Lama 7B. This is how we achieved MRAL 7B. We pushed compression as much as possible and surpassed the performance of Lama 21B at the time. We continued in that direction and in December, we developed a model called Mmial, which had only 12 billion active parameters but performed at the same level as Lama 27B. By pushing compression and making architectural changes, we gained a factor of five compared to the existing state of the art. This was significant because it allowed us to deploy the model on laptops and run it five times faster. This movement and progress explain the observations in various companies in 20123. Efficiency and open-source were important to us when we started the company. We released M7B in June and MX867B in December as open-source models. We continue to make progress in the open-source domain, creating better and better models. Additionally, we have opened up optimized models as part of a commercial platform that serves them. This platform is portable and can be transparently deployed. So, that means that we can bring the platform. We can bring the model weights for them to be modified, for customers to modify them, to fine-tune them. And we have made a variety of strategic partnerships with cloud providers, but also with Nvidia. That's something that we announced, that Jensen announced on Monday. And I think the mission that we have is to bring the AI to everybody. So, make it small enough so that it runs on laptops, make it good enough so that it reaches the best reasoning performances, and also make it portable enough so that enterprises, developers can basically work where they used to, where they are used to work. So, that can be on the cloud, that can be on our platform, that can be on private cloud, that can be on-prem if needed. And often times it is, as usually what you do with Genera, that you work on your proprietary data. And so, there's a lot of compliance aspects to it that we think should, that has pushed us to make a portable platform. So, we do offer like five models for all the cases and business needs. So, Mmistal 7B, Mmial. But we also have this small model, which is actually performing better than Mmial and with lower latency. And we have Mmistal Large, which is currently within the top TI model in terms of reasoning capacities. We've worked beyond the scientific effort that we did until December. We started to work on adding new capacities to the model. So, it is now multilingual. It has function calling. Function calling is actually quite important in today's application when you make agents, when you make your model interact with databases, interact with tools. We have this Gizon mode that allows to force the model to output something that you can then use to call an API. And we've optimized for what we call rag use cases.So, really optimizing for the model to be able to think about the entire context it sees. We also have an amazing model that we did in December, which used to be state-of-the-art but is still pretty much state-of-the-art. It's a very fast-moving field, even more than in the large language models. Everything is available through our platform and progressively available through a variety of partners. It's also available on ai.nvidia.com now. In terms of performance, we have been optimizing, including for Mmistal Large, over the capacity of the model to make it small enough with a certain level of performance. So today, Mmistal Large has quite low latency compared to others. We will continue bringing this compression aspect to our technology because we think that this is the way in which complex applications can be made. The reason why latency matters is that once you reach a certain level of latency through better hardware and better models, you can start thinking about how your AI can run in the background. Instead of just having assistants that are talking to people, you can have assistants that think on their own at a sufficient speed so that you can create applications with a good user experience. This is really something that we're optimizing for. We've optimized for privacy, as I've said. We don't track your inputs. We've optimized for security. The custom deployment VPC deployment is basically as secure as you can get. We have optimized for guard railing. So, we've, you can define what is appropriate for your specific use case. You can set the guards you need in the prompt, and we ensure that those are respected. And because we have been curating our data pretty well, we have a very strong bias control. So, our models demonstrate top-level performance when it comes to gender, religion, politics, and biases related to ethnicities. So, I guess, yeah, those are basically our core propositions. We intend to be the leader in open source. Currently, we're serving the best open source models, and our intention is to continue doing so in the coming months. We have better models coming, we have new capacities coming. Multilinguality is something that is quite important to us and that we're pushing forward. And as I've said, the idea of being open and having very strong open source models is also to enable our customers to customize, to modify the model, and to fine-tune them to their use cases so that they get better performance. So that the model actually adapts and gets better over time. This is also something that is coming to our platform, and on which we will bring our expertise in AI. The idea being to lower the bar of admission toward fine-tuning models. This is coming, I'm not announcing dates, but this is coming pretty soon. Thank you, everyone, and I'm happy to take any questions. Thank you very much. Thank you very much, Arthur. Uh, we do have time for a few questions. Um, if you would, if you have a question, please step to the microphone that's in the aisle, uh, over on that side of the room. Are you guys using any synthetically generated data in your models or model building? Uh, so, uh, yes, we do, to a certain extent. I thought so. That's all. Thank you. I want to bring something back to a paper you worked on, the Retro Transformer from DeepMind. Uh, you guys were doing retrieval augmented generation before it was cool. I was wondering if you, and it was integrated itself in like cross-attention in the architecture, wondering if you see that as another method going forward to integrate retrieval augmented generation directly in the architecture or agents directly in the architecture. I want to get your thoughts on that. Um, yes. So, I think that's, uh, so Retro brought a couple of architecture innovations and training innovations. Some of them were good, some of them were not good. In particular, training on pre-training with augmented, well, basically with neighbors is a good idea and a bad idea. Because it's a good idea, your model actually understands how to retrieve from documents. But it's a bad idea because it makes the task much easier, because your model sees much more context and so it doesn't need to try to memorize things as it would if it didn't have access to retrieve documents. So, I think in that sense, I don't think we cracked it, and nobody has ever since on the architecture side. The encoder-decoder-like models that actually process documents separately, I think these are pretty exciting stuff. This is a pretty exciting paradigm. It's very close to what we proposed for mral 7B, which are sparse attentions. Sparse attentions is also basically another way, a more flexible way of having an encoder-decoder structure. And I think this is also a very promising aspect to the AI development we'll see in the coming years. Okay, thank you very much. Hi, thank you for your speech. I just wonder, do you plan to come up with an LLM in the French language from scratch, or are you guys planning to stick with the translation LLMs? And if yes, how different is the process for the French language? Um, it doesn't seem to be a great idea to train models on a single language from scratch, particularly for languages that are rarer than English. So, this is not what we do. Our models are multilingual by design, and I think that we have evidence that there's some transfer between the languages. So, if you want your model to be very good at French, you need to train it on English a bit. Okay, thank you. Okay, thank you for the great talk. So, I have some questions about the architecture. Did you find some scaling laws specific for it? And why did you choose this architecture? And also, the Gro model appears, but it's not as efficient as MixTrail. What do you think of it? Um, so, MixTrail and Gro have some different trade-offs. They're very good at processing a lot of inputs, but they're a little worse at outputting tokens because you need to load all of the experts every time you have to output a token. So, there are some trade-offs depending on the workload you have. Generally speaking, they're quite efficient for training. Then, I know that some of our competitors have trained very large models, very large MES. Well, it's not all about the size, right? It's also about how you set things up so that you get something that is compact and has a lot of knowledge. So, I'm not sure why MixTrail is as good, but it is good enough. Yeah, okay, thank you. Thank you very much, Arthur. That's all the time we have for today. Thank you, enjoy the rest of GTC. [Applause]

Info

Channel: NVIDIA Developer

Views: 6,008

Rating: undefined out of 5

Keywords:

Id: Vhl3Q-Oe5Ug

Channel Id: undefined

Length: 23min 42sec (1422 seconds)

Published: Tue Apr 09 2024