Now let's get started with our session.
We'll have time at the end of the session for a few questions. You're welcome to
submit questions via the mobile app, the session catalog, or you'll be able to ask at
one of the microphones in the aisle at the end. Please welcome Arthur Mensch. Arthur is CEO
of Mistral AI, which trains state-of-the-art generative models with a strong emphasis on
customization and control. He'll be speaking to us today about lessons learned while training
their first models, Mistol, and give us a glimpse of the year ahead as well. Welcome. [Applause]
Arthur: Is it talking? So yeah, good afternoon everyone. I'm CEO of Mistral AI, and I'll
effectively give you a bit of an overview of what we've built so far and also where we're
coming from. But before I actually present the company and where we are differentiating and the
kind of models that we intend to build and that we have already built, I wanted to give a couple
of our vision of what happened in the last two years for language models and how that evolved.
In between, I guess Palmyra in 2020 and 2019 to where we're standing today with basically an
assistant that can be deployed on laptops. So I will try to explain how I saw it unfold
because I think it's a pretty interesting story. So if you look at basically what kind of
large language models (LLMs) you could turn into a useful assistant, as a scientist, you
can say that it needs to be higher than 60% on MLU (Mean Language Understanding). Below
that, it's not useful enough and it's a bit frustrating. Beyond that, you start to interact
with it like an assistant or a companion, which is good enough and clever enough. Well, if
you try to look at the sizes of the models that match this kind of performance, in 2019, Palmyra
wasn't at 60% MLU, but it was good enough, not far away. At the time in 2020, at DeepMind, we did
a paper called Goofer, which was actually bigger and reached the 60% MLU threshold. Then, Nvidia
was at the time, so in 2020, there was somehow a race towards making bigger and bigger models with
all the infrastructure challenges associated with it. We had to do pipelining, tensor parallelism
in basically every direction. We reached a point where Nvidia released a 530 billion parameters
model. Then suddenly, in 2022, the same kind of performance was achieved by a much smaller model.
I'll explain how we actually got to that. That was at DeepMind. At that point in time, there was a
realization that we didn't need to have very big models and that it was probably not such a great
idea. If you look at what happened in 2022, you had CHBT (Chinchilla) that happened, very closely
linked, I guess, to the fact that the Chinchilla paper actually corrected the scaling laws. Then
we continued in that direction in 2023 with LAMA (LAMA 2) from GPT-Minds. Many people at Mistral
today were involved in that. At that point, you see these interrogation marks that basically
say that the field entered into some form of opacity in the technology used by the different
providers. This is something that kickstarted Mistral and the reason why we created it.
What we focused on is trying to achieve the goal of Mistral at the beginning when we started the
company, which was to reach 60% MLU with a model that could run on a laptop. That's how we got to
Mistral 7B. What's interesting is that there's basically a factor of 40 between Goofer, which
was the first model to achieve that performance, and Mistral 7B. That was the premise of how
we started, and I believe this opens up a lot of avenues for interesting applications. So
it's kind of a biased slide because it ends with Mistral 7B and Mistral 87B, but there is a
trend there, and I'll try to explain exactly how that happened from a scientific perspective.
In 2020, the trend was to make these models infinitely big, and I guess the starting point
for that was a paper from Kaplan and Andl in 2020. They came to the conclusion that if you have
a 10 times training budget, meaning if you go from one week to 10 weeks of training on a cluster,
you should allocate that to five times the model size and two times the data. As it turns out, the
compute spent is basically the multiplication of two factors: the size of the model and the size
of the data. Kaplan's conclusion was that if you go to infinity, you need to have an infinitely
big model trained on a much smaller amount of data. This premise pushed the entire field to
train models that were bigger and bigger on the same amount of tokens, which happened to be
300 billion tokens for two years. Everybody was training models on 300 billion tokens, which
kind of slowed down the field a bit because at some point, if you train your model on 300 billion
tokens, you start saturating and overfitting your dataset. That was a problem we started to observe
at the end of 2021. We noticed this problem, and the answer to this problem was that you
needed to train on more than 300 billion tokens.The way we approached the problem at
the time was to consider the compute budget. The compute budget is the number of parameters
multiplied by the number of data points. So, how do you make the appropriate trade-off
between scaling your compute, model size, and the amount of data? It turns out, and it may
seem obvious now, but it wasn't at the time, that if you have a 10 times training budget, you should
multiply the model size and the data by the square root of 10. This is the only setting that
doesn't explode when you go to infinity. This tells you that if you have an infinite amount of
compute, you should scale the size of the model and the amount of data points at the same speed.
This means that your compression factor should stay relatively constant. As it turns out, the
optimal compression factor is around 30 tokens per parameter. So, regardless of your compute budget,
and it also depends on the quality of the data, you need to aim for around 30 tokens
per parameter. This was a good insight, but it also directed the field towards a bad
representation, where everybody started to think that models should be "Chinch optimal"
(a term coined on Twitter). However, if you're Chinch optimal with a 7B parameter model, you're
basically training 200 billion tokens, and the model you end up with is at chance level on MLU
(Mean Language Understanding), and it doesn't work at all.So, computational optimality is not
great for training if you want to make models that can run on laptops. Approximately eight months
later, the Lama paper and effort from Meta showed, to the world, that you should actually train your
model more, especially if the model is small. Lama 7B was probably the first 7B model to have decent
performance, and the reason for its performance was that it was trained three times more than
the compute-optimal regime that Chinchilla was advocating for. This suggests that if you have a
10 times training budget, you should consider your inference budget when making choices in terms of
architecture. If you only consider your training budget, then you should train something that
gives you around 30 tokens per parameter. However, if you take into account the fact that you're
going to deploy your model on many GPUs, you should try to make your model smaller. The
more tokens you see during training, the smaller the model can be for the same performance.
So, you invest some compute at training time to compress your model more. This is one part
of the recipe we applied to achieve MRAL 7B. In that sense, it's useful to consider graphs
where you compare the model size to the amount of training flops (floating-point operations)
that should be applied. In these graphs, points that have the same loss (indicative of
performance) lie on a single line. The Chinchilla Optimal point, at the time, is where you have
a horizontal tangent. These were the Chinchilla optimal models.If you want to make your model
smaller while maintaining the same loss as the red point, you will have to pay more in terms
of computation. As you move along the y-axis, you are paying more computation, which
translates to paying more dollars to your cloud provider or buying more GPUs. However,
you can achieve significant improvements by moving from a model with 100 billion parameters
to one with approximately 10 billion parameters. There is a cost of compression involved in
this process, and it's a very interesting phenomenon that can be measured empirically.
This graph sets the economics of this period, where you need to invest a certain amount of
compute to achieve a certain size and compression level. This is what you should optimize for
when running a foundational model company. For the mathematicians in the room, the
loss in this context is defined by a certain term called the natural entropy
of the text. Then, there are two terms: one that depends on the size of the model (SID),
where larger models have more representation capacity and can lower the loss, and another
term related to stochastic gradient descent, which introduces stochastic noise dependent on
the number of tokens. These two factors indicate that to lower the loss, you can either increase
the weights (representation capacity) or increase the number of tokens. By plotting this kind of
loss graph, you can make informed choices during training. When we started the company in
June, we put this into practice and ended up with a model that was much better than Lama
7B. This is how we achieved MRAL 7B. We pushed compression as much as possible and surpassed
the performance of Lama 21B at the time. We continued in that direction and in December, we
developed a model called Mmial, which had only 12 billion active parameters but performed
at the same level as Lama 27B. By pushing compression and making architectural changes,
we gained a factor of five compared to the existing state of the art. This was significant
because it allowed us to deploy the model on laptops and run it five times faster.
This movement and progress explain the observations in various companies in 20123.
Efficiency and open-source were important to us when we started the company. We released M7B
in June and MX867B in December as open-source models. We continue to make progress in
the open-source domain, creating better and better models. Additionally, we have opened up
optimized models as part of a commercial platform that serves them. This platform is
portable and can be transparently deployed. So, that means that we can bring the platform.
We can bring the model weights for them to be modified, for customers to modify them, to
fine-tune them. And we have made a variety of strategic partnerships with cloud providers,
but also with Nvidia. That's something that we announced, that Jensen announced on Monday. And I
think the mission that we have is to bring the AI to everybody. So, make it small enough so that
it runs on laptops, make it good enough so that it reaches the best reasoning performances, and
also make it portable enough so that enterprises, developers can basically work where they
used to, where they are used to work. So, that can be on the cloud, that can be on
our platform, that can be on private cloud, that can be on-prem if needed. And often times it
is, as usually what you do with Genera, that you work on your proprietary data. And so, there's
a lot of compliance aspects to it that we think should, that has pushed us to make a portable
platform. So, we do offer like five models for all the cases and business needs. So, Mmistal
7B, Mmial. But we also have this small model, which is actually performing better than Mmial
and with lower latency. And we have Mmistal Large, which is currently within the top TI model in
terms of reasoning capacities. We've worked beyond the scientific effort that we did until December.
We started to work on adding new capacities to the model. So, it is now multilingual. It has
function calling. Function calling is actually quite important in today's application when you
make agents, when you make your model interact with databases, interact with tools. We have
this Gizon mode that allows to force the model to output something that you can then use to call
an API. And we've optimized for what we call rag use cases.So, really optimizing for the model
to be able to think about the entire context it sees. We also have an amazing model that we did
in December, which used to be state-of-the-art but is still pretty much state-of-the-art.
It's a very fast-moving field, even more than in the large language models. Everything is
available through our platform and progressively available through a variety of partners.
It's also available on ai.nvidia.com now. In terms of performance, we have been optimizing,
including for Mmistal Large, over the capacity of the model to make it small enough with a certain
level of performance. So today, Mmistal Large has quite low latency compared to others. We will
continue bringing this compression aspect to our technology because we think that this is the
way in which complex applications can be made. The reason why latency matters is that once you reach
a certain level of latency through better hardware and better models, you can start thinking about
how your AI can run in the background. Instead of just having assistants that are talking to people,
you can have assistants that think on their own at a sufficient speed so that you can create
applications with a good user experience. This is really something that we're optimizing for.
We've optimized for privacy, as I've said. We don't track your inputs. We've optimized
for security. The custom deployment VPC deployment is basically as secure as you can
get. We have optimized for guard railing. So, we've, you can define what is appropriate for
your specific use case. You can set the guards you need in the prompt, and we ensure that those are
respected. And because we have been curating our data pretty well, we have a very strong bias
control. So, our models demonstrate top-level performance when it comes to gender, religion,
politics, and biases related to ethnicities. So, I guess, yeah, those are basically our core
propositions. We intend to be the leader in open source. Currently, we're serving the best open
source models, and our intention is to continue doing so in the coming months. We have better
models coming, we have new capacities coming. Multilinguality is something that is quite
important to us and that we're pushing forward. And as I've said, the idea of being open
and having very strong open source models is also to enable our customers
to customize, to modify the model, and to fine-tune them to their use cases so that
they get better performance. So that the model actually adapts and gets better over time. This
is also something that is coming to our platform, and on which we will bring our expertise
in AI. The idea being to lower the bar of admission toward fine-tuning models.
This is coming, I'm not announcing dates, but this is coming pretty soon.
Thank you, everyone, and I'm happy to take any questions. Thank you
very much. Thank you very much, Arthur. Uh, we do have time for a few questions.
Um, if you would, if you have a question, please step to the microphone that's in the
aisle, uh, over on that side of the room. Are you guys using any synthetically generated
data in your models or model building? Uh, so, uh, yes, we do, to a certain extent.
I thought so. That's all. Thank you. I want to bring something back to a paper you
worked on, the Retro Transformer from DeepMind. Uh, you guys were doing retrieval augmented
generation before it was cool. I was wondering if you, and it was integrated itself in like
cross-attention in the architecture, wondering if you see that as another method going forward to
integrate retrieval augmented generation directly in the architecture or agents directly in the
architecture. I want to get your thoughts on that. Um, yes. So, I think that's, uh, so Retro
brought a couple of architecture innovations and training innovations. Some of them were
good, some of them were not good. In particular, training on pre-training with augmented,
well, basically with neighbors is a good idea and a bad idea. Because it's a good
idea, your model actually understands how to retrieve from documents. But it's a bad
idea because it makes the task much easier, because your model sees much more context
and so it doesn't need to try to memorize things as it would if it didn't have access to
retrieve documents. So, I think in that sense, I don't think we cracked it, and nobody
has ever since on the architecture side. The encoder-decoder-like models that actually
process documents separately, I think these are pretty exciting stuff. This is a pretty
exciting paradigm. It's very close to what we proposed for mral 7B, which are sparse attentions.
Sparse attentions is also basically another way, a more flexible way of having an encoder-decoder
structure. And I think this is also a very promising aspect to the AI development we'll see
in the coming years. Okay, thank you very much. Hi, thank you for your speech. I just wonder,
do you plan to come up with an LLM in the French language from scratch, or are you guys planning to
stick with the translation LLMs? And if yes, how different is the process for the French language?
Um, it doesn't seem to be a great idea to train models on a single language from scratch,
particularly for languages that are rarer than English. So, this is not what we do.
Our models are multilingual by design, and I think that we have evidence that there's some
transfer between the languages. So, if you want your model to be very good at French, you need
to train it on English a bit. Okay, thank you. Okay, thank you for the great talk. So, I have
some questions about the architecture. Did you find some scaling laws specific for it? And
why did you choose this architecture? And also, the Gro model appears, but it's not as
efficient as MixTrail. What do you think of it? Um, so, MixTrail and Gro have some different
trade-offs. They're very good at processing a lot of inputs, but they're a little worse at
outputting tokens because you need to load all of the experts every time you have to output a
token. So, there are some trade-offs depending on the workload you have. Generally speaking,
they're quite efficient for training. Then, I know that some of our competitors have trained
very large models, very large MES. Well, it's not all about the size, right? It's also about
how you set things up so that you get something that is compact and has a lot of knowledge.
So, I'm not sure why MixTrail is as good, but it is good enough. Yeah, okay, thank you.
Thank you very much, Arthur. That's all the time we have for today. Thank you,
enjoy the rest of GTC. [Applause]