Please welcome to the stage NVIDIA
founder and CEO Jensen Huang. Apparently, there's a huge line out there and
they're orderly in an orderly way coming into this room, but once they come into this room, it
becomes total mayhem. What is it about? Come on, you guys, hurry up, hurry up! I'm going to start!
So, so this room, this room is going to be packed, and there are two other breakout rooms,
they're going to be packed. And it's just that they're taking too long getting in here
because they're just timid. And the part that I just don't understand is once they get in
here, once they get within about 10 feet, all of a sudden chaos ensues. It's very, very nice to
see all of you. The computer, the computer hasn't largely remained unchanged for 60 years. The year
after my birth, had nothing to do with my birth, but the year after my birth, the modern computer
was described by the IBM System 360. The central processing unit, I/O subsystem, multitasking,
the separation of hardware from software, software compatibility across a family, backwards
compatibility to protect the investment of software engineers. I describe modern computing.
Hasn't changed since 1964. In the late '80s and the early '90s, the PC Revolution kicked it into
turbocharged democratized computing as we know it. Driving the marginal cost of performance
down every single year. Every 10 years, we reduce the cost of computing by about 10 times,
every five years, every 10 years by 100 times, 15 years a thousand times, 20 years 10,000 times.
In literally the 20-year era of the PC Revolution, computing costs reduced by 10,000 times, more
so than any other valuable commodity in history. Could you imagine if everything in life reduced
- everything that matters, everything that's valuable to you - reduced in cost by 10,000 times,
10,000 times over basically half a lifetime? 20 years by the time you became an adult. Something
you used to use, like a car, used to be $20,000, now it costs a dollar. Well, maybe that's
Uber. But computing costs dropped tremendously, and then one day it stopped. Not precisely one
day, but it stopped. The rate of change stopped. It still continuously improved a little bit every
single year, but the rate of change stopped. Well, we worked on another form of computing called
accelerated computing. It is not as easy to use - there's nothing easy about using
accelerated computing - because you have to formulate the problem from what is originally
kind of like a recipe. Step by step by step that you do faster and faster every year. But you have
to reformulate a recipe into parallel algorithms. Parallel algorithms as a whole field of science.
Insanely hard to do. Well, we believed it anyways, and we believed that if we could accelerate
the 1% of code that represents 99% of runtime, there are some applications that we can make a
tremendous benefit - a tremendous difference. We can make something impossible possible.
Or something that cost a lot of money to do cost-effective. Or something that cost
a lot of energy to do energy-efficient. And so we called it accelerated computing.
We worked on it for the entire duration of our company's history. And one application
domain after another application domain, we were able to accelerate. The first one
was computer graphics and video games. We accelerated computer graphics and video games
so well that people thought we were a computer games company. But nonetheless, we kept pursuing
it. We realized the value of computer graphics and games because it was simultaneously a
large market and drove technology innovation. That rarely happens. The confluence of large
markets that never is never good enough has the ability to drive incredible technology
revolutions. We found it initially with computer graphics and games. Well, to make a long story
short, in 2012, we had a first sighting - and that first sighting was Alexnet, the first contact
of artificial intelligence with NVIDIA GPUs. And it started our attention - our attention to
this field. Several years later, something amazing happened, and it led to today. Today, I'll
tell you about that something in a second. That something led to generative AI. Now, you've heard
me say generative AI is, of course, incredible. The fact that software can not only recognize
an image of a cat and say "cat," it can take the word "cat" and generate an image of a cat. It can
take the word "cat" with a few more conditional prompts, like on a surfboard on a sunny day off
the coast of Maui, drinking a mai tai, wearing a ski hat, you know, whatever. You just keep on
adding your prompts, and somehow the generative AI is able to generate that. We have now taught a
software program how to understand the meanings of these pixels, to recognize the pixel, understand
the meaning of the pixel, and, in fact, able to generate from the meaning of something pixels.
And so, this ability for us to learn data from just about any data is incredibly transformative.
And it's led to today. You've heard me say that this is the beginning of a new revolution, a new
Industrial Revolution. And there's a reason for that. In this new Industrial Revolution, we're
producing something that never existed before. As in previous industrial revolutions, the last one,
water comes into a facility, energy is applied to it, this thing called a Dynamo goes off to work,
and it creates this invisible thing of incredible value that we depend on today. Water comes into
a building, you basically light it on fire, you boil it, and what comes out the other
side of it is electricity. Water in, electricity out. Magic electricity used everywhere
and created the Industrial Revolution as we know it. A new facility creating a new product of great
value. While generative AI is a brand new type of software, and the software is produced, it,
of course, has to be created. Amazing computer scientists have to go create it. But then, after
that, it's produced. It's produced in volume, a building with machinery we call GPUs, essentially
a Dynamo, a large building with machinery inside. You give it raw material, this raw material
is data, numbers. You give it energy, and this amazing thing comes out. Numbers go in, numbers
come out, and the numbers that come out do amazing things, unbelievable things. And they could be
used, of course, in all of the applications that you know, but it could be used in healthcare
and drug design, and it could be used in transportation and cause cars to drive, and
manufacturing and industrials, and every single industry that we know will benefit from this new
product, this new thing that is being produced. So, a brand new thing that the industries
have never seen is going to get produced, and it's going to be produced in facilities and
factories the world has never seen before. AI factories using and producing AI, and the AI being
used by every industry. So, what do you call that? A new Industrial Revolution. None of this existed
before, and now we're seeing it play out right in front of us. This next 10 years, you don't want to
miss. You don't want to miss this next 10 years. Unbelievable new capabilities will be invented,
and it started, it started at a point in time by some researchers. And so today, I thought we would
invite the inventors, the creators of a machine language model called Transformer. And the way we
thought we set it up is kind of like our living room, and there will be very little moderation.
And, you know, I was in, we were in the back, and I wish you were there. There were a lot of deep
learning jokes, a lot of deep learning jokes. And, you know, we're going to see if any of them land,
but a lot of good deep learning jokes and a lot of arguments. And so, what I thought I would do is
we would just tee up the joking and the arguments, and then we can see where it takes us. And so,
what I would do, let me now welcome the inventors of the Transformer, and they were the authors of
the paper that says "Attention is All You Need." Okay, so let's have Ashish Vaswani.
Ashish, Ashish is now the CEO of a brand new startup company called Essential AI. Noam Shazeer,
welcome Noam. He's also a CEO of a new startup called Character.AI. Somehow all of their startups
have a name called AI. NVIDIA also has AI, just not in the right order. I knew it all along,
I knew it all along that I needed the letter "a" and the letter "i" in there. I just didn't know
what order it had to be. Jakob Uszkoreit, Jakob, nice to see you. See you, Jakob is also
CEO of a startup. CEO, this is really, really interesting. Inceptive. Okay, Llion Jones,
founder and CEO of Sakana AI. Welcome Aidan Gomez, founder and CEO of Cohere. My goodness,
you know Lukasz, come on, Lukasz Kaiser, ladies and gentlemen, the only person who is
still an engineer. Lukasz, yeah, you're my hero. Illia Polosukhin, come on the stage. NEAR
Protocol, he's a co-founder of NEAR Protocol. Okay, and we have a colleague and friend who
couldn't make it because of a family emergency, and so our hearts and thoughts are with Nikki.
So, let's, um, here, let's, let's, so first, first of all, they've actually never been in
the same room at the same time. That's true. I got this work-from-home thing has gotten out
of control, but apparently, it doesn't stop innovation and invention. So, it's great to have
you guys here. You know, the Transformer was, um, and we're going to talk about the importance
and the meaning of the work, the incredible transformative capability of the Transformer, and
what it has done to industries. And obviously, as I was saying earlier, everything that we're
enjoying today can be traced back to that moment, the fact that we can learn from data of gigantic
amounts in a sequenced way, sequential data, as well as spatial data, but learn from just a
tremendous amount of data to find relationships and patterns and to create these gigantic
models was really quite transformative. The, my first question, and you guys all dive into it,
and we agreed that it is not impolite to cut each other off, talk over each other, disagree with
each other, even get out of your chair. We need a whiteboard then. For that, we need a whiteboard,
that's right. Nothing is off-limits today. But go back to the beginning. What were the problems, you
know, engineers, we need problems to inspire us. What were the problems that you were struggling
with or challenged with that led to this? I think everybody had a different problem, probably, but
for me and the team, we were working on question answering. So, very simple, you go to Google, you
ask a question, it should give you an answer. And Google has very low latency requirements. And
so, if you want to ship actual models that read, you know, search results like tons of
documents, you need something that can process that really quickly. And models at the time,
recurrent networks just cannot do that. Yeah, because there were recurrent RNs, and RNs had some
attention, but you know, the difference, they need to read one word at a time. That's right. We were
generating training data much faster than we could actually train the most advanced architectures on.
And so, you actually had simpler architectures, just feed-forward nets with n-grams or so as input
features that because they trained so much faster on, you know, at least in some problems that
Google scaled massive amounts of training data, they basically always outran the much more
advanced, much more powerful RNNS at the time. And so, it seems like a valuable thing to fix.
Yeah, we were already seeing these scaling laws back in like 2015, and you could see that if you
make the model bigger, it just gets smarter. And here is the best problem in the history of the
world. It's so simple, you're just predicting the next token, and it's going to get so smart and
be able to do a million different things. And you just want to scale that up and make it better. And
one big frustration is that RNNs were just a pain in the, you know, to deal with, right? And then,
so yeah, so I overheard these guys talking about, "Hey, let's replace it with convolution or with
attention." I was like, "Heck yeah, let's do this." And I think it was, I like to compare it
to, you know, RNNs were like the steam engine, and the Transformer is like internal combustion.
Like, we could have done the Industrial Revolution on the steam engine, but it would have just been a
pain in the butt, and things went way, way better with internal combustion. So now we have electric
vehicles, and now we're just waiting for fusion, right? That's the next wave. Our, I mean, two
lessons that I was reminded of constantly, especially during the time of the Transformer,
was I started tasting some of the bitter lesson when I was in grad school, when I was working
on machine translation, where I felt like, "Hey, I'm not going to, I think that gradient descent,
the way we train these models, is a much better teacher than me. So I'm not going to learn these
linguistic rules. I'm just going to ask gradient descent to do everything for me." And the second
piece was that, you know, I mean, just quoting from the bitter lesson, that general architectures
that can be scaled are ultimately going to win in the long run. Today it's tokens, tomorrow it's
actually the actions we take in a computer, and they'll sort of start mimicking our activities
and be able to automate a lot of the work that we do.So, the Transformer was self-attention,
in particular, as we were discussing, it had this quality that was extremely general,
and it also made gradient descent happy. And the second thing it made happy was physics because,
I guess, something I did learn from G over time was that matrix multiplications are a good idea.
So let's try to make accelerators happy. And both those things together, this motif has been
repeating. So every single time we add a bunch of rules, gradient descent will one day learn
those rules better than you. And this was it, like, we, like, all of deep learning has been,
we're building an AI model that's the shape of a GPU, and now we're building an AI model
that's the shape of a supercomputer. Yeah, the supercomputer is the model now. Yeah, that's
true. Yeah, supercomputers, they're there. Just so you guys know, we're building the supercomputer
to be the shape of the models. Yeah, yeah. Now, what were the problems you guys were
solving? Oh, machine translation, yeah, definitely. And it seemed so hard five years
ago. Like, you had to gather data, maybe it would translate, maybe it would be slightly wrong.
It was at the very basic level. Now, these models, you don't get any data at all. They just learn to
translate. You have this language, that language, it just emerges that the model can translate,
and it's very... What was the intuition that led to "attention is all you need"? So, I came up
with the title, and basically what happened was, at the point where we were looking for a title,
we were just doing ablations, and we had very recently started throwing bits of the model away
just to see how much worse it would get. And to our surprise, it started getting better, including
throwing all the convolutional solutions away. I'm like, "This is working much better." And that's
what was in my mind at the time, and so that's where the title comes from. Basically, what's
intriguing about that is we actually kind of started with that barebones thing, right? And then
we added stuff. We added convolutions, and I guess later on, we kind of knocked them out. And a lot
of the other things, like multi-head attention, were also a really super important piece.
But I was watching this movie, I don't know if you watched it, "Leon," where this guy is in a
parallel universe where they don't have beetles, they didn't exist anymore. I was wondering
what would happen, what would be the title in that universe. I don't know if you watched
it, I forget what it's called, "Yesterday" or something like that. Yeah, no idea, sorry. He's
got no time, he's trying to build a company. So, you guys, well, hey, I think this is
important. How did Transformer come up? What were some of the other choices? Who came up with
Transformer? Why is it called Transformer? It's an excellent name, by the way. Um, you, I think
Transformer, right? No, no, I liked the name that Yakob had for it. I was like, "That's a name,
let's use it." Yeah, yeah, I mean, it fits what the model does, right? And every step actually
transforms the entire signal it operates on, as opposed to having to iterate over us. That
logic, almost all machine learning models are transformed. Yeah, oh, look at that. That's
what all machine learning models are becoming, transformed. Before, nobody thought to use the
name. And so, I guess we, yeah, I thought it was too simple. I was like, "Oh, every..." That's
exactly what I thought about it. But then, you know, like, I was overruled. Everybody thought
it was a great name, and they were right. And what was the name you came up with? I had a lot.
There were a lot of names. I mean, like, there was something called Cargonet. I had, like, there...
I wrote something. One layer was like convolution. One was attention. One I called recognition
or something. That was like the feed-forward net. And so, convolution, attention, recognition,
Google Cargonet. But I'm happy that now... Carg... That's... That's horrible. I'm glad you were...
You were outvoted. Yes, by wise people. I think the reason it became such a general name is that,
you know, in the paper, we were concentrating on translation. But we were definitely aware of
the fact that we were actually trying to create something that was very, very general, that really
could transform anything into anything else. And I don't think we predicted how well
that would actually work. Yeah, you know, when Transformers were being used for images,
that was kind of surprising. I mean, it's probably logical to you guys, but the fact that you could
chunk up the image and tokenize each little part, I think that was architecturally there very early
on. And so when we were building the tensor to tensor library, we were really focused on scaling
up auto-regressive training generally. It wasn't just for language. There were components in there
for images, audio, text, and both on the input and output side. Lukash said what he was working on
was translation, but I think he's underselling himself. All of these ideas that we're starting
to see now, of these modalities coming together and being a joint model, it was there day zero,
day ten in the Transformer repository. Because that's what Luk was going after. It didn't work.
We're like five years ahead now. Now it works. But I mean, there was this other paper, "One Model
to Rule Them All," but it did use self-attention. Yeah, eventually, it started working, but it was
really all there very early on, and those ideas were percolating. And it took some time. Lukas's
goal was, we have all of these academic datasets, and they go from image to text, text to image,
audio to text, text to text. We should train on everything. And that idea is really what
drove this scaling effort to model the web, which Writer Team has, you know, succeeded, and
now many of us are doing similar things. And so, I think that North Star, it was there on
day zero, and it's been really exciting and gratifying to watch that come to fruition. We're
actually seeing it happen now. Yeah, and it's so interesting that in so much of knowledge, it is
about translation. Image to text, text to image, text to text. You know, tensor to tensor,
tensor to tensor. Yeah, this Transformer idea, this translation idea, is quite universal. And in
fact, you're using it for biology, that's right. Or maybe something that we like to call biological
software, which is maybe an analogy to computer software that starts its life as a program that
you then compile into something that could run on a GPU. In our case, read. Basically, the life
of a piece of biological software starts as a specification of some behaviors you want, say,
print a protein this much, that specific protein in a cell. And then you learn how to translate
that using deep learning into RNA molecules. That's right, this idea really goes all the
way from not only translating, say, English into computer code, but also specifications of
medicines, hopefully transformational medicines. Any day now, into the actual molecules that we
then... And do you guys create a big wet lab that produces all this? You have to run experimentation
against nature, right? You really have to verify this. The data does not yet exist. There are tons
of extremely valuable genomic data that you can download, largely available, actually still openly
publicly because of the fact that it's generally still largely publicly funded. But really, you
still need data that specifically speaks clearly to the phenomena that you're trying to model
at hand in a given product. Say, for example, something like protein expression in mRNA vaccines
or so. Yeah, it's really quite true. Over in Palo Alto, we have a whole bunch of robots and people
in lab coats, both learning researchers and folks who were previously biologists. Now we think
of ourselves as pioneers of something new, working on actually creating that data and
validating the models that design those molecules. So, the idea you're saying is that some of the
early ideas of translation, a fairly universal learner, universal translation, were there
in the beginning. What are some of the major architectural fixes, enhancements, breakthroughs
that all of you have seen along the way that you think are really great additional contributions
on top of the base Transformer design? I think on the inference side, there's been tons
of work to speed these models up, make them more efficient. I still think it kind of disturbs
me how similar to the original form we are. I think the world needs something better than the
Transformer. I think all of us here hope it gets succeeded by something that will carry us to a new
plateau of performance. Yeah, I wanted to ask a question to everyone here. What do you see comes
next? That's the exciting step because I think it is too similar to the thing that was there years
ago, right? Yeah, I think people are surprised how similar it is, like you said. And people do
like to ask me, you know, what is coming next, as if I'll just magically know because I'm on the
paper. But the way that I answer the question is to point out an important fact about how these
things progress. You don't just have to be better, you have to be clearly, obviously better. Because
if you're only slightly better, then that's not enough to move the entire AI industry to the
new thing. So we're stuck on the original model, despite the fact that probably technically it's
not the most powerful thing we have right now. But you know, everyone's toolset, right? But what
are the properties that you guys want to make better? Context window, you want to make better,
AO, the generation, the token generation ability. We want to make it faster. Well, I'm not sure if
you'll like this answer, but they're using too much computation right now. I think they're doing
a lot of wasted computation. We are trying to make that more efficient. Thank you, but actually, it's
about allocation. It's not so much about the total amount. You need any amount of computation, right?
It's really about spending the right amount of effort and ultimately energy on a given problem.
You don't want to spend too much on a problem that's easy or too little on a problem that's
hard and then fail to actually provide a proper solution. That's a real example. It was like
2 plus 2. Right now, if you enter it into this model, it uses, you know, a trillion parameters,
even though computers are perfectly capable of doing that. So I think adaptive computation is one
of the things that has to come next. So we know how much computation to spend on a particular
problem. Yeah, again, that was super, like an immediate follow-up paper that I know a subset of
the authors here did was Universal Transformers, which targeted exactly that. So these ideas
were...They were there, still there. Yeah, the prep paper a year earlier, the mixture of experts,
that now everybody did, that's everywhere. It's kind of now folded into the Transformer, but it
was before the Transformer. I actually don't know if folks here know, but we kind of failed at
our original ambition. We started this because we wanted to model the evolution of tokens. It
wasn't just linear generation, but text or code evolves. We iterate, we edit, and that allows us
to potentially mimic how humans are evolving text, but also have them as a part of the process.
Because if you naturally generate it as humans are generating it, they can actually get feedback.
"Oh, you didn't tell me." Yeah, so I mean, all of us read Shannon's papers, so yeah, we were
like, "No, no, no, let's just do language modeling and perplexity." But that has not happened. And
I think that's also where we could intelligently organize our computation well, right? And that
goes for images as well. Diffusion models have the interesting property that they're iteratively
refining and improving. We don't even have that. And yeah, I mean, this fundamental question of
what knowledge should exist within the model, what knowledge should exist outside it, retrieval
models, RAG, I guess RAG is one instance of this. And also, there goes for reasoning too. What
reasoning should be done outside with symbolic systems, and what reasoning should be done
in L? It's largely an efficiency argument. I do believe that large models will ultimately
learn the circuits to do 2 plus 2. But if you're adding up trillion numbers to add up two, that's
inefficient. Well, in the case of IL's example, if asked 2 plus 2, the AI should just pick up a
calculator, use the least amount of energy that we know, which is a calculator, to do 2 plus 2.
However, if asked, "How did you decide on 2 plus 2?" or "Is 2 plus 2 the right answer?" then it
can go into math theory and explain from there. That's right, that's right. I'm pretty sure all
of you guys are creative and smart enough to go pick up a calculator. GPT does this right now,
exactly. No, that's right, Jake. I think the model is just too cheap right now. It's too small.
Yes, it's too small, it's too cheap because, like Jensen said, you are producing computation
that costs like 10 to the 18th per operation or something on that order. Thank you for creating so
much of it. But if you look at a model with like half a trillion parameters and you're doing like
a trillion computations per token, that's still like a million tokens to the dollar. That's like
a hundred times cheaper than going out and buying a paperback book and reading it. It's so cheap.
And we have applications that are a million times or more valuable than efficient computation on
a giant neural network. I mean, certainly curing cancer and that sort of thing is, but even just
talking to anyone, talking to a person, talking to your doctor, lawyer, programmer, that you pay like
a dollar a token or more. We've got this factor of a million to play with to make it way smarter.
That would be so amazing because sometimes just the right word is going to change the world. Yeah,
that's exactly it. I also think that to make it smarter, the right interfaces are essential. How
do we actually get the right feedback? How do we sort of decompose the task that we're doing in a
way that humans can intervene at the right time? And if we ultimately want to build models that can
mimic and learn how to solve tasks by watching us, the interface is going to be absolutely crucial.
This might be a great way to do this. Could you start a company? You left Google after you
invented the Transformer. You guys worked on Transformer and you started your company. could
you guys all just quickly say something about the company and why you decided to start it? Because
a lot of the things that you're describing, you know, your company is working on. So, yes,
essential. We're really excited about building models that can ultimately learn to solve new
tasks at the same level of efficiency as humans. They watch what we do, they're able to understand
our intents, goals, and start mimicking what we're doing. And that's ultimately going to change
how we interact with computers and how we work. Basically, in 2021, one of the big reasons why I
left was that the only way to make these models smarter, you couldn't make these models smarter in
the vacuum of a lab. You actually have to go out and put them in people's hands because you kind
of need the world to interact with these models, get feedback from them, and make these models
smarter. So the way to do that is to go outside, build something useful. Learning does require an
experiential flywheel, absolutely right. And it was hard to do it in the vacuum of a lab, and
putting something out in the world was easier at the time. Yeah, that's cool. And Palmyra, yeah.
Oh yeah, so I founded Writer Team in 2021, and you know, the biggest frustration I had at the time is
like, here's this incredible technology and it's not getting out to everyone. This is like the
most, it's so easy to use, it has so many uses. Can you guys imagine Palmyra being impatient? The
value of this is like, get it to like a billion people, let them do a billion things with
it. This is what Zen looks like. He's calm, deep learning Zen. This is when he's calm. Zen
looks like a calm comparison to Gnome, and you're sitting next to him, like, "Yeah, thank God for
giving us this incredible technology." And thank Jensen, and thank everyone. The ultimate goal is
to help everybody in the world. You guys all have to go to Character. You've got to go check this
out. But I'm serious, you have got to go. Yeah, let's start by doing this for real. Let's build
something as fast as we can, get it out there, and get billions of people able to use it. And
you know, to start with, a lot of people are using it just for fun or for emotional support or
companionship, or to replace entertainment, and it's really working. Just going to grow the
number of people who are using it is insane. It's really, really working. Congratulations. Thank
you. Yeah, that's awesome. I already said a little bit about biological software, but maybe
more about the why for me personally. In 2021, I co-founded Inceptive. You know, for the need or
the realization that there can be much more direct impact on improving people's lives with this
technology than what we've had before. It was broad, but not very direct. My first child was
born during the pandemic, which certainly gave me a newfound appreciation for the fragility of life.
Then, a few weeks after, AlphaFold 2 results came out for protein structure prediction, winning
CASP14. And one of the big differences between AlphaFold 2 and AlphaFold 1 was that they started
using the Transformer and replaced the rest of their model or architecture with that. So it
became pretty clear that this stuff is ready for prime time in molecular biology. And then
a few weeks after that, the mRNA COVID vaccine ethicacy results came out, and it was very clear
that mRNA and RNA, in general, can do anything in life. With the RNA World hypothesis, there is no
limit to what can be achieved with RNA. But for the longest time, it was the neglected stepchild
of molecular biology. So it just seemed like almost a moral obligation. This has to happen,
and somebody has to go do that. I've always thought of it as drug design, but I love that you
think of it as programming proteins, programming biology. It makes so much more sense, actually.
Yeah, I love the concept of it. And of course, this compiler would have to be learned. We're
obviously not going to write this compiler. So we have to go learn this compiler. That's right.
And if you're going to go learn this compiler, obviously you need a laboratory to test it
and generate the data. Yeah, this flywheel can't work. I'm pretty excited, but I can see it
happening. Helio, yeah. So I was the last one to leave. It's still very early days, but I can tell
you what's going on so far. Yeah, I co-founded a company called Sakana AI. What does "S" stand for,
anyways? It's a bit weird. It means fish. Sounds very weird in English, right? Call your company
"fish," you're off to a great start. Thankfully, the Japanese seem to like that. Yeah, so the
reason we named it "fish" is that it's supposed to be evocative of a school of fish. We want
to do nature-inspired artificial intelligence. The analogy is that a small fish can be quite
simple, but when you bring a lot of simple things together, they become quite complicated. But
people are not quite entirely sure what we mean when we say "nature-inspired," so I want to dive
into that a little bit. The central philosophy, right? What I try to push on researchers when
they join is that learning always wins. Anytime you go from humans trying to do something
by hand, trying to engineer it themselves, to actually just using computers to search through
the space, you always win. Even originally, the Deep Learning Revolution was an example of
that. We went from hand-engineering features to just learning them, and it worked so much better.
So, you know, the resources in this room, I want to remind you that the mad amount of computation
that Nvidia has given us, there are other things we can do apart from gradient descent. We can
use it to search through the space of what we're currently hand-engineering. And actually, I would
like to tell you that, I think today or tomorrow, there's a sort of a time difference problem,
we're actually making an announcement. I'm sort of surprised that we have something to announce
so early, but we have some research that we're going to be open-sourcing. And it's very on-brand
because what's happening, what's in vogue right now, is model merging. But it's being done by
hand, like we're taking the algorithms by hand, how to merge these things. So what we did is we
took all the models available on Hugging Face, and then used a very large amount of computation
to use evolutionary computation to search through the space of how to merge and stack the layers.
And let's just say, it worked very, very well. So keep a lookout for that. Wow, okay, all right,
that's fantastic. It makes a lot of sense, actually. I'm also under strict orders to say,
"We're hiring." That's fantastic, good job, ID. Yeah, I think my reason for starting Cohere
was very similar to Gnome's. I saw a technology that I thought could change the world. You know,
computers started speaking back to us, they gained a new modality. So I thought that should change
everything, every single product, the way that we work, the way that we interact with all the
stuff that we've built on top of computers. And it wasn't changing, there was stasis, and there
was this dissonance between the technology we were faced with, for those of us in the know, and
what it was out in the world doing. So I wanted to close that gap. I think the way that I've
gone about it is a bit different than Gnome, in the sense that Cohere builds for enterprises.
So we create a platform for every enterprise to adopt and integrate it into their product, as
opposed to going direct to consumers. But that's the way that we want to push that technology out
there, make it more accessible, make it cheaper, and help companies adopt it. You know, the thing I
really love is, this is what Aiden looks like when he's super excited, and that's what Gnome looks
like when he's super calm. I just love that. Very true. Cohere, okay. Lucas, well, I did not find
Make, I know, but you went on to change the world. You went on to change the world. Go ahead.Yeah, I
finally joined Open AI after some time. You know, as Al Capone was asked why do you rob banks, he
said, "That's where the money is." Well, at the time I joined, that's where the best Transformers
were at Open AI. It's a lot of fun at the company. We know you can take a ton of data and a ton of
compute and make something nice. And I still hope I can remove the ton of data sheer, we'll just
need even more compute. Yeah, sorry. So, do yours, and then I want to ask these guys the next
question. Yeah, so I was actually the first one to leave Midway Y. And because, kind of similar to
our shift, I strongly believed that the way we're going to make progress towards, you know, pretty
much software is eating the world, and machine learning is eating software, is to teach machines
to code. So that you're able to actually generate software and transform everyone's access. Now,
this was 2017, it was a little bit too early. We did not have as good compute yet at the time.
And so what we tried to do, we tried to coordinate people instead to actually generate more data.
That's part of it. As a startup, you actually have this ability to put something in front of
users and then also incentivize them. And we ended up realizing we need the new basic primitive,
which is programmable money. Because programmable money is what allows us to coordinate people at
scale. And so we ended up building the protocol, which is a blockchain, which has been running
since 2020. It's the most users in the world in the blockchain space, with multiple millions
of daily users who don't even know they're using blockchain, but they are actually interfacing
with programmable money, programmable value. And now we're starting to use that to actually bring
back some of those tools to generate more data. And I think fundamentally, I mean, to, you know,
it's non-controversial in this group, but it's probably controversial elsewhere, copyright as
a technology from the 1700s will need to change. We have a new generative age that is in front of
us, and the way we are rewarding creators right now is broken. And the only way to do that is to
leverage programmable money, programmable value, and blockchain. And so one of the things we're
working on is actually creating a noble way for people to contribute data that then models Cohere.
That's super cool. And then you'll be able to hold a new positive feedback system exactly into
everything that we're doing, and there's a great new economy on top of it. We've got programmable
humans, we've got programmable proteins, we've got programmable money. I love this. And so one of
the questions that people have is, the current generation of GPT models have training datasets
that are, you know, 10 trillion tokens large, which represents approximately the size of the
internet, everything that you can scrape off the internet freely. And so what's next? What kind
of new model technologies have to be explored, like reasoning, you know, so on and so forth? And
I'll let you guys talk about that. And where would the data come from? From interactions? Like,
it needs to come from interaction with users. And that's, like, you need massive platforms
to actually attract, you need economic value people get from this to do this. And then on the
back end, you can funnel it to all of the models to actually become smarter. You can do that to
make a model even better. But how do you get to that incredible pre-trained model, the starting
point that everybody would want to interact with? Is there a way for models to interact with each
other through reinforcement learning? Are there synthetic data generation techniques? Like,
you know, there's all of this, right? I think, kind of between all of us, we're working on every
one of those techniques, probably. Yeah, I mean, I think the next big thing that's coming is
reasoning. I think a lot of people have realized this, and a lot of people are working on it. But
again, a lot of it is being hand-engineered right now. We're sort of writing prompts by hand and
then trying to get them to interact in the way that we think they should. And I, of course,
think that we actually should be searching through that space and actually learning how to
wire these things together and to get this really powerful reasoning that we want. Another way of
thinking about it is that models that are supposed to generate things that we want to consume as
humans, media that we would like to consume, should be trained on all the stimuli that we
would like to consume or that we can consume. So basically, any type of video, audio, any type
of way of observing the world, 3D information, spatial information, spatiotemporal information,
should all just be dumped in there. I'm also not sure if everyone understands that reasoning and
learning from little data are very related because if you have a model that can do reasoning, then
you have a little bit of data. It does all of this processing, why is this thing following it? But it
can put a lot of computation into that, and then, you know, oh yeah, that comes out, and it
generalizes from way less data because of all this computation that it puts into reasoning. It's like
the system to think in human terms, and from that, you can just let it go and then try to build it in
as things that it will do. Ideally, you wanted to design its own experiments so it collects the most
impactful data for its reasoning to be able to continue searching. But I do think that reasoning,
when we figure it out, it will dramatically reduce the amount of data you need. But then the
quality of data you need will matter much more, which is where all the interactions with the real
world and people come in. So I do think that there will be a new age where we still pre-train on some
trillion tokens, but the things that matter, maybe like high-quality things, will make it easier to
give people back money for contributing that data, for pretty much teaching machines to become better
and better. Yeah, a person has seen only like a billion tokens, so people learn pretty well. So
there's an existence proof here. Yes, yeah, that's right. I would also argue that something that's a
lot of progress in the field has been made because of benchmarks and evals too. So like, you know,
what is the grade school mathematics analog of, say, automation? And so breaking down real-world
large-scale tasks into sort of simpler gradations is also important because our models could
probably accomplish some of them and it can deploy, get more data, and then once that loop
is closed, they have the right to actually take on more complex tasks, one because they're also
potentially watching, so observing what they're doing, so that gives them more data, then they can
do more complex tasks, like give them higher-order primitives, like do more abstract tasks. So I do
feel like building, measuring progress, and making progress is also going to require breaking down
or creating this sort of the science in many ways that we've done with some evals, but the science
of automation or the science of interaction or the science of code generation. And then you can't
do good engineering without measurement systems, exactly, really important. Yeah, yeah, yeah.
And so here, I got a question for you guys. What are the three questions you guys want to
ask each other? Okay, we're just gonna fire off one first question. So, what do you think?
Awesome? Too complicated? Not elegant enough yet? Oh, okay, wow. Well, the funny thing about those
is that, you know, we state-based models, we remember the pre-Transformer age, right? But a lot
of young researchers don't. But when I looked at the paper for the first time, it was very obvious
to me that it was a very poor man's LSTM. So all the problems that we were having back when we were
trying to get these things working are surely also in these models. But it seems that because people
have sort of forgotten the pre-Transformer stuff, they have to rediscover all the problems. So my
guess is that these things will be important, and we'll probably end up with a hybrid model.
Well, Transformers have their recurrent step. The fun fact I find is that nobody's actually playing
with the fact that you can run Transformers for a variable number of steps and train that
differently. So exploring what we can do with recurrence, because what this model does is with
every step, it kind of augments more information for every token and resolves things and does
reasoning. So obviously, if you only have six steps, you can only do, you know, actually five
steps of reasoning because the first step is just getting more context. And sometimes you don't need
six steps, sometimes you need more. So what are the different recurrences you can do on that?
And then the other one, how do we go away from tokens? Like, exactly, that's a pain of
our existence. I mean, with recurrence, I have this personal belief that we
have never truly learned how to train recurrent models with gradient descent.
Yeah, that's right. And maybe it's just impossible. I mean, LSTMs, they did poorly,
it worked a little bit. Then SSMS worked even better because they're structured to it. But
maybe fundamentally, you need to train it not with gradient descent. Maybe you need to train
it in a different way, like how we humans are, in some sense, recurrent. I mean, we
live in time. Our brains update in time, but it's not so clear how we're trained with
backpropagation, probably not. So maybe there is a way, it's just not gradient descent, and
that's why it's been so hard to figure it out. Well, guys, it's been so great spending time with
you all. I really hope that you get together every now and then and see what amazing magic can come
out next time from your interactions. We have a whole industry that is grateful for the work that
you guys did. Thank you. I appreciate that. Thank you. Thank you. Thank you. Thank you. Thank you.
And I'm just going to do one. Could you guys just give me one? I'm going to do one, and I'll give
everybody else theirs as we leave. This one is for Ashish Vaswani. You transformed the world. Okay,
thank you. And this one is for Jensen Huang. Here we go. Beautiful. Thank you very much. Thank you,
Noam. Thank you. Thank you very much. Good job. Thank you, Lukasz. Thank you, Illia. Thank you
all. Alright, guys, thank you. Thanks for coming.