So welcome to the talk, Future
of AI and Part 2, Artificial General Intelligence, by two
of my most favorite people, Bryan Catanzaro and David Luan. I welcome you on the stage. I'll introduce you. So David, CEO and co-founder
at Adept, and their company is building AI agents for knowledge. Previously, he was VP of
Engineering at OpenAI, and he saw the research on language
supercomputing, reinforcement learning, safety, and policy. And before that, he co-led
a team at Google Brain where they shipped GPT, CLIP, and DALL-E. Welcome, David. Thanks. And Bryan is the
VP of Applied Deep Learning Research Team at NVIDIA. They work on multimodal language
modeling, chip design, audio, speech, graphics, and vision,
and continue to find practical new ways to use AI for NVIDIA's
products and workflows. While at NVIDIA, Bryan has
helped create Pix2Pix HD, DLSS, Megatron, QDNN, Pascaline,
WaveGlow, and DeepSpeech. Welcome. I'll let you guys
take it from there. Thank you. Well, it's so great to
see everyone here. Welcome to GTC. It's been a while since we had
one in person, and it feels incredible to see you all here. So thank you for coming, and
I hope we're going to have an interesting discussion today. David and I are friends, and we're
just going to be chatting about the work that we're each doing
and where we think it's going. Hopefully, we'll have some
time for questions at the end. Yeah, super great to be
here with you, Bryan. I feel like you've done so
many anchor contributions to the field, so I think this
is going to be a lot of fun to get to grill you a little bit about
some of the things you believe over the next 45 minutes or so. Don't grill me too hard. All right, that's good. So I guess just to start
off, I think NVIDIA, you have driven NVIDIA's AI
efforts for quite some time. I'm curious, how would you
describe the goals of NVIDIA's AI training and research programs
that you've been overseeing? Yeah, it turns out that NVIDIA
is working on our own AI efforts, and it's something that I'm
very excited about, and something that I'm hoping is going to
continue to develop. And I think there's two really
strategic reasons why NVIDIA is building its own AI. The first has to do with the
nature of accelerated computing. So the value that NVIDIA
provides when we sell systems for AI is in speed. And having that delivered to
the engineers and researchers around the world that are creating
AI requires us to understand the process of creating pretty deeply. There's a lot of things about
the structure of networks. How do we use low-precision
arithmetic, sparsity? How do we deal with networking? And all the various software
stacks, compilers, libraries, frameworks, communication
in the network, systems like Grace Hopper, where we have CPUs that
are coupled to GPUs in new ways. And all of these things,
there's so many choices. And the soul of NVIDIA's work
as an accelerated computing company is to make those
choices, but that requires us to actually understand what
is being accelerated really deeply. I always like to joke that
accelerated computing actually means decelerated computing
for almost everything. And the reason for that is
that if you just say, hey, I'm going to make a computer
and it's going to be fast, that's not really saying much. All computers try
to be fast, right? So the thing that makes accelerated
computing different is that it's specialized. But then that question of
what do we specialize for becomes essential. And the only way that we can
build the systems of the future is to be building AI ourselves so
that we understand what to build. So that's the first reason. The second reason, I think, has
to do with the opportunities as AI develops around the world. I believe that AI is going
to impact the world economy in every sector, in every company. But how is it going to do that? Because there's a lot of
specialized skills that need to be developed and
also an enormous amount of compute and data and resources
that get put into building AI. Not every company is going
to be able to invest in that. And when I think about NVIDIA
and our business of supporting the world, you know, NVIDIA
is able to partner with every company, old and young, large and
small, in finance or in consumer product goods or in technology. You know, we're able to help
every company incorporate technology into the beating
heart of their business in a way that preserves their identity and
allows them to take advantage of their own unique ideas and market
position to change the world. And I think in an era where
AI is changing everything, it makes sense for AI to be
part of NVIDIA's platform. And so that's the second reason
why we're developing AI. So you have had a front row seat,
basically, to the enormous gains just due to things like model scale
over the last couple of years. I'm curious, you know, actually
zooming way out for the audience, I think a lot
of people may be familiar. But, you know, how do you
think about scaling laws? Like, what do they mean for AI? And do you think that's going
to continue to hold? Yeah, well, I've been betting
on scaling AI now for 20 years, and it's been a good bet. You're one of the original
GPU programmers way back in the day, right? Yeah, it's a long time ago. Back when I was a grad student
at Berkeley, I was doing machine learning on GPUs and published
a paper at ICML back in 2008 on using GPUs to train models
so that we could scale them. And I actually got this response
from a bunch of people at ICML, this machine learning conference. They were like, what
are you doing here? Like, everything that we're
doing here is new mathematical formulations of machine learning
that allow experiments to be run, you know, new kinds
of experiments to be run by a grad student on a laptop. And the data sets that at
the time people were using might have, like, a few hundred
data points in it, and they might be, like, a few dimensions big. So it was an era of small-scale
machine learning, and there was still lots of interesting
things happening. But, you know, I believed
that scale in data and compute was going to change the world. And now looking back, I think
that's been clear. But it's also kind of a
disappointing feeling for many people working in AI,
this idea that data and compute is all you need. Because, like, we want to
believe actually what we need is more PhDs in probability
theory, because that's really fun. It's really fun to work on
probability theory, but the idea that actually we just need enormous
computers and enormous data sets, that doesn't feel as great. But I think we need both. Obviously, I love PhDs in
probability theory, but I feel like the foundation that has
been moving AI forward for the last several decades has been scale. And I don't think that we're
seeing the end of that yet. Yeah, I feel totally the
same way, by the way. It's like, this is talk I
oftentimes give that basically outlines the different eras
of deep learning. Like, everything before 2012, I
loosely just lump in a prehistory. And then, like, 2012 to 2017
was, like, you and your three best friends write a research
paper that changes the world. And after 2017, though, like,
after Transformer, after learning how we map these architectures
really efficiently to hardware, then it's just really become
a data and scale game. And sometimes people ask
me, you know, should I go leave and get a PhD? Should I go think about,
like, how I can make new algorithmic advancements? I definitely think some people
should continue to consider that kind of stuff. But on net, like, even if you go
back to look at the initial AlexNet paper, right, like, people thought
that that was really an idea shift. But it was really, like, Alex
Krzyzewski sitting in a corner figuring out how to map, like,
convolutional neural networks efficiently to the 580. Right. The two of them,
actually. The two of them. He was very pioneering in the
sense that he, like, built a neural network specifically for the system
that he had, which had two GPUs. So he had to, like, partition
it in this very strange way. And the systems work underlay
the result that he got. And I think it's been true
for so many big, big results. Like, back when I was at OpenAI,
when we did GPT-2, when Alec Radford and I were writing
the paper for that, we had long sections about, you know,
all the evaluations. And, like, a short section about
how we unified all these tasks into just predict the next token. But the modeling section was,
like, a paragraph long. It's like, we used a vanilla
decoder-only transformer with, like, these couple
of configurations. And we're just sitting there
being, like, the academic community is going to roast us. Because they'll say
this is no novelty. And for a long time. And long
time. And they did. And they did. They did. And I just, like,
keep on seeing this happen over and over again. That, like, the new metagame
that people need to play to actually get advancements
in AI is, like, poo-pooed by the current incumbents. And I think we're seeing that
again in this era. I mean, it's a little different
from some of the more broad platforms work that you all do. But, like, for us, at least,
one of the things that we really believe in at Adept is that this
next era of AI is actually going to be about getting product right. And doing the correct co-design of
product and the research objective. And having a lot of new ideas
and research actually flow from what doesn't work for customers. And I feel like that's actually
another change from the way that, like, for example, like,
my old team at OpenAI or my old team at Google Research
would think about the problem. Yeah. Hey, can we back up a
second and you explain, like, what is Adept and what is Adept doing? Yeah. Oh, so Adept is kind of
a is an interesting company. We're configured a little
differently from the other sort of startup labs that you all know
about, like, OpenAI and Anthropic and Mistral and stuff like that. What we do at Adept is we
actually have a we have a broader north star that's both
a product and research north star. And it's can we train an AI
agent that can do anything a human can do on a computer? So how can we build models
that don't just read and write text or understand images,
but use them to be able to give it a natural language
instruction, and do whatever end steps on your on your machine with
the software that you already use at work to go achieve that goal? So, like, simple things like,
you know, take this invoice that showed up in my email
and put it into QuickBooks or find six different ways that I might be
able to to plan out this particular trip that my team needs to
go on and have the model actually just actuate your machine like
a human to go get that stuff done. I think what we see a lot
is that is that, like, you know, as enterprises have
adopted these these these LLMs in particular, right? People always use them for
summarization, text generation. Those things seem to work. And the moment those things
start working in a company, they're like, OK, great. How can I actually hand off whole
workflows from my team to a model such that they could be augmented? And that's basically the problem
that we've been trying to solve. In order to solve workflows,
you need to be able to solve things that look much more
like agents, which I'm really excited about, because if
you start working on agents, all of a sudden you get to bring
all the richness of the, like, like broader RL literature, all the
work that happened at DeepMind in the mid 2010s around, like, beating
humans at Go and all that stuff. You get to bring those to bear in
the LLM era, which is really cool. Wow. I feel like what you're
articulating is a future where humans are being augmented
by these models where, like, the goal of these models is
to to help people get things done. I don't know. How do you think
about the way that humans and agents are going to coexist? Yeah, this is like a really strong
part of what I really believe in from a mission perspective
is that, like, you could frame work in this space as being, hey,
how do I just outperform people and then replace them at tasks? But I believe really strongly
that the much more interesting and correct path for us to
focus on is exactly as Brian said. It's like, how do you build
AI systems that are actually here to augment people? And I think the line there
becomes like the way that we make that happen is we work
on tasks that are like 80 percent doable by these models on purpose,
because that way you get this 20 percent of human supervision where
they oversee what the models do. Like, for example, like one
of our customers is a is a is a logistics company that
uses ADEPT basically to handle the lifecycle of containers. People on their team log on
to their platform and there's like dozens or hundreds of
shipping containers that need to be tracked and to figure
out, like, heavily cleared customs, all that stuff that's
like entirely done by hand now. But like the way that they
now use ADEPT is ADEPT in the background basically now
goes and visits all the places all the different software
tools required to understand where those containers are and
then gives the human team a really easy way to supervise whether or
not ADEPT did a great job on this whole mass of shipping containers. And so by doing this, we've
basically changed the role where humans are able to solve
the harder problems. Like, OK, this thing
didn't clear customs. We have to go fix it. And then also gives our model
feedback about how it can do better next time. And so I think like building
these like data flywheels, these data loops by combining
the product side with the actual AI R &D side just helps your models
get way better, because if you're just working in a pure replacement
world, by definition, it's a lights off automation process. You're not getting feedback and
your models never get smarter. So I think there's also a
better way to get stronger capabilities as well. Yeah. Absolutely. Yeah. But I think, you know,
on that note, one of the big challenges always is like
I think everyone aspires to build data loops, right? Because over time these days,
as you continue to scale up these big models, people pour
huge amounts of compute into these base LLMs and they get smarter. There's still this giant missing
piece of what's specific to you and your company and your
and your customers. And so like we've been talking
a lot about how do you actually be able to like what is the
role of like privacy and private data and all that stuff? And I know you've been thinking
a lot about that. I'm curious. I'm curious how
you think that's going to play out over the next couple of years. Yeah, well, I think we are
reaching the end of an era right now with large language
models, which is the end of sort of easily accessible tokens. You know, to train one of
these large language models takes on the order of tens
of trillions of tokens of text. And it turns out that that's
about the number of tokens that humanity has written, at least
that's available that we can get to on the internet in every language,
you know, all put together, including programming languages. And so this is an absolutely
astonishing amount of data, right? So we're training these models
basically to read the entire recorded output of humanity's
intellectual work. And then we're hoping that
the model, after reading all of that, is going to remember some
of it and is going to be able to use it to reason to solve problems. And the fact that that
actually works in some ways is kind of surprising. It's really exciting. You know, and it's one of
the things that, you know, sometimes I wake up in the
morning, I'm just like pinching myself, like, wow, I can't believe
that this like crazy thing that we as computer scientists tried to do
of like find all text that humans ever wrote and then train a model
on it, that that actually leads to a thing that can help people
solve problems if it's fine-tuned and supervised in the right way. So we've been pushing this,
you know, for the past few years, really since, I guess,
GPT-2, I think, is GPT-2 really kicked off that search for larger
and larger data sets and models. And, you know, the progress
has been really incredible, as you know, but there just
isn't more text to read, right? Like there's just not more. And so but yet we know
that our models are not actually done, right? There are many problems that
our models right now just have no hope of solving it. I love that you brought up
the customs example, right? So like getting something
through customs, it doesn't seem intellectually very difficult,
but actually it's there's a lot of sophistication that
you have to understand how these different systems work. And the rules are written
down in kind of vague ways. And like there's negotiation
that's happening. And different companies
have different rules. So it's like there's no
one-size-fits-all procedure for getting that done. That's right. And so it's clear
to me that the future of these models has a lot more to do with
the kinds of data that you're using to train the model or to supervise
it, to fine-tune it, right? So we're going to need to
teach the models very specific things rather than the general
thing of read the entire internet and then we're going to do
RLHF or supervised fine-tuning with a little bit of human
feedback about like some pretty basic kinds of problem solving. That is going to need to shift
towards something more specialized, something more in-depth. And I think it's pretty clear
that data quality and data sort of the purpose of the
data is going to matter much more in the future than it has. And, you know, today I think
it's true that the world's most valuable data is also
the world's most protected data. For example, like if you think
about my own personal valuable data, you know, like we all
have things that we protect. Like maybe my text messages. You know, I wouldn't like
those to be public. Or my medical records. Or my calls, you know, to my
family members and friends. And yet if there was a model
that saw my life in that detail, it probably would be a really
great assistant for me, right? And so the most valuable
tokens to me are also the most heavily guarded tokens. And I think that's true for
businesses as well. So I think, you know, my personal
belief is that every business is founded on a secret. It's usually the kind of
secret that, you know, Jensen Huang can shout from the
rooftops for 30 years. Like, hey, accelerated computing. It's a thing. But the world doesn't
understand it, right? Like that's the thing
about a good secret. Is that even when you explain
it, there's something about it that, you know, is unique. Like you have a unique way
of thinking about the world. You can explain it, but
it's still yours. And other people think
you're crazy, usually. They often think you're crazy. They don't understand, like,
how this would matter. But, you know, I believe
every company, not just tech companies, but every company
has something unique about it. Some secret, you know, that is
kind of the core of the purpose of the company, its mission,
or its market position, or the way that it goes about problem solving,
or maybe culturally, you know, how is the company held together? These are enormously valuable,
and yet they could never be public, right? So like the act of, like, exporting
all of your most secret data, basically exposing the beating
heart of your business to an agent actually is, it requires a lot
of sort of data provenance and also safeguarding because, you know,
these models, as they learned from this very valuable data, they're
going to become very powerful. But then, you know, the question
is how are we going to use them to augment the work that we're doing? And so to me, what that says
is that we're going to be entering in an era of increased
specialization where entities, companies, people are going
to be able to use their own data that's very, very valuable, but
very protected, and combine that with these models to make agents
that are actually super useful. Yeah, I think that's got to be
the way it's going to play out. Just to layer on my own perception
of how the last couple years have played out in that particular
space, like, I remember, I think back in maybe 2018-ish,
I was at a bar in Noe Valley of this guy, Dirk Kingma, who invented
the variational autoencoder. Really cool dude, and we're
just catching up on the state of research. He had just left OpenAI to
go to Google Research, and he was just like, you know,
David, like, I feel like this whole behavioral cloning thing
has a long way to run, and it might just end up working really well. And I was like, oh, well,
what do you mean? He's like, well, maybe the critical path
to general intelligence isn't actually that you need to go solve
this whole crazy RL problem and learn every possible behavior from
scratch, including language from running simulated agents running
around in virtual environments. Maybe the right answer is
you just clone everything that people have ever done
in their lives, and throw all the weights of that into one model. And that's exactly what we're
doing now, right? With LLMs, we're just
doing that for text. With multimodal models, we're
doing that for images and text, or audio and text, or
all of YouTube and text. We were just training these
models that, like, simply just predict, given the context
so far, like, what is actually going to be the most plausible
thing a human in a similarly situated situation would have done. And so it's really cool that that
works at all, but I think there's a couple of corollaries to that, one
of which is that these models are, one, only as smart as the smartest
data in its training set, really. Like, it has some generalization
capabilities, but, like, anything that is true new knowledge
discovery under the training objectives we have right now
are actually going to be penalized by the model, right? Because it doesn't match
anything in the training distribution that you put in. And also, these models end up
basically learning how to compress all the text or images or whatever
that you put in to go train it on. So if you have a bunch of
crappy data, the model is just wasting so many parameters
on that kind of thing, right? So I think the combination
of those two things really point you in the right direction.
in the right direction. I have a joke about that, which
is, like, we were training a model, this was probably five years ago,
and, like, our model was diverging. And we couldn't figure out why. And it turned out that we
had downloaded some web pages where people were drawing pictures
with, like, ASCII art and emojis. And so we were, like, feeding
those tokens into our model as if they were English tokens. And our model wasn't big enough
to kind of understand that this was a different language,
a language of, like, ASCII art. And so it just exploded. So just, like, P of, like, pound
sign was, like, extremely high. The model just couldn't, at
the time, you know, five years ago, it could not learn how
to draw ASCII art and learn the English language at the same time. It's so funny. I think, like,
we've all sort of built up these, like, battle scars
of, like, stupid, like, quantities of data thrown at these things. I remember for one of the
GPTs, I forgot which one, there was just, like, turned
out part of the corpus was just, like, pages and pages and pages
of Canon printer serial numbers. But we hadn't done a good
job of filtering out. It just, it really does not end. And so that's actually why,
like, I feel like, you know, going back to the private
data thing, right, like, part of the goal for ADEPT is, like,
you know, we're training these agents to do work on your computer. We need to learn from the, we need
to learn from the smartest humans possible doing the hardest tasks. Because if you don't have
that kind of data, which is not public data, it's not sitting
around on the internet, then it's really hard to push the increased
capabilities of your base models. And so I think, like, you know,
there's lots of interesting work that you can now bring
to bear on this particular domain in the agent space of
controlling your computer that help you sidestep that a little bit. Because in the usual text
LLM domain, you don't really have a simulator. And because you don't really
have a simulator, you can't do as much interesting work, like,
for example, one of the things that we've been spending a lot of time
thinking about is self-play, right? So how can you train a model
that can use your computer, that can also scrutinize its
own decisions and let you spend compute at this sort of
post-training time to collect new experiences about how you might do
things on your machine and take the ones that are successful and train
on those and build loops like that. And, like, in addition to, like,
you know, solving these, like, specialized models with private
data problems, the other set of problems I'm really excited about
is that I think in the next year or two, we're just going to see
tremendous gains to AI capability in the post-training step,
not just in the pre-training step that people know and love today. Yeah. Yeah. Well, I mean, I do
think that the post-training step has already shown to
be enormously important, right? Like, if you just take a raw
language model without doing SFT, RLHF, trying to align
it to human preferences and give it some problem-solving
capability, it turns out that the language model isn't actually
nearly as helpful, right? And so I think we're seeing
a lot of progress happening when we figure out how to make
something that's generally smart, but then we specialize it to try to
do a thing that's helpful, right? And then the question is,
you know, how is that going to continue going forward? So I have kind of a spicy
question for you, and I don't know the answer to this question,
but I have an opinion. So do you think that the way
that these problems are going to be solved is mostly going
to be through general intelligence, or do you think it's mostly
going to be through specialization? Or is that a stupid dichotomy and
we shouldn't ask that question? No, I think it's a good question. I feel like, so my experience is at
least, and I'd love to flip it on you after I give my answer on it,
my experience is at least is that the quality of your raw pre-trained
model sort of sets the ceiling for behaviors and intelligence
that you might see, regardless of what you do after the fact, right? So you kind of want to make
sure during pre-training stage that you have support in the
training distribution for most of the tasks that you're
going to care about downstream. And then I think about everything
that happens after that as really like teaching specialized rules,
teaching specialized knowledge, teaching like, hey, in this
collapsing the waveform of like, in time step X, I could do one of N
things because like in my training set, people did it N different
ways to like, hey, at my company, I do this one particular way. So push up the likelihood
of just like the next step according to that particular way. I feel like that's really the
role of everything that happens during the post-training phase. So I think what's going to
happen is like in the next couple of years to get state of the art
capabilities, not like necessarily like fast local capabilities,
right, but to get state of the art capabilities, every organization
is going to need a combination of access to one of the few like
true frontier models that has like just the highest level intelligence
possible with the private data that's like teaching that
model, the particular things that are special to you and your task. So probably a combo. I'm curious. Yeah. Yeah. I mean,
I think I'm generally aligned. One of the things that I really
like thinking about is how multidimensional intelligence is. For example, I don't know
how many of you love, let's say Beyonce, right? Obviously iconic artist. I believe that she has a special
kind of intelligence. It's a really rare kind of
intelligence, the way that she's able to understand other
people and sort of cultural trends and then her own life experience
and then synthesize that into a thing that captivates the attention
of hundreds of millions of people around the world and thereby
makes large amounts of money. This kind of intelligence is
pretty rare, pretty useful. At least us humans, we
resonate with this. Like a lot of our culture
is driven by super unique forms of intelligence. I would almost say like not
to say that these are aliens among us, but they're certainly
icons among us, right? Of people that just really
have special skills. And you know, I don't know
what Beyonce's SAT score was. I don't know. I'm not actually very interested
in that question, right? It's not directly relevant to the
reason why she's so interesting and her work is so valuable. Now I do think that, like you
said, having a more general, a smarter, more general intelligence,
it does place kind of a ceiling on the capabilities. If your model just isn't very
general and just doesn't know very much, it's hard to get it to be
really amazing at anything, right? And so I do believe that
general intelligence is useful and we're going to
continue pushing that frontier. But my belief is that
because intelligence is so multidimensional, I
think there's probably 8 billion different dimensions
of intelligence because there's 8 billion humans on the planet. And I believe that there's
something that I could learn from pretty much all of them. And I think that we're going
to find out as we deploy AI to solve problems around the
world that there are so many different forms of intelligence
that we're going to build in order to solve these problems. And I think that's going to
be pretty exciting. But one of the sort of the
implications of that, which I think goes along with the
work that you're doing at ADEPT, is that I think replacing people isn't
actually very interesting, right? Because if you have this thing
that's so multidimensional and so complicated, making
something general that just all of a sudden does everything,
I just don't see that that's where we're going to go. Because I think the problem is
much more complex than that. I think it's way more complex. I mean, one of my coworkers
actually has a really good analogy for this, which is
that the best way to build AI that is really good at augmenting
people is to think about it kind of more like a cognitive tool and
cognitive technology than a robot. In the same way that our brains
changed when we evolved writing and when we invented mathematics. And similarly, our brains
changed when we were able to offload the majority of
facts to our phones and to learn how to use calculators, right? I think the same thing is
going to happen as we build these increasingly sophisticated
AI agents, right? Because then you have another
set of things that you don't really need to do, so you
can spend your own limited representational capacity
learning how to do something else. Then you sort of co-evolve
this joint way of thinking with these new models as they
get smarter and smarter. And I think that's probably
how this is going to play out. And I think most people don't
think that yet and are spending all their time figuring out
how do I take, in the same skeuomorphic way we saw with
early touch devices, right? People are still in the mode
of figuring out how the current analogies we have today could
apply to this world when you have smarter and smarter AI agents. When what you really need
to do is you actually need to go revisit those interaction
principles from scratch. Yeah. I totally agree with that. This is one of the reasons,
actually, why I continue to be excited about the omniverse,
as we call it at NVIDIA, or virtual worlds in general, is that I
think a lot of the most interesting ways that people are going
to be solving problems with AI are going to happen in a virtual world,
as opposed to the skeuomorphic, like, oh, how do I interact with my
phone today or how do I work today? So how do we build agents that are
able to kind of bridge that gap, I think, is really interesting 100%. Taking it down maybe a different
direction, I'm really curious, Brian, what you think about,
you know, if you go look at this broader North Star of
either generalized AI agents or if we want to call it AGI,
what do you think are the remaining big open research problems? Like, the stuff that isn't
just, you know, scale these things more, put more data in them. Yeah. Do you think there's
anything that's left? I do. I think fundamentally the way
that we're doing inference today doesn't really allow for the kind
of problem solving that we need to be able to do, because it's fairly
linear, so, you know, most of the time when these models are actually
being deployed, you ask them a question, they provide an answer. But I don't know, maybe I'm
anthropomorphizing it a little bit, but back when I was in school and
I was taking a test, you know, the answers to some questions, sure,
you just write it down, the answers to other questions, they could
take a thousand times more thought. And right now it's really difficult
at inference time for our models to be able to allocate compute. You mean like adaptive compute? Yeah. Yeah. It seems like we need like
the ability for these models to be much more introspective about
the outputs they're generating, and that involves allocating
compute in different ways. And like if you need to spend
a thousand or a million times more compute to generate one
token than the others, then we should figure out how to do that. Do you feel like things like all
of the chain of thought prompting related tricks and stuff like
that are a way to approximate that? I think it's a start. I think
it's a start, but, you know, those aren't widely deployed right now. I think one reason is because
they're so expensive. And so I kind of, you know, going
back to the bitter lesson that we spoke about at the beginning,
I think we haven't really seen how the bitter lesson applies to
inference as much as to training. You know, most of the time
when we talk about the bitter lesson, we're talking about
how do we build frontier models and, you know, just dump insane
amounts of training into them. But I think actually there's
going to be an analog of that to the deployment phase. And the research for that,
I think, is pretty nascent. Interesting. Interesting. So when you say bitter lesson
for deployment phase, it'll be like getting rid of handcrafted
tricks during inference time to get the base model to have
already learned sort of the right things to do during inference? Or do you mean
something else? Yeah. I mean, I'm thinking something
about there's going to be some connection to the amount
of compute that we can spend on inference and the smartness
of the models that we produce. And I think that's the underlying
thing that I'm getting at. The way of, you know, how
we actually implement that, I think that's where the research
is going to have to go. I mean, there's a lot of papers
about this kind of topic right now. But I don't think anybody's
fully cracked it yet. And I think we're going to
see some pretty amazing things that come out from much more
compute intensive inference. I completely agree with that. I feel like when I think about
post-training, you know, there's straight up, you know, how
do you get the models to be smarter during inference time? But then there's also just
after you're done pre-training the thing, how do you use
the artifact you just created to go actually improve the model
itself before you even deploy it? And I think that second bucket
is going to be huge. I feel like this doesn't have
to be RL, but this combination of these base models that
have a lot of instincts that have been honed into them through
the pre-training phase with, you know, trying to get these things
to actually understand the reward signal of the task you're actually
trying to solve and then be able to spend compute to push up numbers
on that particular reward signal. And we see the early signs of
this with things like RLHF, right? But like, yeah, that's like we're
at the very, very surface of the full scope of research and what's
possible down that particular path. And I think over the next
year or two, we're going to see and we're already seeing in
the papers, but like we're going to see like true, like discontinuous
gains that that happen when you're when you're able to like
to like hook up RL and or search in all sorts of different environments
with these base models. I mean, just even thinking about
like within the agents domain, your computer is one example,
but also like all this excitement around around like universal
foundation models for robotics. Like that's another great
example of how, you know, right now we're doing the
pre-training phase for that. But then there's a very obvious
second step that happens after you've done the pre-training to
make those models seem like amazing controllers and amazing planners
for all sorts of robotics tasks. I think it'd be so cool.
Yeah, I totally agree. And I feel like there is a
there's a bootstrapping that's happening, which, you know,
this is a classic technology development cycle that we're going
through where, you know, you know, like Moore's law for many years
was powered by semiconductors. Right. So like you needed to have
better semiconductors in order to build the machines to build the
next generation of semiconductor. And I think that we're
seeing that with A.I. right. You know, one of the
most interesting things that I'm excited about doing with
our foundation models is using them to understand our data set,
synthesize new data sets, train much smarter next gen foundation
models, because I do think there's a there's a loop that's happening. Yeah, yeah, for sure. And I
think we maybe see the early stages of that by using the
models as like data filters or augmenters and stuff like that. Yeah. Very cool. All right. Well, we're
running out of time. So I think maybe just
one more question. Like, what's something that
you're excited about that you think everybody else is
not excited about yet? Oh, that's a good one. That's a that's a good, good. So let's see. Well, I think I mentioned a
little bit about I mentioned a little bit about, you know, how. I think right now, so much
energy and so much effort is going into multimodal foundation
modeling as it should. Right. Because, you know,
it's clear multimodal models have sort of have sort of taken
over as the default model family. Right. Like I think in some
time people will just always be bundling all this stuff in. And then soon we're going
to add audio and that soon we're going to add video like
any other like all the tokens, all the tokens, just all the
tokens in one particular model. I think actions like just like
trajectories of behavior like are going to also be added into that. And now you just have this
base thing that can itself decide how it wants to allocate
its capacity to learn how to model all those things. So that's all great. I'm really,
really excited about that. But but I think that's actually
going to be the majority of like of like new advancements
over the next couple of years. But there's these like domain
specific things that I am also very excited about, even
though I don't work on them. Like we just talked about robotics. But one of the ones one of
the projects that I that I helped fund at Google was
done by a friend Nal who and his team up in Amsterdam in Europe. But they what they did was
they trained a model basically to outperform the best scientific
simulators at weather prediction. And all they did was they just
turned the whole planet into these like small cells, each of which is
represented by a couple of numbers. That's like the current
precipitation level and humidity and temperature
and all these different things. And then and they just treated
it as like a like, OK, I now have this like tensor and
I just need to predict the tensor at the next time step,
the tensor at the next time step. And let's forget about any
physical modeling. And it turns out if you just
do this, you now have this sort of like universal like
like Earth model of these of these variables that actually
outperforms the physics simulator up until some amount of time. Like that's just so cool. There's so many other domains
where you could just literally say, hey, I have this infinitely
flexible input output engine. Let me just model it and
see what happens. But I'm curious, I think. Fantastic. Well, I wanted
to put in a plug again for virtual worlds and omniverse. I think that we're going to find
that some of the most interesting experiences are going to come
through people interacting with AI. I think one of the most curious
questions I'm curious about is like how is AI going to
change our culture? And I think that it's going
to create a new form of media. You know, in the way that
video games were different from movies, AI is going to
be different from video games. It's going to be much more
useful, much more profound, much more interesting and
engaging and helpful. And I think that that's going
to happen in virtual worlds. And I think that virtual worlds
are going to make AI smarter. They're going to make AI more
grounded and understanding the problems that we face. And then we're going to work
together to solve problems with AI and virtual worlds. And and, you know, for me, this is
kind of the synthesis of a lot of research that's been happening that
I've been watching at NVIDIA and elsewhere for the past 20 years. And I'm really excited to
where that's going next. I just love this framing of
like how will AI impact culture? So I think to me, that is also
like by far one of the most important things about like like
I feel like anything you're working on hasn't really hit true utility
until it starts impacting culture. And we're already starting to
see the early days of that. But it's one of the things I love
about Brian is like you're like a super like well rounded person. Like we talk about AI on stage,
but like when we're not doing this, it's like all sorts
of all manners of different things about and that sort
of like liberal artsy bent you're taking all of this is really cool. Oh, well, thank you. I mean, I
feel like it's good to be human. Awesome. Well, we we have a
few minutes for questions. I think there's
microphones in both aisles. Yeah, go ahead. Software partner for NVIDIA. Let's think about some kind
of Wintel dupli, you know, I think two to dominate the
player is still better than one. So I won't get Brian's take on
who will be the best software partner for NVIDIA. OK, so I think NVIDIA partners
with every software entity around the world. You know, we work with
with all of them. And so the answer is all of
them are going to be successful. We're going to support all of them. You're a great diplomat, Brian. Well, I mean, I'm
also self-interested. I mean, I it's it's one of
the great joys about the work NVIDIA does, that we do support
many, many different companies with many different perspectives. And and, you know, in that
way, as AI prospers, we get we get to to kind of ride along. And so I don't think that
NVIDIA would want to choose any anyone as like the most
important of our software partners. But but we do we do love
working with them. Hi, my name is Vijay. I'm from Dilation
Capital from New York. I have a question on scaling loss. You said this, that you have
bet on scaling loss in a very successful manner for the
last couple of decades. Human brain is estimated
to have about hundreds of trillions of synapses. Given your conviction in scaling
loss in the last two decades, how do you view these scaling
loss working for the next decade? Is the hundreds of trillions
of synapses sort of a high watermark, like where these
models could eventually go? And what's the risk of
overfitting these models? Thank you. I mean, on this
one, I feel like the whole parameter count thing is a
little bit of a megapixel war thing for the cameras. Right. It's like, hey, I
have 15 megapixels, but I have a shitty lens. I still have a bad camera. I feel like it's really ultimately,
I think a better proxy is actually just the number of flops
you've pushed through the model. I think that's a better, like
in the near term, I think it's a better measure. But like, I think that in the same
way that, you know, every scaling law is ultimately an S-curve. It's like, where are
we on the S-curve? But I think that like, not
only is there more to run on the on the pre-training S-curve, we
have not even started the like what we were loosely earlier talking
about, like the post-training S-curve that much yet. And it's like waiting in the
wings to take over for another huge amount of progress over
the next over the next while. So I personally am quite bullish
that we're going to continue to see predictable improvements
in progress due to compute and new ideas over the next 10 years. Yeah. And I would also want
to say that we still don't understand a lot about how
the human brain works. It's really complicated and
reducing it to a number is probably oversimplifying it quite a bit. I think there's a lot of baked
in specialization into the structure of the human brain that
means that we don't have to learn in the same way that our models
do, where our models are literally started from random numbers. But the human, like each of
us, we start with a lot, a lot more knowledge that's baked
in to the structure of our brains. And I think that that's hard to
quantify and understand as well. What we're building with AI
is quite different. So I don't like comparing these
numbers because I don't think ultimately it tells us very much. Thank you. Hi, and wonderful chat. I'm glad I was able to attend. You guys mentioned, you know,
your big bet before was like, you know, speed and everything else
and updates in terms of not system level, but like architectures. What is maybe your next big
bet that you would do for the next 10 years? For example, I've been interested
lately in neural symbolic architectures, world models. So things like that that are
more that maybe give some some different form to do, you
know, algorithms for the future. So what can I buy a lottery
ticket for for the next? Well, I love I love world models. I mean, I think we're going to we
are seeing just amazing progress. And I was talking about
virtual worlds earlier because of because of that. I also I want to put in a big plug
for sparsity of different kinds. So I think that we are about
done playing out low precision arithmetic where we have crunched
it down, you know, quite a bit. We're running out of bits,
you know, so the way to have less bits than one is to go sparse. And so I think we're going to
find that we do want to go sparse. We want more structure. Like I was saying earlier,
the human brain, there's I think there's a lot of knowledge
that's baked into the structure. It's not an all to all network. Right. And so how how do we
learn how to build sparsity into our into our network
so they can be dramatically more compute efficient? So like the intelligence per
flop can be increased. I think that's that's going
to be a big frontier for us. Plus one to world models, I
think if you frame pre-training correctly, world models super
hand waving, but world models kind of pop out for free. And then the other one is the other
one is on the architecture side. It's just whatever maps
better to hardware. I think it's so much
of this is driven by hardware, hardware cycles. Like you could have such a
clever architecture and just have it not run efficiently. And you'll never be able to
scale a thing as well as someone doing something more vanilla
does map the hardware. The bitter lesson again. Yeah. All right. So I have a
question about the pre-training data set and quality. So you were talking about how
quality for pre-training data sets is so important for knowledge. And if you have a bunch of
noise, it can really diverge the entire model. But our current methods are
already super noisy. We have stuff like at the
next stop, take a left, right or go straight. And if the model with our
current next token prediction predicts left, but it's right,
it's penalized as much as it if it predicts left or as if it predicts
a banana, something totally random. So do you think with our current
next token prediction that we'll be able to achieve the kind
of next levels of AGI or whatever? Or do you think we'll have
to go with a different kind of optimization there? And if so, what kind are you again? It's kind of like a what I
get a lottery ticket for this one here with what kind of
like beyond next token prediction optimizations do you think? Well, I first of all, like
David was just talking about how the thing that wins is
the thing that's easy and scalable and you can dump compute
into and next word prediction has that property, which is why
I think it's been so successful. So anything that comes along after
next word prediction, I think is going to share that property. But the second thing that I
wanted to say is it's hard to know what we should do beyond that next
word prediction because, you know, that these more intelligent, let's
say more symbolic approaches like penalizing the model specifically
in specific cases, they tend to run into the same problems that
other approaches to AI have over the past 70 years where the number
of cases is just too enormous and we can't enumerate them. And when we try, we end
up messing it up. So the models don't actually
learn the right thing. And so that's one of the
strengths of next word prediction is that we can't mess it
up with our cleverness. But then the third thing
I want to say about next word prediction is that
it is tempting to reduce artificial intelligence to flops and
and loss functions and so forth. But we can do that biologically as
well and just be like, does it make sense that intelligence would come
from like amino acids and lipids and like, you know, the elements
can be simple and the implication, the elaboration of those elements
can be quite complex. And so I don't really feel
like the simplicity of next word production disqualifies
it any more than the simplicity of biochemistry disqualifies it. That's a pretty good analogy. I would also just say, like,
sometimes I hear thinking about how architectures work as, you know,
will this thing just will it will architecture X be able to do Y? Will like training decision X
be able to do Y kind of things? And the answer is always, well,
as long as you haven't screwed it up, the answer is always yes. It's just how much compute. Right. And so like when we go
evaluate whether or not any idea is good, we look at does the
new idea change either the slope or the intercept of the scaling laws? And and usually the answer is never
like is never like, OK, there's some like straight up discontinuity
that only happens if you try some particular architectural idea. So I think as a result of that,
it's like I think many of these things have room for innovation. But I don't think that they're
like strictly necessary, actually, even for us to get to that
next level of intelligence. Thank you very much. Hi, I'm really enjoying this, and I
wanted to bring up a concern I have that dovetails off of you're
talking about exhausting the available tokens and the
multidimensionality of intelligence. I come from a education and
psychometrics and neuro and neuroscience background. Also did a little AGI back in
2005 long before it was cool. But anyway, as we've kind
of run out of the the time to do this, the text based stuff,
a lot of people are jumping on the hot buzzword is is synthetic data. And I see that as both an
opportunity and a potential real trap where where I know
from my work in education and human testing and other things like that,
that you can get yourself into this solipsistic loop where where you're
you're creating very artificial, ungrounded systems that where
where you're getting great results. But but they don't actually
mean anything as far as as far as intelligence and
problem solving goes. And I wanted you guys to
to talk about opportunities and dangers there. I just feel like synthetic data is
super useful as an augmentation, but it's a crutch because at
the end of the day, the underlying complexity of the generator
for synthetic data is usually, at least in my experience, and maybe
there's things I don't know about, but it's usually capped at some
degree of model of model capacity. You kind of just end up
modeling the generator and then you're kind of done. So the crutch is over. So I'm curious what you think. Oh, absolutely. It's
it's interesting. You know, in graphics, we've
been doing bottom up. Like, let's model every single
blade of grass and every light source and every photon and
like bounce it all around. And and we've kind of run
into the limits there, right? Like, I think the future for
for graphics has to be a world model because we've run out of ways
of enumerating everything else. And I think synthetic data is
kind of a similar thing that, you know, the way, just like you
were saying, modeling the generator ends up becoming the same problem. And so so I'm a big believer
in synthetic data, but I don't. And we use it and it is important. And yet I don't think that
it I mean, to the extent that it's a trap, I think we're
all aware of it and trying to make sure we don't fall in it.