Good morning, everybody. Welcome to the session. My name is Dieter Fox. I'm
chairing this session. And it's my great pleasure
and honor to introduce Vincent Vanhoucke. Vincent is a distinguished
scientist and senior director for robotics at Google DeepMind. Vincent started, actually,
the robotics effort in Google Brain a while ago. And he and his team have been
at the very forefront of deep learning for robotics. Vincent, in 2017, started the
conference on robot learning called Coral, which by now is established
as the key conference in anything related to learning in robotics. Vincent's team at Google and
Google DeepMind was the first, for example, that also showed that it's
possible to train robots at scale. You might have heard of the
Google Arm Farm, which was a large set of robots that
were then trained through, for example, teal operation
and reinforcement learning, which are exactly the techniques that
we're seeing right now when we're talking, for example, about the
recent progress in humanoid robots. They were also the first ones
to really show how these large language models, generative models,
how they can be used for reasoning and even planning in robotics. And most recently, some of
you might have heard of the RTX, RT2 model, which was
the first one that really showed that it's possible to train very
large models that go from vision and language, actually, all the way
to the control level of the robot. And this is, of course, the
holy grail for large pre-trained models that combine all these
notions of different modalities and use those modalities to directly
output controls for robots. And today, Vincent is going
to talk to us about robotics in the age of generative AI. Welcome. Thanks. Thanks, Peter. Can you guys hear me OK? All right, fantastic. Cool. Well, thanks for the welcome. It's a special privilege for
me to be here at GTC. In a past life, before I started
working on robotics, I was one of the very first researchers at
Google to acquire a bunch of GPUs. It was like Kepler
generation at the time. Bring them, put together
a GPU machine, and start training neural networks. This was very
clandestine at the time. Not a lot of people. It wasn't
a popular thing to do. We literally had to hide
the machine behind the copy machine so that people wouldn't
turn it off at night. So lo and behold, we ended up
launching what I believe is the first deep neural network trained
on GPU in production at scale. I think anywhere, really, that
was for Google Voice Search. And we also went on to lobby hard
for Google to acquire lots of GPUs and put them in their data centers. So Jensen, if you hear
this, I accept cash, credit card, Wenmo, PayPal. Anyways, let's talk about robots. About two or three years ago,
if you haven't been living under a rock, you probably
saw that a big revolution happened in the world of AI. Large language models happened. We suddenly had capabilities
like common sense reasoning or understanding of the world
that had not been really available to us in the past. For us working in robotics and
embodied AI, it was a disaster. We were supposed to be the
next generation of AGI. We were supposed to be the
ones that were going to bring AI to reality, to the real world. So in the community of robotics,
there was a latent FOMO, fear of missing out, that was developing. And people were a little jealous of
suddenly this large language model, this language modeling community
sort of taking all the spotlight. So of course, if you can't
beat them, join them. And we started exploring what
large language models and robotics and embodied AI,
what the connections could be. This could have been a very
shallow kind of exploration. On the surface, the relationships
between language and robotics are really tenuous and vast. You can imagine
talking to your robot. That's fine. You can imagine your robot
telling stories or composing poetry or whatnot. What happened was probably
the biggest surprise in my entire career. The connections turned out to
be extremely deep, so deep, in fact, that they forced us
to rethink all the foundations of how we do robotics and embodied AI. So I want to tell that story
today, because I think it's a really fascinating story
of, I want to say, disruption or thinking in a very different
way about very fundamental concepts that have built up an entire field. I'm not saying this is going
to be the end story of where robotics is going to go in moving
forward, but it's a different path. It's a very different path
than the one we were embarked on three years ago, and it's
barely recognizable. And so it opens up a lot of
new greenfield avenues for doing new research. The first thing we did is
what was popular at the time when ChatGPT came about was
to try to trick the chatbot into being something that it wasn't. So I did pretend to say that it's
a robot and describe a little bit in very coarse terms what kind
of robot the chatbot was supposed to be and ask it questions like,
how would I go about making coffee? What's interesting about this
is that it's both wonderful and not great at the same time. Some aspects of it is the
chatbot can really understand what it means to make coffee. It has a good understanding
of the common sense knowledge of what it takes to make a coffee. It has some notion of how
a robot or an agent may go about making coffee. It knows to ask the right
questions, if you will. The downside is that it has no idea
about the environment you're in. It has no idea about any of
the capabilities of the robot. And so it's
disconnected from reality. So the first thing we tried
to do was to bridge that gap and make that connection. This is work called say-can. And the idea behind it is we
ask a large language model to propose solutions to a
complex planning problem. And then we had trained our
robots to have, because they're trained using reinforcement
learning, they have a model inside the robots that is
able to score basically any query that you make to the robot against
its own affordances, what it can do in a current context based on the
observations that it has around it. And so that's called
a value function. The value function could
rank essentially all the different hypotheses that
the LLM would provide. We would make a decision about
how to re-rank everything and then turn that into a plan. When you do this recursively, you
can get a step-by-step plan about how to go from very high-level
questions to really detailed semantics about how to operate
the robots to accomplish the task. The interesting piece here is that this really lifts the problem of
planning into the semantic world. Instead of doing planning
as trying to avoid obstacles or something that is really
geometric, suddenly the planning happens in semantic space,
in a place where us as humans can really understand what's going on. So this is what it
ended up looking like. You have a robot, you
ask it a question. It has a perception that enables it
to score what it can see around it. It can pick objects. It knows that it can find objects. It can place objects. Those are all the affordances
that the robot has. And the robot makes a
decision about which one is the right one to go after. So let me take a little step back
because this is going to be kind of the scaffold of what comes next. This is where my roboticist
friends start really rolling their eyes because I'm giving
you a very, very sketchy sort of introduction to how a robot works. It's a lot more complicated than
this, but this is a model that actually serves our purpose here. So a robot, roughly, is
you have a loop in which you perceive the world. You extract the state of the
world, essentially. You pass that state to a planner
that decides, based on the goal that you're trying to
accomplish, it's going to make a plan about what to do next. And once you have a plan for
what the robot should do, then you pass it to a controller
which actuates the robot and actually executes the movements. All of that is in a loop
because obviously the state of the world changes. So you re-perceive and you
re-plan and you re-actuate and you keep doing that at a relatively
high frequency so that you can adapt to the changes in the world. So what we did with say-can
here is just taking the planning piece and replacing it with an LLM. And that has interesting
consequences because suddenly this planner speaks natural language. You no longer have sort of
code APIs between your planner and the perception and the
actuation, for example. The consequences of that is
that having natural language as the sort of inner API inside
your robot is something that we can actually lean on and use even more. So one thing that was starting
to develop around that time is perception using visual
language models were getting really, really good. The performance of those models
against sort of bespoke vision models were getting very good. And so the question is, can
we use those visual language models that directly speak natural
language already to control our robots and obtain the perception
information that's interesting? So this is something we started
researching and that became the concept of Socratic models. This is the idea that you can
have multiple models, some that have specialized functions
like a vision model or an audio model, and then a large language
model that does the planning side. And you can have them basically
have a dialogue with each other in which they come up
with a consensus about the state of the world, about what to do
next, about what questions to ask. You can have the planner actually
ask questions to the vision language model to get a refined
perception for a specific part of the environment that it wants
really more information about. This dialogue turned out to
be very powerful. So we had a lot of follow-up
works that really sort of leveraged that concept of having
essentially a little chat room inside your robot where all the
models could speak to each other. So this next work is kind of
an evolution of say-can where in say-can, we just had a language
model and then each interacted with this robot value function. With this inner monologue, we
had the human provide a goal for the robots on this chat room. We had the language model
sort of question what it would require to execute that plan. We had vision components that
would do both describing the scene and determining whether
a task was successful, for example. And so what that looks like
is literally a log on the robot of both the queries,
the actions that the robots tries to take, the reaction
to, in this case, somebody trying to disturb the state of the world. The robot fails. It can observe that
there is a failure. It can react to that failure. It can change its plan. You ask for a soda. Suddenly, the soda's gone. So is there any other soda in
the room that I can go after? Completely replanned,
completely changed. The nice thing about this is that
it's completely readable, right? It's really something that is
very human-centric because you can follow along this conversation,
this inner conversation that the robot has and really understand
exactly what the robot is thinking, what its plans are,
and what potential sort of issues there may be with its perception
or with its capabilities. We went a little further
looking at what happens when the robot has a very
ambiguous scene in front of it. In this case, we asked the robot
to place a ball in the microwave. There are two balls. The robot has
no good way of deciding what to do. You can use conformal prediction
to sort of determine that there is a high level of ambiguity
in the plan and go back to the human that is the user and ask
for clarification so that the robot can disambiguate interactively. Another thing we started doing
is going after what happens when there is not really a precise goal. We're not just telling the robot
what to do exactly, but letting the language model decide on goals. This was in the context of a
data collection effort where we're trying to really expand
the diversity of experiences that the robots would get. We basically told the language
model that was running on those robots, explore, try to do things. Do interesting things. Do things that you don't know
if you can do, try and fail. Or even do things that maybe
you can't do and then you need to call a teleoperator
to actually accomplish the task or help you accomplish the tasks. What's really interesting
about this is that suddenly you have a robot that is defining
its own goals and so you have to really think about safety. Those robots initially, they
would love to pick up and manipulate laptops. They were really an enamored. Laptop was something that
they could perceive very well and ooh, this is very exciting. I'm going to go and
grab your laptop. So we had to tell them, all
right, in the prompt we had to say things like, don't
pick up electrical objects. Don't pick up
objects that are sharp. It was a kitchen, so there
were potentially knives. We removed the knives. But the idea was that, what's
really interesting about this is that suddenly we have a
way to go from very high level concepts of safety, right? Don't bother humans, for example,
or very high levels of very broad parameters that can be explained in
natural language and plumb that all the way through to actually having
the robot behavior that matches. So this is kind of this idea of
constitutional AI that's being used in chatbots to sort of guide
the robots toward safe behaviors. We can do that all the way
to actually have robots that actually follow general
principles of safety and add yet another layer of safety. Those robots have lots of
layers of safety in them. You can add one more layer
that is at the semantic level and enhances even the general
safety of the robot. Okay, so we've replaced,
we've LLM-ified, if you will, the planner, we've
LLM-ified the vision. Now, obviously, you
know what's next. We're going to try and do the same
treatment to the actuation piece. A controller is really a
piece of code that controls the robots, right? And so writing code is
something that language models do very, very well. And so we experimented with
using code generation as the way to describe a controller. The first step in that direction
was what we call CODAS policies. This is the idea that you provide
a large language model, you prompt it with both perception APIs and
control APIs, and let it decide how to use them based on the natural
language query that you're making. And this can be
very, very powerful. For example, this is a language
model that wrote an entire sort of small piece of code
that corresponds to stack the blocks in the empty bowl. You'll notice that it uses in
green a few perception APIs, a very high level in this case. It was largely a toy example. But it also produced some
functions in red that we didn't have an API for. It completely hallucinated those. It thought that it was very
interesting to have a stack objects function. But then you can recursively
go into the language model and ask it, what is stack objects? And then the language model
can recursively produce more and more detailed codes that
corresponds to the actual behaviors that you want, all the way down
to a API level that is actually something that you can use. In this case, we had a pick
and place API for the robots that we could readily use. So this kind of recursive
application of code generation is very, very powerful at
bridging basically different levels of abstraction and going
down to the metal essentially. What this opens up as well
is the idea that now that you have a natural language query
that can go all the way to the actuation, you can teach a robot
to do things in a non-expert way. So this is an example where,
you know, simple move the apple to the cube, you have
code that gets generated and boom, it just rams into things. The user can say, yeah,
that wasn't good. Please don't knock over the can. And so suddenly there is a
code that gets generated that corresponds to that goal that
you have in mind. It's a reward function that
you add to your entire reward system and the robot learns
the better behavior. We've done that a number of ways. This is our little quadruped
here where we want to teach it to give a high five. It doesn't quite get it. You ask it to, you know, raise
the paw a little higher. Yeah, that's good. Now let's do it sitting. Oh, that's not really sitting. Tilt the other way then. Right, and that code is not
really obvious, right? Unless you actually know
what you're doing. This, you have to be an
expert to do this. But now a non-expert can
really code all these behaviors directly on the robot. And I think that's a very important
piece that really the going from high-level semantics to all the
way down to code is really bringing something new to the table. One thing that I love about
this work is that you can take that dialogue that you're having
with the robots with the thumbs ups and thumbs down and fine tune your
model, obviously, to get to the desired behavior directly without
having to teach the robot again. You can bake that into the model. When you do that, you
get a better model. And that's pretty obvious. What you get as well is a model
that is also better at learning. Because you don't feed it just
the output of the dialogue. You feed it the entire
dialogue, including when your responses were wrong
or when the responses were right. And as a result, the model becomes
a better learner, basically. And we've seen that even on a
wide variety of tasks that were not seen during training, the model
ended up being a better learner and better at, you know, we could
basically, with fewer turns, teach the robots to do new behaviors. This, by the way, is all
enabled by having a really, really fast simulation. So we have an open source
simulator called MuJoCo. We recently released
the third rev of it. Particularly, what's in there
that is relevant is we have a JAX implementation of MuJoCo that
runs in parallel and can do very broad sweeps of different behaviors
on GPU really, really fast. We also have integrated into the
simulator an MPC implementation, which enables you to synthesize
behaviors based on rewards very quickly and experiment with,
you know, reward shaping and see the results of that in real time. So this is a very powerful
paradigm to have really a simulator in the loop of developing
your skills and behaviors. Okay, so we've
LLM-ified everything. Are we done? There is some
weaknesses to this model. It's very nice to have something
interpretable as the core component of your robot where you
can have a dialogue and really see what the state of your robot is. But there are some limits to,
you know, sometimes you probably want a much higher bandwidth
connection between, for example, your planner and your perception. You don't necessarily know
exactly summarizing visual context in words can be very convoluted
and is not really suitable to sort of precision work, for example. So one thing we tried next
was, hey, you know, it's all language models. It's all big neural networks. Let's just try and see if
we can fuse them. So the first fusion
experiment was to fuse the perception and planning. This was a work we called PAL-ME. This was, you know, multimodal
language models are commonplace now, but that was one of the
very early experiments doing this. We took PALM as the language model,
and we added a vision encoder in there so that co-training
them together so that you could include image tokens
or embeddings in the string sequence of your input seamlessly
and training everything together on a variety of data such as visual
Q &A and robot control tasks. So specifically, we had done
this for training on robot control plans similar to SACAN, and that worked very well. So we saw a really very high number
of tasks that we could perform directly from vision to plans. Again, the output of this is all
just natural language, so very interpretable, but now the language
model has eyes that can actually see the intricacies of the visual
scene that it's operating on. What was interesting about
PAL-ME is that this is the first time that we saw this
was a model that was really trained for robotics, but it actually
worked extremely well still at all the tasks that you can imagine
a multimodal model wanting to do. So it could do visual Q &A,
could do captioning. It didn't lose any performance
in terms of reasoning. In fact, that model was fine-tuned
on medical data later on by another team and became MedPAL-ME,
the multimodal medical model that was state-of-the-art at the time. I don't know if it still is,
but the idea that you can take a robotics model and
turn it into a state-of-the-art medical recognition model
is fascinating, right? I think this kind of power of just
very large models being able to be retargeted is really interesting
for the industry at large. Another thing that was new
with PAL-ME was that for the first time, we saw positive
transfer across robots. So this is something that's worth
a little bit of an explanation. Typically, you have different
robots, different action spaces, different point of views. You would imagine that when
you train a model, you want to fine-tune it on the embodiments
that you will eventually be deploying your model for,
and that fine-tuning is going to give you the best results. What we saw with PAL-ME is
that actually training on all the robotics data that we had, even
if it was very different data, even if it was barely robotics-alike,
like visual Q &A, it's visual planning, but it's not for a robot. When you put that all together,
you end up with a model that is working much better, and that
was something that, in the past, we didn't see much of in robotics. There was rarely a generalized
model that was better than specialized models, and that
will hold, and we'll see later that there is a lot more to this, and
that is very interesting to pursue. Once you have a vision
language model, you can do lots of fun things. So this is an example that I
wanted to highlight because I think this is something that's
gonna be important in the future. This is an early experiment
in using a vision language model that also can generate
video, and you can imagine using a video generation as
a way to dream up possible futures. So in this case, we have a
planner where when it's confronted with multiple actions that
it can take, instead of evaluating them on the spot, it will
actually generate a small video snippet of what would happen to the
environment if I took that action, and then we score the output
and ask ourselves, is the output of that little snippet of video
closer to the goal, or is it not? And that's how we select
which action to take. I think this kind of world
model-type approach to planning and to actuation is very likely
to develop as video models get better and have better fidelity
in terms of physics and in terms of geometry and things like this. So I'm really excited about
this general line of work. Okay, so we've connected
vision and language. Can we do, you know, let's
ignore planning for a second. Can we do directly
pixel-to-actions? This is another line of work
where we basically wanted to have pixel-to-action models
that, you know, sort of used all the modern toolkit of
transformers and things like that. Our first work in that
direction was RT1. RT1 is basically an end-to-end
model that takes instructions, tokenizes them, takes images,
tokenizes them, throws that all into one big transformer
that's trained end-to-end and outputs actions that are
directly, you know, controls that the robot can execute. It's a big model, but we can
run it at, you know, three hertz so it's actually something
we can, you know, manage for the kind of tasks that we
care about, which are picking and placing and things like that. RT1 really worked well. And that was kind of a big
aha moment for us in the sense that in the past, even for
simple sort of pick-and-place, generalized pick-and-place kind
of tasks, we could never really saturate on the training tasks. Like, we could throw as much data
as we wanted onto the models that we were training using behavior
cloning, and we could never get to, you know, 100 % performance. For the first time with RT1, we
really saturated the performance on the training set. Not on the training set, on
the training tasks. And that's important because
you want to be sure that, you know, if you're in the asymptopia
of lots of data, you should be able to completely nail the training
setup that you're focusing on. In addition, we got
better generalization. So better generalization
to unseen tasks, distractors, and backgrounds. So that's a good foundation
for something to build on. Another thing that we learned
from the RT1 experiment was that not all data is really equal. And one of the big, one of
the experiments we did is a simple ablation experiment where
we took out a little bit of data from the training sets, not a lot. So the total amount of data
was about the same, but we took out the most diverse
data, the data that was the most different from everything else. And the performance just plummeted. What's important about this is
that if really data diversity is a key to those kind of action models,
we're doing everything wrong. If you think of how, you know,
grad students work on problems in robotics labs, they typically
have one problem they're trying to solve and they're, you
know, focusing on collecting data for that task and training a
better architecture for that task. What we're saying here is that
maybe thinking about a single task is already shooting yourself in
the foot, that you should really be operating in the context of having
a very large multitask model. And thinking about architectures
in that context really changes the game in terms of how well
those models are going to do. So some interesting lessons,
I think, for the community at large, that, you know, multitask
is not just a sub-problem. It really is the problem,
and it's probably one way that we're going to solve
for better models in general. Okay, so by now you can picture
where this is going. We've confused two pieces. We've used two other pieces. We're going to try and see
what can we do with just one giant model, right? So that work is RT2. RT2 is basically a very large
vision language model that has all the capabilities of a very
large LLM, so it can do reasoning. It can also do, you know, visual
Q &A and things like this. And the way we approach
this is to really think about the robot actions
as just another language, right? The VLMs are multilingual. They can speak all the languages
that they're trained on. We're just going to add one
more language, and that happens to correspond to robot action
and treat it as such. So the architecture is very
similar to RT1, except it's a much bigger model. You input language tokens. You input image tokens, and then
you output tokens that correspond to, you know, robot ease,
if you will, like robot actions. When you do this,
interesting things happen. You suddenly have an end-to-end
model that goes from semantics and visual recognition all
the way down to action, so you can express very rich commands. You can say, pick the
nearly falling bag. You can say, pick objects
that is different, right? And all of that sort of subtle,
high-level understanding of what it means to be different,
what it means to be falling, is incorporated in the VLM and
passed on to the actual actuation. So I'll give you two examples
that I really like of those kinds of behaviors. This was an example of, we
asked the robot to move the Coke can to Taylor Swift. Our robot has seen a
lot of Coke cans. We love Coke cans. They're, you know, our bread and
butter objects to manipulate. But our robot has never seen
Taylor Swift or doesn't know what Taylor Swift looked like. We don't have any robot
data that corresponds to, you know, Taylor Swift. The VLM does, right? And so the robot is able to
understand, you know, the concept of Taylor Swift and move the Coke
can to the picture of Taylor Swift. It can also do this with reasoning. So basic reasoning, move
the banana to the sum of two plus three, right? So that means the robot needs
to understand what a three looks like visually. It needs to do basic
computation, right? Two plus three. That's something
that the large language model hopefully knows how to do. But we've never really taught
the robot specifically how to do sums, right? It's all part of the overall model. So you see this transfer between
the semantics, the vision and the actuation, all working
together to produce something that is, you know, I know emergent as
a term is being kind of overused, but it feels emergent in the
sense that all of those things sort of gel together in one unified way. Another thing we saw with RT2 is
that we're just at the beginning of getting things to work. If you think of scaling laws
for language models, there is a similar scaling law for sort of
a robotic foundation model, where as we get much better, much bigger
models, things get much better. And I don't think we're anywhere
close to saturating on performance based on the scale that we're at. So it's problematic in a
number of ways, right? Those big models are really slow. And so you're, you know, having
a controller that runs at this kind of speed and using
inference of big models is not easy, but at least there
is a path there that would enable us to scale up and get better. Another form of scaling is
scaling across robots. So remember when I was talking
about PALMI, I saw we saw positive transfer between robots. We did other experiments such
as this one, RoboCat, where we trained joint models. This was a RT1 style model, but
with some reinforcement learning on top of it, where we trained a joint
model across different robots with different action spaces, different
degrees of freedom, and very different settings, if you will. And again, we saw that even
for action models, we could get much better performance
out of training a joint model. It's a little bit like saying
that the different robots just speak different dialects
of robotease in a sense, like it's not that they're formally
that different, it's just different expressions through the embodiment
of a very, very common concepts. And that by adding the data
together, we can actually get much better understanding
of the physics of what it means to control a robot. So we tried to push
this to the extreme. We partnered with 34 different
research labs and asked everyone to pool their data together. And obviously there is a huge
amount of diversity of research that's happening in robot
learning in the world. You have a ton of different
embodiments, different tasks, different data sets. We just pulled everything together. We didn't even try to
normalize any of it. So just to give you an
example, this is what some of the data looks like. It looks completely random, right? You would be like, how can
we learn something out of this kind of diverse data? Turns out we can. And that was also an interesting
learning that we basically pulled all the data together,
trained a big RT1 model, shipped it to all our partners in
universities, and they were able to improve on their baselines
using this model, zero shot. And it was as fair an experiment
as can be because we just shipped them the weights and let them
run the experiments themselves. So we didn't have a hand in
doing the experiments. So this is very exciting. I think the idea that,
fundamentally, cross embodiment really works and works to
an extreme degree sort of opens up the possibility of really building
models for everyone, right? And it also enables the models not
to be locked into a specific form factor and specific embodiments. And that really has a profound
impact in how people think, I think, of what it means to share
data in robotics, what it means to leverage the community and build
something that is bigger and more impactful as a collective effort. We also trained an RT2 version
of the model and also saw that those emergent skills that I was
discussing before get better when we add more of this diverse data. So there is really a strong signal
there that this kind of large foundation models for robotics can
really sort of improve the state of the arts by a significant amount. So I want to sort of step
back and reflect a little bit on where we are at, right? So we have this kind of unified
model that takes vision as an input, takes reasons using
a language model, produces codes that corresponds to action. It's just a large multimodal
model at the end of the day. It's nothing really specific
to robotics in that way. We train it on some robot
data, not a ton, right? It's not internet scale amounts
of data, but some amount of data. A lot of the heavy lifting
there is done by the text data that's from the web,
the image data that's from the web. The actions that we take are
a form of language. They're just dialects of robotese. And this picture is really,
really strange to us in the robotics community. If you'd asked me like three
years ago that this is what robotics would end up looking
like or what a possible future for robotics would look like, I
would have called you crazy, right? Back in the day, and back in the
days, like I mean three years ago, we were really focused on
reinforcement learning, on sort of learning approaches that were very
bespoke to specific robots, that were using a lot and lot of data. So the shape of things has
really materially changed. This is still research, so we
haven't completely validated this in the real world, but
this is a completely greenfield potential new path for robotics. And the thing that is very exciting
about this is that it really enables robotics to ride the AI
wave that we're seeing here, right? Any improvements to large
language models, any improvements to multimodal models, to video
generation, we'll be able to use this and leverage this in robotics. So we're no longer kind of
on a little AI island. We're really part of the entire
AI community and we can really benefit from the entire sum
of all the advances that are happening in the world. Okay, with that, I
wanna thank everybody who contributed this research. It's the effort of a very
large team of very talented people and I'm very grateful
to be a small part of it. And thank you for everyone here. Thank you so much for a
fascinating talk. We have time for some questions. If people just wanna walk
up to the microphone there, that's on the aisle. Let's go ahead. One second, it's not working yet? Okay, thank you. So I think by introducing
the larger language model into the robotics and giving
the uncertainty in the language, how can you guarantee, for
example, like some safety? Right now you only have a
robotics arm, but in the future you may have human noise. How can you like solve the problem? For example, like miscommunication
between the operator and the user or maybe just a bad intention
from the operator to tell the robot, okay, do not hurt someone,
do not do bad things or wrong. Yeah, thank you. you. So the
language model view of things does not remove any of the
safety layers that you need to sort of really think about when
you're deploying a robot, right? The safety approach in robotics
is really defense in depth, right? You go from the low level of
making sure your actuators are safe, that your controllers
are robust, that you have a big red button on the back of your robot if
things go wrong or like all sorts of different components, right? The larger language model only
adds one layer to it, right? I think the idea that you can
add semantic safety by telling the robot don't hurt people as a
preamble to anything that you want it to do only adds to that zoo. It doesn't substitute
itself to things. There is also the question
of, you know, those large models that are notorious
for hallucinating things, right? And that's a general problem. What we are seeing is that when
we're grounding them in the real world by giving them observations
of the actual real world that's in front of them, all those
hallucinations really go down. I don't wanna say that we're
eliminating all the risks associated with them, but
the idea that, you know, over and over, you see, you know, a cup
of tea on the table, you will not imagine that there is, you know,
an object that doesn't exist there because you have this reinforcement
of reality sort of grounding the work, the model, that it doesn't
exist when you're just, you know, in abstract sort of internet space. So tons of new avenues to
do safety research. And I think this is a very exciting
from that standpoint as well. Thank you. Hello, thank you for sharing
your very insightful research. I was thinking when humans try to
pick up an object, for example, and we have our inner monologue, we
don't think in numbers like, I need to use this motor torque or I need
to go to these exact coordinates. And when I try to experiment with
LLMs in my own research, I realized that it works maybe as good
as humans would be able to do that. Sometimes they get the coordinates
right onto placing it, but sometimes they make simple
arithmetic mistakes, even very strong large language,
like very powerful large language models still struggle
with this mix of spatial and arithmetic understanding
and Cartesian coordinates. And I saw on one of your
slides that the LLM is actually outputting coordinates if
I understood it right. How do you solve this task? How do you bridge
these two domains? So there is a lot of research that
needs to happen in this space. I think the vanilla vision language
model is not particularly good at geometry and spatial reasoning. And that's a problem. And that's merely a reflection of
what kind of data and what kind of task it's being trained on, right? A lot of the data
is about semantics. It's about broad
description of scenes. Getting to really precise
geometry measurable is, I think it will take some more work. And I think a lot of people are
really aware that this is kind of a shortcomings of the models. I think this can be
solved with data. We can really sort of, if we change
the way we train models, we can really improve on that quite a bit. Another thing that's interesting
about the way we do this is that we do this in closed loop, right? So it's all visual servoing
essentially, right? You reason about the relationship
between the gripper and the object if you're trying to pick something
and you get feedback, visual feedback at every step of the way. And so that feedback is actually
a really important signal that is being leveraged in this. You don't need to get the solute
coordinates right all the time. In fact, you don't know what
the absolute coordinates will be because your robot moves
and the frame of reference changes and the world changes. So it's all relative and it's all
repeated and adaptive, if you will. Thank you. I'd like to just briefly hijack
the microphone to this side. I have one online question
and then one question for the gentlemen here. And if there's more questions
also in the follow up, of course people should come up
to the front and then we can have a longer discussion afterwards. From online and that's related
to the code generation, the question is, have you attempted
to not just recursively generate code but iteratively test code by
deploying on the robot and passing errors results back to the LLM? Yeah, so we deploy it typically
on simulation, right? That's one of the benefits is that
once you have a piece of code, you can quickly run it on the simulated
robot and see what the outcome is. If it's sensical or if the
code even compiles or runs. And then once you have some
guarantees that the code is correct, you can push
it directly to the robot. And typically if you have a
reasonable sim to real transfer, you can have some guarantees there. But it's prompting code to
be correct all the time, to be effective all the time,
that's a problem that the entire AI community is thinking about. And so I'm hoping we'll get
a lot better models in that respect in the future. Yeah. Thank you. Thank you, Vincent. Thank you for the talk. So what are the most exciting
applications of AI powered robotics that you foresee
emerging in near future? Like your imagination, both
in industry and everyday life? I think the most interesting thing
is being able to retask a robot to do anything you want it to do using
very low level of actual knowledge about the workings of the robots. If you can just prompt a robot
using natural language to do something and then retask
it to do something else, suddenly this opens up, you don't have
to do systems integration for every single problem that
you have your robot do potentially. That's a dream, right? We're far from it. But the cost right now in deploying
robotics is really about sort of the bespoke system integration
that's required to do there. If we can simplify that pipeline
and make it usable so that the people who are on the ground
using those robots in factories, in logistics, can actually sort
of tell the robot what they want and just have the robot do it, that
would open up a lot of the space. All right, I'm afraid this
was the last question we can handle at this moment. So thank you, Vincent, again. And people, if they have questions,
they should come up to the front end and talk to me. Thank you. Thank you. Thank you for joining this session. Please remember to fill out the
session survey in the GTC app for a chance to win a $50 gift card. If you are staying in the room for
the next session, please remain in your seat and have your badge
ready to be scanned by our team.