Robotics in the Age of Generative AI with Vincent Vanhoucke, Google DeepMind | NVIDIA GTC 2024

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Good morning, everybody. Welcome to the session. My name is Dieter Fox. I'm chairing this session. And it's my great pleasure and honor to introduce Vincent Vanhoucke. Vincent is a distinguished scientist and senior director for robotics at Google DeepMind. Vincent started, actually, the robotics effort in Google Brain a while ago. And he and his team have been at the very forefront of deep learning for robotics. Vincent, in 2017, started the conference on robot learning called Coral, which by now is established as the key conference in anything related to learning in robotics. Vincent's team at Google and Google DeepMind was the first, for example, that also showed that it's possible to train robots at scale. You might have heard of the Google Arm Farm, which was a large set of robots that were then trained through, for example, teal operation and reinforcement learning, which are exactly the techniques that we're seeing right now when we're talking, for example, about the recent progress in humanoid robots. They were also the first ones to really show how these large language models, generative models, how they can be used for reasoning and even planning in robotics. And most recently, some of you might have heard of the RTX, RT2 model, which was the first one that really showed that it's possible to train very large models that go from vision and language, actually, all the way to the control level of the robot. And this is, of course, the holy grail for large pre-trained models that combine all these notions of different modalities and use those modalities to directly output controls for robots. And today, Vincent is going to talk to us about robotics in the age of generative AI. Welcome. Thanks. Thanks, Peter. Can you guys hear me OK? All right, fantastic. Cool. Well, thanks for the welcome. It's a special privilege for me to be here at GTC. In a past life, before I started working on robotics, I was one of the very first researchers at Google to acquire a bunch of GPUs. It was like Kepler generation at the time. Bring them, put together a GPU machine, and start training neural networks. This was very clandestine at the time. Not a lot of people. It wasn't a popular thing to do. We literally had to hide the machine behind the copy machine so that people wouldn't turn it off at night. So lo and behold, we ended up launching what I believe is the first deep neural network trained on GPU in production at scale. I think anywhere, really, that was for Google Voice Search. And we also went on to lobby hard for Google to acquire lots of GPUs and put them in their data centers. So Jensen, if you hear this, I accept cash, credit card, Wenmo, PayPal. Anyways, let's talk about robots. About two or three years ago, if you haven't been living under a rock, you probably saw that a big revolution happened in the world of AI. Large language models happened. We suddenly had capabilities like common sense reasoning or understanding of the world that had not been really available to us in the past. For us working in robotics and embodied AI, it was a disaster. We were supposed to be the next generation of AGI. We were supposed to be the ones that were going to bring AI to reality, to the real world. So in the community of robotics, there was a latent FOMO, fear of missing out, that was developing. And people were a little jealous of suddenly this large language model, this language modeling community sort of taking all the spotlight. So of course, if you can't beat them, join them. And we started exploring what large language models and robotics and embodied AI, what the connections could be. This could have been a very shallow kind of exploration. On the surface, the relationships between language and robotics are really tenuous and vast. You can imagine talking to your robot. That's fine. You can imagine your robot telling stories or composing poetry or whatnot. What happened was probably the biggest surprise in my entire career. The connections turned out to be extremely deep, so deep, in fact, that they forced us to rethink all the foundations of how we do robotics and embodied AI. So I want to tell that story today, because I think it's a really fascinating story of, I want to say, disruption or thinking in a very different way about very fundamental concepts that have built up an entire field. I'm not saying this is going to be the end story of where robotics is going to go in moving forward, but it's a different path. It's a very different path than the one we were embarked on three years ago, and it's barely recognizable. And so it opens up a lot of new greenfield avenues for doing new research. The first thing we did is what was popular at the time when ChatGPT came about was to try to trick the chatbot into being something that it wasn't. So I did pretend to say that it's a robot and describe a little bit in very coarse terms what kind of robot the chatbot was supposed to be and ask it questions like, how would I go about making coffee? What's interesting about this is that it's both wonderful and not great at the same time. Some aspects of it is the chatbot can really understand what it means to make coffee. It has a good understanding of the common sense knowledge of what it takes to make a coffee. It has some notion of how a robot or an agent may go about making coffee. It knows to ask the right questions, if you will. The downside is that it has no idea about the environment you're in. It has no idea about any of the capabilities of the robot. And so it's disconnected from reality. So the first thing we tried to do was to bridge that gap and make that connection. This is work called say-can. And the idea behind it is we ask a large language model to propose solutions to a complex planning problem. And then we had trained our robots to have, because they're trained using reinforcement learning, they have a model inside the robots that is able to score basically any query that you make to the robot against its own affordances, what it can do in a current context based on the observations that it has around it. And so that's called a value function. The value function could rank essentially all the different hypotheses that the LLM would provide. We would make a decision about how to re-rank everything and then turn that into a plan. When you do this recursively, you can get a step-by-step plan about how to go from very high-level questions to really detailed semantics about how to operate the robots to accomplish the task. The interesting piece here is that this really lifts the problem of planning into the semantic world. Instead of doing planning as trying to avoid obstacles or something that is really geometric, suddenly the planning happens in semantic space, in a place where us as humans can really understand what's going on. So this is what it ended up looking like. You have a robot, you ask it a question. It has a perception that enables it to score what it can see around it. It can pick objects. It knows that it can find objects. It can place objects. Those are all the affordances that the robot has. And the robot makes a decision about which one is the right one to go after. So let me take a little step back because this is going to be kind of the scaffold of what comes next. This is where my roboticist friends start really rolling their eyes because I'm giving you a very, very sketchy sort of introduction to how a robot works. It's a lot more complicated than this, but this is a model that actually serves our purpose here. So a robot, roughly, is you have a loop in which you perceive the world. You extract the state of the world, essentially. You pass that state to a planner that decides, based on the goal that you're trying to accomplish, it's going to make a plan about what to do next. And once you have a plan for what the robot should do, then you pass it to a controller which actuates the robot and actually executes the movements. All of that is in a loop because obviously the state of the world changes. So you re-perceive and you re-plan and you re-actuate and you keep doing that at a relatively high frequency so that you can adapt to the changes in the world. So what we did with say-can here is just taking the planning piece and replacing it with an LLM. And that has interesting consequences because suddenly this planner speaks natural language. You no longer have sort of code APIs between your planner and the perception and the actuation, for example. The consequences of that is that having natural language as the sort of inner API inside your robot is something that we can actually lean on and use even more. So one thing that was starting to develop around that time is perception using visual language models were getting really, really good. The performance of those models against sort of bespoke vision models were getting very good. And so the question is, can we use those visual language models that directly speak natural language already to control our robots and obtain the perception information that's interesting? So this is something we started researching and that became the concept of Socratic models. This is the idea that you can have multiple models, some that have specialized functions like a vision model or an audio model, and then a large language model that does the planning side. And you can have them basically have a dialogue with each other in which they come up with a consensus about the state of the world, about what to do next, about what questions to ask. You can have the planner actually ask questions to the vision language model to get a refined perception for a specific part of the environment that it wants really more information about. This dialogue turned out to be very powerful. So we had a lot of follow-up works that really sort of leveraged that concept of having essentially a little chat room inside your robot where all the models could speak to each other. So this next work is kind of an evolution of say-can where in say-can, we just had a language model and then each interacted with this robot value function. With this inner monologue, we had the human provide a goal for the robots on this chat room. We had the language model sort of question what it would require to execute that plan. We had vision components that would do both describing the scene and determining whether a task was successful, for example. And so what that looks like is literally a log on the robot of both the queries, the actions that the robots tries to take, the reaction to, in this case, somebody trying to disturb the state of the world. The robot fails. It can observe that there is a failure. It can react to that failure. It can change its plan. You ask for a soda. Suddenly, the soda's gone. So is there any other soda in the room that I can go after? Completely replanned, completely changed. The nice thing about this is that it's completely readable, right? It's really something that is very human-centric because you can follow along this conversation, this inner conversation that the robot has and really understand exactly what the robot is thinking, what its plans are, and what potential sort of issues there may be with its perception or with its capabilities. We went a little further looking at what happens when the robot has a very ambiguous scene in front of it. In this case, we asked the robot to place a ball in the microwave. There are two balls. The robot has no good way of deciding what to do. You can use conformal prediction to sort of determine that there is a high level of ambiguity in the plan and go back to the human that is the user and ask for clarification so that the robot can disambiguate interactively. Another thing we started doing is going after what happens when there is not really a precise goal. We're not just telling the robot what to do exactly, but letting the language model decide on goals. This was in the context of a data collection effort where we're trying to really expand the diversity of experiences that the robots would get. We basically told the language model that was running on those robots, explore, try to do things. Do interesting things. Do things that you don't know if you can do, try and fail. Or even do things that maybe you can't do and then you need to call a teleoperator to actually accomplish the task or help you accomplish the tasks. What's really interesting about this is that suddenly you have a robot that is defining its own goals and so you have to really think about safety. Those robots initially, they would love to pick up and manipulate laptops. They were really an enamored. Laptop was something that they could perceive very well and ooh, this is very exciting. I'm going to go and grab your laptop. So we had to tell them, all right, in the prompt we had to say things like, don't pick up electrical objects. Don't pick up objects that are sharp. It was a kitchen, so there were potentially knives. We removed the knives. But the idea was that, what's really interesting about this is that suddenly we have a way to go from very high level concepts of safety, right? Don't bother humans, for example, or very high levels of very broad parameters that can be explained in natural language and plumb that all the way through to actually having the robot behavior that matches. So this is kind of this idea of constitutional AI that's being used in chatbots to sort of guide the robots toward safe behaviors. We can do that all the way to actually have robots that actually follow general principles of safety and add yet another layer of safety. Those robots have lots of layers of safety in them. You can add one more layer that is at the semantic level and enhances even the general safety of the robot. Okay, so we've replaced, we've LLM-ified, if you will, the planner, we've LLM-ified the vision. Now, obviously, you know what's next. We're going to try and do the same treatment to the actuation piece. A controller is really a piece of code that controls the robots, right? And so writing code is something that language models do very, very well. And so we experimented with using code generation as the way to describe a controller. The first step in that direction was what we call CODAS policies. This is the idea that you provide a large language model, you prompt it with both perception APIs and control APIs, and let it decide how to use them based on the natural language query that you're making. And this can be very, very powerful. For example, this is a language model that wrote an entire sort of small piece of code that corresponds to stack the blocks in the empty bowl. You'll notice that it uses in green a few perception APIs, a very high level in this case. It was largely a toy example. But it also produced some functions in red that we didn't have an API for. It completely hallucinated those. It thought that it was very interesting to have a stack objects function. But then you can recursively go into the language model and ask it, what is stack objects? And then the language model can recursively produce more and more detailed codes that corresponds to the actual behaviors that you want, all the way down to a API level that is actually something that you can use. In this case, we had a pick and place API for the robots that we could readily use. So this kind of recursive application of code generation is very, very powerful at bridging basically different levels of abstraction and going down to the metal essentially. What this opens up as well is the idea that now that you have a natural language query that can go all the way to the actuation, you can teach a robot to do things in a non-expert way. So this is an example where, you know, simple move the apple to the cube, you have code that gets generated and boom, it just rams into things. The user can say, yeah, that wasn't good. Please don't knock over the can. And so suddenly there is a code that gets generated that corresponds to that goal that you have in mind. It's a reward function that you add to your entire reward system and the robot learns the better behavior. We've done that a number of ways. This is our little quadruped here where we want to teach it to give a high five. It doesn't quite get it. You ask it to, you know, raise the paw a little higher. Yeah, that's good. Now let's do it sitting. Oh, that's not really sitting. Tilt the other way then. Right, and that code is not really obvious, right? Unless you actually know what you're doing. This, you have to be an expert to do this. But now a non-expert can really code all these behaviors directly on the robot. And I think that's a very important piece that really the going from high-level semantics to all the way down to code is really bringing something new to the table. One thing that I love about this work is that you can take that dialogue that you're having with the robots with the thumbs ups and thumbs down and fine tune your model, obviously, to get to the desired behavior directly without having to teach the robot again. You can bake that into the model. When you do that, you get a better model. And that's pretty obvious. What you get as well is a model that is also better at learning. Because you don't feed it just the output of the dialogue. You feed it the entire dialogue, including when your responses were wrong or when the responses were right. And as a result, the model becomes a better learner, basically. And we've seen that even on a wide variety of tasks that were not seen during training, the model ended up being a better learner and better at, you know, we could basically, with fewer turns, teach the robots to do new behaviors. This, by the way, is all enabled by having a really, really fast simulation. So we have an open source simulator called MuJoCo. We recently released the third rev of it. Particularly, what's in there that is relevant is we have a JAX implementation of MuJoCo that runs in parallel and can do very broad sweeps of different behaviors on GPU really, really fast. We also have integrated into the simulator an MPC implementation, which enables you to synthesize behaviors based on rewards very quickly and experiment with, you know, reward shaping and see the results of that in real time. So this is a very powerful paradigm to have really a simulator in the loop of developing your skills and behaviors. Okay, so we've LLM-ified everything. Are we done? There is some weaknesses to this model. It's very nice to have something interpretable as the core component of your robot where you can have a dialogue and really see what the state of your robot is. But there are some limits to, you know, sometimes you probably want a much higher bandwidth connection between, for example, your planner and your perception. You don't necessarily know exactly summarizing visual context in words can be very convoluted and is not really suitable to sort of precision work, for example. So one thing we tried next was, hey, you know, it's all language models. It's all big neural networks. Let's just try and see if we can fuse them. So the first fusion experiment was to fuse the perception and planning. This was a work we called PAL-ME. This was, you know, multimodal language models are commonplace now, but that was one of the very early experiments doing this. We took PALM as the language model, and we added a vision encoder in there so that co-training them together so that you could include image tokens or embeddings in the string sequence of your input seamlessly and training everything together on a variety of data such as visual Q &A and robot control tasks. So specifically, we had done this for training on robot control plans similar to SACAN, and that worked very well. So we saw a really very high number of tasks that we could perform directly from vision to plans. Again, the output of this is all just natural language, so very interpretable, but now the language model has eyes that can actually see the intricacies of the visual scene that it's operating on. What was interesting about PAL-ME is that this is the first time that we saw this was a model that was really trained for robotics, but it actually worked extremely well still at all the tasks that you can imagine a multimodal model wanting to do. So it could do visual Q &A, could do captioning. It didn't lose any performance in terms of reasoning. In fact, that model was fine-tuned on medical data later on by another team and became MedPAL-ME, the multimodal medical model that was state-of-the-art at the time. I don't know if it still is, but the idea that you can take a robotics model and turn it into a state-of-the-art medical recognition model is fascinating, right? I think this kind of power of just very large models being able to be retargeted is really interesting for the industry at large. Another thing that was new with PAL-ME was that for the first time, we saw positive transfer across robots. So this is something that's worth a little bit of an explanation. Typically, you have different robots, different action spaces, different point of views. You would imagine that when you train a model, you want to fine-tune it on the embodiments that you will eventually be deploying your model for, and that fine-tuning is going to give you the best results. What we saw with PAL-ME is that actually training on all the robotics data that we had, even if it was very different data, even if it was barely robotics-alike, like visual Q &A, it's visual planning, but it's not for a robot. When you put that all together, you end up with a model that is working much better, and that was something that, in the past, we didn't see much of in robotics. There was rarely a generalized model that was better than specialized models, and that will hold, and we'll see later that there is a lot more to this, and that is very interesting to pursue. Once you have a vision language model, you can do lots of fun things. So this is an example that I wanted to highlight because I think this is something that's gonna be important in the future. This is an early experiment in using a vision language model that also can generate video, and you can imagine using a video generation as a way to dream up possible futures. So in this case, we have a planner where when it's confronted with multiple actions that it can take, instead of evaluating them on the spot, it will actually generate a small video snippet of what would happen to the environment if I took that action, and then we score the output and ask ourselves, is the output of that little snippet of video closer to the goal, or is it not? And that's how we select which action to take. I think this kind of world model-type approach to planning and to actuation is very likely to develop as video models get better and have better fidelity in terms of physics and in terms of geometry and things like this. So I'm really excited about this general line of work. Okay, so we've connected vision and language. Can we do, you know, let's ignore planning for a second. Can we do directly pixel-to-actions? This is another line of work where we basically wanted to have pixel-to-action models that, you know, sort of used all the modern toolkit of transformers and things like that. Our first work in that direction was RT1. RT1 is basically an end-to-end model that takes instructions, tokenizes them, takes images, tokenizes them, throws that all into one big transformer that's trained end-to-end and outputs actions that are directly, you know, controls that the robot can execute. It's a big model, but we can run it at, you know, three hertz so it's actually something we can, you know, manage for the kind of tasks that we care about, which are picking and placing and things like that. RT1 really worked well. And that was kind of a big aha moment for us in the sense that in the past, even for simple sort of pick-and-place, generalized pick-and-place kind of tasks, we could never really saturate on the training tasks. Like, we could throw as much data as we wanted onto the models that we were training using behavior cloning, and we could never get to, you know, 100 % performance. For the first time with RT1, we really saturated the performance on the training set. Not on the training set, on the training tasks. And that's important because you want to be sure that, you know, if you're in the asymptopia of lots of data, you should be able to completely nail the training setup that you're focusing on. In addition, we got better generalization. So better generalization to unseen tasks, distractors, and backgrounds. So that's a good foundation for something to build on. Another thing that we learned from the RT1 experiment was that not all data is really equal. And one of the big, one of the experiments we did is a simple ablation experiment where we took out a little bit of data from the training sets, not a lot. So the total amount of data was about the same, but we took out the most diverse data, the data that was the most different from everything else. And the performance just plummeted. What's important about this is that if really data diversity is a key to those kind of action models, we're doing everything wrong. If you think of how, you know, grad students work on problems in robotics labs, they typically have one problem they're trying to solve and they're, you know, focusing on collecting data for that task and training a better architecture for that task. What we're saying here is that maybe thinking about a single task is already shooting yourself in the foot, that you should really be operating in the context of having a very large multitask model. And thinking about architectures in that context really changes the game in terms of how well those models are going to do. So some interesting lessons, I think, for the community at large, that, you know, multitask is not just a sub-problem. It really is the problem, and it's probably one way that we're going to solve for better models in general. Okay, so by now you can picture where this is going. We've confused two pieces. We've used two other pieces. We're going to try and see what can we do with just one giant model, right? So that work is RT2. RT2 is basically a very large vision language model that has all the capabilities of a very large LLM, so it can do reasoning. It can also do, you know, visual Q &A and things like this. And the way we approach this is to really think about the robot actions as just another language, right? The VLMs are multilingual. They can speak all the languages that they're trained on. We're just going to add one more language, and that happens to correspond to robot action and treat it as such. So the architecture is very similar to RT1, except it's a much bigger model. You input language tokens. You input image tokens, and then you output tokens that correspond to, you know, robot ease, if you will, like robot actions. When you do this, interesting things happen. You suddenly have an end-to-end model that goes from semantics and visual recognition all the way down to action, so you can express very rich commands. You can say, pick the nearly falling bag. You can say, pick objects that is different, right? And all of that sort of subtle, high-level understanding of what it means to be different, what it means to be falling, is incorporated in the VLM and passed on to the actual actuation. So I'll give you two examples that I really like of those kinds of behaviors. This was an example of, we asked the robot to move the Coke can to Taylor Swift. Our robot has seen a lot of Coke cans. We love Coke cans. They're, you know, our bread and butter objects to manipulate. But our robot has never seen Taylor Swift or doesn't know what Taylor Swift looked like. We don't have any robot data that corresponds to, you know, Taylor Swift. The VLM does, right? And so the robot is able to understand, you know, the concept of Taylor Swift and move the Coke can to the picture of Taylor Swift. It can also do this with reasoning. So basic reasoning, move the banana to the sum of two plus three, right? So that means the robot needs to understand what a three looks like visually. It needs to do basic computation, right? Two plus three. That's something that the large language model hopefully knows how to do. But we've never really taught the robot specifically how to do sums, right? It's all part of the overall model. So you see this transfer between the semantics, the vision and the actuation, all working together to produce something that is, you know, I know emergent as a term is being kind of overused, but it feels emergent in the sense that all of those things sort of gel together in one unified way. Another thing we saw with RT2 is that we're just at the beginning of getting things to work. If you think of scaling laws for language models, there is a similar scaling law for sort of a robotic foundation model, where as we get much better, much bigger models, things get much better. And I don't think we're anywhere close to saturating on performance based on the scale that we're at. So it's problematic in a number of ways, right? Those big models are really slow. And so you're, you know, having a controller that runs at this kind of speed and using inference of big models is not easy, but at least there is a path there that would enable us to scale up and get better. Another form of scaling is scaling across robots. So remember when I was talking about PALMI, I saw we saw positive transfer between robots. We did other experiments such as this one, RoboCat, where we trained joint models. This was a RT1 style model, but with some reinforcement learning on top of it, where we trained a joint model across different robots with different action spaces, different degrees of freedom, and very different settings, if you will. And again, we saw that even for action models, we could get much better performance out of training a joint model. It's a little bit like saying that the different robots just speak different dialects of robotease in a sense, like it's not that they're formally that different, it's just different expressions through the embodiment of a very, very common concepts. And that by adding the data together, we can actually get much better understanding of the physics of what it means to control a robot. So we tried to push this to the extreme. We partnered with 34 different research labs and asked everyone to pool their data together. And obviously there is a huge amount of diversity of research that's happening in robot learning in the world. You have a ton of different embodiments, different tasks, different data sets. We just pulled everything together. We didn't even try to normalize any of it. So just to give you an example, this is what some of the data looks like. It looks completely random, right? You would be like, how can we learn something out of this kind of diverse data? Turns out we can. And that was also an interesting learning that we basically pulled all the data together, trained a big RT1 model, shipped it to all our partners in universities, and they were able to improve on their baselines using this model, zero shot. And it was as fair an experiment as can be because we just shipped them the weights and let them run the experiments themselves. So we didn't have a hand in doing the experiments. So this is very exciting. I think the idea that, fundamentally, cross embodiment really works and works to an extreme degree sort of opens up the possibility of really building models for everyone, right? And it also enables the models not to be locked into a specific form factor and specific embodiments. And that really has a profound impact in how people think, I think, of what it means to share data in robotics, what it means to leverage the community and build something that is bigger and more impactful as a collective effort. We also trained an RT2 version of the model and also saw that those emergent skills that I was discussing before get better when we add more of this diverse data. So there is really a strong signal there that this kind of large foundation models for robotics can really sort of improve the state of the arts by a significant amount. So I want to sort of step back and reflect a little bit on where we are at, right? So we have this kind of unified model that takes vision as an input, takes reasons using a language model, produces codes that corresponds to action. It's just a large multimodal model at the end of the day. It's nothing really specific to robotics in that way. We train it on some robot data, not a ton, right? It's not internet scale amounts of data, but some amount of data. A lot of the heavy lifting there is done by the text data that's from the web, the image data that's from the web. The actions that we take are a form of language. They're just dialects of robotese. And this picture is really, really strange to us in the robotics community. If you'd asked me like three years ago that this is what robotics would end up looking like or what a possible future for robotics would look like, I would have called you crazy, right? Back in the day, and back in the days, like I mean three years ago, we were really focused on reinforcement learning, on sort of learning approaches that were very bespoke to specific robots, that were using a lot and lot of data. So the shape of things has really materially changed. This is still research, so we haven't completely validated this in the real world, but this is a completely greenfield potential new path for robotics. And the thing that is very exciting about this is that it really enables robotics to ride the AI wave that we're seeing here, right? Any improvements to large language models, any improvements to multimodal models, to video generation, we'll be able to use this and leverage this in robotics. So we're no longer kind of on a little AI island. We're really part of the entire AI community and we can really benefit from the entire sum of all the advances that are happening in the world. Okay, with that, I wanna thank everybody who contributed this research. It's the effort of a very large team of very talented people and I'm very grateful to be a small part of it. And thank you for everyone here. Thank you so much for a fascinating talk. We have time for some questions. If people just wanna walk up to the microphone there, that's on the aisle. Let's go ahead. One second, it's not working yet? Okay, thank you. So I think by introducing the larger language model into the robotics and giving the uncertainty in the language, how can you guarantee, for example, like some safety? Right now you only have a robotics arm, but in the future you may have human noise. How can you like solve the problem? For example, like miscommunication between the operator and the user or maybe just a bad intention from the operator to tell the robot, okay, do not hurt someone, do not do bad things or wrong. Yeah, thank you. you. So the language model view of things does not remove any of the safety layers that you need to sort of really think about when you're deploying a robot, right? The safety approach in robotics is really defense in depth, right? You go from the low level of making sure your actuators are safe, that your controllers are robust, that you have a big red button on the back of your robot if things go wrong or like all sorts of different components, right? The larger language model only adds one layer to it, right? I think the idea that you can add semantic safety by telling the robot don't hurt people as a preamble to anything that you want it to do only adds to that zoo. It doesn't substitute itself to things. There is also the question of, you know, those large models that are notorious for hallucinating things, right? And that's a general problem. What we are seeing is that when we're grounding them in the real world by giving them observations of the actual real world that's in front of them, all those hallucinations really go down. I don't wanna say that we're eliminating all the risks associated with them, but the idea that, you know, over and over, you see, you know, a cup of tea on the table, you will not imagine that there is, you know, an object that doesn't exist there because you have this reinforcement of reality sort of grounding the work, the model, that it doesn't exist when you're just, you know, in abstract sort of internet space. So tons of new avenues to do safety research. And I think this is a very exciting from that standpoint as well. Thank you. Hello, thank you for sharing your very insightful research. I was thinking when humans try to pick up an object, for example, and we have our inner monologue, we don't think in numbers like, I need to use this motor torque or I need to go to these exact coordinates. And when I try to experiment with LLMs in my own research, I realized that it works maybe as good as humans would be able to do that. Sometimes they get the coordinates right onto placing it, but sometimes they make simple arithmetic mistakes, even very strong large language, like very powerful large language models still struggle with this mix of spatial and arithmetic understanding and Cartesian coordinates. And I saw on one of your slides that the LLM is actually outputting coordinates if I understood it right. How do you solve this task? How do you bridge these two domains? So there is a lot of research that needs to happen in this space. I think the vanilla vision language model is not particularly good at geometry and spatial reasoning. And that's a problem. And that's merely a reflection of what kind of data and what kind of task it's being trained on, right? A lot of the data is about semantics. It's about broad description of scenes. Getting to really precise geometry measurable is, I think it will take some more work. And I think a lot of people are really aware that this is kind of a shortcomings of the models. I think this can be solved with data. We can really sort of, if we change the way we train models, we can really improve on that quite a bit. Another thing that's interesting about the way we do this is that we do this in closed loop, right? So it's all visual servoing essentially, right? You reason about the relationship between the gripper and the object if you're trying to pick something and you get feedback, visual feedback at every step of the way. And so that feedback is actually a really important signal that is being leveraged in this. You don't need to get the solute coordinates right all the time. In fact, you don't know what the absolute coordinates will be because your robot moves and the frame of reference changes and the world changes. So it's all relative and it's all repeated and adaptive, if you will. Thank you. I'd like to just briefly hijack the microphone to this side. I have one online question and then one question for the gentlemen here. And if there's more questions also in the follow up, of course people should come up to the front and then we can have a longer discussion afterwards. From online and that's related to the code generation, the question is, have you attempted to not just recursively generate code but iteratively test code by deploying on the robot and passing errors results back to the LLM? Yeah, so we deploy it typically on simulation, right? That's one of the benefits is that once you have a piece of code, you can quickly run it on the simulated robot and see what the outcome is. If it's sensical or if the code even compiles or runs. And then once you have some guarantees that the code is correct, you can push it directly to the robot. And typically if you have a reasonable sim to real transfer, you can have some guarantees there. But it's prompting code to be correct all the time, to be effective all the time, that's a problem that the entire AI community is thinking about. And so I'm hoping we'll get a lot better models in that respect in the future. Yeah. Thank you. Thank you, Vincent. Thank you for the talk. So what are the most exciting applications of AI powered robotics that you foresee emerging in near future? Like your imagination, both in industry and everyday life? I think the most interesting thing is being able to retask a robot to do anything you want it to do using very low level of actual knowledge about the workings of the robots. If you can just prompt a robot using natural language to do something and then retask it to do something else, suddenly this opens up, you don't have to do systems integration for every single problem that you have your robot do potentially. That's a dream, right? We're far from it. But the cost right now in deploying robotics is really about sort of the bespoke system integration that's required to do there. If we can simplify that pipeline and make it usable so that the people who are on the ground using those robots in factories, in logistics, can actually sort of tell the robot what they want and just have the robot do it, that would open up a lot of the space. All right, I'm afraid this was the last question we can handle at this moment. So thank you, Vincent, again. And people, if they have questions, they should come up to the front end and talk to me. Thank you. Thank you. Thank you for joining this session. Please remember to fill out the session survey in the GTC app for a chance to win a $50 gift card. If you are staying in the room for the next session, please remain in your seat and have your badge ready to be scanned by our team.
Info
Channel: NVIDIA Developer
Views: 9,672
Rating: undefined out of 5
Keywords:
Id: vOrhfyMe_EQ
Channel Id: undefined
Length: 50min 18sec (3018 seconds)
Published: Thu Apr 11 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.