Stanford Seminar - Connecting Robotics and Foundation Models, Brian Ichter of Google DeepMind

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today I'm going to be talking about how we can leverage Foundation models for robotics uh and kind of get the best out of both of them to start to sort of like motivate I'm going to do a brief demo so essentially right now we have a robot uh this is running in New York uh like right now and it's got a language model that's essentially going to command what the robot's gonna do it writes code out and then the robot executes the code and it's like basically we'll read my chat and then put it on there so uh to start I guess Mark what's uh what's your favorite color of these bowls and what candy would you like so I don't know if you can see this okay I can make it a little bigger and then we'll see and the robot says okay and you want some Skittles sure so one really nice thing about language models is essentially uh you know they act as this like knowledge base of even like very esoteric things like the Skittles are tasting the rainbow and you can see the robot actually doing it is over here and this is the robot's view here and I just said I'll take Reese's and it remembers from before that I just said which bowl is which who's it belong to and again you get this like very kind of abstract knowledge that tasting the rainbow means Skittles you can also do things like uh and I think when we first started like playing with language models this is something that was like I don't know really impressive does that I can just say put them all in different corners and it understands that that means that not the same Corners it understands what the corners are and knows what all the candies are and it's able to like Loop through this like this was something that uh I don't know about a year ago sort of like blew our mind that this is possible um so I think it might have missed I think it because they're in the same Bowl it like it put the Reese's on top the Skittles and then did that so this is like one area that we're sort of trying to like fix is that sometimes this grounding between the code or the language model the fact that it can't see things means that it makes sort of simple mistakes like this and I'll talk through sort of like why this is a big problem in robotics and sort of like where we're gonna go uh to fix it so let's see what else can I do um the so something like put the M M's in the center of the bowls and this is actually like so essentially this would just like knows that the center is you know the middle and then let's say okay so something like making a line is like also a somewhat abstract concept it really like requires you to understand the way that like you know geometry works and uh and you can kind of the code is maybe hard to see but essentially it should output something oh I missed the uh Reese's because I think it's too far off the screen um but what I wanted to do was put the recess in the initial spot and then oh I think it actually got the M M's in between the Skittles so and then the Reese's is like so far off that it's fallen off the table but essentially like it actually understood like what's there and it wrote code to be able to do some like abstract geometric reasoning like this and it's not like you know we've like trained this or shown it anything this is just like falling directly out of the language model so um and let's say move the hey so something like also like move the M Ms down like it has to understand that like the direction of down is is like in the negative y direction um and we tell that updates and it's able to successfully move that and then and something like move them a little bit up requires you to not only like understand sort of like General geometry but also that like a little bit is not all the way up it's partial part of the way and let's see if move and then we can actually like get some sort of like memory out of this that it remembers that before it put them here and so to put it back is just to recall like the last command that it did and then execute it again so that essentially um I'm going to talk through sort of like how we've gotten to this point and sort of some directions for that we've been looking at to expand Beyond this to fix some of the issues like when it picked up the wrong thing or uh you know didn't do the didn't do exactly what you expected so to start I mean this is all about Foundation models and the the main idea here is that you take the vast trobe of information on the internet throw them through a Transformer and we get performance out of it as kind of roboticist uh the fact that we get this like General performance without having to like do any work or like collect data is really kind of like magical I think a lot of times this takes a lot of work for us to collect the amount of data that was just given to us for free from here so we should really like strive to figure out how to use them and these capabilities are sort of like shooting up like starting from language models that were able to answer trivia or explain jokes we can also now generate images we can answer questions about images and do these vqa tasks and we can even like hear audio and directly like map it through a foundation model to get the text out of it and the real thing is that all of these performances like they're great as they are and they're like continuing to improve and improve and to become more and more General robot learning on the other hand has historically sort of been in these domains so like really kind of constrained environments we have these like uh easy resets so that it can collect data like it can close this uh these two bins so that it moves one thing from another it's got a very like short Horizon tasks of pick and place it's got a kind of structured tasks and the robots aren't really moving around if we're doing any sort of like mobile manipulation it's usually like at a very limited scale so we sort of want to you know put robots on this wave of foundation models and see like how much can we get for free uh without having the robots having to experience the world from Foundation models and so there are some areas of natural overlap for example like the reasoning capabilities of language models are super important there's like this I guess general knowledge that's in there that is important for a robot to act in the world there's also a lot of semantic knowledge of how things map to each other what order to do things in uh these are things that traditionally robotics has really struggled with but there's actually like a lot more challenges than there really are like areas of overlap I think in some ways like other fields have been very quickly revolutionized by having Foundation models be able to solve them robotics on the other hand has all sorts of challenges like just general grounding you need to know what your robot is capable of what the environment has a language model can't see anything just like it's sort of like a huge gap for for them uh interactive decision making everything they've been trained on is this like unsupervised is just internet data with no interaction no making the wrong decision making the right decision this is like a core part of Robotics that they basically don't have in them there's also really on one side we're all about like huge amounts of data on the other side we have not only low amounts of data but often pretty low quality of data I mean even the the data that we'll I'll talk about that we collected later like it's it's all like tell operation data it's relatively slow it's somewhat limited so it's really hard to get this like expert data that we could that we got for free from the internet essentially vision and just general sensing modalities are I guess slowly getting better in Vision language models but have traditionally also not been an area of strength and then finally safety is like really important for robotics you like can't make maybe the same mistakes that in language we might uh might forgive you for making so all these kind of mean that they're not necessarily like natural naturally put together and there's a lot of challenges in like open areas of research to try to do this so to first start I'm going to talk about how we can ground these models in the world and in their environment then I'll talk about what just powered that demo this code is policies then I'll talk about kind of like steps forward for building Foundation models or using Foundation models to give new uh knowledge to uh Robotics and then finally I'll talk about maybe training Our Own Foundation model and what what that sort of looks like so as I said before a robot can't see or a language model can't see so we sort of imagine this this task that like you're a language model and you open your eyes uh or you can't open your eyes and you're trying to reason over the entire environment um once you open it and you realize that I've asked you I've said I've spilled my drink can you help you realize that what's in the scene or things like a spilled drink a sponge which might be able to help and also I've got a lot of experience in the world of doing different things so as we sort of like go through we see you know everything that's like where I might want to throw that away what the gripper looks like in the embodiment and then also like in the in the past I've been able to do things like pick up these objects so this is all like crucial to the embodiment it's both what you can do and it's what's in the environment and we basically want to like bring this into a language model give it in some ways the hands and eyes necessary to make it do these tasks but if we just take a language model out of the box and say you know ask it this question it answers pretty reasonably I mean it says like you could try using a vacuum cleaner or calling a cleaner it even like takes credit for the spill which is like a very reasonable thing from a language model side but not necessarily actionable for robotics so one of the issues is like these are again not actionable so how do we get the language model to speak in a way that robots can actually do it given their embodiment and yeah so you know we wanted to be able to speak this language of robots we wanted to be able to like actually be executable and so the first algorithm that I'll kind of put forth is uh called say can and the idea is essentially instead of letting the language model where normally it would just generate sort of like the max probability next token we are instead going to fix the way that it can respond so these are like the only ways that that it can possibly respond to it's called like a scoring mode and basically See How likely is it to respond in each of these ways so I said like put the apple on the table and we say like these are the things you can do you can like find an Apple find a sponge pick up an apple um and these are like actual probabilities that would come out of a language model uh and you can see like pick up the apple is really high so like that makes a lot of sense but if you then like open your eyes uh you see that well there's not an apple in front of me I'm just sort of like in the middle of the environment so how do we like bridge this Gap and one way we can do this is robotic policies usually come with trained value functions and these value functions are the probability of getting the reward of succeeding at a task given the state given the image of the environment so here we might see that the value functions say well picking up an apple is like pretty low because there's not an apple there um putting the Apple down is extremely low because you don't have an apple in front of you but things that I could do is go navigate to places this has a higher value the like chance of getting this reward is someone high and this actually is basically on one side a probability that the task is useful so like this like low level skill is useful to this high level task and on the other side it's the probability that you can actually succeed at that given the current state of your environment and given the robot's abilities so together we essentially have two probabilities one that it's uh one that it's useful one that is possible and together if you multiply these we get like the most useful and possible thing so in this case it becomes find an Apple which you know makes a lot of sense first you need to identify it so once you do that you can then update this prompt that's going into the language model here and so now it knows okay I found an apple the value function gets updated saying there's an apple in front of it and so probably it picks uh pick up the Apple this is uh so I mean really the intuition here is like given this query on one side we have what's useful to do the overall task what knowledge comes from a language model on the other side what's possible in the environment given the robotics abilities and its knowledge and what's really nice is we have this like interpretable way of looking at its decision making we can actually see that um you know the language model is reasoning over this the affordances are reason over this we can see how it comes to its decision and we can actually like which makes it like super easy to debug and understand but here if I said like I suppose a code can you bring a replacement do you actually notice that like it's it's kind of like picked up on the idea of a replacement rather than like before I said like I spilled my coat can you bring me something to help and it went for a sponge here it understands that replacement means the replacement for the coke it understands this sort of like uh deep information and then the affordances tell it that what it can do is find a Coke it can't pick one up and it slowly can make these decisions throughout so we have this like really nice sort of like interpretable framework and we can uh I think we did like 70 of our tasks though I think with a better language model we got this up to like 85 so one thing that's really nice is we can basically just swap out the language model and put in a new one as things get better and better uh and the other part that's really I think important is that this grounding here have the errors that you make so essentially like robotics actually has something to bring to the table this like experience in the environment is crucial for the language model to be able to successfully plan and command the robot to what to do and a quick video of this like in action we type some command I think this is like I spilled a Coke and bring me a replacement and something to clean up though maybe the internet's cutting out too much and as I was saying like we'll see uh on the right side that it pops up with the actual like decision making process so we can you know as it goes understand what it's trying to do why it's trying to do it it's all you know in language so you can actually understand it rather than some like actions or something more difficult to really like grasp what's happening um and I'll skip forward in the interest of time but essentially it effectively I think yeah we essentially asked her to do two things we asked it to both like find like throw away the Coke can and also bring me something to clean up and it's able to like put these both together to form this like longer rise and task and do a do a task that takes like a couple minutes to roll out and it's like truly long Horizon and sequential but so one issue that arises with this is I just like had it so that I enumerated these very these skills and I like scored over these skills but if you go into like a world like this uh it's pretty computationally intractable for me to like write down everything I can do there's like pairwise things like I can stack something on something else and so say cam doesn't really like scale to this like complexity this like intra this intractability of the real world so how do we kind of go from this idea of how we ground uh and value functions to a more complicated world like this and uh the idea here in this work grounded decoding that uh Wen long actually uh put together is that essentially we can instead of picking specific scoring things just like very uh specific Primitives we can instead just bias it at the token level so a language model is always uh predicting probabilities over the next token we can also do this with a policy we can have that policy conditioned on some arbitrary amount of natural language so just like put the and then the uh the next thing the next object has to have a high probability uh in this scene so we see that like the Pepsi is pretty high the cake is also there I think these are like little Cakes as well as like certain letters are visible and essentially we have this like token level uh grounding of what's possible and what's useful so this means that we can also include things like safety and preferences of the user and be able to decode in like a much more varied like combinatorially complex world and we can do things like uh pack a picnic box and ignore things like the knife that maybe the robot's not capable of doing because the safety function can sort of like biases away from it and so it's just like it's a very general way to sort of like ground the language model in the world and the capabilities of the robot without losing the expressibility of the language model we can also use it for navigation so choosing like where to navigate and even the language model can trigger when it should be grounded and when not so this is in a way like a tool use out of the language model we see that uh given that we wanted uh some object you say I'll go find and then when it gets to the object it doesn't know what's in the world and so it can actually like trigger that I need to ground this next statement and it can look based in the world in this case using clip what have I seen in the past and realized that like a Pepsi can was there or and apple was there and then make its decision based on that so I think the fact that this is like this grounding is like a tool that a language model can call up when necessary means that you can really leverage it for very varied tasks uh one thing that still goes wrong though is let's say um and we saw it in the uh in the initial demo it failed at a task but it didn't get any feedback that hey I failed at that task it just failed that and then kept going on to the next one and in the world in the world we might like get this like you're trying to unlock a door and you realize you put the key in and it didn't work so I need to switch to a different one okay I turned that and now it opened a language model isn't necessarily getting this feedback in these earlier uh in these earlier setups so what we need is to have some like inner monologue thinking of okay I tried this did that succeed what do I see in the world and there's a lot of sorts of feedback that we can provide we can use object recognition to tell the language model what's available we can use success detection to say what uh whether it's exceeded at something and basically build this all into a real prompt so we get some query like bring me a drink from the table we use say can to determine the next task and then we go to the table but the language model doesn't actually know what's at the table yet and so in say can it's going to reason over all the possible objects but without knowing what's there it's very hard to sort of like blindly reason what possibly could be there and it said the object recognition can just write this description into the prompt and then it's very clear uh you know what should be done next the robot can also ask questions and basically in this inner model I realizes like an underspecified problem I should ask the user something and then when it tries to do something like pick up the Coke and it fails it can even recognize that it failed and just try this again so what it looks like in the real world is that we have all these sources of feedback and we can just build them all into a prompt and let the language model essentially handle this monologue of what's going on in its head of okay I tried this that failed what should I do next should I uh I think in this case we have it uh try something and it fails because we like explicitly stopped it from working so picking up a Coke can and then but now the coke can's not there anymore so the bring me a soda maybe doesn't work so well so maybe I should go back to the user and ask hey uh you know this didn't work I don't see a Coke can anymore what would you like instead so I see like in orange soda and it says there's this this and we say no that doesn't work for us so it goes back to the planning stage and says okay now I need to find this lime soda that we asked for instead so this allows the language model to be much more ground in the world by using these additional tools of a success detector of human feedback of a scene descriptor and in this case it successfully delivers it to the person another nice thing that we see is that it can even it actually like terminates the end it says like okay I've accomplished a task and the last thing that it does is and this is like our you position but it'll even like raise the thing saying like okay I've like completed the task but there are some limits of What language can do uh there's you know I talked about putting things in a line um but like this would be kind of hard to specify exactly how to do in language it's not necessarily like the best format also if you wanted to uh describe the scene and language here you would have to describe like the complexities they're all bunched together and the green one's a little higher than this and the blue ones here and so language isn't necessarily the best thing for this also this one where we have to wait until we see something to make another decision this uh feels a lot more like how kind of code works and actually code is this like linguistic representation of actions it's it's uh it's directly what we're running on the robot at the lowest level and it has all this power to do uh to call external methods to do any sort of computation do if loops and while loops and this like more complex reasoning we see things like put things in a line works pretty like straightforward when you're writing it in code even though it might be a little bit more difficult to enumerate in language and most importantly there is a like ton of data of what this is like it's sort of the equivalent uh I mean even I think like the best like gbt models now start code training on code and then are fine-tuned um like instruction following and language so code is like this really nice structured and we know it runs otherwise it wouldn't be there so there's like a a good signal that it's high quality so it's actually like a really good fit and it's what we run on robots so we should just use these language models to write code and essentially uh run that directly on the robot what's kind of nice is we can both use apis so may we have like perception apis that can like detect where things are we can use for Loops out of this and we can even use the language model to write new apis so if it generates something like is empty or stack objects as a function it can realize actually these weren't already defined and then just generate the next uh the next thing so this hierarchical code generation actually uh does extremely well to sort of like do much more abstract and complicated reasoning and it does it all pretty much I guess few shot but without any sort of training so they've also looked at like every textbook that you've seen so it can do it can write a full controller for something as simple as a cart poll uh it can uh do it can understand emojis different languages all this sort of like falls out for free it can even use different Cross embodiments or different embodiments based on the available action so here we just list like move up move right move back is something versus turn left move forward so this is the way that the robot moves and then the code is made is able to immediately adapt to that to be able to do new things so it's like really showing this power of uh I guess what prompting can do and especially when you're prompting in a language that's directly accessible by the robot and we put this on like I guess two three robots here with different end effectors and like the best thing is like you could I guess like I really like these ones where you're like drawing shapes because I think this would be like an extremely complicated control problem but in code it's extremely trivial to just write how to do it and you can just put it on and uh like within a day have this running and it can run like fairly robustly you can also do very abstract reasoning like make this Square bigger move something uh to the side or write these like while loops and if Loops that that wait for things to happen before it makes a decision so it's a very like I guess powerful concept but there's also you know some limitations of how well we can write things through code there's a reason why sometimes we want these end-to-end policies something like going to grasp a mug might be very difficult to effectively write through code um so instead what we wanted to do was basically what if we just collect a bunch of data Maybe not quite to the scale of foundation models but as much as we sort of could and see what if we just like train a model to Output actions directly so this is a paper called robotics Transformer or rt1 and the idea was basically like a simple Transformer architecture with a few things to make it efficient enough to run quickly on a robot we tokenized the whole input the instruction the image and we put it through a couple like efficient layers and then we go through a Transformer and then we just directly output tokenized output we collected 130 000 episodes uh each episode is maybe like 10 seconds long over 13 robots a year and a half and a 700 tasks to be able to get it to decent performance and and actually like it it does pretty well so across the scene tasks that we have it achieves 97 I think so it's pretty near perfect except like little mistakes here and there and these are scenes like this where we're randomizing which objects are there and where the objects are but it can also actually do pretty well uh with unseen objects with unseen tasks like adding in a lot more objects and like really cluttering up the space here and like progressively making it more cluttered changing the background to be you know is almost exclusively the data came from this scene where it's this like gray tabletop but we put in this white tabletop with a with a uh would would uh cabinets and have or wood drawers and it like is able to still open the drawers and able to pick things up and it does a lot better than any of these other sort of like um other other approaches so we tried like just uh regular behavioral cloning without a Transformer we tried gato which has like some subtle differences and found that in general rt1 actually works pretty well and most excitingly starts to generalize across things but I think the the things that like they were most excited about are its ability to ingest uh heterogeneous data so we collected again a ton of episodes on this robot in this scene and then we put in simulated data which is very easy to collect and we put in data from a different robot from a Kuka that was collected for a completely different project and we just kind of like made the action space the same and then threw it in and what you see is first the Sim data if we hold out certain objects it's only seen in Sim it makes a massive difference I mean I think we get yeah like it went from like 10 success or 20 success to 85 so like 65 more it's able to even like do new skills with those objects that it hasn't seen before in either one so it sort of like can cross domain transfer and then adding in these new objects with a different robot also works pretty well and we're able to suddenly pick those up so we're actually like starting to see this like scaling um we do see that it like overfits pretty quickly on the environment that we have I think we had like 97 there and like 80 in the next scene which is great for robotics but maybe not the the level performance that we hope to see but it can ingest this heterogeneous data which I feel like that's like the real like Crux of what a foundation model needs to be able to do is pull in data from all these different environments so it's sort of it's starting to feel like oh maybe we can train a foundation model but some of the the big issues that we come up come across with is so we fixed essentially the data are we slowly lower the data size and we lowered it by either uniformly taking out data across all tasks or cutting out like the lowest data performing tasks so here we lose I think like five percent of data but across like the tasks that are underrepresented but like the more diverse tasks and we see the performance drops really quickly if we just pull out data kind of uniformly across the board and keep our our diversity up we actually see that like our performance doesn't go down or at least not nearly as quickly so really like diversity is the key to this but diversity is sort of hard to get especially for robotics you need to have them in a lot of different scenes you need to yeah like and and also like scaling up in the one scene so we have this like classroom where we just collect most our data it's kind of like a waste of time at some point that you've collected so much data in there so we're sort of one of the big things we're looking at is how can we like vary this data collection um but the performance so one way we can data vary this data collection is that we can imagine that we were in different scenes so maybe we've collected some data here let's use diffusion models to imagine something different this is kind of essentially what we've done here is we recognize which areas we've seen and then we vary everything else that isn't important to this task so it's picking up the Coke can but it's imagining that it's in a kitchen it's imagining that it's in a living room in an office and all this becomes useful data for the robot and essentially we've just used these massive diffusion models to be able to inject new imagined experience for the robot and give it this ability to die to to handle diversity the way that it works is we localize areas that are important for the task and everything else becomes an area that we in paint which means we replace it with imagined experience in this case from the Imogen model which is like a Foundation model for image generation and then we throw that in as though that was real robot experience and we see after this pay after this one I think I have like actual numbers but another sort of like trippy video of we can replace even the thing that it was picking up so that it learns a new task so it can learn to you know pick up these objects while also varying everything else in the scene or not varying everything else in the scene keeping the rest of the scene the same and making sure that we're sort of like in domain for the parts that we care about and you end up with this policy that is oh yeah and like we can act as though the whole time it picked up like a cloth or things that are deformable um and we see that we actually like are able to do this at a higher rate I think there's a lot of numbers here but the the main thing is when we use Rosie on all the bolded numbers are when we added in this imagine experience it's sort of like across the board it helps and it doesn't really hurt anything so it's sort of getting like this free experience just from the the foundation model but I think what we really want to do is at like the lowest levels use the knowledge containment Foundation models for robotics so the next thing we try to do is actually like train our own and we called it palmy which starts with palm which is this like massive uh language model from Google and vit which is a massive Vision model from Google we put them together and we put a bunch of all the internet data's in there but we also put in a bunch of Robotics data and specific data that is like more useful for robotics tasks so we put this onto this like massive mixture and then we tried it on a bunch of different tasks and suddenly not only can we do vqa tasks but we can do task and motion planning we can manipulate things in an environment and we can do real tasks in generalize pretty well so the the way that it works is given some language model that's taking in a bunch of text tokens we also train the vit to Output essentially like text tokens or something like co-embedding space between the two and it can be images it can be anything else but you can see essentially the language one I was just reading in these tokens as though they're all text tokens and these other smaller models are outputting things that then go in as Vision tokens or object tokens or things like that and we train the whole thing end to end so that it learns to essentially like recognize the text tokens and the image tokens as though they're all the same thing and not only are we able to do sort of eqa tasks like and kind of complicated ones like given this image what's in there and answering emojis or something like really simple like just describe this image but we can also do mold manipulations so we can do say can but all just like in a single model we can do task and motion planning and have an output a real plan to do somewhat complex uh sequential reasoning or tabletop manipulation then we we also did just like language only tasking not even with vision and because we started with this language model we don't lose any of the initial language reasoning capabilities so it can do math or write haikus or you know it's a very like General Swatch of things one of the most exciting things though is that we really see positive transfer here so if we train on only like on the same model but only with this data and only this data and only this data we get sort of like middling performance but if we put them all together and we put in all the like internet scale data all of them shoot up massively so we we see that they're basically like there's reasons to throw everything sort of like into one pile and it's and even though these are like pretty different visually especially the uh like between this like language table and this and the say can data they're extremely like diverse visuals the action space is different and it's still able to actually like cross domain transfer we also see that as we get to larger and larger models the performance doesn't go down as much when we add in this Vision so essentially these are just language only tasks and with an 8 billion parameter Palm model this is what it started with for performance and when we added in this like training it dropped like massively like it basically is a terrible language model at this point if we go to a bigger model it drops a little less and then as we get to the really large one it essentially does the same and even on some tasks it got better so we see that like if you get to a really large model it's able to ingest all this information without actually forgetting or like having any catastrophic forgetting of its previous knowledge this is like extremely promising that maybe we don't lose too much by by doing this some uh some examples that I think we were pretty excited about or just thought were kind of fun a visual reasoning are here so it's able to do things like read off here uh the price of these items and then do math based on that to tell you how much it would cost to um to get two custom pizzas we're able to I think like this one is particularly complex reasoning because it's a mix of like words and rules and so it's it's saying you can't enter unless it's a bike but it has to understand that like it's saying except but then there's also a picture of a bike and emergency vehicles and it's able to reason through this to recognize that like on a bike it's okay to go through we can also use multiple images so we can compare I think and these are like definitely not in the training set this is like one of us went to I think like Marine layer and took some pictures and said what's the difference between these two images um what matches and it's able to like put this complex reasoning to work even over these like images that it's never seen before which is like very exciting especially for robotics tasks and then we're able to put it on robots so they said so we can use it to on the same model do these tasks on the right and and neither of these is it putting it's not putting out actions but it's commanding a low-level policy that's putting out actions but it's able to handle adversarial disturbances like here we like block it from picking something up it's able to close the loop with the environment and repeatedly you know recognize what it's done what it needs to do as we manipulate the environment it can recover to it and it can do this all through the same checkpoint so you feel like this is a real step towards doing this uh kind of multi-embodiment brain that the is this Foundation model for robotics so this is kind of my last slide but uh what's next is we want to go beyond language we want to continue pushing on this like visual language uh domain but we also recognize that there's a lot of issues with applying foundation models to robotics they're really like not um perfectly mated for each other which I think is like a robotics researcher is kind of exciting that there's a lot of areas to still figure out uh but in the areas that we can leverage them we should sort of Leverage them as best we can and do so in like in a way that makes sense for robotics so things like grounding them in the scene uh recognizing constraints or using value functions and sources of information that we already had for robotics like uh and then but I think like really at the end of the day like what we need to improve are the robotic side of things we need to while language models continue to increase in performance and vision language models the interaction side of Robotics is still definitely like the bottleneck of the overall system but we do think that like we can kind of move there's a lot of really positive signs that like an embodied Foundation models possible this positive transfer across things this um this ability to ingest heterogeneous data between different robots from SIM is extremely promising and the fact that we don't get catastrophic forgetting as we train these models with more robotics data means that we can kind of Leverage everything that was there but still add in our own data so I think there's a lot of people who worked on this most of which are pictured here but not all um but yeah I think that's that's it so I think I I can obviously take questions and then we also have the robot running in the back end if there's like anything we want to try on that but uh I just had a quick question when you were measuring the success rate um you mentioned that you had like safety precautions to make sure that there wasn't any catastrophic failures like you would find my maybe the language model like you said that you would forgive but um what was like the range of failures is there anything that didn't like perfectly accomplish the task or is there anything where the robot did something that was actually like you know left the place worse than it started so yeah I mean certainly like sometimes it like picks something up and knocks it off the table and then uh in a way that's like worse I think our policies are trained to uh like terminate or have failure if they like hit something like the table too hard or things that mostly like could cause any sort of damage so those are sort of like both baked into the robotic stack and then also trained into the policy that that's like that that would be a negative outcome um so there weren't anything that was like too terrible generally though the the errors I feel like were split pretty evenly between the robotics policy failing and the language model doing the wrong reasoning for what's in the scene it was sort of like 50 50 whether it came from one or the other but I think as like language models continue to get better like it probably if we went back and like re-benchmark this with like this was we'd probably read this a year ago and if we did a language model now I think it would be more on the robotic side than on the language model side thank you more questions uh yeah a few questions from uh the zoom room Ken asked what about monitoring force feedback teller operation to teach or learn about Force interactions so uh I feel like maybe I need more context on exactly what that means but I think like um I mean a lot of the data that we've collected here is through teleoperation but it's always with a or maybe it's about like improving teleporation but right now it's with an Oculus controller that's like tied to what the hand is and honestly it like our tele operation is pretty basic and really I mean I've tried to like pick things up but then it's like hard to do so I think things like adding in haptic feedback would be incredibly impactful I think I mean we see things like uh the like recent like Aloha paper uh where it's like just a much better controller and it seems like dexterous manipulation is much easier so I think this is like a huge area that there's sort of poor performance from tele operation that's limiting our expert data which is then limiting our policies maybe maybe as a follow-up a lot of the tasks that you show are Vision based in robotics you have lots of other sensors that are anything you could say about what your thoughts are on on you know this applied our IMU whatever yeah I mean I think there's a lot of tasks that like not having depth in there is and not having torque sensing is probably like a little bit limiting I think the fact that I guess the way we think about it is we're trying to push as far as we can in the simplest domain to see how what level of performance we can get if we were like a product trying to get to the best one we might be able we might add these in but the fact that we can get to 97 on all these tasks this is like moving objects opening drawers putting things into things flipping things over knocking a can over that like the haptics you'd think would be important but we can still actually get to 97 through learning so there's you can get pretty far with images and since they're available the kind of across the board we want to take like the simplest approach but I do think probably closing that last three percent is maybe on adding in extra forms of supervision like that cool Final question from uh the chat uh us curious if it's possible for many of these models to run on consumer hardware for smaller robots like an Nvidia Jetson or something 550 billion parameters not going to be practical for Hardware of this diet so curious what the options are for improvement yeah I mean I think there's a lot of areas around like distilling policies that seem to be relatively promising um I also think like the fact that these are hosts in the cloud and there's somewhat minimal latency like the we can run a lot of these at like you know under one Hertz and sometimes like five Hertz means that maybe the uh maybe the Jetson doesn't need to have the model running on it is probably like the the real answer that it's not as crucial as maybe we expect there's probably some like really high frequency things and then there's ways to to speed them up but for now uh the cloud seems to work pretty well thank you can you elaborate more on how you're using tele operate data for training uh for image and text you don't need that are you training the trajectory of those using it hello yeah so essentially each so the the four architecture looks like we just take the last six images and then output the next action that the person took so if they have like some trajectory we're chunking this up into single time steps where we just predict the very next action to take given those like last six images I see so there is someone controlling that arm yeah so to get the data we essentially like someone takes a uh Oculus controller and moves as though they're going to pick it up they can also do like a handheld controller which moves the robot hand to then like grasp the object we then take all that demonstration data and turn it into single time steps where we take an image and we output an action I see I see thank you dude uh thank you Brian for the talk it was very great um you mentioned that rt1 outputs actions directly whereas like palmy outputs like decisions at a low level controller executes I guess like going forward uh which of these like two approaches do you think is more reliable and I'm also curious like how reliable is the low level controller that palmy uses to execute actions yeah so the logo controller that palmy uses XU action is actually rt1 so it's like relatively reliable as long as it's in like the scenes that it's been in but we do see like good generalization but maybe not perfect generalization um in terms of like right now I think the reason why we split it this way is that most like the internet data that the language model has been trained on is at that like higher level like there isn't action data or things like that that are in the internet I think going forward though we're definitely like pushing in this direction of what levels can we train into these models and still get pretty high performance and I think it's somewhat of an open question of where where do you get better performance maybe the language model or the vision language model does much better generalization so potentially it's it's you need to find like the sweet spot and trade-offs between performance and generalization across the two so I think like the the small local model is probably higher performance and the other one has higher generalization um is like the answer got it and I guess a quick whole follow-up to that is for most of the failures you're observing uh is it at like the low level controller level or is that like the uh llm or BLM level yeah so I think the the high level making a mistake or the low level making mistake now I think if you do like palmy or something like this it's much more on the low level making mistakes like it's probably like 80 20 now um and uh yeah the performance of foundation models improving has certainly puts her robotics in the in the lower position of performance like the the errors are on our sack thank you so much thank you again oh you have one just out of curiosity now that we have you here and have the robot running what happens if you if you ask it to do stuff with objects that are not actually in the scene does it does it throw an error or does it try something weird so I think because it's on code it probably won't do much but uh we can like what would you like Apple let's see it says I don't have an apple so I mean it does have like this list of objects that are uh coded in for it to look for and I guess not on the list and then that goes to the object Hector and so the language model I've never I haven't tried that before but yeah it doesn't and you can actually I mean it's unfortunately I can't like increase the size of this but it's like all the code that that gets actually run into it in this case like the language model just took it like it just responded without writing without writing any code all right thank you again thanks [Applause]

Info

Channel: Stanford Online

Views: 5,348

Rating: undefined out of 5

Keywords: Stanford, Stanford Online

Id: wnPNRh-jtZM

Channel Id: undefined

Length: 47min 51sec (2871 seconds)

Published: Fri May 19 2023