OPTIMUS Gen 2 w/ James Douma & Scott Walter. Diving Deep w/ Tesla AI & Robotics Experts PART 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
as we know 2024 is already shaping up to be an incredible year of progress for artificial intelligence robots but especially interesting the fusion between artificial intelligence powered robots and I wanted to actually talk about this today with some of the two very best people in the field that I know and that is our friend Dr Scott Walter and the one and only James dma and so today we're going to be diving into the topic of humanoid robots Optimus figure um Aloha a number of the developments that have happened recently and really just try and dig into some of the more nitty-gritty details to see where is the technology today what are the challenges that are still open what is the size of the engineering chall challenge you know there's a difference between a scientific proof of concept and an actual commercializable product and so we'll be talking about some of those things the state of the hardware you know separating out the difference between the brains of a useful humanoid robot and the actual physical body and so those are going to be some incredibly interesting topics for us to really think deeply about together and I couldn't think of two better people to have that discussion with and so yeah with that I'm just going to turn it over to you James and um why don't you just start by talking about the the General State of where we are today in AI powered Robotics and kind of help us to understand what part is a software problem what part is a hardware problem and um kind of just do a a precursor to how we got to where we are okay so I think different people are going to have different ways of breaking down and describing the state of where we are today um I uh my like my 10,000 foot level on this is we're finally uh taking it seriously and and and working hard at it I think uh the industry itself might be in a little bit of a hype cycle right now it there haven't been technical reasons why the stuff we're doing today we couldn't have done two three four years ago I mean we're doing it better and there's definitely tools we have today that we didn't have a couple of years ago but I think a big part of where we are right now is that the market you know the the this this area it's been validated and now there are a lot of people interested and suddenly we see lots of companies making an effort to to do humanoid robot thing and I think externally uh you know if you haven't been watching the space for a long time you might be under the impression that that's because recent of recent software breakthroughs and that that's mostly not true I mean recent software uh advances have definitely made it more reasonable to be pursuing this but you could have started it several years ago and made a lot of the progress that we're seeing right now it could have happened a few years ago um to get to commercial robots you know was we we break the problem down into like there's the software part and there's the hardware part and for robots they're both pretty hard right now I would argue you that the really core stuff that we that you had to have demonstrated was possible in order for this to to be a feasible product development action like that that we actually had all the really important ingredients of that a few years ago and uh you could have decided and there were probably small companies that had decided that they wanted to try working on a humanoid robot you know a few years ago but I I feel like Tesla coming in in doing this has really validated the space uh and uh you know so now there 's much more interest in investing in the space or many more people who are interested in like you know working on the problem not of doing a laboratory demonstration that yes a robot can in principle if you put enough effort into it do things you know that are gen that that a human can do it can be used as a substitute for a human in some in some space of tasks but rather that that we're at the point where it's a matter of of refining and productizing this so that's the software end to it the hardware end um you know I think one of the things we learned from Boston Dynamics was that uh that you know if cost is No Object uh you could make really performant uh robot bodies and we've been able to do it for for quite a while but if you want to make a commercial robot body it's it has to be inexpensive it has to be durable it has to have um a variety of degrees of free fre Freedom that are a good match to what a human being can do and those bodies didn't exist so now there there's very little in the development of the body which requires you know design breakthroughs there's instead there's there's a lot of iteration of testing design variations of getting getting you know building some prototypes getting them out there finding out where they work where they don't matching them to the software figuring out how that fit is and then of course there's scaling up to drive the cost of some of these especially the actuators but other aspects of the bodies as well can't easily be built uh off of things that are already mass-produced I mean there's a significant amount of custom design and there are some things like the actuators that where you know they're going to need to be scaled up to drive the cost down and so there's that whole manufacturing ramp and learning ramp that that that needs to be done on the hardware so we have so the hardware is in that state it's basically in refining and Manufacturing ramp State and the software up until like up to five years ago I would have said we were in a space where you could have looked at and you could said there are things that need that we need for a humanoid robot that we just simply don't know how to do and we don't know when we're going to be able to do them but that hasn't been true for the last five years so so the software also has moved into a similar kind of state where it's it's more about figuring out how to use the building blocks that we do have like what's the combination of software elements that's going to make you know make for the the most useful performant flexible uh humanoid robot and somewhere Downstream ideally you know in the on the order of five years from now we've got a body that's ramping and is highly performant and highly flexible and the software is also converging on this point of being truly human functional in the in the S like what we want a robots where you say go do that task you've never done before here's how I want you to do it you know verbally and and it can do it and it can do it with a level of of uh of uh competence that is comparable it's on the same scale as what a human can do it doesn't need to be as good as a human for to I mean there's so many jobs that do not tax human capabilities that we use humans to do and so the robot can start out doing those kind of things so I would say that's where we are right now I mean both of both the software and the hardware are in refinement stages and we're going to see them gradually get refined you know in we are we may be within a year of there being lots of practical applications for robots I think we are probably significantly more than a year away from the you know being able to you know uncreate your a ro robot for your home that can do your laundry and do the dishes and mow the yard and walk the dog and all that kind of stuff I I we're we're still a little ways away from that so that's my that would be my 10,000 foot view of where we're at I actually have one question I want to ask real quickly of you James before we move on and that is you know is it a coincidence that we see uh Tesla really moving into this realm with the Optimus project right around the exact same time that we are getting multimodal different versions of artificial intelligence that are potentially you know laying the groundwork for having software systems that can be much more unified instead of having you know basically multiple different pieces of software you know to handle audio input versus video input versus the control architecture um and then have to do you know deal with the the challenges of cobbling all of those together and and making something that works in a very seamless and unified way on the total coincidence versus totally not a coincidence I would put it closer to the coincidence end and the reason that I say that is because Tesla had pretty clearly made a commitment to this before a lot I mean the difficulty of doing multimodality looked really different two years ago compared to how it is now I mean one thing that large language models have done is they've brought multimodality they've they've they've shown us a way to do multimodality that is very straightforward which it wasn't before like llms uh they there are some things that they've shown us how to do and until llms really hit the mainstream and started getting traction got a lot of attention you wouldn't necessarily have assumed that that was going to be a big component that you were going to be able to use in the robot today it it it definitely seems like that is something that that you would do the the common sense that large language models seem to be able to absorb from their training Corpus uh and our ability to actually extract it and put it to use that's actually really surprising I don't think there were people in the space who thought that that was uh likely to happen around the you know uh before Tesla basically made their big move so I think on the other hand you know all of this is a consequence of us figuring out neural networks right it's uh we we basically figured out how to make neural networks work around uh 2012 2013 kind of time frame right and many of these things are are are consequences of that I think Tesla deciding now was the time to do robots itself was a consequence of the fact that we can get neural networks working and neural networks are so good at solving so many of the problems that you really want to be able to uh you really need to be able to do well in order to get a robot working and of course the llms and a lot of this multimodality stuff that's also a consequence of us figuring out neural network so these things are kind of moving in parallel but they're all kind of interdependent also so it's not entirely a coincidence I mean these things definitely feed back on one another but my sense of the timing is that Tesla had made a commitment to doing the robot because I think they could see the trajectory of the software like it was going to get there and if you look at if you if you imagine like how long is it going to take to get a robot body to like develop a robot body get it working and get it into production you look at that that's that's not one or two years I mean that's a that this that's a significant commitment it's going to take some time if you can anticipate that the software will be ready by the time you're ready to go into volume well that's good enough and I I I feel like that's kind of a better way of of describing you know Tesla's timing on this is that is that they saw that the software was going to get there and they decided now is the time to start working on the robot body so we have it ready when the software is ready because if the software is working and you don't have any robots you're gonna you got to spend five or 10 years figuring out how to make them and then get them into production so you've just wasted 10 years you could have had humanoid robots for 10 years but you waited until the software was perfect before you started the factory now we see people doing the other thing and especially Tesla doing it at scale like I have believed I'm on the record saying I think you really got to go to scale if you really want to unlock the potential of this kind of stuff and that takes time it takes somebody you know to really do the chops and that was why I said you know a few years ago like I wanted to see Tesla doing this because I because that was the approach I thought they would take it wouldn't be one of these you know we have a really good-looking robot we built 10 of them and we're waiting for the dod to give us a billion dollars to build a factory so we build a hundred of them right and we're going straight to Millions okay James so I'll sort of agree with you on your your overview of it and say that there's always been kind of this chicken and egg problem with what do we need to have first the software and the hardware and how good does a hardware have to be for the the robot to be useful to be able to do useful work and remember from the standpoint of Tesla it has to do useful work in the factory it doesn't have to do useful work in the in the home and um I was also pretty much on the record saying that looking at the original Tesla bot design it's like well it'll be able to do maybe your laundry but just won't be able to fold the clothes it had kind of the dexterity and the capability of that and I wasn't sure it would be able to go and start folding t-shirts or something like that and I was sort of proven a little bit wrong not by Tesla but by the the Stanford group but before Aloha about a year ago when if you remember they were using very simple tele manipulation with a really crappy grippers and they were able to do things like you know open a can do a Ziploc bag also close a zip and stuff like that and what they were able to prove is that it wasn't so much of a hardware problem as it was a neural network problem in this case the neural network of course was a human demonstrating that if you have a good enough net you can take really crappy Hardware now of course if you have better Hardware you're not going to be taxing your network so much you might be able to perform a little bit faster so seeing the Tesla about being able to fold a t-shirt wasn't really a surprise however it's something that had to be proven and again it was they're proving the hardware capability using a neural network which is not quite the software yet but again James is absolutely right if you wait for the software to be perfect you won't have the hardware right there and the question is is the hardware good enough we know the Gen one was good but not quite good enough was able to do a lot of things and now they're getting ready to um get the next version and the question is like is the hands good enough to be able to do everything do you need to have hands that are good enough to play piano or not that good and still be able to do something as useful as maybe folding a t-shirt which we know you don't need to do so they they're sort of showing you could do a lot more with the software than you thought you could and you don't need to have this perfect human hand you just need to have one that's kind of good enough the question is where that is and they're starting to answer that question right now and the question really is do they feel confident enough for the hardware design right now that they will scale it up to have enough to start doing the training that they need to do to be able to start helping the software because again we're getting a little bit of a chicken in the egg that the software I think can only get so far without the body to put it in to start doing enough demonstrations enough testing enough training and if you only have one or two Bots for training it's going to take you a long time to get the training data yeah okay so we're in agreement about that now looking at where we're going forward and what this other stuff means in consistent with the description that I just gave I think think that what we're looking at now is a number of years of gradual refinement of both the hardware and the software sort of converging on getting both of those things to converge I think Scott was making the point that there are lots of useful things you can do with the robot today in a factory where you have a large range of tasks at your that you could decide to apply a robot to and you can adjust the task to fit the robot so you can start with a somewhat simpler robot somewhat simpler software get use out of it get experience with it put wear and tear on it learn how the actuator stand up the body stands up figure out you know what additional features you need in order to expand the functional envelope you know and that's turning the crank so I wouldn't be surprised if Tesla can start using robots in the factory this year for some simple jobs and that that will gradually you know grow over time but I I do think there are significant me functional physical features that need to be added before you're it's going to be able to do the R the full range of things that we would like a humanoid robot to do and certainly the software has a lot of refinement that it's going to need um actually figuring out the architecture of the software I think even at this point is something that really hasn't been settled on because we are seeing so much um Improvement in surprising directions in what you can do with neural networks that a plan that you came up with last year for how you might want to approach it you might see a new development say hey we could use that and you know which that's what we see with FS right um they didn't initially uh you know a year ago they weren't thinking that they were going to be mimicking human drivers in order to do the planning stack but it you know there were developments in the field they did some internal experimentation that turned out to be useful and so they were able to Pivot and go in that direction we you know we're seeing that leak into the Tesla to Optimus now and and we may see more of that and we may see may see more pivots in the future too U depending on yeah this getting common sense and planning out of the uh knowledge that llms absorbed from their training corpora that's that's a really powerful technique and we haven't figured out yet how to fully exploit it in the robotic space to the extent that we do that that's going to simplify a lot of other things that we might otherwise have to do the hard way yeah James you you mentioned a little bit about uh some of the changes you might like to see in the the robot going forward so we CLE since the last time we spoke there was only the Gen one robot and now there's been the Gen 2 robot and of course these these new kind of ways of training the robots that we've seen that figure has has shown us and that Tesla just showed us recently with a t operation um now they added two more degrees of freedom to the Gen 2 robot which is in the neck and the question is do they really need to add it there um we seems like they cleaned up the design at a bunch of places but didn't really radically change anything as far as the degrees of freedom or or move anything around too much there it just seems to be a cleaner design I'm not quite sure where they lost 10 kilograms uh do you think they replace some metallic Parts with some uh plastic parts is it possible as far as the rigidity and uh do you think there's any other kind of modifications that you would have made that you wish you had seen come in the Gen 2 bot that you think would show up in the Gen 3 bot well where did they lose 10 10 kilos right I think they probably lost 10 kilos in bits in lots of different places I think they probably lightened the frame I think they might have done some materials changes in the frame at the the lower I mean the structural parts that we can see they look similar to to the shape they had before they re they redid the electronics and they changed the harnessing so they probably were carrying a lot of gratuit you know unnecessary weight in both the harness and the electronics because they they had taken the electronics and just adapted it straight over from the car so essentially purpose-built Electronics that's going to save you some weight purpose built harnesses that's going to save you some weight uh especially like if you go to like you know in the cybert trck you know you have a can bus if you have a bus that runs around the robot and you have local actuators it really cuts down the amount of wiring that you need for communication if you don't have have to run separate you know pair of wires to every actuator lightening up the um the actuators right like getting the actuators so that so that their so that their performance envelope is actually a better fit to what you actually need that'll let you save weight you could change materials in the actuator like you know initially the prototypes that they were Machining they might have been using one grade of materials and doing you know subtractive machine maybe they're doing some additive I mean there are lots of lots of opportunities I feel like for getting uh for getting weight out of the of the body and I'm not surprised that 10 kilograms is pretty good for one iteration but it they they really cleaned up the design a lot I feel like between those two they uh you mentioned the degrees of freedom on the head I that was actually that was a fascinating design change in my opinion to to to choose because as we were talking about offline it's not really functional like the robot doesn't need those two degrees it's expensive and complicated to add two degrees of freedom in the actuators and all that kind of stuff to the head and robots they don't need to turn their head they can see you know in all directions at once it's cheaper to add more cameras than it is to add actuators you know to a head so why do you do that well look at how different it seem you know you know you see the the video with Optimus like you know looking around twiddling his fingers it your your human factors have just improved so much like if you're going to put that on an assembly line working next to a human being the body the the the robot's ability to move its head and have expressive body language so that humans around it can understand you know it can look confused now which you know that's hard to pull off with a robot you can do all of these really interesting things that yeah you know it can it can nod when you're talking to it and or turnning you know if it doesn't want to that it's just a great piece of human factors so you've seen other people doing a thing where they put a screen on the face and they have icons or little face so that they can have and that's been tried a bunch of times but I I kind of feel like having the thing just be able to like nod its head or you know do expressions with that kind of stuff and to to tell you what it's paying attention to by turning its head to focus on that I think that's brilliant and it like it it was surprisingly powerful I think when you see it in the video so I think that that was a really interesting decision that would not have been on my list of additional actuated joints you know but in retrospect I think it was a really it was a great idea like I'm really impressed with how much um you know how much functionality they're getting out of that Rel relatively inexpensive change right yeah I'll I'll I'll jump in on that for a second there James about the head because there was a mystery to me is that if you look at the the roll of the wrist there was an actuator purpose-built actuator just for that and that's the only place that showed up in the body all the other actuators were being used in more than one place and so I was saying man it's like they really designed an actuator just for the wrist roll and not to be used anywhere else in the body and I was thinking and plus it had all these funky kind of attachments and things sticking out from it that were nonfunctional that just didn't belong in there so I think they originally designed that to go into the head and they finally decided to put it there because they I think according to the Isis book The Head was moving in the beginning and I think in bumblebee or something like that and just the head movement itself was causing it to fall over when they were walking and so Elon just said get rid of it we don't need it and it gave it a bit more stabilization so now it see it came back in there so I wasn't totally surprised and I'm pretty sure it's this one that's up there in the neck and the one thing is unfortunately optimists won't be able to go because they don't have that degree of Freedom it's got this degree of Freedom it has get that degree of freedom but I'm not sure it could do that which is why I've always said it's GNA suck at soccer because it can't do the header that way um the other question that I had James on the the addition of those two actuators was whether or not it was really more about making the bot more compatible with the movements of the T operator because you know like you pointed out earlier a robot has that perfect 360 degree Vision but a human does not and so if a human is teleoperating it it has to move its head to look in the direction at least if it's beyond you know where it can direct its eyes to and um so I didn't know if it was really more about the training and the use of teleoperation in the data Gathering portion of Optimus um or if it was actually a functional need for Optimus itself it certainly actuating the head is definitely going to simplify certain aspects of the teleoperated data Gathering stuff that's that's certainly true I I think you can probably do it you know it's not super complicated to do it without that but if you do do it you you know you simplify your data Gathering to some extent because you can just have a camera and the operator it can contct the operator's head to the extent that they can do that that the because as Scott said it doesn't have all the flexibility I mean human beings we have you know nine joints in our neck and each of them is has like two do at least two doof maybe three doof depending on you know how you do that so human necks are pretty flexible but uh they did a pretty good job of mimicking human movement with the two I I thought that the that what they were doing was like the super super spinatus type of wrist thing where they've got you know the heads mounted on like a ball joint and they have two uh actuators that are differentially driving the head to get the left right up down motion as a pair and the head's really light so you don't need much power in it like I would be surprised if it was this if it was the same maybe it is the same actuator I don't know the um but it like I would have thought that the head actuator was yet another actuator that they weren't already using somewhere else in the body because it seems like the the only other super lightweight power profile ones are the ones the fingers and those probably wouldn't be appro I mean you could spring the head and then use the finger things if you wanted to maybe that would do it um I don't know it's we can't see right we haven't seen any details on on how they're doing the head but no I think Hans you make a good point there is a Synergy there between being able to turn the head and uh and sort of being able to directly take data from a human operated uh T operation rig and and train the robot with it so maybe that's useful too yeah I think the the only reason I'm thinking they went ahead and used um those two actuators is because the original design showed that for the neck that they had two actuators up there to be able to make go back and forth but you're right we can't see enough and that would be another way of being able to do it is is by making it almost like a a little steart platform that's able to move it itself around in almost any way um but again it could be that that would mean they'd have to come up with yet another actuator unless I mean that could also be a good guess may maybe the actuators and the fingers could be used uh for that so for some reason they decided to move that way and if you remember when we did see the one video of the T operator um he was moving his head a lot so that from from last February when the blocks were being done and we were able to see the headset on there the operator was not able to keep his head fixed it was constantly moving around yet you could see the bot when it was doing the same thing its head was fixing was not mimicking that and so they may have decided that again just for T operation it was very important to do it plus you get the other benefits like you say is that everyone else around there can sort of tell what the robot is doing yeah I think the body the human factors thing was I mean I had this I I get the impression other people had this expression too you know when we saw that little that last teaser video for the most recent generation of the robot and that that motion where like it's looking around you know and twiddling its fingers it's so so human it's it's crazy and a huge chunk of that is the head movement right that that that's really what you know if it been doing that imagine doing the same motion without the the the fingers is not nearly as impressive uh it's not as nearly as kind of U it's not a gotcha moment it doesn't it doesn't get you in the same way when you do that like that's just like I was I was really impressed with how powerful the gesture of just turning that being able to turn your head and indicate interest and you know because we do use you know in a very natural way the attitude of our head to to to as as a communication mechanism like it it's almost as good as facial expressions right which like anybody who's seen a Disney Pixar movie knows it doesn't take very many degrees of freedom to get a lot of expression out of something right if they can make a if they can make a desk lamp you know if they can give it a personality well you can do that with a robot too but motion really helps right it really does one of the things that I've heard is that you know part of the reason that some of the more advanced mammals have white around the edges of their eyes instead of you know reptiles where it's pretty much solid color is actually to give other members of you know that Community the ability to track where your attention is by seeing the direction that your eyes are pointed there's a lot of things that we use for non-verbal communication and where our attention is directed is one of those things uh we actually do read a lot into maybe you know subconsciously most of the time um and so it'll be interesting to see if we end up getting any additional things that are added to the Optimus spot that allow us to have more more intuitive Communication in a non-verbal way with it yeah I think that that attention Direction thing is it's clearly a really big component of human communication I I don't know if you've ever thought about it's sometimes sit down Scott try to work out the numbers of like how much res resolution do you need to be able to tell if someone who's 30 feet away from you is looking at you or not because if try this sometime when you're at a party or something like that look across the room look at somebody who's 30 feet away and and ask yourself who that person is looking at because whether they're looking at you the person next to you somebody six feet over something like that it's it's amazing how you can tell even in a crowded room even with lots of targets even with somebody who's really far away exactly where they're looking part of it is that the contrast between the Scara and the cornea right that that we can you know that that that's a trick but uh but humans like that's how important that is to us like as we our our brains dedicate a significant amount of resources to to tracking what the attention of other humanoids in our environment is that that helps us understand the Dynamics of a situation really well and of course head posture is that's a component of it too like being able to estimate that there's another thing humans are really good at you can tell to a very very fine degree exactly what direction somebody's face is pointing like where somebody's nose is pointing like you can you can tell that uh a long way away and because it's important communication and line workers people working with robots they'll be able to make use of that in a really natural way you know now that the head can move around doing this doing the trick with the eyes is probably a little bit more complicated because you can't mimic it's hard to mimic the movement of human eyes and the exact geometry and all the stuff that you would need to do with that but the head that's almost a gimme and and and it was great uh like I'm very impressed that they realized that and were immediately able to make such great use of it well if you look at what they also added to the head there uh that that sort of LED that's outlining everything almost looks kind of like a beard that's going around um is that going to be a way of being able to also communicate a little bit so having that feature makes it a lot like imagine if they just made the whole head black and so all you had was the outline then telling the direction that it's facing is a lot harder right you want some markings but you don't if you put markings on the face it's a little bit weird so instead what they did was they they put a perimeter around it and it's illuminated right it really stands out so and that helps your eye and it's a it's a thin line too like it's much easier for your eye to tell like if they had colored the whole side of the head or something it would be harder to tell but because you have that irregular shaped line on the head it makes it really easy to by ey to discern like what direction the head is facing so yeah I I think that it's functional I mean it it looks nice too it's a nice design element as well and imagine they'll have color coding as well that um green for operational or or red stay away I'm in a bad mood whatever yellow black blinking gone berserk um sure yeah you could do all or you could you know it could be the way you tell them apart too if you've got three or four you know if you're working in room with a bunch of optimine they all look exactly the same there will be situations where that is a little inconvenient you know if you've got multiple ones working different stations crossing paths and that kind of stuff for for human operators to be able to identify them by you know an individual one or by category like what job is this guy doing right now that that uh that could also be really useful I mean I imagine they'll probably be all this kind of stuff maybe they'll wear t-shirts too logos jerseys with a number on the back yep 42 get over here I bet they'll get names it you know just because so Scott I was trying to make the case earlier and I I asked uh Hans to queue up a couple of different videos um to support my contention that the that the that the that the essential building blocks of what we need to get uh to be able to like train robot bodies to have the kind of sort of functional Grace dexterity and whatnot that human beings have that that that the that the building blocks of that we've had for a while and I had queued up a couple videos here that just kind of look at what's been done say over the last several years in various different ways so this one this video the Aloha robot video this was you were talking about it before this the Stanford lab basically demonstrating you can really do a lot with just a pincher robot uh you know essentially you can your Hardware can be pretty limited and you can still do pretty amazing things uh with it but there are lots of things that you can't do with that like I'm shocked that these guys figured out how to crack an egg with a pincher robot you know and not smash the egg and get good re I mean I don't how many takes did it take to get this I don't know but the fact that they can do it at all is pretty impressive or handling a spatula you know you know it handling a spatula is kind of tough they're definitely designed for human hands I can imagine spatulas that this robot wouldn't be able to use just because of their shape and there are lots of things like one of the examples I had was like well try to use a power drill with a with a tweezer poot right like that's just not going to happen there are all kinds of human tools that you can't do with this um and a lot of the movements that this robot's doing they're not going to be you're not going to train this by mimicking people right you're going to gather some data with a human T operator or whatnot you're going to do a lot of reinforcement learning training then you're going to have a planner and whatnot and you could get something pretty functional but this is not going to be a humanoid robot right uh the human to be a drop in replacement for a human you need a set you need a range of motion degrees of freedom grip strength and whatnot that basically match physically what a human being could do um hun you want to pull up the let's look pull up the Deep Mind video let's talk about that one a minute okay um so this was this this thing was this actually was a breakthrough so this was a deep mind they started out basically training showing that you could do reinforcement learning with a tari games so that was a cover story in nature magazine uh some time back it was a really big breakthrough like being able to do reinforcement learning for training against like Atari games was really tough well it once they had done that and they had these longtime Horizon training things where the reason you want to use reinforcement learning you use it in situations where the reward you have to do many operations and then you get a reward after many many steps and you have this Pro you have this uh essential challenge of like which of the many many things I did made it work and one of the processes that has this is like trying to get a humanoid robot to stand up or walk or run or do parkour or whatever that that you know your reward is staying up but you do lots and lots of things in the process of staying up so figuring out how to train something like which of the many things I did allowed me to stand up allowed me to jump across the Gap and went not that's really tough reinforcement learning was a natural fit for this and this this uh paper this is a this is Deep Mind basically demonstrating that it works for robots and that was a really big deal like we did not know how to get reinforcement learning we we weren't even sure you could use reinforcement learning to do this kind of problem so so so that was a really big deal but once we knew that you could that you could do that all of a sudden you know doing it well was now a matter of refining the process uh so that's one thing about this this was a number of years ago and so this this really critical building block for robots we've had for quite a while but one thing I want to draw your attention to is the these things when they're trained their goal is just like you can see this thing is running and it's got its hand standing up and it it's pumping its Fist and whatnot that's not functional like why is it doing that it might as well it doesn't matter to the robot it's not trying to be efficient so it doesn't look like a human being it looks kind of goofy so that that has some costs to it like in addition to training it to be able to do the motion you also wanted to train it to be efficient people tried just training for efficiency and you do get more hum more realistic looking I mean things that don't look quite as crazy and that don't have lots of extraneous Mo movement built into them but it but it turned out that there was a more straightforward way to like get really good stuff and actually get it to converge faster so hun you want to pull up the the Deep mimic okay so let's uh show us some spin kicks so here's cartwheels back flips okay okay so this what this is also using reinforcement learning to train a robot body now why does it look so great well the way that this is done is you do motion capture on a human being doing some kind of action incidentally this work was done for training video game characters it wasn't used for training robots but you can use it to train robots and it demonstrated something which is a really powerful technique which is you do motion capture on a reference you know creature a human being for a humanoid robot dog for a dog robot or or whatever you do motion capture on it and now when you do your reinforcement learning instead of giving it the simple goal of just like you know you know keep your head above ground and move forward or you know move the center of your body closer to the Target or something like instead of some really simple goal like that you give it that goal but you also you give it extra points to the extent that it can keep its body points like its elbows and knees and whatnot in the same configuration relative to its body as a human being so it's mimicking a human a human right and this is kind of the core version of all the mimic stuff that we see getting done today but it it illustrates a couple of aspects of why this is so powerful one of them is you automatically get efficiency because you're imitating an efficient gate so instead of having to discover and solve for efficiency separately which you can do but it's complicated it can be time cons it can be consuming and you won't get humanik Behavior out of it because what's efficient for a robot and what's efficient for a human are different a lot of times we want the robot to move like a human being because of the body language thing and also a lot we there are a lot of tasks in the real world where we know how to do them really well because humans have figured it out and by watching a human do it you get a very efficient very you know very well structured breakdown of the task and so just mimicking a human directly is is a is a pretty powerful thing right okay so this is also an old paper this is from uh five years ago now right so this is not brand new stuff this is also this is work that's mostly done on a PC with a GPU it's done using a little bit of motion capture it doesn't take a lot of resources to do this stuff and the technique has been around for a while but look how amazing these motions are they're incredibly human realistic and then they demonstrated see this is an atlas this is the same motion mapped onto an atlas so why can you do this with the weight distribution being so different that's because it's trained in reinforcement learning the fundamental training is still reinforcement learning it's just that instead of giving it a simple goal of just move close to the Target and do it with whatever floppy limbs you want you get extra points for having your elbows and knees and hips do what the human did in the motion capture so this is a so this is also a technique that we've had for a while I would when I saw this paper five years ago I thought we're done like this is so good and you know one of the things that we still didn't quite have at this point was the Sim tooreal thing like getting something working in simulation then putting in on a real robot that was a challenge it's still not easy but a but a year or so after this paper came out eth demonstrated a super beautiful hack for how to get this which is basically but you know you train a robot you train your robot in simulation and then you have your real robot and the real robot is different in some significant in some important ways the motors have a little more torque the weight distribution you can never perfectly mimic it in simulation and the and the thing is balance especially bipedal loc like Locomotion is s such a fine skill that if you're off by more than a little bit uh then you know it it just like it completely breaks so how do you map those two between them and what eth did was basically they figured out you could use a neural network as a shim layer right so you uh you essentially you you you you train you train a neural network to just flop the limbs around or something like that and it predicts how the limbs should flop around then you you run that in the actual body and then you train a shim layer until it can make the limbs move exactly the way the limbs flop R in simulation now you have a shim layer that captures the difference between your imaginary body in simulation and the actual body and it turns out you pop that shim layer into the real robot and then you can take what you trained in simulation drop it on the robot and it works right and that was beautiful it it turns out to be a pretty straightforward hack they demonstrated this like I said like five years ago and since then we've known that that you can train reinforcement learning reinforcement learning works for training robots we have learned you can get really efficient very human Gates by mimicking humans that was something that we've known for a few years and then we now know not just the eth way but other people have demonstrated other ways of of closing the Sim toal Gap so you can train in simulation move it to a real B so we had all the essential ingredients some years ago for being able to use a neural network to control the body now you still don't see people doing this like why don't why doesn't Optimus walk as well as this as this mannequin does right and I think the the short answer for that is robots are expensive and experimenting with them breaks them right so we're still at a stage and this is one of this is a side effect of the fact that these robots are precious they take a long time to build they're you know get them working that kind of stuff if Optimus only cost $10,000 they would be a lot more willing to maybe break a few optim I sorting out a software problem but the at the point that they're at right now they don't want to do that so instead of of using the techniques that we're seeing demonstrated here to like get really impressively flexible and dynamic you know fast you were talking Scott you were talking about how fast they walk instead of doing that they're still using extremely conservative systems for doing The Locomotion because they don't want to break any more robots than they have to at the point that in the game right now and a lot of the stuff that they want to do right now they don't really need the super fancy Locomotion to get it right now Optimus legs are just a way of moving his torso around the factory and almost all the stuff that they expect him to do in the short run is going to be like either just using those legs and move from one spot to another or just a platform for the upper body to do the stuff so Tesla at this point at least has kind of broken the problem down into two chunks they've separated The Locomotion out out and and so that they can do that in a pretty conservative manner then that's why Optimus walking still looks so robotic it is robotic right they they haven't moved over to like really doing the human mimic you know super efficient thing they're being very conservative with that but the upper body it's safer to go with with h with with mimicking human beings uh on that kind of thing and so we're seeing them do that and especially for like block sorting and whatnot there's a ton of stuff that you just need to be able to do with the with the neural network on that so I think that was the stuff so we we had a couple of other ones there that that eth demo of the the massively parallel reinforcement learning for quadrupeds so when we first started doing reinforcement learning training one of the challenges was that the reinforcement learning training it could be really timec consuming I mean that's one of the things about reinforcement learning is that the Fe this the training signal is relatively weak so you have to train for a really long time before you accumulate enough information to be able to really have trained the robot really well but people started sorting out different ways of doing this this is another eth paper that I like they they do a good job of doing great demos for their papers but they basically they started designing RL environments that were specifically designed so this is basically a chunk of code that runs on an Nvidia GPU and what they do is they just put like a24 robots in the same space and they give it this complicated environment that it has to work over and all of these robots have you know a set of goals that they just do and the goal changes and whatnot so they so essentially they demonstrated that you can do this massively parallel uh so like all because it's not really a thousand different robots it's really one robot control Network that's running on a thousand different robots and that became really efficient this is them demonstrating running exactly the trained thing on an actual robot body so here the human is being the harness for going up and down the stairs but you know this quadruped it learned to deal with these blocks it learned to go up and downstairs learned all that stuff in just in simulation and then they put put it on a real robot body they're doing they put that shim layer in that's one of their tricks right and now the real robot can basically do the stuff that that it was trained to do and incidentally I think there's a stat in this where they only do 20 minutes of training from scratch on one GPU to like train the neural network for the that's that controls a robot body for this quadruped so anyway once again supporting the notion that like we've known how to do this the the guts of it you know the essential techniques we've known it was possible we've known approximately how you could do it for a while and so now we're you know as far as as controlling the body in a natural way like using a neural network to get the kind of Grace and flexibility that an that animals enjoy because animals have powerful neural networks do that we know basically that this can be done we know approximately how to do it and now it's a matter of like doing the hard work of actually doing it that this this is I would argue that that is where we are for the for the body control stuff now planning is separate planning is a separate complicated thing planning for humanoid robots is is is another is you know it's another thing that kind of needs to be resolved to a first approximation as they did with FSD you can get a lot of traction just using using nonnal network planning using the traditional kind of you know good old-fashioned AI sort of methods for for planning and they have you know we do see them doing that that uh we also see them we know that there are lots of kinds of planning that are really that neural network uh techniques are really useful for oh this was the efficiency uh demo so this uh this is a paper this actually predates neural networks in this design uh what this researcher did was they put they put skele they put muscles on skeletons in simulation and they solved for efficient Gates and what they demonstrated with this was that well first of all one thing I take away from it as somebody who was looking at robots robotics that wasn't that wasn't the goal of the person who did this work was that was that natural Gates and efficient gates are basically the same thing that is it's no coincidence that humans look the way they do when they're walking or that dogs do or that birds do right the Motions that we have are a natural consequence of the shape and connectivity of our skeleton the fact that we use muscles right because muscles are a certain kind of actuator they have certain properties like for instance you can muscles can pull but they can't push right so you have to have antagonist muscles which are sometimes connected in different ways and so the the motion that comes out of that is a consequence of the muscles the skeleton and and and the and efficiently achieving some kind of Target so this is something that we've also known for a while we've known efficient numerical techniques for getting you know very natural movement very natural motion out of arbit devices right like you can see the great variety of of structures that that this technique is applying to they're also this is being simulated in a in a in a in a physics simulation right so these things are being trained to do what they do in environment those blocks incidentally they're just to perturb the Sim the the thing under simulation like it has to be more robust if it can keep walking while you're throwing bricks at it right yeah I think that was all the videos right I think there's an Nvidia character animation one also in here this in this V in this segment by the way what they're doing is they're they're demonstrating how um it uh essentially you can um you can train for particular task but you sometimes you want the motion to be conditioned on an intermediate instruction so for instance what that Arrow was doing was basically demonstrating it was an intermediate instruction now go this way now go that way and the and the the system had the the the algorithm that was controlling the body was flexible enough that not only could it walk straight and walk sideways but it could interpolate smoothly between those two if you needed to or it can it can walk you don't have to teach it to run at every different pace you don't have to Tre teach it to run on every different ground you can condition motion actions that you've that you've trained on on other inputs so in the Nvidia one here this is another one where I wanted to highlight so Nvidia is actually doing this mimicking kind of a different way they're using a physics simulation there is some human mimicking going into this but this is another demonstration now uh Nvidia has a tool a commercially available tool that you can just get you can download this for free actually if you buy nvidia's stuff if you if you use their gpus and stuff they have they have kits that you can just basically find now once again these are for training video game characters because robots are expensive video game characters are cheap and there's a big application for video game characters but you can see how natural the movements are they're natural because they're they're trained in a physics simulation so that they end up looking like uh you know the way a body would actually move and then they train on uh you know you start with motion capture and a bunch of different Atomic movements and then the then the you can use good oldfashioned AI to combine the different Atomic movements into very natural sequences of of of motions and this would also work with robots right if your robot was cheap enough that you could afford to break it while you were sorting out the software it would work with robots so uh the so the guts of the software we've known for a while how to do it there's a lot of detailed work to do to get us there and S you know as kind of a separate but related track you know as we were talking about earlier we need to get we need to get the we need to figure out how to make a cheap body a good body you know one that has the degrees of freedom that can mimic what a human being does and then ramp that puppy up so that so that all of this other work is Justified you're not building five robots and getting you know five humans worth of economic value out of it you're building a million or a billion and you're getting millions or billions of you know people's worth of physical labor value out of it so one of the questions I had James is you know when humans learn to walk obviously we're much shorter we weigh a lot less our bones are a lot squishier and um you the other thing is that we kind of learn how to walk in a more or not how to walk but how to fall in a more graceful way that's less painful um you know that's something that people who do something like parkour or gymnastics or you know these types of things that part of the first part of their training is that they learn how to fall in a way that is going to do the least amount of damage and they obviously operate uh when they're training in an environment where things are a little bit more squishy to absorb some impact and so I'm curious what do you think the role of trying to make those types of accommodations for this initial learning experience for various robots how do you think that that's going to play out what's it going to look like practically I think that when we actually get to scale we we probably won't be doing that stuff um I think what what we're what you're going to see is people are going to develop really powerful models where you you train them in simulation you do the Sim toore conversion and they're and they're really good at dealing at at the thing the techniques aren't sufficiently refined right now that people want to bet lots of expensive robot bodies on it and I think it's not a priority right now uh the thing that you said about humans I mean humans learn learn to walk walk under much I mean it's a lot easier to learn to walk if you're if you're a foot tall or two feet tall the taller you get the harder it is like balance gets because it like you know if there's a scale dependency on this stuff like the bigger you get the harder it is to do that kind of stuff so learning to walk when you're small and then smoothly transitioning to being tall like that that's a good trick if you can pull it off but I don't see people doing that with robots people have suggested curriculum training for robots and once upon a time when it seemed like um curriculum training is basically where you give it something easy then you give it a little harder then you give it a little harder so you're kind of guiding it toward doing something that's that's more difficult the same way that we do you know we you know when you when you go to school you learn easy stuff and then the hard stuff Builds on the easy stuff uh that I people have tried using that idea uh it's mostly useful in situations where you're trying to be really efficient with your compute for for for training your core model um I but I think the techniques are turning out to work so well that they're that they're not going to need to accommodate they're not going to need to do that probably and it's inconvenient to have all these these extra steps like if you can just Brute Force go straight to the Target body train in simulation get it working really well and then have a shim module that adapts your that adapts you know what you learned in simulation to the actual physical body like that's easy right if you've got a straightforward way of generating the shim the Sim module and if you can close a loop by you know you drop those neural networks into the body you have it walk around a little bit and it refines the shim module you know in the actual body and then you constantly update the shim as the robot wears down as the B you know know because you get as a robot wear and tear on the robot and you know it the bodies will change slightly as they as they age and in order for the robot to gracefully accommodate those changes to its body as it does age you want to you want to constantly be measuring how is the body actually responding and slightly tuning that interface module so that you're so that you're always in you know so that it's always working pretty well like lot very little things can have a big effect you uh I think Scott you had mentioned earlier about maybe it was Hans that like when you when you actuate the head it makes walking a lot harder well heads are heavy right and they change the center of gravity when they move back and forth so it's a thing that you have to allow for it has to be built into your model and your model has to be able to do that if you train a model doesn't have a head then you put a head on the body it's just gonna like that it's just G to throw itself off balance constantly because heads are heavy and they're up at the back the wrong end if you you know if if you shift some weight around to ankle level it doesn't have as much impact as if you shift way around on the top of this thing you're trying to balance yeah and of course you you see chickens and stuff like that when they're walking you know they're moving their heads around sometimes to actually get the balance they don't have arms and they don't have waist right right so they ended up having to do that and it was sort of funny that the Tesla bot in the early days was falling over because the movement of the head again because they were accounting for that um and it's an interesting thing you talk about the ship model because seems like we have an internal shim model and of course we we've talked about proception and how important that is and you know this this famous trick that I always do I can always close my eyes and always touch my nose and I've been able to do that for years and years and years yet my body has changed my muscles have atrophied my joints are a little bit stiffer everything else I can still do it and it's because every day when I get up you know my body is basically going through and it's recalibrating the fact that I am wearing down and that's the same thing is is definitely going to happen to the adult and it does happen to Industrial robots and certainly it's going to happen to humanoid robots is they are going to wear they're going to get out of calibration they're going to fall down something's going to get a little bit torqued or bent one way or the other and you just kind of get up and then you kind of look at where your hand is kind of like what RoboCop did you know and just do a few things calibrate and you're set to go so it's it's not surprising that a module or something like that would be in there that that's definitely what I would expect you're going to have to do it no matter what because the robot is never going to be perfect even from day one it's going to be a little bit off and especially going from one bot to another they still won't be exactly the same so you're always going to have to have an adjustment and that's the reason why we have the sensors like eyes because that's what allows you to constantly recalibrate and make sure you're on target so even if your internal model says oh this is what I have to do to get to that Target and you miss it by a centimeter the fact that you can see it you know up move over a little bit and now you're going to have it for sure midc course correction right and and it's at it's and it's an absolute reference but in human Pro humans robots actually the robot bodies we build today actually have a big advantage in that their propri reception is much much better than human propri reception in terms of like how accurate individual like robot the you know the the actuators are building right now they have absolute position sensors in them whereas actually for human beings we have velocity and tension sensors and our brain integrates to get a to get a to get a you know a position out of that and that drifts over time like if you act it's you were talking about being able to touch your nose if you close your eyes for a long time like if you if you wake up in the morning before you open your eyes and try to touch your nose it's actually really hard to get it on the first try right because the appropriate reception for your limbs has gone out of calibration overnight right it's dri the DC offset has drifted on that kind of stuff and you you and this is a thing you know you put humans in a sensory isolation tanks or whatnot and can't tell what position your limbs are in after a couple minutes because there's no feedback senses to like to to correct that thing I uh one of the reasons I am so confident that we are going to be able to make really cheap really versatile robot bodies is because the stuff we're building today is massive Overkill right when the neural networks are good enough to deal with the kinds of limitations that human bodies have in terms of like how crappy our propri acception is and you put that on a robot now you can everything can be much simpler like that we make them stiff now because you know if they if they have vibration modes and and whatnot or if they Flex under load like that actually makes a model all that much more complicated and so so we make robots stiff and we make actuators like really accurate because it makes the model simpler when the models can deal with all that complexity we can really dial down you know the tolerances and rigidity and power we you know human human bodies have all kinds of crazy mo modes you know dynamic modes and we never have any problems with them because our software is so sophisticated it it actively suppresses all of the modes that we don't want to be expressed and robot buddies they'll be able to do all that stuff too like if anything we're learning it's much easier like we had you know if you'd asked experts 10 years ago like how much compute power how much sophistication it was going to take to be able to do the kinds of things that animals can do the bar would have been really high but if you look at the size of the neural networks I mean the neural networks for controlling those those little dancing figures and whatnot they are tiny they're microscopic compared to large language models it's hard to get them trained up but once you get them trained up it doesn't take much of a very very small models very very small amounts of compute can can get can run that model on the body and get really good performance out of it so I mean we're learning if anything we're learning that like once you get it down it like you can crank these puppies out in L in large volume at low cost and they're going to be really impressive in the real world we hope you enjoyed part one of this Optimus Gen 2 Deep dive conversation with James dma and Scott Walter if you enjoyed this conversation stay tuned for part two coming next week where we discuss the current capability of Optimus Gen 2 and the changes that we expect to see in both software and Hardware over time as Tesla creates a more and more capable humanoid robot thanks for watching and have a great day
Info
Channel: Hans Nelson
Views: 33,615
Rating: undefined out of 5
Keywords: Tesla, Elon, Musk, Elon Musk, Tesla Engineering, Tesla AI, Tesla Manufacturing, Tesla Dojo, AI, AGI, Technology, Software, Computing, machine learnings, AI models, Engineering, Tesla Bot, Bot, Optimus, Optimus Gen 2, Tesla Optimus, James Douma, James Doma, Scott Walter, Dr. Scott Walter, GoingBallistic
Id: 7RdGZDS_IQs
Channel Id: undefined
Length: 62min 55sec (3775 seconds)
Published: Thu Feb 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.