Thesis Defense (Lucas Manuelli): July 20, 2020

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what do you think should we start it up yeah i think so okay well thank you all for virtually coming um lucas is in the lab right there i'm in the next room over but that's the closest we could come to being in the same place for this uh it's my honor to to introduce lucas here for his defense um i'm actually i'm very very proud that uh this day has come i'm very sad that he'll be leaving you have to know that so let's see lucas had a bit of a checkered past he came to mit as a economist and we had to save him and bring him to robotics and then thank goodness for that because he's just absolutely lit us up over the last few years he came to our group right when we were in the darpa robotics challenge and promised to show some of the the videos from that there's a great one of him jumping on the back of the polaris trying to knock the robot over um and actually the first project he worked on with that was was trying to get atlas to get back up when it falls down right so so atlas is a uh almost 400 pound humanoid robot um and we were terrified that in this competition the robot would fall down and it would just be game over so so we that was that was lucas's project was like okay robot falls over let's have a plan so it can get back up long story short it's a 400 pound robot with like a 300 pound backpack and the kinematic reachability the long you know long story short you don't if you fall down that robot never is getting back up it's uh until they have atlas the new atlas can get back up and in dramatic fashion but uh but ours never did uh but lucas went on to do uh to work on the balancing control and you like i said you you'll see him and andres jumping on the back of the polaris and trying to knock the robot over um he did an amazing job as a part of the darpa robotics challenge team uh when we got a valkyrie robot here uh lucas took over and got the control that we had running from atlas working on valkyrie in like a ridiculously small amount of time it was incredibly impressive work so he's always been extremely solid in the dynamics and control side of things but then we started thinking more about manipulation lucas started thinking more about manipulation and as much as we like the dynamics and control side of things you can't do manipulation without thinking a lot and thinking deeply about perception and lucas with some great collaborations with pete and other people in the lab really helped this lab learn how to do perception and think and lucas and pete and others really thought really deeply about what does it mean to do perceptions so that you can do interesting dynamics and control with the manipulation robot and that journey has just been so fun to watch and so great to be a part of um and i think that the series of results that he's going to tell you about today including some that i think none of you have really seen yet the hot off the presses he just got back into lab to to do the last few experiments um they're just just a fantastic body of work so this is probably our last week in lab together um and uh i'm very sad for that but i just can't wait to see what he comes up with after this lucas please take this take the stage all right uh thanks for us um so i hope everyone can hear me okay um yeah so thanks for that great introduction so today i'm going to talk about robot manipulation um with learned representations and i'll try to give sort of an overview of uh what i've done during my phd and and how it all fits together so okay so a little bit about me so i'm a phd student in the robot locomotion group here at mit um and as russ mentioned my research has been uh mostly in robotic manipulation sort of at the intersection of perception control and machine learning so that involves um getting robots to do non-trivial manipulation tasks like picking up a mug and hanging it on a rack by the handle let's say um but as russ mentioned um i previously worked on the mit darp robotics challenge team doing planning control for walking robots um so this is uh andres and i um testing out the balancing controller for for this whole getting out of the car sequence which was comical given how large the robot is and sort of how small the car is um and i had a lot of great times and learned a ton on this and this was around four in the morning i think right before we sent the robot to california and i think russ actually was the one filming this video so i still have a soft spot in my heart for uh for walking robots okay so um here's just a few of the projects that i'm going to talk about today so i'm going to try to talk about these four projects that sort of make up the core of the body of work for the phd oh one just um public service announcement um on how questions are going to work so i think it's best to hold off on questions till the end if you want to like unmute yourself and ask questions we can do that at the end and in the meantime if you throw a question up in the chat i'll try to look up there occasionally and answer them okay so um if we just to set the stage a little bit uh for what i mean by manipulation here are some examples so here's um spot doing this cleaning up the kitchen task on the top right this is actually one of my favorite robot videos this is from willow garage the pr1 which is the predecessor to the pr2 doing this amazing job cleaning up this living room the caveat being that this is uh teleopt um and then on the bottom you have some examples of humans doing manipulation tasks like making sushi and uh cutting an avocado and spreading it on top so this is the kind of things that we should keep in mind uh when we're thinking about manipulation um so uh you know the goal is that we want robots to be able to accomplish useful manipulation tasks like in the wild so what i mean by that is without the aid of some you know highly instrumented or external motion capture system or fixtures or things of that nature so you know you might ask the robot to do something with objects that maybe it's never seen before um so you know i'm a roboticist so i like to try to formulate in this this in the language of robotics and sort of optimal control so with a little bit of abusive notation because i'm using x for state and really we have to work from observations um we can try to write down this problem as an optimal control problem so you want to um minimize or find some policy pi that maps from states to actions to minimize some expected cost may be subject to some dynamics constraints and i think this is pretty familiar to anyone who's in the robotics field but i think manipulation presents some fundamental differences and challenges from other areas of robotics and i think there's at least three so the first is that it's not clear what the state space should be so and when i say that i mean especially the state space of the world rather than the state space of the robot so for the robot you know we can have an imu we can have encoders we can know what our state is but what's the state of that shoe or the rope that's on the table that's a much more nebulous question secondly what's the cost function so you know for robotics we love having you know let's say a quadratic trajectory tracking cost for something like lqr but if i ask your robot to you know clean up the kitchen it's not at all obvious that a trajectory tracking cost is going to allow us to do that and then lastly you know it's not obvious what the dynamics function is so again when i say the dynamics function i mean the dynamics of the world rather than the dynamics of the robot so even though atlas is a very complicated robot it is well modeled by sort of lacron gene mechanics whereas if i ask you if you grab that rope and pulled it what will happen that's actually much harder to answer so that sort of naturally led into the three questions that kind of i try to answer in the thesis which are you know what's the right state space or object representation for doing manipulation um secondly how can you communicate task objectives to the robot especially in the case where you have to manipulate novel object instances and then lastly how can you enable robots to actually get to doing closed-loop feedback control for manipulation so these are kind of the three questions that will underpin the talk and i'll try to show how each project connects back to these so the first project i want to talk about is dense objnets and this is really tackling the first question which is what's the appropriate state space or object representation um so what is it this is joint work with pete florence by the way so it's a dense visual representation that's going to be useful for a variety of manipulation tasks it's not going to rely on having 3d object meshes or poses and it's going to be able to be learned self-supervised and the reasons that we wanted to do this project were we wanted a rich visual representation for doing manipulation beyond what i call the pick and drop task and i'll explain what that is in a second here and also we wanted to get away from using you know object meshes and relying on pre-built templates and we also wanted to be able to handle deformable objects to some extent so i think if we're going to talk about manipulation we should think about tasks and i think there's at least two axes to think about tasks on um so one is sort of how complex the task is and the other is how general is your solution so sort of the simplest task you could imagine is the task of just pick up anything and there's been really good work here um and the solutions are very general in the sense that they can pick up you know any object you might put in front of the robot sort of going down the list of complexity you have this sort of spot cleaning up the kitchen task but i would venture to say that this maybe isn't the most general solution if you actually look really carefully down here you'll see an april tag that the robot's using to localize so now i want to talk a little bit about this pick up a specific object task and this is basically what the amazon robotics challenge was so this is you'll see a lot of this in industry now which is robots doing these bin picking tasks and the amazon robotics challenge was essentially this task where additionally you maybe had to pick a specific object rather than just any object from the pile so how did teams actually solve the amazon robotics challenge so i think most teams ended up with a solution that looked like combining a grasp success model which just means you know if i were to put my gripper or suction cup here and execute my grasp or suction would i actually manage to pick something up and they combine that with a visual representation which is just instant segmentation which is just which pixels correspond to which object and so i think most teams basically their solutions were a combination of these two things and there's been a lot of good work on both sides of this problem um so to set the stage a little bit i want to talk about sort of what the perceptual representations um are in computer vision and how they relate to robotic manipulation um so there was sort of a revolution in computer vision in 2012 with deep learning and big data sets like imagenet and the representations uh started simple like classifying the whole image like imagenet through detection up through instant segmentation where you get a per pixel annotation of what each object is and that's essentially the representation that was used in the amazon robotics challenge so if you want to do tasks like this that just involve grabbing something and putting it somewhere else just dropping it then this instant segmentation representation is sufficient but now if you want to do a more complex task that's a little bit of more of a purposeful manipulation you actually want to grab something and do something beyond just dropping it well you're going to need to know more than just the pixels corresponding to that object you need to know something about the structure and so there's been other things happening in the computer vision world both sort of sparse key point detection and also these dense pose approaches and the work that we're going to do is very related to these two and actually this mug on rack example will actually be the next project and that will use a sparse key point representation so coming back to sort of this task hierarchy densovic nest which is this current project it's going to do a task that's a little bit more than just pick and drop we actually want to grasp the object by a specific point let's say and this is sort of just an example task to show off the object for presentation and then later we'll show what else you can do with that object or presentation and we set ourselves a couple goals for this project we wanted it to work for deformable objects and we would also like it to be self-supervised okay so what can you actually do just a preview of the results you can go down and uh click on a point on the object and actually identify that point with your camera and go ahead and pick it up and you can even do that at the category level okay so the starting point is really this paper from tanner schmidt and dieter fox where they were learning sort of these dense visual descriptors in the context of a human pose estimation task um so i want to spend a little bit of time on this slide sort of just explaining what dense descriptors are so um i think of dense descriptors as essentially providing like a coordinate system um for the object okay so what we're gonna do is we're gonna take an input image which is uh w by h by three and we're gonna map it to what we call a descriptor image which is w by h by d and d here is a hyper parameter it's a choice so typically we use d equals three and we want these descriptors to have specific properties we want that the descriptor corresponding to a point on the object let's say the tail of the caterpillar that that point should be unique and distinct from the descriptor corresponding to a different point on the object like the ear and also we want that they're invariant under uh let's say changing camera positions lighting deformation so you know this the address of the tail it doesn't change if i rotate the caterpillar around so that's really what allows me to solve the correspondence problem um so if i choose d to be equal to three then actually i can actually map that uh vector onto the rgb uh color space and you'll get these really nice visualizations and what you should see is that different points on the object have different colors unique and also that you know i'll later play a video of this that the color corresponding to a point on the object which is corresponds to the descriptor for that point it doesn't change if i rotate the object so the color of the tail should always stay the same here okay so how are we going to get such a representation well we're going to use deep learning and to train it we're going to use something called a pixel wise contrastive loss so given a pair of images for which we know correspondence and i'll explain how we get these in a second but we know that this pixel here in the left image which is on the right leg of this caterpillar is actually the same point here in the right image so if you think about what we asked our representation to do the descriptors corresponding to these two pixels should actually be the same okay so we're going to have what we call the match loss which pushes those two vectors together and that's sort of half the battle if we only had that we could sort of trivially map every point to the origin and we would have trivially trivially minimized this loss function so we also need to ensure that points that are corresponding to different points get mapped to different points in this embedding space so the head of the caterpillar is not the background so the descriptor corresponding to this pixel and this pixel should be different and this non-match loss is just asking them to be a certain margin apart and then the total loss is a combination of these two okay so whenever you think about doing uh sort of machine learning you have to think about where does your data come from so how do we actually get these pairs of images so what we're going to do is we have a robot with a camera mounted on the end effector we can move that camera around to get many different views of a scene and then because we've calibrated this camera and we know how to do forward kinematics we actually know the relative transform between the cameras and then we can back project points so in practice what this means is i can take a point on the object let's say the right leg here and because i know where the cameras were i can back project and that's exactly telling me that this pixel is the same object point as that pixel and so this is how we get the supervision for our method it's really multi-view consistency and it's completely self-supervised so we never had to label anything um so here's just a picture of what our robot setup looked like at the time we had these sort of uh very homemade rubber band fingers which we've since improved and you can see the camera mounted here on the wrist and so this is what that process actually looks like we go and make a scan and then we we know where all those relative camera poses are and can do that projection okay so there was a few sort of technical points that were needed to to make this work well um and i'm gonna go fairly quickly over these and focus sort of on the big picture but i encourage you to ask questions at the end or go read the thesis if you uh want more details so it's important to sort of focus the representational power of the model on the object by doing some clever masking you also need to domain randomize the background so you learn a coordinate system that's really unique to the object and not relative to the background and then you also need to sort of appropriately scale this loss function so this is uh so so what are the results actually look like here um so here we're visualizing a three-dimensional descriptor image again so uh d is three and we're projecting on the color space and what you should see uh this is pete holding our caterpillar object what you should see is that as we move this around um the colors corresponding to let's say uh the tail or the ear they stay the same independent of you know lighting and deformation and every frame is being processed independently here we're not doing any tracking and we can run this sort of at 20 hertz which is roughly real time okay so fundamentally what all this has done is allow us to solve uh the correspondence problem so on the left here you'll see a reference image um and a green reticle here in the bottom and uh when i play this video you'll see me moving that reticle around with my mouse and then the goal is to find the corresponding pixel so if i highlight let's say the right ear in the left image i want to find the right ear and the right image and so the center image and the right image are actually the same you'll see the red reticle down here in the red in the right image uh track the correspondence and then heat map just shows you sort of the distances in the descriptor space so i can highlight the right ear go to the left ear and actually this is dense so i can move uh all the way down the object and find correspondences so at its core this is what those descriptor vectors are are used for they're fundamentally about finding correspondences across images or scenes okay so um i don't have time to get into the quantitative results um but you know we did lots of analysis and in our approach offers significant improvements um over bass lines and i encourage you to read the paper if you want to see more so what can you do with this in terms of actually using this for manipulation so you can go in a single image click a reference point you store that descriptor and now you can put that object down in front of the robot and you can always go pick it up by that point and you can do this in spite of changing object pose and even deformations in this case um and here we're showing that you can do the right ear versus the left ear and this is interesting because you know if you're using like a patch based descriptor the right and left ears look very similar but the receptive field of the network here is large and you can actually go and do this okay so uh sort of we noticed throughout the training process that uh descriptors sort of seem to naturally overlap across different objects so this led us to ask the question uh what happens if we train on different instances of the same like class or category so here are a bunch of different hats let's say we run our model trained on all these different hats now notice that because our method just uses multi-view consistency as a source of supervision there's nothing in the data that connects let's say this mit hat in the bottom left corner to the princeton hat in the center but sort of magically what ends up happening is that you get actually very consistent colorings embeddings in the descriptor space across the category even though there's no supervision so this was um sort of a pleasant surprise so what can we actually do in terms of manipulation we can do this similar to the caterpillar where we go and click on a single point here the tongue of the shoe and then we can actually pick up objects by that point but not just one object lots of different objects from the category so these are all shoes from the training set again there's no cross shoe labels but here are some previously unseen shoes and it actually generalizes quite well and here i was actually standing behind the camera and threw my shoe on there and it still works so just to summarize what is a dense descriptor representation it works for sort of any 3d reconstructable object it's pretty sample efficient also the output which is fundamentally solving correspondence it's pretty interpretable and that allows it to be used as an input to a variety of other systems so you're going to see two other projects in this thesis that use this for presentation and the nice thing is that we sort of take advantage of the structure and existing algorithms that we know like calibration 3d reconstruction ik and grass planning and we really focus on learning the hard part which is the perceptual model in this case okay so the next project i want to talk about is k-pan and coming back to those thesis questions this is really going to be uh tackling the first two so what is sort of the right way to represent objects but also how can you communicate task objectives to the robot okay so this is joint work with uh pete and way so what is it it's a framework and algorithm for tackling category level manipulation tasks and what i mean by that is the task you have to do with many different instances of the category so here you see me doing this mug task but for lots of different mugs it's also a category level object for presentation that can handle the large intra-class shape topology and texture variations and the reason that we embarked on this project was that we felt like existing object representations couldn't quite capture what was needed for doing these category level tasks um and the dense descriptors they show pretty good category level generalization but they weren't quite the right fit for this project okay so if you recall densopic nets it was doing this pretty simple task of grabbing an object by a specific point which in and of itself is not necessarily a useful manipulation task so now kpm is going to be doing a task that's like one step more complex which is putting objects from a category into a specific configuration so we want to be able to do things like this so grab a mug and then hang it on the rack by the handle okay and we want an approach that can do this not just for this one mug but actually we want to do it for all these different mugs and every object you'll see in this section has never been seen before by the robot and so we want one approach that can handle actually all the different size shape texture and topology variations that you get within a category so sort of more formally the problem statement is that you want to manipulate potentially unknown rigid objects from a category like mugs or shoes into desired target configurations so if we think about the different steps of the problem we first need to grasp the object and actually we think that there's pretty good solutions out there for this so you can either use sort of a classical geometric based solution for this or deep learning but we don't think that grasping the object is the main sort of difficulty once you've grasped the object it's actually pretty simple to move it if you have a good grasp that's basically rigidly attached to your end effector so actually moving the object isn't the challenge either so the challenge is really deciding where you want to put the object and that's what we're going to focus on so pose-based approaches have you know had a lot of success in in manipulation in the past and i think they can work really well when you're dealing with a single known object but i think there's at least two challenges to using pose-based approaches for these category level tasks so first is that it's not really necessarily well defined across different instances of a category so if i show you a shoe a sneaker which is an instance of the shoe category well i can also show you a high heel or like a boot so if you have to define the pose that's consistent across all of those it's not clear how to do it additionally there's a lot of ambiguity so i could put the origin of the frame at the heel at the sole at the tongue or the toe for a single object those are actually all equivalent but they have very different implications if you propagate them to the other instances of this category and additionally it can be pretty difficult to specify tasks uh using post-based targets exactly because of this ambiguity so to give a little sort of simple cartoon example of that uh here's like a cartoon of putting the mug on the rack so i have this mug here in the top left if i actually just take that mug and shrink it down then it turns out that pose is actually totally well defined but if i just put the small the shrunk mug at the position of the big mug we obviously aren't accomplishing the task because the task is not about pose it's really about having the handle around the rack so opposed targets they can be physically infeasible or they can just fail to accomplish the task and this really arises because there's inter-category shape variation so how do you define the target pose for all instances of a category so for any one instance it's just an se3 transform but it's going to be different for every different instance because they have different geometry so in our view this means that se3 pose is not very meaningful in the context of category level manipulation tasks so what we're going to do instead is instead of using pose we're going to use semantic 3d key points as the object or presentation and then we're going to use an optimization program that allows you to put costs and constraints on these key points as the task specification mechanism and this sort of goes nicely with the fact that key point detection has really matured in the computer vision literature and you can actually detect these um from rgbd images for example okay so to give you a little sort of uh example here's a toy example of putting mugs upright on a table so instead of representing the object by a pose we're actually going to represent it by two key points so let's say the bottom center and the top center and then we're going to represent the action because we're going to rigidly grasp this object we can move it however we want so the action is just going to be an se3 transform that moves the key points so if i want to specify the task of putting the mug upright at a specific location how would i do that so what we're going to do is we're going to say well it's an optimization program you get to optimize over the action you're going to take and you can put costs and constraints on how the key points are going to move so in this example you could put a constraint that the red key point is at the target location and a cost saying that the mug axis is aligned with the vertical and if you do that that's actually going to be a pretty good way to specify a lot of pick and place tasks at the category level so if you recall in our cartoon example here's what happens if you do pose-based transfer but if you just put a constraint that the handle center key point is on the rack and some costs about where the other cue points are you can actually get really nice transfer for different instances of the category using this keypoint based approach and i want to emphasize that these key points are really in 3d and not in 2d and that's important the real world is in 3d so if you want to do optimization over an se3 transform which is what your robot can actually do you need your your representation to be in 3d as well so how does the actual full pipeline work so you start with an uh the robot taking an rgbd image we do instant segmentation from that we're going to do uh run our 3d key point detection network which is going to detect these 3d key points and i'll talk more about that in a second then we're going to run our optimization program to figure out what action we actually want to apply then we can combine that with your favorite grass planner to actually grasp the object and apply the action and the nice thing about this pipeline is that it's very modular and all the input and output is sort of very well defined so you can mix and match like if you want to put in a different grasp planner it's very easy to do that in this framework so here's an example of us actually doing uh that task in practice okay so how do we do key point detection um so we're going to detect 3d key points directly from our gbd images and we're very much going to build on some advances from the computer vision literature and um one important difference is compared to densonic nets which was completely self-supervised here we're back in the world of supervised learning so we're going to need a way to build our own uh labeled data set um so to build our labeled data set we actually are going to borrow a trick from the previous approach which is to label in 3d and back project those labels to 2d because we know where the camera has moved um so here you can see me labeling this uh shoe in 3d and we actually labeled a pretty small data set by computer vision standard so only about 10 shoes and 20 mugs and just to give you an idea it only took us about four hours to generate this data set um so the key point detection network essentially outputs a probability a probability distribution over where it thinks each key point is going to be and then the way that we actually localize the key point is we take the expectation of that distribution so an important thing to note here is because the key points are sparse and they're in 3d we can actually do inference through occlusion so even if this bottom heal key point is occluded we can actually still detect it um so just to give you an idea of the experimental results we did a large number of experiments on three different category level tasks you've already seen two of them hanging the mug on the rack um which is here at the bottom and putting the sh the mug on the shelf um and then we're going to do one more task involving shoes and here on the left you can see the key points that we chose for for each object category so for the mugs we have three key points the top the bottom and the handle and for the shoes uh we chose six key points that roughly capture uh the shape of the shoes um so here's this uh shoe on a rack task where you need to grab the shoe and place it on the rack in this position with the toe facing away from the robot so the really nice thing about this method is it can handle all the variation that you get in a category so here's just a video of some of the shoes that we used in our experiments so you've got everything from sneakers to soccer shoes to boots and we even have some you know baby shoes in there that our mechanical engineer brought in and so i spent you know several days running experiments on all these different shoes and trying to quantify the results and you can get these really nice tiled videos uh showing all the shoes that we did our experiments on so just to you know i don't want to spend too much time on this slide because it's a little dense and i encourage you to read the paper if you have more questions but i sort of just want to draw your attention to the success rates on the task so we do really well on the shoe task as well as on the mug on shelf task and for the uh mug on rack task for anything that you would consider like a regular size mug um we actually do really well um but it turns out we also had some really really small sort of like kid-sized mugs in our data set whose handles are about you know one to one and a half centimeters in diameter so we can still do those about half a time but it's definitely more challenging and i'm happy to talk about that in the questions at the end okay so um i want to briefly just comment on key points uh as an object for presentation so they're definitely a partial but task specific representation of objects in a category they're not everything so you definitely still need object geometry for things like grasping and collision pre-planning but the semantic information is really coming from the key points and their strength really comes from the fact that they ignore task relevant details which make them robust to shape texture and topology variation and there's other ways that you could go about detecting key points other than using a neural network but the nice thing about the neural network is that its performance is really only limited by its data set so as an example of that during the mit visit days a few years ago we actually were running this demo uh live um and we were having the students put their shoes up here if they wanted to and we would run it and someone put a high heel up there and we failed because there was no high heels in the data set so we just went on amazon bought a few high heels uh labeled them retrained our network um and then we could do high heels and these are different heels these aren't the ones in the data so the nice thing is that there's a clear path to sort of repairing and improving the system if it fails and and that's sort of a nice thing that you get out of being in the supervised learning world again okay so just to summarize um it's a novel formulation of the category level pick and place problem that represents objects as the semantic 3d key points and then it's a task specification by an optimization program on these key points we also developed a manipulation pipeline that lets you nicely factor the problem and then we did extensive hardware experiments sort of demonstrating the effectiveness of the approach so now i want to talk about this video motor policy learning paper so sort of trying to complete the story of the thesis so so far we've only thought about the first two questions and this in this project we really try to answer all three okay so this is uh joint work with uh pete um and so what is this project it's a factorization of visiomotor policies that use dense objnets as the representation and then it's using this factorization to perform efficient imitation learning and the reason is because we wanted to get to closed-loop feedback control policies for manipulation so what i mean by that is in the first two projects it was very much the case of the robot looks at the world it does some uh processing and then it closes its eyes and it just executes the policy open loop um and that led to some really comical failures and uh we really wanted to get a policy that continuously looks at images and and decides what to do um so we're going to try to do uh tasks like this so these tasks um you can see the robot actually reactive uh we've got a task with a formal object a category level task some non-prehensible manipulation in the bottom right you can see a task that really highlights the real-time feedback nature of the approach okay so suppose you want to do a task like this so pick up a hat in this case and hang it on a hat rack how would you go about doing that and you actually want a policy that goes from images to actions so images are really high dimensional so a typical vga resolution image has like 640x480 by three and that's just too much for a policy to handle so typically we use uh something to compress that image down to a low dimensional representation z uh we combine that with information from our robot like our encoders and then we run it through a fairly small policy model this is typically just a few layers of mlp or lstm to determine our action and typically what's been done in the literature is to use something like an autoencoder and then to let z be the latent code uh from the autoencoder and there's been a lot of great work on this so what we decided to do is actually replace the autoencoder with our dense correspondence model so we think that for doing tasks like this spatial information about the object is actually really really valuable so what we're going to do is we're going to um select descriptors corresponding to points on the object and because the dense descriptor network is fundamentally solving the correspondence problem it's going to allow us to track these descriptors over time so now what z is going to be is it's going to be the location of these descriptors and that location can either be in the pixel space or in the 3d space so how are we going to train this well if you think back to the original densovic nets paper we had a static scene with a wrist mounted camera that we would move around and that's how we would get our supervision so now the scenes are dynamic so we can't really do that so what we would do is we'd have multiple uh calibrated and time synchronized external cameras and then we can get correspondence via that source so these are two images taken at the same time instant and we know for example that this pixel here in the left image corresponds to that pixel there in the right image and this is what allows us to train our densopic network presentation still in a self-supervised way so to give you an idea of what the representation actually is here on the left if you think back to that original caterpillar example where i was mousing over and you saw these heat maps this is the same thing but now the right images are going to move so on the left i'm highlighting the point that i want to find in the right image but i can actually do this through time now so i can track this point through time and space and even as the object moves around so this is really what the input to the policy network is and it's not just one key point but we actually track a whole bunch of key points through time so here we are tracking uh this set of reference key points over time so these locations are really the input uh to our policy um so how do we go from this representation to an actual policy well what we're going to do is we're going to train it using behavior cloning so this basically just says copy what the expert did okay so the demonstrations are going to be provided by a human teleopping so here's an example of pete uh teleopping this shoe flip task and what we're gonna do is ask the network to just say to copy what pete did when it saw a similar image okay so that's called behavior cloning um so what actual kinds of results can you get with this so here we did about 50 demonstrations which amounts to like about 15 minutes of demonstration time of us teleopping this task of pushing this box across the table to this red line and now we can actually run this policy autonomously so this robot is actually receiving commands at uh five hertz and it's the images are actually coming from this camera right here and you can see that it's actually pushing this object across to the table and this might look similar to some of the previous things i've shown you but now we can actually go and disturb this okay and and the policy is very reactive right so it's actually doing real time closed loop feedback um off the images and this was the first for us and we were excited and you can actually get pretty high performance solution out of this it's quite accurate and it's not sort of a shaky robot like maybe you've seen in some other approaches you can also do things like manipulate deformable objects so here grabbing a hat and hanging it on the rack and you can do this even if the object is sort of changing position and i think in the next one pete's going to hit the hat mid-go and will adjust so if you think back to the original dense objects paper we actually had showed that the descriptor is generalized at the category level so we could find uh correspondences across different instances of category so here's that same heat map visualization finding correspondences across a single shoe but we can actually find correspondences across different shoes so on the left is the boot and on the right are sneakers so now if we give demonstrations at a category level and we combine that with our category level perception system we can actually do some category level tasks so we provided a bunch of demonstrations of flipping over different shoes and now we can actually deploy a policy that works on all different uh shoe instances so these are all shoes from the training set um but pretty soon here um we're gonna start uh putting down previously unseen shoes and the policy can still generalize um which was pretty exciting for us and so it works not every time so here is actually going to retry but it works on a wide variety of shoes and shows some interesting behavior so just to summarize um it's a novel formulation of the vizio motor policy learning using self-supervised correspondence as the object representation um its techniques for doing this multi-camera uh time synchronized descriptor learning and then it's sample efficient imitation learning using this policy and then we showed that this actually works on real hardware um so now i want to sort of get to the last project which i think sort of nicely ties everything together so again similar to the previous project this is going to be answering all three questions but it's going to be doing it in a different way so what is it uh it's basically going to use dense descriptors but we're actually going to learn a dynamics model on how those key points are going to move and then once we have a dynamics model we can actually run closed loop feedback via model predictive control so the reasons we want this is again we want to stay in the world of getting closed loop feedback control policies but we don't want to rely on imitation learning and there's a variety of reasons uh why which i'll talk about but once you have a model you can do really powerful things like use it to do planning and achieve diverse goals okay so just a brief preview of what the results look like so using this approach we can still do a task which is pretty similar to what you saw before so stabilizing is pushing across the table but now if i want to do something else like spin the box 180 degrees and then push it across the table we can actually use our same approach to do that other task and actually we can use it to follow arbitrary trajectories okay so the motivation is that you know the previous approach does get you to close loop feedback control policies but it relies on imitation learning and imitation learning has a lot of nice features but it also has some drawbacks so we did 50 demonstrations just to learn a single task we had to have expert demonstrations you can't learn from off policy data and also somehow your performance can never be better than the demonstrator um so you know in the current work like we want to be able to do that task of pushing across the table okay but we also want to be able to do the diverse the other tasks which are you know spinning and doing s-shaped trajectories and and whatever you can come up with and just to motivate why you really need closed-loop feedback here on the left is a closed-loop rollout of this 180 degree spin and on the right i placed the box as close as i could to the initial condition and just ran the plan open loop and you can see sort of how quickly uh it diverges so you really do need closed loop feedback okay so um i want to come back to the slide that i sort of showed at the very beginning of the talk and show you how the different approaches fit in so kpam it's representing objects via the semantic 3d key points and then it's using an optimization program to basically say what the task is so that's effectively the cost function in this equation it's making a really simple assumption about the dynamics model which is that you rigidly grasp the object and now it just moves together with your end effector and once you've done that you can actually get a policy by doing trajectory optimization with that model okay but the problem is that that doesn't work for tasks where you're not rigidly grasping the object okay because this step uh breaks down um so the vizio motor policy learning paper actually takes a really different approach so it says i'm not going to think about the cost function or the dynamics model i'm directly going to jump to learning a policy that goes from images to actions and the way it does that is via imitation learning so it just basically has a bunch of supervision of xtu pairs and it just copies it so the current project it's actually saying well let me actually learn a model of how these things move then if i want to push them to a specific point or a specific goal state okay i can actually do that by solving this optimization problem to get a policy so that's how this last project sort of fits into the story of the thesis so how do you learn a dynamics model so you want to learn a dynamics model that looks like this so x t plus 1 is f of x u so x is state and u is action now the problem in the real world is that you don't have access to the underlying state you get something like an observation in this case an image and of course you also get to access your robot state like joining coders etc so how should you represent the object state well that's been a question that we've offered various solutions to throughout the thesis um so you could use pose you could use try to learn dynamics on the full image you could use an encoder latent state but what we're going to do is use object key points so similar to the vizio motor policy learning paper we're going to try to track these points through time and space and i really want you to think of these points even though they're a 2d projection here these points really are in 3d and i'll try to show some 3d visualizations to drive that point home and what we're trying to do is learn how are these points actually going to move so if the robot let's say move to the right here where are the key points going to move and that's represented by the purple here so how are we going to learn a dynamics model well we're just going to collect some training data of us interacting with the object which is going to be time series data of state action pairs and what we want to learn to do is predict how the key points are going to move given a sequence of actions so we're going to parametrize our dynamics model as a function f with some parameters theta and given the current state in action we can predict the next state using this dynamics model and actually by repeatedly applying it we can actually predict arbitrarily far into the future so then we're going to actually train this to minimize the prediction error so we want to find parameters theta such that our predicted states actually map the the states that actually occurred so this is back in the world of supervised learning but because you're using time series data there's no like manual supervision needed you just collect roll outs and that's enough to train your system um so in terms of the training data you only need about 10 minutes of interaction data to learn the dynamics you don't need expert demonstrations actually any interactions are fine you can use all data you don't just need on policy data so previously if you had a demonstration that failed we would just throw that out and also it learns a global dynamics model that's valid everywhere and then at the end i'm going to show you how to use just a single demonstration to provide a trajectory that we're going to track so if you compare this to the vizio motor policy learning paper we used 50 demonstrations which amounted to about the same amount of time 15 minutes but those were all used to just learn this one task if you now want to do some different trajectory like spin 180 degrees you'd have to go collect a whole new set of demonstrations you can only use demonstrations that succeed and that's in the sense in which it only uses on policy data okay so i said we're going to learn a dynamics model so i'm going to try to show you uh sort of a visualization of what that actually looks like so here in teal uh or in like light blue are the current detected key points in dark blue is the dynamics model prediction of what it thinks is going to happen to those key points over time and then in green is the ground truth of what actually happens so i'll let this video play and if the green and blue are sort of closer on top of each other that means we've done a good job learning the dynamics so you can see the model is not perfect but it's sort of capturing the main dynamics and if we couple that with high rate feedback that's going to be enough to actually get good closed loop control so how do i get from just having a mod a dynamics model which is this to actually having a policy which is what i want so let's say that i'm initially here this is my initial state and i want to drive my system to this goal state but what i want to do is i want to solve this optimization problem find a sequence of actions that are going to drive me from the initial to the goal state and then if i just execute the first action in that sequence and replan that's going to be a model predictive controller so there's a few caveats here it's not trivial to solve this planning problem even if i give you a known model okay and that's because in this system the dynamics are hybrid they're non-linear which results in this being a non-convex optimization okay and in this case f is actually a neural network for us which you know makes things even harder so currently what we do but you are no by no means restricted to doing this as we use a gradient free optimizer called model predictive path integral which just happens to work well with the fact that we can roll out large number of samples in parallel because f is a neural network and we have it on a gpu so to try to make this planning problem a little easier we can do the following which is to collect just a single demonstration just to tell the system what we would like it to do so here's a demonstration of me just tell teleopping the system from start to a goal and all we're going to do is record the trajectory of those that the key points underwent and now we can add those in as a trajectory running cost instead of just having a terminal cost and this is a pretty uh this is something that helps the optimization to converge um just just to show you sort of a 3d version of that here's uh the 3d visualization of what that trajectory looks like so every time you see an image like this you should think of these 3d visualizations because that's the the underlying representation is actually 3d okay so you've seen a lot of videos that have colors like this on it so i'm going to try to step through and give you a sense of what each component is here so in light blue is your current state in green here these green reticles are the goal state and actually you see this image on the right it seems like it's a blend of two images so it's a blend of the goal image and the current image okay uh the green lines are the demonstration trajectory that i just showed you um and then sort of the most interesting part is uh the purple which is the predictions from the model of what it thinks is going to happen in terms of these key points under the optimized action sequence from our controller and if this is working well you should see that these predicted states track the demonstration and end up at the goal and then the true states follow so if i let the video play that's there we go so if i let the video play that's exactly what happens and you can sort of see what the mpc is doing um along the way and just to note like all the videos you'll see are 1x unless it's noted otherwise um so the nice thing is that once you have a model you can use it to track lots of different trajectories so here are just four different trajectories that i thought up so here we're spinning at 180 degrees you're pushing along different sides of the box or doing this curved trajectory down here and so the nice thing is you can reliably track these long horizon trajectories just from a single demonstration so just to give you a sense of what this actually looks like from an off-board view here's the robot tracking this 180 degree spin trajectory and you can see sort of in the bottom right a picture-in-picture of what the mpc algorithm is doing um so here's one where we're quite off from the initial plan and here you can actually really clearly see the robot uh performing sort of real-time visual feedback again it's feeding back off this camera right here um so this is an interesting one where the controller it actually messes up the initial plan and ends up on the wrong side of the box but about halfway through it actually decides to switch contact modes um and use the other side of the box to stabilize this plan and reach the cool so just like before we can be robust to external disturbances so here i am with the poker annoying the robot it was actually much harder to do this than in the previous project because i was in the lab by myself and also this robot this demo actually moves a lot faster than the previous one so you need to be pretty quick from hitting enter to actually going and applying the disturbances and we can also be robust to you know having visual clutter and objects in the background our vision model is pretty robust to that kind of thing so just to show you that it's pretty reliable here's doing one of those trajectories uh this is the only video that's sped up i think it's sped up 10x just so we can get through these but you know this is a single uncut video just to show you that we can actually do this uh pretty reliable um so it doesn't work i want to caveat with saying you know i think most demos you see they don't work 100 of the time and this is true with this one as well so um pushing is a little more challenging than it seems it's hybrid it's non-holonomic so small distances in like l2 space can actually be pretty big in terms of the underlying dynamics so it can be challenging to recover from perturbed initial conditions and then you know digging into our method a little bit some of the causes of the failures are limitations of the mpc algorithm so just because i've written down that empty mpc optimization problem doesn't mean that i can actually solve it every time sometimes it would have benefited from having a longer mpc horizon and sometimes we ran into control limits so just to show you two examples um here the robot similarly ends up on the wrong side of the box and gets it pretty close to the goal so pushes all the way to the end but here it would have needed to switch contact modes and that's just hard from an optimization standpoint in this case um so as another example here i started with quite a perturbed initial condition and we just we would like to go further you'll hear the robots click and so what happened there is we ran into a kinematic singularity of the robot so the algorithm in its current form doesn't have these uh built in it wanted to go further towards the cameras but the robot can't actually do that so i think that all these things are are solvable they're not sort of fundamental uh limitations the dynamics model is sort of agnostic to these problems okay so to show you some quantitative results here are four uh different trajectories i encourage you to read the paper if you want to see more but this is trying to give you a sense of how big uh the inlet funnels are um so this is the deviation of the initial condition from where the plan was so for example this point is 2 centimeters and 60 degrees off um and then a blue is a success and um a red is a failure so you know it can handle uh quite a variety of initial conditions all right so uh just to summarize it's a novel formulation of predictive model learning uh using learn dense visual descriptors um we did lots of uh experimental validation and real world experiments and if you go read the thesis you can see um some simulation experiments showing uh that it's got improved performance over a variety of bass lines um so just to wrap up i talked about uh four different projects downstopic nuts which is a a way of solving the correspondence problem and representing objects um kpam which was the semantic 3d key points for doing these category level tasks um the visual motor policy learning which really uses dense objnets in an imitation learning framework to get to close loop feedback control and then the final project which really shows how to use the densovic network presentation to actually learn a dynamics model and get to close loop control by doing model predictive control rather than imitation learn um so i see that there are a few questions but maybe i will just uh go through my acknowledgments first and then um we'll open it up for questions um so first i want to thank russ for being a great advisor he took a chance on me when i was still an econ phd student letting me be on the drc team and later joining the lab and initially i thought that all the pis were like russ but over the years i've realized that that's not the case um he has both a great eye for the big picture but also loves uh to get into the details of equations on the whiteboard and he even still writes code which i think is pretty unique and on the academic side he's really instilled in me the importance you know of doing serious rigorous work and has always pushed me to hold myself to a high standard and i really appreciate that he's really guided me through the phd but also sort of allowed me the freedom to pursue my own ideas so i just want to say a big thanks to russ um i also want to thank my committee alberto and phil for giving me great feedback on my thesis and being really supportive throughout the process which sort of was difficult at the end with the whole covid situation alberto has really been a great mentor to me throughout the years even before he was on my committee and we had a lot of great chats i also just want to thank all the mit all the other mit faculty that i've interacted with over the years i think i've gotten used to just running into you guys in the hall or at seminars and just being able to have great stimulating discussions and now that i'm leaving mit i realize sort of what a what a privilege that's been um i also want to say thanks to pete to being both a great friend and a co-author he was really my main sort of partner in crime throughout the thesis and we worked on a lot of projects together um so it was a pleasure to be able to work with someone uh who can i who i can also call a friend um and it definitely made the phd much more enjoyable um i want to say thanks to the drc team i learned a lot during my time with the team and also had a great time doing it and i really want to say a special thanks to pat and andres who spent a lot of hours answering my questions at the beginning in this tiny little office in n9 which this was the only picture i could find um and they really helped me to go from sort of knowing nothing about robotics to uh knowing something um i just want to thank the robot locomotion group in general for being a great and intellectually stimulating place and also my collaborators on the different projects over the years um i want to thank my friends for you know going on lots of great adventures and having a great time there's too many of you to name but i think you all know who you are and hopefully you're up here in a picture uh somewhere i want to thank my family my mom my dad and my sister for always encouraging me and believing in me even through periods that were sort of difficult and more uncertain and i especially want to say thanks to my parents for spending so much time with me growing up and working hard to give me every opportunity to succeed um they've really shaped who i've become both as a scholar and more importantly as a person and finally i want to say thanks to my girlfriend katie who's been a great sport tolerating all my rants about robotics and it's been great having a partner throughout the journey a journey which isn't always the easiest and you've definitely helped remind me that there's a life outside research and all the great adventures we've had through the years have really helped me sort of keep a great balance and so with that i'm happy to open it up for questions so maybe i'll try to look at the chat and or russ do you want to say how this is going to work yeah i think uh you all have the ability to unmute yourself chris and ben send in some great questions um so why don't you start with those lucas and then sure um so chris had a question i'll just repeat it um so is there some mechanism by which the vizio motor policy retries in a different way after the failure or is this implicit from the randomness of the disturbance caused by the prior failure so actually all the demonstrations had no retrying they were actually just so for the shoes the one you're talking about they actually just went and succeeded on the first try but we were using an lstm as our policy network which actually has uh like internal state and so it just turned out that if it missed and it got back here which kind of looks like the initial state after a while the internal state of the lstm will kind of reset back to the beginning and it will autonomously retry so that was actually completely unexpected to us and we didn't do anything special to do that so ben birchfield asked clarification on the key point estimate do you use the mode or the center of mass in the heat map uh it's the center of mass so the expectation um so the reason we did that is because that is a nice going from a distribution to its expectation is a differentiable differentiable operation versus something like uh taking the max so that allows it to be trainable end-to-end um so okay then asked when you use dense descriptors for visiomotor policies how do you select the low dimensional set of reference points to use as input to the policy network so good question the answer is randomly as long as you select random ones that are roughly separated on the object that's sufficient for for that captures enough information for doing the policy for the key points work for the keypoint dynamics work uh it's it's a little bit different and much more subtle so i don't have time to get into that now but i encourage you to reach out to me or read the thesis for sort of the differences there um so hong kai asked for the rl model how robust is the dynamics is it slight is it easy to generalize the dynamics model with a slightly different geometry shape and dynamic parameters like mass and inertia good question i think the answer is i don't know yet i think more work we have to do more work to figure that out um i think you know i think key points are great because they're a sparse representation when you're learning about dynamics you do need especially pushing or something like that geometry really matters so i think to really get a high performance solution in the future we're going to have to have some mix of sparse and dense um representation eric asked for the mppi setup do you have to explicitly encode anything for figuring out different contact modes no so even though all the results i showed you for that final key point dynamics approach we're in that pushing example there's actually nothing specific to that pushing example baked into the method anywhere so the way that we learn dynamics is general and the way that we do planning is general so it's actually controlling the end effector velocity in that case that's like the that's what u is um so it just figured out to go and do that and get on the other side of the box so there was nothing sort of explicit there thanks um andy berry can you explain a bit more about the integration of robot state into the dynamics model do you have states for the contact point with respect to key points so the way it works is the state is the key points in the 3d world and then we also just want to know like where is our end effector so we just have a point which is or you know we say what the uh xyz location of the end effector is that's how the robot state gets put in does that make sense andy so it's not in a relative frame do you also have a state for if you are in contact or not no because you don't so yeah i thought a lot about that um so the answer is no because how would you actually know that i exactly i was wondering so you so you so the reason the answer is no and the other thing is when you have a dynamics model that you want to be able to run for multiple time steps into the future anything you put in the state you better be able to predict right yeah so there's things that you could put in like oh my contact sensor's firing i'm in contact but if you actually want to do prediction from i'm not in contact what's going to happen three seconds in the future you actually need to predict that so um no it's not in there and it has to figure it out got it okay i actually have a second question now that i have you can you wildly speculate about how far you are from folding laundry with this type of strategy like say you had a t-shirt and you had key points it seems like you're maybe not that far so i i actually yeah so um what i learned from the last project is i have hope for learning dynamics models um i actually think that that's possible but there was one slide where i said that doesn't mean just because you have a learned dynamics problem like the battle is not won like solving the planning problem is still really hard and i think people in robotics we've always known that um it's not like planning became easy just because we're doing learning um so right now the planner is very simple it's just basically like a fancy cross entropy method um but i think more work needs to be done to like get a policy and a plan rather than just doing these online planning so i think building up a value function to allow you to do shorter horizon planning in your mpc is definitely like an interesting direction but yeah i think the hard part of the laundry would be the the planning part not so much the dynamics if i had to guess um okay we got another question uh for the key point detection what's the accuracy you can get how do you see if this method extends to tasks that require accurate position um for example you mentioned that the method would fail for smaller mugs so um for the so that project uh which was the second one the one hanging the mugs that's supervised learning you can get really good accuracy so like sub centimeter accuracy um so the reason we were failing on the smaller mugs is actually not because of the perception system really um it's because uh let's see if i can find a a little mug here okay i've got my trusty c cell mug so what happens is the real failures are coming from the fact that that pipeline's open loop so you would you would look at the world you'd detect the key points but now when you go in for the grasp uh inevitably in the real world whenever you grasp something you actually end up moving it just a little bit when you grasp it so if you have a mug that has a big handle like this that's fine you're still going to succeed at the task but if you have a handle that's you know one centimeter in diameter and then you move the mug half a centimeter on the grass that's where the failure comes from but i think the solution is just to move to a closed loop you know that's not a fundamental limitation the method just move to a closed-loop system re-perceive after you grasp or use a contact sensor for finer localization and that's solvable so you can get really accurate detections um speculation request okay from eric cosineau how well do you think this would handle object grasping and re-grasping of an object like reorienting a mug how would you do the play to learn the dynamics part so um one thing that maybe an astute observer noticed is that like the there isn't a natural way right now to handle occlusions um so partial occlusions are sort of fine um as long as the camera can see some of the key points it's okay but i think for things like grasping like you're really going to get into heavy occlusions so i think going to a multi-camera setup or using tactile sensors is actually going to be pretty important um maybe we can speculate more offline eric thanks also i think honkai had a question about the key point speed oh yeah so does it help so honkai's question is uh does it help if you include the key point speed into the dynamics model um so yes so in the interest of time um i didn't have this or in the interest of sort of keeping the slides clean so i didn't have this in the talk but x is actually the key point so the state is actually the key point locations at in the current image and actually in the previous image so by having both of them you get velocity um i think that hopefully that answers your question speculation request from pete um for new graduate students just starting out what related or adjacent research problems do you recommend they work on um i think stay in the world of closed loop feedback control um i think that is a very rich world and um very promising and i think integrating i really do like the dynamics learning approach so i think figuring out how to do that but integrating more modalities than just vision for especially for dealing with things like occlusion i think is really interesting and also how to go from you know actually building a value function to get sort of a global policy rather than just relying on your mpc to get you there i think those are all fruitful areas okay i think i've answered all the questions from the chat so if there are other questions feel free to unmute yourself and just ask that's awesome i i not only a great presentation but i just love that you have such a community of people that want to know the details and everything like this really really good sorry that we ran over to the sorry to the committee i guess who have to stay everybody else stayed optionally um okay so here's what we're gonna do now i'm gonna shuffle the committee for sure into a breakout room if there's any faculty who would like to join us in the breakout room you are welcome just send me a quick uh chat right now and i will include you in the breakout room i'll leave everybody else here you're happy to talk so much to yourself you know discuss the finer points of lucas's work or whatever you want to talk about and we'll grow lucas and then we'll send it back

Info

Channel: Lucas Manuelli

Views: 5,977

Rating: 5 out of 5

Keywords:

Id: Gb-t2hIpYpk

Channel Id: undefined

Length: 74min 45sec (4485 seconds)

Published: Fri Jul 24 2020