Jiajun Wu: Learning to see the physical world

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today's talk will be by jajan wu who is an assistant professor of computer science at stanford working on computer vision machine learning and computational cognitive science before joining stanford he was a visiting faculty researcher at google research he received his phd in electrical engineering computer science at mit this research has been recognized through a whole bunch of stuff through the acm doctoral dissertation award honorable mention the triple a i acm cki doc dissertation award mit george prowls phd thesis and there's just there's a whole lot more mention every one of them um he is famous for his work on physical reasoning and um three generative models and uh i'm very excited for this talk today so please take it away thank you uh let me see can i share the screen does it work yeah is it is it blurry because i turn on this optimize for video because sometimes it might be blurry but if it's not then i can go ahead uh quick question before we start um how are you with uh audience questions uh should we just like oh i guess i can you know i guess i can pause once a while and then i can ask for questions um and i know anyone had any questions you can post on chat and then you know during sometime in the middle i can take a look or you know like let me ask you flooring if you know where i can later thanks cool okay uh great yeah thanks for introduction and thanks for having me here uh very excited uh to virtually be a mira montreal is the city that i visited the most often in canada because because of new rips but okay um so yeah i'm judging i am a new first-year assistant professor of computer science at stanford and i'm going to talk about you know learning to see the physical world so so this is you know also um like like a general overview of my research which we sometimes call like physical c understanding because we want to uh build see understanding systems ai systems that can not only recognize objects you know because i started as a computer vision person so not only recognizing objects like classifying images into cats and docs but to also understand like for image like this you know scenes made of objects obviously have geometry and physics things are going to happen in the future you know can you predict what's going to happen in the future can you if your agent can interact with the scenes okay so yeah so so humans we live in this physical world and you know we can we effortlessly we're able to see and navigate and interact with the world so we definitely want to build machines that can do the same thing um so so let's start with um like just like every other talk uh you know nowadays you start with a baby video and i'm no exception of that so here is a a six month old baby playing the game with jenga and looking at what she's doing so apparently she has a different notion of playing uh so you know the way she plays the game is different from adults playing the game but but although she didn't understand the rule of the game it's still quite remarkable that just six months old right we're able to effortlessly see the object touch stop you feel the object interact with it and pick it up and even maybe trying to put it into a mouse um so so in some sense you know our our our ability to see and interact with the world and exists in a very young age and our goal here is also you know to build such kind of what we call um infant machines we put a coach here because infants we now know that they really know a lot so we try to you know replicate a subset of their ability that is um not only for blocks but for any objects right because to them you know many of the scenes are new you're you're always encountered with this kind of a pile of toys or random objects some of them you're familiar with you may have learned or played with before but many of them are really new to you but how can we for any new objects many of them you've never seen before you're able to build a machine that see its shape and physics and predict its behavior how they are going to interact with each other and you know how humans could interact with it and to really you know manipulate those objects so if this is our goal then what would be our approach right how can we build such kind of systems and of course there is neural networks and they could do very well and this is like a typical video that you can see you know what they're doing um segmenting uh in autonomous driving scenarios segmenting scenes into trees or cars and rolls and stuff like that in persons um this is really great and this is back in 2017 right so back in 2017 you can already do so well um but on one hand they're really powerful on the other hand they are also limited because everything here you know you can see is you require a lot of training data right you have to annotate a lot of cars to actually be able to segment the cars you don't have any notion of 3d you know everything here is purely 2d uh just 2d sequence and you're not trying to generalize to any new objects you know because every object here cars trees you see many of them during training and you're not trying to uh segment or recognize any objects you're not familiar with then people could say okay if we want to you know go from 2d 3d going from familiar to infomedia going from passive perception to active manipulation or maybe one approach is we can just scale up right to how to solve the how to learn problem now we discovered the learning system to let it learn everything uh you know not only semantics but also appearance physics and actions um so this approach really rely on data right so you're saying okay now let's try to have a larger more powerful data set but the challenge here is any annotations beyond semantics are just so hard to apply right it's relatively easy to annotate whether the pixel is a car or stuff like that people come up with better ways to annotate it but there are things that are much harder to annotate like the 3d geometry of the car and there are things that are almost impossible to annotate such as like the young's modulus of the car you know what's that how how would and also when the objects are not homogeneous right so how could even annotating a little stuff an alternative approach it is we can resort to you know we live in this physical world so there is actually physics right there is actually an underlying physical model that is governing the world and how it behaves and how it works so if you look at an image like this no you you know that you're seeing the pixels the way it is because uh there's underlying physical state which includes you know opto-centric representations where every object has its intrinsic geometry appearance and materials extrinsics like positions and velocities and scene descriptions like the lighting conditions and camera parameters and taking all those physical states and put them together and the process of producing an image right is an important subject in computer graphics right called rendering um so you're actually trying to produce image from the state and you can do that for uh objects at different states and across over time there's dynamics right conditional current states and action what's gonna happen in the future uh and you know here it actually you can see it as gravity and you can use that to you know think about there's a physical model a physical explanation of what whatever you observe and this kind of top-down generator modeling is is really popular in computer graphics and computer games um so they're actually very powerful and this is a video that you know i can play you can see that this is a made by this open source software called blender and they made in 2010 before any deep learning uh the movement is called sintel uh it's it's really to to demonstrate right the power of those rendering algorithms completely open source is free you can you can they say okay with our rendering engine with our physical simulation with our graphics engines you can produce and create your own virtual world like this you know so which is you know it's of course it's still you can see it's not real but but it's pretty realistic you can see where a lot of you know fine textures and lighting conditions and motions and stuff like that so people may say okay why do we just you know try to replicate it and try to build this kind of physical model for the entire world um so so to think about what model can bring us so to rely on model you also have other challenges the one is these models are top down right so it doesn't tell you how to solve this bottom up inference problem it tells you how to simulate the world how do you produce an image but it doesn't tell you what's your given image how can you infer the things you care about like you know how to solve the inverse problem of inference after geometry or physics from the videos or from observations right so and then traditionally you know people say okay you can see vision as inverse graphics uh and then you know you just solve the inverse problem because it was like you know complex but not differentiable you can say okay why don't we just solve it with uh like sampling mcmc methods and stuff like that but those things are just really slow and then scale up to the higher very high dimensional state of the physical world especially if you want to look at real world applications um and the second issue is you know all these models are always approximate right so you can see they're still not realistic enough they're mostly deterministic so they're not really capturing all the realism and the stochasticity of the real world um so my research philosophy in that sense is actually try to think about you know what is the model that we should build or what is the minimal but very universal and powerful structure that we can borrow and we can integrate them with learning methods so that we can get the power of both the data and the model in particular you know we try to use learning in two ways the first is we call it learning to invert which means because it's it's so hard to to use those top-down models for bottom line inference now they can actually we can actually use learning to solve the inference problem to invert these physical models by using them as guidance to to suggest you know what should be the learning target what should be intermediate representations what should be the training data and learning the other is learning to augment you know models are approximate sometimes can be slow can be deterministic so can we use learning to make them more accurate more efficient uh stochastic um so these are i think like the two directions of my research and the key principles to summarize is you know we want to think about what is the physical model and physical means it is true you know it's not like building things that are that are specific to a certain domain or certain application but beauty things are universal but true physical models but also think about you know whatever we can already build or we cannot model well then we can use learning to help us so this is like a high level overview and given there is no question uh i will first go into specific examples let's see how this can actually be applied in specific applications in computer vision graphics and robotics so first when we say learning to invert let's look at a specific application like a simplified extremely simplified problem i said okay we're inverse graphics but let's just assume right now i'm giving a single image the image has a single object object centered and i just want to care uh reconstruct the 3d geometry of the object from the image so in this case what is the physical model let's think about this forward image formation process why we are seeing the way the image stated the way it is so it's because you have an underlying 3d shape you know you can rotate it you can see it's an actual 3d shape uh the the shape is in a swedish space and in an extremely simplified illustration right the serious shape is in swedish space there's a light source the light source then i raise to the object and hit the visible surface of the object got reflected to the camera in front of camera there is image plane uh so you're seeing an image playing the object but what you're really seeing is not a full city geometry like for example there's a full stack that is included you're not seeing it what you're actually seeing is you know the visible surface of the object uh and you know the like the force like yeah you're not seeing a full stack but you're seeing the other three legs and the reason that we humans understand the chairs must have four legs it must be stable it must be symmetric it's from our power knowledge we've seen many chairs we've interacted with them so we have this very strong prior knowledge that even if i'm seeing this image i can only see three legs on the visible surface i can automatically complete or imagine how the shares will actually look like to make it symmetric to make it stable to make it have four legs so then we thought okay if this is how the physical model works would that be possible for us to design an inverse system that also uh has this similar structure but first you know estimate available surface which we reprimand re-parameterize it as a depth lab so you can first run and solve the estimated depth and then second you know you can do a shape completion going from the depth lab or the visible surface to use your prior knowledge uh to complete a shape right so so the first step is more like a local thing you try to extract whatever you can tell you can you can get from the image and that lab is really capturized uh it really captures all the information you can tell you can get about all from this from the single image or anything after that there's nothing you can do from the image you have to rely on your prior knowledge right so this is why we choose this physical but also you know intermediate representations well not other representations but we use this visual surface repetition because it has a physical rhythm and also because condition on this visible surface right input image and output 3d shapes becomes you can say like become conditionally conditionally independent right that's why we choose this physical intermediate representation and turn out you know if you build a system uh following this philosophy and having two modular networks for depth estimation and shape reconstruction you can train it on shape net uh you know a large data set of 3d shaped cad models you can randomly sample shapes from shapenet put any random poses in front of random backgrounds so that you can render all those intended data for training but then you test it you can test it now here on real data these are from pascal 3d plus these are chairs which are you know not not not the easiest um but without modeling that just have an end to a neural network this is what you can get which is not terrible uh it captures the you know the rough structure of the object but it's missing a lot of fine details and by modeling that just with this small change by introducing the intermediate representation allows you for the system to focus more on the geometry and find structures of object and here a few more examples okay um so now this is back in yours 2017. yeah go ahead is there a question or yeah uh sorry my own question um so this is like the change that we're seeing here this is only for um enforcing this intermediate step like adding like depth as an intermediate representation and then going from depth to like model completion as opposed to going end to end from from image to a complete model right right that's right interesting okay thanks that was it yeah so this is back in europe 2017 and then at the beginning we talked about our ability to deal with world war is it's heavily based on our our knowledge our our ability to deal with objects we've never seen before to generate new objects to nc objects um so let's see how how will this system work if we try to generalize it to nc object classes also new category of objects um so if we take the model train on shareplanes and cars uh these are like the three largest of the categories from shakedown so this is like you know the most common optic categories that you can find ken models for let's say and and then you can test it on the table table is like fourth largest object category so it's also you know these are pretty common objects and table and chairs they're both furnitures but then if you take this image no these are images right take this image of the table you send it to a model train on shares planes and cars and and this is what you will get right so so it doesn't look like a table it looks much more like an airplane uh just you know than the table but this is because the model hasn't really learned how you know what its reconstruction really means and it essentially was trying to do nearest neighbor right to 2bn 2d and i think everyone is doing some version of the nearest neighbor smart nearest neighbor so so you have to find the right prior so that the model was able to you know appear to have some extrapolation abilities but but here you know if you just do this very straightforward implementation then it was like okay what is this i've never seen it okay but the top of the table looks like something that is flat and round and you know some airplanes are like that so why don't i just you know try to retrieve a near stable airplane and this is not only a problem specific to our approach uh but you know there's all those methods right they try to do 3d reconstructions using different representations like point clouds or voxels or a collection of surfaces uh and multiview images octave trees and they all have the same issue so for people who are interested uh you know tom uh blatant culture and tom brocks they have this paper called what do you single bill reconstruction networks learning cber 2019 and they analyze this problem in detail so we thought okay is there a way that we can solve this problem or at least do a little better you know because we thought you know we want to we're inspired by graphics rendering engines we have this you know physical rendering pipeline so we thought okay can we look back at this physical model and to think about you know what what the physical model or the graphic engine is really doing if we want to invert it so if we look at the second stage of the system right there's this shape completion network which is going from depth uh to your 3d shapes so we're saying okay this step is you know around prior knowledge to complete the shapes but that's not up to the end right although it's about a visible surface it's still a 2d representation right it's still like a 2d image and the switch the output reconstruction shape is in 3d so so the second network the shape completion network is still uh implicitly doing projecting depth right back projecting depth into 3d but this is like a deterministic fully differential process right so we know how to backproject the statmap into 3d as this visible surface this is like perspective projection implementing every graphic changes and you can you know you can you can write down equations and you can make that differentiable um so if that's the case why do we have this shape completion network to be over parameterized and relearning this deterministic mapping uh well this is something that we already know and it's physically true so it applies to every image um so in this case we thought okay what if we just you know implement it as a differential layer with no learnable parameters but still differentiable so that you can still do end-to-end training uh so and put it into differential layer into the into into learning system so then the system works as you first estimate the depth and then you back project adapts into you know this partial or visible surface but now in 3d you can see that the force like is missing uh and then the shape confusion network can really focus on computing shape uh instead of relearning this back production process so now if we take this model and we tune in on again shares planes and cars and we test them same table um and you know this is what you got with direct prediction this is what you get while with modeling depth and this back projection process right this this so just with two small change the depth representation and the back protection module you're able to generalize and do much better while reconstructing objects from unseen object categories um this is joint work with show tone and xiaomi which we call generalizable reconstruction of genre this also was back when i was still a mit phd student and both xiaomi and joe tola and mit phd students as well seems there's a question yeah i'm pretty sure i know the answer but just to be sure like in the 2017 work you're not supervising the depth right or are you we are supervising adapts but it's all from synthetic data so so everything there is so for both baselines and for our methods we are training everything on the synthetically rendered data that we generate ourselves using shipnet objects and then all the testing and everything are on real data so there is yeah but we are supervising the depths yeah so when you say it's trained end-to-end um i i'm not sure what you mean then because it would seem like there's the depth estimation part is trained independently of the shape completion part oh you just have two different losses right these have a loss in between you have a loss at the end okay we also actually we also have a consistency loss which i didn't talk about but just to make sure that the two things match um yeah and one last question uh your final output um 3d shape what type of representation are you using for that yeah so so this is a a little complicated uh so uh so for for this paper for whatever i'm talking about here eventually it's still back protected to voxels but there are some other intermediary representations i didn't talk about because of the time constraint that we we also try to reproject it into some spherical map representation uh for the completion so this so so in the in the shape shape completion part once we got this partial surface what we did uh is we actually once we have this partial surface we project it and again to a spherical map perpendicular and we do the completion there and then finally we back project into depth of sorry a2 into voxels uh so the final output is it's doing voxels cool things yeah but of course this is back in 2018 so that was before all those implicit things um so no of course these days people are like okay why don't we just you know i think there are all those things that we can do much better you know especially if you look at those bumpiness and stuff you can actually probably tell right because just look at those bumpiness stuff it doesn't look like it's something you can derive from you will get if you have implicit reputations so so of course there's a ways that we can do better okay um well we also did some other testing you know about generalizing to nc object classes uh like you know if we train ontario's plans and cars and then you test it on like bookshelves and so far and here is the bass lines which uh is uh try to represent objects as a collection of surfaces um and which work really well um you know on trini object classes but when you test it on new object categories you can see that they're still trying to do something uh but there are much more artifacts or compared with our methods or the ground truths um there's still like errors uh like also like no roughnesses which we can probably you know solve with the better representations um but it would it does capture you know the rough idea of you know what the object is without seeing them and we also tested a little bit on you know generalizing to to non-rigid object classes um here we again treat our transplants and cars but here we use that as input so you only evaluate the second part the shape completion part uh and then this is the reconstructed shapes for the model which is only trained on chairs planes and cars and you're able to reconstruct and give a you know of course still a lot of the artifacts but but but still quite remarkable i think to me that uh you are able to give a roughly you know complete shape i understand the object should be locally smooth and and and enclosed uh and when you compare that with the ground truth it captures you know what what the thing uh roughly look like okay so this is about you know how we took some you know essentially i know i think we made a number of changes but essentially what really matters uh the the things quantitatively but uh if you look at external experimental results is the use of these intermediate representations the data as well as the back protection module so the two things i'm talking about and you know in some sense right people can in computer vision they call it two and happy services you know proposed by david maher 40 years ago as as his hypothesis or explanation of human visual processing no matter whether humans actually use it i think there are a lot of evidence or arguments around that but in ai uh we thought but it is something that is physical and it is something that is you know universal right it applies to every image applies to every scene you see because there's production protection projection going out so we thought okay why don't we just try to integrate these kind of physical things into learning systems it does really help you and this is about learning to invert uh let you generalize but also i can quickly show you some examples about augmenting how this uh intermediate representations allow you to augment and do a little better uh in in in rendering but what does that mean because in computer graphics you say okay i take the shape i take the texture i take a viewpoint i can just render image so what does it mean by augmenting a rendering pipeline here we use it to refer to you know what if i want to generalize generate new shapes whatever if we want to generate shapes with new textures um so here uh no we extend we first extended uh this again i think everyone's familiar with that to 3d let's say okay we can just uh generalize that and then to synthesize not only 2d images but also 3d shapes uh but here now we still take this projection module but we'll use it in a different direction we take the 3d shapes and then we do a forward projection so that we can actually compute right this two inhabit surface or the depth map of the object and after that we can apply a 2d gain or like a cycle again so that we can apply a texture to the to the object so that you can synthesize not only 3d shapes but they're corresponding to images with textures so in some tests you can see this as a hybrid graphics engine because you also have the 3d shape perpendicular you also have projection you also have a two and half d surface just like a standard graphics engine but you in addition to that you have a generative model of you know possibly to synthesize possible 3d shapes you also have a generative model a conditional generation model and there's a texture code that you can apply you can change so that you can apply different types of textures to the image so the two parts are learned well the projection part is differentiable but not learned and the benefit of this is it also allows you to to have this we call like disentangle representation right this in tango representations of what is the code what is shape called you have a and the projection you can import inject a post like a viewpoint called red pulse code and then for textures it's conditioned on the texture code as well so you can uh change one direction you can type synthesize object in a new viewpoint or you can see if i already synthesize the number of shapes uh but then i know i i realize there are some other images of the cars which look really great so i wonder you know how these kind of textures uh will look like on the on shapes or the cars i synthesized right so they take the texture from one image and apply it uh to the shapes that you use in the size so this kind of distinct disentangle our representation or this hybrid graphics and allows you uh to get flexibility uh or the benefit of a standard graphics and it also you know synthesize and do more flexible image generation and editing so we can yeah we can put them together not only doing invert but augment so which means if i'm given an image can i first infer this kind of object centric shape aware 3d aware representations and then we can uh you know maybe do some changes and and to synthesize or edit a new shape on a singles as a new image so here's the example uh where if i'm giving an image of autonomous driving scenes let's say their number of objects uh like the cars especially and then i can first run this kind of inverse graphics uh pipeline uh here we made additional some additional changes because we want to uh model the background there are things like you know skies and trees which is you know it's really hard to explain what their what their geometries are so we just model it as a 2d segment like a latent code um but there are objects they're of interest you know these are cars um so we can have this we call it like semi-interpretable representations where you actually infer what is the geometry of the car uh and what's the post of a car what's the texture of the car and then you can say you know i can imagine okay how will the car look like now if they're at the center but what if i move their position to the right and then how would it seem look like and then you can send this edited or change the representations back into what we call this like hybrid graphics engine right to re-synthesize the image with everything else staying the same the background the weather on and everything and object textures object shapes but only the position of the card now i move to the right right so you can do this kind of a flexible or inversion and documentation which means like you know computer vision recognizing objects perceiving objects but also synthesizing or re-rendering the image seems like there is another question i think this is a good place to to stop so can you unmute yourself do you want to ask the question live yeah there's a reconstruction no i can't see it yeah what is it right uh it's often symmetric yeah this is a good point i have a whole different talk which is totally on uh programs and symmetrics and how this kind of things that you know how this kind of a regular structure especially you know knowledge from graphics that people have used for procedural modeling and those kind of stuff can be integrated with vision but especially with you know a neural program synthesis which is very popular topics these days i have a whole talk on that uh which i did i know you know it's also more up to date so but i was i thought this one is about in the robot learning seminar so i should talk something that is more related to physics and robotics so i didn't give that talk here but you know we can have more offline and i think at the end uh there is also one slide uh even in this stock there is also one slide well which i try to talk a little bit about how knowledge like symmetry and stuff can be can be uh can be useful for for scene perception okay um you know so yeah because because there for symmetry and stuff you know we mostly just uh or irregularity right you know you can say like if you have this mug you know the mug is like you know rotationally symmetric uh if you're seeing like these kind of blocks you have the knowledge that you know these blocks must have the similar shapes now they're all those regular structures especially in man-made artifacts you know for chairs also for buildings and for those kind of stuff so so it's more about we're also looking more into their applications in computer vision computer graphics right you know image editing generation shape modeling so which i thought is not directly relevant to to robotics so so so that's why i thought maybe i shouldn't talk about it here uh to be given the name of the seminar i'll be happy to chat offline okay um so here we have this um uh no we saw learning to read physical models we talked about how inversion and augmentation design idea can be applied to your graphics channel uh but the word is not static right if you see an image like this this this one you can recognize okay there are objects and objects have their own geometry but you also have a texture and cutter and position but you'll have also have the strong intuition that the tower is it's not going to not going to be stable it's going to fall um so how can we let machines or also have this kind of understanding which is very critical for robotic applications the the straightforward the most straightforward things you can do is like you just do like perception or state state estimation and then you just get your estimated state representation then send it to directly to affiliate standard or dynamics model predicting what's going to happen in the future and then you can say okay let's re-render on what's going to happen in the future right so this is like a very straightforward approach uh you know doing aim for everything from our image and then send it to just purely simulation elements it turns out though these kind of pop methods are could be actually very powerful even if you compare with into a neural networks now if you're seeing an image like this and it asks you what's going to happen in the future you can you know have a neural network that directly predict what's going to happen in the future in pixels and the predictions are not bad if you compare them with the ground shoes you can see that these models indeed they were able to capture uh you know how many blocks are falling and roughly to which directions they're falling but because those models are purely based on pixels they don't have a notion of you know uh objectness or you can say after geometry so the long-term prediction also because that's chaotic so it's can be very blurry uh if you compare with you know this kind of more physical models where you first identify the object states and then do a physical predictions and re-render them then of course you know long-term predictions are not going to be perfect uh but it roughly captures uh it still and it was it does definitely have the notion of objects it also captures how many objects are falling to which directions they're falling and also allows you these kind of physical representations to do things other applications such as if i've seen this object and then if i can reconstruct it i can flexibly predict you know what if right if there's a wind coming from the right if i poke the object on the top then what's gonna happen in the future uh you can imagine all these cases or if you already see a tower that is falling and then you realize that you predict that uh then what are the stabilizing forces that you can apply uh to actually stop it from falling right so and you take actions to it and we actually even uh later updated the model a little bit but the most important change we made here is we actually try to realize it on a real robot so if you're given a single image or let's say the robot would like to recreate rebuild the tower or from the single image then you know what does it need to learn or what they need to understand it has to understand that cnn is made of objects there are eight objects objects have different geometry and textures and colors here mostly geometry and colors and you have to understand their aesthetics right going on so so you know you can if you want to re rebuild this tower or bridge you have to start from the bottom right if you take the blue box at the top uh and then it will just fall to the ground right so the system was able to you know um just learn or invert these objects and treat reputations and then to plan actions uh so that you are able to rebuild the tower just from a single image and we tested on more examples um these are like 10 different examples it's not always successful especially because from a single image the especially depth estimation can be very ambiguous now you can see that at the bottom right there are cases where the system just just fails but in seven of the ten cases i was able to successfully reconstruct the tower okay so this is like probably the most straightforward approach that you can extend a static perception module and the rendering module into to handle a dynamic scenes but know what is missing if you just apply this very straightforward approach right so first objects can have a very similar appearance uh like these are you know two objects each have two parts okay they're glued together okay so there's these are two objects and they're not two blocks they're one object you know every tower is one object but every tower has two parts okay two blocks they do have very similar appearance but if i put them on the ground and i give them same external force then you can see that they're trajectories the behaviors are just very different right why is that i know despite they have very similar for uh appearance that's because you know objects made of similar appearance no can have very different physical properties can be made of different materials right the top block of the first tower is made of aluminum well the top block of the second tower is made of steel which is much heavier right so that's why the second tower didn't move as much so these physical properties really affect optodynamics but these first things are really hard to perceive from a single image right if you go from single image to estimate of the states then how would you be able to tell whether you know the the the block is made of aluminum or steel and sometimes the difference in appearance can be really subtle but the difference in physical properties is huge and significant so i wouldn't have time to talk about it but we also tried you know thinking about learning to invert not invert the graphics and then so that you can aim for object geometries and states but also inverted dynamics and or inverted physics and so that you can infer things like physical properties that really matter to object dynamics uh especially i think like density young's modulus and those kind of stuff um so yeah this is another some kind of interesting research uh due to time constraint i'm not going to talk about details um there's a second issue with dynamics models that is you know if you put object on a surface and in exact same position and you try to push it multiple times right you can see that even if you try to apply exactly the same same force for the object in exactly the same state you know the outcomes could be stupid uh stochastic or uncertain now it's not because physics isn't certain it is because there are things in the real world that are really hard to model that which makes dynamics stochastic right these things including the microstructures or micro frictions on objects the roughness of the surface and these are all the things that are extremely hard to model and you know they're not really perceivable uh and and and you know there are things that we cannot really write down analytically uh so and also there's even if you can do that you know like using like differential equations or stuff like that you know it will be very computationally expensive so you can't really solve that in real time or very efficiency uh so to the end there are things that seems really hard to model by physical models by analytical equations so this seems to be a nice place that we can apply the idea as you know can we actually use learning to augment fedex uh to capture for example to make it more accurate but also to capture the stochasticity so let me give a quick example of how we tried that uh so we thought you know we can actually use learning to augment this dynamic channel you know that extended uh you know it takes the state in action it can uh give you what's gonna happen in the future uh the good thing about it is no and good you know people spend decades developing them and they're often very good and they have building these kind of equations they can give you uh a solvers that can give you very pretty accurate estimations of the future states but of course as we just talked about they often make simplifying assumptions this for new efficiency reasons and also they're mostly still deterministic and on the other hand you can have a learner now these days they're more like neural networks uh which can take the current state of action you can relatively easily make them stochastic uh and they're they can also be no more experiment technically right neural networks can learn to capture more complex distributions or functions but the issue with them is mostly you know they're very data hungry uh required a lot of data training it's kind of not you have to train it for specific cases right it's really hard for them to generalize as we talked about because essentially they're like memorizing and doing smart nearest neighbor retrievals so we thought if we're putting two things together maybe we can get the best of both worlds so this means if you already have a dynamic standard so then let's have this learner or this neural network not only take the current state in action but what if we let it also take this additional input that is the predictions or the estimates of the dynamics engine so uh now this learner instead of learning everything from scratch right instead it's learning a minor correction or a residue right learning things that the diagonal doesn't capture and try to you know augment it to make it more accurate but also to capture you know the noise distribution so that it can be hopefully you know as expressive and stochastic as a neural network but much more data efficient with the help of this dynamics in in practice you know we implement it as a conditional variation or current neural networks uh just you know you can see as just like some version of success neural networks that is conditioned on some additional input and we test it on this pushing scenario right exactly so like you know if you just push it on the uh exact same object from the exact same positions on this you know more than a million pushes data set from alberto alberto rodriguez's group um so we can compare uh the predictions of our model which is showing red with the ground truth which is showing um yellow and you can see that our model does capture you know this kind of stochasticity and the two moles of the possible outcomes well if you compare with the baseline which try to model dynamics with the gaussian process then it still does pretty well but it only captures one mold and this is compared with you know traditional methods but if you compare that with a pure neural network that learned everything from scratch as it is indeed more data efficient if you want to achieve a one percent relative error uh then in the final position you know if i don't push the positions of object then you only need about 700 examples where while filtering everything from scratch using a neural network entry and neural network pure neural network it requires a little bit more than 2000 examples right so so this is log scale so which means that our our our model requires about one third of the data uh compared with a pure neural network to achieve the same level of error and then we deploy on a robot and to see you know how uh this kind of learned hybrid dynamics models can really actually help you you know for manipulation if you just combine that with model predictive control um so uh here the robot is trying to solve a highly interactivity problem uh you try to use uh this thing uh your arm or or to actually poke the object the disc on the left and then which in turn will poke or push the disc on the right and the goal is to push the disc on the right which is you know the center is showing red to to make it reach the target which is showing green so here is how the robot was able to successfully do that using the learned model there are other ways that learning uh can help to model dynamics you know there are these are cases where for example uh the dynamics are relatively well known and then we can easily write it down and implement a physics tangent but there are also other cases as i said before where even writing down the physical equations can be relatively challenging you know here for example there are cases where you know if you want to represent objects particles so that you can handle not only rigid bodies but also like deformable objects and fluids so we also tried uh to use learning especially in this kind of you know hierarchical graph neural networks to learn to approximate how particles may interact with each other so you can simulate object interactions between all the rigid bodies but you know the fluid here in this example the the blue is fluid how this kind of fluid may interact with the rigid body which is showing green right so these are other cases where learning can help to augment those dynamics models okay um so quickly you know we're talking about we want to do this infinite machines you know for any new objects for our objects of different shapes different uh categories uh you want to see its shape physics predicts behaviors and to interact with it we discussed a few works that could you know are trying to solve this problem in some sense right by taking the physical inspirations or representations from the physical world from graphics and dynamic standards and integrate them with learning methods through learning to invert and learn to augment there are other related works that i have done uh including uh for example you want to uh you know if you're human right your perception is beyond seeing or there's sound and there's touch and these things how they can help us to build better models and for better manipulation for example let me give one example uh that is um if you want to you know thinking about uh perceiving objects right sometimes you feel like the visual data is also very powerful there's some intrinsic ambiguity right uh so if you look at this physical model and then let's say i want to reconstruct this 2d image uh sorry you construct a 3d shape from the 2d image through like two and happy surface as we have talked about then this is the reconstruction you can get which is pretty good but there's again this intrinsic ambiguity because from this viewpoint it's so hard to tell you know what's the uh what's the thickness of the bottle right so so this is something that you can cannot really easily resolve if you assume you only have this kind of single color image but if you look at the physical model behind tactile visual data right so that you can only see the objects where you can feel the optics then the tactile data here we use like a jl site the signals can be seen as you know a local information about two unhappy surface right so unlike the two happy surface repetitions you can get from the image data which is global and captures you know more information but the local the touch signals you give are much more local but they are also much more accurate right so they can also you know refine your estimations to fix or resolve things that you cannot really resolve purely from visual data like this intrinsic ambiguity so these things can be you know easily integrated uh this because they're essentially the same thing right it's about two and half this surface so then with the tactile data you're able to refine your estimation of the thickness of the bottle so that you can actually do a little better to resolve this intrinsic ambiguity so this is about you know uh beyonce uh but we also want to leverage information from like sound and touch uh and then there is like you know if you want to go a little bit beyond info machines actually i think there's some psychological discussion on you know how much uh influence also already have not only perception but also abstraction but let's just assume you know this is more like more advanced stage where you want to go beyond from just pure perception in instructor space but also to derive or to think about what are the level of abstraction that we may want to have in our representations now one example is if we look at this uh method reconstruction paper we'll talk about going from the input image you can now use a genre to reconstruct the nc object categories but if you look at the top of the of the of the table uh there you can see that it has this kind of weird triangular shape right which doesn't really match our intuition as being just talked about right we have this strong prior that the table you know it's made by humans so it's very likely to be symmetric in this case you know circular symmetric right but why would your reconstruction be this weird triangular shape with all those you know roughness and bumpiness this is something again also very hard to resolve i feel if you uh assume you're just looking to solve the problem from a single image because from the input viewpoint looking at this image on the left the reconstruction already looks extremely similar to the input image right so there's very little signal you can get from the input image to tell you that okay the shape should not only be this shape but also should be you know round and regular so here we probably have to resort to going from shape reconstruction to to shape abstraction which means shapes especially man-made objects artifacts they often have this uh abstract and program-like structure right in computer graphics people use that a lot for procedural modeling you know like the chair lasts can be uh can be should be the same and they have this regular layouts and they're repeating themselves so we as a first attempt we're like okay what if we try to design you know our own uh belief of what shapes or furniture should look like in this kind of specific domain-specific language uh and then we try to represent objects not only in this unstructured space like maxwell's or point clouds or even implicit functions but try to represent them in a more structured way so that you can represent them as for example what we call the shape programs right and now you are able to uh reconstruct the objects and by having a little bit kind of stronger prior to reconstruct shapes in this more regular way of course there's trade-off right you know because by assuming will you believe uh you know what the shapes will look like you you lose a lot of expressiveness right for things that are not structures or scenes that have more shape varieties uh then you might not be able to represent it so i think an interesting question would be you know how these kind of learning based expressiveness can really be integrated with these kind of human priors about regularities about you know repetitions and symmetries like that again i think you know due to time constraints i don't have time to talk about how we did this in detail but have you checked offline about how we were able to use neural program css integrated with this kind of structured shape representations okay so finally you know so if we start with like going from graphics and then we said okay let's do inverse graphics to info object states from images but as we have seen right this is really about you know not doing uh inverse graphics from just images but like inverting perception so you can invert from multi-sensory observations and also it's not purely about representing everything about object states but there's like an implicit reasoning process right you want to discover what is the right level of abstraction uh and then you know can you discover those from data can you use that to represent uh to give some kind of more intelligent understanding of what the c representation should be and some of these are no action abstraction some of these abstractions doesn't only apply to study objects and static you know shapes and scenes but they're also you know in some sense you can see the level of abstraction or level concepts also exist in human actions so if you look at this you know uh here i think it's a is a five-year-old or yeah i think five-year-old playing the game of jenga then he's doing it much better uh but what is very remarkable is you know he can not only play the game but he also there's a there's a level of distinction about the the actions it takes that he takes on to to to play the game let's look at it again right he uses vision and touch uh to estimate object states but then he decides you know some of the blocks are just not really movable i just you know i just tell i can tell by just touching it but then some others are really movable right so then i apply a totally different action that is you know try to take it out so there's a level of astronaut reading process it not only applies to static object shapes and c's but also applies you know what actions or the different dynamics models that you should take uh depending on the state of the object right so for some objects you know you don't want to touch it for some objects you know you should really push it so that you can continue the game so this kind of level abstraction also really guide our learning of the dynamics model and help us to do better in object manipulation so taking all this inspiration into account though we were actually able to uh together with nima hazardi and alberto rodriguez nima is now was a phd student at mit but now he's assistant professor at university of michigan um we're able to actually build this django robot that you know take inspiration from all these things and we use vision touch to identify the state of the objects and then to tell you know what are the blocks that are actually pushable and then we really uh use our learn dynamics model together with mpc to to interact with the objects and to play the game of django okay uh so to summarize you know we talked about my research is about learning to invert the invert means you know to invert different types of physical models this includes the perception model uh for estimating 3d shapes from raw observations inverted dynamics models although we didn't talk about it but to infer the physical properties as well and then in some sense right to invert this reasoning process to going from raw states to some level abstractions and the other direction is about learning to augment which you know in some sense it means uh this is also i didn't talk about is you don't you not only want to discover abstractions right you will not only want to recognize their cuboids or their cylinders but in some sense you want to understand what these abstractions really mean uh especially in the context of why we care about cuboids why we care about uh cylinders in the concept of natural language when i say something that is cube how would i know that a cube is referred to a shape not to attack texture or not to a cutter and then as well as you know augmenting dynamics model so you can use it for panning for manipulation and augmenting this perception or graphic sentence so that we can synthesize 3d shapes but also 2d images and to do you know image editing and manipulation so the key thing behind my research is you know if if we really want to go in from what we call this info machines to intelligent machines then you know we should really think about what is the physical model of the physical world and we have to be extremely careful here right because you know there is always people arguing uh from you know from canada like there's this bitter lesson so for example uh like no matter whatever we're trying to do right anytime we try to say okay let's do something more specific eventually it hurts you because it doesn't generalize that well and whenever you have more data the data always swings uh but here i think that what we call physical model is slightly different in the sense that we're not trying to build anything any model that is specific anything we try to build uh we call this physical model is is universal uh is it's very minimal but also very universal it is true that if we do things that are specific it may eventually hurt you because we want you to generalize to more complex data or or generalize to different tasks it may hurt you but whatever we're building here death lab or projection it's very minimal structure but it's also physical and universal and every image you you take you're looking at right if you want to infer what is the 3d structure there now there is perspective projection going on no the image exists because there is object there is surface and there is you know um there is perspective protection it applies to every image right so this kind of minimal but also very universal structure seems to help you in generalization instead of hurting you and of course you want to use model uh you want to use learning to learn whatever these models don't give us including doing inference inverting them including uh doing better simulation are commanding them thank you so much yeah uh thanks again so much uh thank you um yes are there any okay uh liam has a question uh why did you go ahead yeah uh really really great talk thanks so much i was curious to get your impression on the relationship between physical models and affordances so you show in a lot of your examples that through planning and you know the dynamics you're able to achieve uh you're able to to accomplish tasks that require a really really good understanding of the dynamics but what i'm curious about is what your impression is of how we can generalize these these plans to maybe leverage the affordances of objects that we've inferred that maybe weren't part of the original task specification or that allow us to do like problem solving and reasoning you know a little over the like object abstraction level yeah that definitely makes sense you know because partially the original motivation is is you know these are why we want to represent this way is is not to uh like in the shape program paper the original motivation although we never were able to do that is is not to represent not just object geometry in terms of abstraction but also to represent you know these are the different parts they have different functionalities uh and then you know you can do different things and you can recognize you know tear lags and table likes are similar they're similar in terms of their geometry are similar they're similar in terms of their functionality their physics because they're both trying to you know make this thing stable uh so i know this means if i have some if i have a table that is not stable you know maybe i can just add another lag or maybe i can some lag is not too long too short and so this is the things i can do to fix it so so i think this is of course the the the the i think something that's totally makes sense and and what's motivated us to do is work uh although i think right now we haven't really tried it yeah thank you cool thank you welcome yeah yeah and before oops okay turns out i'm unmuted i just thought it was me yeah we can hear you krishna cool so uh yeah my question was basically about some of the scalability aspects of these techniques for example you know all of these approaches you have shown so far are mainly geared towards object reconstruction and do you have any thoughts on like how which of these aspects might scale well in general to a scene level understanding and where there are avenues for the work foreigners in the union like uh the scenes are made of uh like indoor scenes where they're like you know force and wars and then there are many more objects is that what you're seeing or i mean in general like scenes that you see everywhere like a messy table top for example there might be lots of objects and multiple relationships between objects so just trying to get your perspectives on like what would be challenges that are not visible in the single object setting that that would be important tackle there i see yeah yeah that could make sense you know i think you know um uh it's uh it is true that the 3d reconstruction work i talked about is mostly like you know just form a single object you know but there are also cases for example like the django robot cases or or like which including a huge number of objects radios we're actually modeling this box separately as well as in this example oops this one uh where you know we try to model a little bit more complex you know there are multiple objects interacting uh so a challenge uh as i as i agree with you actually i think is you know these kind of representations especially especially if you use like either voxels or you use like particles you know it's it's very hard to actually scale up on to to more complex scenes because it's three-dimensional these are two-dimensional so the number of voxels or the number particles required to model the scene is uh it's just exponent it's it's no it's not exponential it's you know three dimensions it's a cubic so it increases very quickly uh so and then becomes very harder to learn also becomes much harder to fix uh fit into like neural network even like gpu memories so it seems like there were like two different approaches that people have been thinking about you know one is uh you just increase level abstractions right you build better hierarchy and then you know you only model things when needed so you know for things that you don't care about you just model it at a core scale and for things you do care about uh and you model at a finer scale and these things can be even like a little bit dynamic or so so the ways you can actually implement it right either things like octave trees or things like that some attention mechanism is i think one way that people have thought about to scale that up and the other is you know instead of i think uh using uh more of these explicit repetitions you just switch that right switch into more like implicit reputations where you're saying okay for object instead of modeling what is actually there now i just have a neural network which is like a comparison representation of the scene that i can i can query the neural network so that it can you and i can tell me okay whether uh you know what's the what's the occupancy of that particular point what's the radiance of that particular point um so that uh what's the you know what's the availability of that particular point so that which is something that assuming you have a good way to train this neural network then it wouldn't uh scale or increase with respect to the size of the scene so it seems like these are the two you know of course there are ways also you can you can you can you can combine them right you can say i have a you have a neural sparse voxel field you can have these kind of things that you can integrate both uh those kind of octave tree idea hierarchical idea with implicit representation idea so it seems like these are the ways that we can try to do better um yeah not sure if that answers your question yeah because thanks okay uh it looks like we're out of time uh let's thank our speaker uh one more time uh thanks so much for presenting thank you and uh members of the uh real team uh i hope to see you in five minutes uh so nine minutes past two est uh in the meet and greet the jj so that's the that's the google meat link that i just received earlier
Info
Channel: Montreal Robotics
Views: 499
Rating: undefined out of 5
Keywords:
Id: 0c6bEyXQ898
Channel Id: undefined
Length: 60min 35sec (3635 seconds)
Published: Fri Apr 16 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.