Unsupervised Monocular Depth Estimation With Left-Right Consistency

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone I'm going to present the work I've done with ocean Mackay in gay Bristol at University College London on unsupervised estimation so we want to build a system that given an input image like this one will produce a dense depth map like that one which came out our system in this case brighter means closer so why we're interested in depth well depth is crucial when it comes to understanding the world around us such as navigation and mapping so if you are robot you would like to know where you are where you want to go how far away are things if you want to grasp things if you want to if you want to grasp things and if you want to interact with the physical world around you depth is also crucial finally if you want to do non-physical interactions such as augmented reality let's say I want to put an object on the table right there I want to know how big is this table how far away is it so how do you usually get in a get depth well I'm just going to name-drop a few methods here we have structured light stereo cameras or even lidar so why we interested in monocular depth estimation well let's say you want to get that for this photograph which is a postcard from Hawaii in the 70s well you can't do that because we didn't have any depth sensor at a time or any other photograph that doesn't have depth information well you would like to have it well virtually run our method on this picture and I think the result looks pretty good let's say you also have a physically restrictive setup such as an endoscopy well you don't have enough space to put a stereo camera or laser or laser scanner at the end of your endoscope so I want to collect estimation here in this case is very useful and finally why not if pirates can see the world in perceive depth with only one eye I think we should be able to do so so there's been some work done on depth estimation and usually they rely on supervised depth which means that you have a data set where you have an input color image with The Associated ground truth depth image that is aligned to it you then train a model that will output a depth and you will rely on some loss or optimization method to update your parameters to get a better estimation this is essentially the work of automatic photo poker or make 3d which are essentially local methods relying on handcrafted features more recently there's been some advances which because you will all cigarette are using neural networks which allow you to use more global understanding of the scene and we have the seminal work for are getting colleagues and since then a lot of other papers doing supervised death estimation so you might be saying hey why don't we use ground chair steps and just train a model well as it turns out ground truth depth is actually really hard to capture so to give you an idea this is the kitty stereo dataset from 20 2015 and this is the car that was used to capture data it has two stereo cameras at the top and an expensive lidar scanner at the on the roof this is an input image and this is the ground truth step that you get with it as you can see it is pretty sparse it doesn't have any depth information above the horizon and some missing objects such as these busts are actually missing from the data because they are moving we could also use structured lights such like the Kinect everyone loves the Kinect right however the clinic doesn't love the Sun it doesn't work really well outside and has a very limited range so what do we do is it all doom and gloom well not really remember we said earlier that we could get from we could get depth from stereo so let's say I'm capturing a scene with my stereo cameras left and right and I'm looking at this Apple on this tree well if I can solve for correspondence between these two images I essentially get disparity and disparity is directly related to depth so if I can get my network from a single image to solve for disparity or correspondence then I get depth let's train with that with zero data so sir data is easy to capture there are lots of devices around and actually I'm sure that a lot of you actually have a stereo camera in their pockets because the iPhone 7 plus has two cameras on the back so we're not the first ones to use stereo data to train models last year at HDTV we had the deep 3d model that was trained with stereo images to do image interpolation and at the same conference we have the work of Gargan colleagues which also used stereo pairs to infer depth however they both made strong approximations in their image formation model which led to poor reconstruction quality and we will show you in this work that we can actually get state-of-the-art results using only stereo images at training so how does it work in unsupervised estimation or self supervised the destination we have stereo pairs at training our target is one of these two images which means that our model needs to output an image we then have a reconstruction loss between this output image in our target image we still want depth however so we need our network to output a disparity so then how do we go from this priority to depth well we just sample one of the input image using this disparity and this is what we will call our baseline and as our first contribution at the sampler we use this valanor sampler from the spatial transformer network from 2015 which is fully differentiable which means that we can trend end to end and have good reconstruction quality so to give you an idea of the results we have an input image and this is from the stereo Kiki data set and this is the result of that baseline so as you can see it's already pretty good it gets the general sense of the scene we have good boundaries however it has some artifacts such as here and there and so we're not we didn't stop here and we kept going and we improved our method and this is the result of our method and as you can see we solve all the artifacts so how does it work as I mentioned before we have an unsupervised depth estimation so we push one image to the network and we sample the other one to generate another one but then we use a commonly used trick in traditional stereo which is to actually output to disparity map one for the left image one for the right image then we can enforce these two disparities to actually describe the same thing to be consistent with each other and turns out we can also do that inside the network which means that we can train everything end-to-end we also of course output an image and we have a loss on the reconstruction of these two images and finally we also have a smoothness loss on both disparities to make sure that our our output is very smoothly so a few words on the architecture it's fully convolutional which means that you can pick whatever encoder you want we use vgg like and resonate 50 we use skip connections between the encoder and the decoder which is similar to this plate and flow net which are from Thomas boxes group we also have a multi scale generation which means that we output we generate the output at multiple scales and we also have a lot at multiple scale and finally it is fast the current model which wasn't optimized for speed actually runs at 30 frames per second on a Titan X so now let's see a video of the result this is a video of the test on kitty the network has never seen any of these frames and every single frame is processed independently there is no information on video or temporal information all the frames are independent from each other as you can see we construct all the pedestrians the walls the signs and the road so now let's compare with other methods so this is an input image from kitty and it is the ground truth depth which we interpolate it for visualization purposes as you can see the top part is missing and I will go through the results of other methods so we have eigen léo Garg which is our closest competitor which is also and supervised and from ACC villager and this is our results as you can see we result more details and we're more faithful to the input image we also run some numbers we had a thorough and careful evaluation we rerun most of the other methods and we compare with the same evaluation code which we share with everyone and as you can see we beat all previous supervised methods as well as unsupervised methods so but that means that with stereo pairs our training we can actually have state-of-the-art results versus methods that actually use RAM tooth depth we also run our method on unseen data set just as the mechs reading so this is an input image this is the ground truth depth Karsh leo Linna who is third of the art on this data set and our result as you can see we are pretty good around the edges of the of the scene so there are a few things to know about our method because we're based on a reproduction loss it means that everything that is not an inversion will have unreliable depth so this could be sold with more supervision which actually was studied by a paper which was presented two years two days ago as subsequent work we also need calibrated data which means that it needs to be synchronized and rectified however if that's too much for you then you should stay for the next talk where they use the less supervision at the expense of reconstruction quality so as a conclusion we can get that from a single photograph we get self supervision from stereo data which is both cheap and scalable our method is accurate as it beats all previous results on KT data set so we made the code and the models and the evaluation code available for everyone you can just search for marrow depth and if you have more questions come at the poster 21 and this is a video I took yesterday in Hawaii and as you can see it does a pretty good job actually thank okay so we have time for questions so again they are a number of microphones here in the room best is to use that otherwise you can also try to tweet and maybe we'll find the questions to get started maybe I have a question so with stereo there are ambiguities in particular if you have horizontal lines aligns with the polar lines so those will tend to systematically be you know essentially be missing in the the left/right projection with essential ways you know fail them or most of the time meaning that you systematically miss that in your training data did you address that did you did you think of how you could solve that did you consider to in ocular setups that could actually help with that so it's a good question we didn't really have that problem because most of the issues really are vertical object in terms of loss and it was doing a pretty good job with that but we could truly use more in two cameras or like space them differently in space but we haven't tried with with this method but it could definitely be done so you said that it makes a lambertian world assumption how much you know use the kitty dataset where there's a lot of specular things do you have any kind of qualitative intuition on how much highly specular surfaces affect output quality and it's a good question actually if you look at the results you can see that the car hoods are usually slightly bumped in as well as the windows and the kitty data set used CAD models so the windows are actually at depth of the window so we probably have a bit of loss there but we actually don't have a quantitative evaluation of that but it didn't seem to hurt too much okay thank you
Info
Channel: ComputerVisionFoundation Videos
Views: 25,403
Rating: 4.9463086 out of 5
Keywords: CVPR17
Id: jI1Qf7zMeIs
Channel Id: undefined
Length: 12min 43sec (763 seconds)
Published: Thu Jul 27 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.