Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
it's let's find ice with the last talk of this session it's on deep virtual stereo odometry leveraging deep fryers a deep depth prediction sorry for monocular direct sparse odometry and it's a paper by nan yang Wang York Strickler and Daniel Kramer's and the talk is going to be given by an onion [Music] thanks for the introduction hello everyone my name is NaN I'm glad to introduce our work deep virtual store automata and this is a joint work with rewound yahks nikola and daniel criminals as we all know that differing had swept a lot of areas in commission not only the high level protection like classification and object detection to also the low level tasks like super resolution and optical flow estimation however there is still one field where deep learning method can don't compete with classical method Richard Richard slam all we can see visual odometry it means to reconstruct the world from moving cameras and estimate the camera poses at the same time in fact in the large-scale outdoor scene as you see in such a video classical method can only achieve good performance with your cameras using a single camera Holloman dr. Mathers cannot perform well on such a outdoor last kill sequence because the metric scale cannot be recovered from a single camera and it will result in very large skill drift in the end here is an example of skill drift and this sequence is taken from kiddie data set D Rex was adored geometry is the state of the art monitor visual odometry system but you can see that it still has a very large skill drift over the last two years researchers have proposed a number of quite impressive deep learning approaches to visual slam or visual odometry these are our few representative works most of these approaches tackle the problem with and change deep neural network however none of these methods can outperform classical approaches in terms of quantitative evaluation LD established autumn trapinch mark such as Kitty or Tomb data set in contrast in this work we propose a hybrid method which combines the advantages of deep learning and classical visual slant methods other methods provides state-of-the-art performance on established benchmark in fact the performance of our Malachor visual Tamra system is on par with the state of our zero method but with only a single camera we achieve this by integrating different in taste moniker decimation into classical geometric methods deep learning can recovered the metric scale of the taps from a single image because it can learn the prior knowledge of objects and the typical scene layout the taps estimates are not only used for the initialization of the new points but also integrated into the optimization for the arrow function of visual odometry to get accurate molecular deficit adapts estimation we firstly introduced a semi-supervised to deep neural network the loss function is a combination of Sales Supervisor loss supervisor laws and the regularization term our step supervised loss is inspired by the work from Goethe at all and from one single left image the network is able to protect both the left disparity map and the right distribution map instead of using lighter ground truth as Kuznetsov at all proposed we use zero which automata to collect sparse data as the supervision signal in particular we use the state-of-the-art dural visual Dunphy's system Cyril DSO proposed by one at home we use this one to collect the sparse depth map in this way we reduce the cost of collecting sparse labeled test data the regularization term a to deal with the occlusion occluded area and the texture list area for the network architecture we propose stack net which is compromised of two sub networks simple net and receiver net inspired by this pennant and flow net both of them are fully conditional neural networks with steep connections we see Janelle turns the residual signal of simple net and a gas additional clue like the image the residual image and the disparity map from simple net the final outputs are the admin admin buys submission of the outputs from the two sub networks let's firstly see the evaluation result of stack net a compare with the state of our self supervised approach of coda at all and the state-of-the-art semi-supervised approach from Kuznetsov at all as you can see here at the time of submission we achieved state-of-the-art performance on most of the metrics in terms of in terms of qualitative evaluation as you can see here our approach can deliver better prediction especially on the object boundary as you can see for example the traffic sign on the red part of the image to show the generalization capability we also run out on stack net on cityscapes data set and the model was trained only on kitty data set you can clearly see that our never can predict plausible decimal and it can still recover the shape of the objects for example the polls on the Left image and the car on the right image okay now we have a deep neural network TechNet which produce two disparity map of a steerer camera but only from a single image then we want to use this to mimic a steerer set up in a moniker visual Dom tree system how can I do this the tabs of the selected point on each new Kieffer it's firstly initialized using the left disparity map now we want to construct a steel for the metric arrow but from this only only from this single image so we take the baseline from the training set of TechNet and we project the selected points onto a virtual right coordinate and now if we want to construct a sterile photometric arrow we need to get the intensity of this virtual red image here comes the red disparity map so we back prop back whap this virtual right coordinate through the original original left image so now we have a photometric arrow between these two terms we call it virtual zero term and together with the other of the other temporal for the metric arrow term we have the total photometric arrow and we use Gauss Newton method to optimize the arrow function and to get refined spars there's estimation and the poses of the keyframes so this is another sequence from GT data set actually you can see that moniker DSO can actually delivered locally good 3d reconstruction but the problem is that after long term running the security is very large after using the initialization from the left disparity map you can see that the skill drift is eliminated very much but still the accuracy is not very high now after we're adding this virtual serial term you can see that the accuracy is improved very much actually there's basically there's no virtual visual difference between the estimated jagged trajectory from DVS oh and the ground truth we also evaluate DBS Oh on the training set of kitty with other state-of-the-art zero method blue numbers means green numbers means the best performance and blue numbers means the second bad performance t RL and RL means translational arrow and rotation arrow respectively and the sequences is karol simple means the sequence used for training TechNet and the sequences which starts involved means they are not uses used for training TechNet and you can see that DVS o achieves comparable result to the state-of-the-art zero methods on both subsets of the sequences but t vs o only use one single camera we also apps made our result on key T test set on which the ground truth is not public available the also ddso also achieves better result than zero there so on the test set in fact DVS o is the best moniker visual odometry system and keyd benchmark and its performance is on par with zero or lighter methods we also test the ability vessel under segments of cityscapes data set where the camera properties are totally different from the one from kitty data set the red line is the gps chrome tools provided by the dataset and the blue line is our result after performing the similarity similarity alignment PBSO can still deliver very good result recently I also recorded a sequence around gosh-dang where you are sitting in and this is the street view you can see that although the result is not as good as in the PT data set it still delivers reasonable depth estimation for example it can recover the shape of the traffic signs the shape of the cars and of course we also run DVS all on the recall a sequence to deliver 3d reconstruction and this is the sparse sparse point cloud from DVS oh I hope you can tell that this is just the building you are sitting in now and actually although we see that the depth map from this decnet is not very accurate but DVS OH can refine the depths using the classical object optimization method this is the reconstruction you can see here yes to conclude we firstly propose step net a semi-supervised moniker taps activation but we do not use lidar ground shoes and it achieves state-of-the-art performance in Kedah dataset we also propose DVS Oh a moniker visual odometry system which leverages the depth estimation from stack net and achieves performance on parvis state-of-the-art steerer method but we only use a single camera we think our approach is a promising tip direction to enable moniker autonomous autonomous navigation in GPS nine environments if you want to discuss more in detail welcome to my poster session thank you thank you very much this exciting there's a talk question up here yeah hey over there on your right very very good talk and really really good results are very impressive Godard went one difference with Godard might be that Godard can train also with just monocular sequences not just test on monocular sequences but train can your method also train on monocular sequences I mean the summer I supervised in sense of you have part of the date sets that stereo and part is uneven ocular so I don't know which paper we refer to from Goethe at all but from last year's of EPR I think Kola it helped proposed the self supervised approach and on that from that paper it he also trained our zero method zero sequences I'm talking about the latest one sorry you're right ah I can train on monocular sequences also for now we use the self supervisor floss and I think is can be easily extended to monocular case because we can also estimate or we can also use the poses from Styria so or from other 0 methods as somehow a grown trees are used some robust robust kernel to train the Malachor sickness also using similar methods like a photometric arab between different frames and what the refs rob the current frame to the reference frame to get the training signal thank you sorry I didn't hear very nice work Thanks one question I have why don't you use the motion parallax from the successive frames to improve your depth maps oh you may follow the network during test time yeah so you estimate the depth oh so why don't you use the motion parallax oh yes this is a very good question I think this can be a extended work from us and actually I think this method can also be extended to a lie a lie training let's say a scheme because we can also get the post estimation during running the USO that we can use this post and those two tabs estimate from the vessel to in turn to refine the stack net deep neural network yeah thank you there one more question Stan let's thanks to speaker again [Applause] [Music] [Applause] [Music]
Info
Channel: European Computer Vision Association
Views: 1,955
Rating: 5 out of 5
Keywords: ECCV, ECCV 2018, Computer Vision, Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry
Id: 2_nDLpGtY1Y
Channel Id: undefined
Length: 15min 18sec (918 seconds)
Published: Tue Dec 04 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.