Learning Deep Convolutional Frontends for Visual SLAM | Daniel DeTone, Magic Leap

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

White paper here:

https://arxiv.org/abs/1812.03245

Localisation via visual means is a key technology for self driving cars.

👍︎︎ 5 👤︎︎ u/Mantaup 📅︎︎ Dec 26 2018 🗫︎ replies

Google owns a decent chunk of Magic Leap and be curious if they are working with Waymo at all?

👍︎︎ 2 👤︎︎ u/bartturner 📅︎︎ Dec 27 2018 🗫︎ replies

This is cool.

Does any know know if this has been tried in an autonomous vehicle (let me clarify, automotive / passenger autonomous vehicle ) ?

Or any plans to?

I'd like to see some performance metrics if it has

👍︎︎ 2 👤︎︎ u/Mattsasa 📅︎︎ Dec 27 2018 🗫︎ replies

Very interesting tech. This has a lot of applications far beyond AV car applications. Much of modern video compression relies on point detection between frames for example. There are tons of video and photo applications as well from removing unwanted objects to stitching photos together.

It's like to see how accurate you could build a dead reckoning rig with a single laser depth sensor, humidity/temp sensor, and RGB camera. Point them all at the ground and start moving it and see how far you can go before you are more than 10cm off your starting point over various terrain.

The "Chen et al. (ECCV 2018) Estimating Depth from RGB and Sparse Sensing" paper seems even more interesting. Looking at the error map in the video, it seems you could dramatically increase Radar and Lidar resolution with this technique with no much error.

👍︎︎ 2 👤︎︎ u/WeldAE 📅︎︎ Dec 28 2018 🗫︎ replies
👍︎︎ 1 👤︎︎ u/mslavescu 📅︎︎ Dec 27 2018 🗫︎ replies
Captions
[Applause] [Music] [Applause] thank you very much for the introduction so today I'm gonna talk about learning deep convolutional front-ends for visual slam so a bit of a contrast from the last talk I think this one will be focused more on the we're basically from images it's them where rather than the what it's kind of something that deep learning has been applied to a little bit less in general deep learnings been really successful in semantic tasks but bringing it into the geometry world which is where our objects and people and places has a little bit more challenging so I'm going to talk to you a little bit about some of the research I've done at magic leap over the past three years in this area you'll see me I'm just pictured here wearing the the magic leap device so looks like my laser pointer doesn't work too well on the screen but I'll just point and talk so it just a quick advertisement so for those who don't know magic leap we recently released a developer edition of the magic leap 1 which is a wearable display and there's a little picture of me wearing it here are some other pictures of it just from YouTube so we just have a variety of developers that are playing around with this sort of technology right now so the focus of today's talk I'm gonna present primarily the work that I've been directly involved with but I'll also kind of advertise some other work we've done in the in the realm of deep learning they say that the at the intersection of deep learning and slam is what is what I do but I'll also advertise a little bit a little bit of other work as well so to break down the talk I'll first introduce to you what is super point and it's essentially a deep slam front-end that's made of a multitask fully convolutional neural network I'll talk a little bit about the main challenges of this network which is how do you try this how do you train this system to detect interest points which are in general things that are very hard for a human to label and it's difficult to come up with a training set to actually do this and then lastly I'll go over a quick snapshot of some of the different deep learning projects here at that that I do and that we work on that magically so a brief summary of the last 15 years so that's that's happened in slam so first I'll just clarify what is slam versus visual slam so first of all slam is simultaneous localization and mapping in general the problem involves given a sensor input can you basically wreak reconstruct the environment that the scene that that the the sensor was in and also localized the sensor within that environment and visual slam often means that you're working with image data so you're not really you don't really have the expense to use expensive light ours or other types of sensors there's tons of there's hundreds of different sensors you can be using visual slam tends to just mean RGB camera images so what's what's happened in the last 15 years well there's a there's been a ton of progress actually and there are systems that work today that are being deployed on large scale in large scale kind of industry but one of the characteristics that stood out to me when I started working on this problem was especially in 2015 was there are very few learned components in these systems and I was really I was really puzzled by that because by 2015 deep learning had taken the computer vision industry by storm and was beating all the benchmarks on recognition and detection and etc so the question I kept asking was why why is this not yet broken into the slam problem so some of the early work in this area was what I like to call simple end-to-end deep slam so where essentially you have images or whatever since they're data you have you you hook it up to the big latest and greatest deep convolutional neural network you can find and you output directly the pose quantities or the map quantities and this was exciting because at the time no one had really tried this and it was essentially this recipe was working across many other tests so myself and a lot of other researchers in the community thought why not do this why not try this the advantages are that it's purely Dave data-driven you don't have to have experts in computer 3d geometry and computer graphics and all these things come in and hand design a lot of heuristics for these systems to actually work they you can just set up the problem collect your data hit train and it should work right well we actually tried spend some time working on a mammography estimation so given a pair of images can you estimate the holography between them and it does work reasonably well but unfortunately the accuracy was not really competitive at the large scale with some of the other engineered systems that have come before and one of this may be one of the reasons that this may be true is because it's very hard to collect this data at large scale I often dreamed of having this image net of SLAM where you have all different sets of cameras depth sensor stereo with with millimeter accurate pose information and and all this things but the truth is it really doesn't exist maybe because it's it's a really technically difficult sort of problem but as of today there really is no image net of slam so that's where a lot of the techniques that I'm talking about today were geared towards getting around that problem how can we develop our own data how can we think about the problem and develop data in such a way that that we can train these large-scale systems so one of the ways that people tend to yeah yeah you can yeah no I don't mind yeah so well so the typical recipe for example with imagenet there was a lot of work done and it's pretty straightforward to kind of go to Amazon Mechanical Turk say hey what's in this image click on it do a bounding box etc but for a human it was really hard to go in and say you know given a pair of images how many centimetres is the camera displaced and and basically kind of so so then you say okay well we have you know more expensive sensors that we can use as well as a lot of external sensors motion capture type technology but really having a single data set with all the diversity of the world and all the different types of sensors all the different types of distortions I don't know if a kind of a data set available that has something like that yeah yeah so that so you can do that sort of to a certain extent and there are data sets that have that and they are growing more and more yeah there's in terms of diversity in terms of visual content there's diversity in terms of different types of of kind of Len lenses different types of actual physical sensors rolling shutter global shutter and like a lot of these these characteristics all getting them set up and basically it's very expensive to collect this sort of data you have to have external sensors and things and yeah it just it doesn't really exist I mean at the scale of kind of tens of millions hundreds of millions that I think would be needed to do something like this but but yeah and and and kind of so I mentioned this maybe do the reason why these methods aren't quite yet competitive I mentioned data could be one of them but it also in some sense lacks a little bit of interpret ability so when you have the system just output a quantity you might want to know more you want to know things like confidences things like if it made this mistake why did it make its mistake so there's kind of some other challenges so that essentially leads into a diagram like this so what what's commonly done in slam is you split up the problem into you extract it into the front end and the back end so what the front end does is it abstracts these high dimensional images into a small set of features that that basically can be used for some sort of optimization in the back end that's going to give you pose information or information about the scene now the front end is is essentially over the past couple years there's there has been some success now in applying deep learning to the front end why is that well the the most kind of the the most powerful component and that's really revolutionized deep learning in my opinion is convolutional neural networks so that's where you have a single image coming in and and basically outputting some quantities and and that aligns quite well with how the front-end works and and the work that I'm I'll talk about today is with the front end now there has been some very very early work in act in actually applying deep learning to the back end but it's it's a little bit harder to set this problem up because your inputs are now no longer images but they're essentially very varied size kind of arbitrarily sized quantities and and and basically it's just it's not kind of as straightforward it's not really using the success of deep learning but there is some exciting work in this area so it's just even it's just a little bit earlier I would say so that leads me into super point so what is super point it's it's a deep convolutional neural network that takes as input a single image and outputs two quantities there's 2d the first is 2d key point locations and the second is descriptors which can be used to match these 2d key points the points and descriptors are computed jointly what that means is in practice for practical systems the the two tasks share about 90% of the compute which is really attractive for for systems where you have low compute environment a sorry compute constrained platform and just in terms of the architectural design one of the ways we designed this was so for example with the 2d key points that that's an output that has a full size of an image so what's what's typically done for and basically an image input mapping to an image output is you have an something that's called an encoder decoder network and those use something called deconvolution layers the design that we came up with here doesn't use the the decoders and that's purely just to save compute and I'll talk a little bit about how we how we did that if you guys are interested so with the the key point decoder so we have basically an input here which is a height by width image and that so that's coming from here that gets processed by a series of convolutional layers which reduce the output resolution to H by H by 8 and and W by 8 so height divided by 8 and width divided by 8 it's essentially a spatially reduced output and with 65 channels so the 65 channels were designed essentially to to output probability over a local 8x8 region of pixels so that basically means that on the output of the encoder for each spatial position you have a classifier over the local region and the the essential design behind this is that thinking about slam and thinking about image matching is it's not that helpful for us to have multiple points within a local region we actually want to design our system so that we're extracting points that are well spread and well covered in and that that's one of the design decisions that went in here but so but that so that's essentially the architecture designed for the 2d key points once you have this so for those of you that are curious this is a trick that's used in super resolution so so it's a kind of a comment relatively commonly used trick and you can then reshape that back to your original height by width resolution you can apply 2d non-mac suppression and then get your key points the descriptor is another head in the network and that also outputs something that's with byte by 8 and height by 8 and once you have the 2d key points you can interpolate into this descriptor field to get corresponding descriptors for each of the key points so ok so that's interesting I presented to you kind of a unique network in general you could you you could set up a standard encoder decoder Network and it would probably do reasonably well so so what's interesting about this well the key thing is how do you train super point so the the so training the descriptors is actually a reasonably well studied topic but the key the hardest part here was was how do you get the 2d key point location ground truth how do you how do you create a data set that has supervised key point detections because we want to set up the problem we want to use supervised machine learning which is the thing that works best in deep learning we want to be able to use that how do we do it so like I said so so the general recipe that we used is relies on essentially homography s-- so we load up a single image in this case we used a large-scale database of natural images like ms cocoa we load up an image and we warp it according to home ography which basically gives us a pair of images which we know geometric correspondence or every pixel and that allows us to set up the descriptor loss so we can use standard metric learning siamese learning to train that and and then once we get the interest points we can set up classification lost we can use softmax cross-entropy loss to Train the suit the interest points but like I said where do we get these key point labels as I mentioned earlier we these are just too hard for human to label like I said we can ask humans to go in and label things like elbows and shoulders and you know knees and and some more semantic type things but humans have no idea what an interest point is or what they're like I don't know what you're talking about if you ask them so we took a self supervises approach so we started with a very simple synthetic render so this this renderer is just generates simple things like quadrilaterals triangles ellipses checker boards and we actually trained the the detector to detect to detect corner like interest points so in this case with these images there is no ambiguity as to what the corner location is so for a triangle you have three vertices you just say you know in the generation program I'm going to use the three vertices as a ground truth and the idea is we will train a detector starting in this domain and then use Homa graphic adaptation to overcome the the domain adaptation problem which comes from training on this synthetic world and apply it to real images it's abstract art it it's it's it's very simple and and one of the things we like we were so surprised that this actually worked and even just training something on the synthetic only training on the synthetic abstract art it actually performs reasonably well in the real world it will detect things like the corners of the door and the tiles on the ceiling more geometric type things and and we presented in the paper the results there but it still wasn't quite with some of the other methods like sift for example so that's where we came up with the oume graphic adaptation was to have it fireman more things like I remember taking the system and pointing it at text and it just it had never seen anything like text and text has some corner like things but it just didn't really know it had never seen that before so the way we trained it with so much noise and so much augmentation it tended to under detect but in general we wanted to kind of help it detect more things so that's where the the homo graphic adaptation came in that I'll talk about now well and like I said so like I said the synthetic training here and and one of the key things I'll just just mention is these images are kind of the raw images but the key to any high-performing deep learning based system is augmentation so we augment the crap out of these images we warp them zoom them rotate them we add Gaussian noise salt-and-pepper noise brightness changes we add synthetic shadows we actually made this problem really really hard to the point where an in something like orb or so another harris corner detector for example i mean this should be the ideal case for a harris corner detector if you think about it but by the time we added so much noise those detectors were failing and and that got us really excited because we saw that the system was able to overcome the synthetic noise and just to give you guys a picture of that on the top i have basically so on the x-axis there's the degree of noise so we we ran experiments where we slowly increase the noise on the images and on the top two lines show the the deep learning based system and the other three lines show classical detectors and you can see that we can essentially add so much noise that the images are almost black a human can can barely even make out what's going on and the system can still detect that and in light of this massive performance gap that's what got us really excited about this this work and this is the core sort of idea behind super point the basically starting off with something that's better than the classical detectors like I said and this is just saying again like I said the the system works surprisingly well on real real world images so Homa graphic adaptation so how did we then take these lists system and take these labels and start labeling real data the because we wanted to get real real natural images into our training set and start training with them so we started by simulating a planar camera motion with homography so basically if your scene is planar and you move a camera around you can explain all of the transformations of the pixels by holography and the nice thing about this is that we can generate these homography synthetically we don't need to get expensive camera sensor rigs or anything set up we can just use simple OpenCV calls to warp images do the kind of interpolation and and etc so we start with an unlabeled image from that image we generate something like a hundred random warps of that image and we fire the synthetically trained detector on each of these images independently now independently from the different views some of the points will only be seen from a couple views and what we want is the the the points are only seen from a couple views you want to suppress those those might come from kind of just random noise on the ground or just something kind of from one angle looking like a point but from the other their angle not and we want to enhance the points which were repeatedly seen from all the detections and all the views so once we run the detector on all these different views we then aggregate the point sets because we have the homography which relates all of these images back to the original and that results in a super set of points and that's where we got the name super point it's like I said we want to suppress the spurious detection and enhance the repeatable ones and we do this by starting with a base detector and we run it from many different views now some of you might be wondering well you you just so you have a deep convolutional neural network you're saying you want to run it a hundred times on an image now the the network already come needs a GPU to run so that could be a pretty expensive task well this is only done at training so think of this as more of rather than having a human in the loop going in and clicking on where the corners are we have a system do it and this has only done at training and then at inference time this and this goes away so we can afford the extra compute as part of the labeling process so to evaluate the system we evaluate it on H patches which is a data set of images that are that have ground truth thermography is that relate them and in general we saw that super point results in a denser set of correct matches so we compared it to both sift which I've talked a little bit about which is a very well-known method developed in the early 2000s we compared it to orb which is a more of a real-time type detector that's in the similar realm and then also lifts which is a recent deep learning learned based detector that's in the style of sift so they kind of have a similar name but it's it's a modern deep deep detector so just here's another example so in this case the image is just the lighting change so the camera is actually stationary the lighting change you want to see how robust is the system to the lighting this one the viewpoint changed the general story story is that super points detecting more dense set of points and the matches are more well covered in the image same same story here and so we evaluated this on the H patches data set and we got great results beating out the other methods basically given a pair of images we compute points we compute descriptors around those points we ran a standard open CV ransack tomography estimation engine and we got basically counted the fraction of the pairs in the data set which were correct so that's what the number is so higher is better so approximately 68% of the pairs were had the homography correct to within some threshold when we looked at sub metrics we were really happy with the descriptor and its ability to discriminate I just want to mention one thing when you look at at the detector we looked at the repeatability it's it we found an interesting story which is orb actually has the most repeatable detection but it has the worst it does the worst at basically homography estimation which actually validated our thinking in that a system which detects points which are more well covered and distribute in the image should do better at these types of geometric tasks if you look at orb a lot of the points are essentially clustered around these high textured areas but with super point the system is designed such that it's only trying to find a single point within a local 8x8 region and we are really happy with the visual and distribution of the points in the image again lift is a competing method so in designing super point we want it to be real time because that magically if we have basically an embedded system that we want to be able to run things on a low amount of compute so one of the questions you might ask okay the system was trained on 2d basically using kamagra filles synthetic 2d transformations it was evaluated on 2d how does it work in 3d because most people care about 3d scenes so to show this we have a qualitative evaluation where we just rigged up a simple connect-the-dot style algorithm that just tracks points in a sparse optical flow type manner and we applied it to a variety of data sets and and different types of image streams so we've got some indoor data there's here there's some synthetic data here's a fisheye lens here's a stereo lens from a driving data set and we were really happy with with how well it performed how general the detector and the detection z' work just visually so I will give you guys a little demo here because I have this set up let me see so so yeah so just just just for those of you that aren't interested the code is publicly available on github and you can go ahead and try it out you can basically we've made it so that the the dependencies are really small you should be able to basically in five minutes get this up and running this uses Python and PI torch which is a deep learning framework which you can install with pip install pi torch basically and the demo is is pretty fun to play around with so I'll just I'll just give you guys a quick demo here here's the github page all the information so I'm not sure if you guys can see this but it's pretty simple so I've gotten my little webcam here and it's pointing at the table now and basically I'm just I have my little webcam here and I'm running it and it doesn't seem to be adjusting very well to the autofocus so maybe I'll just so now you can see it's focusing in on the room but basically you can point this this is running at qqvg a so this is running at resolution of 120 by 160 and this is running relatively real time on a CPU and basically what you're seeing with the the visualization my girlfriend likes to call this the dancing worms but it's essentially we detected the points in blue and they're matched over the last five frames by default and the colors represent the confidence of the match so red color means it's more confident blue means it's less confident so you can just kind of play around with it but what just one thing to point out is that the features are very well distributed in the image which which is what you won't see I can try to like if you want to wave your hand I'm not sure if that if at that resolution it will see there's your hand yeah kind of getting it yeah yeah so they look you can play around with the different parameters but by default it's it's five frame sort of memory and the matching is done just by nearest neighbor so there's no prior on a search radius that assumes anything so basically the points can jump around all over the place and the descriptor is able to discriminate essentially from that point to all the other ones yes some of the points will pop in and out and and the idea is that by colouring them with the the descriptor confidence they should have the lower the lower color that's not it's not always perfect but yeah it's kind of fun to play around so I encourage you guys if you're interested in just download the the pre-trained Network run it on your webcam or on your favorite YouTube video dataset etc so in summary what is super point it's a modern deep slam front-end that's designed to operate on images and extract features that can be that are suitable for back-end optimization of things like pose the recipe is self supervise meaning the the system labels data itself so there's no need to buy an expensive set of 3d sensors or spend a hundred thousand dollars on Amazon Mechanical Turk to get your data set labeled you can just collect your data run the holographic adaptation process and basically train your own the public code is publicly available to run super point there isn't the full stack available to train it but it it uses pretty standard and off-the-shelf machine learning mechanisms so it should be relatively straightforward to to recreate and some people online have already started doing some kind of replication of that so I've got a few more minutes here I'll just do a quick just two minute overview of some of the other research word we're working on so multitask learning is something I mentioned and something was mentioned in the last talk a little bit so one of the difficulties with multitask learning is when you have these multi multiple tasks how do you balance the learning how do you how do you inform the the learning process basically which tasks are more important and which tasks are harder so one of the things we developed is called grad norm this was done by some of my colleagues at magic leap and the basic idea is is you sort of look at the gradients that are coming in through the the learning process and you try to balance the weighting of the gradients so you don't want one task to be overpowering the other tasks so you actually normalize the scales of the gradients based on on on some some analysis of the gradients and we we found success in training this using this system on something that like a multi task network that texe depth normals and key points from from indoor data so I encourage you to check that out that was published at ICML this year yes so so I mean in general I mean you can kind of tweak this to get what you want but in general it's it's very difficult to know this sort of a priori which tasks will be easier in which task will be harder given the architecture so this just gives you kind of a framework for setting that up and what we found was that by balancing them actually all the tasks will improve which was kind of a surprising finding and because basically what happens with these networks is they they extract this multi scale representation of the image that then the features are shared across the task so by balancing the gradients the the representation is better and actually helps all the tasks kind of in a similar way another project here is deep the deep depth densification if you can say it the the basic idea here is we want to we want to use depth images to we want to use sparsa depth images to essentially create a dense depth image so let me think there's a video here that does a better job at explaining it so what I have on here is the input image and right now there here's essentially the Sparsit depth coming in so now if you guys can see there's these little dots here that are sampled in the image and basically they're they're there as the video goes on the density of the sparse depth measurements coming in is slowly increasing and you'll see the system output depth here is improving as the density of the points come in so so what we're trying to show here is that actually with your depth sensor you might not need to to compute a depth for every single pixel if you can use the corresponding RGB image to help infill where the depth measurements are so the the sparse depth kind of provides a grounding for the system because ultimately monocular depth is it's kind of a ill-posed problem due to the scale invariance but with this we showed that you canyou can use a sparse sampling of the depth map to actually get high-quality per pixel depth measurements so I encourage you guys to check this out this was presented at ECC V a couple weeks ago and that's pretty much it this slide we have the the basically if you guys are interested in doing research or engineering here or know anybody you can send an email to this EC CV at magiclip it's kind of what we use the EC CV you can also email me that that would work too but our main two locations are in the Bay Area here and in Zurich and in Europe and here's just references just a selection of the people in our group that are working on these problems and that's it I'll now take questions Thanks [Applause] [Music] [Applause] [Music]
Info
Channel: AR MR XR
Views: 3,300
Rating: undefined out of 5
Keywords: Augmented Reality, Mixed Reality, Extended Reality, Virtual Reality, AR, MR, XR, VR, Magic Leap
Id: kjaRRGLw4RA
Channel Id: undefined
Length: 36min 51sec (2211 seconds)
Published: Mon Dec 24 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.