Deep Visual SLAM Frontends: SuperPoint, SuperGlue, and SuperMaps (#CVPR2020 Invited Talk)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good afternoon everyone my name is Tomas Mellish Ebbets and I'm here today to talk to you about my team's work on deep visual slam the title of my presentation is deep visual slam front-ends super point super glue and super Maps this talk is being delivered at the joint workshop on long term digital localization visual odometry and geometric and learning-based slam at CBPR 2020 let's begin visual slam is the problem of performing simultaneous localization and mapping solely from images this is a very important problem in numerous applications such as mixed reality depicted on the top row as well as robotics depicted in the next two rows within robotics we have tasks such as self-driving vehicles and autonomous cards both tasks requiring significant capabilities in mobility and perception about space in today's talk we will talk about three things first I'll describe super point we'll have a discussion about architectures and training paradigms these are the things you really need to know if you want to replace local features with convolutional neural networks second I'll discuss super glue that is our team's attempt to utilize graph neural networks and attention to improve the feature matching process finally I'll talk about super Maps these are some ideas for extending our work and moving beyond pairwise matching and it's going to be a road map towards end-to-end deep visual slam let's begin with the first part visual slam can typically be decomposed into two parts there's the front end and there's the back end the goal of the front end is to deal with image input it's natural to use deep learning here particularly convolutional neural networks because we've seen many successes and applying deep nets directly to images in the context of visual slam the back end will be an optimization problem over pose and map kuan Ruiz this is generally solved with nonlinear least squares problem it's also known as bundle adjustment our solution to this decomposition of slam into the front and the back ends starts with super point super point is our deep slam front end it's a convolutional neural network that takes an input as image and it produces key point locations as well as key point descriptors this network is fully convolutional that means you don't have to extract patches first and then send them through a separate machine learning system the point in the descriptor is also computed jointly meaning the backbone shares most of the computation between these two tasks and finally we've had great results using bgg like backbones there's no reason why we can't modernize this with the best net and more recent approaches it's important to note that our approach was really designed for real-time processing on a GPU it does mean that we had to use a smaller backbone than researchers would like and we have to deal with sparse points because in order to build slam systems that are extremely efficient using sparse points is still the best way to go I'll make a quick sort of point in here is that what was interesting about that super point work is that we devised the key point or the interest point detection head to perform a classification problem now the only thing that the network has to do is classify which pixel is interesting or not we cast this as a probability over 865 locations the 65 locations are an 8 by 8 pixel region together with a dustbin region we don't use d convolutional layers in our network unlike you nuts and seg nuts this makes it extremely fast so now let's ask ourselves how can we train such a front-end well to set up the training we generate a large number of image pairs where we know the correspondences we use siamese training our pairs are generated and they're related by a homography the training of the descriptor is relatively straight or word because once we generate the homography piece would be nowhere a pixel in the left image maps to the pixel in the right image the remaining difficulty is to think about the key points themselves you know where do they come from where can we get a data set that defines which points are interesting it's going to be very difficult to take images such as the ones I'm showing on the screen right now and send them on Amazon Mechanical Turk and ask people to label interesting points interest points are quantities that were devised by the computer vision research community to help machines tackle the image matching problem interest points do not exist in the human mind except that of researchers we decided to propose a self supervised training procedure that works as follows first we define a synthetic world where corners and interesting things are well delineated we first train on that and then we use the resulting detector to label a real data set of images the MS cocoa dataset in our case the procedure that we invented to take the labels and propagate them from one data set to the next is called homo graphic adaptation because to no surprise it uses a lot of tomography let's take a look at some of the synthetic images we pre trained with now these are simple non-photorealistic generations of simple shapes these are the kinds of things you would see in a classic book on corner detection from the 70's and the 80's these are the kinds of images people had typically tested in harris corner detectors on we generate these in python we can generate millions if not billions of these examples may work extremely well back our earlier version of super point which was just the interest point detection part we call it magic point and we had looked at experiments compared to other typical interest point detectors and we were pleasantly surprised that not only did our method outperform the earlier ones it worked extremely well as the images got darker and had more blur let's now take a look at the Hama graphic adaptation training procedure the goal is we want to simulate planar camera motion the holographic adaptation is a self labeling technique designed to suppress various detections and enhance repeatable points it's a little bit more complicated than just running the detector once on an image and then saving the output let's see how it works let's start with an unlabeled input image what we will do is we will warp the image multiple different ways and run our magic point detector each output will give us a different set of points we can then warp them back and we can reason about what is the super set of detected points this procedure works extremely well let me show you some final examples this is an example of the final super point train system running on an example pair of images in the top left we see super point in the top right we see lift another deep learning based system in the bottom left we see sift one of the most well known local feature matching methods and in the bottom right we see orb notice that super point is able to produce a higher density of green lines the green lines are the correct matches matches are determined by matching the descriptors as nearest neighbors in descriptive space notice also that orb in the bottom right tends to concentrate a large number of detections around highly textured regions this is generally not a good thing if your goal is to not only produce a large number of matches but later estimate the relative pose between the two images in this next example here we see a slightly more difficult pair and now we see lift and sift also performing fairly well but we do see more interest points coming out of the super point method and in this last example we see now a rotation between the two images and we see the lift method suffering and again super point has a nice clean high density of matches so so far I talked about super point and it's training and I explained that all of the training is based on 2d images and 2d correspondences we want to ask ourselves how is this going to generalize to super point a lot of people had doubts that our 2d based training procedure would not work on real sequences we use the simple connect-the-dots nearest neighbor algorithm and we ran super point on a large number of videos from different data sets let's take a look at what super point looks like when running on different data sets you can see that super point seems to work quite well and in this case what we are showing is pairwise tracking put together in sort of video form we released a pre trained super point network and it has been with big success in the community we see many people using our system implemented in pi torch very easy to get the system up and running we first released this at the first deep learning from visual slam workshop at cvpr 2018 please I'll go to our github page if you want to take a look at sort of super point and play with it yourself before I continue and talk about the other projects I want to take a little break and talk about the robustness property of super point we asked ourselves if we could apply the similar architecture to other highly related tasks we adapted super point to an instance detection problem of a specific module pattern that is used for camera calibration here the name of the method is Cherico net but is the same thing as super point except we replaced the real value descriptors with an ID classifier in this case our pattern has 16 points so we have a 16 way classifier each point detected can be one of the sixteen points or none of the above this shiruko net works extremely well as the images get darker in this visualization going from left to right we see the same image getting darker and we see the raw images at the top and OpenCV output that is a classical image processing system and our deep to become that on the bottom notice that for some of the images they're so dark a human can't see the pattern yet the children comb that is able to detect the IDS correctly here's a video form showing this robustness property of deep shiruko in each video the left-hand side shows our deep shiruko and the right hand side shows the opencv output every time the frame turns red that means there's not enough detections or the system failed notice that there are significantly more red frames with the classical system than with their deep Chirico system next we asked ourselves the question if we could improve super point with real data and the visual odometry back end so far super point was trained from ms cocoa data that is non sequential data let's take a look at what happens when you wire up a very simple vio system together with super point it's not that difficult to recover the camera poses as well as the 3d locations for the Associated 2d points here's another example showing what happens when you run visual odometry back-end using super point on another sequence from the frayberg T um RGB G data set we realized that what are the benefits of the O based super point training is that we can establish correspondences across time we saw earlier that the super point tracker worked quite well now we're essentially taking this tracker and updating it to 3d with the bundle adjustment computation determining which points were tracked successfully or not additionally lets us determine which points are stable and which ones aren't let's take a quick look at our self improving vo algorithm we start with an input monocular sequence we then run our super point on the sequence to create point tracks these point tracks are upgraded to 3d using vo and this gives us a final labeled point track sequence we define stability by looking at the reprojection error if the reproduction error is less than 1 pixel we say the point is stable if the reproduction error is greater than 5 pixels or some other predetermined large amount we say the point is not stable otherwise we need to know the point here are some videos of taking super points and running them through vo and looking at the output of the stability label in green we have the stable points in black we have the ignored points and in red we have the unstable points training with these labelled sequences follows the recipe of super point using random filmography s to create more data augmentation we evaluated our super point in the OU system against the original super point and numerous other methods on the pose estimation task on scan net here two images are fed into the system and the goal is to recover the relative pose the plot on the left is going to be the rotation error and the plot on the right is the translation error the dotted line at the top is super point vo and the orange line right below it is the super point system trained on scan net and in green we have the super point system trained on Coco the super point system trained on scan net does not use the sequences vo only comes in in the dashed line when we ran our experiments on a small baseline of one second we notice that vo only helps a little bit we repeated the experiments with a larger frame difference naming frames 60 apart corresponding to two seconds of motion now we started seeing a larger gap between the top-performing method namely super point vo and all the other methods finally when we looked at the largest time delta a difference of three seconds we saw the biggest performance gap one of the things that we learn from these experiments is that we really should be looking at wider and wider baselines seams for small baselines the trick of training with homography z' is just good enough let's now go into the second part called super glue well I'll going to discuss how we applied deep matching ideas to super point we really wanted to answer the problem how can we learn to solve the correspondence problem do something even more profound than just making super point better this work super glue learning feature matching with graph neural networks will be presented at this year's main cvpr conference this is work done with Paul Edwards Harlan Daniel the tone myself an and rabinowitch at magic leap Paul received his master's thesis for this work and he will later tell you more about the ins and outs of this technique superglue is made up of two components a graph neural net in an optimal transport layer the goal of superglue is to solve wide baseline matching for image pairs and do this in real time using a GPU superglue does give us stated our indoor and outdoor matching we have separate superglue one that works with sift and then one that works with super point important thing to note here is that super glues goal is to be better than motion guided matching without any motion model at all we saw in our earlier work matching done with vo and in practice using motion estimates is important in order to make everything work better what we really wanted to do is replace heuristic design with one big network that can learn how to solve this kind of ambitious alignment problem without requiring any motion priors the first part of superglue is a graph neural network with attention this part encodes contextual cues umpires it reasons about the 3d scene the second part is the solution of a partial assignment problem we use a sink horn algorithm here it is a classical optimization technique now the final output will be a partial assignment between the points in the left image and the points in the right image an important thing to note here is that supermoon requires both sets of local features as input we broke the traditional paradigm of doing all the processing on each image independently and then doing simple nearest-neighbor matching we instead fuse the representations relatively early do a lot of communication across both images and then get the final correspondences let's take a look at some videos of super point and super glue in action on the right we have super point and super glue and on the left we have super point with nearest neighbour matching and a couple of other heuristics we see that super plonk point is able to on together with super glue produce a larger number of higher quality matches as depicted by green lines the red lines are spurious detections notice the larger number of red lines on the Left compared to the right we trained another super glue system this time on outdoor data in order to show that we can get really competitive results on large number of data sets on the Left we're looking at super point plus nearest neighbor matching plus last years Oh a net which is an inline classifier system and on the right we have super point in super glue note again the large number of high quality green matches we evaluated super glue on indoor and outdoor data sets the most important part to see from this figure is red which is super point being used with super glue if you look at the dark blue you'll see the results of sift with super glue now this is a different super glue one trained to work exactly with sift most important take-home message is that super glue yields large improvements in all cases we released our pre trained super glue Network and it runs at 15 frames per second on 640 by 480 images using approximately 512 key points using a GPU you can run this on a commodity desktop with a GPU the demo also works reasonably fast not quite 15 frames per second using a MacBook Pro we encourage you to go to the github page get the code and play with it yourself here are just six videos of what I could come up with in about 1 to 2 minutes per video by downloading the super glue code you will run something that works directly on your webcam now these visualizations red means high confidence match and blue means low confidence match the camera is not calibrated in this case so I can't say that a match is correct or incorrect I can only say it's highly confident or not confident notice that by moving the camera around shaking it including it and breaking the rigidity assumption of the scene you can learn a lot about how this works finally let's talk about what comes next what comes after super point and super glue rather than describing how to build the next thing I want to describe the high level things that we need in order to take super point and super glue and make it work like an entire slam system first of all the work I showed you earlier works with a pair of images what we really need to do is work with a set of images that means we need to think about multiple images being matched to one image or multiple images being matched to multiple other images second super point and super glue solely does the matching component it requires a classical pose estimation system on top to get relative poses I believe the future is to design networks that can do the pose estimation inside this will be an important milestone in order to turn our work into a full end-to-end slam pipeline also super point and super glue does not have any loop closure mechanism at this point it's a little bit too expensive to run the entire super glue pipeline on all potential keyframe candidates what's necessary is a very quick procedure to determine if it's worth even running super glue this can be done with keyframe embeddings and the super point front end can be augmented to create one global image descriptor super point and super glue modules are also trained independently it will be important to take these ideas and show how to do end-to-end training although I feel this is not as important as other people would like you to believe last super point and super glue is a combination of convolutional neural networks and the graph neural networks the system has two different notions of receptive field the convolutional neural network has a growing receptive field that is Gaussian like and the receptive field of super glue is much larger because the attention mechanism allows the features to communicate with each other across the entire image it'll be important to reconcile these two notions study them and determine what should be the job of the feature descriptor if you're going to be using a complicated network like super glue for the matching afterwards before I conclude I want to outline some open problems at the intersection of deep learning and slam that I strongly believe will drive innovation first multi-user slam I believe it's going to be important to create representations or Maps that work across a large number of agents the problem of one slam system running in a room creating a map and then a second person coming in from a brand new location and being able to align themselves to that map is very important as we have more and more robots collaborating together it'll be impossible to assume that each robot can maneuver to the same point of space second it'll be important to integrate object recognition capabilities with slam front-ends we've seen already a lot of object recognition systems used deep learning so this might not be the most difficult thing to do but it'll be important if we are to ship these slam systems and say that they solve real world perception tasks finally enabling lifelong learning has really been a passion of mine the reason why I was excited to cast slam as a deep learning problem was the idea that continual deployment can create data that improves the quality of the core system each time a human goes out into the world they see something experience something when they come home and they digest what they experience they are more agile the next day maneuvering in that same space I believe these capabilities will be important to create the next generation of slam systems in summary I talked about super point that is our convolutional neural network architecture for visual slam front-ends I discussed self supervised learning using homography as well as visual odometry backends I also discussed the robustness of super point and when you take super point ideas and apply them to slightly different problems like pattern specific super points I introduced super glue and super glue has really been very successful at doing lots of matching sort of tasks and we hope that more and more people will see the light and extend our technique to build more impressive slam systems and finally I outlined some basic ideas for how we can go beyond pairwise matching and finally build an intense Lam system so before I wrap up I want to let everybody know the tomorrow Paul Saarland will be speaking about super glue at this workshop because our method did win the first place in to visual localization challenges at this workshop on Friday June 19th Paul will also be speaking at the image matching and local features and Beyond workshop we applied our data organ so we applied our super glue technique to that outdoor data and it also works extremely well if you're interested in learning more about super glue I've shown on this slide the different places that you can learn more about super glue in the next few days finally I would like to thank you for your time and attention if you are interested in learning more about our work follow us on Twitter and if you're interested in collaborating or you have questions about our work feel free to send us an email thank you very much have a great day you
Info
Channel: Tomasz Malisiewicz
Views: 7,779
Rating: 4.9499998 out of 5
Keywords: computer vision, deep learning, SLAM, robotics, superpoint, superglue, image matching, local features, graph neural network, convolutional neural network, visual odometry, visual localization
Id: u7Yo5EtOATQ
Channel Id: undefined
Length: 26min 36sec (1596 seconds)
Published: Sun Jun 14 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.