Keyframe-based SLAM for hand-held Augmented Reality

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
each year Microsoft Research hosts hundreds of influential speakers from around the world including leading scientists renowned experts in technology book authors and leading academics and makes videos of these lectures freely available good morning thanks for coming it's an absolute pleasure to introduce George Klein he's originally from Austria and got his PhD working with Tom Drummond in in Cambridge and he finished that in 2006 and he's currently a postdoc in Oxford working with David Murray in the active vision group and I think I can promise you you will see some absolute cutting edge work on handheld simultaneous localization and mapping aimed at augmented reality so take it away George well thanks very much thanks David for inviting me here it's a pleasure to be here so key frame based slam for handheld AR slam being simultaneous localization and mapping in AR being augmented reality which is actually what I've been really interested in for the last five or six years a quick outline of the talk initially I'm gonna talk a bit about what augmented reality is and what we need for it then I'll see what previous attempts the visuals Lomb relating to AR have been and finally I'll show how I think if we use key frame based slam we can make a positive difference there and finally I'll show you some recent work we showed up at ECC P which is then making this key from this some more agile specifically for handheld use the work I'm gonna cover is from two papers one of them is an is more paper last year and the other one this year is ECC B paper so augmented reality is inserting virtual graphics into the real world this is pretty much what you get in the effect would like to achieve is the same as in in movies where you have really nicely rendered computer graphics which looked like they're part of the real world except we want to do this stuff live in real time on small devices as opposed to in a render farm taking weeks it has a huge number of applications and it's it's a are I sort of changed recently we used to believe that this was the dream a are set up for everybody everybody would for these very fashionable eyeglasses which would directly project the virtual graphics into the real world unfortunately that hasn't really materialized so now we're we're shifting to actually handheld they are using some sort of screen so the setup I use is just a laptop with a camera but increasingly we're looking at things like PDAs and mobile phones are becoming powerful enough to be really the AR platform of choice no fundamental requirement for providing augmented reality is registration that means knowing where the camera is or better said where the viewer is relative to the world for for video see through augmented reality like I'm gonna be talking about that equates to just knowing where the camera is so we need to track in real time the sixth off camera pose the the predominant approach in AR right now is to use in fact this here this is the the most cited paper in augmented reality for good reason because it works and it works really well this is Katherine building Hurst's AR toolkit system you print out these square fiducials you put them around the scene your camera observes them and it could work out because they're quite distinctive the pose of them and it could work out the idea from the little pattern in the middle and then you can draw your virtual graphics correctly registered in the scene now there's some labs which actually go and print out dozens of these and put them all around the wall so it can go anywhere in the lab and know where you are another application is you can actually move them around and have sort of interactions based on where these things are relative to each other now so that's if you can put fiducials around the scene that's great otherwise maybe we don't want to do that but we know a bit about the scene so another approach to tracking NER is to use models so actually some work I did previously was we wanted to have this printer maintenance demo so I measured out the edges of the printer stuck them in a CAD file and then we can track this by tracking these edges a similar approach using feature points for by EPFL both circa 2003 and in general if you have some sort of prior knowledge of the scene you're going to be trying to augment then it makes a whole lot of sense to actually use this because well then you can put your graphics in the right place and tracking works great and everything's fine but today we're looking at something a bit more tricky we're looking at the situation where you don't have any prior information about the scene where you essentially have to learn the structure of the scene at the same time as learning the camera motion and this this problem is well studied as slam in the robotics community so they're really interested in a robot like the one we have here exploring the world and figuring out the stretch of the world it's exploring at the same time as working out where it is and big topic in robotics mostly using a wide array of sensors for example laser rangefinders as you can see mounted on this robot here two of them which give you both an orientation and a range that was sort of sweep of laser across the room the in computer vision slam is a bit less well-established it's the first real-time monocular slam system was actually presented by Davison in 2003 ever since there's been a bit of interest so um looking at a few developments like I said initial EKF slam implementation 2003 by Andrew Davis and then we had a UKF implementation which was a bit more robust thanks to using sift descriptors finally Eden Drummond um started doing interesting work using different type of filtering framework so fast slam - which can handle many more features and finally something called local slam which is a combination of using EKF and bundle adjustments at AI local sub maps together but all of these have something in common they're all filtering methods and there has been application of these methods to IR I'll just show you some of these Davison presented way back in 2004 this kitchen augmentation software using a slam system then we actually tried doing some work bringing sláinte are in 2007 likewise a few attempts it is more let me just show you what this looked like here this was the first application of real-time EKF slam - AR by Nick Moulton and under Davison and Voltaire email we just forward a bit this is perhaps if you're not familiar with the sort of EKF slam history and vision this is pretty early system essentially move the camera around and features are added to them up and it can work out where the camera is from that and eventually if we fast-forward here you can see the map being created as the camera is moving around if we fast-forward a bit you'll see then here a shelf has been added to the map and this is now integrated into the scene as our augmented reality so just to show you our efforts in this direction the focus of this work was actually to add a real oak eliezer to slam because slam occasionally broke we needed a method to recover from tracking failure so we have this demo here where we put these circles in the world and then a little guy ran in between them and the idea is that this is correctly integrated in the world no slight problem though you'll see perhaps these but just just rewind a bit everything sort of not not quite completely stable everything sort of jitters around a bit the quality of tracking isn't isn't really that great and we also we ended up encountering some other issues in general the frame to frame slam methods we saw applied to AR we're all somewhat fragile so first of all starting with this tracking jitter I just showed you this is partially due to the fact that all these just say these are all all of these methods just I'll have a slide on that in just a second yeah but please do if you have questions do please ask right away so well there's there's no there's a number of reasons I think first of all the the maps are all relatively sparse so there's not many features this contributes to the jittery tracking there's agile motion synthesis move the camera fast most of these slam methods will fail and finally certainly in the methods we used after a long-term exploration of the environment eventually the map always ends up corrupt and that's makes the whole system fail and let's look at why this is explaining now what I mean by frame by frame slam all these systems work by filtering a certain estimate of the state and this state vector includes I've given an example down here for the case of EKF slam we're estimating the camera pose appear and also camera velocity and simultaneously all these Y's here these are all the point features which make up the map so we have maybe a camera pose and say a hundred feature points in their world and we're estimating jointly the position of the camera and the position of all these feature points and every frame we're going to make some measurements and then filter these into this into this distribution over this feature vector here in the case of EKF slam the the distribution is described just by a single gaussian PDF staying with the example of PDF slam this is this is what you do every frame first you modify the state vector by by applying some sort of motion model and process noise which also expands your urine certainty then you project your current uncertainty of the feature vector into the image this is the the principle of Andrew Davidson's active search you can see here of its drawing all these ellipses this these are the projections of the Gaussian state estimate into the current frame using these ellipses we can then make some measurements by localizing within each search ellipse the appropriate feature and then we can take these measurements and use them to modify our state and reduce our uncertainty integrate them into our current state estimate finally the measurements once we've done that we've got our new updated posterior state estimate for that frame and then the measurements are discarded I hope that answered your question good there is there's a couple of downsides to this approach so the first two are perhaps not fundamental but there is certainly currently a downside the first is scalability so you're filtering a huge state vector here your whole map and also camera calls and this this turns out to be quite expensive if you want to do this every frame you need to do this at 30 Hertz to provide good graphics for ER and this Sun favor is reducing the size of the state vector by say running with a small map so for example Andrew Davison how's the phrase of using a sparse map of high quality features essentially the idea is almost to use as few features as you can to ensure proper localization of the camera another problem is currently consistency this is this is a property of monocular some systems - due to unfortunate mathematical reasons to overestimate the amount of information they have this means that their uncertainty estimate ends are too small which leads to corruption of the map now this might not be a fundamental problem but it's certainly a problem which all current methods have unfortunately but the biggest downside to frame to frame slamming this filtering approach is what happens if one of your measurements was incorrect now these measurements they go into the PDF and then they're locked in you can never get them out again so that means if you make a single incorrect measurement at any time in any frame you've actually permanently corrupted your map now this this this is a big enough problem that some current slam implementations go through a great deal of effort to ever avoid this from happening so we start with even even the active search I showed you first this projecting the lips into the image helps not only concentrate the processing on the regions where its most useful which helps real-time operation but it also means that we're not going to get any outliers because we're not going to look into unlikely regions also there is various binary methods for we're trying to avoid including outliers so there's a joint compatibility branch and bound or ransack both of those do roughly the same thing split all the measurements into an entire set and an outlier set and finally most time systems require some sort of conservative tuning you really want to avoid including outliers so you have to maybe shrink the size of your search ellipses and and all this essentially prevents agile operation so this is our proposal turn ative key frame based slam we don't want to have this effect where we discard measurements and have them corrupt our map so we're going to keep all our measurements and essentially do all of our estimation using bundle adjustment so instead of filtering we use bundle adjustments we keep all the measurements we ever make which go into the map and also we keep all the poses in the bundle adjustment which means that the whole thing this estimating all the poses and all the map points using bundle adjustment is pretty much the optimal way of determining camera trajectory and map map structure from from a bunch of monocular measurements unfortunately this is we can't possibly do this at 30 Hertz we can't keep all the frames we've ever seen and all the poses and try and update this every frame of 30 Hertz given that bundle adjustment eventually will scale as the cube of the number of poses in the estimation so what we're going to do is we're going to going to keep a subset of opposes we're going to call these key frames and once we've done that once we've decided we're not going to use every frame we can actually split off the mapping component from the tracking and run the two in separate threads which given that now everybody has dual core machines we can run them in parallel on the same processor run mapping and tracking so here are these two threads on the one hand we have a tracking thread this is responsible for estimating the camera pose so that we can get our nice augmented graphic in the right place this therefore has to run at 30 Hertz but it can simplify that but assuming that the map is fixed which means that all it ever has to estimate is six degree of freedom camera pose then we can focus on making this thread as we can try and make the tracking as good as it can possibly be given them up so we're gonna focus on robustness agility and also try and get it as accurate as possible on the other side we have the mapping thread which is trying to make them up and this since we're not looking at every frame we can actually take our time for every keyframe we can spend lots of time and try and make the map really as good as it can be given the keyframes it has to work with there's a bit of interaction between the two the mapping thread obviously provides a map for the tracking thread and the tracking thread occasionally gives keyframes to the mapping thread so let's look initially at how we make them up the map as usual in slam systems is a cloud of point features it's made from the set of keyframes which means the whole thing is actually quite similar to multi-view reconstruction techniques except that we don't start with all the keyframes ab initio we start with a single stereo pair and then keyframes are added as the system goes along all the keyframes that we add usually have opposed already attached to them because they came from the tracking system which calculates this pose except for the first two and here we have to do something special and this is just a stair initialization so here you can use any stereo algorithm you'd like we used to use five-point pose now because we usually run on scenes which include a planar target we've used a homography initializer instead so you just all this needs is a stereo pair with feature matches between them the way we generate these feature matches is we we need cooperation from the user so we have the user place the camera press the button move the camera gently and press another button it can track the features between these two between these two frames and generate an initial stereo pair from now yes but you avoid that by choosing education yes it's also the scene is never quite planar there's once once the stereo pair is created then the mapping thread enters this here infinite loop whereby it's it's essentially idle until it gets a new a new keyframe when I gets a new keyframe it then starts adding new points integrates a keyframe into the whole estimation once it's added new points so it goes and optimizes this map using bottle adjustment and finally it's a bit of a map maintenance and perhaps iterates over this if the bundle adjustment didn't quite converge so now I'll now go over these these steps and explain them so keyframes we don't want every frame to be a keyframe we only want the keyframe if a it's not in exactly the same place as or any of the other keyframes already were we need some sort of baseline in order to add and triangulate features so we want a certain separation between a new keyframe and all all the other previous ones we also want the frame to be good and that means it shouldn't be motion blurred so we can add features well and finally the tracking should actually be reasonably confident of the position estimate it had for that key frame then when a keyframe is added if if the mapping thread was doing anything like a huge big bundle adjustment all that gets stopped and it handles a new keyframe as a priority it tries to add new points to the map and even before it does that it tries to measure all the points which are already in the map in that keyframe just to get as much information out of it as possible so to add new points to the map this is this is one of the key places why our system differs from other slam systems we want our maps to be as rich as possible which means that everything in the key frame which can be a map points it will try to make a map point so it'll go and do a we take all the maximal fast corners in the key frame we first check if there's already a map point there if there is an skip it otherwise it'll then do a sheet Tomasi score check just to see if there's enough gradient information there and if it isn't already in a map then this will we attempted to others and not pointed them up the way that happens is we extract an 8 by a pixel patch and then we try and find the same patch on a neighboring keyframe by a people research if we can find it then we triangulated add it to the map and we repeat this four times we work not only on the full size image which is 640 by 480 we also go through three other pyramid levels shrinking the image by half every time this lets us later do some coarse to fine tracking and also lets us track across multiple scales as the camera changes distance from the target what adjustment looks like this what we're doing is we're optimizing all of the keyframe poses except for the first and all of the maps point positions to minimize the reprojection error of all the measurements we've ever made in any keyframe we use a robust estimator for the reprojection error we use the two key estimate for that this has the advantage that outliers are flagged and we can then remove outlier measurements from the estimation at a later date this is implemented as 11:00 Berg mark out solver and it scales ultimately if we have many many keyframes it would scale cubic ly with a number of keyframes in practice we find much as you have that scales quadratically with the number of use right well no I just project the map points onto the cameras but beyond that I don't do anything else so I don't do any variable reordering or and in practice if I move around like this it will probably turn out that pretty much every view you can see every point but yes this is expensive so this could take tens of seconds to converge as we as we hit their sort of scalability limit of say about 200 keyframes this means if if I'm exploring I don't want to wait tens of seconds for each new keyframe to be added as I said earlier we can actually abort this if the new keyframes coming along but we also do something else we when we're exploring we do a local bundle adjustments initially which is just a bundle adjustment of a few local key frames individually and then once that's converged we can then go and unload just the entire map then finally once bundle adjustment has converged and the map maker is just sitting there waiting for a new keyframe it has a bit of idle time so I can it can do a bit of housekeeping do several things it does first of all if there was ever a measurement in the bundle adjustment that was flagged as an error it'll go back and try and search in that key frame again because we keep all the key frames see if we can make a better measurement of that point likewise new features which are added to the map we seek them out and all the old key frames and see if we can maybe make measurements of them there and in general we we try and make the map as dense as possible to fill in as much information as possible which is actually pretty much the opposite of what you'd maybe want to do if you were trying to explore with the robot so that covers the operation of the mapping thread I'll now tell you what the tracker does this is um reasonably straightforward it it just tries to track at 30 Hertz to avoid camera pose for ignited reality we want it to be thirty Hertz accurate means low jitter and also we want a good absolute accuracy would like it to be reasonably robust to temporary occlusions things like people's hands moving in front of it Anna should ideally be support the user moving the camera in a fairly unconstrained manner we don't want to have to treat this camera as a sort of raw egg we can now do that because because we've separated the tracker from the mapmaking we can do all sorts of mistakes in tracking and they won't immediately corrupt the map which means we can be quite aggressive we assume that the map is fixed there's no there's no estimate of uncertainty in any of the positions and we use the standard active search method so we project the features into the image fine matches and as in a circle around the projection and then update the camera pose to minimize this reprojection here we're going to do two stages of tracking a course defined method so initially we try we try and project the courses points into the image then find them estimate the sort of intermediate camera pose and then use that to find many more measurements and get our final camera pose estimates once we've done the fine stage then we have a quality assessment finally we can draw the grants and graphics on the correct page so initially we frame comes in we process it we split it into a black and white version which we use for tracking and the color version which we use for granted reality we generate our pyramid levels from 640 480 all the way to 80 by 60 and detect fast corners in each of these so this whole pre-processing is fairly rapid it takes three milliseconds of now we can go and project her features and see image and try and find them to project the points first we take a previous frame estimate and apply a decaying camera velocity model and then we we might have 10,000 or 20,000 points in our map we project all of these into current viewing frustum and keep maybe the few thousand which we can possibly see in the current view then we also filter them according to how big they are if they're way too big compared to where we first saw them we discard them otherwise if the normal if the patch normal is pointing in the wrong direction we discard them and we'll maybe end up with two or three thousand which we might be able to measure that frame we can't measure all of them at 30 Hertz so we restrict ourselves to measuring all the a thousand of them initially we choose the biggest 50 the ones which will be most likely to be found even despite big camera search for them do a pose update and then search for thousands of the smaller ones the way the patches are found is we use 8x8 pixel patches this is very fast although the quality is not so great we generate the search template by warping them according to our estimate of the patches patch normal and then we project them into the image and search and a fixed circle around the projection using zero mean SSD restricting our search to fast corners this makes it possible to search for so many features if we try to exhaustively search all these circles we couldn't possibly handle this many points and finally we find each point to sub pixel accuracy using some lucas-kanade a tracking once we already have the pixel precise estimate from the search then finally updating camera pose is just a minimization in 6 degrees of freedom again we use the two key M estimator and this iterates 10 times per frame so finally the tracking quality this is this is our protection of the map from the poor estimates of the tracker as it were we want to only include good keyframes in the map which means that if the tracker is going a bit crazy we don't want to use those as keyframes what we have for this is a number of heuristics so first of all we want to limit frames by the amount of motion blur in them because if we put motion blur frames into the keyframe this will break well we won't be able to have many new features from that frame you also won't be able to find a new features from other frames so we essentially don't want these frames also we just have a check which is based on the number of features the tracker can find compared to the number of features the tracker wants to find just to see if the tracker thinks this is a good frame estimate also if if the tracking performance is really poor then we just declare a tracking lost and we we initialize a recovery process which we'll talk about at the end so let me just show you what I've talked about so far so starting with the stereo initialization procedure so what I'm going to do is I'm going to hold the camera here press the button and then you can see all these little these are my featured correspondences for the first stereo frame the second time I press the button it'll have generated a very simple 3d map of the initial plane that it saw you can also see a sort of cloud of outliers down here and then as I move the camera around it's whenever I get sufficiently far from previous key frames you can see it inserts the key frame and as it does so it'll add new points to the map so as I move the camera around it'll it'll hopefully start adding a few features occasionally breaking and then this way I can grow them out now this is this year represents roughly what our system looked like this time last year and it still had you can see them up here you can see just now generated a map this has 22 key frames and 1600 point features as I move around it can maybe add a few more and as you can see the map contains not only inliers but also a huge number of outliers so down here for example this was features which were aliased from the keyboard because it's a repeating structure it gets the second depth to play and further down things like that don't really affect the operation of the system too badly so this is due to the use of robust estimation throughout not only in the tracking but also in the map generation so even if we do have bad measurements in the map they don't end up affecting the quality of the thing too badly but anyway let me let me carry on so already with just the points we can in a suitably the suitably textured environment we can generate maps of many thousands of features say up to ten or twenty thousand using up to two hundred keyframes and if we insert virtual graphics into this and the quality of the tracking far exceeds that what we got from EKF I'll show you more demos later but let me just say that we still had a few problems at this stage so although we could make these Maps the tracking quality wasn't perhaps as good as we would have liked it's also of course we have the scalability issue it doesn't go past 200 keyframes the quality heuristics are poor but but the problem I'm going to do talk about now is the agility really so as we start shaking the camera around tracking will very frequently fail and although we now have methods of recovering from tracking failure this is still an annoyance for the user and it'd be better if tracking didn't fail in the first place part of the reason tracking fails is the sort of cameras we use quite small lenses quite small sensors don't really gather a huge amount of light so they need to have very long exposure times pretty much 30 milliseconds on this model so we soon start running into images like this is like this this is a standard image if you move the camera a bit already you can see if you're any substantial motion blur and this is what happens if I start really shaking it around and this is a problem for us as I said earlier we use the fast corner detector to find our features on an image like this the fast corner detector will only find image noises corners and no actual true corners even if the fast corner detector did find corners it probably actually would have trouble finding them with our 8 by 8 pixel search patch because the 8 by 8 pixel patch is just going to be completely dominated by motion blur so we need something to be able to track these sort of images ideally so we came up with something two methods I'm going to talk about one of them is to add edgelits to the map and I'll tell you why in just a moment and the other is to add a frame to frame rotation estimator which becomes necessary when we add engines so first of all Edler why well here's an image of a kitchen and a canny edge extraction next to it and you can see it's nicely populated by edges on every in every old orientation and it looks it looks actually really nice and trackable that image now edges have the property that they're well they're locally one-dimensional and actually motion blur is - so what happens if we motion for this image is that we've lost all the vertical edges except maybe there's there's one still left here but all the horizontal edges are actually still there we can still detect them with the standard edge detector which means we should be able to track them trivially so whereas most of the point features here have been wiped out half of the edges are still there not only that we can actually even track the ones which have been blurred if there were nice strong intensity steps then a motion blur is just convolving that with the box filter which produces instead of an intensity step and intensity ramp so if instead of looking for an intensity step we just look for an intensity ramp using much filter then we actually still get a reasonably strong we get a reasonably strong response at the correct location so we should be able to track these blurred edges now people have tried tracking edges in slam systems before there's been to multiple previous approaches both presented at the MVC in 2006 they're very different so the first was the system by Smith Breeden Davison which was based fundamentally on EKF stone ek Absalom doesn't scale very gracefully with the number of map features so the approach taken here was to include as few edges as possible but to make these edges really good so Paul Smith he he went and tried to find long edges in between corners he looked at every pair of image corners in the image and checked if it was a long edge in between them and if there was they were added to the map the other approach was that of Eden Drummond who they didn't use a EKF they used fast slam to as the underlying slam engine and this scale is more gracefully with a number of math features so they were able to not look for a few long edges but rather to add all these tiny little edgelits to the image which has the advantage that if you're looking at a long edge which is not all in the image at the same time you can still track it by just adding all these little small features to it as you go along so we we used the method of Ethan Drummond or better said we use the concept of using edgelits so an edge that is this is this very small piece of an edge which is only locally straight and the long edge will be made of many many many of these the way we represent them is we give it the full six degree of freedom coordinate frame so relative to the world it has it has a full coordinate frame transformation such that the transition from dark to light is in the direction of its x-axis and the edge light itself points along the z axis now this is clearly an over parameter ization given that this only has four degrees of freedom but of course we only optimize the relevant four degrees of freedom in the optimization that means we ignore rotation around the z axis and translation along the z axis to add new edges on to our slam system well it's it's much the same as how we add points we whenever we have a new keyframe to the system we find canada edges in it for this we use a plane County hinge extractor with a slightly modified edge linker and then we use that people research to add these edges to the map so let me just go through this this might be a new keyframe we perform a maximal edgelet extraction much like kami and then when we link these things together we slightly modify this so that it breaks the chains the points of high curvature so we're left these with these chains of locally reasonably straight edges and then we walk along these chains and try and find things which actually qualify as edgelits in in our measurement stage which I'll describe later these are all candidates we're going to try and add these to the map all of them again and to do that we perform a people research and another keyframe and unfortunately since edgelits are well they all look the same we're going to get a lot of notches on each one this means we have to go to a third keyframe and verify each hypothesis and only add these to the map if we only get one correct hypothesis so having out of the key edgelits to the map we still need to we still need to do image measurements to actually then optimize them in bundle adjustment and this will work much like edge trackers do this this is basically this is going to do the same as the rapid edge tracker from from 1989 it initially we protect an edge blur into the image then we initialize sample points along it and from each sample point to a perpendicular search into the image to find the nearest intensity to find the nearest maximal gradient now this can be because this isn't and this this is a keyframe measurement procedure so remember these keyframes are nicely behaved they have no motion blur and the tracker has a really good pulse estimate already so this search is going to be very very local it can be very constrained and we try and make it as accurate as possible so once we've done the search we fit we do a best fit straight line and then this straight line becomes the measurement of the edgelet and that keyframe and this is what he used to perform bundle adjustment and bundle adjustment we project the current estimate in and then initialize these two virtual sample points against that best fit line and this is our two degree of freedom measurement of an edge lit error and a keyframe so let me show you so here's that kitchen scene this is just all the point features and then as soon as we add edges it's actually visually reasonably satisfying you can suddenly see all this sort of structure emerge and it's it's although we don't exploit this for a human operator yet it does it does actually make make a nice difference perceptually so in the tracker that now the trackers frames aren't nearly as nicely behaved as the ones the map maker encounters so here we don't have any guarantees about no motion blur the way we're going to track agents in the tracker is very different first of all in motion blur there's really no point in trying to determine the angle of an edge lit given how small they are and given how blurred the frame going to be so we're only going to make it one degree of freedom measurement which is the perpendicular distance we project the edge length of the current frame and then we look for the nearest instead of intensity step we now look for the nearest intensity ramp as I said earlier and this will be based on an estimate of the camera's current rotational velocity finally this this would work this would work perfectly if all edge that's lived in if all image edges lived in isolation unfortunately what happens in practice is that you get maybe an edge which is which you get a switch from dark to light back to dark so if you blur this you'll find that these two edges will merge together and in fact the position of the maximum will move so we want to avoid taking these measurements because will no longer be measuring just a single edge let the way we avoid this is in each keyframe we actually go to each edge let and simulate what would happen in an edge search under certain levels of motion blur and this tells us what the maximum search range we can use for any level of motion blur is and also what sort of edge threshold we should set for that let's search later on so so much for edge let's let me just then talk about the rotation estimator this this will handle rapid accelerations so the edgelits can track motion blur motion blur happens in rapid velocities so now maybe we can track the rapid velocities but we need to have an estimate of how fast the camera was rotating in order to predict how much motion blur there was so that we can track them so velocities are okay but accelerations are now going to be a problem and for this we have this frame to frame rotation estimator what we do is we try and emulate the effect of having a gyroscope strapped to the camera most of the sort of rapid accelerations that happen in handheld motion are due to camera rotation rather than translation so we're going to assume that the big image motions between frames are only due to rotation and try to estimate this between every frame we do this by a direct three degree freedom image to image minimization using well we subsample eats each image down to very low resolution down to 40 by 30 pixels so it looks like this and then just to aid convergence a bit we had a bit more Gaussian blur and then we essentially just tracked the three degree of freedom difference between these two frames this usually converges in five or six iterations using efficient second-order minimization and then we transform this three degree freedom image space transformation into a three into a three degree freedom camera rotation by taking accounts of the camera parameters and that's it that gives us our frame to frame rotation estimate this is very quick because the images are so small this takes just half a millisecond to estimate and yeah that's that's it's no longer ten it's no more like six iterations there was a bug so yeah let me then show you what the what the result is so I'm going to have to lock down the exposure otherwise so now if we look at the map you can see well apart from a huge outlier we can see apart from points we also have these sort of glowing green edges being added and you can see the if i zoom into them you can see initially they might the estimate might be reasonably poor but as I add more and more measurements these should have end up straightening out and forming hopefully continuous long edges now as I these it it's it takes a bit longer to add the edgelits to the map than it does to add the map points because I needs well a I need for map points any two views will do because will give me a baseline for edgelits of course if I move the camera vertically I can only add horizontal edgelits and if I move the camera horizontally then I can add vertical edgelits so it takes a bit more moving around so you get lots of edges in the map but eventually I can I can hopefully add a reasonable number to the map and get a nicely filled in scene so then so then yeah these these things basically help me help me track the camera reasonably fast let me just show you what the rotation estimator does if I if I turn off tracking completely right if you just look at this grid here this is sort of my initial reference plane I have if I now turn off tracking entirely now I'm just using the rotation estimator you can see it it pretty much follows the camera rotations around and it's pretty immune to rapid things of course translation will be misinterpreted as rotation so that's we still need proper tracking and also it's quite sensitive to this sort of thing you can see as I move in my hand in front of it because it's using the whole image at once it can't tell that my hand isn't part of the scene but combined with combined with tracking and the edge nodes it then gives me a sort of reasonably agile performance I can sort of move this thing around and dance around the scene quite a bit and it'll it'll happily handle reasonable amounts of motion blur and I think it's probably best if if if you're curious that I let you have a play with it first so that you can judge for yourselves how well this works or not like the best one where is it frame-to-frame I was wondering why start accumulating heroin oh that's just the frame to frame tracker I would think the people somehow trick it but you had all forget it lost but then we'll recount what did you do well it'll hopefully the tracking quality heuristics will will capture the fact that it had an error so yes I can get the if I sort of wave my hand in front of it if I turned all the tracking things off then it would do this rotation but of course it's also tracking the features and this happens after that estimators so it's sort of reasonably it'll reasonably stay snapped on until it gets to a certain critical thing and then it'll just break completely but it'll recognize that it's broken and not a keyframe hopefully but yes this is a disadvantage of adding this estimator for sort of general manipulation as you might get in AR it's usually not the problem it's just this sort of thing used to be more robust before out of it all right so um just just show a few more things this was this kitchen thing I had earlier and just in the videos you can actually sort of Paul's during some of the frames here you can see for example reasonable quantities of motion blur you can see this whole part of the image it's completely wiped out all the point features here these point features here are in fact incorrect these are just images always where it's incorrectly made a measurement and yets all the features all the edges that you can see drawn were actually tracked in this frame and that's what's providing the camera poles estimates during these sorts of motions show that to show another this here is a graph of what happens to the number of features we can track as camera motion increases you can see the points drop off quite quickly as we get a bit emotional err whereas edges partially due to this directional nature because some of them will always remain on blurred we can still track them of course it doesn't really help if we can only track the unpaired edges because they won't fully constrain the camera pose but it does help somewhat so finally let me talk quickly about how we recover from failure this this used to be a big deal for alarm systems so they were basically unusable in practice because as soon as the tracking went even slightly wrong you have to start from scratch so a number of people started looking at recovering from tracking failure one of our approaches was actually to use to use the map to recover so this was joint work with Brian Williams and Ian Reed what we did was we took our EKF slam thing and trained a randomized randomised trees classifier for each map points then when tracking was lost we were able to generate correspondences from an all of you so we could say this point is the one we have in the map over there and then using using just ransack you can estimate where the camera is and the map and restart tracking now this substance sort of it's there's there's been a number of other approaches along a similar vein since mostly replacing the randomised trees with other forms of classifiers because randomised trees don't scale hugely well with number of feature points so for example even Drummond have a version of this which runs with the sift and bags of words otherwise Chekhov and Bristol ha-har had a hard wavelet pre-filter and finally there is actually a version of randomize trees which still scale to my points which was presented at UC see me this year and that's been used for a slammer an organization as well but all these methods are actually they take quite a bit to implement there they also don't really exploit the fact that we have keyframes in her map so what what I tried instead was to actually instead of realizing against the map of points try realizing against the images of the key frames instead so this can actually be done trivially easily what we do is when we're lost we well as as we're adding in you keyframe you take each keyframe and it we sub sample it down to 40 by 30 pixels and then we blur it very heavily and this blurred image is actually going to be our descriptor for each keyframe so we have all of our keyframes for a small team here this is just a hundred keyframes then if the camera ever gets lost all you do is you do the same procedure to the current view and you just compare this to all the other keyframes in the map using zero mean SSD and that'll actually arguably usually give you the correct match and then of course it won't the camera won't be in exactly the same position so you use the permutation estimator I described earlier to match the rotations and even though it still won't be in the same translated position this will now be close enough that the tracker can then lock on in the first frame and resume tracking so of course it's a huge problems if I change the lighting substantially this will fail if I change the scene layout substantially this will fail it can't realized upside-down which we used to be able to do but well that's no huge loss it has a very nice property that it's extremely predictable in operation essentially if you were tracking here and tracking worked fine and then you move here and it breaks all you have to do is move the camera somewhere where it was before and then tracking resumes and this often happens so fast that people won't even notice the tracking is failed and it's also extremely quick it takes 1 and 1/2 milliseconds for as big a map as we ever make right well so much for localization that pretty much concludes my talk just just a few things in conclusion first of all I quite like this idea of using keyframes many people think it's an elegance to not use all the information in estimating the map but I just have the opinion that's basically some frames are not worth using it's a somewhat philosophical difference in in computer vision I think actually in practice you can't just use appropriate noise models for some of the bad measurements you get and it's better to just throw some data away so instead of doing looking at all frames and performing some sort of minimal processing spread out over everything we just choose a few that we think are good and we concentrate all of our processing resources on these and I think this gets a better result then the way we've split apart the tracking from the mapping is not very complex we essentially just create the map on one process and read it out from the other it turns out this is not at all the problem with things like synchronization it really hasn't been an issue at all and the split does allow us to make the tracking far far more robust and aggressive than we could otherwise do it if the results from the tracking were always fed into the mapmaking stage and finally this this method is very problem specific so I work with my camera in my hand in a small volume I don't try and make large loops around buildings but this is not in any way a replacement for general-purpose LOM it's just something that works quite well for the sort of augmented reality problems I try and deal with all right well thank you very much oh yes of course so I haven't actually showed you any of them right a organs so I said initially that I use this plane initializer that's because well if you're going to do slam in an unknown environment the big question is what can you actually helpfully draw into this environment which makes sense well not very much but if we know that there's a plane in there then we can start adding things to that plane and so for example well let me just get right to it so this is the this is the old favourite and I've got a little character running around my my computer here and eventually I'll get these little hordes of enemies here and then you can sort of shoot them and do little things like this there's let me perhaps oh dear look let me perhaps show some videos of this so just because you've only seen the tracking in the same environment this is actually where I do all of my work my helpfully cluttered desk showing lots of good structure for the for the system to use and build the maps and this is sort of the intended scale of operation for the system now if I forward a bit you'll see here this is slightly different environments this is just salt from a train window as soon as there's anything plainer in the environment like this then we can actually go and augment it and it'll be reasonably happy with that then once again start with a planar environment and after that you can move around the rest of the environment doesn't have to be planar you can sort of expand it as you will you can see here the large amounts of clutter outside back then didn't bother the system at all I have to admit I haven't tried it now with the rotation estimator but that might not make a difference you'll soon see the map superimposed on the world here which shows you all the features it's added to the world and carries on adding as the camera moves around so another example from outdoors you can see even though it doesn't really have much texture on the ground there's enough texture around it to actually constrain the camera pose fairly well and again you can run these little games of having characters running around yeah this just an example of tracking slightly larger environments we don't have any explicit loop closing mechanism so there's there's nothing there which actively recognizes when the camera is going around the loop but because we use bundle adjustment and sub-pixel measurements it's actually accurate enough that you have a fairly good chance of the loop just closing itself although that's by no means guaranteed and then here we have some failure modes this made more sense if you were in is mark but first of all we are very repeated we're very repeat the texture on the ground so this is going to track very poorly and then you'll see the map actually gets substantially changed by this things starting to move around and that makes tracking very unhappy then again there's some things we can't handle we add features by a people research if you look at this foliage here everything looks like everything else that people research fails completely and the repeat that extra breaks tracking similar sort of thing here with the gravel with the gravel ground corner quarter features everywhere everything looks much like everything else and again we just have way too many bad features for the system to handle and finally another little example of sort of augmentations we can add to a planar scene if we know that it's a plane and we can lift the texture off and manipulate it and make it appear as if the environment is actually deforming right so maybe this problem having it work over longer ranges are just the closing we encounter well I encounter a very different from its it's dealing with occlusions so if I can see something from here if I can see this this little point here my system will currently assume that it'll see it from anywhere within a sort of hemisphere around it which means that if I go somewhere like here where it's covered by something else it'll expect to see that point it'll then not see that point and they'll think this is because there's some sort of problem and because it wants to avoid adding bad keyframes it will just then refuse to add anything else to them up so this turns out to be the limiting factor to scaling right now so things like going through doorways will be problematic it's actually you can go forwards you can make quite a large map looking forwards but then turning around and reabsorbing the map you've just created is problematic so yes there's there's approaches which do that so there was a recent paper by Aidan Drummond where they do just that and read localization is actually they treat it as a loop closing event whereby when ever tracking fails they start a new map instantly and then splice it back onto the old map later I'm not convinced it should be that automatic so I actually don't mind having to press something to add a new map otherwise you I think this depends very much on what application you're looking at and whether you're exploring an environment or just making a little local space for you to working you
Info
Channel: Microsoft Research
Views: 8,255
Rating: undefined out of 5
Keywords: microsoft research
Id: UPLROIlyBWs
Channel Id: undefined
Length: 58min 37sec (3517 seconds)
Published: Tue Sep 06 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.