Keynote Talk by C. Stachniss: Map-based Localization for Autonomous Driving (ECCV 2020 Workshops)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello my name is daphnis and i would like to welcome you here to my talk on vehicle localization and i want to especially thank the organizers of this event to invite me to speak here i want to talk about how we can use non-metric information to support vehicle localization so in the end we of course want to localize our vehicle in the metric space but the question is how can this non-metric information actually help us in our localization system and the work that i'm presenting here is work that is done by several people from my lab so a lot of the credits of what i'm actually presenting here is their work and i want to start with arguing for using this non-metric information by saying that vehicles must operate under very varying conditions and the world around us changes and so its appearance so we want to have autonomous cars or other mobile vehicles that navigate through the space and can estimate where they are and for that they need to deal with challenging situations so what you see here are pairs of images those pairs and those pairs which have been taken at the same location but at a different point in time and what you can see is that the appearance of the space has substantially changed the question is can we deal with such strong appearance changes and still build a system that can localize or make the data cessation recognize the same place and say yes this vehicle is at a place where it has been before and that's what this or the first part of this talk is about and just to show you that this is based on individual images a challenging problem i brought you a few pairs of images and so let's see if you are actually able to identify if this is the same place yes or no so question is this the same place yes or no i think that one is easy it's a yes we can see this this building is quite distinct hold this flag one image has been taken in summer the other one in winter but we as humans are still able to identify that let's go to the next example this one over here first would say yeah that actually looks pretty much the same but it turns out no it is not the same place it's not even the same street it's a different street but in the same neighborhood and you can see it for example there's a tree over here this tree is missing but maybe they took down the tree you never know but it is definitely not the same place and then um i have the next example over here in this the same place yes or no again it's a similar neighborhood in freiburg if you drive around you experience those images but it turns out it's not this it is the same place although it doesn't look the same and you can see it here this building over here has been demolished so the building doesn't exist anymore the space appears different and looks different but it's actually the same place so it turns out that doing this recognition task based on individual images is quite challenging but we can actually do that if we take sequences of images into account so if we take into account that we have been recording one image after the other and we exploit this sequential frame formation to align image sequences this becomes a much easier problem so they can use it in a localization system by exploiting this temporal or sequential information and what i can do is let's say i have one database recorded or one image sequences which i use as my database maybe associate even with map information doing a mapping task which i recorded let's say in winter and now i'm driving around in summer and i want to take just the image information and match them against the previously taken image how do i do this let's say start with the most naive case we build up a matrix where we compare every image or query sequence with every image or reference sequence and we compare those images for example based on feature descriptors that we compute from those images or any other way for comparing those images and then we end up getting such a cost matrix over here so bright values mean images look similar dark values images appear differently and you can see just based on individual dark and bright values it's actually hard to say where the vehicle has been if you take the full matrix into account so exploiting the sequence you can see that here is a bright line going through this matrix and this is actually the trajectory that the image has been taken and by using this kind of full matrix we can actually try to find a kind of path through this matrix in this way find the best matching pairs of images so technically we can build up a so-called data cessation graph so a graph in which each each node represents a data association a match between two images and then we can plan a path the shortest path through this graph which will give us the sequence of matched images and the cost of this path the cost of a node is basically the cost a cost which is computed based on the similarity of those images and through this path through that graph and by taking a local connectivity into account in this graph we can actually exploit this sequential information so that by planning a path we'll find a path through this notes let's see this graph here shown at the red nodes and in this case a blue not a so-called hidden node there's a node along the path we know the system has been there um although the images look not very similar and this is something we call a hidden node so something where we couldn't find a match based on an image but based on the sequential information we know the system must be here and given this we can align those images sequences and can align one trajectory with another trajectory although those images have been taken under very very different conditions the problem that we have with this approach is that we need to compute this full cost matrix and this is computationally expensive because we need to compare every image of my query sequence with every image in my database and this is something which we can't do in real-world situations so the goal is to bring this towards an online approach where we don't need to expand the full graph but only part of the graph the part which is shown or illustrated in green here so if we only need to compute those green paths in there let's say the most promising images and then find a path in this um through this kind of promising region then we may be able to do this in an online fashion and we can actually do this we can build up this graph on the fly so that we don't need to compute the full cost matrix but only the cost matrix here shown in green everything here in black actually doesn't need to be computed so we want to avoid computing this part and only compute the green region and so that we then can find an efficient path the question is how can we actually realize this how can we do this and the goal or the how we can achieve this is by simultaneously building up the graph during the search so we don't build up the full graph first we only expend those neighboring nodes that we actually need to compute and in order to do this in an efficient manner we need to have a good heuristic so we know this from graph search informed search techniques such as a star are very efficient if we have a good heuristic and heuristic is a technique to provide a lower bound for the cost that will be generated if i go from a current node to my goal node and i need to provide a lower bound estimate for that and then i can basically turn into a lazy data cessation traversal or building up of the graph and expect and expand only those nodes which currently look the most promising ones and this way expand only a small fraction of the graph but still are able to find my matching sequence through that graph if you think about informed search you probably think about a star but it turns out a star doesn't really work well and the reason for this is that it's very difficult to design a useful and admissible heuristic admissible means you need to underestimate the cost and have a zero cost at the goal and useful means that it's actually not just a zero function it provides useful information and this is different to a surge in the cleaning space for example where the straight line distance gives me a good admissible heuristic here we are in the space of image similarities and here it is much much harder to come up with a good heuristic so what you need to do in this graph search so you add a certain node let's say you expand it up to this node and now another node looks more promising like this one over here then in our data station graph we need to estimate how expensive is it to go from here down to here and match the next l minus i images and we have done this using a heuristic which is not an admissible heuristic so it's not guaranteed to provide a lower bound but which takes into account the average cost of the best pass fund so far multiplies it with a factor which is smaller than one so we say that say we expect the best parts to be only let's say 80 of the cost of the currently best found path this is an assumption that we do in here and then multiplied with the number of images so we have our expected cost for an image comparison along a matching sequence uh we have the images we need to match and this expansion rate or this factor which scales the cost down in order to allow us to search explore other alternatives um the problem with this heuristic is that it changes during the search because we find better paths so we get better estimates of the expected cost and this um prevents us from the use of a star so we have to modify our social algorithm a little bit to take into account that our heuristic may change during the search but if we do this we can actually perform this online visual localization of place recognition in changing environments and you could see here our images so this is our database images you can see where this image has been taken and the current estimate of where we are you can see here the vehicle is driving to a path that hasn't been mapped before which is this road over here so it doesn't find any match um but as soon as it re-enters non-partisan environment of the environment it can establish the data association and can actually localize these images and localize means here through the matching with the database we can we know where we are and in this way can localize the the vehicle just based on this imager again this is something which would you wouldn't of course not use for stand-alone localization system but you may integrate this into an existing metric localization system to tell you where you are on a global scale or provide you with an alternative means to estimate your position on the global scale and so this online search actually helps a lot to reduce the number of computations so you can see here a rather large cost matrix from the v price challenge for performing this place recognition in those environments and you can see here on a very very small area and this graph actually needs to be explored in this case had only been 0.5 percent of the overall comparisons that need to be done if you go to other very challenging environments the number may be a bit lower but you still only need to explore a fraction of the graph so that you can actually run this on a real platform and multiple times per second perform those comparisons you can again see here the current image the corresponding database image and here the part of the cost matrix that is actually computed and explored during the search so this is kind of a zoomed in view of the expanded cost matrix and you can see here how the system um only needs to explore a small part of this matrix and actually finds the data center here we can also do this for example over a day night drive um so again here we are driving at night and localized with respect to an image sequence that we have been recorded during the day and even here under this strongly changing conditions we only need to expand a small part of the surgery and we can also show that we can find multiple hypotheses can track multiple hypotheses over extended periods of time if we for example have two trajectories which look very similar so that's actually an interesting thing that just by this exploiting this sequential information we can quite robustly estimate image sequences although the appearance of the environment has changed substantially and again as a short reminder we have done this using purely visual information no geometry has been used no connectivity information except the sequential information no odometry no visual odometry no cameras need to be synchronized really the cameras are just exposed let's say two or three times per second we have this image and just use these image sequences and we can perform this localization just with this image information this was however only part of the truth so they are a situation where the system can fail and may fail and one of the reasons is if we have very flexible trajectory uh drive on flexible trajectories for example if we deviate from our originally taken trajectory and drive to a part of the environment where the system has never been so then the system basically gets lost re-enters the space at a later point in time the question is can we follow up on that or if we have taken shortcuts on our current route with respect to the reference trajectory it's also something which currently cannot be handled you can see it here in those cost matrices where this was a path and i would need i'm going and then the system would actually need to jump over here and continue the search here or we have areas in my matching matrix where we can't find any match these are those two areas over here where the system has been somewhere else so if i for example run this in this first setup where there's no much year no match here this is a matching trajectory and this is a matching trajectory we will find the first one or may find the first one but finally the second one is just by chance it's just it's just luck because we have been somewhere else which in an area which is not part of the map this however is a problem which is not novel this is something that has been investigated in the past and one prominent example here is the kidnapped robot problem in monte carlo localization so this is a triple ai paper from 2000 and what the author here suggests is that if the sensor data doesn't fit to my current state estimate anymore in this case a particle filter a good idea is to sample new particle locations based on the observation model so take the observation into account to sample new hypotheses to see if those locations do a better job and so what we need to do in here is if we kind of don't find good matches anymore we need to find other places that are potentially good measures the only question is how do we actually find them so how to find potential matches in this large database so what we have done in order to do this is a technique inspired by a locally sensitive hashing where we use a hashing technique in order to find other potential matches it's kind of a form of an inverted index taking into account the different dimensions of the feature binarizes those features and then basically looks into the zeros and ones in the individual binarized dimensions and then tries to find um with an inverted index potential images that look similar to the current image and it turns out that this is a very good technique to find potential re-entry points for my search it's not that this hashing technique always provides me the best image or matching image so what we do is we look for a certain number of potential matches use those matches to expand to great new neighbors in this data center tree search and then through this similarity we have potential comparisons compare those images and if we find a good matching one we can continue with this hypothesis so again we have this example in this part of the environment we have deviated from the trajectory the system was lost and the standard approach really failed so it only found this part of the trajectory and then just by chance the last one but this was just actually good luck luck but everything else doesn't really work well and if you use this hashing technique so whenever we can't find a good match we use this hashing technique and see if we can re find re-entry points somewhere else and it turns out this system is able with this hashing technique to say i don't find anything up here then finds the correct re-entry point here localizes the system appropriately then it gets lost again it starts sampling see there's this blue dots over here probing locations but they're not very promising so it doesn't expand them further until it finds generates gets an image which is similar to a place it has been already and then at some point in time the hashing technique actually finds the re-entry point expand the search and i'm actually able to localize my system so with this hashing technique i'm actually able in a lot of very situation actually find those three entry points relocalize the system by just creating a few new neighboring edges in this data association tree and you can use this also to for example um localize along multiple trajectories so if you have multiple reference trajectories not only one you can use this hashing technique to simply find re-entry points in multiple of those trajectories and just brought you a few examples how that actually looks like so this is a trajectory with a quarter trajectory taken with a car and there are several i think in this case three or four um trajectories had been taken before recorded and then the system localizes the trajectory and that i'm currently driving with respect to the current trajectories and you can see here uh the correct matches in terms of in green and wrong ones which are false positives or false negatives um where the system missed an image but through the sequential information of course can recover um its positioning knows where it actually is so that if i then compare um where i was driving with the trajectory i've been matching to um we can see that of course we only need to find one color as long as we found the correct color everything was done well so here i was localizing on the red trajectory then i switched to black and then here to the cn trajectory because this was the most similarly appearing place in this scene we can go a step further and actually use street view images and localize a dashboard camera using street view images so in this example we just recorded a data set in kiev driving through kiev and matched only against street view images by extracting image sequences virtual drive through street view i'm actually able to localize my camera in my car in this environment and that's actually very nice because this is much closer to a map because i can just use existing image material just from street view in order to localize myself my vehicle in this environment and you can also see the trajectories that we recorded and then matching this against the generated sequences out of street view so they can actually localize in the street view data i can even take this a step further and start taking a random youtube video until nothing we have recorded on our own something we just found on youtube and localize it against street view images so we can take videos we found randomly on the web from dashboard cameras which a lot of people upload and actually localize those images again i don't have ground truth information over here and it's actually it's not doesn't work as well as with the trajectory that we had recorded but you can see that you can localize a large number of frames that you found in this youtube video actually within this um within the street view images which is actually a very nice um indicator that this is a robust system for localizing these sequences of image material so as a short kind of summary so far what we have done we proposed a system for visual sequence based place recognition that exploits the sequential nature of the data and works in changing environments so in different seasons matching current images against street view data so even if the environment has changed substantially we can do this in an efficient way and we can do this in online fashion build up this graph on the fly dealing with multiple trajectories and also the hashing approach really helps us whenever the system is lost or is deviated from that trajectory and the next question is kind of now most of those cars actually have laser range data on board as well so can we use laser range data to do something similar to localize with this data and it turns out yes we can so this approach is slightly different so we could use this data association graph idea as well for the approach although we haven't really done that because this works started as a technique for improving loop closing in slam so whenever you build a map of the environment and localize yourself in the map which is called often simultaneous localization and mapping then you want to find those re-entry points those loop closing points and you can use a place recognition system for that um so this is an example we have two 3d scans of a car one the cars driving around here the other way situation is driving down here and the question is is this the same place and we can rephrase these questions do actually the 3d scans overall happen if those 3d skins overlap i should be actually able to find a loop closure over here and i want to focus on this question is do these 3d scans overlap here and scan overlap or something that is an older technique from photogrammetry where you look into images or image over live and we just use this on used on range images that are taken from a 3d lighter such as a velodin or an oyster scanner so we can we have a scan a and a scan b and if we kind of put the scan to the same location we typically generate a higher overlap between the scans or get a higher overlap rather than situations where they are not aligned and again this overlap has no idea where those images have been taken but you want to estimate what's the potential what's the overlap between this scan and what we have done here is actually using a neural network-based approach which is fed with training data of images taken at similar locations and from other locations where scans overlaps and don't overlap and tries to estimate this overlap and that's a technique that we've presented at rss this year and it was a technique for finding loot claws again not using any metric information for finding these loop closures and how the system works just kind of in 20 seconds we have different information such as the range information we have from our scanner normal information intensity information and also semantic information that are estimated from a semantic segmentation system from those two scans that we have been taken and then um there is a series network with the two legs over here and then there are two hats one estimates the overlap between the scan and the other one estimates the your angles or the rotation around your yaw and then gives me an estimate on how similar those scans are again there's some pre-processing involved in terms of computing normals from the from the intensity image taking this information into account then we have our network structure the overlap network um with the two legs and the two heads which then provides me with the output information so i what i get is an estimate of the overlap so how much do the scans overlap and what's the yaw angle rotation among them and so that i can use this information then for finding loop closures in the context of slam and that's something that we have done integrated this into zuma one of the ladder-based mapping systems that have been developed by jens billy from my lab and what we can see here so the blue thing is a standard loop closing technique so far used in in suma and the green curve is the approach with this overlap network not using any geometric information and the red one is this overlap network if we take the pulse uncertainty that we have in our slam system into account as well in order to find those closures and um so this overlap net can be used for loop clusters so for doing place recognition in based on 3d lighter scans without again without using metric information tens of where we are right now but of course there's metric information in the local scan and what overlapped nap does it predicts the loop closures also in very challenging situations where you have very little overlapping areas where current loop closing systems typically fail and is now part of our zuma system and the question that we have after we have developed this work for rss here is kind of can we use this over idea of overlap for localization as well because it is very similar um to a place recognition system so how can we actually integrate this or can we integrate this into a localization system and we started here out with good old monte carlo localization so a particle filter based localization system which basically has two steps it takes a sampling from a motion model so where's the system at the current point in time given we know where it's been before and some odometry command and then the correction step of this recursive base filter is done with the observation model so the likelihood of the observation given the pose and the map and for computing this observation model we now want to take the overlapping app into account so we use this idea of overlap so we put in our query scandal scan that we currently have and a map scan so a scan that has been taken at the location when we're building the map and then compute this overlap um what we can also do is what we do in practice we build actually a 3d model of the environment in the mapping phase and then generate virtual frames at different locations in the environment just on a small grid so that's those map scans actually pre-computed scans taking at different locations based on back protection so we need to compare them and this is an operation which is done on a per-particle basis or for particles that are very nearby we can actually group this operation so that we are then able to compute an observation likelihood for that particle using this idea of overlapping that um and so we generate those in this example a map for all those blue locations here this kind of virtual scans so we kind of zoom in there's the area have been taken we put a grid on this and then can as an example take three locations over here and practice is a 20 centimeter grid and then compute virtual scans store them in the map and use them then for the comparison so we have kind of those map scans and then we need to compare those the query scan to the map scan of the individual particle so every particle will generate one of those map scans and be compared to our actual scan and then we can use this overlap net architecture to actually come up with an observation model that we then can use in monte carlo localization and again these are kind of rather fresh results we can then use this in order to estimate a vehicle so we do a localization you can see the particle filter over here converter successfully after a small amount of time and then we take the current 3d scan that we have take the generated map scan and then can perform a particle filter-based localization system and track the system over time and again just using this information that is obtained kind of in an um kind of topological way and this again done only with this with the three lighter scans not taking other information such as visual information or other things into account over here however if we go to uh real autonomous vehicles then we typically have maps that may look like this so there's an example from a map used by one of my collaboration partners two of my phd students that i mentioned in the beginning work for volkswagen and use build worker localization there so there's a map where you have certain features or polls which you can see lane markings and you want to localize in this map using your data and what it typically what you typically do you do kind of a local data association between what you see right now and your environment and and then often to make that robust there's a matching step involved so um in a you take the full map and perform a full match of that map this on the one hand side allows you to recover from failures quite quickly and this is similar to this kind of more global approach where you could also use image information to see if you have deviated from the trajectory and the system hasn't realized this then a temporal smoothing is involved in this case it's not a particle filter it's a graph based or factor graph based approach in order to perform this estimation but the steps take the sensor data into account doing local data associations but you can fuse this global information in there as well and how this online optimization can be done in an efficient way was the work done by christian merfelds it's a sliding window optimization approach based he developed it for postgraph but you can transfer it in the same way to vector graphs and it's a black box integration system where you can have different post information or post sources you can put in and very seamlessly integrate visual odometry laser-based localization place recognition just into that system by feeding this into this post graph by today we transform this into a factor graph representation and you have those factors between the notes you want to estimate such as landmarks and vehicle positions or can even be map reference frames and one of the key thing is in here you can have gnss involved but you can also have this alternative sources in there such as the localization capabilities or place recognitions of my systems i've explained before and you can feed them in in the same way like you would do with a gns node so that in the end you can actually realize a localization example so this is an example by daniel wilbus doing this at volkswagen where you use all this information that i've presented you build up this graph on the fly you fuse in data for matching poles you can also integrate visual information and other sources in order to localize your vehicle and do this in an online fashion and estimate within a few milliseconds where your platform is and integrate all those censoring information in order to localize your vehicle with this i come to the end of my talk today so what i hopefully have shown you is that you can use image data and lighter data for place recognition and vehicle localization so i talked more about the place recognition part but you can integrate it into the localization system itself sequence information helps especially if the environment changes substantially i think we've seen this nicely in the first part about image localization and those topological information helps me to also localize the system metrically and can be quite elegantly integrated through this vector graph optimizations with this i come to an end i would especially thank olga reini jens christian and daniel for their contributions to this work and also for all the parts i talked in the beginning except of the last part um everything is available as open source so you can download this and try the stuff yourself so i hope that was entertaining and you enjoyed this talk thank you very much for the attention and i wish you a nice remaining conference thank you very much
Info
Channel: Cyrill Stachniss
Views: 2,409
Rating: undefined out of 5
Keywords: robotics, photogrammetry
Id: NB63C8L8CRM
Channel Id: undefined
Length: 31min 57sec (1917 seconds)
Published: Thu Aug 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.