Keynote Talk by C. Stachniss on LiDAR-based SLAM using Geometry and Semantics ... (ITSC'20 SLAM-WS)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello my name is sir staffness and i'm from the university of bonn it is my great pleasure to be here and talk to you on semantics and geometric information for the simultaneous localization and mapping problem in the context of autonomous vehicles or self-driving cars i would really like to thank the organizers for giving me the possibility here to speak and i hope i can entertain you for the next 25 minutes or so presenting you the work that my team and i have done of the last approximately three years in this context and i want to start with the question today that is the question that most autonomous vehicles have to answer when they want to navigate through an environment and that's basically the question where are we what does the environment look like and what's going to happen around us so these are central question which pose a lot of state estimation challenges that we aim at addressing in trying to provide solutions and building blocks for the perception problems related to that question so how could information like this look like so what you see here is an illustration of an autonomous vehicle illustrated you also see a 3d lidar scan that is colored and this color information tells you something about what you see there so the points in the 3d world tell you something about the geometry and the colors tell you something about the semantics so what we are actually seeing what is a car so which point belongs to a car to a road to vegetation or similar and but we already also see on that side over here which instance an object is belonging to if we see things like cars pedestrians we may wonder which points that have been reflected by a car are actually the same car that we also get the instance information out in order to perform prediction estimate what's going on in the scene and the talk today addresses several aspects what we need to estimate in order to get that knowledge so on the one inside we're interested in pose information so where are we with our vehicle we are also obviously interested with the geometric information so what does the world around us look like from a geometric point of view where are obstacles where's where plane surfaces things like this but we also invested in semantic information so what is actually that causes this return signal is this a car is this pedestrian is this human is this the flat surface in front of me actually the road surface or is what i see a part of the building so this is the semantic information that we are interested in so we want to estimate for every point in the 3d world that is shown here on that side of the illustration a color value and every color value stands for a semantic class so what we are seeing but we do not only need the semantic information in form of a so-called semantic segmentation we also want to estimate instance information so what objects are we see which of the pixels belong to the same object so that we can see here those green dots over here is car number one and those blue dots over here is car number two so these are both pixels or 3d points that cause the return on a car surface but these actually two different cars and the ability to distinguish this is important if you for example want to track the motion of those cars because two cars may move differently to the environment or final eventually will move differently through the environment so you want to be able to distinguish those cars from each other and get this instance information that will allow you to do tracking and afterwards also doing predictions about what's going to happen in the future so that the vehicle can adapt its plans based on what it predicts going to happen so we're not tackling prediction and tracking tasks here today but we will start proposes geometry semantic information and given outlook to instance information and what is needed in order to build such a system the first thing you need to address if you want to estimate poses and geometry is to provide a solution to this slam problem slam or simultaneous localization and mapping so the simultaneous localization and mapping task allows you to estimate where you are and what the world surrounding the vehicle looks like from a geometric point of view so we are not interested in estimating semantic information at that point but we want to estimate the geometry of the scene what does the world look like and we are relying here in our systems on the so-called summa approach which is a technique that janspielai who is a postdoc in my lab has developed over the last years and it's a lidar-based slam system designed for autonomous vehicles it also works with other robots of course but some of the adaptations are tailored towards autonomous vehicles so what it does it takes a rotating 3d lighter scanner such as a valodyne scanner or an oyster scanner or a similar sensor setup like that and turns this turns multiple of those scans into a globally consistent model of the environment so how does it work what's special about this approach internally the system doesn't work with point clouds it builds a circle based representation so circle is basically a surface element where we store a location of this surface element but also normal information about how that local surface is actually oriented and the map internally is just a set of those circles which represents the surrounding if you look to those circle-based maps they don't necessarily look very beautiful or very aesthetic from a human point of view but they represent the environment well in order to align new scans with respect to those old skins and for example also perform loop closing tasks and therefore internally we use this server-based representation but of course through the correction of the poses and how things are aligned we can in the end of course render also globally aligned point clouds for example internally the scan registration works with a projective approach which basically takes the model and has an s initial estimate where the system is or where the vehicle is in that model and then renders a view by projecting the 3d world the circle information into a virtual image and then align this virtual image with the actual 3d range data that it gets and this form of projective registration allows us to not make an explicit data association beforehand but basically get the data station through the projection and then optimize the viewpoint and align those scans that's a technique that has also been successfully used before for example for rgb registration in indoor environments and we use this way for 3d lidar scanners in the outside world we also need to perform loop closures and kind of the standard approach basically also has an initial estimate that comes from the state estimation problem and then tries to [Music] find locations where the projected image looks similar to the actual range image that the system gets and then tries to find loop closures basically based on scan registration and then you have the typical approaches in that you need to feel see a few scans over time that are consistent with each other in order to accept the loop closer once you have a loop closer you may be interested in doing some graph based optimization system which runs in the background we use a postgraph based system that allows us to protect the map and then builds up a globally consistent model and that's the work which again called zuma which has been released at rss or published rss 2018 and is now kind of our standard toolbox so this is an example how that looks like you see here a vehicle moving through the environment the color information here doesn't tell us anything about semantics it's just an illustration where the color tells us basically the height over the ground of that 3d point and you can see here this is a scene rendered from the kitty data set how the vehicle drive through the environment and what what is displayed here is not the circle information it's just the registered or corrected point cloud information so you're not seeing the internal surface software representation we'll see that later on in some of the videos and but you can see here how the map is consistently built up so this is actually a bridge going over the other road and you can see how the vehicle is navigating through the environment building a map of the environment and while the vehicle is driving through the environment of course we are building up this post graph on the fly this is kind of some of the coordinate system that you see here and now when the system will re-enter the known part of the environment at some point in time the loop closure system will kick in find a loop closure and then can correct for drift that was accumulated on the way so that we get a consistent map representation of the environment and we can also do this for other data sets or other parts of the kitty sequence kitty data set so what you see for example here are the different sequences from the kiddie data set and the system running building consistent maps of the environment again using the surface ideas of surface registering with respect to the existing model through this projective fashion having a loop closer system in place and running postgraph optimization in order to build a consistent model of the environment and with this approach the summer system is able to build consistent maps of the environment okay so this was the geometry part what about semantics what can we do in order to estimate the semantics about the scene and we have been investing quite a bit of time in building fast semantic segmentation approaches for camera images this was work done by andres milioto from my lab and he released the bonnet and now the newer bonadal pipeline which is basically an implementation of existing cnn architectures but really tailored towards speed so that we can efficiently perform a high quality semantic segmentation at 50 to 70 frames per second and turn this camera image what we see here in this semantic information estimating which pixel belongs to each semantic class and one of the work that we've been doing in the lab where undress also took a lead in here was using this also for lighter data so the key question is can we perform the same thing providing a point-wise labeling of the 3d lighter scan so that we can take the raw 3d lighter scan where here the color is just the distance from the object from the projection center of the scanner into a semantically segmented point cloud where color information refers to a semantic class and how can we do this ideally exploiting the existing cnn architectures so what we can do is instead of using the full 3d information because it's quite difficult to feed it into the um such a cnn we can turn the 3d point cloud into a range image because in the end the scanners also produce an image so the scanner is rotating with in this example for example 64 late lighter beams and the scanner is rotating and taking those scans so we can see this actually as a single column in an image and so basically a range image is a collection of those scan lines arranged in an image which has then not the central projection as you would use it for for a camera but you basically can project it on a cylinder and in this way you can unroll the cylinder and then you get one of those range images and this again is in a 2d data structure when every pixel the distance information to the obstacle or the measured range is stored and the question is can we actually use this 2d representation and then turn it into a semantically annotated range image so if we can get estimate the semantic information based on this range image we can then project this information back into the 3d scene and then obtain the semantically segmented 3d point cloud so instead of doing this directly we can actually go through the range image representation perform the semantic segmentation the range image we can export the idea of images and then project it back into the 3d scene what we need in order to do this beside this cnn which sits here and does the work we need to train these systems so we need to have a system which allows us to train how semantic objects look like in this range image representation and for that the team from my lab together with the lives of jungle and sven binka also both from the university of bonn we set up semantic kitty which is basically providing annotations for every single lidar endpoint within the kitty data set so as a result of this we selected we manually label training data by labeling every single endpoint in a manual labeling procedure of every single range scan that has been recorded in the kitty data set so this is substantial labeling effort which takes more than a person year if you want to do this but if you have that at hand you can exploit this semantic information for the um for the range images in order to train your classifier so that this data what you see here which is training data that you get similar good estimates for um for estimating that from data and kind of another short example of video sequence what you see here is the car driving around what i really like to see you can actually see the kids here on the swing as the red object going for some back because really every scan is manually annotated saying which semantic class this pixel belongs to and the approach we have been developed were the work done by andres jens belai as well and a few other people from my lab where we tried to exploit the semantic segmentation architecture and see what we can do in order to improve the semantic segmentations for the um when we are working with this range image data so we have this approach we are working with this range image feed in the range image into our cnn architecture and this is basically heavily inspired from the queessec architecture um with a small number of modifications but not too many where we can turn our range image into a semantically segmented range image the problem that we have if we do this in reality is often that the boundaries of objects are not very well segmented so this is an example that you can see here so this is a ranged image and you can see a few objects like the car over here and then you can actually see that although pixels which are very nearby look nearby in the range image are actually further away because basically a little bit a part of the car is projected onto the wall behind the car you see also similar things if you have a tree trunk over here that some of the three pixels are projected onto the vegetation lying behind that tree and this is a problem that you basically have these these objects in the 2d range image bleeding into the environment into objects which are um behind the actual object and there this is one of the reasons is because you're down sampling in the cnn architecture and in this downsampling you actually lead to mistakes so what we have been doing is trying to clean up those labels which basically works with cnn based or canerous neighbor voting not cnn sorry kenya is never voting in the 3d space taking the depth information into account that we can actually fix this information so if you look here to the semantic segmentation approach for lidar scans and this is kind of the standard output that the cnn architecture generates you can see it here on the walls you can see a lot of shadows of those cars which are not very well represented and so the question is how can this cleaning of the labels actually help us to do a better job and i want to draw your attention to those circles over here and then we can actually turn the sweep over it and fix the mistakes that we have seen in those circles and as a result from that the approach will generate smooth trajectories or smooth labels that you can see all the or most of these boundary effects of the of the semantic segmentation bleeding into the environment actually goes away because we get consistent semantic labels estimated from the lidar data so that we can take just the lighter scan and nothing else into account in order to come up with this representation okay the next question is how we can how can we exploit the geometric information and the semantic information inside the slam system so what i've done so far told you how our geometrics land system works but we haven't talked about how the semantics are integrated into the slam system so what we in the end want to have is actually a map which encodes the geometry of the scene as well as the semantic information of the scene so if we move in for example in this local region we want to have a representation that looks like this so that for every local pixel or every surface element we actually have not only the geometrically correct pose we also have the semantic information available and there's an approach which is an extension of summa which also has been published last year combining the ideas of zuma and the surface with the semantic information that we are estimating so again we have our input point cloud our raw point cloud we perform a semantic segmentation very similar to what we have done before with this squeeze net based architecture based on some efficiency reason how that could be integrated the pipeline in it's not using the um k-nearest neighbor approach it basically performs a depth aware flood fill algorithm to filter out some of the noise but this from the result very similar to the um k nearest neighbor voting take into account depth information the only advantage was here it was easy to integrate in the sensor rt framework where this system has been implemented so we're basically getting a corrected depth image and then two things happening the first thing is we can use the semantic information for filtering dynamics in an easier or more robust way than uh competitive effect when we would ignore the semantic information and we can use the semantic information for the scan alignment if we compute this cement consider the semantic information into this projective icp approach so again the first step is basically a depth aware a multi-class flood fill which kind of eliminates some small artifacts that we have in the semantic annotations similar to the k-nearest neighbor voting approach and the next thing which is important is the filtering of dynamics so because we want to get rid of information that belongs to dynamic objects so we can see here as an example so these are the traces of a car moving through the environment and ideally we want to actually filter that out so that this information is not in our skins anymore it's however not that easy that you can just say let's remove all car pixels in our in our environment you actually can do this but this can also lead to sub-optimal estimation behaviors what we actually want to do we want to only remove those objects which are actually moving so if this is the original scan you can see for example here the car coming from the front but you see also a lot of park cars in this or nearby and what we would like to do we want to remove those car pixels over here but we want to keep the car parking cars in the object because it actually helps us for the registration because it provides good 3d structure that the scanner can use for the incremental pose estimation so instead of removing everything which is movable like all the cars which would all vanish over here we are actually able to remove the car which is which is was over here so um coming the opposite direction but keep the parking cars by taking the semantic information into account grouping them and see which of those parts are actually moving and which are static so that this information can be used to filter out dynamic objects not all dynamic objects will be done but a couple of them and then we can take the semantic information also into account when performing the scan registration by basically computing a mask of what we are seeing and what is in our map and those things which are inconsistent can be masked out if they are associated to typical dynamic objects and this also allows us to get rid of some of the mistakes reduce the weight in here so that we are able to register the uh or use the semantic information within the registration process what you see here is a top-down view of a car navigating through the scene this is not the circle-based representation where the color of the circle is a semantic information and it's not just the semantic information from the current scan but the um circle basically represents um or uses a probability distribution over the the circle and so can also update the belief and if circle then has been let's say initially seen as let's say a car and then it's been seen multiple times as a different object then the different object for example the road will actually dominate the estimation procedure similar to how that happens in occupant secret mapping for example and we can also show that this actually gives us a competitive advantage for the scan matching algorithm especially in challenging situations so what you see here is um probably the most tricky part of the kitty data set it's the scene where the car drives over highway towards a traffic jam and all cars move basically at the same speed or a similar speed than the vehicle and as a result of this there's very little 3d structure the system can rely on and if we just do a geometric slime approach on this we can see that there's actually a large deviation between the actual estimate and the ground truth and the actually estimated trajectory compared to that if you use the semantic information take that into account we can then better estimate which cars are moving not use them in the registration process and in this way improve the estimate i just kind of rerun that video so you can actually better see that okay here we go so you see the ground through system the ground truth and the the estimated car as well as an hour setup where they're basically overlapping and what you can also see is in this parts over here these are the traffic signs illustrated that are smeared out where the traffic sign here keeps the position and so you can see on a few static objects we as humans are able to identify if this is a consistent map or not and this information can be exploited in the semantic slam approach so that the semantic information actually helps the geometric information that's not always needed so geometry itself works pretty good but there are a few situations which lead to failure cases such as this one over here when a lot of cars are moving similarly to our vehicle itself where the semantic information can make a difference the next ingredient that slam system typically have is a loop closing system loop closing means we want to identify when if we are at the same place yes or no and so if you have for example two 3d range scan like your top views of range scan the question is is this the same place is this a loop closure and most of the systems like also our suma system basically takes initial estimates that it has and then tries to project the model into the current position how would an observation look like if our estimate would be right and then tries to make an alignment between what the system sees and what the project scan looks like and if it is successful for a sequential number of steps let's say or consecutive number of scans then this loop closure is accepted we can either also go for a different approach and one approach that we have developed here by a phd student reiny chen from my lab is a so-called overlap based approach so overlap is a concept that is used in photogrammetry for computing the overlap between images and we kind of generalize that to range images into 3d scans and it's basically how which fraction of the image or the range image overlaps between two images and we can actually use a learning approach in order to learn if two 3d input skins encoded as a ranged image have an overlap yes or no without using any post information at least in the when we apply our um our cnn for training of course we need this information but generating this training data is fairly easy to do if you have already a working slam system in place and maybe add some loop closures manually or fix wrong loop closures then you can use this data to actually train your model and try to find those overlaps so the system works in the following this is an overlap it's called overlap overlap.net published at rss this year so you have two range scans you use the range image you use some normal information you can extract from the range image laser intensities and this semantic information and feed this this is kind of a pre-processing step and then feed that into cnn and this architecture consists of two legs which have sheared weights one taking the first skin the other one taking the second scan into account and and then there is our two outputs one is the so-called delta hat which estimates the overlap between those two scans and in addition to this delta hat we can also estimate the yaw angle offset using a correlation head and in this case estimate just the your angle so not the not the attitude not the pitch and the roll just um the the yaw angle but this is in the autonomous driving domain probably the most relevant orientation angle and then use this information for example as an initial guess of a loop closure estimate and verify it using a scan matching approach and this overlap net work very successfully of course has to be trained but you can also show that you can train in one environment transfer to another environment and find good loop closure points so a few examples over here the blue dot over here in this precision recall curve um is the original system used in zuma so you actually want to be as close as possible here in that corner and then you have the different approaches that you can take so the green plot here is is our approach not using any geometric information but if you additionally use the geometric information that comes from your post graph in order to reject wrong loop closures or reject loop closures which are let's say outside the three sigma bound then you can actually push it even further to this red plot being better than the original suma loop closing and also alternative loop closure measures which are popularly used so this learning based approach is able four pairs of range scans tell you if this is a potential loop closure yes or no and then you can use it as an initial guest for your subsequent icp or loop closure algorithm that you already have verify those loop closures and in this way come up with a fairly robust system to find those loop closures even in very challenging environments with this i'm coming to the end of my talk about the core slam system that we have been developed using semantic information and geometric information and i want to use the last minutes to give you a short outlook this is work which will be published very soon at irs 2020 this fall and this is these are two works one a generalization two other data sets or other sensor setups and the second one is the panoptix segmentation system so if you remember we had the semantic segmentation system over here going from our initial scan to from our raw scan to our semantically segmented scan through this range image the problem when you have this approach which is attractive because you can use the cnn architectures used from computer vision directly is that the generation of these images depend on the configuration of your scanner so if you use a different scanner or you use a different sensor setup of your scanner this will lead to scans which look substantially different to the original range scan in this case the performance of the system typically degrades this the reason for this is that this is the way the scanner generates this image and where the scanner is mounted on the car has a substantial impact on what the scanner sees and this generalizes not very well to other scenes there's a small example over here so one dataset recorded so this system was trained on kitty so this is a semantic segmentation on kitty and um if you use the new scenes data set um for um or use a new scenes data set and run the classifier trained on kitty on new scenes we can actually see that the prediction gets worse compared to the ground true so you can see a lot of color mismatches between those scenes that means we need to retrain our classifier so that it's able to run on new data sets without retraining from scratch and what we have done is building a system for domain transfer which uses a combination of a slam system then simulating data with a different sensor configuration from that slant system combine this with a gun based approach in order to perform a label transfer and transfer our classifiers to new domains so that we are able to transfer a system train with one scanner also if our sensor setup one of our car changes which we believe is a very important step and the last outlook i want to give is to a panoptix segmentation approach which is a unified perception system providing you with the semantic segmentation on the one inside and the instant segmentation on the other side so how most state-of-the-art systems do that today they take their sensor input their lidars and then they have one neural network which performs an object detection approach gives you the instance information and an independent neural network which provides you the semantic segmentation and then try to fuse this information into a semantic instance segmentation this is several problems the first thing is you do quite some redundant computations because you need to run those um those systems here and parallel those neural networks but the second problem is that there is not necessarily a very good output coherence between those semantic information and the instant segmentation because these are basically two different pipelines which are run and we have been trying here to fuse this into a single pipeline with one encoder and two different decoders an instant decoder and a semantic decoder which allows you to do this in a joint estimation and so with this network architecture you are then able to estimate on the one hand side all the instances for example the car the pedestrian the cyclists the things you see together with these stuff classes or here showing both things and stuff together so what is vegetation what is road surface and based on the single scan get a consistent estimate of the of a semantic segmentation as well as the instance segmentation which is then a valuable input for your autonomous driving domain so and this brings me to the end of my talk today where i have shown you that maps do not necessarily con store only geometric information as it was done um for quite a while of course we have semantic information but we can also estimate that semantic information online and integrated out a mapping system we looked here exclusively into lidar-based data so using 3d lighter scans in order to perform this estimation and trying to fuse this in a semantic geometric estimation approach i also gave you two outlooks on how to generalize this data and how an optic segmentation system could look like that we will release at iris this year so thank you very much for your attention i would like to thank my collaborators andres milioto jens bailey ryan chen ferdinand langer ignacio visso vessel immanuel parazzolo chris mccool and philip giguere for their contributions to this work so thank you very much for your attention thank you
Info
Channel: Cyrill Stachniss
Views: 2,875
Rating: undefined out of 5
Keywords: robotics, photogrammetry, self-driving cars
Id: vrdlk2p9AZI
Channel Id: undefined
Length: 31min 57sec (1917 seconds)
Published: Sat Sep 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.