Simultaneous Localization And Mapping (SLAM)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I am its burrowing I'm the editor-in-chief of semiconductor engineering I'm over cadence with a mall worker who's gonna talk today about simultaneous localization and mapping so what is simultaneous localization of mapping we typically know this as slam yeah it's a good question at simultaneous localization mapping or commonly certainly in the industry slam is a computational problem typically involves two parts one is building and a map of an unknown environment or a space and at the same time being able to track the position or movement of your camera or otherwise an agent in this space so you are able to very nicely and accurately articulate the movement of a particular object in the scene so we would you what markets will actually use this in is this automotive or is it go beyond that slam is used in a lot of markets and a lot of applications that are existing in the world today you can start with the mobile phones and a lot of the augmented reality applications for a lot of the self-driving cars you can see drones when they are moving around and flying around by themselves maybe in virtual reality as well video games so the list is really endless it's a very broad usage for this particular technology so why don't you draw this out for us sure so what are we looking at here so human looking at a basic flow diagram for typically how slime is implemented in most applications that are used today so we start first with a sensor that feeds in data to a feature matching block that further is used to come up with the basic odometry visual odometry or pose estimation and then you can do further refinements with loop closure and another stage following that called bundle adjustment so let's drill down into each one of these as well so what happens on the sensor side what's actually going on in there sure great question so the sensor is like over here what we're talking about is you know slam typically in a lot of the camera based applications they use some type of camera sensor as input but you know if you see the autonomous driving space they use a lot of radars and lidar so you know it's typically agnostic so from a sensor perspective this could be RGB camera grayscale time-of-flight stereo could be radar lidar I've even said in some cases where customers used like a barometer if you use the sensor properly for its particular capability you can feed it in towards your slam block so if you focus on visual slam which is more camera based in the feature extraction stage this is more or less how we perceive the scene like when we look outside the world how do we understand the different points in the scene well we look at corners we look at edges we look at colors and things like that so in the feature matching States the goal is basically to find these specific corners that may be there in a particular environment so you can say that if a camera is looking at this duster over here this could be several corners that are there on the on the duster but when you move to frame number two the goal is you want to be able to identify you know 1 2 3 4 as the same corner points or interesting points in the second frame now how you relate those two there are a variety of algorithms that can be used to do that but typically for feature extraction you can use things like sift serve difference of Gaussian orb and things like that moving now to the pose estimation stage you take this these features that you have found right over here and you identify how they move from one frame to the next and allows you to estimate how the camera or the object has moved from one frame to the next the human brain tends to pick this up very easily with object permanence but if you're looking at a billboard in a car and the cars has cameras on it and you're driving along does it understand that this billboard is the same billboard in the next frame as what you saw before because you're coming at it from a slightly different angle yes so that's a great question so that's where your feature feature matching state actually comes in because the goal of you know the features like sift which stands for scale scale-invariant feature transform or or descriptors and things like that the goal is to identify those interesting points or corners in frame number one and when you see that same object in frame number two these feature matching capabilities actually allow you to identify that let's say those four corners that you saw in frame 1 are the same four corners that you see in for him to once you have that information you can essentially correlate those to those two frames and say ok it's moved so much from frame 1 to frame 2 from that I can articulate and understand how my camera has moved in the scene are you working off of probabilities of this is 99% probability that this is the same object that you were looking at before yeah there are probabilities are coming to play over here and you know the thresholds on that can be very depending on you know the implementation everybody has their own flavor for the algorithms so it varies from one case to another but there's typically probabilities are there there's also outline illumination because obviously there is noise that comes up when you are trying to match these so when you're trying to match match these feature points and that's where you know techniques like ransac which are used to eliminate outliers and keep predominantly the you know the key feature points that correspond well from one frame to the next so what else is in the flow here okay so once you have the pause estimation you have the next step which is called loop closure so the goal of a loop closure is basically to identify that you have visited or you have been to that particular spot before so a simple example is let's say I'm in this room over here if I walk around this room and I come back to this particular point loop closure would help me identify or establish that hey I have been over here before so it's not it's not a new area that has to determine and then bundle adjustment is a common step that is used to you know further refine the aggregation of errors once you do a loop closure now typically what you would see over here is a feedback step which would come from either loop closure or bundle adjustment coming to pose estimation so let's dig into this feedback a little bit more what's actually going on here and what's the impact of this sure so as you can see basically loop closure and feet and bundle adjustment both feedback and the pose estimation because what happens with this feedback is constantly refining or updating your your state estimate which is basically your pose for your camera based on the additional information that you have caught by let's say you know walking around the scene and building a more accurate and a robust map there's a couple terms that tend to go with this one of which is vio how does that play in here so great question so the the terms I think interchangeably used in the industry there is a visual odometry which is vio or a visual inertial odometry vio there's also slam and $6.00 just six degrees of freedom so visual odometry and vio that there's somewhat similar in the sense that there is typically no feedback going from loop closure vital adjustment as compared to slam because visual odometry is more local problem in the sense that okay let's say I have frame 1 and frame 2 and I know how my camera I'm just estimating how my camera has moved from one frame to the next so it's far more local versus slam uses a lot more global data because you're building a map and as you are building more and more data in that map you are getting a more accurate representation of your position furthermore updating your camera position over and over again but loop closure real localization and all those things so it depends on the application depends on the compute budget that your particular processor or your platform has and you know the varieties of applications are different so if you're talking about drones or something that I've seen many cases where they just used visual inertial audiometry because for them they have to build a very large map and having that much data just keep locally it's not possible versus on mobile phones or where you're doing augmented reality it's usually a lot more smaller playing area so slam is a lot of times more applicable over there like all electronics something always goes wrong what can go wrong here so with slam or with vio typically it comes to whatever are the limitations of your sensor so in slam for example you know they typically do a combination of a variety of sensors so sensor fusion most common solutions typically have maybe a camera along with ímu which is an inertial measurement unit allows you to understand how your object is more or less rotating in the scene now cameras also although they can see the world very easily there are some cases where there are limitations if a camera is looking at a flat wall like this well there are not many features over here and if you were to just move the camera around looking at a flat wall slam wouldn't really be able to work too well because just using visual data you don't know how the camera is moving other than that also there are nice noise models that you have to estimate very well so typically for your IMU you have to understand the different types of noises that are there on the particular device and be able to accurately model it to further give you a much better slam implementation and sensors also need to be recalibrated over time because they did begin to drift what you start out with is not necessarily what you end up with how does that affect this yeah so I think typically with with slam for example but you know with loop closure and having real localization which is understanding read localization with loop closure to understand that you have come back to your to your original point that helps with a lot of cases of drift if you did not have this and we're pretty much just relying on say for example your visual inertial Adamo tree in those cases you could probably expect a lot of drift to happen unless you know you have an algorithm or an implementation which is very very accurate and is not prone to any noise which in most cases it probably is not going to be the case another angle on this is the power and performance which obviously you need as much performance as you can possibly get you need as little power as you can possibly get what can you do here to actually reduce the amount of power improve the performance a great question so so slam you know as a as a technology by itself has been around for a long time I think introduced back in the 90s and you know what you started seeing some real-time implementations coming somewhere in the mid 2000s like the DARPA Grand Challenge an urban challenge those contests were primarily relying on our slam implementations but at that time you know you had a rack mount service or amount the trucks you can't really go to production with something like that so over years the algorithm has refined it's gotten more power efficient even our hardware has obviously improved so you have seen implementations going on to CPUs to GPUs and now even on mobile phones but power and performance are two things that obviously are very important for a customer because you don't want to have slam running on a phone that you can run for 10 minutes it gets super hot and then you have to charge it so power efficiency is very very important so although the solutions run very well on CPUs and GPUs obviously those type of platforms aren't built for these specific applications so they work well but they are power-hungry and them not the most efficient you can if you start scaling more towards things like DSPs or accelerators you probably get much better performance as well as much smaller power envelope alongside of all this there's been a huge push into accelerators on almost everything involving large amounts of data which is going what's going on here what's the best platform for this do you stay on a CPU a GPU a DSP do you add accelerators in here how do you make that choice sure great questions so I think DSPs is probably a great way to go right now so CPUs were are great for prototyping in my opinion you want to get some more performance or some more juice out of it I think GPU is a third way to go once you start targeting large volume deployment DSPs are definitely a great way to go hardware accelerator I think it's still too early for slam because hardware accelerators is typically some type of an RTL block which will give you a performance boost but then you are really locked into you know however the algorithm is implemented or configured those type of pieces of hardware in my opinion are best when there is not going to be there's a lot of standardization to the implementation so you typically see accelerators benefiting in cases like you know video codecs and things like that where you have books of standards of things that need to be done for implementation versus slam everybody still has their own space big flavors so some amount of flexibility / programmability is is still very good at that's where DSP may be with a couple of acceleration packages is probably a good trade-off or a great combination and these algorithms are changing almost weekly right yes if you're on the bleeding edge it's changing constantly because you know there is research papers that are published constantly it although it's fairly quite mature now the slam industry there's always room for improvement and optimizations there's better feature extraction techniques better tracking pose estimations you know integrating some AI artificial intelligence in there on the feature extraction stages so this this is constantly involving and aside from the R&D side as well if you go into the actual commercial entities that provide slam everybody has their own spice or their own flavor that they're putting on their target for a specific usage or an application so although it's quite horizontal you'll always find like for an automotive usage somebody has tweaked their slants like it differently versus what's running on drones versus what is running on your robot vacuum cleaner that you might have at home a mob or car thanks for a great explanation thank you very much for your time
Info
Channel: Semiconductor Engineering
Views: 16,707
Rating: undefined out of 5
Keywords: Cadence, Semiconductor Engineering, SLAM, sensor fusion, feature match, radar, LiDAR, automotive, loop closure, GPUs, accelerators, DSPs, AI
Id: MxuBLW8hmRY
Channel Id: undefined
Length: 14min 9sec (849 seconds)
Published: Tue Nov 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.