Open-source SLAM with Intel RealSense depth cameras

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you can think of every frame from a depth camera as a standalone 3d snapshot of the environment capturing all the distances and the geometry of the scene so this technology has been around for a while but it's becoming more widespread with sensor getting cheaper and more robust there are many great 3d sensors on the market but for the purpose of this talk I'll be focusing on our D 400 series of depth cameras so differ 100 the cameras share the ability to operate indoor as well as outdoor including in bright sunlight they provide reasonable accuracy within several meters they create no multi camera interference allowing you to use as many of them as you want on a project and operate at reasonably low power of about one half watt this has been introduced to the market at the beginning of last year so they have been around for a while now but they have been evolving ever since with new capabilities and new models being introduced all the time the deeper 100 cameras are operating using the stereoscopic vision principle so with depth from stereo every every frame is captured at the same time from two distinct viewport displaced by some constant baseline and then the depth of every pixel is determined by how much that pixel moved between the two viewports so a pixel that moves a lot is close to the camera while one that doesn't move at all is further away this technology has many applications so in this talk we'll be talking about localization and mapping which is a critical component for indoor navigation but depth is also very useful for collision avoidance especially if you have obstacles coming from all directions and not just located at a single floor plane depth can be combined with modern cutting-edge computer vision techniques for better scene understanding so let's say you have a robot that needs to detect cats so you deploy some Network on the on your device and run inference but then you can use depth information to make sure that what you detected is actually at reading object and not an artifact and not let's say a picture of the cat measurement and manipulation are two other applications which are closely related to mapping and finally you can use gesture and post detection as a way to communicate with your system now this is nice in theory but let's see it in the real world so this is from last year from iris the two platforms are very similar provided by heavy robotics and they're trying to pick up the sandbags with the main difference being that one of them is operated manually by a human while the other one is fully autonomous driven by a depth camera so we can imagine which is which futurism in automation another interesting component is the inertial measurement unit diamine is a collection of sensors usually containing a accelerometer and a gyroscope and sometimes other sensors accelerometer is constantly measuring the total force acting on the device and the gyroscope is measure measuring the angular velocity using these two inputs and by passing them through a simple filter you can estimate the orientation of the device in 3d space as well as get some limited limited understanding of local motion we have added a new capabilities to our D 400 series of devices less the chamber with the D 435i device now let's talk about some software components that you might find useful when operating this hardware first of all you'll need a way to talk to the camera this can be done using our open source sdk we are on github we respond and encourage and community feedback and accept community contributions we run on most common operating systems and have integrations with a long list of languages and technologies to name a few we have examples with open CV we integrate into unity 3d we provide Python pip packages obviously and for the purpose of this talk we also support Ross the robotic operating system which will be our next component well Ross is not actually an operating system but rather it is a collection of services provided for anyone who wants to build a robot this includes standardized ways to communicate between the different components and the different host machines in the robot a simulation environment tools for visualization and a fully functional navigation stack which we can integrate in the end of this presentation okay the next component we're going to need is visual odometry visual odometry or viz Lam is a family of algorithms designed to estimate the position and orientation of the device in relation to the environment based on visual features that it sees in the environment but also at the same time build and update a map of the environment conceptually this is somewhat similar to how sailors used to navigate based on the constellations and the terrorize in the same ideas of triangulation there are many free and non-free slam solutions but for this demo I chose to show you one called our table map our term up is a free viz LOM package is very robust it can work with pretty much any depth camera not just our stuff and it provides the capability of loop closure which means that when a robot approaches a place it has seen before it will pick up on the discovered features and correct its position and also propagated to that to the map that it's building and will see it in action in just a minute our top map is also available is the ROS package so installing it is very simple and if we take the concept of visual odometry and we compare it to the process of IMU filtering we can see that these two the technologies are closely related and complementary to each other so what I mean the new data is available for you at very high frequency of several hundred samples every second while the visual odometry is limited by the exposure of the camera and the visual processing that you must perform on the other hand diamine suffering from aggregated drift over time while the visual stuff can correct itself based on the features in the scene that said if you have no features in your scene then you have no visual odometry and that's where I am you can save you because IMU is very robust you know unless you're on another planet the gravity vector is going to to be there and finally I'm your abrogation can be performed very fast while visual rhythms are usually requires significant amount of CPU power so the natural question is why not have both and this is exactly what we're going to do so we're going to do we're going to take the RGB data add the depths and pass it through the art of map and we're going to take the MU data and pass it through a naming filter in this case we're using IMU filter called metric filter which is available in ours and we're going to add additional Kalman filter on top to fuse the two types of odometry this way our camera transform is going to be very frequently updated from the IMU data but then occasionally corrected using the data coming from our table map this process is documented on our wiki and you know I'm not going to show you the exact terminal commands but it's all in there you can just follow a simple set of instructions and has this has this whole thing running on your computer and here are some results so on the left you see the camera feed and on the right is the real-time 3d point weld of the environment as the map is being built in this experiment will't platform was traveling roughly 200 feet around dense cubicle area and you can already see that there's some aggregated drift that the corridors are not exactly straight but this is exactly where loop closure will kick in and as the platform is approaching the original position you can see it snapped back into its correct geometry so now let's talk about the limitations and ways to improve this this solution first problem that we're going to encounter is that we need to be very careful going around corners so in this example we're approaching mostly flat wall and the visual odometry is not picking up on many features so it's relying on the IMU and the smallest bump to that thermometer just spend a sentence spinning and you can see that it completely lost track of where it is now we can live with it by backtracking to a point we have seen before and then letting our table map to close the loop and real oka lies to the known position but the ways we could improve this is first of all by increasing the camera field of view if we increase the camera field of view we will be accepting more features at any given time so it will be much harder to completely you know find ourselves completely blindfolded and the second way we could improve is by investing in IMU quality which would let us operate based on I am you alone for slightly longer still not infinitely but a bit longer the second limitation is the aggregated drift at least prior to loop closure so once the system detects the loop closure and corrects itself it propagates that this knowledge across the entire graph but before then if you were just like walking a straight line there is some visible drift and the accuracy is not that great but of course this is somewhat a mute point because we just took a couple of off-the-shelf components without playing with any of the parameters and just threw them together and this is the results we are getting so if you are willing to invest more in fine tuning the parameters for your specific use case you can definitely get much better accuracy from this method another point that I want to discuss is the trade-offs that are inherent between localization problem and mapping problem so for localization you basically need the slam and for mapping you need the depth camera or some kind of sensor that would map the environment and there are some complimentary trade-offs so if we want better tracking as I mentioned before you want to increase the field of view but as you increase the field of view you are basically stretching the same amount of pixels over my bigger physical area so the depth quality will significantly suffer so for depth quality you want a narrow field of view secondary depth cameras love projecting different types of patterns you know the structured light how one types of patterns in project the active stereo has its own thing and all these patterns are terrible for tracking because instead of tracking the environment you'll be tracking the pattern so it's somewhat like chasing a laser pointer so for better tracking we would prefer to operate just in the visible light well now why there are not many projections in the visible light because we as humans don't usually like it but for depths usually we would want a broader spectrum of visible and infrared light and the way to to tackle this problem is by introducing a dedicated sensor with wide field of view and then our cut filter specifically for the problem of tracking finally performing visual odometry on the CPU can be quite expensive in fact on some platforms it's outright impossible so it might be a good idea to offload this specific problem to a dedicated hardware accelerator one example of such hardware accelerator is the in term of video chip which is designed for low-power computer vision tasks these steps and these problems are exactly what brought us to the development of our latest product the t2 65 tracking camera we took a highly optimized slam algorithm and ported it to a low compute a low power compute board and paired it with a set of high quality sensors that were chosen specifically for the problem of tracking and this is a very similar experiment but in this case we're still using a depth camera for mapping but the whole problem of localization is being offloaded to the dedicated tracking camera and you can see that with the improved optics and improve the quality of the IMU the tracking is both more robust and more accurate in fact as it approaches the point of real localization it's very hard to even see where it exactly miss misses the the origin point now to give somewhat live demo not entirely live but almost live do you recognize this this room so this is a scan that I did using this exact software that I described with these two cameras and if you want to stick around later I can actually show it live I don't want to risk a live demo while presenting but you can actually see the results you you would get with the overall geometry of the room being pretty well captured you know the different types of furniture all very much recognizable and and the stairs are also clearly visible except that I don't know how to penalty so this is pretty much the raw point cloud that you're getting from what I described here now with our table map you can basically convert to whatever you want you can export it to a format that you can play with in mesh lab but you can also convert it to a 2d occupancy map and use it for navigation using the standard ROS navigation stack and basically do whatever you want with it here are some additional resources about realsense about our cameras and that's my talk thank you so much I'll be happy to answer as many questions as possible quick question about the loop closure how deep is the is the image buffer in other words let's say you want to recognize distinct feature to close the loop yeah how how long will it go until it can't remember anymore well that's a excellent question and there's no single like measurable answer for that we're using a buffer of roughly 60 megabytes for the map that we maintain inside the device and ultimately there is a graph that that is contained from the different nodes that you visit during your your path and when you see you know when you incur real localization there's a back propagation there was a bundle adjustment performed to optimize that graph but the size of the graph is basically determined by the topology and that 60 mega limitation of the device are you referring to be 4 35 35 or both well currently I'm talking about T 2 65 because it is the one that we provide a localization capabilities in case of our table map it's also maintaining similar graph data structure and it's also doing loop closure I'm not entirely sure if it's capable of real local ization I might be wrong on this but basically the difference is you know loop closure is you go around the circle and you say hey I'm back where I started when the real localization is you go somewhere and then you blindfold yourself and then you go where am I so it's very similar concept but the implementation is a bit different and in case of our table map I think that these you know these parameters are pretty much configurable there's a lot of configurability in this package that we didn't even touch upon but you know we can read the dachshund yeah you say yes to cameras which do and why yes so for this demo I was using that D for 15 for the purpose of mapping for the depths and the T 265 for tracking for localization so the D for 15 was chosen because of its narrow field of view we have several cameras and 15 is the one that has narrow field of view which means that for this purpose it provides slightly better depths accuracy compared to a wider field of view solution and for the t 265 it's the only one we have but basically the purpose what what it's doing is while the main computer is working on the quadtree rendering whatever and the inference that you want to do this device is quietly crunching numbers and running slam in the background so you know that no matter what happens you still have available you can definitely work with the t 265 on the Raspberry Pi that's part of the like the purpose of this device is to make this visual odometry stuff available on this very low-power platforms to mainly to encourage robotics but it's also very relevant in case of VR and AR of mented reality where you put it on a headset and you know it tracks the motion of your head and in all these cases usually they compute that you're going to get is very limited so to use this device you don't actually need that much you don't even need wide USB bandwidth because all you're getting back is the location and orientation of where you are so it works well with USB - and I was confused about one point Intel is currently recommending the 435 and the GT 65 as a pair but you use the 415 that's right is would you actually recommend the 413 instead it depends it depends really on your application so usually the for certified goes well with the t2 65 because it has global shutter so if you have a rapidly moving platform and for instance the certified family is that pretty much only one going on drones so for this use case the certified is better but the 15 is absolutely absolutely usable and the point is more that it doesn't matter that much then the the individual selection I mean I could have just swapped in the certified or any of the future realsense devices and it was just pretty much worked in the authoring declamation for different kinds of inference you you have already done in what is possible so this hardware is very similar to the neural compute stick but it's not exactly the same so when operated as a tracking camera the only thing it currently is doing is solving the problem of tracking in addition more vidios chips are also available in different products and popping up in many like configurations but the most famous one is the neural stick which allows offloading and inference of networks but that's somewhat different product mesh processing or x-ray that's an excellent question because our next speaker is going to show a lot of cool stuff including message ms generation and well i'll let let him pretty much introduce himself right so any more questions [Applause]

Info

Channel: Intel RealSense

Views: 47,024

Rating: 4.9375 out of 5

Keywords: SLAM, Open-source SLAM, stereo depth, computer vision, autonomous navigation, collision avoidance, object detection, object detection drones, object detection robots, collision avoidance drones, Intel, RealSense

Id: tcJHnHpwCXk

Channel Id: undefined

Length: 27min 40sec (1660 seconds)

Published: Mon Jul 01 2019