New SOTA Depth Estimation Model with a Monocular Camera

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey guys welcome to new video in this video here we're going to take a look at some new state-of-the-art monocular depth estimation models so we can basically just take a single image f it into a model and then we'll get a depth map out so in the previous videos here on the channel we have created a bunch of videos covering the Midas model both how you can use it it is still like faster compared to these models but now we have some reent research we're using the fusion models to create these depth maps so we can take a single image F through the model and then we'll get the dep map out and we have really good accuracy on that right now with new state-of-the-art models so before we're going to jump into the first one I'm just going to show you this example from Google deepmind I'm not sure if they were going to release the code but this is a really nice model and then we're going to see another one where all the code is already available and you can play around with it and it is significantly better compared to the miners model and any out there it is not as fast because it's using these diffusion models which is basically just uh created on top of stable diff Fusion so they're using those encoders to be able to generate these really nice depth maps so here we can just see some examples let's just go through it here we can see the input and we can also see the ground tro so this is the input image and then we can see the results down here at the bottom so to the right we have a D model so here from Google Deep Mind seros shop metric depth with a fill of view condition diffusion model so again all of these examples that we're going to take a look at and we're going to go into the code see how we can run it both the examples in hog and face Bas but also how we can run it in a Google collab notebook so here we can see that they have RS and they're also comparing it to this so dep so this soy dep model is also really good for depth estimation and these models here are act like predicting absolute depth value so if you're talking about like Midas the Midas model it is relative so all the de values are relative to the camera where now we have these models trying to predict absolute depth so we actually get a like a metric to all the different values in our image so now we can actually go in and predict that this chair here is one M away from our camera so we can see that this so dep here it has some problems like like when we're getting into the depth in the image or like the further out we get in the image where we can see with this one from Google deed mind we have really good accuracy even this classroom here if you just take a look at that so we have the ground Tru and then we have it compared to their model down here at the bom so there's still some way to the ground T here but we're going to see the other model in just a second which is the new state of the R1 and we have code for that so if you just use like a stereo camera for example we often just get a dep map like this and now we can see that these models here act like significantly more dense and more smooth compared to the other ones also if you just take outside so one of the other main problems with depth this major models especially for monocular is both combining outdoors and indoors scenes so let's now just jump into the second model here that we're going to mainly take a look at in this video here we both have a Google cab notebook haogen face Bas code and also paper and again these are also based on a diffusion based model uh where they're using stable diffusion for generating these monocular dep Maps again it just a single image throw it into the model and then we get our dep map out these are not as fast as Midas so Midas and those models we have actually try to run that on a live webcam definitely go check out those videos if you're interested in that and if you want to use some real time for basic obstacle Optical voidance or whatever you want to use it for definitely go check those videos out it is very easy to set up but if you want really high accuracy as we can see here we can now go and use these diffusion based models so this is the results that we get with this model if we just scroll a bit further down we have an overview and again we have our side by-side comparison so here we have ours and they're comparing you to something called L rest so we can just see the details here even in the outdoor scenes so just take a couple of more examples here we can see the outdoor scenes we can just see the details in this depth map here if you just go back and forth between our input image and the output so again this is just a single image but you can see all the details we can even see the Rings here and we have a bar behind it and it predicts that these rings are in front of the bars so the warmer the colors are the closer the Optics are to the camera so here we can see that the red colors are closer to the camera where like the the the blue colors are further away if you just scroll a bit down we can see like how the fine tuning works it is using this ladent encoder it is the exact same um encoder for actually like getting into Laden space so how we can get the image and also the depth map into a laden space and use the variational auto encoder exact same one as stable fusion and that's why we can get these really high D details step Maps so this could probably be used for like for example some some face recognition facial recognition let's say that you have an iPhone you want to do like facial facial unlock you can probably use this model here to go in and generate like a dep map and act like go in and generate these Point clouds then you convert that into a model combined with some other features for basically just classifying if this is the correct person or not so that that might be some really good and cool um use cases for these specific models again they're like kind of slow so they're not running real time it takes like a couple of seconds to process these images depending on your Hardware but basically just encoding it into a lat space they have some sampling noise that they're adding as well to the depth map and then they're doing concatenation and then they use like a basic Laden Fusion unit structure so this is basically just a unit structure and combined with a latent uh encoder from stable diffusion we also have the inference scheme here so we have our input image Laden encoder we add the noise here as well for the inference and then we basically just diffuse our image and we get the dep map out here at the end if we just compare with some other methods before we're going to jump into the code we can just see that we have mainly been focusing on mitas here on the channel it is still very good especially if you want to run um realtime applications and they also have a bunch of different models out there MERS here we can see that this merry gold here which is the new model that we have taken a look at it is significantly better and outperforming all the other models here on pretty much like all the benchmarks so the most C ones are like the first four here I would say so scan n a kitty is also pretty nice for outdoor scenes um autonomous cars driving around and we can even see like if you take a look at the values so these are percentages they're actually like outperforming the other models here significantly this is new state-ofthe-art model it can be used for a lot of different use cases but again it is still um pretty slow but again all the hardware all the models becoming better over time this is the newest reseearch this model just came out a couple of days ago so let's just go over a couple examples here you can just use this space if you want to run it and you want to download the images directly if you don't want to set up the code we can also take a look at the GitHub repository that basically just showing how to run it locally as well it is it doesn't really take like too much time you just have to set it up with either abunto o Mac OS windows and so on they have some installation guides if you're using uh Windows you actually like need WSL and and then you also need to install cud support for that so it's a bit harder to actually install here on Windows but if you're on on mag or abunto it is pretty easy or else you can just use the Google collab notebook as I'm going to show you in just a second so here we can just take like one of the examples we already sort that from their website but they have this really nice space that you can just directly go in and use you can also go in and drag and drop your own image if you have that just drop it into here to the left and it will then go in and do the prediction you can have this slide bar so you can go back and for between the dep map and also the color dep map and you can also go down and and download them here so both 16 bits and also floating Point 32 depending on like what you want to use in your application we can also go down and download our image again we'll get the download image we can see it we can use it in our own applications and so on if you want to use the code directly you can go inside the files take the app here and this is basically like how we can set up this hog and face space you can just clean up out the demo and then you can use the code directly in your applications so now we have everything now we covered both the examples and also how we can run some quick inference and try out this model in hog and face space so they basically have this whole notebook that you can just go in and run through they we have the project website that we took took a look at they have the GI up repository the paper hog andface based hog and face model and also the license here but here you can see we have this Google cab notebook you can just run it directly and you'll be able to either upload your own images or use some sample images and then it will display the input images and then you can run inference at the end and download the results but we can also go in and run it locally as we saw in the GitHub repository so already did all the setup it was actually like pretty straightforward and I'm on a Windows right now it's a bit more um Advanced compared to just running it on Mac OS or auno so here we can actually see we just have an intering guide for vsl so this is basically just how we can run abunto under the hood in Windows so here you can either go inside Microsoft store here and then you can just type in abunto and download the abunto distribution that you want to use so let's just type that in so you can see that I have installed onto again you can choose the version that you want but definitely just go with 22 in this example so then we can actually just have WSL here you can follow an installation guide in here it is straightforward this is just from Microsoft so there's not really like a lot to that and then you can download a p inside the Microsoft store after we have that we need to go in and install Cuda support for WSL and that and there we have um basically just a really symbol one from Nvidia so first of all here we basically just need to install the Nvidia GPU driver so first of all we need to go inside the downloads from Nvidia after that we can go and install wvsl 2 so we're just going to call this install and also update in a Windows terminal or in a command prompt Powershell or like whatever so we can just open up a standard command prompt I already have it installed right now but this is pretty much like everything that we need need to do so you can see here wsa executable install and after that we can go and run update so you can see here abundo is already installed from the Microsoft store so that's everything that you need to do so here we can basically just go inside the download drivers on Nvidia right now I'm on an RTX 4090 game Driver search and then you can just go in and directly download it from the download drivers from the WIS side so now I act like everything we can just run this from the Windows command WSL and then the default distribution is abunto then we need to go down and install cud support for WSL 2 you first of all you need to go in and create a username and also a password and then you'll be able to do this if you actually like go inside abunto so you can just open up your abunto terminal you'll get it over here to the left and now we're act like inside abunto so then you just need to go in and follow each of these individual steps and then after that we get option one or also option two so here we can see that this is actually like how we can go and install WSL abunto for our Cuda toolkit and I was just using the first option here I found that to be the easiest again you just need to go in and basically just download it here so WSL abunto and then just copy paste each of these individual commands inside your abunto terminal so here you just take each of these individual ones copy it inside our rter here and then we just basically just install all of these individual steps and this is how you can install Cuda toolkit for WSL orto after that we pretty much have everything and we can go in and run this model locally so let's now go back so here we can see that we have both our drivers this is the only thing that we need to do WSL and cud support because this is running Linux under the hood and that's why it's easier to run a native auno or Mac OS we also need python 3.10 they have tested out on an M1 MacBook RTX 3080 and also 4090 and you can also going and use M if you want to use that so here you can need to act like create an environment so using mamba which can be installed together with mini Forge so also did that step it's not really too complicated but inside here you can just go in um and actually like just go down and install it so you just call the curl and then you can have bash because now we act like inside a Linux environment inside your buonto command prompt or like terminal so you can just do it like this two commands and then you have that installed as well so now we're going to go back create an environment here with Mamba so you just call Mamba environment create merry gold and then we're going to create an environment based on this yl file with our all environment so all the dependencies and all the things that needs to be installed we can then go in and activate our mer gold environment I can just follow you guys through that so let's open up our bundo Command Prompt again we have oura activate merry gold now we can see that we have activated our cond environment so now we can go down and just test on our own images or we can go in and run inference so if you don't have any images that you want to test on you just want to test if this model act like works you can call this Bas script which is just going to download some sample data which is this exact same one as in the Google collab notebook I've already done that so I can just directly go down and run the python file here so let's just go and take a look at that and that's actually like the only thing that we need so a couple installation steps set up our oer distribution if you're on Windows and then we have this run file which is basically just taking care of all of it it's just calling the model setting up the directories calling models if you're using like MPS CPU or a Cuda going to do a forward pass it going to extract the the the dep map and then it's going to like convert it if you're going to use floating Point 16 uh iners or floating Point 32 from pre-trained marold so this is the exact same code as in the GitHub repository or like in the Google collab notebook we have the dep color depth prediction it just going to save that to a number of different directories so this is pretty much everything that is doing in the hawk and face space as well together with the gradio demo wrapped on top of that so if you just go back here again and delete all of this then we should be able to go in and run it that's the wrong one I opened up so here we have our merold run python run input RGB directory input in the W example and you also need to specify our output directory so if we done this we just get an error because we forgot to CD into directory so we have this marold that I cloned before and now we should be able to Runner command to run python we have our input directory and also the output directory device we're using Cuda and it found eight images loading pipeline components estimating deps so here we can actually see we have this um track bar here as well where we see that we're going to process eight images we have the inference Badges and we will also see the diffusion steps in just a second so depending on your Hardware this might take some time probably like 10 seconds per iteration couple minutes and so depending on your Hardware again and these are there are also some very large images so some of the images here in the example data set is 4K images so that's going to take significantly longer compared to if you have lower resolution images so you can try it out with your own you just need to put your images in this um directory or specify the directory path and also the output you can just create an an empty folder and then input directory could just be an arbitrary one that you choose on your own so here you can see that is doing all the inference so to be able to go and access our directory and also the files in our Linux enir M we basically just need to go inside our Linux in the file explorer go inside abunto and then we should be able to to go in and locate it so here we should have our uh home directory inside our home directory have nego and then we have our merold which is the one that we cloned we have our input directory so this is where you actually want to put your own images and these are just the exact same image examples as you saw on the hawen face Bas and also in the GitHub repository and in the Google collab notebook then we also have our outputs here again so we going and see the output so we have the depth BW we have the colored and then we also have the numpy ray if you want to use that directly so go and take a look at the college one so this will just run for a couple of seconds or like a couple minutes but again we can just see the height details in the output depth maps let's go to the next one so this is the cat again even if you zoom in we can just see this really nice color gradient throughout the whole image with the dep this is also pretty cool we have the house here well nice details around that we have the ferris wheel again these are some really nice dep Maps here you can even see the cat behind like um behind the grid the outdoor one and again you can go and see here we have estimating depth over ratings because I've already like run this example but again it takes around like 80 seconds per duration I'm running this on a 49d graphics card so this is definitely not going to run in real time um that's for sure like a lot of processing like a lot of processing actually like going on and a lot of resource need to be done to be able to run these models in real time if you want want to use it in real time applications so again the midus 1 is still pretty good to trade-off between accuracy and speed if you want to use it for something like that if you just want really high details maybe for facial recogition you don't really care about the speed or something like that or if you just want to like generate like 3D animated images or something like that or create whole environments based on a single image or a single prompt then this model here is really good for that so when you're taking a look at like for example text to 3D text to room and all those different models which is basically just taking text or taking an image Tod image and then basically predicting into 3D space it is using these models onto the hood so this is how we can run it we have been over a couple of examples both like the one from Deep Mind and then also this merry gold we took a look at the model architecture the results did some comparisons we took a look at the GTO repository how we can run it in Hawk and face bait it is really good for just testing out the model you can go and download the model extract all the code use it in your own applications and projects and that's pretty much everything that we have covered now we can use it on your own even locally so I hope you have learned it ton I hope you can use this model this is definitely stateoftheart the details are very nice so I hope to see you guys in one of the upcoming videos until then Happy learning so if you want to take your machine learning Ai and computer vision skills to the next level I also have my courses on the website you can go check them out we have everything from optic detection with deployment optic tracking with Yol V8 we also have Transformers and segmentation courses the most interesting one for me is definitely like this research paper implementation course where we learn how to actually like Implement research paper architecture so we're going to have the architecture on one side we're going to have code on the other side

Info

Channel: Nicolai Nielsen

Views: 2,203

Rating: undefined out of 5

Keywords: computer vision, distance estimation, opencv, python, opencv python, python opencv, depth maps opencv, depth maps python, disparity map opencv, opencv depth maps, opencv disparity maps, computer vision python, disparity computer vision, depth estimation opencv, distance estimation opencv, computer stereo vision, vision, monocular depth estimation, monocular camera distance, monocular camera opencv, depth opencv, python depth maps, marigold, stablediffusion, zero-shot

Id: Xjs4RQpViO4

Channel Id: undefined

Length: 18min 38sec (1118 seconds)

Published: Thu Jan 18 2024