ESP32 stereo camera for object detection, recognition and distance estimation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so I've made a stereo camera for object detection and distance estimation using two esp32 cams and in this video I'll show you exactly how I did that hey everyone so today I'm making a stereo camera with the esp32 cam and to make a stereo camera basically I need two cameras and with two cameras I can read like reconstruct the three-dimensional World from a two-dimensional projection using geometry but what I do need is I need these I guess to tightened to a grid so I need these two constrained at a certain distance while I take photographs because I need to be able to measure all that for the geometry and it would be best if they're at the same height so it would be good to have some kind of I guess grid structure to connect these two and the perfect thing for this which gives us a natural grid structure is a piece of perf board I suppose we have these natural grid structure here which we can then connect these two cameras to so now here's what I prepared earlier so all I've done is I've sold it on these female header pins to the perf board and then I've broken out a couple of wires from each side for the for the power connection and then put a battery Shield there with an 18650 battery and then I've just I've um I've cut open this USB cable and um sold it on a couple of Jumper leads onto here which I've connected to this to the power leads here so I can just plug the the cameras into here so I can just plug a right camera in and a left camera in and there we go we've got a stereo camera for the esp32 and I'm going to use this for object detection and distance estimation so I'm going to run this through a bit of python code that I've written and I'll show you how I've I've set all that up in the next scenes but first i'm going to discuss a little bit of the theory to do with a uh stereo camera and uh in light of I guess computer vision okay I'll see you in the next scene before I go through the code setup and the geometry of the stereo camera I'm just going to give it a little demonstration of how it works so I've just got it mounted here on this on this bit of Lego stand that I made and then I've got this Lego person and a little toy car about 20 centimeters and 30 centimeters from the camera respectively so on the left side I've got the left camera which is called Left Eye and on the right side I've got the right camera which is called right eye and these are streaming through python so when I press p it'll it'll do the inference and it takes about five or six seconds on my CPU but it's a little bit slower when I'm running OBS which is what I'm using to record off the screen okay so it says the car is 30.4 centimeters which is fairly accurate because I've got it at about 30 centimeters it thinks the Lego person is 18.6 centimeters away and it detects that it's a person as well and that's a car so that's another thing um and it's about 20 centimeters away and then the computer monitor which it thinks is a TV which is close enough um it thinks there's about 57.3 centimeters away and that's about right it's about 56 centimeters away so all in all that's working pretty good and it thinks this little um this little mark on the wall here which is a mark in the paint to chip in the paint is a bird and it thinks the bird is 66 and a half centimeters away so that's that's actually fairly accurate okay so now I'm going to take you through the geometry of the of the stereo camera and the code okay so hey everyone so I've just got the two cameras running the webcam server um example that comes with the Arduino IDE 1.8 and so I've got the left camera on the left side and the right camera on the right side and I've just got them mounted on this um unlike this piece of Lego here just to keep them steady I'll just start streaming and I'll see we'll see what that looks like so we see um each image each camera has got this um this vanish bottle in view so I'll just make this a little bit bigger so you can see it a bit better I'll change the resolution let's have a look okay so we see that the the Right image the bottle is a little bit further to the left and the left image the bottle is a little bit further to the right and because it's far away it's not really that that pronounced but as we move it closer it gets it gets more obvious so this is quite close and the Right image it's way over to the left and the left image its way over to the right and we can see that quite easily and we can use this distance this this width to work out how how far away the object is and re that's this is how we reconstruct the 3D World so the geometry is not that difficult it's just solving a bunch of equations and I'll go through that with you in a few minutes the challenging part for this at least from a computer vision point of view is matching the features between the left image and the Right image so how do we know this part of the Right image matches this part of the left image and for us it's fairly obvious but we've got a very complex optical nerve system but for the um for a computer it's a little bit more difficult so one of the things I'm doing in my spare time is a graduate diploma in AI in my day job I work in a bank and this is just at night to keep my mind a bit busy and also help me with my robotics projects and one of the subjects where we're doing uh in the course is computer vision and the stuff we're learning about in this computer vision subject just blows my mind it just blows me away how advanced some of this stuff is um so I did my PhD in image processing about 20 years ago before I worked in a bank and what we were doing then we could have only dreamed about um what they're doing now we could have only dreamed about back then um we've come so far so anything before about 2012 is is like considered like Newtonian mechanics kind of kind of thing so anything basically predating Alex net in 2012. um so one of the things we went over in quite some detail in this computer vision class was um was this pedestrian tracking algorithm and how to implement this in Python and so basically what it does is it is it segments people and attracts them walking across a busy station so now this got me thinking stereo vision is just the same as tracking we're tracking from the left image to the right image so left that's the right to right so I'm going to apply this tracking algorithm to the stereo camera but in in the case of the stereo camera we've got a few more constraints whereas these pedestrians they can move in any direction for the stereo camera we only can move left to right basically we can only move in One Direction and we don't move up and down because I've got the cameras fixed at the same level so we're only moving really um from the left we're only moving from the left image we're only moving right basically right all right okay so I'm just going to go over the geometry of the stereo camera very briefly all right okay so let's just consider what happens if we've got one camera so we've got our camera here and our object here say and the object says distance D1 from the camera and it's filling the whole camera and let's say it's D2 wide and this point here is the focal point of the camera so it's going to be like this so this is our focal length and D1 so the distance away from the camera is equal to f plus D1 all right so now let's assume this is X1 so now in the literature chart a lot of the time they use these similar triangles so they'll say um X1 over f equals D2 over D1 so basically D1 equals fd2 over X1 now the issue with doing this with a digital camera is so that's an F the issue with doing this with a digital camera is X1 is in pixels and D2 is in centimeters so I mean there's ways we can compare them but what I prefer to do is instead of using X1 and D2 I use here Theta so now we've got tan Theta equals D2 over 2 over D1 so therefore D1 equals D2 over 2 tan Theta all right so now say we've got an object here which is D3 across okay so now I guess by similar triangles we can say D3 over D2 equals P Over N where n is the total pixels and P is the pixel count of of this object all right so let's say the object is a car and it's p pixels across and this is n pixels across right foreign so now we can say D2 equals nd3 over p and we can substitute this back into D1 and therefore we get the distance equals n D3 over 2 p tan theta plus F so f is our focal length and P is our pixel count and D3 is the distance of our object so if we know the distance of our object we can we can measure how far away it is and now how can how do you think we can measure the distance in terms of pixels in the image one way to do this is to add another camera so if we add another camera in here let's say and then we can match these points in the images let's say this camera is DC across so now we've got our second camera and oops that's not to scale and this point here matches this point and because our camera is a DC across we know this many pixels here corresponds to DC so with two cameras with this with a stereo pair what we get is the following we get the distance equals n which is the number of pixels how many pixels wide it is DC which is the distance between the cameras to p tan Theta plus f and now tan Theta and F are constants and DC is a constant and P we have to count P we count how many pixels the corresponding point is between one image and the other and the way we can I guess find tan Theta and F is we can use known distances so if we know it if we if we measure the distance with a ruler say and then take a photograph of it then we can back out tab Theta and F so we can have A1 equals n d c over 2 P1 tan Theta and this we just get by counting P1 and plus F and A2 equals n so I should move that up a bit equals n d c over 2 p 2 tan theta plus F so we can solve this pair and get out tan Theta and F which is what I've done and I'll take you through the notebook that that uses that so we'll go through the notebook next but the challenging part of this isn't the geometry because all these equations are solved for us and there's lots of documentation on this the challenging point is the challenging part is matching this point with this point in the image um and then counting the pixels so once we've done that we we can we can measure our distance so I've got the bottle 30 centimeters away from the camera so I know the distance so I'm going to take a left image and a right image and I'm going to use these to back solve for Theta and the focal length of the camera okay so copy image okay now I'm going to move the camera a little bit further sorry I'm going to move the bottle a little bit further away from the camera and I'm going to take two more two more images so I'm just going to move that I think 50 centimeters from the camera that's about 50 centimeters there all right and then I'm going to save two more images and I'm going to use these to back solve for Theta and the and the focal length because I need two sets of images because I've got two parameters to solve for basically so copy image all right so now I've got those two sets of images um I'll put them through my python demo notebook and we'll we'll work out Theta and the focal length of the camera hey everyone so this is a demo notebook that I've set up which matches objects detected in a left image to objects detected in a right image and calculates how far the object is away and it just takes in two images it doesn't connect to the camera yet I've just set this up basically to calculate the parameters of the model so this is the tab Theta and the focal length of the camera and I'll just go through this because it is quite explanatory so I'm using the mask rcnn model from the pie torch framework and this model basically it does an object segmentation and labeling for 80 different classes and it it does a mask segmentation which is basically uh it puts like a pixel mask over over the object which I thought I could use this because I can then tweak it um a little bit later on and then I've also put some links to some documentation here so this is just like for visualization but I've used um so it does provide some visualization but I've used a lot of other functions for visualization which are a bit lower level and the reason I've used those is it allows me to get a little bit deeper under the hood so I guess the first thing I do is I import my libraries which is what you do with any python code and then I Define all my auxiliary functions so um so load image just loads the image from a file pre-process image this is quite important because it it converts the image into a tensor which um which it needs to be for input into the model into the segmentation model um display image just displays an image and display image pair this displays two images side by side and this is important because um I'm using a stereo camera so I need to display the left image and the Right image side by side and then I read in my images so I've got a left eye image and a right eye image this is the images I one one set is 50 centimeters and the other set is 30 centimeters away so to get the 30 centimeter one you just change that to 30 and then this one will just get the 50 centimeter where the bottle was 50 centimeters away the one that I took before and this just loads the images and plots the images so this here is the left image and this here is the Right image um and we see there's three objects there's a computer screen a bottle and a tape measure and then I I get the model so the model is just the um just the mask CNN model so this just defines the model and then model of vowel just means that I'm using the model for inference um so the get detections function just passes the images into the model for detection and it returns these for these four objects that which are the bounding boxes so debt naught are the bounding boxes for the left image that one is the bounding boxes for the Right image and labels other class labels scores of that the confidences and masks are the segmentation masks so I I do all that here and then I can see what I've got so it's to find a bottle a TV and a cell phone in the left image and a bottle a cell phone and a TV of the Right image so basically um it thinks that this tape measure is a cell phone and the computer monitor is a TV which is close enough for the computer monitor and these functions are here for visualization so draw detections draws the bounding boxes annotate class annotates the class label and draw instance segmentation mask just puts a segmentation mask like a semi-transparent mask over the image so I've done that here I've done The annotation in the boxes and then here I put the segmentation mask over the image so this is all in the code here and you can look through that and then this is this gets the horizontal distance between the centers so basically these functions return so this function get horizontal distance sensor Center Center um returns each object to every other object so it Returns the distance of the center between this object this object this object and this object so it returns like a matrix which um the IJ entry is the ith object in the left image and the jth object in the right image and the distance between the centers and I use those and I use that for matching a bit later on um so that's part of my cost function so the cost function basically consists of three properties or differences between objects in the image so the first thing we consider is the is the vertical move in the bounding box so because for the for the camera we've got our two for the stereo camera we've got our two cameras level we don't expect any any vertical move between the objects in the left image and objects in the right image so we penalize quite heavily for that and the cost function and then the next feature we use is the area of the bounding box because we've got the same object so we expect them to be around about the same area I mean we do notice that the monitor here because it's off the screen more in the right image than in the left image the area is different so um so that will affect it a little bit but then because of the other features it'll it'll get a closer match and then the next feature we use is the horizontal move so we note that from the left image to the right image the object moves to the right so we penalize it more heavily if it's moved to the left so we we calculate the move to the right and then we multiply moves by the left by by a factor of 10. so so this will um this means that hopefully they won't get matched If It Moves If It Moves the wrong way um and then the other thing we do is we add in a penalty if they're if they've got different class labels and then we we do an optimization on this so the cost function here returns a matrix where the IJ entry is the ith object in the left image and the JS object in the right image so we match these objects to I guess minimize the to minimize the cost and we use this scipy optimize linear sum assignment for this and um though the advantage of using this function is it does like a global optimization so say we have two objects in the left image and these move in the right image like this then um then it's more optimal to assign this to this this will this will be minimum but then it's less optimal to assign this um this one to this one so we don't really want to do this what we want is a more global optimization so we would prefer something like this which um which is better which is a better optimization overall so this is what this um scipy optimize linear assignment function does and it returns this it returns basically um two arrays and these are the indexes for the left image and these are the indexes for the Right image so the zeroth object in the left image goes with the zeroth object in the right image the first object in the last left image goes with the second with the index 2 object in the right image and the index two object in the left image goes with the index one object in the right image so we can use those to match and I've matched those here and we've got the bottle match with the bottle the TV match with the TV and the cell phone match with a cell phone and now the next thing we do is we can calculate the distance in centimeters so I've done the calibration and these are the calibration formulas here which I've shown you so I've calculated it for 50 centimeters and 30 centimeters um and then I've done this previously um so um so the next thing we do is we calculate the distances and the way we do that is we get the horizontal distance from between the top left of the box and the horizontal distance between the bottom right and we take the one that's closest to the center so say for the um for the the monitor which is a good example because it's off the screen we get the horizontal distance between here and here and between here and here and we take whichever whichever one is closest to the center because this is closest to the center and we just take the horizontal distance between that and that and the reason why we take the horizontal distance um between the corner that's closest to this between the corners that's closest to the center is just um in case it's going off the screen so if we're looking at this and this then we basically we've got a horizontal distance of zero so we do that and then once we find the distance we can we can um show the image with the distance so the TV which is actually the monitor is 54.8 centimeters and that's fairly accurate the bottle it gives a distance of 50 centimeters which that makes sense because I've used 50 centimeters to to calibrate the parameters of the model and then the cell phone which is actually a tape measure um gives about 52 centimeters which is kind of about right I think it's more like 51 centimeters but it seems it seems fairly close so the next thing to do is to stream the image from the esp32 directly into the python code and do the calculations so that'll be next and um I'll show you how I do that in the next scene hi everyone so to get a live feed from the esp32 cam into python I'm following this project by Daniel Rossi on hackster iO now Daniel loads onto this esp32 the camera web server sketch from Arduino um 1.8 IDE and this sketch might not compile if you're using a platform i o or a later or Arduino IDE 2.0 because I think some of the functionality might be deprecated so if you want to load the sketch you either have to roll back your board manager or you have to use Arduino IDE 1.8 and the reason why some of the functionality is deprecated is because it uses the face some of the facial recognition stuff from the DL lib so um so you could you could also delete the face recognition stuff from the sketch but the important things that he uses from the sketch there's two very important things and the first one is the command Handler and the other one is the stream Handler and now the stream handler comes with a lot of other sketches that you find online on various projects so you could just load one of those sketches and use the stream Handler well you'd think so but there's one very important difference that I'm going to point out to you now so this is the sketch from the camera web server and this is this this is a sketch that I've downloaded from instructables and both of them have a stream Handler but there's one very important difference this stream Handler the one from instructables sends the boundary after it sends the the data whereas this one on the camera web server sends the boundary first so to get the stream Handler to work with the python code to get this to stream to python what we need to do is we need to swap this line around so we need to take this line from here and copy it to here and then it will work otherwise it will freeze when you try to use this with python and then the other thing we need is the command Handler so the command handle the command Handler here which takes which takes the um requests from python to the esp32 you can use this so you can you can put this into another sketch but what you have to do is you have to get rid of all the face recognition stuff so all this stuff down here you have to just delete all this all this face detection stuff and then you can use this you can put this into another sketch hey everyone so I'm just going to take you through the python notebook that takes the live feed from the 2sp 32 cams and does the object detection labeling and distance estimation so now I've put all the definitions all the function definitions into a DOT Pi file stereo image utils.pi file so these were all the function definitions that I included in the notebook that I used for calibrating the camera and then I'm just importing that here and I'm importing all the functions I need from that as well so next I've defined the URLs so the URL for the left camera and the URL for the right camera so you'll just have to replace those with your own URLs and these are the um these are the camera parameters the focal length and the tab Theta that I calibrated earlier um so I capture the left camera using the CV2 Library so just capture the URL and the stream and then I capture the right camera as well so these are the camera feeds here cap left and cap right and then there's a bunch of functions here which I got from from this from this tutorial here from this project here Daniel Rossi's project and these functions they just send a request back to the camera so this one changes the resolution this one changes the quality and this one sets the AWB so these just use the command Handler that's in the um in the esp32 sketch sketch to communicate back to the esp32 and you can basically um rewrite these to send any data you like back to the esp32 which is um which is the chief reason why I wanted to use use this this project here because it it had a method for doing that for sending data back to the esp32 as well as receiving data from the sp32 because the other projects I looked at only received a feed from the camera but they didn't we didn't have any way of communicating back to the camera back to the microprocessor so in the main function uh what the notebook does the first thing it does is it sets the resolution on the esp32s so I'm using index equals tan and this corresponds to uxga which is 1600 times 1200 so that's the highest resolution that I'm using at the moment but you can use lower resolutions if you like and it does work pretty well with lower resolutions and then the next thing it does is it displays the the two images the left image and the Right image so these are just the video feeds and then if that all works if I if R and L it display it does the inference on the images so um and then it once it does the inference it will display the the [Music] object detect the detected objects and um annotate the annotate the label and the distance estimation as well um so that all happens here and then these are the key presses so it waits for a key press and if you press R then you can change the resolution if you press Q you can set the quality if you press P you'll do the inference so I won't do the inference on every image it captures it'll only do the inference after you press p and the reason I didn't do the inference continuously is because it's quite slow and then 27 is escape so if you press Escape it'll break and now I'm just going to leave you with that with a demonstration of the camera thank you for watching I'll see you in the next video
Info
Channel: Jonathan R
Views: 33,164
Rating: undefined out of 5
Keywords: esp32, esp32-cam, stereo camera, pytorch, mask-rcnn, object detection, AI
Id: CAVYHlFGpaw
Channel Id: undefined
Length: 35min 50sec (2150 seconds)
Published: Wed Jan 18 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.