OpenCV Python Course - Learn Computer Vision and AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
opencv is a popular python library for real-time computer vision this course comes directly from the creators of opencv and is the perfect course for beginners at the end of the course there's an interview with the ceo of opencv dr malik and he talks to you about how to get a job in computer vision and ai in 2005 for the first time in human history an autonomous vehicle traveled 132 miles through the mojave desert to win the 2 million dollar darpa grand challenge the name of the car was stanley and it used a computer vision library called opencv hello everybody i'm satya malik and i'm thrilled to help you get started with opencv built over 20 years opencv is the most extensive computer vision library in the world it is downloaded between one to two million times a week and it contains over 2500 optimized algorithms to build computer vision and ai applications opencv is the first library you need to learn welcome to this free getting started series we designed it for absolute beginners all you need is an intermediate level of programming knowledge and python opencv is vast it is not possible to cover all aspects of the library in the short period of time this series is your first step it will help you get started the material in the series will be covered using jupyter notebooks we will go over every notebook in a video to help you understand the code the first few notebooks are all about basics what are images and videos how do we represent them inside opencv what opencv functions do we use to read write and manipulate photos and videos next we will go over image enhancement and filtering our objective in this series is also to give you a glimpse into applications you can build using opencv functions we will go over different image transformations and show you how to align two images the same idea can be modified slightly to create beautiful panoramas next we will dip our toes into computational photography and create high dynamic range images by combining photos taken using different exposures into one single beautifully lit photo opencv also implements many classical machine learning algorithms and has an entire module dedicated to deep learning inference we will learn how to implement face detection and object tracking finally we will wrap up the series by learning how to use deep learning module for object detection and pose estimation it's going to be very interesting that's all we will cover in this getting started series and after completing this series i encourage you to go and take a look at the free content at opencv.org and learn opencv.com and when you're ready for structured learning and you're seeking mastery in computer vision and ai check out our courses at opencv.org courses i wish you all the best in your learning path but today let's get started with opencv hi everyone in this introductory video we're going to be covering several basic image processing concepts related to working with images all this material is very essential and we'll be using python and specifically opencv to demonstrate for example how to read an image and display it understanding how images are represented by data of the differences between grayscale images and color images and specifically what it means to have multiple channels in color images and then finally how to save images so just a couple of things i wanted to mention before we get started is that we'll be using jupiter notebooks for most of these demonstrations mainly because it's a very convenient way to display intermediate results and has some very nice documentation features that make it easy to present code and supporting material so in some cases we might actually use python scripts for specific applications but mainly we'll be using jupyter notebooks in this series so to get started i just wanted to talk a little bit about this first section here where we're importing some required libraries one thing i wanted to mention is that when you use jupyter notebooks you want to use this map plot lib inline specification here so that we can display images directly in the notebook and then one other thing that we'll be using in this notebook is this ipython function image which will allow us to display and render images directly in the notebook so in this first example here below we're going to actually use that function to display two checkerboard images here they're both black and white checkerboard images and the first one here is 18 by 18 pixels and this next one here is 84 pixels by 84 pixels and you can see that if we read in those files and display them directly in the notebook that their actual size is uh faithfully rendered so the 18 by 18 pixel image is quite small here and you can see the difference between the two images and one of the reasons we wanted to start with that is because generally we'll be using opencv to actually read in an image and store that data in memory as a numpy array and then working with that numpy array in terms of manipulating the image or saving the image or displaying the image and in those cases we're really displaying a mathematical representation of the image and not necessarily a faithful representation of the image within the browser itself so that'll become more clear as we um proceed further through this notebook but we just wanted to draw that distinction so let's scroll down here and take a look at the first function that we're going to be using from opencv before we get started though i just wanted to mention that when we introduced new functions we decided to include this documentation section here in the notebooks that summarize the functional syntax i describe some information about the arguments required maybe some optional arguments and then the um we've also provided the opencv documentation links so we're not going to go through these sections in great detail but we will refer to them when we describe some of the code in these notebooks so let's just take a look at how we use i am read and in this first example here we're reading in the smaller checkerboard image notice that the first argument is just the file name for the for the image itself which can be either a relative or an absolute path name and notice also there's actually an optional second argument here and here we're indicating zero and if you look back up here in the documentation section you see that zero corresponds to a flag that specifies that we want to read this image in as a grayscale image and we'll talk more about this further below but i just wanted to point out that there is an optional second flag here that we'll make use of quite a bit and then the return from imread is a numpy 2d array representing the image and i can print that information here using the print command and you'll see uh down here there's the data that represents that image so it's 18 rows and 18 columns and each of the values represents the pixel intensities for each of those pixels and notice that they're in the range of 0 to 255 because this image is being represented by an unsigned 8-bit integer so let's scroll down here a bit further and print some information associated uh with the image so drawing your attention up here uh to this code block we're using the shape and d type methods uh of that numpy array object to print out both the image size and the data type so here you can see it's 18 by 18 and unsigned in as we had pointed out and so at this point let's talk about how to display the image so here in this code block we're using the matplotlib function i am show to display a representation of that image and so here we're just passing it the numpy array that represents that image and you'll see this plot below so notice that this is actually a plot or mathematical representation of that image but it's not 18 pixels a wide on my screen it's it's just a plot representing 18 pixels and so you can see the axes here correspond to 18 by 18. also notice that it's not black and white as we expected but looks like yellow and dark navy blue and the reason for that is that matplotlib uses color maps to represent image data and in this particular case it's using some other color map rather than a grayscale color map so if we want to display this as an actual grayscale image we need to actually set the color map so we're going to do that here in this in this section where we call i am show with an optional argument uh color map equals gray and if we set the color map equal to gray then then we get what we expected a black and white uh image representation so let's take a look at another example here we're reading in another checkerboard image of the same size but this has pixel intensities that range from 0 to 255 with lots of gray values in between so you can see that reflected here in the matrix itself and then when we plot the image you can see the grayscale representation of those middle tone values so it's not a very interesting image but it demonstrates just the idea that a grayscale image can have values between 0 and 255 and that those are represented on a continuum from pure black to pure white so now we're ready to talk about color images so let's scroll down to this next section here where we're going to read in a high-resolution image of the coca-cola logo so here we're using the ipython image function to do that and we're rendering that image in the browser and um let's scroll down to this next section now and actually use opencv imread to read that in and store that data in a in a matrix so here we're storing that information in this coke underscore image object notice that when we read it in we specified an optional second argument of one here so it's going to read in this image as a color image and it turns out this image is in a rgb format so there's three channels one for red green and blue and when we print out the size and the data type of this matrix you can see it's 700 by 700 by 3 where 3 is the number of channels but notice that when we display the image down here using matplotlib i am show comes up blue which is unexpected and the reason for that is because opencv actually uses a um a different format for storing the channel information than most other applications so for rgb images opencv uses a channel order of bgr rather than rgb so matplotlib is expecting this to be rgb but the way that opencv read this into the matrix it stores it in bgr so it ends up looking blue unless we swap the order of the channels so that's what this next bit of code down here does in this block here we're taking this coke image array and swapping the order of the channels here that's what this syntax does here it reverses the order of that last member of the array and now we're going to display that and it comes up red and white as we expected so that's just something to be aware of whenever you're working with opencv you need to be aware of the channel order convention and we'll see that come up again and again in these notebooks so now we're going to take a look at splitting and merging color channels in this next section here splitting and merging um are pretty straightforward in opencv and they can refer to this documentation link for more details but uh here's the example so let's just go over that here on this first line we're using i am read to read in a color image and notice here for the optional argument we're actually specifying the flag as opposed to one for specifying that we want to read this in as a color image and we're going to store that result in image underscore nz underscore bgr and i specifically used bgr in the name of the variable to remind myself that that's what this represents since we're reading it in using opencv and now on this next line i'm going to call the opencv split function to take that multi-channel image and split it into its components b g and r and so each of these uh variables represent a 2d numpy array that contain the pixel intensities for those color channels so in this next section of code here we're simply going to use um i am show to display each of those representations as a grayscale map and then this last bit of code here takes those individual channels and uses the merge function to merge them back into what should be the original image and we'll call that image merged here and we'll show that as well so now taking a look at the images below we've got the red green and blue grayscale representation of each of the channels and then the merged output over here to the right and it's worth a little bit of a mention here but you can get some intuition by just taking a look at the original image so for example this lake is a kind of a turquoise blue if you will it's got some green and blue in it for sure and probably very little red so if you now go back to these channels you can see that the red channel for the portion of the lake is is low meaning there's not much red component in that color so that's why it's darker it's closer to zero and notice that the green and the blue channels are fairly high intensity for their respective colors so that's indicating that the color of that water there has a very little red in it but quite a bit of green and definitely quite a bit of blue so that's kind of the interpretation so next let's scroll down here and talk about another function in opencv called cvt color this allows you to essentially convert between color spaces so the syntax here is you supply a source image and a code indicating the type of conversion you want and the the result will be a different color space so this is easiest to talk about just with an example so we have a very simple example here converting from bgr to rgb so we're calling cvt color and we're passing it the bgr representation of that image that we read in above and we're specifying a code of bgr to rgb so this is simply a flag indicating to opencv that we're supplying it a bgr image and we want to convert it to rgb and so i'm storing that here in this new variable and then i'm going to use i am show to display that and this is what we expect we're just simply displaying the original image but if you look at the documentation for cbt color you'll see that there's all kinds of color codes that allow you to convert between color spaces and so that's the subject of this next section of the notebook here where we are going to convert that image to a different color space so in this first line up here we're going to convert the bgr representation of that image to an hsv representation so hsv stands for hue saturation and value and that's another color space that's often used in image processing and computer vision and so we're going to store that result in a in a variable named image underscore hsv so now i can split those channels just like we did above and get the h s and v components and just like the example above i'm going to plot all four images here the three channels and then the original image for example uh hue represents the color of the image saturation represents the intensity of the color and v represents the value so you can think of saturation as being a pure red versus a dull red and you can think of value as being how light or dark the color is irrespective of the color itself and then hue is more like the um representation of the actual color so just as an example in this next section here we're going to actually modify one of the channels so if you look at this first line of code i'm going to take the um the current q value and add 10 to it so we're just shifting the where we are on the color spectrum and then i'm going to merge that new channel with the original s and v channels and get a merged image and then i'm going to use a cvt color to convert that from hsv to rgb so now i've modified one of the channels i've merged it and now i've converted it and then i'm going to use i am show below to display each of the color channels and then the modified merged image so you can see here that the modified image because we've changed the hue looks different from the original image up here so that's a brief introduction to color channels and color spaces and now we're ready to move on with final section of this notebook having to do with saving images and writing them to disk so there's a function in opencv similar to imread it's called i am right and it's very simple to use you simply pass in the file name that you want to save the file as and then the the image itself and so there's an example here we're going to use the i am write function and we're going to give it a file name new zealand lake underscore saved to indicate that we're actually saving this from our notebook and we're passing it this image that we've been working with here in bgr format and then on the very next line here we're going to use that ipython image function to read that file back in and render it right here in the browser and then the last thing we wanted to conclude with is just taking this image and using i am read to read it in uh both as a color image here and as a grayscaled image here and then print out both of those arrays and notice that the first one read in as a color image has got three channels and the one that's read in his gray scale has got just the single channel so we did cover quite a bit of material in this notebook and we hope it's been a good reference for you there is one other thing that we'd like to discuss but not from within the notebook so we're going to move over to a development environment and spend just a few minutes talking about the differences between matplotlib i am show and opencv item show so we'll continue on there in just a second okay so here we are in the development environment and uh we put this script together uh so that we could demonstrate some of the differences between the matplotlib version of i am show and the opencv version of i am show and we start by just reading in two colored images this first one here is a color checkerboard image and then the second one here is the coca-cola logo and then this very first block of code here is using the matplotlib version of i am show to display the checkerboard image and we've seen examples of that before in the notebook the rest of this code is all centered around using the opencv version of i am show and there's uh some extra code here that's required in order to use this properly so the first thing we do is uh create a named window here and then we call the opencv version of i am show passing at the named window and then the image we want to display in the window and notice right after this we call a wait key function and the argument to that function is the number of milliseconds that this window will be displayed so eight seconds in this case uh if we didn't uh call this weight key function then the im show function would display the window indefinitely and there'd be no way to actually exit the window so the weight key function is meant to be used in conjunction with the i am show function so then we're going to continue on and and perform the same set of actions except with a different image the coke image in this case just so you can see the dynamic behavior and then finally down here in this third section we're going to do the same thing displaying the checkerboard image again but this time we're going to pass a 0 to the weight key function so rather than displaying the image for 8 seconds 0 means it'll be displayed indefinitely unless any key on the keyboard is struck so this allows you to have user input rather than waiting for a specific time to pass and then there's one other option down here that you could use in conjunction with a while loop you could call the im show function inside of a while loop and then monitor the keyboard to only exit the loop if the user types a specific key like for example the lowercase q to quit so let's take a look at how this behaves when we run the demonstration so here we're displaying the image using matplotlib and i can go ahead and exit that window and now this checkerboard image is going to be displayed using the opencv version of i am show but only for eight seconds and then the same with the coca-cola logo and notice that i i'm not able to actually exit the window i have to wait the eight seconds and now we're back to the checkerboard image but in this case we passed a zero to the wait key function so this is just going to be displayed indefinitely until i hit until i hit a key on the keyboard so i'll go ahead and hit the space bar and this will disappear and now we're in the while loop and the coca-cola image is being displayed and if i hit uh the spacebar nothing happens because the weight key function is being monitored for the user to type a lowercase q so lowercase q is going to be the only key that will allow this to actually exit so i'll go ahead and hit the lowercase q and so that's another way for you to use the opencv version of im show so we hope that gives you a better feel for some of the subtleties associated with using the opencv version of i am show and that's all we wanted to cover in this section in this video we're going to be talking about basic techniques for manipulating images which include changing the values of individual pixels within an image as well as some other useful transformations like cropping resizing and flipping images these are all very standard transformations and very easy with the help of opencv functions so let's get started and take a look at the first example below here where we're going to manipulate individual pixels this is the image that we worked with in the first video it's the black and white checkerboard that's 18 pixels wide and 18 pixels tall so let's scroll down here a bit further and talk about how to access uh individual pixels within this uh image so let's suppose we wanted to access this very first pixel in the first black box and then this very first white pixel in that box that would correspond to this element up here zero and this first element here 255. so the first thing we need to mention uh with regard to accessing these elements is that numpy's are razor zero based and so if we just draw your attention down here to this code we're going to print out the cell associated with the first column in the first row so that would be the 0 0 pixel and that would be this 0 right up here and then this next print statement is going to print out the pixel associated with the first row and the 7th column so that's a 1 two three four five six seven but we give it an index of six because it's zero based so if you uh take a look at the printout you get the zero and 255. so that's printing out the value of this pixel here and this pixel here so now let's scroll down a little bit further and see how we actually modify uh pixels so in this next code block here we're going to make a copy of the image just so we can modify it and still retain the original image for reference so here we're going to modify the value of four specific pixels and indicated by these uh four assignments here from the 2 2 entry to the 3-3 entry so that would actually correspond to the um third row and third column and fourth row and fourth column those four pixels right there so we're going to set those to 200. uh this could also be accomplished by um numpy slicing notation so that would just be two to three comma two to three as opposed to setting all four of these assignments but in either case uh if we display the image you can see the modified pixels here uh in the in the matrix and then of course um if you display the image then you can see that these have been set now to a a light gray tone here in the center so it's very simple to modify individual pixels and we just want to give you a brief demonstration of that the next topic that we're going to cover here is cropping so cropping is similar in some sense to what we described above because it involves array indexing so let's take a look at this example here uh we're reading in a color image of this boat in the water and we're specifying of course that this is uh to be read in as a color image here and we're storing that as a bgr format and then the first thing we're going to do is a swap the color channels on that as we've uh indicated before and then display the image using matplotlib i am show so there's the image and now suppose we're interested in cropping out uh the small area around this boat so let's say from rows 200 to 400 and columns 300 to 600 well the way that we would do that is simply uh index into the original array here rows 200 to 400 and columns 300 to 600 and then reassign uh those values to this new variable called cropped region and if we plot that cropped region you can see that's what we get so cropping is a very simple and straightforward it's simply indexing into an existing image and extracting uh the region that you're interested in so now let's move on to the next section which is resizing and for this purpose we're going to use the opencv function resize and it takes several arguments so there's two required arguments the first one is the source image and the second required argument is the um desired output size of the image and then there's several optional arguments here and specifically fx and fy which are scale factors which we're going to demonstrate below and then there's this uh interpolation method and we're not going to go into those details but just know that there's several interpolation methods that you can select from for example when you're resizing an image up uh you're having to invent new pixels and therefore there's an interpolation that's required to do that so let's take a look at an example down here in this first method we're going to use the scaling factors fx and fy and so uh here's the call to the resize function so we're passing it the cropped image and then for the second argument since that's required we have to specify it but it's okay if you specify it as none and now that allows us to uh instead use the scaling factors f x and f y in this example we're just going to set those to two so we're going to double the size of this cropped region and you can see that when we display the result that's that's exactly what we get this is now 400 pixels high and 600 pixels wide so now let's talk about another method for resizing images and this method we're going to set a specific width and height for the image so in this case 100 and 200 and we're going to create this two dimensional vector here indicating both of those dimensions and pass that as a second argument to the resize function and display the resize cropped region and in this case we get exactly what we asked for 100 wide and 200 high but of course the image has been distorted now because we didn't maintain the original aspect ratio of the image so that leads us to the last method here which is still using this dimension vector here but we're going to start by specifying a width of 100 and then calculate the associated desired height while maintaining the aspect ratio so here we're creating this ratio of the desired width to the original width of the image and then using that factor to derive the desired height here and so now when we pass that revised dimension to the resize function you see that we get an image here that 100 pixels wide and the appropriate amount high to maintain the the proper relationship which turns out to be about 67 pixels and then further here below we're going to go ahead and write these images to disk and then read them in so this is the cropped region that's been resized by a factor of two and we're going to swap the channels on that here and then write that to disk and save it as this file name here and then read it back in and display it directly in the browser so you can see how large it is here and then we'll also perform the same operation on the cropped region prior to resizing and display that and you can see that it's half the size as the um as the one that we resized so uh that's all there is to resizing it's fairly straightforward and now we're going to move on to image flipping and the function we'll be using is flip and it simply takes the image itself as the first argument and then a flip code which specifies how we want to flip the image and there's three options for that you can flip it horizontally vertically or in both directions and those are simply specified by 1 0 and -1 and so let's take a look at an example here here we're passing in the original image and we're making three function calls uh corresponding to the three options here for flipping and we're simply displaying them down here below so this first one here has been flipped horizontally this one's been flipped vertically and this one's been flipped around uh both axes and then finally this is the original here so that's all we wanted to cover in this section and thank you and we'll see you next time in this video we're going to describe for you how to annotate images with lines circles rectangles and text uh keep in mind that this also applies to video frames and so that can be very helpful as well i sort of get started we're going to uh read in an image here and we've got it displayed here in the browser and we'll be working with this image for the rest of the notebook but one thing i wanted to point out is that notice that even though this may have been a grayscale image we're actually reading it in as color and we're doing that because we want to demonstrate annotations in color so we'll need to have a color image to work with or a color representation at least and so let's proceed down to the first section here we're going to learn about how to draw a line on an image the opencv functions that allow you to annotate images are all very straightforward and in this case the first argument here is the image itself and then the next two arguments are the first point and last point of the line and then the color so those are the four required arguments and then there's some optional arguments that we'll specifically take a look at as well including thickness and line type so let's scroll down here to the first example so here we're making a copy of the image simply so we can preserve the original image and then annotate a copy so this one's called image line and we're going to draw a line on this image and you can already see it's a yellow line and we're going to draw it from 0.1 to 0.2 so that's 200 along the x-axis and 100 along the y-axis that's this point here and then 400 along the x-axis and 100 along the y-axis so that's this point here and then we're going to specify yellow and recall that this has to be in bgr not rgb and that's the reason that the last two channels here are 255 to produce yellow from red and green and then a line thickness of five and then for the line type we're using uh this line underscore aaa which stands for anti-aliased and uh that's usually a good choice it uses semi-transparent pixels and often produces a very smooth and nice looking result so that's the first example the remaining examples would be very similar to this with just some minor variations so let's take a look at how to render a circle on an image in this case we have to specify the center and radius of the circle here but everything else is the same in the argument list so scrolling down to this uh example you see we're going to render a circle here at the coordinate 900 500 so that's 900 along the x-axis and 500 down and then with a radius of 100 so moving on to rectangles again there's four required arguments but in this case uh 0.1 and 0.2 refer to the top left corner of the rectangle in the bottom right corner of the rectangle but again all the other arguments are the same and the specifications are all the same so in this case we're going to uh draw a rectangle around this launch tower here the upper left corner of the rectangle is at 500 100 which you can see here and then the lower right is 700 600 which is right here and we're going to specify a different color for that and then finally moving on to text text is a little bit different there's obviously some additional arguments the first argument here is the text string and the next argument is the origin of the text and that refers to the bottom left corner of the text string where that's going to be placed on the image and then the font face you might also think of that as the font style and then font scale is is a floating point number that scales the font size and then again we have some optional arguments here so um just taking a look at this example here we're setting several of the arguments right here so the text string is here we're setting a font scale of 2.3 the font face is uh font hersheyplane if you go to the documentation link here you can find out what font faces are available and i just selected this one and then the font color is going to be bright green and the font thickness of two and then i've got the um origin right here for the texturing so that's 200 700 so that's just right here at the lower left-hand base of that text string so that's really it annotations using opencv are very straightforward and simple and that's all we wanted to cover in this section and we'll see you next time in this video we're going to be covering several different basic image processing techniques used for both image enhancement as well as upstream pre-processing functions that are often used for many different applications we'll be covering quite a few topics including arithmetic operations thresholding and masking and also bit-wise operations all of these are very fundamental to many computer vision processing pipelines and we've also included a couple of different application examples so you can get a better feel for just how these techniques can be used in practice so with that let's scroll down here and take a look at our first example here we're reading in a color image and displaying it right here within the browser and what we'd like to do is adjust the brightness of this image so let's take a look at how we can do that so in this section here we're displaying the original image here in the center and then we've adjusted the brightness of the image both darker to the left and brighter to the right and so let's take a look at the code that accomplishes that so we start here in the first line by creating this matrix and we use the numpy ones method to do that and we're going to pass in the shape of the original image here and then also the data type which is unsigned 8-bit and we're going to multiply that by 50. so now the result of that is a matrix that's the same size as the original image and it's got pixel intensities of 50 everywhere in the image and so now we're simply going to use the opencv add and subtract functions to uh add and subtract that matrix from the original image and then we're simply going to display those so that's all that's required to generate an image that's darker than the original and an image that's lighter than the original so now let's talk about how to change the contrast in an image and that's a little bit different because contrast is defined as the difference in intensity values of the pixels within an image and so that's going to require a multiplication operation so we start here by creating two matrices and each of these is going to use the numpy ones function and create a matrix the same size as the original image and in both cases we're going to multiply those matrices by a factor so in this first case it's a factor of 0.8 and in the next case it's a factor of 1.2 so now these matrices contain floating point values that have been scaled by these two factors here and on these next two lines we're going to multiply those matrices by the original image but note that there's a there's a nesting here that's required so the two matrices we defined above contain floating point values so in order to multiply those by the original image which was unsigned int we're going to first convert those to float and then do the multiplication here using the opencv multiply function and then after that we're going to convert those back to unsigned in 8-bit and so now we have those results here in these two variables and we're simply going to display those and you can see the results below the original image is in the center here the lower contrast image to the left and the higher contrast image to the right so you'll notice here that in the uh on the right hand image there is this um odd color coding in here something's gone wrong and the reason for that is because when we multiply the original image by this matrix it has a factor of 1.2 in it we potentially get values that are greater than 255. so if you look at the original image here these clouds here were probably um close to 255 some of them at least and when we multiplied by 1.2 we exceeded 255 so then when we attempt to convert those values to an unsigned 8-bit number uh rather than exceeding 255 they just roll over to some small number and that's why these intensity values are now close to zero and so that's the reason for the issue here so let's take a look at how do we remedy that so if we scroll down to this next section and take a look at that line of code here what we can do is use the numpy clip function to first uh clip those values to the range 0 to 255 before converting them to unsigned 8-bit and now when you look at the right-hand image it looks fine and in fact this portion of the image here has been completely saturated so some of these values here are right at 255 so they have really no information they're the extreme highlights within the image so that's a summary of brightness adjustment and also contrast adjustment so let's continue on to the next section of the notebook which covers image thresholding thresholding is a very important technique that is often used to create binary images that allow you to selectively modify portions of an image while leaving other portions intact and we have a couple of examples below to demonstrate this but first i just wanted to point out a few notes here in the documentation i noticed that we're specifying two different functions here threshold and adaptive threshold so just taking a look at the threshold function here to see how this works it takes as input a source image and then a threshold value between 0 and 255 and then a max value for the binary map and then a type of thresholding that we're going to perform and in all of our examples below we're going to be using a binary threshold so the idea here is that whatever threshold you specify pixels in the original image that are below this threshold will be set to zero and pixels that are above that threshold will be set to 255. and so the result will be a binary map that contains either zeros or intensity values of 255 or whatever you had set the max value here to which is typically 255. so let's take a look at the adaptive threshold function it also takes a source image a maximum value for the binary map and then a method type to perform the adaptive thresholding and also a threshold type which is the same type of input as we had up here in the first function and then also a block size and a constant value here both of these are used by the adaptive method algorithm basically the block size is an indication of the pixel area that's considered when computing the adaptive threshold spatially across the image so there's a lot of detail in here and we simply wanted to include this for reference and then to show you some examples below so let's take a look at the first example here so here we're reading in an image which is a photograph of a building with lots of windows and a geometric structure and we're going to call the threshold function and pass it to that image and then give it a threshold of 100 with a max value of 255 and then specify this flag here to indicate that we want a binary map and what's returned from this function is the binary image this return value here is not important at this point so it's this argument here that contains the actual binary map and so we're simply going to display that below both the original and the thresholded image and you can see that there's an opportunity here for you to use this map as a way to selectively process certain parts of the image so let's take a look at a more concrete example down here below suppose you were interested in building an application that could read and decode sheet music which is very similar to optical character recognition where the goal is to recognize characters in text documents in this case you'd be trying to recognize musical notes for the purpose of digitizing that information so in the example below it's easiest to actually start off with talking about these two images here and then we'll come back up and and talk about the code so over here on the left is the original uh photograph of some sheet music and you can see that it's kind of dark here in the lower right hand corner clearly not a white background but the notes are fairly well defined they're all very dark black which looks good and the idea here is that we'd like to perform thresholding on this image to achieve a binary map similar to the one shown here to the right so just taking a look at the dark values of all these notes in the musical notation it looks like perhaps the intensity values of all these black areas might be below 50 for example they all look pretty black even some of these up in here look fairly black so if we create a binary map with a threshold of 50 we're hoping that we would be able to isolate all this important information and the image to the right here was actually produced with a threshold of 50 and the result is rather surprising because notice that there's no information up here in the top portion of the image that would mean that the intensity values of all these black notes are actually above 50 which isn't very intuitive because they look rather dark so that's just one example so let's go back up here and take a look at the code that produces these plots so here we're just reading in the um the original image here and then we're going to call the threshold function passing at the original image a threshold of 50 a max value of 255 and then a flag here indicating that we want to produce a binary map and so what we get back here is the thresholded image and this is what is actually displayed down here to the right but there's some additional code here so there's another thresholding call here and this time we're going to pass it a higher threshold of 130 with the hopes that we can extract more of this information up here in the top of the image and then finally we're going to call the adaptive thresholding function and specify the type of thresholding algorithm and the fact that we want to create a binary map and then some settings here for the algorithm and and then we're going to display all four of these uh below so you can see the first two uh here that we've already talked about but let's scroll down here and see the next two so you can see that the one at the lower left here that was produced with the global threshold of 130 did a better job of isolating the musical notes here in the top portion of the page but that threshold was far too high in order to accommodate the lower portion of the page so what's going on there is that these dark values on the page this shadow essentially is actually lower than 130. so as a result this this whole portion of the page is just blacked out so neither of these global thresholds 50 or 130 do a very good job and you could actually experiment with other thresholds and find out that there's not going to be a single global threshold that's going to do very well in this situation so notice that the plot at the lower right here that was produced using adaptive thresholding is much better this is a very good example of how you can take an image that's pretty challenging and has several dark areas here and actually isolate just about everything you want to in the image so we simply wanted to point out the importance of thresholding and in particular adaptive thresholding and so now let's move on to the next section of the notebook which covers bitwise operations so here in the documentation section you can see that we have four different functions bitwise and or xor and not and here we're showing an example of the bitwise and it takes two input images these can actually be the same image but they don't have to be and then it takes an optional mask and the mass specifies which portion of these two images the logical operation applies to so let's take a look at this uh example down here we're reading in uh two different images these are both grayscale binary images that we're going to perform these operations on and let's just see how that works so we'll start with the bitwise and operator here so in this case we're passing in the image of the rectangle and the image of the circle and then we're indicating the mask is none so we're simply going to do a bit wise and comparison between these two images and the value returned from that comparison will be 255 or white if the corresponding pixels in both images are white so in this case the result will be just this left side of this half circle since that's the only region in both images where the pixels are white so now let's take a look at the bitwise or um operation in this case the return value from the operation will be white if the corresponding pixel from either image is white and so in that case we get the entire left side of the rectangle which is white and then the right hand side of the circle so now let's take a look at the xor operator and in this case we're passing at the same set of arguments the operation is simply different and the exclusive or works as follows it'll only return a value of white if uh either corresponding pixel is white but not both so this is the result that you get here so that's a summary of those three functions so now let's take a look at an application using um bitwise operations and binary maps so in this example here we're interested in manipulating this coca-cola logo and we're going to start with the logo and also what we call a background image here this colorful checkerboard and we'd like to achieve this result to the right so essentially being able to display a background image behind the white lettering and have it show through so that's the goal and we're going to go ahead and proceed through this notebook to see how that's done so first down here we're going to read in what we call the foreground image which is the coca-cola logo itself and we're going to display that in the browser here and then further down here we're going to do the same thing with the background image in this case there's a little bit of extra code up here in order to make sure that that image is the exact same size as the coca-cola logo so as we've seen before in a previous video we're making use of the opencv resize function here to accomplish that so now we're going to go ahead and create a couple of masks from the coca-cola logo so in this top portion here we're going to pass in the logo here to cvt color convert it to gray and then use the opencv threshold function to create a binary mask from this grayscale image and we're going to call that image underscore mask and we're displaying that down here so this is only going to contain values of 0 and 255. and then we're going to perform a similar operation down here but not using the threshold function although we could have we could have used the threshold function down here and specified a threshold binary inverse mask but instead we can just simply call the bitwise not function on the image mask to return the inverse mask and so you see both of these masks displayed here in the browser now we're going to make use of those down here below so now in this section we're going to do a bit wise and on the background image with itself but using the image mask so this bitwise and operation is going to perform a bit wise in between the corresponding pixels between these two images which is the same image but it's only going to apply it to the mask which is the white lettering in this case and so that's the result we get everything else is going to be zero and we're going to get just the colors showing through in the logo and then we need to do a similar operation on on the um on the logo itself which is image underscore rgb and we're going to do a bit wise and operation on on that and pass it the inverse mask and that's going to allow us to only show the the red foreground and everything else is going to be black and so now you can see that if you just added these two images together the blacks sum to zero and what you get is the following result so we thought that was an interesting way to demonstrate how you could use binary maps and thresholding and logical operations to accomplish something like this so we covered a lot of material in this video and just keep in mind that all of this is very fundamental to many different image processing and computer vision pipelines and we'll continue on in this course with a little more focus on actual applications uh now that we have some of this basic material under our belts so thank you and we'll see you next time in this video we're going to be describing for you how to access the camera attached to your computer system and send that streaming video to an output window on your display so we have a short script here that accomplishes that and we'll walk through this and then we'll go ahead and execute it so you can see it in action so starting on lines 35 and 36 we're importing opencv and the systems module both of which are required below and then on line 38 we're specifying a default camera device index of zero on line 39 and 40 we're simply checking to see if there was a command line specification to override that default value but in this case we're just going to use zero and then on line 42 we're going to call the video capture class to create a video capture object and we pass in device index into that class so device index of zero will access the default camera on your system if you had more than one camera attached to your system then you would need to indicate a device index that points to the correct one so zero would be the default one would be the second camera two would be the third camera and so forth on line 44 and 45 we're creating a named window which we're going to eventually send the streamed output to and then finally on line going to create a while loop and this while loop is going to allow us to continuously stream video from the camera and send it to the output unless the user hits the escape key so that's what this weight key function does it continuously checks whether or not the users hit the escape key so the first line in this loop line 48 uses that video capture object source to call the read method in that class and that read method will return a single frame from the video stream as well as a logical variable uh has underscore frame so if there's any kind of a problem with reading the video stream or accessing the camera then has frame would be false and we would break from the loop otherwise we'd continue on and call the im show function in opencv to actually send the video frame to the output window so that's all there is to it there's not much code we just wanted to walk through that and give you an example of how to do this so let's go ahead and execute it and there it is here's the window streaming video from my web camera right to the display as soon as i hit the escape key this is going to uh exit and uh that's all we really wanted to cover in this video in the next video we're going to uh build on this little bit and do some processing of the video frame from the camera and then send the post-processed output to the display so that's going to be a little bit more interesting to take a look at so that's it for now and we'll see you next time in this section we're going to describe for you how you can save videos to disc in a previous video we described how you can send the streaming output from your webcam to the output display but here we wanted to cover how you can actually write that to disk so we're going to start by specifying a source here this could have been the webcam but here in this example we're going to specify a file from disk and read it in by creating a video capture object here and so here we're specifying uh as the input argument to this the source for the video file so this is our video capture object here and in this next section we're just going to simply check that that was successful and then scrolling down here a little bit we're going to use the read method from that class to retrieve the first frame of the video and then use uh i am show to display that in the browser so that's the first frame of the video and we can actually load the video here in the browser as well with this command here and we've already executed that so i'll go ahead and play it just so you can see it it's just a very short clip of a race car okay so then down here we're going to talk a little bit about the video writer function in opencv so this um this function allows you to create a video file on disk and it takes his argument the file name and then this 4cc argument and that stands for four character code that describes the codec that's used to compress the video frames and then uh there's also the frames per second and then the frame size now the frame size is important because um that needs to be the dimensions of the frames that you have in memory that you want to write to disk so we'll see an example of that below so let's take a look at this next code section here the very first thing we're going to do is use the video capture object to call this get method which is going to retrieve for us the dimensions of the video frame that we have in memory and then now we're going to create two video writer objects one for an avi format and one for an mp4 format so you can see here in both cases we're specifying the file name and then the 4cc codec and to do that we're going to use this video writer underscore 4cc function here and for an avi file you need to specify these specific arguments and for an mp4 file you need to specify it just like this so you can take a look at the documentation and read up on this but this is sort of the tricky thing associated with writing video files getting this codec right and then also making sure that the frame dimensions match the frame size that you're trying to write to disk so let's now take a look at how we actually do this we're going to create a while loop here and we're going to use the read method from this video capture object to read every frame from the video file and then we're simply going to pass that frame back out to those video writer objects that we just created up above one for the avi format and one for the mp4 format and then when that's done processing we're simply going to release the resources down here so that's all there is to it but we did want to point out that getting these codecs correct and making sure that the frame dimensions match the frame dimensions of the video frames that you have in memory are the two key things that you need to watch out for so that's all we wanted to cover in this section and uh thanks so much and we'll see you next time in this video we'll be demonstrating some of the more common image processing techniques that are often used in computer vision pipelines and to do this we're going to build on the camera demonstration from the last video where we sent streaming video from the camera to an output window in the display however this time we'll be doing some image processing on the video frames first and then send those results to the output display so we hope you find this informative and to get started we'll first take a look at the code and then execute it so we can talk more in depth about the various processing techniques and their associated parameter settings so starting on line 39 we're defining the four different run modes for the script which include a preview mode a blurring filter a corner feature detector and a canny edge detector then on line 44 we're defining a small dictionary of parameter settings for the corner feature detector and those include the maximum number of corners that the algorithm will return quality level is a parameter for characterizing the minimum acceptable quality of image corners and the way that works is that the corner feature with the highest value in the entire image is multiplied by this parameter and then that value is used as a minimum threshold for filtering corner features from the final list that's returned by the algorithm so for example if you had several features that were detected and the maximum value of those features was 100 then we'd multiply 100 by 0.2 which would be 20 and then 20 would be the threshold for determining whether or not a a feature corner was detected and then the the next parameter here the minimum distance this is the minimum distance between adjacent uh feature corners and this is measured in pixel space so it's the euclidean distance in pixel space which describes how close two corner features can be in the list that's returned from the algorithm and then finally block size is the size of the pixel neighborhood that is used in the algorithm for computing uh the feature corners so this next block of code started on line 48 is very similar to the code in the previous video where we're setting the device index for the camera creating an output window for the uh streamed results and then creating a video capture object so that we can process the video stream in the loop below so here on line 61 we enter a while loop and uh first line in that loop is to read a frame from the video stream and then on line 66 i'm going to flip that frame horizontally mainly as a convenience for myself so that it's easier for me to point things out in the video stream and then here on line 68 depending on the run configuration for the script we'll be executing uh one of these functions in opencv of course if we're in preview mode we're simply going to take the frame and set that to the result and then display that directly to the output window using im show but for these other run modes we're going to do some processing first and then send the processed results to the output window starting on line 71 uh here we're calling the canny edge detection function in opencv and the first argument there is the image frame and then there's two additional arguments a lower threshold and an upper threshold the upper threshold is used for deciding whether or not a series of pixels should be considered as an edge so if the intensity gradient of those pixels exceeds the upper threshold then we'll declare those pixels as constituting a sure edge and likewise for pixels whose intensity gradients are below the lower threshold and those segments will be completely discarded however for the pixels whose gradients fall in between these two thresholds we'll consider those as candidate edges if they can be associated with a nearby segment that has already been declared as an edge so in other words we're allowing for weaker edges to be connected to stronger ones if the weaker edges are likely to be along the same true edge and we'll see an example of that when we run the demo for edge detection uh in just a little bit then the next function here is a blur function in opencv and this blur function uses a box filter to blur the image so the first input to this function is the image frame itself and then this second input are the dimensions for the box kernel so this would be a 13 by 13 box kernel that would be convolved with the image to result in a blurred image so if the size of the kernel is smaller than the blurring is less and if the size of the kernel is larger then you get more substantial blurring and then finally for the corner feature detector converting the frame uh the image frame to a grayscale image and then on line 77 we're going to call the function good features to track and although it isn't indicated in the name of this function what this function does is compute corner features so the first argument is a grayscale image of the video frame and then the second argument is that dictionary of feature parameters that we described up above and so what that returns is a list of corners that were detected in the image and if we have one or more corners detected then we're going to simply annotate the result with small green circles to indicate the locations of those features and then after we're done with all this whatever result we have under whatever run mode we've been working with we're going to send that to the output stream so this next block of code here is simply monitoring the keyboard for user input the script was written so that the run modes could be toggled interactively so for example if you were running in preview mode and you wanted to transition to candy detection mode you would simply type a c on the keyboard so that's all there is there really isn't very much code required to implement this and at this point we're ready to go ahead and execute the script and we'll cycle through the different run options and talk a little bit about the results that we see so this is the preview mode and here we're simply sending the video stream from the camera to the output window and the display so what i'd like to do next is toggle through the other filters that we implemented and we'll start with the blurring filter so i'm going to type a b on the keyboard and you can see that the image has been slightly blurred there's a few reasons you might want to do this for example if you had a noisy image you could apply a small amount of blurring and still obtain an aesthetically pleasing result but more importantly in computer vision and image processing we often use blurring as a pre-processing step to performing feature extraction and the reason for that is that most feature extraction algorithms i use some kind of numerical gradient computation and performing numerical gradients on raw pixel data can be a rather noisy and not well behaved process so smoothing the image prior to performing gradients uh turns out to be much more robust and well-behaved and so that's uh one of the primary reasons we use blurring in computer vision so now let's take a look at the next option which is the corner feature detector so now i've turned that mode on and you can see a small amount of corner features in the image there's some here on the microphone there's a few around my face here and in particular there's uh several there in the painting of the horses behind me and we're going to talk a little bit about how these features are generated based on the input arguments that we selected and there were two input arguments that we talked about in particular one was a minimum distance between features which is fairly straightforward and the other one was a quality level of the features so first we'll talk about the uh minimum distance so here i've got a textbook with uh some where very well defined characters on the front that have nice sharp edges and we're detecting all kinds of corners in those characters you can see that um each of those letters probably has two maybe three uh corner features for each character but when i move the book much closer to the camera you'll see now that um there are several more corner features associated with each character and the reason for that is that those letters are much larger in pixel space so now i'm not constrained by the minimum distance between between the features because i've made the letters so much larger in pixel space and then one other thing i'd like to talk about is i've drawn your attention to this section of the book here with this graphic image on it if i put this very close to the camera we're going to see that we detect all kinds of corners especially with the dots and that pattern of the book there so the reason those are jumping around so much is that i'm having a hard time holding the book really still but the main point is that i i'm getting all these detections here now watch what happens when i lower the book and expose the text from the title of the book as soon as i expose the text from the title all the features associated with the graphic image below have been filtered out so let's take a look at that again i raised the book and i get all these features here and now when i lower the book and expose the title of the book all those have been filtered out and the reason for that is that that quality level threshold we talked about is based on the highest score for a corner feature in the entire image and because the corner features associated with these characters in the title of the book are so much stronger their feature score is higher and therefore we're effectively raising the detection threshold for corner features so i just thought that was an interesting example of how that parameter is actually used and how it can affect um uh the algorithm uh in your particular application so uh uh finally let's uh go ahead and cycle to the um canny edge detection so uh toggle to that mode and so now you can see the results of edge detection here you can see the microphone is very well defined the corner of my shoulder against the light background of the wall is very well defined and then up here uh behind my shoulder you see a painting of some horses and the subject matter in that painting is is a partially defined but there's a lot of broken edges in that painting and so i thought it'd be interesting to talk about the threshold inputs for the canny edge detector and see if we can improve what that looks like so before we do that i'm going to make a screen snap of this video feed just so we can have something to compare to so i'll put this aside and now i'm going to edit the threshold for the canny edge detector so previously the lower threshold was very close to the upper threshold so there wasn't much opportunity for us to find some weaker edges and connect them to stronger edges but if i lower this to something like 80 we now have an opportunity to consider weaker edges that might be associated with the stronger edges so i'm going to go ahead and run this and now we'll put these side by side for a comparison so if you take a look at the uh image down below you can see that again the outline of the horses was rather broken in some places and now if you compare that to the video stream up above you can see that there's been some improvement we're effectively extending the definition of these edges because we're allowing those edges to be connected to weaker edges that were in between those two thresholds rather than discarding those edges altogether so i thought that was an interesting way to demonstrate how those inputs affect the results obviously all these algorithms require some experimenting and tuning and things that will depend on your particular application but we hope this was a nice introduction for you and definitely encourage you to take a look at the opencv documentation on these functions and other functions and write similar scripts like this one here and do some experimentation so that's all we wanted to cover in this section and we'll see you next time in this video we're going to be talking about image alignment which is also referred to as image registration image alignment is used in many applications such as building panoramic images from multiple photos or constructing hdr photos from multiple images taken at different exposures it's also used in the medical field for comparing digital scans to highlight small changes between images so in this particular example we're going to introduce the topic by showing how we can perform document alignment the image on the left here is an image of a form that's been printed out and filled out by hand and placed on a table and the goal here is to transform that image to an image on the right that would align with the original template of the form and therefore make optical character recognition of this form a much easier task so that's a preview of what we're going to talk about so first let's scroll down here a little bit and talk about some of the theory of transformations so on the left here we have an original image in the shape of a square and one of the simplest transformations we can make is a translation which is simply a shifting of the pixel coordinates in the original image to the translated image as shown here and then beyond that we have euclidean transformations which now include rotation but notice that the size and shape of the original image has been preserved and then further to the right we have a fine transformations which encompass euclidean transformations but also include shear and scale changes so now we have some distortion but notice that parallel lines remain parallel and finally we have the homography which is the most general transformation for 2d images which allows us to transform the original square into an arbitrary quadrilateral and the reason this is useful and interesting is that it allows us to warp an image to effectively change its perspective so just as a more concrete example let's scroll down to this next section where we showed two images of the same book taken from different perspectives here we're interested in talking about the homography between these two images and specifically for the 2d plane and each image is represented by the front cover of the book so if we can identify at least four points in both images that correspond to the same physical location on the front cover of the book then we can compute the homography that relates these two images so for example we've identified four different points in both images that we've color coded that represent the same points on the physical book and therefore we call these points corresponding points since they correspond to the same physical location but are obviously represented by a different set of pixel coordinates in each image so given we have a set of points like these we could simply call an opencv function to compute the homography and then apply the homography as a transformation to the image on the left for example to effectively change its perspective to look like the image on the right i just keep in mind that four points is a minimum required and in practice we would want to find many more corresponding points uh but thankfully there are other functions in opencv that enable us to do just that so we'll take a look at those details further below in this notebook and just getting back to the document alignment example let's start taking a look at the actual code so here on lines two through three we're just importing some required modules and then the very next thing we're going to do is simply read in the images of the template form and also the scan form so we have those available to us and in this next section here we're simply displaying both of those images so the we've got the original form here on the left and then the photo of the form that we filled out on the table and and taken a photograph of and again our goal is to take this image and apply homography to it so that it lines up with this form here on the left so let's see how to do that the very first step in this process is finding some number of key points in both images and there's not a lot of code here but there's a lot going on that needs some explanation so lines two and three here are simply converting uh the images that we read into grayscale and the reason for that is that the the code that follows that is performing some feature extraction on these images only requires a great scale representation of the image and then there's this uh this code right here that is configuring an orb object from this orb create class so if you're not familiar with image features and feature extraction and computer vision just know that various algorithms have been invented over the years to extract what we call features from images and uh the objective there is to try to extract meaningful information that is contextually um related to the uh the image itself so typically we're looking for edges and corners and texture in images and we people have been tried to invent various ways to compactly represent that information so orb features are one way to do that and they're available in opencv so here we're going to create this orb object and then we're going to use that object to detect and compute key points and descriptors for each of the images so let's just go over this each of these call each of these function calls returns a list of key points and a list of associated descriptors so the key points are interesting features in each image that are usually associated with some sharp edge or corner and they're described by a set of pixel coordinates that describe the location of the of the key point the size of the key point in other words the scale of the key point and then also the orientation of the key point and then there's an associated list of descriptors for each key point and each descriptor is actually a vector of some information that describes the region around the key point which effectively acts as a signature for that key point so it's a it's a vector representation of the pixel information around the key point and the idea here is that if we're looking for the same key point in both images we can try to use the descriptors to match them up so let's uh let's scroll down here a bit further and just talk about these two displays so we've we've calculated the we've computed rather the um key points and descriptors for each image and here in these figures we're displaying just the key points so the key points are the um where all these red circles are key points the center of the circle is the location of the key point the size of the circle represents the scale of the key point and then the the line connecting the center of the circle to the outside of the circle represents the orientation of the key point so those details um aren't terribly important for this demonstration i just wanted to point out that all these red circles represent the the key points but associated with each of these key points is a vector representation of the image patch at that key point which we're not displaying but it's the descriptors that are actually used to match up these key points so notice that on the figure in the left there's all these red circles here on the form and on the figure to the right there's a lot of red circles in regions on the form that aren't even located here on the left so the list of key points for figure one and the list of key points for figure two are um they're overlapping but certainly there's probably some key points in both images that maybe are the same and those are the ones that we're going to try to try to find so that we can compute the homography between these two um image representations so so that's the introduction to key points now let's scroll down here a bit further and talk about how we match those key points so the first step in this matching process is to create a matcher object by calling this descriptor matcher underscore create function and we pass to that function some configurations that indicate the type of matching algorithm we're going to use which is brute force and then also the metric for computing the distance between the descriptors which is a hamming metric a distance measure and that's because the descriptors for uh orb are binary strings so we therefore require a hamming uh metric for that purpose and then uh so we use that matcher object to call the match function which then attempts to provide a list of the best matches associated with those list of descriptors and so now we get a data structure back here that contains the list of matches from the key points that we determined up above and then once we get that list we're going to sort the list based on the distance between the various descriptors and then on lines 9 and 10 here we're going to further limit that to the top 10 percent of the matches returned by the matching function here and we're going to use that now to draw the matches in this code below shown here in the image so uh we're calling this draw matches function and we're going to pass in the key points for image one as well as the image and the same for image two as well as this filtered list of matches computed above and if you uh take a look at these two images you can see that several key points in this image match the key points in this image for example on in the form over here there's a little image of a person and further down here it looks like an image of a house and you can see that there were several key points in both images that were in that local region and then this matching function determined that yes there were several here on this form that matched this form in other words the descriptors matched close enough for us to call that a match and so now we have a set of corresponding key points right but notice here for example right down here this lavender line is going up to some other location on the form to the right so that's that's not a match but it turned out that the descriptor for the key point here and the descriptor for this key point here were very close coincidentally and so it decided that that was a match it's okay to have some false positives here the important thing is that we have an overwhelming number of actual matches which will allow us to compute a homography so then the final a couple of steps here in the notebook are to first compute the homography so to do that we simply call this find homography function here and pass it both sets of key points that have been filtered by the matching process above and then there's an optional argument here which is the algorithm that's going to be used to compute the homography and here we're indicating the ransack algorithm which is definitely the one you want to use that's very robust to filtering out outliers left over from the matching process computed above and there's a little bit of code up here that requires us to change the format of the points so that um to comply with the uh this fine homography function but that's um that's a detail that the point here is that we can compute the homography from a set of uh corresponding key points and then finally once we have uh the homography h which is a three by three matrix we can uh call this uh function warp perspective on image two and recall image two was the image of the filled out form sitting there on the table and passing the homography and what we get back is the registered or aligned image as shown below here to the right so now we've we've effectively changed the perspective of that image on the table and it's very closely aligned to the image on the form here and now this is obviously a much simpler task to process uh you know automatically process this form on the right this form on the right can be compared to the form on the left and an algorithm could be written to it knows where the last name field is on the form so then it can easily recognize uh these characters here as the last name of the person that has filled out the form so that's one example of how you can use image alignment or image registration and there's many other applications as we mentioned earlier so you can see in very few lines of code you can get this up and running you can experiment with your own images and that's a lot of fun to do so and we encourage you to explore that more so we hope this was helpful to you and we'll see you next time in this video we're going to be describing how you can create panoramas from multiple images using opencv much of the processing pipeline used for creating panoramas is very similar to the steps we described in the image alignment video since panoramas require image alignment we still need to find key points and descriptors in each of the images and also determine their pairwise correspondences through a feature matching process and we also need to estimate the homographies to facilitate image warping and then once images have been transformed in this way we need an additional step to stitch and blend the images together so they look realistic and fortunately there's a high level convenience function in opencv that's available in the stitcher class that allows us to create panoramas by simply passing in a list of images however we do think it's important to understand the underlying concepts but since we covered much of this material in great detail in the image alignment video we're simply going to use the stitcher class in this example to show you just how easy it is to create panoramas with a single function call i just remember that images used to create panoramas need to be taken from the same vantage point ideally on a tripod that is panning around the optical axis of the camera and it's also important to take the photos at roughly the same time in order to minimize any lighting changes between the images so adhering to these suggestions will lead to the best results so let's get started and take a look at the code that's required here in this first cell block we're importing some required modules and then in this next code section we're using glob to retrieve the file names from a subdirectory and then in this for loop here we're simply reading in each of the file names and converting the images from bgr to rgb and then appending each image to a list of images and then in this next section here we're simply plotting each of the images so you can see the sequence here there's six images total and then finally in the last section here you can see that we're going to be able to create the panorama in just two lines of code so we do that by creating a stitcher object from the stitcher underscore create class and then we use that object to call the stitch method and we simply pass in the list of images and the result we get here is the panorama image so that's shown below uh the only thing we would mention at this point is that the return panorama includes these black regions here which are a result of the warping that was required to stitch the images together and we'd just like to mention that maybe one thing you might consider doing is writing your own code to programmatically crop out that black image you could use a combination of thresholding techniques uh bitmaps and contour finding uh to do that task so that's all we really wanted to cover in this section and uh thanks so much and we'll see you next time hi everyone in this video we're going to be talking about high dynamic range imaging also referred to as hdr imaging and i think the best way to get started with this is to just take a look at the simple example below here we have two photos of a young boy this photo below on the left was taken with an iphone in standard camera mode and in that case the metering system on the camera attempts to determine what the main subject matter of the photo is in this case the young boy and then it tries to set the exposure accordingly so it's done a nice job here the boy is properly exposed yet the background the sky and the clouds are completely washed out and then sometimes the opposite occurs sometimes you'll get a background that's properly exposed yet the the foreground is just far too dark and the reason this occurs is because the actual intensities in the real world scene far exceed the capability of the camera to record those values since most cameras only have an 8 bit per channel capability there just aren't enough bits to capture the full dynamic range of the scene so in contrast to that we have the image to the right this image was also taken with an iphone in hdr mode and in this case both the foreground and the background are all properly exposed and the photo looks fantastic so how exactly is this done well what the iphone does is it takes three photographs at different exposures and it takes them in quick succession so that there's no movement uh or almost no movement in between the three shots and then it takes those uh three what we call low dynamic range photos and merges them to come up with a hdr photo like the one shown here so that's the basic idea and we'll talk about this in a little more detail down here below with a different example so this is a common uh photo sequence of the old courthouse in st louis that's used to describe hdr imaging you can see that there's four different images here taken at different exposures the image to the far left uh underexposed quite a bit although that there is some area here in the lower portion of the building that looks properly exposed and might contain some useful detail and then further to the right here there's a little bit more of the building that's properly exposed still the center is nice too and then these other areas here might provide useful information and then even further to the right here now the buildings in the background start to have proper exposure and then finally to the right uh the well-lit portion here in the center is completely blown out but perhaps the um the background buildings and even the sky for example and then some of these areas in the foreground uh might contain useful information so the hope is that collectively this sequence of four images across all pixels in the image will contain some useful information that can be merged together to form a single hdr image with proper exposure for all of the pixels so let's take a look at some of the code that implements this example down here we're just importing some required modules and then right here we're defining a convenience function that's going to read the images and the exposure times for each image so in here we're just listing the file names of the four images and right down here we're setting the exposure times for each of those images we know what that is for this example but you could also extract that information from the metadata in each image and do that programmatically and then finally down here in this for loop we're just reading each of the images in and converting them to rgb and then returning the list of images and the exposure times so the next step in the process uh once we've read in the images is to make sure they're properly aligned and even though these images may have been taken in quick succession or even on a tripod in quick succession it's important that they be very accurately aligned down to the pixel level or even the sub pixel level so just as an example the image here to the left is an hdr image that was produced without alignment and you can see that the zoomed in section here at the top of the building has several ghosting artifacts and just doesn't look quite right and it's not a true representation of that portion of the image now contrast that with the hdr image produced on the right this was produced with alignment and you can see that the top portion of the building looks much more correct however since the images that are used in the sequence are taken at different exposures they actually look different and therefore standard alignment techniques just don't work however there is a special class in opencv that uses bitmaps for this purpose and that class is called create a line mtb for median threshold bitmap so down here we're going to create an align mtb object and then use that object to call the process method from that class and pass at the list of images and then get back the list of aligned images right here so once we've done the alignment of the images the next step in the process is to compute the camera response function and the reason we need to do this is because most cameras we use are not linear which means for example that if the radiance in a scene is doubled the pixel intensities recorded by the camera will not necessarily double and this presents a problem when we want to merge images taken at different exposures so for example suppose the response function was linear then the intensities of the input images could be simply scaled by their exposure times which would put them on the same radian scale and then we could simply compute an average intensity at every pixel location across those images to synthesize an hdr image however since the response function is not linear we need to estimate it so that we can first linearize the images before combining them however since the response function for various cameras are considered proprietary information by the camera manufacturers we need to actually use the images captured by the camera itself to estimate the response function and this is actually a rather involved optimization problem but fortunately opencv has two different classes that we can use for this purpose both named after the people that invented the algorithms so let's take a look at the code uh in opencv that accomplishes this uh the one we're going to be focusing on is create calibrate debevic there's another one by robertson in either case there's a class for each algorithm and here we're creating a calibrate devic object and then we're using that object to call the process method for that class and we pass in the list of images and the associated exposure times for those images and we get back the inverse camera response function here and so this next block of code here is simply plotting the camera response function and you can see at the lower intensity values the function is quite linear in this region here and then starts to become non-linear right about here and then finally at the higher end of the spectrum we start to see some clipping at 255 as the intensities in the actual scene exceed the recording limits of the camera also notice that the three channels are calibrated separately since the sensitivities are slightly different between them and so now we can use this function to linearize the input images by mapping the measured pixel intensity in those images to the calibrated intensity so that the images can then be merged appropriately so in this next section here we use a separate class for that purpose here we're calling the create mergedabit class to create an object and then using that object to call the process method in that class passing at the list of images the exposure times for each of the images and then the response function that we calculated up above and that method then returns the hdr image that we've been looking for so at this point it's worth mentioning that the merging process intentionally ignores pixel values close to zero or 255 and the reason for that is that pixel values close to those extremes contain no useful information so it's common to apply a hat type weighting function to each of the input images to filter out those pixels from the merging process so just to briefly summarize because there are multiple images of the scene at different exposure settings the hope is that for every pixel we have at least one image that contains an intensity that is neither too dark nor too bright however there is one problem that still remains which is the intensity values are no longer in the zero to 255 range of course black is completely zero but hdr images can record light intensities from zero to essentially infinite brightness so because they have a fixed range they need to be stored as 32-bit floating point numbers and since our displays require 8-bit images we need one last step to bring the image intensities back down to the 0-255 range so that brings us to the final step in the process which is called tone mapping which refers to the process of mapping hdr images to 8-bit per channel images so there's several algorithms implemented in opencv for this purpose mostly designed to preserve as much detail as possible from the original image while converting it to 8 bits per channel but the main thing to keep in mind is that there's no correct way to perform tone mapping sometimes the goal of tone mapping is to achieve an aesthetically pleasing image that isn't necessarily realistic however the algorithms implemented in opencv tend to be fairly realistic yet they have some differences and also various parameters that are configurable for each of the algorithms so in this first example we're going to take a look at using drago's method to create a tone map and we start by calling the create tonemapdrago class and to create an object and then use that object to call the process method for that class and we simply pass it the hdr image itself and that returns the 8-bit per channel color image shown below here it does a very nice job of properly exposing all regions of the scene there it's very pleasing in my opinion and even the background the buildings there seem to be properly exposed so very nice result and then moving on to the next example this one is using reinhardt's method and it also looks very nice perhaps not as ascetically pleasing but certainly very realistic and with everything properly exposed and then finally there's one more example down here that's almost a combination of the two i'd say a little bit of the glowing here and in the center like the first image and again everything uh looking fairly realistic and and properly exposed so that's a summary of the hdr imaging process we covered a lot of detail but when you step back and look at how much code was required it wasn't very much so that's all we really wanted to cover in this section and we'll see you next time thank you in this video we're going to be talking about object tracking this is a really interesting topic and a lot of fun to experiment with and we hope you enjoy this demonstration so first of all what is tracking tracking usually refers to estimating the location of an object and predicting its location at some future point in time and in the context of computer vision that usually amounts to detecting an object of interest in a video frame and then predicting the location of that object in subsequent video frames and we accomplish this by developing both a motion model and an appearance model the motion model for example will estimate the position and velocity of a particular object and then use that information to predict its location in future video frames and then we can also use an appearance model which encodes what the object looks like and then search the region around the predicted location from the motion model to then fine tune the location of the object so the motion model is an approximation to where the object might be located in a future video frame and then the appearance model is used to fine-tune that estimate all of the code that we'll be using below is from the opencv api tracker class so we'll talk about that a little bit more as we scroll down through the notebook here so as a concrete example suppose we're interested in tracking a specific object like the race car identified here in the first frame of a video clip in order to initiate the tracking algorithm we need to specify the initial location of the object and to do this we define a bounding box shown here in blue which consists of two sets of pixel coordinates which define the upper left and lower right corners of the bounding box and then once the tracking algorithm is initialized with this information the goal is to then track the object in subsequent video frames by producing a bounding box in each new video frame so we'll talk more about this below but before we get started uh with the code description let's just take a look at the tracking algorithms available in opencv there's uh eight different algorithms listed here and we're not going to review the the details of each of these but it's worth noting that depending on your application one might be more suitable than the other for example some are more accurate some are faster some are more robust to occlusions of the object being tracked so that's worth keeping in mind when you experiment with all of these uh different algorithms and then one other thing that's worth mentioning is that the go turn model uh here is the only one that's deep learning based and we'll talk a little bit more about that further below so uh just as a preview to get started here i've got the um uh the test video clip right here and let's just play it uh one or two times so you'll notice that early on the car's appearance is relatively constant as well as its uniform motion but as it starts to make a turn here you'll see that we see the broadside portion of the car and then the lighting is starting to change quite a bit and then now it's getting smaller and smaller off into the distance so those types of things are going to represent some challenges uh for some of the tracking algorithms so we'll talk a little bit more about that so let's start taking a look at the first code block in this notebook here we're just importing some modules that are required and then on line 10 we're indicating the file name for the video clip that we're going to process and then here we're defining some convenience functions that will allow us to render bounding box information on the output video stream as well as annotate the output video frames with some text and then recall earlier we described that one of the algorithms is the go turn model which requires an inference model so this block of code here is required to download that inference model and then this figure here is a very high level description of uh how the go turn tracker is trained and used so in the center here we're indicating that we have a pre-trained neural network model also known as an inference model and it takes as input two cropped images one from the previous frame and one from the current frame uh it uses the bounding box from the previous frame to crop both of these images and therefore the object of interest uh is located in the center of this previous frame and obviously if the object has moved um in the current frame then it won't be centered uh in this cropped image because we're using the bounding box from the previous frame to crop both of these images and then uh it's the job of the inference model to then predict uh what the bounding box is uh in the output frame here so that's just a high level description of how that works so let's scroll down a bit further here and take a look at this next code block this is where we're going to create a tracker instance and we start by defining a list of tracker types here where we're just indicating the list of string names that are available in the opencv api and then depending on the track or algorithm that you wanted to execute you would just set the appropriate index here into that list and since that's specified as two then we'd be indicating that we'd like to execute the kcf tracker in that list and this uh if else block here would then call the appropriate class to create the tracker object so in the case of uh the default we'd be calling the tracker kcf underscore create class to create a tracker object of that class so let's scroll down here to the next section uh in this block of code uh we're setting up the input output video streams so here on line two we're passing in the video input file name and creating a video input object and then on the next line we'll go ahead and read the first frame from that video file and then down here on lines 13 and 14 we're doing a similar thing for the output video stream and creating a video out object which will then write results to from our tracking algorithms now in this section here as we talked about earlier we need to define a bounding box around the object that we're interested in tracking and we're accomplishing that here just manually notice that i'm specifying the two sets of pixel coordinates here for the upper left and lower right corners of that bounding box but in practice you would um either select that with a user interface or um or perhaps uh use the detection algorithm to detect objects of interest for tracking and do that programmatically uh so but for demonstration purposes here we're just going to set that box manually and then down here we're now ready to initialize the tracker so in order to do that we use the tracker object here and call the init function and we pass it the first frame of the video clip and then the bounding box uh that we defined manually up above okay and then once that's defined we're ready to enter a loop here to process all the frames in the video so this first line of code in the loop on line 2 is reading the next frame from the video clip and then on line 10 here we're going to pass that frame to the tracker update function and hopefully return a bounding box for the object that was detected so if we detect the object and we retrieve a bounding box from the update function uh then we'll go ahead and render a bounding box rectangle on the current frame and if we didn't detect the bounding box and this okay flag would be false and we'd simply annotate the frame with a tracking failure message indicated here and then further below we'll also annotate uh the video frame with the type of tracker that's being used and the frames per second that's been calculated and then write that frame out to the output video stream so that's all this loop does it um cycles through each frame in the video clip and calls the tracker update function and then annotates the frames and sends them to the output video stream so let's scroll down here a bit further and take a look at some results so this notebook has already been executed a few times for different trackers and so we're now we're just going to replay those results so all of these results shown here are the output video streams that have been annotated with tracker results and you can see in this first example this is the kfc tracker and i'm going to go ahead and play this and we'll take a look at how it performs so it looks like it's doing a fairly good job of tracking the car a little bit off-center but uh still maintaining track on what's um obviously the car in the video frame and then as the car rounds the corner it's uh it's doing okay and then uh shortly here we're gonna see that uh has a little bit of difficulty right here at the end it drops track on the car uh so let's take a look at the next example the next example here is the csrt tracker and we'll go ahead and take a look at that so this one does a little better job tracking the car with the bounding box encompassing most of the car and centered on the car pretty much and then as the car makes the turn here uh the box is on the front of the car but it's still got the car and track i'd say and then right there at the end it looks like it's having difficulty maintaining a precise location for the car so let's go down to the the final example and this is the go turn tracker again trained on a deep neural network offline so let's take a look at this so it looks like it's maintaining track as well the bounding box is a little narrower but centroid it on the on the car and then as it rounds the corner here it still maintains uh track the car is pretty much in the center of that bounding box uh for the most part it gets a little wider there but then as it tails off it uh it maintains uh track on the car right there at the end so out of the three examples of the go turn tracker probably uh did the best job of maintaining track uh throughout the entire video uh stream and especially right there at the end still able to keep the car essentially in the centroid of that bounding box so we hope that gives you a good feel for how to exercise the various tracking algorithms in opencv and especially the small amount of code that's required in order to get something up and running so we encourage you to experiment further try some of your own videos and experiment with the various algorithms and we hope this was helpful to you and we'll see you next time in this section we're going to show you how you can use a pre-trained neural network to perform face detection and to do that we're going to be using the opencv framework that will allow us to read in a pre-trained model and perform inference using that model so to get started there's a little bit of code here at the top of the script that sets the device index for the camera creates a video capture object and then creates an output window for sending all the results to the display however since we've covered this in a prior video we're going to skip that discussion and focus our attention here on line 47. so opencv has several convenience functions that allow us to read in pre-trained models that were trained using various frameworks so for example caffe tensorflow dark net and pytorch are all deep learning frameworks that allow you to design and train neural networks and thankfully opencv has built-in functionality to use pre-trained networks to perform inference so just to be clear you cannot use opencv to train a neural network but you can use it to perform inference on a pre-trained network and that's very nice for getting familiar with using neural networks and getting started so this function read net from caffe is a function that's specifically designed to read in a cafe model and it takes two arguments the first argument here is the prototext file which contains the network architecture information and then the next file is the caffe model file and that's a much larger file that contains the weights of the model that's been trained so notice here that we're pointing to these files on our local system however they can also be downloaded from the internet so let's take a look at the um the get repo that contains these models so here in this repo there's several scripts at this level and if you scroll down a bit uh you'll see that there's a download models script right here and if scroll down a little bit further you'll see that there's a readme file here that contains a description and instructions on how to use that script to download various models so it turns out that that script actually references a models.yaml file which is right here and it's instructive to go ahead and take a look at that so at the top of that file you'll see a block here that references the caffe model that we're actually going to be using here is the url to download the weights file and then there's several other parameters here related to how that model was trained and so we'll talk about these in a minute because we're going to reference these values in our script but just notice that there's a mean value a scale factor a height and width and also this rgb flag so let's go back to the script and continue on when we call this read net from cafe method it returns for us an instance of the network and we're going to use that object further below to perform inference on our test images from the video stream so this next section here is identifying the model parameters that were associated with how the model was trained and it's important that we're aware of these because any images that we pass through the model to perform inference on also need to be processed in the same way that the training images were processed so here we have the size of the input images that were used to train the model 300 by 300 and then here we have a list of mean values from each of the color channels across all the images that were used in training and then this confidence threshold is a value that you can set that will determine the sensitivity of your detections so then scrolling down here a bit further we enter this while loop and the first thing we do in the loop is read one frame at a time from the video feed and on line 59 i'm going to flip that frame horizontally just as a convenience for myself so that when i point to things in the field of view of the camera it's easier for me to do that but it has no consequence other than that and then on line 60 and 61 we're simply retrieving the size of the video frame and then on line 64 this is important uh here we're doing some pre-processing on the image frame calling this method blob from image so there are several arguments here and we'll go through these but this all this has to do with is doing some pre-processing on the input image and putting it in the proper format so that we can then perform inference on that image so it takes as input the image frame from the video stream this next argument is the scale factor and recall that the scale factor in that yaml file was one but that's not always the case when models are trained sometimes the images are scaled to different ranges and if that was the case this would have been something other than one then this is the input width and height of the images so that was 300 by 300 and we've identified that up above and then this is the mean value which is going to be subtracted from all the images and then there's this flag swap rb rb stands for red blue notice that that's equal to faults and the reason for that is that both cafe and opencv use the same convention for the three color channels but some models use a different convention and in those cases you'd have to swap the red and the blue channels and then finally there's this last input argument crop this last argument indicates that you can either crop your input image to be the correct size or you can resize it so because crop is set to false that means we're going to simply resize the image to be 300 by 300. and then this function call then returns a blob representation of the input image frame with all that pre-processing handled and then there's also a format change and then we pass that blob representation of the image to this function set input and that prepares it for for inference and then this very next line a net dot forward makes a forward pass through the network and is performing inference on this representation of our input image and then for some number of detections returned by the inference we're going to loop over all those detections and right here we're going to determine if the confidence for particular detection exceeds the detection threshold and if it does we'll proceed further and query the detections list for the bounding box coordinates of that particular detection and then the rest of the code here is um going to render a bounding box rectangle for the detection on the image frame right here and then we're also going to build a text string that indicates the confidence level for the detection and annotate the image frame using opencv rectangle and put text functions right here and then once we're done processing all the detections we'll finally call this get performance profile function which is going to return for us the time required to perform inference and we're going to convert that to milliseconds and then build another text string here and continue to annotate the frame with the amount of time that it took to perform the inference and then finally we're going to use im show to display that annotated frame to the output window so that's all there is there isn't much code required to perform inference on the model and in fact most of this code here is related to annotating the frame itself so at this point we're ready to go ahead and execute the script and when we do that we'll cycle through some demonstrations and see just how this performs so here you can see the model is detecting my face uh very nicely and i can um obscure my face a little bit with my hand and it still does a nice job of detecting my face the reason i like using the video stream is that it's just a lot of fun to experiment with that you can hold up images to the camera and experiment with the uh scale and orientation of the images in real time and so we're going to do that i've got a magazine here that i found a lot of interesting images in and so we'll cycle through that and we'll see what you think so i'll scooch out of the way here and we'll get started in just a second so in this first image here you can see the boy's face is in a downward pose and also his bangs are obscuring his forehead and even a portion of his eyes yet the model still performs nicely and detecting his face so we thought we'd start with this image and then progress to some that are a little bit more difficult in this next image here you can see the young girl's face is also in a downward pose but is also a profile view and then of course she's wearing eye glasses which may present an additional challenge yet the model still performs nicely these next couple of images have a mixture of the face as well as some graphics mixed in so kind of a mixed media obscuration of the face if you will and the model does very nicely on this one and here's another example on the opposing page and in this case notice the the different scales of the images that are being detected but the model still is handling things very nicely so this next one coming up is my favorite primarily because it's the most impressive so take a look at this as you can see the woman's face is heavily occluded and there's even been some manipulation of the image in the area around her eyes and also around her chin and mouth almost a blurring to some extent so those both represent significant challenges yet the model is able to detect her face fairly well and we hope this gets you really excited about computer vision and especially deep neural networks just remember that you don't have to train your own models you can use a pre-trained model like we've done here in this demonstration and write just a small amount of code to do your own testing with your own images so that's all we wanted to cover and we encourage you to do that and i will see you next time in this section we're going to describe how to perform deep learning based object detection and specifically we'll be using a neural network called single shot multi-box detection trained using tensorflow and like previous videos we're going to be using opencv to both read the model and perform inference on some sample test images if you look at this name here it says ssd which stands for single shot multi-box detection the single shot refers to the fact that we're going to make a single forward pass through the network to perform inference and yet detect multiple objects within an image and like other types of networks ssd models can be trained with different architectural backbones which essentially means you can model a single concept yet use different backbones depending on your application so in this case we're using a mobile net architecture which is a smaller model intended for mobile devices but before we get started i wanted to point out this resource here there's a tensorflow object detection model zoo at this url and if you go to that repository you can download a variety of different object detection models so we just wanted you to be aware of that in this particular case we're going to be using the ssd mobilenet v2 coco 2018 archive listed here and if you extract that archive you'll see it has a structure like the one shown here and we simply wanted to point out that you only need one file from that archive and that would be the frozen inference graph right here which is the weights file for the model and there's actually two other files that we'll need to have in order to run this notebook so let's scroll down and take a look at those as well so right here we're specifying the three models that are required the frozen inference graph which we just described above and then there's a configuration file for the network that's indicated here with the dot pb text extension and then also the class labels for the data set that was used to train this model which is the coco 2018 data set you can actually google that coco data set and retrieve this class labels file from numerous places on the internet but in terms of this configuration file there's actually a script that you can use to generate this file from the frozen inference graph and that script is indicated right here uh we've already executed this notebook and so we have all these files locally in our system but we just wanted to review with you how to obtain uh each of these three files and then one thing that's worth pointing out at this point is take a look at the class labels for this file that we printed out down here in the lower portion of the screen notice the difference between a deep learning object detector and a traditional computer vision object detector we used to have a detector for every class so for example we had a face detector and a person detector and so on and those were all separate models but with deep learning models we have enormous capacity to learn so a single model can detect multiple objects over a wide range of aspect angles and scales which is the real beauty of deep learning so let's scroll down here a little bit further to the next section of the notebook so summarize here are the three steps that need to be performed first loading both the model and the input image into memory and then detecting objects using a forward pass through the network and then finally displaying the detected objects with bounding boxes and class labels and so the first step is indicated here where we're calling the opencv function read net from tensorflow and that takes as input a model file and the configuration file both of which we specified above and then that's going to return for us an instance of the network here which we'll use further below to perform inference next here we're defining a convenience function called detect objects and it takes as input the network instance and then the test image and then we've seen this before here there's another opencv function called blob from image and this takes as input the test image and then several other arguments that are related to pre-processing uh the test image i recall that when we prepare an image for inference we need to perform any pre-processing on that file that was performed on the training set and so this function contains several arguments related to the required pre-processing this first argument here is a scale factor and it's set to one uh which indicates that the training set didn't have any special scaling performed on it then here we're indicating the size of the training images and we're indicating that right here with the 300 so the test image will need to be reshaped according to this size and then the next argument is this mean value if the training images had um had a mean subtracted value applied to them then this would have been some other vector but since um those images don't require any mean subtraction we're simply indicating zeros here and this next argument here uh swap rb for whether or not we want to swap the red and the blue channels and then in this case we do want to do that since the training images used a different convention than what's used by opencv and then finally this crop flag is set to fault so that means that the images are simply going to be resized as opposed to cropping them to the right size and then this function returns for us a blob representation of that image that's been pre-processed so there's a pre-processing step and then there's also a format conversion step if you will and then this blob representation of the image is passed to the set input method to prepare the image for uh inference and then finally we perform inference on the test image by calling the forward method and that returns for us some number of objects that have been detected and then we'll return that from this function so there's a couple more convenience functions down here below so let's take a look at those this one here display text takes in the test image frame and then a text string and some coordinates so this is a function that will simply annotate a bounding box with the class label by drawing a black rectangle here and then annotating the frame with some text indicating the class label inside the black rectangle and then finally there's this display objects function and then it also takes in the test image and then a list of objects that were detected and then the threshold for detection here and here we're retrieving the shape of the input test image and then we're going to loop over all the objects that were detected by the network and retrieve their class ids and their scores and in this next section here we're further going to retrieve the coordinates for the bounding box of that object and convert those coordinates to the original test image coordinates and then finally if the score for this object is greater than our input threshold then we'll go ahead and annotate the frame with the class label i'm calling that display text function we just described above and then finally we're going to render the test image frame with the bounding box rectangle in white right here so let's take a look at some results so you can see here we're reading in a test image and now we're going to use the function we created above detect objects passing at the network instance here and the image we just read in here and the return from that function is a list of objects that have been detected and then we're going to call the display objects function passing in the test image and the array of objects and you can see the result down here there's all kinds of objects being detected there's a person here there's a bicycle here there's a car here there's cars off in the distance and even way out in the distance here you can see a traffic light's been detected so this is a very robust object detection algorithm has about 80 classes and let's take a look at another example down here below this is a sports scene so you can see the same sequence here reading in the image calling detect objects and then display objects and in this case we're getting uh both people in the image the bat the baseball glove which is really nice and then uh baseball but notice that the baseball actually has a false positive there's only one baseball there yet the detection algorithm is uh reporting two so there's a false positive there but other than that it's done a very nice job and let's let's take a look at one more example so here's another sports scene here and you can see it's detecting the soccer player here the soccer ball here and then there's a false positive here it thinks that the tip of his shoe is actually another sports ball and one thing we can do in cases like this is after you've established some number of false positives you can actually take these image examples and perform what's called hard negative mining by training the network with additional examples like this to reduce the number of false positives so we hope that's a nice introduction for you to object detection and that's all we wanted to cover in this section and we'll see you next time in this section we're going to show you how you can perform 2d human pose estimation on your own images and video by using a pre-trained model called open pose open pose was developed at the carnegie mellon perceptual computing lab and if you're not familiar with the pose estimation the figure below from the open pose research paper provides a nice graphic so essentially the problem requires taking an input image that may contain one or more people and then identifying the key points associated with the major joints in the human anatomy and then logically connecting those key points as shown in the figure on the right hand side here the model actually produces two types of output part confidence maps and part affinity maps however for the code demonstration below will only be using a single person in the image and therefore we'll only need to make use of the confidence maps which are also referred to as probability maps and we'll see how that's done further below so just a brief history for a long time human pose estimation was a very difficult problem to solve robustly especially on some of the more challenging benchmark cases the reason the problem can be hard is that joints are not always very visible there are numerous opportunities for occlusions of one type or another and then clothing or other objects can further obscure the image and then there's the added complexity of not only identifying key points but associating them with the right people if there's multiple people in the image however once deep learning was applied to this problem domain just a few years ago we really began to see dramatic improvements and it's been really exciting to see just how well these models now perform so in this demo we're going to be using the open pose cafe model that was trained on the multi-purpose image data set and we'll be doing that using a single image which we'll get to in just a minute but we wanted to point out that human pose estimation is often applied to video streams for various applications such as intelligent trainers for example so we wanted to just start with some example results on a video clip to whet your appetite and then we'll walk through the code for a single image implementation but just remember that the code can easily be adapted to process a video stream as we've shown in prior videos so just scrolling down here a little bit to this first example this video clip which we're going to show was processed using open pose and the open pose results have been overlaid on the video stream so let's take a look so all three hockey players are wearing bulky uniforms which is a challenge and then they're also including each other to some extent yet the model is performing uh pretty nicely if you just take a look at the results it's a lot of fun to work with and of course we'll be taking a look at a single image example but we thought it was instructive to show you how exciting that is to process videos so let's continue on and take a look at the rest of the notebook so in this first section here we're simply specifying the model here is the prototex file here and then the cafe model or the weights file right here we downloaded those already and have already executed this notebook but there are references in this notebook here for where you can download these files and then in this next section here we're specifying the number of points in the model and the associated uh linkage pairs here by their indices so these each of these blocks here refers to a linkage in the human anatomy and zero starts at the head one is the neck two is the right shoulder three is the right elbow and so forth so this is a mapping that the model uses during training and we're gonna need this mapping to process the output from the network further below and then right here on this line we're calling the read net from cafe we've seen that before in a previous video where we just pass in the prototex file and the weights file for the trained network and that creates for us uh instance of the network and we'll use that below for inference so now we're ready to read in our test image and we're doing that right here uh in this code block with imread and then we're also swapping the red and blue color channels here on the next line and then these two lines are retrieving the size of the image which we'll use further below so let's take a look at the image this is a picture of tiger woods hitting a driver from the rear view at the top of his backswing and the reason i chose this image is because it's a little bit challenging and makes a nice example his upper body notice is at right angles to his lower body so his lower body is facing to the right of the camera and his upper body is actually facing the camera and then his left arm is occluding his right shoulder so that's going to make things um a little more complicated and let's continue on uh to the next section so now we're at the point uh where we're ready to go ahead and pre-process our image i recall that when networks are trained they're trained um with training images that have a specific size and potentially some scaling performed on them and we need to make sure that whatever images we're using to perform inference on uh are pre-processed in the same way so here we're setting the net input size of 368 by 368 and then we're calling the uh opencv function blob from image and recall from a previous video that this takes several arguments related to all this preprocessing and then it's also going to convert the image into a blob representation which will pass into this set input function uh to prepare the network for inference so let's review these arguments uh briefly so this first argument is the image itself and then the second argument is a scaling factor which is the same scaling factor that was applied to the training images so we need to perform that same transformation here on the input image and then here we're just indicating the net input size which we just talked about right here above 368 by 368. uh the there was no mean value subtracted from the um training images so we're simply uh indicating a vector of zeros here and then the swap red blue uh flag here set to true and uh we're not cropping we're going to resize our input image uh to match uh the size of the images that were used during training which was 368 by 368. so now we're ready to use the model to perform inference on our test image and we do that right here by calling the forward method and that returns for us the output from the network which consists of both confidence maps and affinity fields and as we mentioned earlier we're only going to be using the confidence maps for performing the key point detection in this demonstration and so for each point we're going to receive a probability map and then we're simply going to in this next two lines of code plot each of these probability maps and you'll see that these are color coded they're heat maps indicating the probability of the location of the detected key point and so red is a very high probability so in each of these uh probability maps you'll see this is the likely location for key point zero key point one key point two and so forth so remember this one corresponds to the head this corresponds to the neck this corresponds to the right shoulder and so forth so we can use these probability maps to overlay those key points on the original image and to do that we're going to have to scale these in the same scale as the input image and so that's what this next block of code is performing so right here we're using the uh output shape of the network in other words the shape of the probability maps and also the input shape of the test image to compute two scale factors x and y that we'll end up using below to determine the location of the key points in the actual test image but before we do that we're going to need to determine the location of the key points in the probability maps so that's what we're going to do in this next code block here this for loop is looping over all the key points and for each key point we're going to retrieve the probability map from the output array from the network and then we're going to call this opencv function min max location and pass it the probability map and this is going to return for us the location of the point associated with the maximum probability and so the coordinates of the point are in this variable here point and then once we have that location in the um probability map coordinates we're going to multiply it by the x and y scale factors we computed above to get the key point location in the original test image and then if the probability returned by this function is greater than some minimum threshold which we set above then we're going to go ahead and take that x y point now in the coordinates of the test image and then append it to a list of points and so now we're ready to render those points on the test image so let's scroll down here to our results first let's just take a look at the image so this is um the input image here with all the key points annotated on the frame and then the image to the right is um the same key points but without the numbers but with the linkages connected so two different views of the same data really and um if you take a look at this area that we knew was going to be difficult in other words the head was at zero the neck was at one the right shoulder at two the right elbow at three the right wrist at four it looks really nice even though the left arm is occluding the um right shoulder and uh i mean if you go back over here to the image on the right you can see the skeletal view and that looks pretty um spot on so it did a nice job of detecting the key points and and putting them together in a way that makes sense but let's just go ahead and walk through this code a little bit so up here on these first two lines we're just making a copy of the input image and one's gonna be um called points and the other one's gonna be called skeleton and then we're gonna loop over all the points that were that we just created in the for loop above and those are the coordinates of the key points in the test image coordinate frame and then we're going to use the opencv circle and put text functions to draw and label those points on the im points image which was the um image to the left down here and then further we're going to render the skeleton view uh that's displayed down here to the right with this for loop so we're looping over all the pose pairs which we defined further up in the notebook and then we're retrieving those pairs and we're going to set those to part a and part b here and then use those as indices into the points list that we created up above which contains the list of key point locations in the test image and now we're simply going to use opencv line and circle functions to draw a line from one joint to the next and color code it and then also draw a circle at the first key point in that link so and then here we're just using im show to display both these images below so that's all there is to it there wasn't much code in this notebook since we're leveraging the capability of opencv to perform inference for us really the code amounts to a few function calls and then a little bit of a logic to parse the outputs and and render the information on the original image and just for fun i went ahead and ran the same model on a photo of my son who also plays golf and this is a view of his follow through and the model does a pretty good job but i'll have you notice one thing here is that the um if you look over here to the right the neck and the right shoulder are off a little bit for some reason so uh this is the head this is point zero this is point one which should be right here and this is the right shoulder which should be over here but notice his right shoulder is actually occluded uh quite a bit by his back so it's almost not even visible so it's it's definitely a challenging pose and you know it was a lot of fun to try this out so the main point though is that you can use a pre-trained model leverage uh the inference capabilities of opencv and start playing around with your own images and video and uh we think you'll enjoy doing this and thanks so much and that's all we wanted to cover in this video and we'll see you next time well thanks everybody i hope you enjoyed the course that we just covered in the getting started series in computer vision we covered a lot of ground and a lot of material uh at a pretty good level for a getting started series and i think this is a good opportunity for us to talk with dr satya malik who's the ceo of opencv.org to get his uh take on how do you get a job in computer vision for example and some of the other course offerings uh that we offer on opencv.org thanks a lot bill uh it's a pleasure to talk to this audience because they have just completed their first steps in opencv and i can understand you know uh it's very exciting it's a very exciting field and the next thing uh people ask is how do you get a job in computer vision and ai and the path is uh you know you have to dedicate yourself you have to commit yourselves to learning uh various aspects of computer vision but there is a path through which you commit to that path you will find a job at the end of this path and as you know you know bill your own journey is one great example of that you started your journey in aeronautics and astronautics you did your masters from mit in that area and for the longest time you were in that area but then you made a switch because of your in your job you wanted to use ai but then gradually you made a switch to computer vision and ai i would you know i would in fact love to hear that we can start with that little story yeah sure i mean i'm excited to talk about that so i did i worked in the space and defense industry for many years uh very rewarding experience i loved the idea of space travel when i was a young child and ended up graduating with a degree in aeronautics and astronautics and it was a wonderful experience for me along the way i had the opportunity to work on a project that involved something completely unrelated to what i was doing i was working uh managed doing uh performing technical management on a project related to machine learning uh at the time and this was about eight or nine years ago and uh it's really uh exciting for me to see just how um stunning the results were to some of these use cases that we were working on it was a small research project and shortly after that project i uh began a journey on on expanding my continued education i studied machine learning and also image processing and as i moved forward with that i really uh felt a strong passion for working in the field of computer vision so i spent quite a bit of time uh taking extra classes while working full-time early morning study sessions and uh lots of weekends uh spent uh learning all this material and uh i'm happy to say that i've landed in a very nice place and i'm working in the field full time now and uh i really enjoy it thank you yeah so that that's you know uh and you took uh deep learning with pytorch which is one of our courses uh as well right right um so you know so what i was saying is that there is a path it's not necessarily easy especially if you are working full-time you have to dedicate your nights and weekends but there is definitely a path for people who are interested and if you're a student there is a very clear path right it's an a relatively easy path one one thing i like to say people is that if you want to learn about physics you have like 200 years of history in physics so you have to learn all that material to get started in physics right before you can contribute uh in physics but for computer vision it's a relatively new field computer vision you can say that even though research started in 1960s it's really now that algorithms that work in the real world are uh are available to the general public so you're literally talking about a decade uh of uh worth of techniques which is not you know which is not a lot frankly so for people who are trying to start in this area i would say that uh you know set aside and especially people who are not doing it full-time they should set aside anywhere between six months to a year to learn the material um to get you know completely uh and that's that's hard work right six months to a year of hard work nights and weekends to uh get their foothold in this field where you can think about making a career switch you know there are other people in our courses who were able to make a switch by taking just one course but i usually recommend that you know you need to be conver you need to understand traditional computer vision algorithms and you also need to learn deep learning algorithms so uh you need to have a flavor of both of those things let me actually start by explaining where uh what are the various aspects of ai so that we lay out the land and then it will be easy for beginners to understand what are the various topics we are talking about now uh the first question you know people ask is what is artificial intelligence so it's a very fuzzy term it doesn't uh it doesn't refer to a specific technique but in general whenever we try to make machines uh think like humans we call it artificial intelligence and there are various ways of solving this problem you can think about a rule-based system where you're encoding all the rules uh like in a chess game you could say that oh if you do this i'm going to do this step right and that used to be very popular uh a few decades back but then people realized that uh you can actually train the machine by giving it data not explicitly telling it you know what the rules are but it will automatically figure out things based on data so that's machine learning machine learning is a subset of ai techniques where we care about data the other part is deep learning you might you might have heard a lot about this new uh kind of technique called deep learning which is nothing but solving machine learning problems using a deep neural networks and you know in our course we tell you why it is called deep it's a very subtle uh thing why it is called deep neural networks but it is basically solving ai problems using neural networks so that's deep learning computer vision is actually there is an overlap of computer vision deep learning etc computer vision basically means the analysis of images and videos and it is different from image processing because in image processing you have an input image and the output is also an image uh you could be encoding an image you could be uh you know enhancing an image etc so that all is under uh image processing in computer vision usually we have an input image and the output is is information that we want so the image could be this video session but the output could be the faces that we detected right so it is image in and information out not only that in computer vision we also handle many things that have nothing to do with artificial intelligence for example in this course uh or in this video series you learned how to create panoramas that is a classical computer vision technique and it has nothing to do with artificial intelligence because you're stitching images together you're using the geometry of image formation you're using all these techniques which are not really machine learning or ai right but they are still very useful techniques so there is an overlap between artificial intelligence and computer vision but computer vision has does a lot more other things also for example the whole field of 3d computer vision uh there used to be very little ai in that now even ai is being used to enhance uh those fields also so that's uh the general lay of the land so from the artificial intelligence standpoint uh or the machine uh computer vision is like machine learning for images is one way i like to think about that component of it and also image understanding is another um way to think about uh computer vision or the the portion of it that's uh associated with machine learning and artificial intelligence yeah and there are other fields of ai for example natural language processing that's another field where you deal with text data we also have speech recognition and that's another big field you know for example when you call alexa there is an artificial intelligence module that reads recognizes your voice does the processing interprets it etc so that's the speech processing but among these different fields i think uh computer vision has the biggest potential because if you look at human visual system it it spends about 30 percent of its processing power on the visual part because visual information is so rich and we are also in a lucky spot that there are so many hundreds of millions of cameras uh in this world which are continuously gathering data so there is a lot of uh there's a lot of activity in the computer vision space you know even in aerospace for example uh for military and other applications there are ton of applications i mean everybody's mobile devices right everybody's mobile devices have cameras so there's all kinds of uh video and image processing taking place yeah and it's going to transform multiple industries uh we are going to see in manufacturing we we see we also do consulting work and we see people are using computer vision in manufacturing and agriculture in security obviously in autonomous driving that's a very big area so computer vision is everywhere and that's going to explode it's already on the rise there are so many jobs out there and we will we will show you right there in the video we will show you the pay grades of people who get jobs in this area you can see that people make more than a million dollars in uh you know companies like facebook etc of course these are senior people and these are level six engineers who uh you know who are accomplished they are at the cutting edge but it gives you a sense of an engineer is making this is not this is not like uh entrepreneur or this is not like a senior manager this is an ai engineer making that kind of uh money in big companies so that gives you a sense of you know why it is worthwhile taking a very hard look at these emerging fields as a career option so what about some of the the course offerings you want to talk about uh some of those and yeah sure so basically uh before even we go there right it's instructive to know what are the things that you need to learn to get into the field and what are the libraries you need to learn uh so the very first thing is that it's very important to be very good at uh at coding uh some of these uh algorithms right algorithms or building systems so you should be a very good programmer first because this is a very engineering oriented field you need to be able to write code and python is a great uh place to start but don't end there right take learn as much as possible if you want to learn c plus plus if you need to learn c plus plus learn it and because it expands your chances of uh getting a job so uh but let's start with python right you have suppose you have python expertise what do you need to learn to get a job first of all as i said that opencv is a fundamental library you have to have good expertise in using opencv that's number one and then there are two other libraries that are very important one is uh and this is a deep learning framework from facebook it's open source it was developed at facebook and uh the second one is tensorflow and kira's so this library is developed by google also open source these are all very good libraries and if you have mastery over these three libraries and you have mastery over computer vision and uh deep learning techniques i think there is no way in the world you will not get a job it's very easy to get a job once you have these uh two or three things under your belt so uh for from our course point of view computer vision one covers traditional computer vision applications uh and not so much deep learning we show you how to use deep learning uh in applications but we don't show you how to train deep learning models computer vision 2 is all about applications and there we we go over many different applications we don't even worry about you know which library you are choosing we expose you to several libraries that will allow you to you know you just build uh your arsenal of techniques and libraries and tools that you can use to build real-world applications and the third course is deep learning with pytorch where you go over the fundamentals of deep learning and we and it starts using pytorch and by the end of this year uh 2021 we will also launch uh deep learning with tensorflow and keras so basically it's all covered anything related to computer vision that involves deep learning it's covered in these courses we go over image classification object detection image segmentation pose estimation etc but these are very meaty courses in the sense that we go over uh all the theoretical details we uh you will learn about you know what is back propagation and things like that so it's once you take these courses i think uh it is equivalent to taking uh being getting a master's in computer vision and machine learning you will get that level of knowledge in fact i can very easily say that by the time i completed my masters i had i did not have this knowledge that people gain by taking computer vision one and deep learning let's say even if you take these two courses computer vision one and deep learning with pytorch i did not have that knowledge after my master's right so uh and i can say that without uh without any hesitation these are solid courses we took the best from uh you know various things we we are also very industry oriented right we are very very applications oriented so we picked the topics that are actually used in the industry and left out the things that are only of theoretical importance right so what about the prerequisites people might be wondering about what what's actually required to get started and maybe what are the potential learning tracks here with uh three or four courses that we offer yeah so the prerequisite is just uh you know intermediate level of knowledge and the python programming language once you have that you don't need any other uh prerequisites but if you're starting you know you have never tried uh opencv you're just starting out i would suggest that take our first course opencv for beginners and that is meant to be a quick you know a very short and fun course it is short it is affordable and it is uh something that you can try in a month uh and that will give you an idea you know do you actually enjoy building applications do you actually enjoy this field of uh computer vision and ai and once you have completed that course and you're certain that you want to invest uh you know time and energy into this field then i would say that uh take computer version one which is uh about classical computer vision we also cover some deep learning but it's very important to have the foundation uh ready a lot of people i see they directly jump into uh deep learning that's also an option but then what happens is that in real world there are many problems that you don't solve using deep learning it's just absurd to use that technique for solving uh some very easy problems in computer vision and if you don't have that foundational stuff you know you will try to look for a nail because you have this deep learning hammer exactly yeah so right so that is a pitfall that people should avoid so uh so computer vision one is the second one uh you can also skip opencv for beginners if you are convinced that oh you want to commit three to four months uh you can directly take computer vision one and then take deep learning with pytorch if you have more time to invest let's say you're a student then you should definitely take the whole you know four courses or at least uh start with computer vision 1 computer vision 2 deep learning with pytorch and then when deep learning with tensorflow and kiras comes along that is also going to be very useful so computer vision um there's uh must be a little bit overlap between computer vision one and computer vision two but computer vision two is more application focused yep uh is it considered a little bit more advanced or what are the prerequisites for that it's not advanced but we don't dive into a lot of theory in that we are more interested in teaching people about uh the tools right for example if we cover something like uh like a barcode or a qr code scanner we will tell you oh this is the library this is the best library to use for this application but we may not go into how qr code is actually read right right so we don't go into that level of detail uh similarly we cover applications related to faces and uh you know face swapping and things like that and in those cases we go over some theory enough for you to understand you know what's what's going on but uh we may not go into the theoretical detail of say uh facial landmark detection which can be mathematically challenging for people right right so yeah so it is based on you build applications right that's the main focus we will teach you how to build web applications for example so you get exposed to a wide variety of applications one thing that i noticed just in my own continued education in this area was that having a little bit of overlap is actually i found it to be a valuable experience because you might cover uh one particular topic in a class and then hit it again in another class maybe from a slightly different perspective or just the the passage of time and you're looking at the at the same topic again you know several months later from a slightly different vantage point i found is a actually very helpful so actually that's true about our deep learning with pytorch and deep learning with tensorflow courses as well you may ask you know uh if i have taken deep learning with pytorch does it make sense for me to take deep learning with tensorflow as well and the short answer is yes absolutely uh you know tensorflow is the most popular deep learning library in the world and pytorch is the one that is rising the fastest um a lot of people like using pytorch because it's very pythonic for for python developers the learning curve is uh very you know it's very gentle you can easily pick up uh pytorch but when you go to the industry right you're looking for a job right you want to offer the best uh things you have you know you cannot say pytorch is the library of my choice i don't work in tensorflow then people are not going to hire you the the people who actually use tensorflow as engineers we should not be married to the tools right we are there to solve problems and if taking this course deep learning with tensorflow the theory is already covered in deep learning with pytorch also so there will be an overlap and there will be a reputation but uh it will be in the context of a new course right so you are actually uh looking at uh from a different framework right tensorflow is a different framework and we have also added we are also adding new applications so that there is uh you get different applications with these uh two courses but with let's say 30 extra effort you learn a new framework right and now you have covered everything with pytorch and tensorflow you are all set in deep learning and you also know the theory from when you take either course we cover the theory that is necessary so i think that uh what you said uh it also is a revision of your theory when you take the second course and um you know people should not be married to tools and i i say this about other python and c plus also uh python is definitely the first and the easiest language uh to get started with uh ai but don't ignore c plus plus completely because when you go to the job market you may find that there are equal number of jobs in two areas right and you've already learned you've done the hard work of learning the basics the foundation and you're now you're just worried about you know i don't like this language or that language is hard language is just a way of solving problems right we should not be married to one particular language because as engineers we try to we want to see ourselves as problem solvers and whatever is the right tool for the problem we'll use that right and two of our courses computer vision one and computer vision two they are offered in both languages in fact when you purchase one course the other course is for free right so now you can actually compare code side by side to see uh oh this is how it is done in c plus plus version of opencv that's really great i like that feature especially i think that's great um so we talked about python versus c plus plus uh earlier but i think some people might be wondering uh obviously both would be good to have under your belt but some people might be wondering to what extent is either language used in industry and if you only knew one language very well or you had an affinity for one of those languages for for whatever reason um you know what is the market share for that language in industry yeah so um python is definitely it is fast becoming the language of scientific computing but then uh you know there is you you can easily learn the concepts using python and it will also be there are several jobs you know there are the large i would say 50 of the jobs would easily take somebody who has python skills but then there is this whole other world of embedded computer vision for example where you're trying to do computer vision on really not very powerful devices right you're not using a gpu even if you're using a gpu it's like an embedded gpu or something in those cases uh a lot of times these algorithms are implemented in c plus plus right and sometimes even in c right in c so that because you don't have access to that much computational power and that's a lot of you know that's also a very big market think about all the security cameras uh all kinds of devices which are embedded uh so i i mean if you love one language you know python is great you know try python you will gradually uh learn c plus plus over time what i want to emphasize is that don't uh think that you're a python developer only right think about yourself as an engineer and you will learn and uh whatever needs to be learned to make yourself fit for the job market all right so um you know one thing people might be wondering is there's certainly a lot of uh freely available course material uh on the internet um universities are offering all their court many top universities are offering many of their courses online for free there's lots of tutorials uh what how do people sort this out and what do your courses provide that maybe is not um available online for free yeah so i actually encourage people to look at all the free stuff uh first right because it builds uh the confidence it builds the momentum they start knowing what to expect you know what they want to learn uh so they start having an expectation what they want to learn so by all means you know there are a lot of free tutorials on opencv.org and also on learnopencv.com go ahead and try those and uh there are several other very good bloggers also who uh you can try and then there are free uh courses by universities as well try those as well but then at some point you know if you feel that you're missing a structure sometimes what happens is that you take all these pieces but you don't get a bigger picture you know uh pieces of computer vision but you're not confident enough to go and uh present yourself in front of a job interviewer because you know that you know you have learned in bits and pieces you have not connected the dots completely and that can be a blow to your confidence right when you take a course it is structured and you know that the important things have been covered and suppose the interviewer asks you something which is beyond what you have learned you can confidently say that oh i have not learned that that's okay but i know uh this thing i have gone in a through a structured path to learn all these things and in our courses we also cover uh you know there's a lot of code that people go through you you write a lot of code i was saying that you need to be very good at programming so you need to write a lot of code you you know read a lot of code so that's a very good practice we also give you assignments and projects and together that uh creates you know sense of urgency it creates a sense of responsibility that you have to finish this thing what happens with online material is that you take this and you read the code but you never write any code right you feel that you have understood it but if you take away that code you cannot ever write from scratch right you use that material as a crutch so when you start doing something like a project etc uh it actually comes together right you feel you need to job market it's basically a confidence game you need to have your expertise to a level where you're confident uh you can uh face an interviewer right and then we also uh not only that you know on the internet in a blog format it is simply not possible to cover things uh in depth right so a lot of times uh people gloss over things and we do that ourselves also it's simply not possible to go in a blog format you have to condense everything uh to be easily uh consumable you don't have that user uh for an hour right so you have to make sure that they learn whatever they want to learn in 10 minutes and in doing so the medium is restricted in some sense so in courses we don't have that restriction we know that people are committed they are ready to spend time on this and so we take the time to explain in depth uh that's that's a very important thing as well and there is also peer pressure when you see that other people are you know asking questions in the course forum they are um they are enjoying the course it puts a little bit of pressure on you in a positive way which uh which propels you that i also want to do something uh you know well you're all in it together right it's a little bit of camaraderie too and uh i think you said it perfectly that confidence comes from mastery so you feel confident when you know that you've mastered one particular topic and then the other thing that you brought up is uh doing projects so uh um actually executing a project and taking what you've learned to create you know an extension of that uh is i think very valuable very rewarding and also it forces you not to skip steps so you could be reading a blog or watching a video online and not in your head yes this all makes sense but when it comes time to actually code something or create something a little bit different than what you learned if you actually have to program it and get your hands dirty it forces you not to skip steps or gloss over details that are actually required so right i think that's a very valuable experience so let's talk a little bit about what uh sort of jobs are available in industry i mean some immediate ones come to mind the entertainment industry perhaps uh medical imaging uh manufacturing but what are the different uh domains and and uh types of uh fields where computer vision is actually being used more recently so uh professor andrewing who is also the founder of coursera he likes to say that ai is like electricity so when electricity was invented it was used in for lighting purposes but within a few years it transformed multiple industries it was used in manufacturing in agriculture and a lot of different things and ai has the same power right it is used in certain industries right now but it is transforming multiple industries including manufacturing automotive you know uh autonomous driving cars um and agriculture we have a lot of people who are working on pest control who are working on removing weeds in an agricultural setting there are medical imaging is huge because now ei is doing better than radiologists in some sections right because when the data becomes available and it's a repetition some tasks which are repetitive in those cases you know we are ai is going to do better than humans over time as the data becomes available it is those creative fields you know let's say even in music and creating generating new images ai is learning from existing data and creating a new art form right but there are certain fields where ai is not going to take over uh for example um comedy is one example comedy is you cannot train uncommon you know the if you if you look at the jokes that are produced by an ai system they're usually pretty lame they try to rhyme something because for creating comedy you need things that come that's in that's not in a specific area you could tell a joke which combines uh you know something going on in the music industry with something that is going on in sports and when you put them together it is funny right right it's really hard to to how do you quantify that right right and it's not domain specific because you took something very different in a completely different domain and combined it with something uh very different and so those kinds of things which are one off right and it's even best comedians do not know what will what will fly right so they do a lot of tests themselves right right those kinds of fills fields are very difficult uh you know for ai but everywhere else where it is very structured like manufacturing it's a very structured environment uh warehouses uh it's a very structured environment a very controlled environment right so a lot of variation is removed in those cases so which makes it easier a lot of variations are removed and there are repetitive tasks that can generate a lot of data right the the key thing is that they are repetitive tasks the same task is happening over and over again no matter how complex the task you can put a camera there and get as much data as possible as as you want and ultimately that problem will be solved right so um i mean if you look at all the companies who are the companies hiring when i graduated back in 2006 i had a few options i could go to microsoft research i could go to google i could go to a few different places but now the people who come to us for consulting they are all over the place obviously these these big companies are still there microsoft google facebook adobe all these companies are still there who hire computer vision research engineers but then there are small companies one or you know like five people companies who are working on uh something very specific that you may not have thought about uh but they need computer vision expertise for example one of the first consulting projects uh i did they were sorting lego pieces so these lego pieces there are 10 000 of those we don't realize it but there are 10 000 unique uh lego pieces and they wanted to identify with reliability which piece it is right so because they wanted a replacement system that if some piece is lost they can replace it so uh you can see that that is something that came out of the blue i had never thought about you know uh oh that's an application area right right another one was uh uh there were there is a company that is using uh uh computer vision for detecting the species of fish when they have uh you know when they catch fish they want to make sure just to you know what is the size of the fish what is the size what is the species that they have caught and things like that so uh to do that that analysis they are using computer vision uh microscopy companies we have worked on so many different applications uh even you know applications which have a huge impact like um one where they were identifying uh shooters uh in school and uh we had worked on a very proof of concept project uh with a company and recently a few weeks back i came to know that they have grown into a full-fledged company and they are offering the service right so my point is that computer vision is everywhere it's not just focused on these large companies but i just gave you several examples i'll add one more example there is a company that used us for detecting fraud in fashion merchandise like bags these bags are very expensive some of them are two thousand dollars for a small bag by a recognized brand and they want to protect the brand identity so they want to something that is two thousand dollars you can make a very good copy of that for a thousand dollars right and then sell it for two hundred dollars and still make a very good profit so um but fortunately these companies have very high standards trained a machine learning algorithm to know the difference between uh between a fraud a bag and the real bag so you can see now in the fashion industry we have worked and this is a small company you know i'm talking about a consulting company uh with 40 50 people we are receiving such uh diverse set of uh problems to work on we worked on sports analytics also where we are tracking soccer ball um uh in in a sports setting there are other companies who have approached us for baseball for golf and uh other you know projects like that one so uh it's a very diverse range uh we are lucky to be in this in this position i'm really really curious that the example that you pointed to with the with the counterfeit merchandise was that actually was that machine learning only or did that incorporate deep learning it was a deep learning project it was it was a deep learning project um and this company basically is in the business of selling second-hand they buy these handbags and then resell them so when they buy these handbags from people they have to make sure that these handbags are genuine and they have experts who go and check whether these things are genuine or not and that requires a lot of expertise right even to give a quote for this for this handbag how much should it be quoted you need to know the exact make and ma the model of the handbag which is the 10 000 of these again right so you have to automatically determine at least to give a get a ballpark you know which kind of handbag it is otherwise there are these experts who are you know tens of experts who are employed just to look at the these images and make sure that oh uh this is this kind of handbag right this is a gucci handbag of such and such model and therefore it should be priced at such and such um right and then the handbags comes in and you have to determine whether it is counterfeit or not and that requires a different level of expertise altogether because mistakes can be really expensive very expensive right yeah they are putting their name on that that this is a genuine item so we've covered quite a bit in this discussion today and i just curious if you had any final thoughts for people who are perhaps on the fence about getting into computer vision or wondering exactly you know how they might start um yeah so uh final thoughts uh first of all don't be afraid of taking a leap into this field uh the job opportunities are tremendous it's a very good uh career switch for people who are interested you know the first thing you have to answer for yourself is are you interested in computer vision and ai and if it you know sparks joy uh in you then this field is full of opportunities high paying jobs etc and take any learning path right it's not necessary that you take our courses which you can but jump into this uh this learning path try to learn as much as possible from free material and when you're ready come to our courses but even if you don't it doesn't matter right the main important thing is that you embark on this journey knowing that there are a lot of jobs available in this area and if you spend about six months to about a year and dedicate your time and energy into this you can become an expert an engineer who can uh create you know who can join this ai revolution so with that in mind i wish you all the best welcome to the ai revolution you
Info
Channel: freeCodeCamp.org
Views: 80,935
Rating: undefined out of 5
Keywords:
Id: P4Z8_qe2Cu0
Channel Id: undefined
Length: 180min 26sec (10826 seconds)
Published: Mon Jun 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.