VIOLA JONES FACE DETECTION EXPLAINED

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today I am going to explain to you the voyagers face detection algorithm topics to be covered in here are the voltage on face detection algorithm which has subtopics as the hard features integral image adaboost and cascading before we proceed further I would like to mention that I have used some of the slides from the sources available online just to make this presentation more clear to understand so let's get started a face detection algorithm a system is designed by giving input as some faces and non faces and training a classifier or something that identifies a face so we train something using faces and non faces and once the training is done the data that we have got we would be able to detect any faces from an image so just to make this thing a little bit more clear to you it is like which shows some images is some images of a face to an alien who has no previous knowledge of what the human face is we show some 100 or thousands of human faces and tell it that it is a human face and we also show some hundreds of thousands of non faces and tell it that this is not a face so once that alien is strength toilet trained to identify or identify those features whenever we show any new image later on it will be able to classify it as a face or non face exactly what we are trying to do is is to train the computer to understand what a face is and what is what a non face is once the computer is strained it will extract a certain features and everything will be stored in a file all we do is we take that file if we get any new input image check the features from that file and apply it to the image so if it passes through all the stages of that feature comparison you say that it is a face else it is not a face so this is what exactly a computer does and if you see the digital cameras and mobile cameras they do exactly the same things to detect faces so already a train data is there with all the features that have been trained already and now all we do is in any system we just have that train data and using that data we just start to classify the given name image as a face or non face just by referring to the data that we have already have in the file so I hope the basic idea behind the phase detection using water Jones is clear to you now how are we going to detect the faces how are we going to extract the features but before getting into what features are extracted and how everything is done let me just give you a brief introduction to X detection the way edge detection works is we have a pattern like this it has some low values over here and high values over here so it's like a bright area and surrounded by too dark a regions above and below it so I am trying to find a single horizontal high value line in this image so I create a kernel that is similar to what I would like to extract from the image and it look it looks like a horizontal line so what will be extracted from that from that input image will be the horizontal lines so I apply this kernel all over the image and we get the output image like this this image has high values only at the places where I wherever this pattern matches with the image now let us try to understand what hard features are hard features are more similar to this convolutional kernels that we just saw which are used to detect the presence of that feature in the image so we have over here these hard features which are generally used in wöhler jeonse algorithm so so if you look at this hard feature what does it signify is that a black region is replaced by plus one and the white region is replaced by a minus one by that I mean it is exactly like a convolution kernel which is of one row and two columns the right column is plus one and the left column is minus one if you want to apply this mass to an image right here we just subtract we just subtract the pixel values under white region from the pixel values under the black region and the output will be a single value so this is a feature that we are trying to find in this image similarly we have different kind of features like these so consider this feature I apply this feature to the image and this is similar to the feature but it varies in size and position so when I apply when I apply this on an image what it exactly does is a sums up all the image all the pixels values under the black rectangle because it is all plus one plus one plus one for black and it sums up all the pixel values under the white rectangle and then the sum of all white region is subtracted from the black region and the single value that we get is the output so I hope this thing is clear so now what are these hard features why do we use them and what they signify so let us say you have this feature and if you can see this this is similar to this feature this feature lets us understand that understand the output bridge of the nose combination what I am trying to say is and that this feature resemblance resembles the bridge of nose where only the bridge is brighter and the surrounding is darker and compared to the bridge so this feature will be able to extract this nose bridge from the image this is done by applying this feature all over the image and after that I get the high values only at the pixels where this pattern matches exactly so I get the I values only at this region okay so from that I understand that this pattern is is absorbed in this picture at these pixels right so similarly see we see another feature say this feature have a look at this feature or this is used to identify the brighter region underneath the cheeks and the darker region on the eyes because eyes are darker when compared to right so when I apply this feature all over the image I get high values only at the region where this speech and matches with the input pattern of the image so the dark and bright right this regions so if I get the pattern I will be able to figure out this particular feature from that image right so what we understand from this is that all these hash features have some sort of resemblance to some facial features to some characteristics of each faces right so this hard features represent some characteristics of a face so let's move it so whele jones uses 24 plus 24 sub-window from an image and it calculates these features all over this image what I am trying to say is that you have this one feature of two pixels so you apply this to pixel feature over here over here and calculate the value then shift it by one pixel again and calculate the value and so on you move it across the entire image till you end up reaching the bottom corner pixel of the image right so this this two pixel feature now we increase the size of this pixel the size of this feature we make it 2 pixel white and 2 pixel black so now this feature is of 4 pixel size and we apply the same feature again all over the image by shifting it by 1 1 pixel and we get the values again now again I make it 4 pixels for white for 4 fixed 4 pixels for black and again apply it to the image similarly the same thing is done by increasing size and width of all the features and moving it around the entire image so if you consider all the variation of size position of all these features you end up calculating about 160 thousand plus features in this 24 close 24 window for every 24 plus 24 window you end up calculating more than 6 160 thousand features because each single type of feature is repeated all over the image in all scales sizes position so you know everything combined you have many combinations so right now this is a problem here right we need to calculate a huge set of features about 160 thousand features for every 24 24 sub window in any new image which and this thing looks practically very difficult or nearly impossible for the all-time face detection so what we are going to do is a basic idea is to eliminate the redundant features or the features which are not useful and select only those features which are very very useful for us so this is done by adaboost adaboost eliminates the all the redundant features the features that we don't need and it narrows it down to several thousands of features that are very useful now before going towards adaboost let us bring into the picture something called as integral image now every single every single time I need to sum up all the pixels in black region and then sum up all the pixels in white region right so whenever I want to calculate the sum of this area this area that doesn't look very computationally efficient for us when we want to calculate real time right it becomes very lengthy so and that is for many so many features thousands of features so while a Jones have come up with an idea basically it is a trick to solve this problem that is called as integral image the basic idea behind integral image is that say we want to calculate area of this patch okay so we do not need to sum up all the pixels rather we use the corner values of this patch and do a simple calculation that I am going to explain in the coming section so there is a simple calculation involved here by taking the corner values so let us see how it is done so I will explain what integral imager is integral image this is the given input image so how do we calculate this value at the integral image so say we need to calculate the value at this pixel just sum up everything from top to the left so I sum up all the pixel values here and I get 6 for this I sum up top and left side pixels and I get to say for this pixel I sum up all the top pixels and left side pixels to get 6 got it so integral image means to get the new pixel value just sum up all the pixel values the Falling in the left hand top region right so let us see what is the advantage of converting any given input image to this format of integral image so right now we have an integral image so if you want to calculate the value of this patch in the integral image this is resembling to the previous example that I have shown to you this one okay so what do you have to do is you just have to refer to your integral image go to the corresponding patch on it add the pixel values of here and here that is the 1 and 4 and then subtract it from the sum of pixel values of the other diagonal that is 2 and 3 so just to make it more clear you want to sum up all the pixel of this patch this now we have seen the integral image you have already summed up all the values to the top and to the left to obtain this value right this value is the sum of a plus B plus C plus D region and the 1 is the sum of all the pixels in this region ie the area of 1 now 2 2 is the sum of all the pixels in this region that is a and B region 3 is the sum of all pixels in this region that is a and C so here a and C is 4 3 and a and B is 4 2 so if you just take the sum of its diagonal and subtract from the some of its of this diagonal pixels in the integral image you end up calculating the sum of all pixels in this region so moving further as I have told you that adaboost is used to eliminate the redundant features so what adaboost actually does is let's say we have all the combinations of all the positions all sizes etcetera taken that is you have all the 160-thousand features but are all of them relevant no definitely no because if you see this kind of feature bridge of nose feature as I have told you earlier this will be able to identify the bridge of the nose when this feature is applied right here in the in this position and it will yield highest values at this position on the image right so basically this would be a very relevant feature to extract the bridge of a knows feature in their face image whereas this feature it would not give any relevant information because more or less the region over here at the upper lip is constant so you do not get any relevant data in this feature so you can say that this is is relevant feature and you can eliminate it so that it would not be considered for the further L evaluation so in this way we are just trying to understand that which one is relevant and which one is irrelevant features among all these 160-thousand features so a very important point here is that this relevance and is relevance is determined by adaboost and it will select only few features which are relevant to us so what a DuBose does is it will identify certain number of features from all the 160 thousand features it and after after after identifying all these features it will it will give weight to these features and a linear combination of all these features is used to decide whether it is a face or not now I am going to introduce you another term that is V classifier when I say we classify I mean to say a good feature or a relevant feature a weak classifier is something which which at least performs better than random guessing what I mean to say that if I give hundred faces to it it will at least be able to detect more than 50 faces and say that these are faces so we classifier is just a relevant feature that is extracted by adaboost and we apply that relevant feature and find corresponding weight of that and we continually combine all those relevant features with their corresponding weights and form a strong classifier or the strong detector now what is the output of this weak classifier the output of a weak classifier is either 1 or 0 1 is when it has performed well and identified the feature when it is applied on that image let us say this is a bridge of the nose feature when this feature is applied and if this feature is detected you say that this classifier has passed and it outputs a 1 if it gives a 0 output it means that there is no identity of this classifier that means this pattern is not present in the input image similarly all these classifiers output a binary value either 1 or 0 hence the combination combination of all the weeks we classifies together form a strong calcified generally 2,500 features are used to form a strong classifier so now we move towards our final section that is cascading so in every 24 plus 24 window you need to evaluate 2500 features that we obtain after performing adaboost if you have an input image of say 640 cross 480 resolution you need to you need to move this 24 plus 24 window all through the image I mean each single 24 plus 24 window you need to evaluate 2500 features and take a linear combination of all those 2500 outputs and see whether it is it exceeds a certain threshold or not and then decide whether it is a face or not this is the whole process but we want to use the setup of same hierarchy in detecting whether it is a face or not that means instead of calculating this 2005 features all the time on every single 2424 window what we do is we use cascades that is out of 2500 features first ten features are kept in one classifier means we make a set of ten class ten features in s of one classifier then the next 20 features all the 30 features are kept in another classifier and next hundred or 200 features in another classifier and so on so we increase the complexity and what is the advantage of this thing is that when we apply this cascade on a certain window of 24-24 size on any given image we can check if it is a face or not based on just the output from the first stage what do we have done here is these 2500 features are arranged in a cascading structure so right now what I can do is I can reject any any input image in very less time you see what I am trying to say is let's say I have an input image and I get a 24 constraint for window so instead of evaluating all those 2500 features what what I do is I split it up into several stages let me repeat what I do is I split it into several stages stage 1 stage 2 stage 3 and so on so I have 10 features in stage 1 10 features in stage 2 10 features in stage 3 and so on up to 2500 stick features so what I see here is that I will put a hierarchy on on the classifiers and see if it passes the first hierarchical stage and proceed towards the next stage of hierarchy by that I mean if input is given if it passes the first stage then it may be a phase we need to evaluate it further to confirm whether it is a phase or not and this will be done by the second stage but if it does not pass is the first stage it is definitely not a phase so it is eliminated so in real time when we try to detect faces in any image this gives a lot of advantage to reject the areas are other windows or the other non phases items that does not have faces immediately so to summarize we have phases we have non phases and then you train a cascade of classifiers with adaboost so now you have stages that are cascaded and in and in each stage the classifiers are selected using adaboost so each classify that is selected has a threshold it has a weight and it has everything determined by a Debus so after everything is done after cascading everything is done you apply that each window of 20 focus 24 and it passes through all over the image and in this way a face is rejected in an image
Info
Channel: Rahul Patil
Views: 135,108
Rating: undefined out of 5
Keywords: Face Detection, Viola Jones, Viola, Jones
Id: _QZLbR67fUU
Channel Id: undefined
Length: 20min 46sec (1246 seconds)
Published: Mon May 19 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.