Data Science & Machine Learning Project - Part 3 Data Cleaning | Image Classification

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in last video we looked at data collection for our image processing project in this video we are going to talk about data cleaning in any data science project majority of the time is gone in data cleaning process and you will see that in this process of project also will be spending significant amount of time in cleaning our images because when we download our images from Internet the images might have a lot of issues now when you want to detect a person from that image just think about it like when I show a photo of a person how would you go about detecting that this person is X or Y majority of the time you will be using the face of a person now using the hide hands and legs you can tell about a person to some extent but your final decision of who that person is is mostly based on the face and we are going to use the same concept all the images that we downloaded from Google will first try to detect the face of the person sometimes face might be off start obstructed so we want to detect if the face is clearly visible or not now how do you detect that so you will try to detect two eyes as well so if in a photo you can find a face with two eyes clearly then you will keep that image otherwise you will discard it for face detection and detecting the eyes we will be using open CV which is a famous image processing library in Python and for the specific detection we will be using a technique called heart cascade that's a famous technique on how you can detect the face and the two eyes and we'll see how actually you can do it it's pretty straightforward at the end of this tutorial you will have a clean data set and that you can do for the feature engineering on here are some sample images from our data set you see lot of variety here for example here Serena Williams face is not clearly visible in these two pictures along with virat kohli there are two other people here is an Oscar his wife and Stoney in this picture Massey's face is visible but it is only a side face so his two eyes are not visible now how do you handle these different images let's look at a very simple example so this is the image of Marius are up over here the face and two eyes are clearly visible so we'll detect this using open CV and once you detect face and two eyes clearly you keep that image in this photo there are two faces so first we'll detect those two faces and using open CV you cannot say that this is without Cola versus this is dhoni open so we can only tell you that there are two people and here are the regions where they have their faces and two eyes visible so once you get these two cropped faces we'll have to run a process of manual verification eighty to ninety percent of our data cleaning is happening through our Python code in automated way but there is 10% where you have to spend manual afford in cleaning your data so we'll delete image of ms dhoni in the manual verification step in this case Serena Williams face is not visible properly hence we will not use this image in our classification because if we do then our classifier might make mistakes so we want to make sure our classifiers accuracy stays high in this image although we know masse from the side face for computer it might be hard if two eyes are not visible and hence we will also discard this image so just to go or the overall data cleaning process you had raw images you create a cropped faces you detected basically the faces out of all these images you also discarded some images where faces were obstructed such as Macy's and Serena Williams pictures are not visible here then you run a manual data cleaning process where you delete unwanted images in our case we are doing a sports celebrity classification only for five players which is Kohli Roger Federer Serena Williams Liana Messi and Marius Arawa and here there is an image of Anushka Sharma and Dhoni who are not our classification classes hence we are just deleting those images in the next video we'll be looking at wavelet transform and how you can do feature engineering to extract the features from the cropped image in an effective way we'll then use this wavelet transform images as well as the raw images where we will do vertical stacking of raw and wavelet transform image and train our model and then hyper tune it once that is done we will save our model to a file and we have already looked into this architecture where we'll hide pythons Plus somewhere around it which will be serving to our website let's jump into coding now in C code try I have created sportsperson classifier folder where I am going to host all my code for this project I have already created three different folders in the model folder we'll do a model building in the training the server will host python flash server code and UI will host UI of course it's obvious from the name if you go to model I have placed my data set at the Google images that are downloaded into this data set folder so you can see that there are five people here if you look at Maria Sharapova all her images will be in this folder and same applies to other players ok so you have already done that the other folder I have is OpenCV I will talk about that folder little later and I'm gonna provide this folder on my github so you check it in a video description below and you will be able to download this folder I also have requirement dot txt which contains all the modules that you to install and the way you install it is you go to your command prompt and see here right now I am in this folder so I can just go through my model folder and just say pip install - our requirement or txt and this is how you install all your modules now I already installed this module so is gonna say these models are already installed if you get this kind of error where it says access is denied what you can do is you can start get bash as an administrator so if you run it as an administrator you will be able to install things okay I assume that you already have anaconda installed so you install anaconda and on top of that you have to install these three modules I'm gonna provide a requirement dot txt file which is this in my github so again you have to just download it and just run pip install - our own requirement dot txt I also have this taste images folder guys where I have couple of test images to try a few things and with that now we can start our python coding so I went I open Jupiter notebook I created a new notebook by going here I call it sports person classifier model and imported a couple of important modules here this is a cb2 module which is OpenCV basically and it will be helping us a lot throughout this project so the very first basic thing you can do in CV - is read an image so here from the test images folder that I have I am reading an image of Marius or a power okay so this is a beautiful looking image I am just downloading it and when you read this image in CV - you will realize that the shape has three damn Inchon so this is x and y okay x and y-coordinates and the third dimension is the RGB channels you know that any color can be represented using RGB values and therefore you have this third dimension for your RGB values now if you want to quickly show that image you can use PLT which is matplotlib basically and PLT has this method call i am not sure and that will show you this image now when you look at this image it's a colorful image basically which has RGB values and if you want to change it to a gray image you can do something like this where you can see that it is removing that third dimension that you have alright and the gray values I mean ultimately they are all our numbers it's n dimensional array with numbers from 0 to 255 and when you plot a gray image using again a matplotlib matplotlib has design show function you can use C map gray the gray image looks something like this now we are you going to detect the face from this image and also the eyes now if you look at open CV documentation they have this nice article on how we can detect face and eyes using hard casket we are not going to go too much into detail on what is hard cascade how it works because it's it requires a long discussion but just to give you brief idea you have this line and edge features and it will use a moving window of these aged features to detect where is your nose and where is your eyes for example in this image when you have eyes the area of I tend to be more darker then the area below similarly when you have knows the area of eyes since tend to be darker and the tip of the nose will be little brighter so you can use all this mask to detect these areas ok and OpenCV documentation contains this ready-made API which you can use to detect the face and in image so if you don't want to bother too much about the heart cascade and inner workings of it just assume that there is this cool technique called har cascade which helps you detect face and images and your result will look something like this so I'm going to try the same code here on our image of Marius our pooh-bah okay now going back to our folder structure let's see here we have this open CV folder which I downloaded it from open civic github and it has all these hard Cascades so what are these hard Cascades they allow you to detect different features on the face they allow you to detect phase I left eye right eye so these are the different xml's or pre train classifiers that you can use for detecting various features and I'm going to first try face ok so and I'm just copying pasting the code from my other notebook so that it saves me time on typing because there is like lot of code that will be writing so I loaded that XML file which is called front face default and also I loaded I casket I am not using I casted for now so when I load face Cascade and when I say detect multiscale from this gray mean what is this grey grey is nothing guys but this grey image of Marius are poor on that you are saying now too take me faces and what it returned is an array of faces so if you had two faces it would return to face right now to turn only one face and this is an array of four values so what are these four values so it is your X Y width and height see this image has this scale so 352 so 352 see you see 300 here so 352 will be somewhere in between this and 38 will be somewhere here so see at this point the face starts and the width and height is 233 so you will go to 33 here 233 here and this will be your face so let's draw that face so that we know how it looks so since faces is a two dimensional array we are going to detect the first phase we are going to store that first face in X Y W and H values once you have this you can now crop that face not exactly crop but you can draw a rectangle a rectangle around that face using OpenCV so in OpenCV you can say c v2 dot rectangle in my image so I am G what is IMG well I am G's my original image and in that Iams image I am saying draw a rectangle with a red color see this is RGB R is 255 that's why this is going to draw a red rectangle and the rectangle dimension will be it will start with X and Y and then X plus W and y plus h and I will stored that into my face image and when I draw it you get something like this so now my face is very clearly detected now I am going to going to draw the two eyes so this is the code nothing fancy about it the opencv documentation has this code so I have just done copy/paste from there so what we are doing is we are hydrating through all the phases in our case even if you don't have for loop it will work because we have just one phase and for each of the phase we are first drawing face image so see face image is nothing but this so we are doing that and then we are applying eye cascade so I cascade will give you eyes and you might have multiple eyes so you're going to run full loop on those eyes and again do the same rectangle but you you see I am doing rectangle in now green color so this is RGB this is RGB this was red before now I am doing that in green this code is extremely simple believe me now you get these two eyes detected in this code our eye color was nothing but the rectangle region that red rectangle that you're seeing for the face so if I just plot our white color you will notice that I get a cropped face and this is something we are interested in I am calling it ROI because it's region of interest we are interested in the facial region of every image in our data set so we will be dropping the face region from all the images and we will store this cropped images into a different folder and use that for our model training so now what I'm going to do is write a function where I can input this image for example I had this image right the original image let's say I have a function where I input this image and function returns me the cropped phase is the phase and two eyes are detected clearly and that function I can run on all my images so let's write that function it's it's the same code I'm just creating a simple function out of it so my image will be supplied using an image path as an input to this function and it will read the image it will then convert it to gray and then detect the faces first then you go through all the faces and if the number of eyes that you get in your face is greater than equal to two then it returns you the region of interest basically so let's try this function on Maria's image whatever we did previously we are now doing the same thing using the function so see this is the original image I have and let's see what kind of image is returned by this function so I'm calling this function on that image I am passing the path in that and that function is returning me the cropped image when it plot that image it looks something like this so this is pretty cool because this way I get a cropped image now if the face is not clear and if the two eyes are not clearly visible we want this function to return nothing because we want to ignore that image now if you look at our taste image folder so let's see so in our taste images I have the second image of Marya where her face is actually obstructed because the two eyes are not clearly visible so I don't want to use this image for my classification purpose so let's see how our function behaves for this particular image so first I loaded this image and I plotted it here so this is how it looks now I am going to call my function say I'm calling this function on this image setup or two which is this image and when I run this function I get nothing so cropped image noise is now none which means the face is obstructed and I don't want to use this image in my model training all right now in my data set folder I want to create a new folder called cropped which I will do programmatically in that I want to store all the cropped images so if you look at my original images they have their original and they have a lot of things right and I want to now generate a cropped folder all right so how do I do that so the first thing I'm gonna do is initialize couple of variables so that dot slash means the current directory so my current directory for this notebook is this C here is my notebook this is my current directory and my data set is in data set folder which you can see here and my crop data set so CR means co-op data set I'm gonna store in the croff folder okay and first let me store the path of all the induces subfolders in a Python list so I'm using a pythons OS module and when you do OS not scan directory what it will do is it will go through all the subdirectories within my data set folder so my data set folder has how many directories these five directors okay those names of those directories are going to be stored in this image directory variable so if you print that variable it will look something like this now I have complete paths of individual folders for each of these players so now what I'm going to do is if Croft folder doesn't exist then I am going to create it so right now there is no croc folder inside my dataset okay so let's see so this is my dataset folder this doesn't have any Crawford you can see that but this code will generate that folder if it doesn't exist see now I got croc folder so what this code is doing very very simple code what I am saying that is if the folder exists OS not part not exists means does this folder exists no oh sorry here what it is doing is if the folder exists then I am removing it so that if you are doing multiple run then if you have some old image you want to clean it so first thing is if the folder exists remove it then this line will create that folder so make directory will create that folder so now I have this folder pretty cool it's looking good guys so far life is pretty good we have no issues now what we're going to do is we're going to iterate through each of these image directories so for image directory in image directories okay so I'm going to eyes it through all this images first and I'm gonna build let me just copy paste some variables here so I will need these two variables I'll explain you the reason but crop image directory is nothing but the it's similar to image directories but it contains the cropped folder path for each of our five players so when I go through this image directories first thing I want to do is what is my celebrity names I'm just again gonna copy paste I am doing copy paste just to save time so when you do this what happens is see what is my image director first of all image director is this okay when I split this string by slash it will give me this two tokens data set and Leonore masse okay and these two tokens will be stored in a list and you know in Python lists when you do minus one it will give you the last element from the list so here what's happening is I am splitting all this strings one by one and taking the last element which will be the name of the celebrity so if you want to just verify really quickly and if you see a celebrity name you see you are now getting celebrity name in this variable called celebrity name alright what is my second thing now so my second thing is I want to now I try to each of these folders and itit through all those images so let's see what is my folder so my folder here is the cell you know messy so now I'm gonna go through all these images one by one and use that gat craft if two eyes function to create a cropped image okay OS dots can direct is a nice function you supply the image directory it will tell you it will give you the Irate er which can help you go through each of the images or each of the files from that folder okay so my entry alright so the entry dot path okay entry dot path will have the path of that image on this path I want to call my function so when I call my function in ry color I will get the cropped face if the face and eyes are clearly visible if they are not what will be the value of ry color we already saw that the function will return none see if we look at this function it returns only if the eyes are clearly visible otherwise it returns nothing which means it's none so now what we have to do is first we have to check if our eye color is not none if this is not none which means my face and two eyes are clearly visible in that case you can store that image into a craft folder so first you need to get individual folder for the celebrity so what is path to spot to see our data so part to see our data is datasets less cropped okay so let's see so it will be this in this crop folder you want to create a soft folder for your player first so who is a player well celebrity name celebrity name is the player okay so you are you are going to create a craft folder which will be part two cr+ celebrity name which will be data set slash crops Leslie Arnold Missy okay if you print this name of the folder it will print that and if that folder doesn't exist so here see if this folder does not exist then your first thing is you create the folder pythons OS module is pretty handy you just call OS dot make their lives and it will create that folder for you all right so now I'm going to just print this folder so that we know you know it's generating this folder if you want to run this code just for fun you can run that say right now it is doing cropping we are not saving the images yet but I want to just show that you can see it's going through all those little messy images creating crop images and generating your folder is still working on it now see it went to minus ro power and it is saying generated a folder for Marius our approval and so on okay so I'm gonna stop this cell so if you click on this button it will stop this execution because I wanted to just demo the code so far if you look at your craft folder see lyonnel mess in my reps or opera you got two folders but they were empty so I'll just delete it I will finish the rest of the coding and then we'll run the same code block again okay so I have now clock Rauf folder what I'm going to do is now I have this cropped image directories that is nothing but a list of your crop image directories for each of the sports person in that I am gonna append the craft folder okay so this is just a helper variable it will help us later on so that's what it is the code so far is extremely simple okay I hope you're understanding it till this point if you don't take a pause try to think about it no rocket science guys this is extremely simple project alright now I am going to do one more thing which is once you create a folder outside that if block I want to generate the name of the file so name of the file I'll just call it like you know like lyonnel may see one dot PNG lyonnel may see two dot PNG and so on I want to keep it that simple so then I need a count for one two you know so that's where I have count here celebrity name count dot PNG and I have not initialized here so I realize I need to initialize that in here okay and this will be the name of the file and this will be the full part of that file now what we are going to do is this roi color we will save that as an image in this cropped file path very simple now how do you do that in open CV CV - what I am right very simple what is the first argument your file path second argument your region of interest whatever you got back from your craw get crop image of twice that's an awesome function we wrote guys we made a big achievement by writing that function although that function is a big achievement there is one shortcoming I will tell you so that you don't complain later on if you have two images is gonna return on the first image okay so if you want to make this more robust I am kind of feeling lazy but if you want to make it more robust you can return to ry colors as an array and then save those two images okay but you know sometimes I feel lazy and I don't I don't care about little details because guys understand my schedule is very busy and I am doing this YouTube thing on side and I don't have I don't have time to go too much too much into details you know although I try my best but you if you understand my situation okay enough of side talk now once you execute this line your cropped image is stored in coop folder amazing but we need to do one more thing we need to store the name of all those image file paths into a dictionary that dictionary will be useful later on so this is that dictionary sellable file names dictionary so what is this so the key in this dictionary will be the name of the celebrity and the value will be the list of file parts so it will look something like this guy CI you'll say Leo you know let's see and it will have the file path you know like data said masih whatever masse one dot PNG Messi 2 dot PNG you know you want this kind of dictionary so that it helps you later on we'll see how it helps but you know like you get an idea this is very easy you are creating a dictionary where you have the path of all your crop images stored here in this beautiful looking dictionary see this is the dictionary I am trying to create that is celebrity file names to date now here I can append but I need to initialize the key of this dictionary somewhere here so how do I do that well you can just do this ok so what is this so when your dictionary is empty the first key is lyonnel messy and you are creating a blank array so blank array will be this value and once you have blank array you can insert all these image parts one by one one by one friends that's what I am doing here all right now looks like my code is ready for big execution so I just want to see that I'm not doing any Bluff I have only crop folder it is blank and now I'm gonna do ctrl execute so now executing the core it is going through all the images generating the cropped images I pause this video while this was going on because we were doing some heavy lifting based on your computer speed it might take you a few minutes but after this execution is complete for this cell let me show you how micro folder looks like so craft folder you see now it has five beautiful subfolders in each of these subfolders I see the crop image you see I see cropped images guys guys and girls cropped images my life is pretty cool see okay now one issue you notice is we have cropped images but then we have an image of Venus Williams she's a sister of Serena Williams we have this image which looks pretty blurry then we have this image I think this person is Serena's husband I think if I'm not wrong see so what is this person doing in Serena's folder so what happened was this image so this may see it has this face so it just returned that face into a craft folder so using Python you can crop this images but you can do the data cleaning to only certain extent most of the companies they will try to do data cleaning in as automated way as possible but you have to rely on humans or to do some manual remediation so many companies what they'll do is they will hire this workforce in countries where the labor is cheap some companies also use crowdsourcing platform for example in our case we can easily use a crowdsourcing platform and we can throw these images as a crowdsourcing job and we can ask people that does this image look like Serena Williams and people will easily detect this and they'll say no this doesn't look like Sedona Williams in that case you can delete the image okay so companies have this workflows where they use crowdsourcing platform or many different tools and assign these micro tasks to people especially in the countries where the labor is cheap to do manual remediation or manual cleaning of these images now our data set is small so we are just going to eyeball and clean these images okay so what I'm going to do is go through each of this folder see just quickly it's ok we are not focused so this looks like Serena Williams or Venus Williams her sister this image is blur so I'm deleting it so I'm just manually deleting it so now I'm done with my automated way of data cleaning now I am doing manual data cleaning see this was a background image which it caught see this is an image of some someone else ok so Serena Williams folder is cleaned similarly see I look at lyonnel messy and I see this boy's image maybe this is Lionel Messi's son there's another boy's I'm gonna delete that viraat coolly my favorite player Indian cricketer and I see this blur image I see this another vehicle II see I want to delete that guy and then see an obscure Sam I like her wife his wife I see three images see one two three oh I see Deepika well ok delete all those images and once you delete all these images you have a clean data set that you can use it further ok so so that's all I have for this tutorial in the next tutorial we are going to look at wavelet transform and how we can use that to generate the features and we'll use wavelet transform as well as these raw images for our model training later on if you're liking this series so far please leave a comment below because your comments kind of helps me in designing the future content in a better way so if you like projects like this if you want me to build this kind of projects please give it a thumbs up or comment below so that I know that there is a demand for doing this kind of projects and I can maybe focus more on these projects in the future

Info

Channel: codebasics

Views: 32,332

Rating: undefined out of 5

Keywords: image classification machine learning project, complete machine learing project, complete data science project data science project step by step, data science project for beginners, data science project in python, machine learning project python, machine learning project in python step-by-step, machine learning projects in python with code, machine learning projects with source code, machine learning projects for beginners in python, data cleaning in machine learning

Id: kwKfWBb6frs

Channel Id: undefined

Length: 40min 44sec (2444 seconds)

Published: Sat Jun 13 2020