Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we are going to look at how we can use z-score and standard deviation to remove outliers from your data set we will be using a real data set from Cagle comm and remove outliers using z-score and three standard deviation in the end we'll have an interesting exercise for you to work upon we will be using weight and height data set from giggle thanks Mustafa Ali for providing this data set this data set has height and weight to columns is basically the weight and height of different people and just to make things simple I have a remove weight from that data set and my CSV file looks something like this you can see that it has 10,000 records in it and I'm gonna load that into my Jupiter notebook so here I imported a couple of important modules and then I imported the data set into a pond as data frame it looks something like this and the first thing I'm going to do now is plot a histogram just to understand the data distribution so the histogram will look something like this so here you know I have this height column that I am using to plot the histogram and beans is like how many bars that you want to see on the chart okay if you increase the bin size you will see in a more chart more bins here so this histogram shows a normal distribution now if you don't know about histograms this is very simple actually what this is saying is for 65 inch height I have more than thousand samples for 60 inch height I have around 370 380 samples so the y axis shows the number of samples that you have in data set for that x value which is your height okay so if you don't know about histogram again google it I don't get yourself familiar once you have the histogram you can figure out if your distribution is a normal distribution or not now what is the normal distribution so if you go to this website math is fun calm this is a great website by the way this website explains how a normal distribution looks like it often has this shape of a bell curve because it looks like a bell and what it means is majority of the values are centered around mean or an average and then as you go away from mean the number of values goes down in nature we see many data sets which follow normal distributions for example here they are saying heights of the people size of the things produced by machines you know blood pressure marks on taste these are all examples of a normal distribution where if you are taking example of heights of the people which is this data set you will notice that majority of the people have Heights between you know in this area 65 to 68 inch and then people having around 80 inch height which is more than which is around 7 feet is very less so you can see number of samples are less similarly number of people having very less height which is like 4 feet 4 feet find is also less so this letter that is clearly following a normal distribution now what we're going to do is we will plot a bilko that you saw here so this is this yellow chart is a histogram and this curve is a bell curve and we'll just plot that bilko for our visualization purpose and you will need to use a sci-fi module so if you have installed sci-fi module already it I think comes with anaconda installation so you should be good these three lines are same as this okay so you're plotting a histogram here and then in these two lines you are plotting the actual bell curve now what are these two lines with is producing the range of X values so DF dot height dot min and Max will give you the minimum and maximum value for your height so let me just bring them here for your information so the minimum height that we see in our data set is 54 inch which is 4 feet and 6 inch because 4 feet is 48 inch so 4 feet and 60 and 6 inch the max height in the data set is 78 inch which is around 6 feet 6 inch okay you can also do describe and that will tell you the quick statistics on the height column you know here it says minimum height is 54 max is 78 then this is a standard deviation the count is 10,000 we saw excel file has 10,000 data points ok so once this is all done you can now execute this cell and it will plot the chart for you now NP is not defined here so I will import numpy as NP and that will plot this nice looking bell curve for you and this bell curve clearly shows normal distribution we already saw on this particular chart you know if you remember this chart says shows the normal distribution here now what we are going to do next is find out the mean and standard deviation the mean and standard deviation we already saw in this described parameter but if you want to just kind of print it out then you can say the F dot hi dot mean that will print the mean of this column and then DF dot light dot STD this is printing standard deviation now what is standard deviation if you don't know standard deviation is basically showing you how far away the data point is from mean value okay so in this example most of the data points are within one standard deviation if your data set is normally distributed 68% of the samples will be within one standard deviation value then 95% samples will be in two standard deviation in 99.7% samples will be in three standard deviation okay so standard deviation is just it's nothing but it just shows you how far away you are from mean for example our mean value is 66 inch and if we have a data point which is less than this 178 so 78 is quite far away from your max so this will naturally have a higher standard deviation but if you have something like let's say 67 68 then those will be most likely within one standard deviation range so see what this shows is if you have anything in like 3.84 range from 66 so 66 plus three point eighty four is one standard deviation away - 3.84 is two standard deviation away so now we will use three standard deviation first to remove the outliers so three standard deviation is kind of a common practice in the industry to remove the outliers sometimes people use for standard deviation five standard deviation as well if the data set is small I have seen people using even two standard deviation but you have to really use the sense of your judgment or to come up with that threshold on how many how many standard deviation you want to use so here I will figure out my upper limit by saying okay I want mean and three standard deviation to be my upper limit and anything which is more than seventy seven point nine one I'm gonna mark that as an outlier you can do similar thing on lower-end and you can say mean height minus three standard deviation is my lower limit which means any height less than fifty four point eighty two I will mark that as an outlier now let's quickly see what are the outliers in our data frame so nowaday the frame when you do something like this where you are saying if my height is greater than upper limit or lower than lower limit and then show me those data points and I find this seven data points where the height is really high this is six for six feet six inch actually the height is in inch you can convert it into feet an inch 54 is four feet six inch now these Heights are actually they might be valid Heights you know I'm not saying these are data errors but many times even if the data point is legitimate we can actually remove those points as an outlier because that can help with your model or it can generally help with your data analysis process sometimes the because of the error you might see crazy values such as you know you might see let's say 2000 or 1000 those kind of crazy values in that case removing those outlier becomes really easy you don't have to think much but here you know I can discuss this result with my business manager for whom I am building my model and if he agrees I can remove these as an outlier okay and how do you remove these as an outlier so what you can do is you can say DF okay and here you can specify a condition so you can say D F dot height I want the height to be in this limit it should be less than upper limit and the height should be greater than lower limit and so it will give you this data frame which will have your normal like good-looking data points okay and this you can store in a new data frame and this new data frame you can just print the shape of it and you will find that out of ten thousand nine hundred nine thousand nine nine three are the valid samples okay so you can just say this is the shape of your filtered out data frame and if you do shape of zero will return your number of rows okay and if you do this you will find that you remove seven outliers using this method now we are going to use z-score to do the same thing now z-score and standard deviation are kind of very similar things z-scores just tells you the it gives you a number which will tell you how many standard deviation you are away from the mean so let's say if your data point is three standard deviation away then the z-score for that data point will be three if it is 2.5 standard deviation away then it will be 2.5 okay so I'm just gonna pull this other notebook and just kind of walk you through the definition of z-score and the equation so see here it's saying it indicates how many standard deviation away your data point is for example our mean is sixty six point three seven and three point 84 is the standard deviation and if your data point is seventy seven point ninety one then the z-score will be three because seventy seven point ninety one is three standard deviation away the equation is extremely simple the z-score is equal to you which is a data point value which in this case was seventy seven point ninety one minus mean divided by standard deviation okay so this is an extremely simple equation z-score is extremely simple thing guys like don't this there is not no rocket science in it it is very similar to standard deviation it just tells you number on how many standard deviation you are away from the mean and to calculate the z-score I am going to create a new column in my data frame call it z-score and you might know that this is how you can create new columns in the data frame okay and this will be dependent on my height - height mean okay so it's like X which is at this data point - mean divided by the standard deviation and this is how you can get standard deviation DF not height gives you a column in the data frame which is an umpire array and that dot STD can give you the standard deviation and once that is done you can put in a couple of samples and you will find that now I have a new column called z-score and I see all these z-scores here now you can filter out those rows which has a z-score of 3 3 or above or minus 3 or below okay so the Cisco 3 means 3 standard deviation so you are basically doing exactly same thing that we did here okay so so if you don't want to use z-score and use this technique fine you can stop watching this tutorial but z-score is just an alternate way of doing the same thing here okay so here let's say I got z-score 1.90 94 for this data point how did I get it well all you do is you take this data point minus the mean ok now what is the mean well the is 66 point 37 so you read this and you divided that by 3.84 which is your standard deviation and that's how you got this 194 pretty simple okay nothing to worry about now I will figure out all the data points which has a z-score more than three okay this is how you can get all the data points with z-score greater than three and you see you got those same 78 inch data points similarly if you do z-score less than -3 you will get these two data points okay so 5 into 7 these are the same same outlier that you saw here so those outliers can be filtered in one short or or let's first view all those outliers by having this condition so I am just doing or on these two conditions so that it will show both of them it's a union of both of these sets and you see this seven outliers and to filter these outliers again you can use the same approach that with it previously which is you can say okay I want to keep those rows which has z-score greater than -3 or less than 3 and you will find this will be your new data frame and in this data frame when you do something like this where you are saying the f dot shape 0 will give you the number of rows in the DF and this is the new data frame that you have and the rows in that and you find that you remove say one outliers in total alright that's all I had now is the time for exercise and for the exercise I have this particular our data set for you the Bangalore property prices so let me open that so this is the data set I have for you and what you need to do is first remove the outliers using the percentile technique okay and then you get a new data frame now on that data frame you have to use for standard deviation to remove the outliers then plot histogram and then you use Z score of four to remove the same outliers now I have a solution link here but please try it out on your own first and then only look at the solution the link of this particular notebook I'm gonna provide in a video description so you can look at the notebook for the code that we covered in this tutorial today and then go at the bottom you will find the exercise description the CSV file you will find it here so watch this so if I'm on this particular notebook you can just remove this guy here and you find a folder and then there is an exercise folder here if you go there the CSV file is located here this is a Bangalore home price data set which I got from K go sothanks k go for providing all these amazing data sets and this is a solution or for that exercise but do not look at that solution first try it out on your own and then only look into the solution I'm gonna cover many more interesting tutorials on feature engineering in the future so keep watching keep learning feature engineering is very very important in data analysis whether you are building machine learning model or not it will help you tremendously if you are targeting a data analyst or a data scientist carrier so please pay close attention to all the feature engineering tutorials that I'm gonna provide work on the exercises if you liked this tutorial please give it a thumbs up share it with your friends it really helps with my youtube ranking so please share it with as many people as possible thank you
Info
Channel: codebasics
Views: 111,493
Rating: undefined out of 5
Keywords: z score python pandas, z score python example, z score outlier detection python, standard deviation outlier python, outlier detection and removal in python, outlier detection python z score, feature engineering tutorial python, outlier detection, outliers, outlier removal, z score, outlier detection standard deviation, outlier detection and removal, standard deviation feature engineering, outlier detection and removal z score, outlier detection and removal standard deviation
Id: KFuEAGR3HS4
Channel Id: undefined
Length: 20min 5sec (1205 seconds)
Published: Thu May 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.