Outlier detection and removal using percentile | Feature engineering tutorial python # 2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
outliers are unusual data points which are very different from rest of your observation for example you are analyzing a data set which has people's age in it now you might see up to ninety or hundred years of age but if you see a data point that has thousand years then that's an outlier that clearly indicates an error in data collection sometimes outliers can happen just because there is a nature of variation in your data set for example you see a data point with 120 years age that couldn't be an error maybe it's a legitimate valid data point but since it is very different from rest of the data points it can skewed the statistical power of the data analysis process for that reason often if not all the time it makes sense that you detect outliers and remove that now there are many different ways of detecting and removing outliers and there are statistical techniques such as percentile z-score standard deviation you can also use visualization using box plot a scatter plot to detect the outliers in this particular tutorial we are going to look at percentile way of detecting and removing the outliers we'll write simple Python code on a simple data set initially then we'll look into some complex data set and we'll remove outliers from that using percentile and in the end we'll have an interesting exercise for you to work on this tutorial see is gonna be awesome because I will be producing different videos for each of the outlier detection techniques let's get started let's first understand what exactly is percentile if you know percentile then you can skip this section I have the timeline of this video in the video description you so you can easily skip to the next section but you might have noticed that in some of the taste score techniques they use it relative scores here in this excel file I have added a score of different people out of hundred now if you use your usual person to score then this will be a person to score because these are the numbers flow out of 100 so the percentage is same as that but sometimes people use a relative scoring technique where 69 is the highest score hence they will say Oh get this person achieved 100% okay and 27 is the lowest score hence we'll say this person achieved zero so basically he is at the bottom and this guy is at the top now the definition of percentile is this is a percentile rank by the way so here it 50 it is 50% which means that 50% of the samples are below value 56 so let's count it so there are 4 samples 1 2 3 and 4 so for our 4 out of 8 okay so if you don't count this data sample then there are total 8 without after excluding 56 8 samples out of 8 4 are below 56 which means this is 50% percentile this is 100% percentile because all the data points are below 69 ok so that's some basics on percentile rank now in this tutorial we are going to examine a person's height data set so just assume that there is an operable clothing company who wants to perform data analysis on people's height so that they can design the clothes of relevant land accordingly again there are some dummy people's name and then there I have listed all the heights here the data set is very simple so by visual examination also you can spot outlier easily but the idea is in real life your data set will be much more bigger and you need to use statistical techniques ok so I'm going to load this data in my Jupiter notebook so I have loaded it here in my panda's data frame and then I will use a percentile feature of pandas did so you all know that if you want to access the height column and then you can access the height column by doing this and that will return you the number array on that you can call quantile so quantile will give you the percentile value and if you want the data samples which are about 95% quantile then you will get this value now what this means is 9.68 is 95 percent quantile anything about this is something we can consider as an outlier okay now what do you want to set your threshold to it really depends on situation so if there is no like fixed guideline but here I am just using 95 percent so let me store this in variable called maximum threshold and the maximum threshold value is this now in your data frame so this is how you identify the outliers so see here the person's health is 14 feet that cannot be true you know like it's hard to find a person whose height is 14 feet so we just detected an outlier we can also detect the outlier on the minimum and by doing this so we can say my minimum threshold is quantile 0.05 so in give me anything which is less than 5% so then I get this value and when you do less than that minimum threshold it will also show you some outliers here Josef's height is 1.2 assume that this is the data set for adults 1.2 feet high it seems to be really less and it's most likely a data error an outlier which we can remove easily now if you have a domain knowledge or you can actually use your domain knowledge for example for people's height we know that the max height could be around 7 feet or maybe 7.5 feet so even if I don't want to do quantile I can directly say okay if the height is greater than 7.5 then there's an outlier but unfortunately when we deal with the results in real life we don't have that much domain knowledge and features are very very complex so it becomes very hard to come up with a fixed threshold and at that time using quantile can be very useful because what you are doing is you are removing the samples on the far ends on the left far as well as right far so for that reason quantile is one of the techniques and that you can use now here in our example if you want to remove these outliers what you can do is this in your data frame we can say if the height is less than max threshold and if the height is greater than mintus well then only keep this example so you will get all these examples and you see that there is no yourself with 1.2 height here and also we removed this particular sample which had 14.5 feet height now let's look at little complex data set I have a data set of Mangalore property prices which I got from Cagle I have pre processed it a little bit and I'm gonna load this CSV file into my data frame now you can see that this has around 13,000 rows okay and here these are the some of the very basic features for property prices I have loaded them in my notebook here and this is how it is looking so now let's start analyzing this data so first thing I will do is I will just confirm how many rows and columns or 13,200 rows seven columns you can also use a describe function just to get a quick feeling on your data set so here this twenty five percent fifty person that you're seeing our percentile what this means is quickly is for example for total square feet column seventy five percent of samples have total square feet less than this value this is the mean value this is the max value okay similarly price per square foot let's look at that here seventy five percent of your data sample have value less than seven thousand three hundred and seventeen rupees per square foot now if you're living in a Bangalore you will get a feel that okay this is probably about right but then look at the maximum value the maximum value is 1 point 2 and E raise to 7 so this value is a really high now this could be either a data error or it could be a legitimate property but this type of outliers will really hurt the performance of our model so we need to tackle them okay so let's first find out the min and Max threshold by using the quantile so here you can also supply an array in your quantile function and it will return you min and Max threshold okay and this I am doing on price per square feet I want to basically remove outlier based on price per square feet feature that I have in my data set my minimum threshold is thirteen thousand sixty six hundred rupees per square foot maximum is fifty thousand 99 rupees per square foot now if you're living again in Bangalore you would know that if you're getting a home which has a value price per square foot value of less than 1366 is most likely not not true you know it cannot be true Bangalore is pretty expensive so let's see what data points I'll have a very less value so you can do something like this where in your data frame you can say my price per square foot is less than minimum threshold and give me all those data points so I got all these data points here the price per square foot is a ridiculously low in Bangalore I mean come on you can't get a home for 371 rupees per square foot so these are most likely errors in our data set look at this guy here 9 bedroom hall kitchen apartment 42,000 square feet area and you are getting this is in LAC you are getting in one car or 75 lakhs this cannot be true okay so these are clearly outliers now I used one person than 99.9 percent you can use different ones okay so based on the situation based on your intuition you can use different thresholds also if you have added domain knowledge let's say you are working for makan comm or some property real estate website and if you have a talented business manager who comes to you and directly says you know what I know based on my knowledge of real estate in the city that the price per square foot should be restricted in these two range okay then it's fine then you don't need to use quantile but as I mentioned before many times you don't have that domain knowledge many times features are complex that's when quantile can be really useful okay now let's look at our data points which okay price per square foot price per square foot is greater than max tassel okay now we are looking at the data sample on the higher end here check this property the price per square foot is ridiculously high all these are like very very high prices now I'm not saying these are data errors maybe some of them are valid but if you keep these in your model by your building model if you keep these data points they are really going to hurt the performance so to remove these outliers what you can do is you can create a new data frame all right how do you create a new data frame like this so here what I did is remove anything whose price per square foot is less than max threshold yeah so basically keep anything that is this that whose price is less than max threshold and greater than min threshold so this will automatically remove your outliers and now my new data frame which is df2 has 13,000 172 rows initially it was 13200 so we removed few outliers here and if you do the F 2 dot sample by the way you can use sample just to randomly sample some rows from your data frame and if you print 10 random rows they show these values and you see that these values look pretty decent that's all I had for this tutorial now the most interesting part of this tutorial is an exercise you have to use Airbnb in New York data set from Cagle so if you right-click on this link you will find this data set here on kegel and you can click on this button to download the data set if you don't want to do that then you can actually go to my github and on my github under this outlier there is an exercise folder in that exercise folder you'll find the CSV file now this Python ipython notebook is actually a solution for the problem so don't look at the solution until you have tried it out yourself so what you have to do is you have to analyze this data set and remove the outliers are based on the price per night the Airbnb data set shows various properties and hotels and there is a price per night per apartment or whatever and you have to use the intuition to determine your percentile range and then remove it then you can verify your solution with my solution now I used different type of percentile range so your solution doesn't need to exactly match with mine but this is a very simple exercise by the way you'll be doing pretty much the same thing I was what I did in this tutorial so go try it out once you try it out you will your understanding from whatever you have learnt in this tutorial will become more solid I'm going to provide a link of this this notebook in the video description so go check it out and towards the end of the notebook there is an exercise description now many times people ask me how can we download the CSV file etc so what you can do is you can go to the root directory which is pi and you can click on clone or download button and then you can go to ml directory under ml I have different ml tutorials and here is a location where I'm going to host all my feature engineering code thank you very much for watching bye
Info
Channel: codebasics
Views: 54,345
Rating: undefined out of 5
Keywords: outlier detection and removal in python, outlier detection python percentile, pandas outlier detection feature engineering, feature engineering machine learning, feature engineering tutorial python, feature engineering in data science, outlier detection, outlier detection techniques, how to remove outliers, outliers, how to remove outliers from dataset, outlier analysis, outliers python, how to remove outliers in python, outlier detection python, removing outliers in python
Id: 7sJaRHF03K8
Channel Id: undefined
Length: 17min 18sec (1038 seconds)
Published: Mon May 25 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.