How to Identify and Treat Outliers in Stata | Stata Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we are going to understand what an outlier is and what are different methods of finding out an outlier in Stata and then we are going to understand that how we can correct the outlier we can trade the outliers so starting with the with the definition of outlier an outlier is an extremely high or extremely low value in the data set but that definition is quite vague and this is how the outlier is there are multiple definitions of outliers and no definition is right or wrong it depends on the type of data the researchers objective there is a setting so at the end of the day we will provide some tools to identify outlier but how to treat them or whether to treat them or not would depend on the researcher so what are the different methods of finding outlier so first let's load a data so in this case again we are going to use the auto data so starting with the first method sorting data using the sort command so we are going to sort the price variable and in this video we are just going to cover the univariate outliers so unit we read outliers are the outliers that are particular to certain variable to a single variable and another video we will cover multivariate outliers the outliers that effect will not just do one variable but multiple variables so in this video we are focused on Univerity outliers so because it is a univariate outlier and out and that is particular to one variable so that's why we are going to use a single variable so we solve the price variable and let's add it 8 so this is our sort variable so we can we have sorted it we can see the lowest values and we see that they are close enough so we draw tie defy any outliers in the lower part of this data in in higher values there are they seem certain outliers but they might be out as or might not be outliers so sorting a data would give you just an idea it is not a statistical method one more thing let me edit this value and let me convert it to 50,000 and now we can clearly see that it is an outlier if this was the case in your data set then just by sorting the data you can identify an outlier so let's move to the next method which is box plot box plot is the graphical equivalent of five number summary or the inter quartile method of finding outliers so we can draw the box plot by clicking on the graph menu then on box plot and we would get a new dialog box we just need to select one variable in this case we are going to understand the price variable we click OK and the graph would appear you would notice that this is our box plot these are the whiskers and this is the 75 5th percentile and this line is 25th percentile the middle line is the median and if you understand the interquartile range then you would probably have understood what box plot is we would have another video on box particularly so in this case you can clearly notice that this dot this specific observation is way above the the normal distribution or the normal pattern of the data and this clearly is an outlier but the issue with box plot in Stata is that it do not give us the exact value or the row number or the case which is an outlier so we do not know which value is outlier we would have to go to the data and sort the price variable and then we will see the highest and we would get an idea that it is an outlet but in in this box plot we do not get an idea but there is a way around so let's see how we can label this and then this graph would give us the exact ID of this observation so let us generate a new variable by the name of ID where each value would be the row number of that specific row so let us create it and now let me show you the data you would notice that we have given an ID to each case so we have total of 74 IDs and now we will draw the graph but let's use the command line to draw this box plot so the command for box plot is graphed box and then we write the variable name this is the simple command but we want to use another option which is mark 1 and then we want to label it within mark there is another option of label so we want to label it on ID the variable that Heidi variable that we just created press Enter I missed a parenthesis so now you would see that it would give us the ID of those observation and this observation is the 74th observation if I move towards the data view I can clearly identify this observation it is this observer in the 74th observation and now I can treat it the next method of identifying outlier is the extremes command instead this isn't a built-in command of Stata it is a user written command so we will first have to install it we to install a user written command we type SSE install and then the command name in this case the command name is extremes because I have already installed it so I do not need to repeat the process so now what is the issue with boxplot and why do we need this extreme the problem with boxplot is that it not allow us to change the inter quartile level in boxplot the whiskers the top level line of the the box plot this line is the 150th percent of the interquartile range which is the default and the box plot do not allow us to change this so let's say there are different definitions of outliers one definition says that any value above 150 percent of the interquartile range would be considered an outlier the other would suggest a 2.2 value and the there are researches that suggest 300 percent of the interquartile range as the cutoff point 25 outlier so if we wanted to change that definition we cannot do that with the box plot so we have a way around and that is to use the extreme command so by typing the extremes command and writing the variable name the simple syntax is type the command name and then the variable name which is prized in our case when you press Enter it would just give us the lowest five observation and the highest five observations so by default is just giving us the lowest and the highest top highest and the top lowest values but this isn't quite useful for our task we need to identify outliers so what we do is we we add another option with this extremes command which is IQR it stands for interquartile range and the default interquartile range is 1.5 of as we discussed for the box plot it is 1.5 so it would give us all these values as outlier but not necessarily 1.5 or one 50% of the interquartile range are considered an outlier there are other cutoff points say for example 3 so if you use a 3 cutoff point then we would see that these come out to be outlier so it is also a way to identified the extreme values it is not necessary that all these values are outliers but it would give us an idea that all these values are 300% higher from the interquartile range we can also specify two variables say for example I also specify the price and the mileage variable and in that case what this command would do is it would give us the extreme values of the first variable so these values are similar to the previous values and the values of other variable the another variable for those same cases so in this case the the case is 70th its price is 1290 and its mileage is 14 so it doesn't suggest that these are the extreme values of for mileage variable it just give us the extreme values of this first variable which is the price variable which ever the variable we wrote first and then the corresponding values for that specific case for the second variable so we have got certain idea of identifying outlier the next method that we are going to use is histogram we can create histogram by clicking on the graph then histogram and then we select the variable which is the price variable we can change the frequency instead of density click OK and it would clearly show that this is the observation that is different from the other observations so this data is clearly skewed and we have a whole video on how to use histogram you can check out the link in the description box below there is one more graphical method that we can use which is called spikes plot for that click on graphics then click on distribution graphics and this is the last option spike we select the variable name click OK and it would give us the spy plot in histogram everything is some in bins but in spite plot it shows the individual spike of each value of a continuous variable so in this case these all are the data points we can clearly see that these data points are clustered together so they are in our players these are quite near to these cluster points this one value seems to be an outlier it is clearly different from the rest of the cases the last method on our list is z-score so we calculate the z-score of a variable and if any data points fall away from the third standard deviation then it will be considered as an outlier and how do we create this z-score the command is ijen which stands for extension generate command let's call this new variable STD prize which stands for standardized price the z-score is also used to standardize variable then we have this STD function and in parentheses we write the variable name let's move to the data view and you would see that these z-scores for each value is created if we sort this data and scroll down you will see that this specific value is greater than the third standard deviation so according to week all the values should fall within 3 standard deviation of the mean this value is seven point five standard deviations away from mean which clearly indicates it is in it is an outlier so once we have identified the outlier how do we treat them the first method is to do nothing and leave them as they are we say that outliers are also humans and we can use some other kind of nonparametric tests that are not affected by these outliers so for example the normal height is say for example 5.5 feet but there would be someone whose height would be eight or ten feet so they are outliers but for some cases we won't delete them the next method is to correct the data entry error there are different reasons of outliers one of them is data entry error example you have coded a gender variable as 0 & 1 there is an entry in the gender variable as 1 1 double 1 so this might be due to human error and we need to cap them and we need to replace double 1 with a single one because we know that we mistakenly pressed the one digit twice the third method is to winterize the variable by winterizing we mean the changing the value of outlier to the nearest value of the observation that is not an outlier so for example in case of our this price variable what we can do is change this last outlier which is 50000 with the nearest value which is not an outlier so if we consider this 14,500 as a legitimate value then we can replace 50000 with 14500 so normally when an outlier is legitimate there would be a car whose value will be 50000 it is a legitimate value it isn't an error but in this case this value is not a representative of our sample so in most cases what we would do is we would wince rise these values so how do we visualize a variable in Stata we install a command is user written come on Vince or to this is a user command and we are going to use this command to winterize our variable because I have already installed it so I do not need to in G install it before that let me summarize the price variable and let's use the detail option to get the different percentile values so now we know that the first percentile is 3 2 9 1 this is the price and the 99th percentile is 50000 so what we want is we need to replace any value below 2 with the percentile values so let's use the command in this case we are going to use vince'll 2 which is the command name then we write the variable name comma the first option is the replace option we either generate a new variable or replace the previous one and then we write the cut rates in this case let's say I wanted to use the 1 and 99 percentile that means to replace the outlier with the first percentile and 99th percentile and it will replace it let us go back to the data view and you would notice that it had not been replaced and the reason is you would have noticed in these summarize table that the nitrites percentile was the same value 55 thousand so winterized did replace the outlier but with this value so we need to change the cutoff rates in this case we might change the cutoff rates to say 5 percent and 99 percent so in this case the higher value of this 50 thousand will be replaced with 13,000 466 now if we summarize again you would notice that the 50000 value had been replaced with the 19 5 percentile you can use any value can it is not mandatory to use that if i--if or 90% time all right and for that matter you can use any present time so in this case you would notice that all these values had been replaced with ninety nine ninety five percent time so it would have been bad review we could have used 98 97 percentile so what this means is commanded is it replaced the outlier value with the previous values so in this case we we told Stata to replace all the values below 95 percentile the 95 percentile value was 13 for 66 and that value was repeated in all the cases that were higher than that value so if it is another way of saying to replace the price variable with 13 for 66 if the price variable is greater than 13 4 6 6 so what this means arise command would do is Vince or come on will do is it would perform this command it would replace it would first calculate the percentile value and then it would replace the variables values that are higher than that specific percentile value the last method of dealing with outlier is to simply delete them drop them or what we call trimming the outlier we can perform the same task using the Vince Rice command but first I would have to change the data because I deleted the outlier or I winced arise the outliers on it now I have created the outlier again so if I wanted to trim the data instead of vents arising it I would simply use the trim option so in this case I will rewrite the command which is when so2 price replace and last time I use the cutoff points of save 5 and 95 and then I would trim those variables so in so by default it winds rises the variable but if I add the trim option then it instead of ends rising it would delete the outlier so now you would notice that that specific value that outlier is deleted please click on subscribe button and hit the bell icon you will get a notification as soon as we upload a new video also leave a comment in the comment box below for a recommendation or any suggestions that you would like to give
Info
Channel: The Data Hall
Views: 11,294
Rating: 5 out of 5
Keywords: outliers, outliers in stata, winsor2, winsor2 in stata, univariate outlier detection stata, univariate outlier, winsorize a variable stata, extremes stata, winsorize outliers stata, how to address outliers in stata, finding outliers with stata, winsorize command stata, boxplot stata, stata drop percentiles, stata boxplot label outliers, stata extremes, stata interface, thedatahall, data, winsorize, graph, histogram, graph box, stata, data management, stata tutorials, how to use stata
Id: XUk2gA11_JU
Channel Id: undefined
Length: 20min 34sec (1234 seconds)
Published: Tue Jun 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.