Identifying Multivariate Outliers with Mahalanobis Distance in SPSS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
well this is doctor to my video and identifying multivariate outliers using Mahalanobis distance so I have here fictitious data in the data view of SPSS I have an ID variable independent variable three levels and then three other independent variables measured at the scale level of measurement and then an outcome variable measure the scale level measurement and I want to test for multivariate outliers for these three independent variables functioning severity index and motivation so what I'm really looking for are unusual combinations of these three scores ones that would be so unusual that I would record them as outliers and then based on the criteria for the study I may find somebody to adjust or transform the variables based on the outliers or eliminate the outliers so first we'll need to compute the Mahalanobis distance for all three of these variables and the way we'll do that is through analyze regression even though a linear regression is not of interest to us it's how we compute the Mahalanobis distance so the dependent variable doesn't matter doesn't affect the creation of this new variable so we'll just use outcome as our dependent but our independence of course will be all the independent variables that we want to screen for multivariate outliers so we'll move functioning severity and motivation over to the independent listbox the method will remain as enter statistics remain the same plots remain the same save however you can see under distances we have Mahalanobis want to check that continue that we know changes to options and no changes to style so I'm going to click OK and you'll see it generates this new variable mah one this is the Mahalanobis distance for these three independent variables and now let's sort the data set by this new variable and we'll sort descending and we can see there's some relatively large values compared to the more common values found in this variable so we would suspect that there are some outliers but we don't know for sure yet we want to compare these malla nobis or MD values to a chi-square distribution the same degrees of freedom and the degrees of freedom in this case will be equal to the number of predictors so b3 so go to transform and compute and let's set a new target variable so we'll call this one probability underscore MD and it will start off the numeric expression by one minus and then we'll go over and we want the cumulative distribution function for chi-square so this returns the cumulative probability that the value is from the chi-square distribution so we'll double click that and we want to move over the new MD variable into the first parameter make that the first parameter and then for the second argument we need the degrees of freedom we know that's three so it's one minus this function so this will calculate the p-value for the right tail of this variable so let's take a look at that now this only shows us two digits to the right of the decimal place so I'm going to go to the variable view you can see decimals and probability MD I'm going to change that to five let's go back to the data view and now we want to compare these probabilities against point zero zero one so if one of these probabilities is below 0.0 0.2 the combination of these three variables would represent outlier in that case and you can see with the first value point 0 0 0 1 5 that would be an outlier similarly so with point 0 0 0 3 4 but not point 0 0 4 so in this particular case the only two outliers are ID 1 0 0 7 and ID 1 0 1 8 now I want to show you an alternate way to calculate this probability go to compute variable you can see it retains the last method I used if you don't want to use the one minus the cumulative distribution function of chi-square you can you can actually delete all this and type in SI g dot CH I sq and then the same arguments there and it will all change this to probably MD 1 just to distinguish it from the other variable you see it returns the same results when I go to variable view and display five digits to the right of the decimal values are equal so either way is fine to calculate that probability and if you're dealing with a particularly large data set and you'd like some way to flag it with a 1 or a 0 I'm just going to clear the second probability because it's the same as the first we go into compute and I'm going to change this label to outlier and the numeric expression I'll delete the old one in there and simply add in probability MD is less than and then point zero zero one and okay it's going to create a new variable named outlier and you can see the two probabilities that we identified as outliers are labeled now with a one in the outlier variable and the remaining variables with a zero so in this case we do not want any digits to the right of the decimal place to go to variable view and adjust this to zero and then back to data view so you can see we have the two one values and the rest are 0 I hope you found this video on identifying multivariate outliers using the Mahalanobis distance to be helpful as always if you have any questions or concerns feel free to contact me and I'll be happy to assist you
Info
Channel: Dr. Todd Grande
Views: 157,205
Rating: 4.9555106 out of 5
Keywords: SPSS, Mahalanobis, distance, Mahalanobis distance, independent variables, predictor, predictor variable, outcome, outcome variable, dependent variables, outliers, multivariate outliers, linear regression, regression, CDF.CHISQ, 1- CDF.CHISQ, SIG.CHISQ, compute variable, degrees of freedom, counseling, Grande, Outlier, Multivariate Statistics
Id: AXLAX6r5JgE
Channel Id: undefined
Length: 8min 23sec (503 seconds)
Published: Sun Sep 06 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.