Tutorial: Data Cleaning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to your tutorial on data cleaning we'll start by looking at identifying impossible values and response sets okay so we have a data set here with a number of variables we have ID gender which is a categorical variable we have favorite color which is also a categorical variable we have happiness and vitality which are interval level variables measured on one to ten scales and then we have reaction time which is a ratio level variable that goes from zero to infinity so the first thing we're gonna do is look for impossible or out of range values so what will the easiest way to do that is actually just through the descriptive statistics and use the frequency distributions so if we click here and what we'll do is run just get the frequency tables for all of the variables except for ID or yes ID so we click OK and what we get here are different tables and it tells us which values we have so we have our five variables and if we go down for gender we have male and female and then we have this random four here which doesn't make sense because we don't have a category for four if we look at favorite color okay we have red blue green yellow everything's good happiness we see that we have scores of four five six and seven on ten everything's fine and same with sorry with happiness and vitality and then we look down here everything is within normal range as well for reaction time so we do have an impossible value here for gender and essentially what that means is if to go deal with it so since it's a categorical variable unless we can be or sorry it's a categorical and demographic variable so unless we can be really certain about what this should have been so if we can access the original data and see what the participant entered at first we have to get rid of it so if we go back to our data set we have we can see here could days that's not very big that the four is right there so what we really should do is delete it however when you are deleting something like this it's not a bad idea to check and see what else that participant has answered and we noticed here that we might have evidence of a response set since this participant appears to have entered the score for for every variable even if that doesn't make sense like it does here so I would say in this case based on the four here and the fact that there's four is everywhere this would be a worthwhile candidate for for deleting them so what we're actually going to do is we're not just gonna delete the score here we're gonna take out this whole row so we can delete participant thirteen entirely actually I don't know what happened with our numbers but anyway so we can delete that here and that participant is gone so we have dealt with our impossible value and our and it and our response set at the same time if there had been any other impossible values and any of the other variables you would just need to delete it if you're not sure what it should have been next we'll look at handling missing data okay so next we need to deal with our missing values so if we run the frequency tables again we can actually get a sense of how much missing data we have and it actually tells us here right in this row right here so for gender we have one missing score for favorite color we have six missing scores happiness we have one and in vitality in reaction time we have none so if we recall how we handle missing data depends on the variable so when it comes to categorical variables we simply need to remove the variable and code it as missing or remove the data and just code is missing for categorical variables that aren't demographic if we have more than 5% of the data as missing we can create a new category and call it other or we can just leave it as missing and then for interval and ratio data so these three variables here we would replace the missing variable with the mean so we'll go through each of those right now so if we look here we can see what the overall percentage is so on gender one percent is about three however we do see here that on favorite color six / six missing scores is actually equivalent to about 21 percent so this means that for this variable we could recode those missing scores as an other category and then since this is an interval ratio variable we would replace this with the average score for all of these so let's go do that okay so for gender we need to replace the missing score with a code so we'll give it a nine nine and then we really need to go back to the other side here and make sure we know that our SPSS knows that that 999 or nine nine means it's missing okay if we go back to favorite color so we have four categories and we have a lot of missing data here so what we're gonna do is create a fifth category and call it other okay so we have five and then here and then we have to go tell SPSS that those are another category so we click here we would add five and call it other okay and then finally for the happiness variable we can see that we have the missing score right here for participant number 24 however we do need to figure out what the average of this variable is in order to plug in that score so what we're gonna do is to analyze descriptives we can stay at frequencies or go to descriptives either one will give us the option so we'll take out all the variables that were not interested in and we will simply ask SPSS to generate the mean score for this variable let me click OK and what we see here is that the mean score for happiness is five point five so what that means is that we can go and recode this missing variable with the score or five point five so here it is excellent so now if we rerun our descriptive statistics with all of our variables and look at our frequency tables we'll see that the Corrections have been made so you will notice that the only remaining missing score is on gender and that makes sense because we can't impute that however for happiness we no longer have a missing score and you do notice that we have one observation of five point five that's that new one but the average hasn't changed because we simply replaced an extra score that was blank with the average so it's not going to have an impact on the average and then gender now it tells us the code has been properly associated with the missing variable and then we have our other category here so that is how you would handle your missing data in SPSS now we'll look at identifying and managing univariate outliers okay so the next thing need to deal with our our outlier scores so if you recall outlier scores can only occur on interval or ratio data and it's when we have an observation that's significantly different than the rest so in SPSS we're gonna focus today on finding any outlier scores on this reaction time variable so what we're looking for our scores that are very different and if you recall we do that by calculating the Zed scores to identify which ones are a problem so to get the Zed scores for all these observations what we're gonna do is do analyze descriptives and then descriptives we're only gonna select reaction time and we're gonna check this little box here save standardized values as variables so this is what it's gonna do is create a new column right here a new variable with reaction time and a little Zed in front of it and it's gonna represent the standardized scores of all of these scores so we click OK and we can close that and what we have are all the dead scores so if you recall we're looking for scores that are bigger than 1.96 or smaller than negative 1.96 so these are all fine they're within one point of the mean or of the center so very normal scores so if we go down everything's fine until we get here we notice we have a score of four point six eight obviously I made this a really big outlier so it'd be easy to find but this score is much bigger than 1.96 so if you recall we need to handle this by doing something called Windsor rising where we would replace this score of fifteen seconds with the score that is the most extreme but still within normal range so if you remember the formula for determining that would be one point nine five which is just one value inside of 1.96 times the standard deviation of this variable plus the mean of this variable so to get that the first thing we need to do is calculate the mean and standard deviation for this variable here so we do analyze descriptives will unselect this box and automatically just by being in a descriptive statistics options we're gonna get our mean and standard deviation right here so we can click OK and we have a mean of two point nine three and a standard deviation of a two point five eight so we'll grab the calculator here and so the formula is going to be one point nine five times two point five eight plus the mean so we'll do that here one point nine five times two point five eight plus two point nine three and that gives us seven point nine six so we're gonna close this and replace this score of 15 whoops with a score of seven point nine six and what that represents is the score that is the most extreme but still within normal range so if we were to rerun this this would now be equivalent to about one point nine five now unfortunately if you do rerun it it's going to readjust now that you don't have that outlier score in it and it's still probably going to show this as an outlier but you only want to do this once because it's going to keep readjusting the distribution so doing this one time is sufficient and that's how you do perform Windsor rising finally we'll look at measuring univariate normality okay so the last thing we need to look at is the univariate normality for our variables so what we're gonna look at is happiness and vitality and what we're interested in determining is whether either of those variables are significantly skewed or have significant kurtosis so to calculate this it's actually pretty straightforward again we go to analyze descriptives and we can do this from frequencies or descriptives but either way so we'll take out that one put happiness and vitality here and then in options we just need to select the distribution options under and select kurtosis and skewness and this will automatically have SPSS generate the the kurtosis score as well as the standard error of kurtosis and the skewness score as well as a standard error of skewness so we click OK and run the analysis and we have our window right here and basically we have all the same descriptive statistics we had already asked plus we now also have this information on skewness and kurtosis so what we need to do is defy divide the statistic by the standard error to obtain the ratio and if we recall we're looking for values that are bigger than positive or negative 1.96 to indicate that we have an issue so if we use for example the happiness variable if we look at skewness we have 0 divided by it's not going to do this but or will let it split anyway will do 0 divided by 0.4 or 3/4 and we get 0 naturally okay so it means we have no issues with normality or skew the skew on this variable it means that it's rather has a nice distribution kind of like this kurtosis will do the same thing so we have negative looks right negative 1 point 1 2 divided by 0.8 4/5 and this time we have negative 1 point 3 5 3 so even though this is bigger than 0 it are smaller than 0 rather it's not smaller than negative 1.96 so we don't have a significant problem here there's a slight issue with the pointyness or flatness however not enough that we need to worry ok next we'll look at vitality skewness so we have 0.16 4 divided by 0.4 3 4 and this time again we're much smaller than 1.96 so we don't have any issues with the skew on this variable as well and then let's do the kurtosis one so we've got negative 1.0 or 3 divided by 0.8 4/5 and this time again although we're bigger than 1 we're sorry smaller than 1 were not smaller than negative 1.96 so we can conclude that there is no significant issue with kurtosis either okay so that covers everything that you should need to prepare your data set for analysis and that is that so if you'd like I can quickly show you what would happen if we had some issues with skew but let's see here I'll replace okay so I've just replaced this variable here with a lot of very extreme scores so if we run the descriptives on just the regular reaction time variable this time we have so you see here the value has changed a lot so now we have four skewness 1.2 own 9 divided by 0.4 34 and we have two point seven nine so this is bigger than 1.96 so we would say in this case since it's a positive value that we have a significant positive skew on that variable okay so it is significantly positively skewed so we would have to report that kurtosis I think is still fine but we can check it anyway and yeah so much smaller than 1.96 okay so that is everything you need and think there we go so thank you
Info
Channel: Meredith Rocchi
Views: 52,715
Rating: 4.7987423 out of 5
Keywords:
Id: 2ZMXjDVIdd8
Channel Id: undefined
Length: 16min 17sec (977 seconds)
Published: Wed Aug 20 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.