SEM Series Part 2: Data Screening

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in this video I'm going to talk about data screening by case and by variable and for this video if you want to follow along you'll need this data set it's a YouTube SEM series data set on the home page of stat wiki if it doesn't show up just hit control f5 to refresh the browser or the web page and it should show up if you hit control f5 not just at 5 or not just the refresh button ok so here's what we're going to do let's go over to data screening oh not that one general guidelines excuse me and wait for it there we go and here are the order of operations and we need to do case screening and then variable screening so let's start with our data set let's go look at our data set it's the exact same data set you have YouTube SEM series SAV here it is in the variable view you can see we have all the independent variables that's playfulness comprehensiveness of use a typically use usefulness is a mediator joy is a mediator info acquisition and decision quality are dependent variables and then we have our controls gender age education and our to multi group moderators frequency and experience and then an ID variable just for keeping each record straight I go to the date of you here's all the data what we need to do first is find out if we had any cases that is rose or people that did not respond to very much so let's just paint it the easiest way to do this actually is to hit ctrl a ctrl C and do it all in Excel so let's do that controlling you control V alright here we are and you know what I really need is those variable names let's go get them the way to get them is to go back to SPSS just highlight all those names ctrl C go back I should actually insert a row here and on a different sheet hit ctrl-v and then control let me do it me which will see again paste to here control V control T will do this it's going to transpose it in that order which is kind of nice control see go back and paste it right here control V that should go over to the last variable yes okay we're in so the first thing we need to do is find out if we have many missing from a single respondent the best way to do this I think is to use the equals is blank and for the value just control shift left and oh it's counting excuse me not is blank count blank yes and then it's going to give the range there should be zero blank in the variable row let's go here and rename this to blanks and then I can just sort by blanks now we're allowed to sort because we have an ID that we can refer back to and maintain a consistent order we can always sort by ID to get back to our original sort order so let's sort data sort z to a and we see here we have out of our how many let's see out of our 49 you can see here 49 variables these two guys are missing 47 of them are they useful no they're completely useless so I'm just going to delete them that's justifiable now these guys are missing too that's not that many if it were you know like 10% of the variables so in this case it'd be about 5 then I'd be concerned but it's only 2 so we'll just impute those values which I'll show you how to do very soon ok and the rest had nothing missing wonderful people aren't they ok now let's go find let's go we get our order of operations we did case screening my missing data we need to look at unengaged responses hmm all right what is an unengaged response it's someone who responds with the exact same value for every single question so they put fors all the way across or threes all the way across there are other things they could have done like 1 2 3 4 1 2 3 4 it's harder to detect that without just a visual inspection so what I'm going to do is I'm going to just type in not there I'm going to look at the standard deviation and we keep you like this let's see equals standard deviation P sure of which ones well just the latent variables probably there we go all the way back to the beginning there and enter and drag that down and it's just too many numbers to look at so decrease the number of decimals and let's sort again okay already data sort A to Z and still show us my guess this person has zero standard deviation what does that mean that means the answer to three for every single question so are they useful no not at all get rid of him that is a justifiable deletion all right this person had very little standard deviation of 0.15 so let's go look at the responses four four four four four four four four four four Wow a lot of fours and a three so are they useful not really I mean they might have been telling the truth but they're completely useless because they don't have any variance in their responses so an increase in let's say playfulness isn't gonna make any difference when we look at an increase in information quality because there is no increase they stay the same there's no variance so I'm gonna say they're pretty useless in fact anything under about 0.5 is probably pretty useless let's just go look at a few others just to make sure here's a point three four four four four four four four four four four four four it's a lot of orders a couple three so a typical use they put threes that's a little bit more engaged in the in the survey for these ones I would just do a really quick visual inspection threes four three two okay fairly engaged fours threes fours five and someone gave so we might have to say that's good except this top one there was a 0.15 one they were completely useless not engaged at all so I'm just gonna get rid of those let's keep these read the rest of these you always want to remove as little data as possible you want to remove the ones that are clearly unengaged or malicious hopefully we don't have any of those okay what do we do next let's go look at it that's engage responses next outliers okay so let's put all of this back into SPSS after deleting these two rows we just used and after restoring the ID column okay we're back in the original order we're minus a few let's see how many do we have now we have 381 responses now I'm going to copy all of this except the top row that's control shift and then I'm gonna have to hold shift and press up and left a little bit there I've got an all control C go back to SPSS data view delete everything that's there because we're gonna overwrite it you don't just control V or paste over it you actually delete it all because now we have fewer responses and based there we go just for then to the bottom make sure we're clean okay 380 yep let's scroll all the way to the right and the ID column is filled so good okay the next thing we said it was to look it for outliers all right we're using latent variables on a Likert scale of 1 to 5 so is there such thing as an outlier and not really there might be somebody who the answer differently you know everybody answered fives but this guy answered one is that really an outlier well we don't know and so we can't remove him so no not really but we could have outliers on gender we all not generators it's a forced response so it's only a 102 we could have it on age and education and frequency and experience who knows maybe there were some no oddballs on each of these so let's go look at those analyze see descriptive essay to do this probably scraps legacy dialogue this box plot and just do a simple separate by variables is that right define yeah and then we want to stick in age education frequency and experience label cases by ID mmm it go okay alright and here we have it so age we have somebody who was 35 apparently somebody was 33 ish are they outliers not well not in the sense of have the normal erroneous bad data no it's just somebody was old who took this survey hold SN 35 haha so can we delete them well only if we say that we're only interested in people between the ages of 18 and 25 so II I wouldn't delete them let's look at the next one this one is I can't tell let's see we started with age and the next one's education education somebody said they had 11 years of education wow that's pretty good it's number 193 let's go find out how old number 193 is just to make sure this person could really have you know that many years of education oh there we go 193 so here's education ten years but they're 22 and I mean they started college when they were 12 you know I could be possible you got those child prodigies but if they did then why are they still in an undergraduate intro course I'm thinking this is probably a mistake on education and it would make sense to replace it with the median is definitely an outlier and erroneous outlier so the median in is hard to tell let's see if it gives it here nope let's go run it analyze descriptives descriptives let's go get education and we're just looking for the median or the mean fine continue continue and the mean is 2.18 years I'll just stick a 2 in there because that's probably the median 2 years and that has changed alright let's go to the next one which was frequency no outliers in frequency and then on experience holy schnikeys this guy has 25 years of experience has Excel even been help for 25 years let's go look at number 291 391 25 years of experience and he's 21 years old huh I think that's a mistake so let's go find out what the median experience is the mean experience descriptives through our education out look experience ok and the mean is 4 point 4 years so I'm just gonna stick 4.4 in there 4.4 ok and that solves that problem that was clearly a an erroneous outlier you can't be 21 and have 25 years of experience ok even if you were using Excel in the womb all right so that takes care of outliers let's go to the next and missing data for variable screening it's got a missing data let's save now I'm going to go to analyze descriptives frequencies and we're just gonna stick everything in there except ID I suppose we don't need Eydie and we're going to go statistics looks like we're good here I think it automatically tells us about missing so hit okay and here's the missing table I'm just gonna copy it it's a fairly big table yeah I'm just gonna copy it go back over to excel stick it in here and just say you know highlight all that and do something like where's conditional formatting Oh bother hmm conditional formatting hi cell rules greater than zero okay now we can see who's got some missing did experience age decision quality info quality or acquisition excuse me and joy and usefulness usefulness okay so what we need to do is replace those values with the median if they're on Likert scales like this or if we want with the mean for more continuous variables such as age and experience so let's do that I'm actually just going to consolidate here so I can see it all to see the ones they're missing data okay so usefulness to 3/5 will go to here transform replace missing values and was again usefulness two three four five two three I say four I just met five two three five and then joy six info AK three twice six info Mac three and decision quality one and age and experience and experience alright that's a lot so what do we do with these they have to do them one at a time so usefulness we want to rename it actually just the same thing we're just gonna replace the missing value inside here with the median of nearby points and use all that's just a median of the whole variable column and hit change you see it changed up here I'm gonna do the same for usefulness get rid of that give it a median all change do I'll do this for each one just a minute okay I've done all the Likert scales you can see they've all been changed but I haven't changed these more continuous variables like age and experience what I'm gonna do for these is instead of median I use mean so for age oh yeah age do the series mean change and for experience do the series mean change and hit okay yes I want to change existing variables and tells me how many replaced so for usefulness to it replace one value and now we have 384 valid non missing values for each of these variables so we're good and save oh not the output sorry go back to SPSS save okay and what's next skewness and kurtosis alright last thing for variable stuff let's go to analyze descriptives frequencies I believe we'll do it statistics we don't really need skewness on a five-point Likert scale so once you can do kurtosis continue of everything yep that's right we'll need to see skewness of Education and frequency though and age age isn't in there let's see age oh it took out all the ones i recalculated so i'll throw those back in there fact let's throw it all out so it's alphabetized to control a throw it back in get rid of ID and it didn't realize nice oh well it okay and here we have it lots and lots of stuff okay but this is the table I'm looking for so I'm just gonna copy this ctrl C go over to excel I love using Excel by the way I'm going to just use instead of the standard error of ketosis oops I'm going to just use the kurtosis value and anything greater than or less than the absolute value I guess of one so greater than anything greater than one is candidate for being Curt oat I actually don't like one because it's too j't it's just too close to the border I'm gonna use two to show us just the extremely Curt oat items okay and in conditional formatting one more time less than negative two okay and now we'll see the ones that have issues I'm going to go ahead and get rid of the ones that don't push effects and it's just those guys actually Wow okay so it's just usefulness in foac one in three and age age is highly Curt out why because we looked at undergrads in a class and so they're all essentially in only 19 to 23 years old so they're they're the distribution is very small and it's very centered around the median so if they're going to be kicked out is there anything we can do about that nope so we'll leave it alright then you got info acquisition is 2.5 89 yeah that's pretty high it's not a three but it's pretty high which means there wasn't a lot of variance on that item people answered it very similarly same with these two now if they were negatively if they were strong negative values it would mean that everybody answered fairly differently and there wasn't a central tendency towards the median so what do we do about these well we just watched them we just make note these are kind of cut out and then look at them in the EFA to see if they cause problems for example they might have low communality values or they might not load on any single factor and that's that I believe for screening whoo

Info

Channel: James Gaskin

Views: 78,343

Rating: 4.8954248 out of 5

Keywords: Screencast-O-Matic.com, Statistics, SEM, SPSS, Data Screening, Missing data, Outliers, Skewness, Kurtosis

Id: 1KuM5e0aFgU

Channel Id: undefined

Length: 22min 13sec (1333 seconds)

Published: Thu May 02 2013