SPSS - Hierarchical Multiple Linear Regression

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

- [Instructor] Okay, up for right now is a complete example of hierarchical multiple linear regression so we're gonna cover how to from start to finish, run a multiple regression that has steps including data screening, power, and what you might write in the write up, and example of a possible representation of the data. So this is data set two from blackboard and what's in the data is that we have gender, where zero is female, one is male. Age of the participant, and extroversion, so high scores are extroverted, low scores are introverted. We're really looking at how well they take care of their cars and so the dependent variable is car. Are they washing it, or cleaning it, or they gave it oil change, they're getting checkups, that sort of thing. And so what we're gonna do is we're gonna control for demographic variable of sex and age, and then test if extroversion adds something to that equation in predicting how well people take care of their cars. Okay? And so you'll wanna start with power, and power for the (mumbles) not limited here in G power, is just, there's only really a couple of options, so click on F tests, and then pull down that window, and you'll get two options, linear multiple regression, R squared from zero, that tested the overall model is significant, or R squared increase, which you could use for this type of model, and that would test if extroversion is an addition to the model. I wanna go deviation from zero, 'cause I kind know overall it's significant, but both options are viable. If you don't know, this is F squared, so not your normal aida or R squared. So if you hover over it, it'll give you the convention sizes or you can hit Determine out here, and kind of calculate from a different, a couple of different things but this square multiplication that's row, you can do R squared there, and that will calculate it for you. So I'm gonna close this bad boy and leave it at .15, alpha is always .05, power is 80%, and this case we have three predictors total, so we use three. That says we need 77 people to detect a significant effect. I only have 40, so let's see what happens. It's gonna tell you my calculate power. The next thing I wanna do is the really intense process of data screening for regression. But this isn't a fake regression, it's a real regression, so it's a little easier 'cause I don't have to create some random variables to test this. The first thing is always missing data and accuracy of your data, so go Analyze, Descriptives, and then Frequencies. I'm gonna select everything and move it over. And under Statistics, really you need the min and the max, but it doesn't hurt if you kind of look in the means and the standard deviations, if it is this your own research field and it's not sort of a silly example. You can notice things like, wait why is that score so low? Oh no, maybe I forgot to reverse code it, that sort of thing. And then okay, let's look at the output here. It indicates that my data is zero to one, which is good 'cause gender should be evenly split. My ages don't seem abnormal, like you wouldn't expect somebody to be four and have a car. My extroversion score, is it find me what that scale was, I think it's zero to 100, so we're doing pretty good. And the car scale is also zero to 100. So how well they're taking care of their car. So far everything looks good. And I don't have any missing data here, so see, no missing. So that first assumption check works out. Now, to do outliers, what we're gonna do is we're actually gonna set up the regression to run as if we were ready to test and then check for outliers in three different ways. The reason I picked these three, they do seem to be the most popular. To me they really get at the point of what regression is testing, and they sort of will cover you. There are lots and lots of options as you'll see here in a second to test for outliers in regression, and these seem to be, to me these were the best three. Okay, so let's set up the analysis as if we're gonna run it. So Analyze, Regression, Linear. Our DV is car. Now, this is a hierarchical regression, so we're gonna get to use these different blocks here, and they're not actually called blocks in the output, it's called models, so block just means what do you want to do next. So first, we're gonna control for demographics. Put that in independent. Hit Next to get block two or model two, and then put in extroversion here. You do not have to include all three. It actually does that for you automatically and so whatever you've used in step one will carry over to the other steps 'cause you wanna keep controlling for it. So it shows you, it'll show you them several times in your output. Okay, after you do that, what you wanna hit is, Statistics. We're gonna get R squared change, that's super important for the type of way that I'm gonna suggest you write this up. Part and partials are the sr and pr, and then hit Continue. Under Plots, for data screening, ZPRED in Y, Z residual in X, histogram, and normal probability plot. And that's your normal data screening. For the graphs, one thing you can do when there are multiple variables to kinda get an idea of how well your equation is to graph the predictive values against the actual values. Remember that R, big R is a correlation between Y hat, your predictive score. What would I have guessed the score to be and why your actual score was? So the better your R and the bigger your R squared, the closer you're getting to the real score. So the dots are perfectly aligned, you have done a great job, but that never, almost never happens. But it's kind of to see how well we're doing, so hit Next, and this is where I'm gonna put DEPENDNT in Y, and adjusted predicted in X. So that's gonna give me Y hat on the X axis, so that's all of my axis combined with coefficient and Y on my Y axis. Then hit Continue. Under Save, we're gonna click the three different distances, Mahalanobis, Cook's and leverage. Look, there are so many options, influence statistics, Df beta is pretty popular, studentized, deleted residuals are also pretty popular. Almost all of this is different ways to look at outliers. We're gonna cover these three. Hit Continue. That should be good, hit OK. First things first, we wanna check for outliers. So I'm gonna ignore all my outputs so far. Hit the start button and go back to the data. You'll see that I have three new columns. And those columns are for each of the separate outlier analyses, let's start with Mahalanobis. So the cut off score for Mahalabobis is gonna be three variables with three degrees of freedom, and let's see, chi square table, might gonna see option, there it is. And we're gonna use .001, 'cause this is still, we want them to be really crazy before we delete anybody. For three degrees of freedom, it's 16.27. So that's my cut off score. Now, normally you just sort and you look. But in this sort of analysis, when I have three things I wanna compare, and I kinda wanna keep track of what I'm doing, I'm gonna actually show you a way to create multiple columns that tell me if people are outliers on each variable separately, and then create total outliers score. I don't think this data set's too crazy. We don't have a whole lot of outliers, but if you had 400 participants, you don't wanna code this by hand. That's gonna take way too long. So what you're gonna do is go to Transform, Recode into Different Variable. Let's take Mahalanobis distance here, move it over. I'm gonna call this out_mah, I know it's for outliers for Mahalanobis. You have to click Change so you get that variable name here. And then before you hit OK, you have to get the old, you have to tell what are you gonna transfer this into. So this is how a lot people recode or reverse code labels to. So click Old and New Values. We're gonna use this HIGHEST option. So I wanna take everybody above 16.27, 'cause that's what I've set the cut off score was. 16.27. And I wanna make everybody, sorry, 16.27, above that score, one. And that basically codes everyone who's scores are too high, as one. I'm gonna take everybody else, so everything below 16.27, and then all the other random decimal points, and make them zeros. So that basically codes everybody into zero, not an outlier, one an outlier. And then hit Continue, and OK. The crappy part about this is since that each have different cut off scores, you have to do them one at a time. So I didn't get anybody with outliers on Mahalanobis. I'm gonna do that twice more, once for Cook's, which is a measure of influence, which is a discrepancy and leverage together, and then once for leverage, which is just straight, how much are they changing the slope. So let's see. Let's do, now, for Cook's. Transform, Recode into Different Variables. I'm gonna hit Reset to clear everything out. Move over Cook's, type out_cook here, Change. Old and New Values. So what's my cut off score for Cook's? Well, the formula for Cook's is four divided by n minus k minus one. Or four over degrees of freedom. So I have four divided by n, n is 40, minus three for k, for three predictors, age, sex and extroversion, minus one. So 40 minus three minus one is 36. So four over 36, so .11 is my cut off score for Cook's. Same functions. Value through HIGHEST, so .111, it is gonna be a one, and all other values is gonna be, ooh, not missing, a zero. And then Add. So everybody above .111 gets a marker for being an outlier, everybody below that score gets a zero for being not an outlier. Continue and OK. Right, and so it looks like I've got two little Cook's scores that are too high. One of them, oops, that's leverage, .114, and then one of them is .312. So those are too high. One more time for leverage. Transform, Recode into Different Variables. Reset. And let's do leverage, and do out_lev, Change, Old and New Values here. So what's my cut off score for leverage? Well, let's see. The score for leverage is two k plus two divided by n. So two times k which is three, two times three is six plus two is eight, divided by n which in this case is 40, so eight over 40 is 0.2. So I'm gonna do value through HIGHEST, so 2.0 and up is gonna be a one, those are my outliers. And then all other values can be zero. Those are my not outliers. Continue and OK. And so I have an outlier for leverage as well. So their score is higher than 2.0. Now this is very easy to see because there's only 40 people and I can kinda scroll through it, but again, if you have 100 or more, or even just a couple more than this, it can be kinda tedious to look through them. The sorting of multiple columns in SPSS is not always the best thing. So what you wanna do is go Transform, Compute, let's just add all those together. This is gonna be total outliers, I'm gonna call it out_total. Then I'm just gonna do out Mahalanobis plus double click out Cook's plus, double click out leverage, so just add them all up. Hit OK. And then now I can sort my out_total column. Remember, you can right click on the column and click sort. For some reason does not totally work well in my Mac, with no mouse, so I'm gonna do this through Sort Cases. Gonna put the highest people in the top. So I have one person who has two or more markers, so they're two out of three. I would delete this person because their score has two markers out of three that indicate it's an outlier. I mean, you don't have to delete them, 'cause really what is going on? Look at the data before you delete it, clearly. They're a young person who has a high extroversion and they take care of their car, and more than likely they're the top of those two variables. So they're getting, they're kind of, they're getting that high Cook's and leverage scores because they're probably discrepant, which means they're far away from the rest of the data. So being at the very top or the very bottom tends to make you far away from everybody. But it looks to me like they're really, especially far away on the car score. If you're following along in my User's Guide, I did delete them. You can leave them in and try it, and then take them out and try it to see what happens. That's the popular thing to do. But since I wanna match the handouts that you're looking at, I'm gonna delete this person because they have two out of three. There we go. Alright, so that being said, that makes all of this output moot. So I'm gonna get rid of it. 'Cause I deleted something. Next thing I wanna tell is multicollinearity. So Analyze, Correlate, Bivariate. Remember, this is only for independent variables. Do you want them to be correlated with your DV? That's the point. So sex, age, and extroversion, we move those over and hit OK. And that is gonna show me that gender and age aren't correlated, which isn't too surprising. It is correlated with extroversion, so differences in men and women, and then age and extroversion is also correlated but none of these are too high. The cut off score is .9. But remember at .7, you might get some suppression with multiple regression, so I might tell you to try it and see what happens if you get that high. Okay, so I'm gonna rewrite my regression because I deleted somebody, and I'm gonna make a point to talk about, I'm just gonna hit OK, the fact that when I do that, it's gonna give me three new outlier columns, because I ran it again. Don't delete anybody. Don't do it. Don't think about it. Don't make this a thing. Don't delete people multiple times. So essentially, these three columns, we don't need. Alright, so. There's my output. Alright, we're gonna check normality first. So that looks pretty good. Maybe a little bimodal, but not too bad. We have at least 30 people. And it's centered over zero, it ranges from two to two, so I'd say it's okay. And then linearity, pretty good, especially with only 40 people. Homogeneity and homoscedasticity, also look pretty good. So most of the data's between two here and two. We're getting three up here because it's just slightly over two but really that's almost perfectly between two and two. The data here is between two and two. And that's like, it's about a score area is gonna get so homogeneity and homoscedasticity both checked out. Okay. So all of my, there's one more plot. We're gonna come back to what this plot is in a second. So all my assumptions check out after I deleted one outlier. Now let's look at the actual analysis. Which is just a little bit higher up in my notes here. Copy this into Words so you can read it a little better rather than side by side. Well thank goodness I wasn't anything salacious. There we go, it was just Z test. (sighs) Now SPSS is doing that fun thing where it doesn't like to copy. (shutter clicks) Let's turn off the sound here. Struggling. There we go. So the first question you have to ask yourself in regression is, is the overall model significant? So let's talk about model one, it's just my demographics. And yeah, it's significant. So I'm gonna say F of and then here we go, this first line, so df 2 and 36 is 21.66, here my p value's less than .001, and my R squared for just this step is .55. So what does that tell me? That means 55% of the variance is due to demographics. Whoa, that's huge. And it is significant. Next thing is model two, so this is our extroversion, or extraversion, either way you think about it. And I'm not gonna use that ANOVA box. So the interesting thing about the two different boxes here that you don't see in a simultaneous regression is that they're gonna be different. So what does this change statistics thing do out here? That is testing this number right here, R squared change is greater than zero. When you have the first model, the first step, those two numbers match because it, you're starting at zero, so it says, is it greater than zero? When you add a second step, what happens is is to now this is testing if this change is different than zero? So is 7% a significant addition to the model? Versus this number down here in the ANOVA box is testing if the overall R squared, 61%, is greater than zero. And I, I mean, you can go either way. But I feel like purporting the ANOVA's a little bit of cheating if your first step was really big. Your second step was still gonna be significant 'cause the first one was big even if that addition is not. So I always biased towards using this change statistics 'cause that's kind of the point of doing it, hierarchical regression is to show that that extra step is significant. Adding this variable was important, so we should do it. So that's what's different between the two. But this is an example so of course it is significant. If I can get capital F here, there we go, so it's gonna be one and 35, is 5.96. And then my p value is .02. My R squared, which I'm gonna cheat and copy from up here, is .07. And then what I would do in word to make this super duper clear what I was talking about is insert a change statistic symbol, which is delta, the little triangle. So I'm saying the change in F is significantly different, and then the change in R squared. So that tells people, or at least it tells me, that is the change in R squared, so the addition to R squared. And most people can figure that out, because they don't assume that after getting 55% of the variance, somehow you magically dropped to only 7%, they go, oh that must mean an additional segment. So you don't really need to list R squared total, because hopefully people can figure out to just add them together, and that's how you get 61%. It's gonna look a little high because we've round it up on both of them. And so in that case, I might tell you to use three decimals, but I mean it's .01 so it's not a huge deal. Okay, the next question is which predictors are significant? And so, I'm gonna take the coefficients box here, my output, bop, and use that to answer that question. So the way I learned this was to only talk about the predictors in the step they're entered. And people vary on this point. I think about it as more of a theoretical view. I'm gonna talk about, I'm gonna control for demographics. So here's what happen to demographics when they're by themselves. When I control for it, I'm basically down with it, and then I'm gonna add extroversion. So after controlling for demographics, what happens with extroversion? 'Cause you'll notice that the coefficients do change. That's because there are other variables in the equation. So mathematically they have to change. We can't actually hold them constant. It's more of a theoretical idea of I'm controlling for these and then doing this. I have seen it both ways, where people report them in both steps or only the last step. But the way I kinda think about, or the way I don't kinda think about it, the way I think about it is just talk about them in the step they're entered, because that, you did them in steps for a reason. So talk about them in the step they're entered. Remember number one rule when I help people with things, is do what your advisor wants. Do what the reviewer wants as much as you can. Practically. And basically go with what makes sense to you. If it makes more sense to talk about both, do both and see what happens. See if people will accept your explanation. So I'm gonna talk about them in the step they're entered. So that means, for model one, when I'm controlling for demographics, sex is a significant predictor. I'm gonna list, I'm gonna do beta, so Insert. The advantage of beta is that it standardized, so I can compare, there's beta. I can compare statistics. So I don't know why this always comes up with this other font, there we go. Let's do Times New Roman. Sorry, it's one of my things, it just makes me crazy. Alright, there we go. So I'm gonna list beta. What's the advantage of beta? Beta is standardized because gender and age are definitely not on the same scale, 'cause one is zero and one, the other one is age. Beta will let me tell which predictor is stronger, but so will partial correlations. So you could go with either one. Remembering that b is more interpretable, so it is in the scale you're using so you can talk about it easier. And beta is standardized, so that you can compare better. Either one. Alright, so beta is .68, my t says it's significant. Remember, the degrees of freedom for t match the second degree of freedom for F in the step we're talking about. So it's 36 here, 'cause it's n minus k minus one. So it's 6.00, p value is less than .001. And I'm gonna use pr squared as my affect size. So what in the heck is pr squared? Sr and pr are types of partial correlations. This output out here are zero order, it's just plain or, that's the correlation between gender and my DV, cars. Partial correlations are in the second column where it says partial. That is the correlation between gender and car controlling for age, like subtracting out all the variance for age. Semi-partial correlations are the relationship between gender and car, including age, so the difference between pr in the middle column and sr in the last column is the denominator. Pr is calculated only over leftover variance. So it basically takes age and just like caves it out and says that variance due to age doesn't exist anymore. Poof, gone. For partial, sorry, sr, semi-partial correlations, that variance due to age is still part of the denominator. So it's over total variance on the bottom. If you can't remember the order, like I do sometimes, remember that pr is always larger than sr because the denominator is smaller, and so, unless they're all zero. And so go with the larger column, which is this one. I'm gonna square that because we think about, they're both affect size, so it doesn't actually matter, but I like to think about it as R squared and so we'll keep in the same theme here. And that tells me how much variance is accounted for. It's actually 50%. We'll talk about what does that mean here in a second. So for age, the beta's .33, also significant. That doesn't always happen. Sometimes it might just be one of them. Equals 2.92. P is really less than .01 here. And let's do pr squared. Word will keep up with me here, so .44 squared. Squared, .19. And here's the tricky part. Because these don't have the same denominator. It says R squared, they do not add up. They will not add up to my total R squared. Sometimes bigger, sometimes it's smaller, it just really depends on the mathematical properties and their overlap between sex and age. But since they are fairly uncorrelated, that means that pr will be bigger. The more correlated they are, the smaller they'll be. Don't expect those to add up. Its' just my word of warning here. Right, so 50% of the unaccounted for variance is due to gender, and 19% is due to age. I can also look at beta and tell that gender is a better predictor. The interpretation for age here is as age goes up, for every one unit increase in age, we get .33 standard deviations rather, or .54, .55, increases in car. As age goes up, care for car goes up. The tricky part of these categorical variables, as sex goes up, what does that mean? That's an odd way to say that. Basically as we go from zero to one. So zero group is girls, females. The one group is guys, males. So the difference between boys and girls is .68 standard deviations or 26 points. So as sex goes up, as we are looking at guys, care for car goes up. Our guys are taking better care of their cars than our girls. Sorry ladies. Alright. So let's talk about extroversion. I added that in model two. So what happens here? Is it, it already know it's a significant predictor, because I only have one in that model two, extroversion, most significant. Let's see what happen with that. .33, about the same size as age. Now my degrees of freedom for t are gonna be different, though, because it's the second degree of freedom here, so that's 35 instead of 36. Is 2.44, p value's .02. Which, with only one variable will match the p value up here. Let's do pr squared. Can you tell it's late? Getting silly voices. Alright, we've got .38, squared, so come here Calculator. .38, squared, so .14. And you know that it does not match R squared. Overall addition to the equation is .07 which would be this .25 thing squared, I'm pretty sure so let's try that. .26, (clicks) that's where .07 is coming from. So if you square a semi-partial, you get R squared change. But we're talking about partial correlations so minus age and gender's variance out of the DVs, so subtracting some numbers out of the denominator, it's 14% of the unaccounted for variance. So it's a significant predictor. I would, to write that up, talk about all those different things. One caveat that I always tell people is if a predictor is not significant, you can't just pretend like it didn't exist anymore. So talk about predictors even if they're not significant. And then my thing is, in the step they're in. All of mine were significant in their specific steps so we'd talk about them all. But you really don't want to ignore one just 'cause it wasn't significant, 'cause people are gonna go, what happened to the other variable, they just stopped talking about it. Say it's not significant. Now for pictures, what can I do with, making a graph, a representation of this. It's usually a little hard because if you have three variables, technically you're predicting into 3D space. The sort of cheap way to do it, it's not really cheap, but it's the easiest way, would be to create a picture here, this one, of the relationship between the predictive values and the actual real values. 'Cause this gives me a picture of all these variables together, equals what? Now, I got that scatter plot when I ran my plots with dependent as Y and adjusted predicted as X, but this graph is terrible. So what I would do to make it APA Style. Remember, APA does not have all this stuff at the top. It's not letting me delete here, oh there we go. It's being grumpy. There we go. And then I would change this stuff from the bottom, so click once to get it, click twice to get it where you can type into either of the equation, is a good one, or one or all the variables. So Sex + Age + Extroversion. You could also call this Predictive Values. It doesn't have to be equal, that's like the other option and calling it Predictive Values. I like to remind people what are the variables I'm using unless you have 10, then it might be kinda long. Over here on Car, that's not a very good one, so click once, click twice, this is my Car Care, oops not Care Care, Car Care Score. You can delete this awful blurred gray background. Double click on it, change it to transparent here and Apply, that's just the personal preference 'cause gray is awful. But I also like to add the Fit Line, so Add Fit Line at Total, that would add your Fit Line and then you can turn off right here, attach line to label, since that's not actually the equation. Apply. So I don't wanna include that equation because my real equation has three predictors and a coefficient, that's what's gonna be what you're reporting with all of your beta values, or your b values. This is just a way to get to give you the stupid line. So how are we doing? Let's close this and it'll pop back over here, there we go. We're doing pretty good, because lots of dots are close to the line. I mean only one person even touching the line but they are pretty close. It could be way spread out. Remember this is 61% of the variance, that's a lot. So we're getting pretty good at guessing people's scores with all three variables at once. That is how you run a multiple linear regression, hierarchical multiple linear regression, you got steps, how you would talk about each piece in your write up and a potential graph or way to visualize the data.

Info

Channel: Statistics of DOOM

Views: 19,817

Rating: 4.9225807 out of 5

Keywords: Linear Regression, SPSS (Software), Statistics (Field Of Study), multiple linear regression, heirarchical regression, data screening

Id: mVFhP10oSis

Channel Id: undefined

Length: 35min 36sec (2136 seconds)

Published: Thu Apr 16 2015