- [Instructor] Okay, up
for right now is a complete example of hierarchical
multiple linear regression so we're gonna cover how
to from start to finish, run a multiple regression that has steps including data screening, power, and what you might write in the write up, and example of a possible
representation of the data. So this is data set two from blackboard and what's in the data
is that we have gender, where zero is female, one is male. Age of the participant, and extroversion, so high scores are extroverted,
low scores are introverted. We're really looking at how well they take care of their cars and so the dependent variable is car. Are they washing it, or cleaning it, or they gave it oil change,
they're getting checkups, that sort of thing. And so what we're gonna
do is we're gonna control for demographic variable of sex and age, and then test if
extroversion adds something to that equation in predicting how well people take care of their cars. Okay? And so you'll wanna start with power, and power for the (mumbles) not limited here in G power, is just, there's only
really a couple of options, so click on F tests, and
then pull down that window, and you'll get two options,
linear multiple regression, R squared from zero,
that tested the overall model is significant,
or R squared increase, which you could use for this
type of model, and that would test if extroversion is
an addition to the model. I wanna go deviation from
zero, 'cause I kind know overall it's significant, but both options are viable. If you don't know, this is F squared, so not your normal aida or R squared. So if you hover over it, it'll give you the convention sizes or you
can hit Determine out here, and kind of calculate from a different, a couple of different things
but this square multiplication that's row, you can do R squared there, and that will calculate it for you. So I'm gonna close this bad boy and leave it at .15, alpha is always .05, power is 80%, and this case we
have three predictors total, so we use three. That says we need 77 people to
detect a significant effect. I only have 40, so let's see what happens. It's gonna tell you my calculate power. The next thing I wanna do is
the really intense process of data screening for regression. But this isn't a fake regression,
it's a real regression, so it's a little easier
'cause I don't have to create some random variables to test this. The first thing is always missing data and accuracy of your data, so go Analyze, Descriptives,
and then Frequencies. I'm gonna select everything
and move it over. And under Statistics, really
you need the min and the max, but it doesn't hurt if you kind of look in the means and the standard deviations, if it is this your own research field and it's not sort of a silly example. You can notice things like,
wait why is that score so low? Oh no, maybe I forgot to reverse code it, that sort of thing. And then okay, let's
look at the output here. It indicates that my data is zero to one, which is good 'cause gender
should be evenly split. My ages don't seem abnormal,
like you wouldn't expect somebody to be four and have a car. My extroversion score, is it
find me what that scale was, I think it's zero to 100,
so we're doing pretty good. And the car scale is also zero to 100. So how well they're
taking care of their car. So far everything looks good. And I don't have any missing data here, so see, no missing. So that first assumption check works out. Now, to do outliers, what we're gonna do is we're actually gonna
set up the regression to run as if we were ready to test and then check for outliers
in three different ways. The reason I picked these three, they do seem to be the most popular. To me they really get at the point of what regression is testing, and they sort of will cover you. There are lots and lots of options as you'll see here in a second to test for outliers in regression, and these seem to be, to me
these were the best three. Okay, so let's set up the
analysis as if we're gonna run it. So Analyze, Regression, Linear. Our DV is car. Now, this is a hierarchical regression, so we're gonna get to use
these different blocks here, and they're not actually
called blocks in the output, it's called models, so block just means what do you want to do next. So first, we're gonna
control for demographics. Put that in independent. Hit Next to get block two or model two, and then put in extroversion here. You do not have to include all three. It actually does that
for you automatically and so whatever you've used in step one will carry over to the other steps 'cause you wanna keep controlling for it. So it shows you, it'll
show you them several times in your output. Okay, after you do that, what you wanna hit is, Statistics. We're gonna get R squared
change, that's super important for the type of way that I'm gonna suggest you write this up. Part and partials are the sr and pr, and then hit Continue. Under Plots, for data screening, ZPRED in Y, Z residual in X, histogram, and normal probability plot. And that's your normal data screening. For the graphs, one thing you can do when there are multiple variables to kinda get an idea of
how well your equation is to graph the predictive
values against the actual values. Remember that R, big R
is a correlation between Y hat, your predictive score. What would I have guessed the score to be and why your actual score was? So the better your R and
the bigger your R squared, the closer you're getting
to the real score. So the dots are perfectly aligned, you have done a great job, but that never, almost never happens. But it's kind of to see
how well we're doing, so hit Next, and this is where
I'm gonna put DEPENDNT in Y, and adjusted predicted in X. So that's gonna give
me Y hat on the X axis, so that's all of my axis combined with coefficient and Y on my Y axis. Then hit Continue. Under Save, we're gonna click
the three different distances, Mahalanobis, Cook's and leverage. Look, there are so many
options, influence statistics, Df beta is pretty popular, studentized, deleted residuals are also pretty popular. Almost all of this is different
ways to look at outliers. We're gonna cover these three. Hit Continue. That should be good, hit OK. First things first, we
wanna check for outliers. So I'm gonna ignore all my outputs so far. Hit the start button
and go back to the data. You'll see that I have three new columns. And those columns are
for each of the separate outlier analyses, let's
start with Mahalanobis. So the cut off score for
Mahalabobis is gonna be three variables with
three degrees of freedom, and let's see, chi square table, might gonna see option, there it is. And we're gonna use .001,
'cause this is still, we want them to be really
crazy before we delete anybody. For three degrees of freedom, it's 16.27. So that's my cut off score. Now, normally you just sort and you look. But in this sort of analysis,
when I have three things I wanna compare, and I
kinda wanna keep track of what I'm doing, I'm gonna
actually show you a way to create multiple columns that tell me if people are outliers on
each variable separately, and then create total outliers score. I don't think this data set's too crazy. We don't have a whole lot of outliers, but if you had 400 participants, you don't wanna code this by hand. That's gonna take way too long. So what you're gonna
do is go to Transform, Recode into Different Variable. Let's take Mahalanobis
distance here, move it over. I'm gonna call this
out_mah, I know it's for outliers for Mahalanobis. You have to click Change so you
get that variable name here. And then before you hit OK, you have to get the old, you have to tell what are you gonna transfer this into. So this is how a lot people recode or reverse code labels to. So click Old and New Values. We're gonna use this HIGHEST option. So I wanna take everybody above 16.27, 'cause that's what I've
set the cut off score was. 16.27. And I wanna make everybody, sorry, 16.27, above that score, one. And that basically codes
everyone who's scores are too high, as one. I'm gonna take everybody else, so everything below 16.27, and then all the other
random decimal points, and make them zeros. So that basically codes
everybody into zero, not an outlier, one an outlier. And then hit Continue, and OK. The crappy part about this is since that each have different cut off scores, you have to do them one at a time. So I didn't get anybody with
outliers on Mahalanobis. I'm gonna do that twice
more, once for Cook's, which is a measure of influence, which is a discrepancy
and leverage together, and then once for leverage,
which is just straight, how much are they changing the slope. So let's see. Let's do, now, for Cook's. Transform, Recode into
Different Variables. I'm gonna hit Reset to
clear everything out. Move over Cook's, type
out_cook here, Change. Old and New Values. So what's my cut off score for Cook's? Well, the formula for Cook's is four divided by n minus k minus one. Or four over degrees of freedom. So I have four divided by n, n is 40, minus three for k, for three predictors, age, sex and extroversion, minus one. So 40 minus three minus one is 36. So four over 36, so .11 is
my cut off score for Cook's. Same functions. Value through HIGHEST, so .111, it is gonna be a one, and all other values is gonna be, ooh, not missing, a zero. And then Add. So everybody above .111 gets
a marker for being an outlier, everybody below that score
gets a zero for being not an outlier. Continue and OK. Right, and so it looks like I've got two little Cook's scores
that are too high. One of them, oops, that's leverage, .114, and then one of them is .312. So those are too high. One more time for leverage. Transform, Recode into Different Variables. Reset. And let's do leverage,
and do out_lev, Change, Old and New Values here. So what's my cut off score for leverage? Well, let's see. The score for leverage is
two k plus two divided by n. So two times k which is
three, two times three is six plus two is eight, divided by
n which in this case is 40, so eight over 40 is 0.2. So I'm gonna do value through HIGHEST, so 2.0 and up is gonna be a one, those are my outliers. And then all other values can be zero. Those are my not outliers. Continue and OK. And so I have an outlier
for leverage as well. So their score is higher than 2.0. Now this is very easy to see
because there's only 40 people and I can kinda scroll through it, but again, if you have 100 or more, or even just a couple more than this, it can be kinda tedious
to look through them. The sorting of multiple columns in SPSS is not always the best thing. So what you wanna do is
go Transform, Compute, let's just add all those together. This is gonna be total outliers, I'm gonna call it out_total. Then I'm just gonna do out Mahalanobis plus double click out Cook's plus, double click out leverage,
so just add them all up. Hit OK. And then now I can sort
my out_total column. Remember, you can right click
on the column and click sort. For some reason does not
totally work well in my Mac, with no mouse, so I'm gonna
do this through Sort Cases. Gonna put the highest people in the top. So I have one person who
has two or more markers, so they're two out of three. I would delete this
person because their score has two markers out of three
that indicate it's an outlier. I mean, you don't have to delete them, 'cause really what is going on? Look at the data before
you delete it, clearly. They're a young person who
has a high extroversion and they take care of their car, and more than likely they're
the top of those two variables. So they're getting, they're kind of, they're getting that high
Cook's and leverage scores because they're probably discrepant, which means they're far away
from the rest of the data. So being at the very
top or the very bottom tends to make you far away from everybody. But it looks to me like they're really, especially far away on the car score. If you're following
along in my User's Guide, I did delete them. You can leave them in and try it, and then take them out and
try it to see what happens. That's the popular thing to do. But since I wanna match the
handouts that you're looking at, I'm gonna delete this person because they have two out of three. There we go. Alright, so that being said, that makes all of this output moot. So I'm gonna get rid of it. 'Cause I deleted something. Next thing I wanna tell
is multicollinearity. So Analyze, Correlate, Bivariate. Remember, this is only
for independent variables. Do you want them to be
correlated with your DV? That's the point. So sex, age, and extroversion, we move those over and hit OK. And that is gonna show
me that gender and age aren't correlated, which
isn't too surprising. It is correlated with extroversion, so differences in men and women, and then age and extroversion
is also correlated but none of these are too high. The cut off score is .9. But remember at .7, you
might get some suppression with multiple regression,
so I might tell you to try it and see what
happens if you get that high. Okay, so I'm gonna rewrite my regression because I deleted somebody, and I'm gonna make a point to talk about, I'm just gonna hit OK, the fact that when I do that, it's gonna give me three
new outlier columns, because I ran it again. Don't delete anybody. Don't do it. Don't think about it. Don't make this a thing. Don't delete people multiple times. So essentially, these three
columns, we don't need. Alright, so. There's my output. Alright, we're gonna
check normality first. So that looks pretty good. Maybe a little bimodal, but not too bad. We have at least 30 people. And it's centered over zero, it ranges from two to
two, so I'd say it's okay. And then linearity, pretty good, especially with only 40 people. Homogeneity and homoscedasticity, also look pretty good. So most of the data's
between two here and two. We're getting three up here
because it's just slightly over two but really
that's almost perfectly between two and two. The data here is between two and two. And that's like, it's about
a score area is gonna get so homogeneity and
homoscedasticity both checked out. Okay. So all of my, there's one more plot. We're gonna come back to what
this plot is in a second. So all my assumptions check out after I deleted one outlier. Now let's look at the actual analysis. Which is just a little bit
higher up in my notes here. Copy this into Words so you
can read it a little better rather than side by side. Well thank goodness I
wasn't anything salacious. There we go, it was just Z test. (sighs) Now SPSS is doing that fun thing where it doesn't like to copy. (shutter clicks) Let's turn off the sound here. Struggling. There we go. So the first question
you have to ask yourself in regression is, is the
overall model significant? So let's talk about model one,
it's just my demographics. And yeah, it's significant. So I'm gonna say F of and then here we go, this first line, so df 2 and 36 is 21.66, here my p
value's less than .001, and my R squared for just this step is .55. So what does that tell me? That means 55% of the variance
is due to demographics. Whoa, that's huge. And it is significant. Next thing is model two, so
this is our extroversion, or extraversion, either
way you think about it. And I'm not gonna use that ANOVA box. So the interesting thing about
the two different boxes here that you don't see in a
simultaneous regression is that they're gonna be different. So what does this change
statistics thing do out here? That is testing this number right here, R squared change is greater than zero. When you have the first
model, the first step, those two numbers match because it, you're starting at zero, so it says, is it greater than zero? When you add a second
step, what happens is is to now this is testing if this change is different than zero? So is 7% a significant
addition to the model? Versus this number down
here in the ANOVA box is testing if the overall R squared, 61%, is greater than zero. And I, I mean, you can go either way. But I feel like purporting the ANOVA's a little bit of cheating
if your first step was really big. Your second step was
still gonna be significant 'cause the first one was big even if that addition is not. So I always biased towards
using this change statistics 'cause that's kind of
the point of doing it, hierarchical regression
is to show that that extra step is significant. Adding this variable was
important, so we should do it. So that's what's
different between the two. But this is an example so
of course it is significant. If I can get capital F here, there we go, so it's gonna be one and 35, is 5.96. And then my p value is .02. My R squared, which I'm gonna
cheat and copy from up here, is .07. And then what I would
do in word to make this super duper clear what I was talking about is insert a change statistic symbol, which is delta, the little triangle. So I'm saying the change in F is significantly different, and then the change in R squared. So that tells people,
or at least it tells me, that is the change in R squared, so the addition to R squared. And most people can figure that out, because they don't assume
that after getting 55% of the variance, somehow
you magically dropped to only 7%, they go, oh that must mean an additional segment. So you don't really need
to list R squared total, because hopefully people can figure out to just add them together, and that's how you get 61%. It's gonna look a little high because we've round it up on both of them. And so in that case, I might tell you to use three decimals, but I mean it's .01 so it's not a huge deal. Okay, the next question is which
predictors are significant? And so, I'm gonna take
the coefficients box here, my output, bop, and use that to answer that question. So the way I learned this was to only talk about the predictors
in the step they're entered. And people vary on this point. I think about it as more
of a theoretical view. I'm gonna talk about, I'm
gonna control for demographics. So here's what happen to demographics when they're by themselves. When I control for it, I'm
basically down with it, and then I'm gonna add extroversion. So after controlling for demographics, what happens with extroversion? 'Cause you'll notice that
the coefficients do change. That's because there are other
variables in the equation. So mathematically they have to change. We can't actually hold them constant. It's more of a theoretical idea of I'm controlling for these
and then doing this. I have seen it both ways, where people report them in both steps
or only the last step. But the way I kinda think about, or the way I don't kinda think about it, the way I think about it
is just talk about them in the step they're entered, because that, you did them
in steps for a reason. So talk about them in
the step they're entered. Remember number one rule when
I help people with things, is do what your advisor wants. Do what the reviewer
wants as much as you can. Practically. And basically go with
what makes sense to you. If it makes more sense to talk about both, do both and see what happens. See if people will
accept your explanation. So I'm gonna talk about them
in the step they're entered. So that means, for model
one, when I'm controlling for demographics, sex is
a significant predictor. I'm gonna list, I'm gonna do beta, so Insert. The advantage of beta
is that it standardized, so I can compare, there's beta. I can compare statistics. So I don't know why this always comes up with this other font, there we go. Let's do Times New Roman. Sorry, it's one of my things, it just makes me crazy. Alright, there we go. So I'm gonna list beta. What's the advantage of beta? Beta is standardized
because gender and age are definitely not on the same scale, 'cause one is zero and
one, the other one is age. Beta will let me tell which
predictor is stronger, but so will partial correlations. So you could go with either one. Remembering that b is more interpretable, so it is in the scale you're using so you can talk about it easier. And beta is standardized, so
that you can compare better. Either one. Alright, so beta is .68,
my t says it's significant. Remember, the degrees of freedom for t match the second degree of freedom for F in the step we're talking about. So it's 36 here, 'cause
it's n minus k minus one. So it's 6.00, p value is less than .001. And I'm gonna use pr
squared as my affect size. So what in the heck is pr squared? Sr and pr are types of
partial correlations. This output out here are zero order, it's just plain or, that's the correlation between gender and my DV, cars. Partial correlations
are in the second column where it says partial. That is the correlation
between gender and car controlling for age, like subtracting out all the variance for age. Semi-partial correlations
are the relationship between gender and car, including age, so the difference between
pr in the middle column and sr in the last column
is the denominator. Pr is calculated only
over leftover variance. So it basically takes age
and just like caves it out and says that variance due
to age doesn't exist anymore. Poof, gone. For partial, sorry, sr,
semi-partial correlations, that variance due to age is
still part of the denominator. So it's over total variance on the bottom. If you can't remember the
order, like I do sometimes, remember that pr is always larger than sr because the denominator is smaller, and so, unless they're all zero. And so go with the larger
column, which is this one. I'm gonna square that
because we think about, they're both affect size, so
it doesn't actually matter, but I like to think about it as R squared and so we'll keep in the same theme here. And that tells me how much
variance is accounted for. It's actually 50%. We'll talk about what does
that mean here in a second. So for age, the beta's
.33, also significant. That doesn't always happen. Sometimes it might just be one of them. Equals 2.92. P is really less than .01 here. And let's do pr squared. Word will keep up with me here, so .44 squared. Squared, .19. And here's the tricky part. Because these don't have
the same denominator. It says R squared, they do not add up. They will not add up
to my total R squared. Sometimes bigger, sometimes it's smaller, it just really depends on
the mathematical properties and their overlap between sex and age. But since they are fairly uncorrelated, that means that pr will be bigger. The more correlated they
are, the smaller they'll be. Don't expect those to add up. Its' just my word of warning here. Right, so 50% of the
unaccounted for variance is due to gender, and 19% is due to age. I can also look at beta
and tell that gender is a better predictor. The interpretation for age here is as age goes up, for every one unit increase in age, we get .33 standard deviations rather, or .54, .55, increases in car. As age goes up, care for car goes up. The tricky part of these
categorical variables, as sex goes up, what does that mean? That's an odd way to say that. Basically as we go from zero to one. So zero group is girls, females. The one group is guys, males. So the difference between boys and girls is .68 standard deviations or 26 points. So as sex goes up, as we are looking at guys,
care for car goes up. Our guys are taking
better care of their cars than our girls. Sorry ladies. Alright. So let's talk about extroversion. I added that in model two. So what happens here? Is it, it already know it's
a significant predictor, because I only have one in that model two, extroversion, most significant. Let's see what happen with that. .33, about the same size as age. Now my degrees of freedom for t are gonna be different, though, because it's the second
degree of freedom here, so that's 35 instead of 36. Is 2.44, p value's .02. Which, with only one variable will match the p value up here. Let's do pr squared. Can you tell it's late? Getting silly voices. Alright, we've got .38, squared, so come here Calculator. .38, squared, so .14. And you know that it
does not match R squared. Overall addition to the equation is .07 which would be this .25 thing squared, I'm pretty sure so let's try that. .26, (clicks) that's
where .07 is coming from. So if you square a semi-partial,
you get R squared change. But we're talking about
partial correlations so minus age and gender's variance out of the DVs, so
subtracting some numbers out of the denominator,
it's 14% of the unaccounted for variance. So it's a significant predictor. I would, to write that up, talk about all those different things. One caveat that I always tell people is if a predictor is not significant, you can't just pretend like
it didn't exist anymore. So talk about predictors even
if they're not significant. And then my thing is,
in the step they're in. All of mine were significant
in their specific steps so we'd talk about them all. But you really don't want to ignore one just 'cause it wasn't significant, 'cause people are gonna go, what happened to the other variable, they just stopped talking about it. Say it's not significant. Now for pictures, what can I do with, making a graph, a representation of this. It's usually a little
hard because if you have three variables, technically
you're predicting into 3D space. The sort of cheap way to do it, it's not really cheap,
but it's the easiest way, would be to create a
picture here, this one, of the relationship between
the predictive values and the actual real values. 'Cause this gives me a
picture of all these variables together, equals what? Now, I got that scatter
plot when I ran my plots with dependent as Y and
adjusted predicted as X, but this graph is terrible. So what I would do to make it APA Style. Remember, APA does not have
all this stuff at the top. It's not letting me delete
here, oh there we go. It's being grumpy. There we go. And then I would change
this stuff from the bottom, so click once to get it,
click twice to get it where you can type into
either of the equation, is a good one, or one
or all the variables. So Sex + Age + Extroversion. You could also call
this Predictive Values. It doesn't have to be equal,
that's like the other option and calling it Predictive Values. I like to remind people
what are the variables I'm using unless you have 10,
then it might be kinda long. Over here on Car, that's
not a very good one, so click once, click twice, this is my Car Care, oops not Care Care, Car Care Score. You can delete this awful
blurred gray background. Double click on it, change
it to transparent here and Apply, that's just
the personal preference 'cause gray is awful. But I also like to add the Fit Line, so Add Fit Line at Total, that would add your Fit Line and then you can turn off right
here, attach line to label, since that's not actually the equation. Apply. So I don't wanna include that equation because my real equation
has three predictors and a coefficient, that's what's gonna be what you're reporting with
all of your beta values, or your b values. This is just a way to get
to give you the stupid line. So how are we doing? Let's close this and
it'll pop back over here, there we go. We're doing pretty good,
because lots of dots are close to the line. I mean only one person
even touching the line but they are pretty close. It could be way spread out. Remember this is 61% of the variance, that's a lot. So we're getting pretty good
at guessing people's scores with all three variables at once. That is how you run a
multiple linear regression, hierarchical multiple linear regression, you got steps, how you
would talk about each piece in your write up and a potential graph or way to visualize the data.