Welcome to Highline BI
348, class video number 47. If you want to download this
Excel workbook, BI 348 Chapter 4 or the PowerPoints, click
on the link below the video. In this video, we have a bunch
of awesome topics in regard to linear regression. We're going to talk
about r squared, a measure for
goodness of fit, we're going to talk about SST, SSR,
and SSE, measures variation, and we're going to talk
about the standard error of the estimate,
an actual estimate of the standard deviation
of our regression line. Now, we're going to start over
in our PowerPoints on slide 32. This is what we did in
the last couple of videos. We have weekly add expense. That's our x variable,
but we're hoping that it can help us predict
our y variable, weekly sales. We plotted all the
individual sample points, saw that there was a
strong, direct relationship. Last video, we
created our equation, y hat equals slope
times x plus intercept, and there's the line. Now, we want to
start off this video by looking at r
squared, which is a measure of the goodness of fit
of this line to the actual data points. Now, let's think
about this line. What would it mean if we
had a perfect equation that predicted exactly like the
original sample points. Go to our next slide. It would look like this. That means we
would have our line and all of the
original sample points would be exactly on the line. But that's not what we
Here is what we have. Here's our estimated
regression line, and there's all of the
original sample data points. They're not on the line. You can think of the
distance between each one of the original markers
of the original sample points on the line
as a bit of an error when we use this model. Now, the distance between
a particular value and the estimated value we
would get at that x value is called a residual. Now, if we go to
our next slide, we can see a picture of what
the residuals look like. Here's our original
sample data points, and in each one of these sample
data points is a y value. Well, guess what? This equation is predicting
y hat, which is a y value. If we took the predicted
value and the particular value from our sample
and compared them, the distance between those
would be our residuals and we'll actually say,
the original y value, and we'll subtract
the predicted value. Now, sometimes the
actual predicted value will be
underneath, which means the equation is under
predicting, other times, it will over predict. Here's our estimation
right there. There is the original y sample
point from our sample data. It's way below. Now, calculating the
residuals is not hard. We go to our next slide. Here is our original sample,
x values, our original sample, y values. And we already calculated
the slope and the intercept. So guess what? We're going to take our y hat,
but each one of the x's will be from original data set,
and if we calculate that down, we have y hat. That means we have an exact
predicted value at the same x as the original sample y value. Then we simply subtract. I'm going to say, particular
y minus the predicted, and this is called the residual. And when we make all
of those calculations, we get the distance,
the red line for each one of our
original x values. Now, check this out. If we were going
to add these up, because we want to analyze the
error in using our model as compared to the original sample
points, if we add these up, we get 0 just like
with our deviations. Now, since we want to analyze
these, we will get around that. The same way we did when
we did standard deviation and deviations earlier,
we will square the values and then add them
up and that will be called the sum of the
squares of the errors. Now, before we go
and calculate that, I want to think about
our model here, this is the y hat model,
compared to a y bar model. What's a y bar? Now, we saw how to plot
this in an earlier video, but think about this. If a manager in
a business didn't bother to go out
and see if there was a relationship
between weekly ad experience and weekly sales to
try and build this model here to predict, as weekly
ad expense increases, it looks like weekly
sales increase. If the manager
didn't do that, they might be stuck with just using y
bar at every single possible x. That means no matter what
the weekly ad expense was, they would just be using y bar. Now, we want to think
about what would happened if we only used y bar. Well, there's the
particular value. They're all over here. Look at the distance there. So the error in using y bar
as our model would be huge. Now, look at our y hat model. The distance between
using the actual predicted value and
the particular value is much smaller. That is going to be the
essence of calculating r squared, the goodness of fit. We'll actually calculate all
of the errors, or deviations, in using the y hat
model and we'll compare it to the actual
distance using our y hat model. And you can see
right off the bat that it's going to
be much smaller. So a lot of errors
when making predictions with our y bar model. These are called
deviations, much less error when we're using our
estimated equation, and these errors are
called residuals. Next slide, our calculation,
as we mentioned, is going to be called r squared
also known as coefficient of determination, and it
will measure the goodness of the fit of our
estimated regression line to the actual
sample data points. Now, before we go and
calculate and review, here's what a perfect
model would look like, all the points on the line. Here's our residuals. That's the error in using our
estimated regression equations. Here's the total error that
we would have if we use y bar. You can see that's a much
bigger than the error we would get if we used
our regression equation. Now, there's actually
two parts in this error, if we go to our next slide. Here's the total amount. That's the total error,
but this total error, you see that line kind of splits
it into, there's two parts. The first part is
the actual error and us using the estimated
equation to make a prediction. But guess what? This bottom part is
the extra benefit we get in using our estimated
line over the y bar line. Now, let's see how to
calculate both of these errors. Total error, as
it will be called, is always going to be the
particular value minus the y bar. The actual residual, particular
value minus the predicted, you can see particular
value minus y hat. And the bottom part
is going to be, this is the part that's
explained by using our regression equation. That's going to be the predicted
value minus the y bar, so y hat minus y bar. Now, there's some synonyms
for this error, residual error or the unexplained
part of total error when we use our
regression equation. This part here is
the explained part. Again, it's like the benefit
of using this equation over a y bar equation. It's also called the
regression amount. Now, here's what
we'd like to do. I'd like to calculate
every single blue line for every single point
and add it up and then do the same for the
residuals and add it up. And then we can compare
those two parts to the total and get what percentage is
explained by using the equation and which part is not explained. But we already know
what the problem is. If we add up all the
residuals, we get 0. Guess what? If we add up all these amounts,
we're going to get 0 too. So if we look at our
next slide-- and we'll go do this in just a second. Here's the total. And we already know this one. This is from doing standard
deviation, a particular value minus the mean. Do it for every
single one, you're always going to get a total
of 0 for all the deviations. Here's our residual. We already saw that
that adds up to 0, but also the explained part
or the regression part, if we take our predicted
value minus y bar, do it for every single original
x, we get a total of 0. But that's no problem. We know what to do. We can square all
these and then add them up just like we did
for standard deviation. Next slide. That's exactly what we do. Just like for standard
deviation, particular value minus the mean
square, add them up except for now in our analysis
of our regression equation, this will be called sum
of the squares total. That's the whole little error,
in essence, of using y bar. And we have one two part,
the regression part-- predicted value
minus y bar squared. Do it for each individual
x and add them up. That's called sum of the
squares of regression. Our residual squared
will take particular y minus the predicted square
and do for each individual x, add it up, that will be
sum of the squares of error. Now, there's an important
relationship between this. Remember if we go back a slide
here, two parts make the whole. If you add the individual error
to the individual regression, for the total, when you add them
up, you get exactly the total. If we go back to slide 44,
that's the same relationship here. If you add sum of the
squares of regression to sum of the
squares error, you're going to get sum of
the squared total. So our next slide defines each
one of these as a math formula. Sum of all the deviation
square, that's going to be SST. Sum of all of the predicted
values minus y bar squared, that will be SSR. Sum of all the particular
values minus our predicted square,
that would be SSE. Now, I want to be sure that
we're thinking about SST, SSR, and SSE in the right way. So when we go to
our next slide, SST, you can think of
this as how well the actual particular values
are clustered around the y bar line. When we get to
SSE, you can think of that as how well
the particular values cluster around the
estimated regression line. And when we go to our
next slide for SSR, how to think about
SSR, it's going to be a measure of how
much better using y hat is for making
predictions than y bar. Again, we're going to take
all those blue distances, square them, and then add them. So our relationship,
SST equals SSR plus SSE. And that means if we want to
figure out the goodness of fit, remember this is the
part that's explained. We simply use that
in the numerator, this in the denominator. In our next slide,
we get coefficient of determination or r squared. SSR divided by SST, that is our
goodness of fit, or r squared. Now, sometimes you already have
SSE the sum of squared errors, so if you compare
that part to SST and you have that percentage,
you subtract it from 1. And of course, that's going to
be exactly the same because SSE plus SSR, two
parts make the SST. Now, r squared always
is going to result in a number between 0
and 1, the closer to 1 the better the fit of
the estimated equation to the xy sample data points. rr squared can be thought of
as the measure of goodness of fit of the estimated
regression equation line to the xy sample data. You can think of it
as the proportion of the sum of squared
total that can be explained by using the
estimated regression equation. You can also think
of it this way. The proportion of
the variability in the dependent variable y. That is explained by the
estimated regression equation. Now, we want to go over to Excel
and make these calculations. We're on the sheet
rr squared and S1. All right. We can calculate our
predicted, and guess what? I'm going to show you a
different way than before. We calculated the individual
slope and intercept using the slope function
and the intercept function. But if you don't already have
those calculated in the cells, there's a great function
called the equals FORECAST. And what it will do is if you
give it an x, comma and then the known y's comma
and the known x's, it will internally calculate
slope and intercept and run that equation
for that particular x. So I'm going to
have to lock these. I forgot, so I'm going to hit
the F4 key and watch this. I'm going to click
[INAUDIBLE] y and hit F4. And there we go,
close parentheses. All right, so FORECAST
is substituting us having to individually do
the slope and the intercept. Control-Enter and copy it down. That FORECAST function did
the entire process, calculated our y hat at each particular x. Now, I want to note
something about the lines and the predicted value. If you add up both
of these columns, I'm going to
Alt-Equals, and when I hit Enter, I come
over here and I do it for the predicted values,
Alt-Equals, and lo and behold, they come out exactly
the same, a nice way to check if you did your
predicted values correctly. Now, we can calculate
our residual. Remember residual is
looking at the difference between the actual
y from the sample and our model predicted value. So are you ready? Equals and you take
the particular value minus the predicted
value, Control-Enter, and copy it down. Now, these are residuals,
so when we add them up, Alt-Equals, we better get 0. Now, I want to calculate total,
explained, and unexplained; total, regression,
and residual or error, and we're going to
square this because we want to add them, sum
of the squares total, in parentheses, a particular
y value minus our y bar. Make sure to lock
it with the F4 key, close parentheses, caret 2. This is no different if we were
calculating standard deviation longhand, the deviations
for y squared. So we copy these down,
Alt-Equals to add them up. If we're doing
standard deviation, then we'd have to
divide it by n minus 1 and take the square
root, but we're not. We're calculating SST to
analyze our estimated regression equation. There's SST. That's the total sum of squares. All right. Now we can calculate. The regression amount
equals, in parentheses, the predicted value
minus our y bar. F4 to lock it, close
parentheses, caret 2, Control-Enter. Copy it down. I came down to the bottom
because I squared it. When I add them up--
Alt-Equals-- I do not get 0. That's the sum of the
squares regression, in essence, the amount
of this SST total that the estimated
regression equation explains. Now we can calculate
residuals squared, and I already
calculated the residual, so I'm simply going to get
that amount there, caret 2, Control-Enter,
and copy it down. Now, I come to the bottom,
Alt-Equals to add it up, and there's my
sum of the squares of error-- SSE, SSR, and SST. Now, I'm going to
come down here and I want to do a slightly
different calculation for SST. I just want to prove that
the relationship is true, so instead of going and
getting this one right here, I'm going to say, SSR
plus SSE, and it better be exactly the same. And sure enough, there
it is, exactly equal. Now, I want to
calculate SSE next, and I could just come up
here and get that amount, but I want to show you a
second method of calculating sum of the squares of error. And the only reason we
have a second method here is because we already
have our residuals. And what do we do over
here on this column? Square them and add them. Well, if you have a
column of residuals, there's a function that does
specifically that, SUMSQ. It will square and then add
whichever numbers you put into the number one argument. So I'm going to put
all the residuals and it will
automatically, from the S cubed part of the
function, square it, and the sum part we'll add it. When I hit Enter, I
get exactly the same as the calculated
amount up here. Now, SSR, since there
is this relationship, I can say equals, give me the
total minus the other part, and this will give me SSR. And you can see they're
exactly the same. Now, the moment
of truth, here it is, r squared,
goodness of fit, equals SSR, that's the
explain part, divided by the total squared above
and beyond our y bar line. And there it is,
goodness of fit, 0.84. 84.2% of the total
sum of squares can be explained by the
estimated simple linear regression equation line. In essence, about 84% of
the variation in the y can be explained by the
estimated simple linear regression equation. This means that we have a strong
fit of the estimated regression equation to the xy
sample data point. So remember, the max
this can be is 1. And down here we can
think of it this way. If we had not
collected x and y data, we would have been stuck
with the y bar line with the total variation
of y minus y bar. Our predictions would
be much less accurate. Our equation explains 84% of
the variation above that y bar line. Now, we don't have to
do the longhand process, but certainly doing
it and understanding how the numbers come
together to calculate this helps in our understanding. But guess what? Just like there are so many
other statistical functions, if you want to
calculate r squared, there's an r squared function. You've just got to put
in your known y's, comma, and your known x's. And sure enough it will say,
about 84% of variability in y can be explained by our
estimated regression equation. We can also, if we know SSE
and SST, we can do 1 minus. And just to prove
it, I'm going to say, there's that other part,
not the regression part, the error part
divided by the total and 1 minus-- that's our
complement rule-- it will give us exactly the same thing. Now, that's goodness of fit. I want to actually jump
over to our PowerPoints and talk about the relationship
between coefficient of correlation and
coefficient of determination. This is r. This is r squared. r correlation can be calculated. If we know r squared, we
take the square root of it and assign the sign
from our slope. So if slope is minus, we
put a minus in front of it. If it's positive,
we just leave it as square root of r
squared, whereas coefficient of determination, if
we know r, we simply square it to get r squared,
thus the name of r squared. Now, correlation,
we remember it's a number between minus 1 and 1. It measures the
strength and direction of the linear
relationship between one independent variable and
one dependent variable. For coefficient
of determination, it's a number between 0 and 1. Remember we're comparing
parts to the whole, so if one of those
parts goes down to 0, that's the smallest that
number is ever going to be. Coefficient of
determination, r squared, measures the
strength and goodness of fit of the relationship. Now, here's the deal
about r squared. It could be used on a linear
or nonlinear relationship, and it can be used for one or
more independent variables, and we'll see multiple
linear regression in a couple of videos
ahead, and there we'll refer to it as a big R squared
instead of little r square. So little r square will be
for simple linear regression when we have one
independent variable. Big R squared is when
we have two or more independent variable. But remember for correlation,
it's only for linear and it's only one
independent variable. Now, I want to go
back over to Excel, and down here, I want
to try and calculate from r squared getting r. Well, I need the sign. I could just look up-- and I
didn't calculate the slope, so I'm going to say equals
slope, known y's comma, known x's. And I could just look at
that and go it's a positive and take the square
root of r squared, but I'm not going to do that. Just in case this is a template
and we put some new data in there and it
comes out negative, there's a great function
in Excel called SIGN. And all it does is you
put in some number-- and I'm going to
put in the slope-- and it tells you whether
it's a plus or a minus. Now right now, it
will come out as 1, and I'm just going
to overwrite this. If I put any negative number
here, it shows minus 1. Now, I'm going to
Control-Z to undo that, so that's what SIGN does. It says 1, yes, your
positive, minus 1, yes, you're negative--
times and then I'm going to go get my r squared
except for I need to take the square root of that. And that will work. When I hit Enter,
I get exactly r. That is correlation. Now, we could test it with our
Pearson or Correl function. There is one array comma,
here's a second array, and boom, we get exactly the same answer. Now, the reverse is easier. If we're going from
r squared to r, we have to do that SIGN thing. But I don't have a cell for
it, but right over to the side I'm just going to say, hey,
if I already know r, I simply take r and square it. It doesn't matter if it's
positive or negative, because when you square
it, it's going to turn out to be the same exact number. So that's the relationship
between r squared and r. Now, we're going to go back
over to our PowerPoints because we're going
to have to calculate one more awesome statistic
for linear regression line. I'm going to go to slide 54. And we want an estimate for
variance and standard deviation of the estimated
regression line. Well, check this out. We already took all of the
errors and squared them. Now, remember
standard deviation, we take the deviation,
square them, and then we divide by n minus 1. But here, when we're going to
calculate standard deviation for our estimated
regression line, we're going to take the
sum of the residual squared and we're going to divide it
by, in essence, n minus 2. Now, this textbook
says n minus q, and q is the number of
independent variables minus 1. Now, n minus q minus
1 or in our case n minus 2, that's called
degrees of freedom. And degrees of freedom
is simply, take the n and subtract how many population
parameters you've estimated. And what do we do? We estimated slope and the
intercept, so that's two. So we'll take sum of residual
squared divided by n minus 2. That will give us the estimate
of variance for our regression equation. And it has a special
name, mean square error. Now, once we have our
mean square error, that's our estimate for variance,
remember back in our chapter where we did variance
and standard deviation? How do we go from variance
to standard deviation? We simply take the
square root, and s will indicate the estimate
the standard deviation for the regression
equation, which has a special name, standard
error of the estimate. Sometimes you hear
standard error of the y, but we're going to say
standard error of the estimate. So let's go over to
Excel and calculate this. All right. So here we have q. That's the number of independent
variables for this example. n minus q minus 1, I'm
going to say equals, there's our count
at the top minus number of independent
variables minus 1 and Enter. So our degrees of
freedom will equal 188. Now, for mean square
error, estimate of variance for our regression
equation, we simply say, equals and we go
find SSE divided by our degrees of
freedom, 188, and there is our estimate of variance. Now, for standard
deviation, we simply say, square root of
our variance and boom, we have standard
error of the estimate. That is the standard deviation
for the regression equation. Now, if you have
your s and y data, there's a built-in
function for standard error of the estimate-- that
means standard deviation for our regression line--
equals ST for standard and E for error, and there
it is, standard error for our x and y data. We simply put in our known y's
and I'm going to have to scroll up here and get my y's. That's the range C4 to C193
comma, and our x's, that's B4 to B193. And when I hit Enter, I
get exactly the same thing. And just as when we
did standard deviation, you can think of the standard
error of the estimate that it suggests that any
particular estimate will deviate on average by $52,000. All right. Now let's go look
at a second example. We'll go to r squared and
s2, and here's our same data set, bike weight and price. And we were to calculate
slope, intercept, and now we want to
calculate r square. Well, of course, we're simply
going to use rs cubed for r squared, and there's the
known y's, comma, known x's. And so r squared,
goodness of fit, of our equation to the
sample data points is 0.8. So that's pretty strong. Now, if we were to
calculate r from this, remember we have to go the
sign of our slope times square root of r. And sure enough that
would tell us in dollars, if I apply general with
Control-Shift grave accent or tilde, keyboard for general,
0.89, so almost minus 0.9 correlation with an
r squared of 0.8. So remember correlation tells
us the strength and direction of a linear relationship with
one independent variable. This tells us the
percentage of SST total, the total measure of variation
above and beyond the y bar line, how much of that is
explained with our regression equation. Now, we can also calculate the
standard deviation for this line equals ST for standard,
E for error, and there's xy, our known y's comma,
our known x's. And interest or standard
deviation is approximately $71. All right. So in this video, we did a lot. We talked about the standard
error of the estimate. We talked about
r squared, and we talked about SST, SSR, and SSE. All right. When I to come back,
in our next video, we're going to talk about
an awesome feature, the data and analysis regression feature. All right. We'll see you next video.