(gentle music) - [Brandon] Hello thanks for watching, and welcome to the next video in my series on basic statistics. Now as usual, a few things
before we get started. Number one, if you're watching this video because you are struggling
in a class right now, I want you to stay positive
and keep your head up. If you're watching this, it
means you've accomplished quite a bit already. You're very smart and
talented, but you may have just hit a temporary rough patch. Now I know with the
right about of hard work, practice, and patience,
you can work through it. I have faith in you, many
other people around you have faith in you, so so should you. Number two, please free to follow me here on YouTube, on Twitter, on Google+, or on LinkedIn. That was when I upload a new video, you know about it. And it's always nice to
connect with my viewers online. I feel that life is much
too short and the world is much too large for
us to miss the chance to connect when we can. Number three, if you like the video, please give it a thumbs up. Share it with classmates or colleagues, or put it on a playlist. That does encourage me to
keep making them for you. On the flip side, if you
think there's something I can do better, please
leave a constructive comment below the video
and I will take those ideas into account when I make new ones. And finally, just keep
in mind that these videos are meant for individuals
who are relatively new to stats. So I'm just going over basic concepts, and I'll be doing so in a
slow, deliberate, manner. Not only do I want you to know what is going on, but also
why and how to apply it. So all that being said, let's
go ahead and get started. So this video is the next in our series about simple linear regression. In our last two videos we
talked about the very basics of regression and introduced
other basic concepts like the algebra of lines,
and general patterns to look for on a scatter plot. In this video, we're going to
learn about the fundamental concept in linear regression,
the least squares method. We will talk about how
the least squares method relates to previous
concepts we have learned and then we will actually use the method to calculate the least squares line, or the regression line. This video will involve formulas
and simple calculations. While you may not have to
find a simple regression line by hand very often, it's
not that difficult to do. And this video will at
least show you how it's done so you understand the
underlying mechanics. So if you are new to
regression or are still trying to figure out exactly what
it even is, this video is for you. So sit back, relax, and let's
go ahead and get to work. So in this video we will
continue to use the previous problem we have used in other videos, so I will review it very quickly. Let's assume that you are
a small restaurant owner or a very business
minded server or waiter, at a nice restaurant. Here in the U.S. and elsewhere,
tips are a very important part of a waiter's pay. Most of the time the dollar amount of tip is related to the dollar
amount of the total bill. So there's dependency there. As the waiter or owner,
you would like to develop a model that will allow
you to make a prediction about what amount of tip to
expect based on the bill. Therefore one evening, you
collect data for six meals. But in the last video,
you only had the tip data. But now you were able to go back and get the actual bill data as well. So now you are working with two variables that are matched pairs. So you can see we have six
meals over here on the right, so in the left column
we have to total bill, and then on the right we have the tip. So for our first meal, or first table, or whatever it was,
the total bill was $34, and the tip that went
along with it was $5. Then the next meal the total was $108, and then the tip amount was
$17, and so on and so forth. Now you want to know to what
degree can the tip amount be predicted by the bill amount. So in this case the tip
is the dependent variable. So we're sort of making a logical claim that the tip amount is dependent
on the total bill amount. Then it's important to set
up your variables this way. It would not make sense to
say, the total bill amount is dependent on the tip
amount, that's reverse. So we wanna say what's actually true sort of in real life, that the tip amount is dependent on the total bill amount. So the tip amount is
the dependent variable, and the bill is the independent variable. But in a previous video
we looked at a situation where we only had the tip data, and what we determined in that case where we only had the one variable, the tip data, all we do is use it's mean as the best predicted value. So based on that data
we found that the mean of the tips was $10. Therefore our best prediction
for the seventh meal would be $10. That's the best we could do. So I went ahead and plotted our points and we notice that we
have a horizontal line at the tip amount of $10. Just a black dotted line. And then we put all of
our tips in relation to that line, then we found the difference or the distance between
the line and the tip, then we squared that difference, and then we added up
all those differences, and what that gave us are
the squared residuals, or the squared error. And then we added them up. So the sum of the squared errors, or the sum squared residuals was 120. Now when conducting
simple linear regression with two variables, we
will determine how good the regression line fits
the data by comparing it to this type. Literally in this case in this problem, we're gonna compare a regression solution to this line, where we
pretend the second variable doesn't even exist. So remember beta sub-one is our slope, a horizontal line has a slope of zero. So in this case the line
we're looking at here, our beta sub-one is zero. But the whole idea is that
when we do regression, we're gonna compare that
model to this model, and hopefully it's better. And we'll talk about
exactly what better means as we go. So what exactly is the
least squares method? Well it depends on this general idea called the least squares criterion, and it looks like this. Now it's kinda of an
ugly expression there, but I'm gonna pick it
apart so we can understand exactly what each thing means. Now let's start at the left, what does min mean? Well it means minimum, or minimization. Then the we have summation
symbol there in the middle, now look over further to the right, and we notice that we have
two values in parenthesis that we are subtracting. So we're gonna find the
difference of two somethings, we don't know yet. And then we're gonna
square that difference. So let's put it all together. We're gonna find the
difference of two things, we're gonna square that difference, and then we're gonna
add them all together. That's the summation. And the goal is to minimized that sum. So y sub-i is the actual observed value of the dependent variable. In this case, it's the actual tip amount that occurred in the restaurant, that's why it's the observed value. Now y- hat sub-i is the
estimated or the predicted value of the dependent variable. So this is the predicted tip amount based on a regression model. So, what are we gonna do here? We're gonna have two values
for every x on the graph. We're gonna have the actual observed tip, and then we're gonna have
the tip the model predicted. Now those are not gonna
usually be the same, there's gonna be some
difference between the two. So we're gonna find the difference between those two things,
we're gonna square the difference and then
add up all the differences, and we want that sum to
be as little as possible. So in plain English the
goal is to minimize the sum of the squared differences
between the observed value for the dependent
variable, so the actual tip in the restaurant, and the
estimated or predicted value of the dependent variable that is provided by the regression line. So think about this. Let's say that we have
a bill, I don't know, that's $50, and the
patron at the restaurant left a tip of $5, but our regression line predicted a tip amount of $7.50. So we have a difference there. Our observed value was $5,
our predicted value was $7.50, so we find the difference
between those two, and then we'll square the difference. And then we'll do that for every point along the regression line. But not only that, but the
sum of the squared residuals should be much smaller than when we use just the dependent variable alone. So remember in that case, the
slope of the line was zero, the predicted value for
every point of x was $10, 'cause we only had the tip data. And that sum of squared residuals, or sum of squared errors, was 120. So when we actually find
these squared residuals using the regression
line, they should be a lot smaller than 120. So let's walk through this step by step. So step one is to do a scatter plot. Now it seems obvious, but a lot of people just go in and do the
math and don't actually look at the data. So do a scatter plot of your data, you can look at the general pattern, you can look for any outliers,
or anything that seems odd. But you also wanna make
sure that your graph is scaled correctly. So you can see down
here in the bill amount, I started at $20. Well that's because our
smallest bill was $34. So I went all the way down to zero, that wouldn't make a whole lot of sense. So I went ahead and did
the scale from 20 to 120. Same thing for the tip amount. Our smallest tip was $5, so I started that axis at four. So you always wanna make
sure that your graph is set up proportionally
so the scatter plot is not distorted. So step two, look for a visual line for a rough visual line. Now does the data seem
to fall along a line? Well yes it does. So we don't know if any
one of these lines here is the actual regression line, but in general the data
points do fall along a line. And what if they don't? Well if the data points are all scattered all over the place, if
they're like a big blob, or if you've ever seen sort of a shotgun blast against a target,
the shot is everywhere then there would be no linear pattern. You would actually stop
in your regression. There's no going to be a linear pattern in a blob of data points. Now some people will go ahead and do it because with the computers it's very easy to do anyway, but really it's a waste of time. And it's technically
not an appropriate test or appropriate technique
to use when the data are just sort of just
randomly all over the place. Now step three correlation
I would consider optional, but I think it's a good thing to do because the correlation
coefficient is involved in other things later
in regression and also in multiple regression,
so you might as well go ahead and do it anyway. So what is the correlation
coefficient for our data here? Well in this case, it's 0.866. Now you obviously have
to know what that means. So in this case, is the
relationship strong? Well, yes it is. A correlation coefficient
of 0.866 indicates a strong, positive, linear relationship. So it sort of gives evidence
to our conclusion earlier that in fact there is
a linear relationship between data points. Step four, descriptive
statistics and the centroid. So over in the right,
we can see that we have the bill column and the tip column. The first thing we wanna
do is find the mean of each variable. So our average bill amount, or
a mean bill amount, was $74. For the tips, it was $10. Now what we can do is actually
graph this on the graph. So we'll take our mean
of the bills of $74, and we'll take the mean of
the tips, which was $10, and we'll actually graph a point there. Now this point is very important, and it's called the centroid. And here's why it's important. The best fit, or the least
squares regression line, will or must, pass through the centroid. So whatever our regression
line happens to be, it has to go through the
centroid, which is comprised of the mean of the x variable, and the mean of the y variable. And remember, it takes
two points to make a line. So the centroid automatically
gives you a point to work with, and that's
important as we go forward. But always find the mean of each variable, then we can plot the centroid on the graph knowing that our regression
line must go through that point. Let's go ahead and walk
through the calculations. Now remember the general model. So y- hat sub-i equals b
sub-zero, plus b sub-one, x sub-i. That's a lot of variables in there. But all that means is this,
it's comprised of two parts. So b sub-one is the slope. Now, the formula to find
the slope is this over here on the right. Now that looks very complex, but it's not. As you can see we have
things in there like x bar. Well, that's the mean of the x variable. We have y bar, that's the
mean of the y variable, so on and so forth. It's not that hard to do,
it just takes a few steps, and we'll walk through them. But that's how we find the b sub-one, which is the slope of our regression line. So in this case, x bar is the mean of the independent variable,
in this case, the bill amount. Y bar is the mean of
the dependent variable, which in this case are the tips. Now x sub-i is the value
of the independent variable for a point, and y sub-i is the value of the dependent variable for a point. So we have the mean of
the independent variable, we have the mean of
the dependent variable, and then x sub-i and y
sub-i are simply a pair of tip and meal data. Now the intercept is the
other component here. So it's b sub-zero. Now to find b sub-zero,
we just take the y bar, which is the mean of
the dependent variable, and then subtract the
slope, times the mean of the independent variable. So we have to find b sub-one first, because we will use that
to find the intercept. So these are very simple calculations if you set them up in a
table and actually walk through them step-by-step. But there is no magic here,
the four things we need are things we already know. The mean of both variables,
and then a point. So in this case, a
dollar and a tip amount. Let's talk about how to calculate these. So here's our b sub-one. Remember this is the slope
over a regression line. So to find the numerator,
we do these things. For each data point, we take the x value and subtract the mean of the x variable, or the independent variable. We take the y value and subtract
the mean of the y variable, in this case, the tip amount. So all we're doing is
taking the difference between the mean and the actual data point for each variable. And then we multiply
those two things together, and then add up all the products. Now we're gonna walk through
an actual calculation, so we actually see it in action, but this is what we do. Now on the bottom, for each data point, we take the x value and
subtract the mean of x, then we square it and add them up. So you can see it's
actually pretty simple, just using the four
things we already have. The mean of each
variable, and then a point that's an x sub-i and y sub-i. Now to find b sub-zero
which is the intercept, we use what we find for
b sub-one the slope, and then we use the mean
of y and the mean of x, to do a simple calculation,
and then we have it. So here is a table where we can actually walk through each step of the calculation. So here on the left, we have each meal one through six, and then
we have the dollar amounts for the bill and the tip. So the bill of $34 at a tip of $5, the bill of $108 at a tip of $17, and so on and so forth. Then the bottom of this column, you'll see that we have the mean of each variable. So the mean of the total bills was $74, and the mean of the tips was $10. So the first thing we got
to do in this next column is the bill deviation. So we're gonna take x
sub-i, and then subtract the mean of x. So it looks like this. So let's walk through a couple of them so you can see where we actually get them. So remember, x sub-i, all that
is is the x value over here in the total bill column. So in the first case, it's 34. Then we subtract the mean
of that column, which is 74. So 34 minus 74 is negative 40. Let's go to the next one. So 108 minus 74 is 34, same thing. 64 minus 74 is negative
10, so on and so forth. So each x value minus the mean. Now I'll do the same thing for the y's. So for the first one, a tip amount of $5, minus the mean of $10 is negative five. Then we have 17 minus 10, which is seven. 11 minus 10, which is one, eight minus 10, which is negative two,
and so on and so forth. So each y value minus its mean. Now look at the next column. What are we gonna do? Well, we're gonna multiply
those two things together, it's just the product of those two. So negative 40 times negative five is 200. 34 times seven is 238. Negative 10 times one is negative 10, and so on and so forth. Now we need to find the
sum of those things. So we're going to add them all up, and then when we do,
we have a value of 615. Now, the next column, what do we do? We're gonna square what we found in the bill deviation column. So negative 40 squared is 1,600. 34 squared is 1,156. Negative 10 squared is 100,
and so on and so forth. Then we're gonna add all
those up, and that's 4,206. Very, very simple if you just
followed along step-by-step. Let's go ahead and calculate the slope of our regression line. So remember, here it is. So b sub-one is equal
to this fraction here that involves things we just
found in the previous slide. Now I've color-coded
everything so you can see how everything relates
to the actual formula. So in the numerator, you
can see that the terms in that numerator are the
same thing as the terms in the deviation products column. So the sum of that column
will be our numerator. Up in the denominator,
we can see that that term is the same thing as our bill
deviations squared over here, so the sum of that column
will be the denominator. B sub-one, or the slope
of a regression line, is 615 divided by 4,206, and we get those from our columns over here on the right. So I'm gonna go ahead
and do that division, the slope of our
regression line is 0.1462. Now what about the y-intercept? Well here is our general formula. Now we know what our b sub-one is, or our slope is, we just found it. And here is the rest information we need. We need y bar and x bar,
which is just the mean of the bills and the mean of the tips. So we go ahead and
substitute everything in. So the mean of the tips,
or the y bar is 10, minus b sub-one, which is our slope, which is up there at the top. And then the mean of the x's, or the mean of our total bills, which is 74. So we'll go ahead and do all that out. And we end up with an
intercept of negative 0.8188. What is our regression line? So here is a general formula,
b sub-zero is our intercept, we found that out, negative 0.8188, our slope is 0.1462, now all we have to do is assemble it. We gotta put it all together, and it looks like this, or this. So y-hat sub-i equals negative 0.8188, that's our intercept, plus 0.1462x. So that's our slope b
sub-one up there at the top. Now we could rearrange it
and put the slope first, so 0.1462x minus 0.8188 for the intercept. So it doesn't matter how you rearrange it. Some software packages
will display it one way, some will display it the other way. It really does not matter,
as long as you know what each section or
each term in there means. Now I went ahead and did
this in Microsoft Excel, and here's what it gave us. It gave us a y equals
0.1462x minus 0.8203. Now what was our calculation? 0.1462x minus 0.8188. Now notice there's a
little bit difference due to rounding in the
intercept, and that's ok. Now I will say this,
that regression is very sensitive to rounding. So it is always best practice
to take your calculations out to four decimal places. But here we have a slight
difference due to rounding, but our manual calculation
that we just did is the exact same that Excel
came up with there at the top. So our slope is 0.1462, and our intercept is negative 0.8203. In Excel's case, or
negative 0.8188 in our hand calculation case, which
is close enough for me. Now look at our centroid. Remember I said our centroid has to fall on the regression line. So in this case, our centroid was 74, 10. So a bill amount of $74
and a tip amount of $10. Well, guess what? It does. So that disproves the
point that our regression line is accurate. It goes through a centroid,
and our hand calculation matched Excel's calculation. So how do we actually
interpret our regression line? So here it is, so y- hat sub-i, and again remember that's predicted. The y hat means this is how
we find predicted values, 0.1462x minus 0.8188. What does that actually mean? Well, here's what it means. For every $1, the bill amount,
which is our x, increases. For every dollar the
bill amount increases. We would expect the tip
amount to also increase by $0.1462, or about $0.15. So for every dollar the
bill amount increases, we expect or predict the tip amount to increase by $0.15 approximately. Now, what does the intercept mean? If the bill amount is
$0, then the expected or predicted tip amount is negative $0.82. Well does that make any sense? No, and here's one other important thing. The intercept may or may not have any real meaning in real life. So it may or may not make sense. In this case, it doesn't make any sense. So it has to be part of our prediction equation or a regression equation, but it doesn't really
make sense in real life. Sometimes it does, sometimes it doesn't. It just depends on the problem. But the important thing
to get out of this slide is how the dependent
variable changes in relation to one unit change in
the independent variable. So for every $1 the bill amount increases, we expect the tip amount to increase by about $0.15, or $0.1462. And that's because it's positive. But is this regression
line model any good? Well we don't know that yet. And that will be the
topic of our next video. Okay so we've reviewed the heart, or the core, of simple linear regression, which is the least squares method. So we talked about how to
put our data on a graph, make sure it actually follows
a linear relationship. If it doesn't we should
abandon the regression because it doesn't make sense
to try to force a linear model on data points that are
scattered all over the place. We talked about how we should format our graphs so it doesn't
distort our data points as well. We also talked about doing
a correlation analysis to figure out if a strong
linear relationship exists between the two variables. Now again that's optional but I do suggest doing it because that comes into play later on, and also in multiple regression. Now once we have that, we use
the actual step-by-step method to calculate b sub-one, which is the slope of a regression line, and b
sub-zero which is the intercept. Now when we calculated those,
we found out that, by putting them together, we generated
our regression line. Now we can go ahead and
graph a regression line and it better go through our centroid, which is the mean of each variable as plotted as a distinct
point on the graph. But of course, in the end, we
don't know if this regression line is any good. We're gonna compare it to the situation where we did not have
an independent variable. So remember, in that
case, the best prediction we had for any tip was $10. So when we find the squared residuals using the regression line, it had better be a lot less than 120,
otherwise a regression model is no better than just using
the mean of the tips alone. And that's the whole point. We're going to compare a regression line to the situation where we're only using the mean of the dependent variable. So we'll get to that in the next video. (gentle music)