Statistics 101: Linear Regression, The Least Squares Method

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(gentle music) - [Brandon] Hello thanks for watching, and welcome to the next video in my series on basic statistics. Now as usual, a few things before we get started. Number one, if you're watching this video because you are struggling in a class right now, I want you to stay positive and keep your head up. If you're watching this, it means you've accomplished quite a bit already. You're very smart and talented, but you may have just hit a temporary rough patch. Now I know with the right about of hard work, practice, and patience, you can work through it. I have faith in you, many other people around you have faith in you, so so should you. Number two, please free to follow me here on YouTube, on Twitter, on Google+, or on LinkedIn. That was when I upload a new video, you know about it. And it's always nice to connect with my viewers online. I feel that life is much too short and the world is much too large for us to miss the chance to connect when we can. Number three, if you like the video, please give it a thumbs up. Share it with classmates or colleagues, or put it on a playlist. That does encourage me to keep making them for you. On the flip side, if you think there's something I can do better, please leave a constructive comment below the video and I will take those ideas into account when I make new ones. And finally, just keep in mind that these videos are meant for individuals who are relatively new to stats. So I'm just going over basic concepts, and I'll be doing so in a slow, deliberate, manner. Not only do I want you to know what is going on, but also why and how to apply it. So all that being said, let's go ahead and get started. So this video is the next in our series about simple linear regression. In our last two videos we talked about the very basics of regression and introduced other basic concepts like the algebra of lines, and general patterns to look for on a scatter plot. In this video, we're going to learn about the fundamental concept in linear regression, the least squares method. We will talk about how the least squares method relates to previous concepts we have learned and then we will actually use the method to calculate the least squares line, or the regression line. This video will involve formulas and simple calculations. While you may not have to find a simple regression line by hand very often, it's not that difficult to do. And this video will at least show you how it's done so you understand the underlying mechanics. So if you are new to regression or are still trying to figure out exactly what it even is, this video is for you. So sit back, relax, and let's go ahead and get to work. So in this video we will continue to use the previous problem we have used in other videos, so I will review it very quickly. Let's assume that you are a small restaurant owner or a very business minded server or waiter, at a nice restaurant. Here in the U.S. and elsewhere, tips are a very important part of a waiter's pay. Most of the time the dollar amount of tip is related to the dollar amount of the total bill. So there's dependency there. As the waiter or owner, you would like to develop a model that will allow you to make a prediction about what amount of tip to expect based on the bill. Therefore one evening, you collect data for six meals. But in the last video, you only had the tip data. But now you were able to go back and get the actual bill data as well. So now you are working with two variables that are matched pairs. So you can see we have six meals over here on the right, so in the left column we have to total bill, and then on the right we have the tip. So for our first meal, or first table, or whatever it was, the total bill was $34, and the tip that went along with it was $5. Then the next meal the total was $108, and then the tip amount was $17, and so on and so forth. Now you want to know to what degree can the tip amount be predicted by the bill amount. So in this case the tip is the dependent variable. So we're sort of making a logical claim that the tip amount is dependent on the total bill amount. Then it's important to set up your variables this way. It would not make sense to say, the total bill amount is dependent on the tip amount, that's reverse. So we wanna say what's actually true sort of in real life, that the tip amount is dependent on the total bill amount. So the tip amount is the dependent variable, and the bill is the independent variable. But in a previous video we looked at a situation where we only had the tip data, and what we determined in that case where we only had the one variable, the tip data, all we do is use it's mean as the best predicted value. So based on that data we found that the mean of the tips was $10. Therefore our best prediction for the seventh meal would be $10. That's the best we could do. So I went ahead and plotted our points and we notice that we have a horizontal line at the tip amount of $10. Just a black dotted line. And then we put all of our tips in relation to that line, then we found the difference or the distance between the line and the tip, then we squared that difference, and then we added up all those differences, and what that gave us are the squared residuals, or the squared error. And then we added them up. So the sum of the squared errors, or the sum squared residuals was 120. Now when conducting simple linear regression with two variables, we will determine how good the regression line fits the data by comparing it to this type. Literally in this case in this problem, we're gonna compare a regression solution to this line, where we pretend the second variable doesn't even exist. So remember beta sub-one is our slope, a horizontal line has a slope of zero. So in this case the line we're looking at here, our beta sub-one is zero. But the whole idea is that when we do regression, we're gonna compare that model to this model, and hopefully it's better. And we'll talk about exactly what better means as we go. So what exactly is the least squares method? Well it depends on this general idea called the least squares criterion, and it looks like this. Now it's kinda of an ugly expression there, but I'm gonna pick it apart so we can understand exactly what each thing means. Now let's start at the left, what does min mean? Well it means minimum, or minimization. Then the we have summation symbol there in the middle, now look over further to the right, and we notice that we have two values in parenthesis that we are subtracting. So we're gonna find the difference of two somethings, we don't know yet. And then we're gonna square that difference. So let's put it all together. We're gonna find the difference of two things, we're gonna square that difference, and then we're gonna add them all together. That's the summation. And the goal is to minimized that sum. So y sub-i is the actual observed value of the dependent variable. In this case, it's the actual tip amount that occurred in the restaurant, that's why it's the observed value. Now y- hat sub-i is the estimated or the predicted value of the dependent variable. So this is the predicted tip amount based on a regression model. So, what are we gonna do here? We're gonna have two values for every x on the graph. We're gonna have the actual observed tip, and then we're gonna have the tip the model predicted. Now those are not gonna usually be the same, there's gonna be some difference between the two. So we're gonna find the difference between those two things, we're gonna square the difference and then add up all the differences, and we want that sum to be as little as possible. So in plain English the goal is to minimize the sum of the squared differences between the observed value for the dependent variable, so the actual tip in the restaurant, and the estimated or predicted value of the dependent variable that is provided by the regression line. So think about this. Let's say that we have a bill, I don't know, that's $50, and the patron at the restaurant left a tip of $5, but our regression line predicted a tip amount of $7.50. So we have a difference there. Our observed value was $5, our predicted value was $7.50, so we find the difference between those two, and then we'll square the difference. And then we'll do that for every point along the regression line. But not only that, but the sum of the squared residuals should be much smaller than when we use just the dependent variable alone. So remember in that case, the slope of the line was zero, the predicted value for every point of x was $10, 'cause we only had the tip data. And that sum of squared residuals, or sum of squared errors, was 120. So when we actually find these squared residuals using the regression line, they should be a lot smaller than 120. So let's walk through this step by step. So step one is to do a scatter plot. Now it seems obvious, but a lot of people just go in and do the math and don't actually look at the data. So do a scatter plot of your data, you can look at the general pattern, you can look for any outliers, or anything that seems odd. But you also wanna make sure that your graph is scaled correctly. So you can see down here in the bill amount, I started at $20. Well that's because our smallest bill was $34. So I went all the way down to zero, that wouldn't make a whole lot of sense. So I went ahead and did the scale from 20 to 120. Same thing for the tip amount. Our smallest tip was $5, so I started that axis at four. So you always wanna make sure that your graph is set up proportionally so the scatter plot is not distorted. So step two, look for a visual line for a rough visual line. Now does the data seem to fall along a line? Well yes it does. So we don't know if any one of these lines here is the actual regression line, but in general the data points do fall along a line. And what if they don't? Well if the data points are all scattered all over the place, if they're like a big blob, or if you've ever seen sort of a shotgun blast against a target, the shot is everywhere then there would be no linear pattern. You would actually stop in your regression. There's no going to be a linear pattern in a blob of data points. Now some people will go ahead and do it because with the computers it's very easy to do anyway, but really it's a waste of time. And it's technically not an appropriate test or appropriate technique to use when the data are just sort of just randomly all over the place. Now step three correlation I would consider optional, but I think it's a good thing to do because the correlation coefficient is involved in other things later in regression and also in multiple regression, so you might as well go ahead and do it anyway. So what is the correlation coefficient for our data here? Well in this case, it's 0.866. Now you obviously have to know what that means. So in this case, is the relationship strong? Well, yes it is. A correlation coefficient of 0.866 indicates a strong, positive, linear relationship. So it sort of gives evidence to our conclusion earlier that in fact there is a linear relationship between data points. Step four, descriptive statistics and the centroid. So over in the right, we can see that we have the bill column and the tip column. The first thing we wanna do is find the mean of each variable. So our average bill amount, or a mean bill amount, was $74. For the tips, it was $10. Now what we can do is actually graph this on the graph. So we'll take our mean of the bills of $74, and we'll take the mean of the tips, which was $10, and we'll actually graph a point there. Now this point is very important, and it's called the centroid. And here's why it's important. The best fit, or the least squares regression line, will or must, pass through the centroid. So whatever our regression line happens to be, it has to go through the centroid, which is comprised of the mean of the x variable, and the mean of the y variable. And remember, it takes two points to make a line. So the centroid automatically gives you a point to work with, and that's important as we go forward. But always find the mean of each variable, then we can plot the centroid on the graph knowing that our regression line must go through that point. Let's go ahead and walk through the calculations. Now remember the general model. So y- hat sub-i equals b sub-zero, plus b sub-one, x sub-i. That's a lot of variables in there. But all that means is this, it's comprised of two parts. So b sub-one is the slope. Now, the formula to find the slope is this over here on the right. Now that looks very complex, but it's not. As you can see we have things in there like x bar. Well, that's the mean of the x variable. We have y bar, that's the mean of the y variable, so on and so forth. It's not that hard to do, it just takes a few steps, and we'll walk through them. But that's how we find the b sub-one, which is the slope of our regression line. So in this case, x bar is the mean of the independent variable, in this case, the bill amount. Y bar is the mean of the dependent variable, which in this case are the tips. Now x sub-i is the value of the independent variable for a point, and y sub-i is the value of the dependent variable for a point. So we have the mean of the independent variable, we have the mean of the dependent variable, and then x sub-i and y sub-i are simply a pair of tip and meal data. Now the intercept is the other component here. So it's b sub-zero. Now to find b sub-zero, we just take the y bar, which is the mean of the dependent variable, and then subtract the slope, times the mean of the independent variable. So we have to find b sub-one first, because we will use that to find the intercept. So these are very simple calculations if you set them up in a table and actually walk through them step-by-step. But there is no magic here, the four things we need are things we already know. The mean of both variables, and then a point. So in this case, a dollar and a tip amount. Let's talk about how to calculate these. So here's our b sub-one. Remember this is the slope over a regression line. So to find the numerator, we do these things. For each data point, we take the x value and subtract the mean of the x variable, or the independent variable. We take the y value and subtract the mean of the y variable, in this case, the tip amount. So all we're doing is taking the difference between the mean and the actual data point for each variable. And then we multiply those two things together, and then add up all the products. Now we're gonna walk through an actual calculation, so we actually see it in action, but this is what we do. Now on the bottom, for each data point, we take the x value and subtract the mean of x, then we square it and add them up. So you can see it's actually pretty simple, just using the four things we already have. The mean of each variable, and then a point that's an x sub-i and y sub-i. Now to find b sub-zero which is the intercept, we use what we find for b sub-one the slope, and then we use the mean of y and the mean of x, to do a simple calculation, and then we have it. So here is a table where we can actually walk through each step of the calculation. So here on the left, we have each meal one through six, and then we have the dollar amounts for the bill and the tip. So the bill of $34 at a tip of $5, the bill of $108 at a tip of $17, and so on and so forth. Then the bottom of this column, you'll see that we have the mean of each variable. So the mean of the total bills was $74, and the mean of the tips was $10. So the first thing we got to do in this next column is the bill deviation. So we're gonna take x sub-i, and then subtract the mean of x. So it looks like this. So let's walk through a couple of them so you can see where we actually get them. So remember, x sub-i, all that is is the x value over here in the total bill column. So in the first case, it's 34. Then we subtract the mean of that column, which is 74. So 34 minus 74 is negative 40. Let's go to the next one. So 108 minus 74 is 34, same thing. 64 minus 74 is negative 10, so on and so forth. So each x value minus the mean. Now I'll do the same thing for the y's. So for the first one, a tip amount of $5, minus the mean of $10 is negative five. Then we have 17 minus 10, which is seven. 11 minus 10, which is one, eight minus 10, which is negative two, and so on and so forth. So each y value minus its mean. Now look at the next column. What are we gonna do? Well, we're gonna multiply those two things together, it's just the product of those two. So negative 40 times negative five is 200. 34 times seven is 238. Negative 10 times one is negative 10, and so on and so forth. Now we need to find the sum of those things. So we're going to add them all up, and then when we do, we have a value of 615. Now, the next column, what do we do? We're gonna square what we found in the bill deviation column. So negative 40 squared is 1,600. 34 squared is 1,156. Negative 10 squared is 100, and so on and so forth. Then we're gonna add all those up, and that's 4,206. Very, very simple if you just followed along step-by-step. Let's go ahead and calculate the slope of our regression line. So remember, here it is. So b sub-one is equal to this fraction here that involves things we just found in the previous slide. Now I've color-coded everything so you can see how everything relates to the actual formula. So in the numerator, you can see that the terms in that numerator are the same thing as the terms in the deviation products column. So the sum of that column will be our numerator. Up in the denominator, we can see that that term is the same thing as our bill deviations squared over here, so the sum of that column will be the denominator. B sub-one, or the slope of a regression line, is 615 divided by 4,206, and we get those from our columns over here on the right. So I'm gonna go ahead and do that division, the slope of our regression line is 0.1462. Now what about the y-intercept? Well here is our general formula. Now we know what our b sub-one is, or our slope is, we just found it. And here is the rest information we need. We need y bar and x bar, which is just the mean of the bills and the mean of the tips. So we go ahead and substitute everything in. So the mean of the tips, or the y bar is 10, minus b sub-one, which is our slope, which is up there at the top. And then the mean of the x's, or the mean of our total bills, which is 74. So we'll go ahead and do all that out. And we end up with an intercept of negative 0.8188. What is our regression line? So here is a general formula, b sub-zero is our intercept, we found that out, negative 0.8188, our slope is 0.1462, now all we have to do is assemble it. We gotta put it all together, and it looks like this, or this. So y-hat sub-i equals negative 0.8188, that's our intercept, plus 0.1462x. So that's our slope b sub-one up there at the top. Now we could rearrange it and put the slope first, so 0.1462x minus 0.8188 for the intercept. So it doesn't matter how you rearrange it. Some software packages will display it one way, some will display it the other way. It really does not matter, as long as you know what each section or each term in there means. Now I went ahead and did this in Microsoft Excel, and here's what it gave us. It gave us a y equals 0.1462x minus 0.8203. Now what was our calculation? 0.1462x minus 0.8188. Now notice there's a little bit difference due to rounding in the intercept, and that's ok. Now I will say this, that regression is very sensitive to rounding. So it is always best practice to take your calculations out to four decimal places. But here we have a slight difference due to rounding, but our manual calculation that we just did is the exact same that Excel came up with there at the top. So our slope is 0.1462, and our intercept is negative 0.8203. In Excel's case, or negative 0.8188 in our hand calculation case, which is close enough for me. Now look at our centroid. Remember I said our centroid has to fall on the regression line. So in this case, our centroid was 74, 10. So a bill amount of $74 and a tip amount of $10. Well, guess what? It does. So that disproves the point that our regression line is accurate. It goes through a centroid, and our hand calculation matched Excel's calculation. So how do we actually interpret our regression line? So here it is, so y- hat sub-i, and again remember that's predicted. The y hat means this is how we find predicted values, 0.1462x minus 0.8188. What does that actually mean? Well, here's what it means. For every $1, the bill amount, which is our x, increases. For every dollar the bill amount increases. We would expect the tip amount to also increase by $0.1462, or about $0.15. So for every dollar the bill amount increases, we expect or predict the tip amount to increase by $0.15 approximately. Now, what does the intercept mean? If the bill amount is $0, then the expected or predicted tip amount is negative $0.82. Well does that make any sense? No, and here's one other important thing. The intercept may or may not have any real meaning in real life. So it may or may not make sense. In this case, it doesn't make any sense. So it has to be part of our prediction equation or a regression equation, but it doesn't really make sense in real life. Sometimes it does, sometimes it doesn't. It just depends on the problem. But the important thing to get out of this slide is how the dependent variable changes in relation to one unit change in the independent variable. So for every $1 the bill amount increases, we expect the tip amount to increase by about $0.15, or $0.1462. And that's because it's positive. But is this regression line model any good? Well we don't know that yet. And that will be the topic of our next video. Okay so we've reviewed the heart, or the core, of simple linear regression, which is the least squares method. So we talked about how to put our data on a graph, make sure it actually follows a linear relationship. If it doesn't we should abandon the regression because it doesn't make sense to try to force a linear model on data points that are scattered all over the place. We talked about how we should format our graphs so it doesn't distort our data points as well. We also talked about doing a correlation analysis to figure out if a strong linear relationship exists between the two variables. Now again that's optional but I do suggest doing it because that comes into play later on, and also in multiple regression. Now once we have that, we use the actual step-by-step method to calculate b sub-one, which is the slope of a regression line, and b sub-zero which is the intercept. Now when we calculated those, we found out that, by putting them together, we generated our regression line. Now we can go ahead and graph a regression line and it better go through our centroid, which is the mean of each variable as plotted as a distinct point on the graph. But of course, in the end, we don't know if this regression line is any good. We're gonna compare it to the situation where we did not have an independent variable. So remember, in that case, the best prediction we had for any tip was $10. So when we find the squared residuals using the regression line, it had better be a lot less than 120, otherwise a regression model is no better than just using the mean of the tips alone. And that's the whole point. We're going to compare a regression line to the situation where we're only using the mean of the dependent variable. So we'll get to that in the next video. (gentle music)

Info

Channel: Brandon Foltz

Views: 450,488

Rating: 4.9678602 out of 5

Keywords: statistics 101 regression, statistics 101 simple linear regression, statistics 101 linear regression, least squares regression, least squares method, least square method, statistics 101: simple linear regression (part 3), brandon foltz, least squares, simple linear regression part 3, linear regression statistics, linear regression part 3, simple regression, regression, statistics 101, linear regression machine learning, machine learning, linear regression, Regression analysis

Id: Qa2APhWjQPc

Channel Id: undefined

Length: 28min 37sec (1717 seconds)

Published: Fri Dec 06 2013