Statistics 101: Linear Regression, The Very Basics 📈

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(gentle acoustic guitar music) - [Brandon] Hello, thanks for watching, and welcome to the next video in my series on basic statistics. Now as usual, a few things before we get started. Number one, if you're watching this video because you are struggling in a class right now, I want you to stay positive and keep your head up. If you're watching this, it means you've accomplished quite a bit already. You're very smart and talented, but you may have just hit a temporary rough patch. Now I know with the right amount of hard work, practice, and patience, you can work through it. I have faith in you, many other people around you have faith in you, so so should you. Number two, please feel free to follow me here on YouTube, on Twitter, on Google Plus, or on LinkedIn. That way when I upload a new video, you know about it. And it's always nice to connect with my viewers online. I feel that life is much too short and the world is much too large for us to miss the chance to connect when we can. Number three, if you like the video, please give it a thumbs up. Share it with classmates or colleagues, or put it on a playlist. That does encourage me to keep making them for you. On the flip side, if you think there's something I can do better, please leave a constructive comment below the video, and I will take those ideas into account when I make new ones. And finally, just keep in mind that these videos are meant for individuals who are relatively new to stats. So I'm just going over basic concepts. And I will be doing so in a slow, deliberate manner. Not only do I want you to know what is going on, but also why, and how to apply it. So all that being said, let's go ahead and get started. Okay, so this is the first video in what will be, or is, depending on when you're watching this, a multi-part video series about simple linear regression. In the next few minutes, we will cover the basics of simple linear regression starting at square one. And for the record, from now on if I say just regression, I am referring to simple linear regression as opposed to multiple regression or models that are not linear, which we will hopefully get to those at a later date. Now regression allows us to model mathematically the relationship between two or more variables, using very simple algebra, to be specific. For now, we'll be working with just two variables; An independent variable, and a dependent variable. The truth is, when we talk about how quote "good" a regression model is, we are actually comparing it to another specific model. Oftentimes, students don't realize this. So in this video, we're gonna talk about that idea. I will also begin introducing basic terminology and concepts that will carry you through your work using regression. There are no formulas or calculations in this video. We're just introducing the underlying meaning behind good regression models. So if you are new to regression, or are still trying to figure out exactly what it even is, this video is for you. So sit back, relax, and let's go ahead and get to work. So as always, I like starting out my videos with a problem, and a relatively real world problem at that. So we'll call this one tips for service. So let's assume that you are a small restaurant owner, or a very business-minded server or waiter in a nice restaurant. Here in the US, tips are a very important part of a waiter's pay. Most of the time, the dollar amount of the tip is related to the dollar amount of the total bill. So if the bill is $5, that would have a smaller tip than a bill that is $50. Now as the waiter or the owner, you would like to develop a model that will allow you to make a prediction about what amount of tip to expect for any given bill amount. So therefore, one evening, you collect data for six meals. So a random sample of six meals. But unfortunately, when you begin to look at your data, you kind of forgot something. You realize you collected data for the tip amount and not the meal amount that goes with it. So unfortunately right now, this is the best data you have. So you have a random sample of six meals and the tip amount for each one of those meals. So $5, $17, $11 and so on. Now here's the question. How might you predict the tip amount for future meals using only this data? There's only one variable here, the tip amount. The meal number's just a descriptor. So we have one variable, the tip amount. But I still want to challenge you to come up with a model that will allow you to predict within some reason what the next tip is going to be. How can you do that? Think about it. So the first thing we're gonna do is we're going to visualize our data. As you know if you watch my other videos, I am a huge advocate of visualizing our problems, making charts, graphs, diagrams, whatever we have to do to make them visual. So the first thing we'll do is we'll make a graph of our tips. Now on the x-axis on the bottom, we have our meal number. Now that's not a variable, that's just a descriptor of what meal we're graphing. Now on the y-axis, or the vertical axis, that's where we will graph our tip amount. Let's go ahead and see what this looks like. So for meal one, with a tip of $5, so we'll go ahead and graph that at around $5. For meal two, with a tip of $17, so that goes way up there. For meal three, with a tip of $11, so that goes there. Meal four, with a tip of $8, that goes there. Meal five, that was a $14 tip. And meal number six, that was a $5 tip. So here are our data points. Remember, we're only dealing with one variable, that's the tip amount, and the meals along the bottom just describe where we're graphing each point. And the order does not matter. We could have graphed these in any order. This just happens to be the one we ended up with. Now, what's really the most you can figure out about this data? How would you predict what the tip for meal number seven would be? Is it going to be like meal number six, it's $5? Is it gonna be like meal number two, to $17? How would you come up with the best guess or estimate for the next meal using only one variable? Well, you would use its mean. So the mean for all six tips is $10. So guess what? That's the best we can do. With only one variable, the best estimate for the best prediction, for any given meal tip is $10. So go ahead and put a line at $10. So that for this model, that is our best fit line, that's all we have. One variable, tip amount, the mean is the best predictor of any given tip amount. Now obviously if you look at this chart, our tips do not fall on the $10 line, they're scattered around it. But still, it's the mean. So that's your best estimate for the next tip for any given tip would be. So here's our graph again with our tips, our mean, and our tip amount. I just want to stress that the tip amount is y bar, so that's the mean of y, and that's for two reasons. One, the dependent variable, which it will be as we progress forward is always the y of the x and y axes, and of course we're graphing it on the y-axis, so it should be y bar. So here it is, the basic concept I really want you to remember in your head as you go forward. But obviously, simple linear regression is about two variables. But, we're starting off here, 'cause this is where it all begins. With only one variable and no other information, the best prediction for the next measurement is the mean of the sample itself. So the variability in the tip amount, 'cause they're not on the line, they're above and below, the variability in the tip amounts can only be explained by the tips themselves because that's all we have. So the way they're above and below the line, that's just the natural variation in the tips. But the basic point is this; With only one variable, the best way, the only way we can make a prediction about what the next tip amount in this case is the mean. So our best prediction for the tip of meal of number seven is $10. So let's talk about the goodness of fit for this line and our tips. Now obviously we know that the data points, the actual observed values do not fall on that line, they do not fall on the $10 line, some are above and some are below it. So that tells us how good this line fits these observed data points. Now one way we can do that is to measure the distance they are from that best fit line. Now we did this to some degree when we were talking about standard deviation. Remember, we're talking about the distance each data point is from the mean. But guess what we're doing here? The distance that each data point is from the mean, because the mean is our line of $10 here. So, for meal number one, our tip was $5. so that's $5 below our mean of $10, so that's negative five. Meal number two, got a tip of $17, that was $7 above our mean. Meal three was $11, $1 above our mean. Meal four was $8, that's two below our mean. Meal five was $14, that's four above our mean. And meal six is $5, that's five below our mean. So these are the distances, in this case, dollar amounts by which each observed value is different from or is, deviates from the mean of $10. Now we have a name for these, they're called residuals. So the distance between the best fit line, which in this case, 'cause it's one variable is $10, the distance from the best fit line to the observed values are called residuals. Now they're also called the error. So the distance is also called the error because that's how far off the observed value is from the best fit line. Now you notice a few more things here. If you add up the residuals on the top, just above the line, seven plus one plus four, that's 12. Add up the residuals below the line, five, two and five, so that's minus 12. So the residuals always add up to zero. Now that's another important concept to keep in mind as we go forward. But if you remember in standard deviation, one of the steps was that we took the deviations from the mean and we squared them. Well guess what? We're gonna do the exact same thing here. So the residual for meal one one was $5, so it's $5 below, so we square that and it squares to 25. Meal number two, it was $7 above, we square seven, that's 49. So on and so forth. So the right-hand column of our table, we have our squared residuals. Now the question is why do we square them? Well we square them for the same reasons we square the deviations when calculating the standard deviation. Number one, it makes them all positive. So if we square a negative number, it obviously makes it positive. And number two, it emphasizes the larger deviations. So a deviation of two will square to four. But a deviation of five will square to 25. So the squaring really exaggerates the points that are further away. Now what we can do is we can take these residuals, these squared residuals in the right-hand column and we can add them up. And they're called the sum of squared residuals, or the sum of squared errors, or the SSE. Now where have you heard that before? Well you've heard it everywhere in statistics. You've obviously heard it in standard deviations, you've heard it in ANOVA. Same idea, sum of the squared errors. It's a fancy way of saying we add up the squared residuals. And when we do so, it's 120. Now when we say squaring the residuals, we literally mean squaring them. So, 25 over here in the left-hand side, that's negative five squared. 49 is seven squared, and so forth. Well we actually mean squares, so when we square each residual, or error, we're actually making squares. So when we say sum of squares, we literally mean the sum of squares. So 49 plus 25 plus one plus four plus 16 plus 25 adds up to 120. Now, here is sort of the blockbuster bombshell concept of this video. The goal of simple linear regression is to create a linear model that minimizes the sum of squares of the residuals, same thing as the sum of squares of the error. So what we're gonna do is we're gonna create a different line through the data, once we introduce an independent variable that will minimize the size of these squares. And actually mathematically, we'll come up with the line through the data that minimizes these squares as much as they can be. And that will be our best fit line for the data. But again, in this problem, we're only using one variable, we're only using the dependent variable. So when we introduce the independent variable, it will sort of take away for its own self some of this error we see here. If our regression model is significant, it will eat up some of the raw error we had when we assumed, like in this problem, that the independent variable did not even exist. So what we're doing here in this problem, we're taking a simple linear regression problem that in theory has a independent variable, called the bill amount, and a dependent variable called the tip amount. But what we're doing is pretending that the bill amount doesn't even exist. We're only using the tip amount. So, that creates a sum of squared residuals of 120. Now, when we introduce the independent variable of bill amount, what will happen is that we'll create a different best fit line through our data. And what it will do is it will sort of eat up some of this sum of squares. So when we do regression, we're gonna have sum of squares regression and sum of squares error. So by introducing that independent variable of bill amount, it will create a new line that goes through the data. That new line will explain some of the sum of squares, and therefore it will reduce the SSE down, the sum of these squares as much as it can be. So the regression line will and should literally fit the data better. It will minimize the residuals. So when conducting simple linear regression with two variables, we will determine how good that line fits the data by comparing it to this type, where we pretend the second variable does not even exist. So when we say a model is good, a linear regression model is good, what we're saying is that it reduces the sum of squares of the error by a large amount. Which is another way of saying we're comparing the other best fit line to this one you're looking at right here. Simple linear regression is always in comparison to what we would have if we only had the dependent variable. So if a two variable regression model looks like this example, what does the other independent variable do to help us explain the dependent variable? Well, it does nothing. If we introduced bill amount into a two variable simple linear regression, but the best fit line looks exactly like this, then the bill amount didn't give us anything. It didn't explain the variability in the tip amount anymore than the tip amount itself did. So we're always comparing our simple linear regression best fit line to this one. Basically, the mean of the dependent variable alone. Okay, so quick review. So simple linear regression is really a comparison of two models. The first one is where the independent variable does not even exist and we just use the mean of the dependent variable, like we did in this video. And the other use uses the best fit regression line, where we go ahead and introduce that second variable, the independent variable in this case, the meal amount, and that creates a different line, and then we compare that to the first one. But if there's only one variable like in this example, for this video, the best prediction for other values is the mean of that dependent variable. In this case, it was $10. Now the difference between the best fit line and the observed value is called the residual, or the error. The residuals are squared and then added together to generate a sum of squares, literally residuals or errors, or SSE. So we square the residuals and add them together, sum of squared residuals, or most often it's called sum of squares error. So simple linear regression is designed to find the best fitting line through the data that minimizes the SSE, that minimizes the area of the sum of squares residuals, that minimizes the area of the sum of squares error. And actually through calculus, it is the best fitting line. Now I'm not going to go into the calculus behind that, but you're just gonna have to start trusting me on faith that when we come up with a best fit line in simple linear regression, that literally is the best fit line that reduces the SSE. Okay, so that wraps up our very first video of mini on simple linear regression. So I just want you to realize in this video that later when we talk about the best fit line in regression, we're actually comparing it to the situation where we don't have the independent variable at all. We're just comparing it to the case, where we're looking at the mean of the dependent variable. So in this case, in this video, all we had was the mean of the tips, 10 bucks, that's all we had to go on. So therefore, our best guess or best prediction for the next tip was $10. Now later, when we introduced the meal amount, what will happen is we'll get a different best fit line that will explain or take up some of that error in the regression, it'll reduce the error, we'll have a different line, then we'll have smaller, hopefully, residuals. If the regression line is flat across, like we saw in the first example of this one, then the regression doesn't tell us anything, the meal amount doesn't mean anything, so the best guess is really just the mean of the tips. So, we'll go more into this example in the second video, just wanted to lay the foundations for that. Look forward to seeing you next time. (gentle acoustic guitar music)
Info
Channel: Brandon Foltz
Views: 1,640,707
Rating: 4.9393702 out of 5
Keywords: linear regression, Simple Linear Regression, linear regression statistics, statistics 101 regression, Regression Analysis, linear regression analysis, simple regression, statistics linear regression, regression tutorial, simple regression analysis, regression, simple linear regression statistics, regression statistics, statistics regression, linear regression model, statistics 101, stats 101, brandon foltz, regression line, machine learning tutorial, data science
Id: ZkjP5RJLQF4
Channel Id: undefined
Length: 22min 55sec (1375 seconds)
Published: Sat Nov 23 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.