Basic Excel Business Analytics #47: SST = SSR + SSE & R Squared & Standard Error of Estimate

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome to Highline BI 348, class video number 47. If you want to download this Excel workbook, BI 348 Chapter 4 or the PowerPoints, click on the link below the video. In this video, we have a bunch of awesome topics in regard to linear regression. We're going to talk about r squared, a measure for goodness of fit, we're going to talk about SST, SSR, and SSE, measures variation, and we're going to talk about the standard error of the estimate, an actual estimate of the standard deviation of our regression line. Now, we're going to start over in our PowerPoints on slide 32. This is what we did in the last couple of videos. We have weekly add expense. That's our x variable, but we're hoping that it can help us predict our y variable, weekly sales. We plotted all the individual sample points, saw that there was a strong, direct relationship. Last video, we created our equation, y hat equals slope times x plus intercept, and there's the line. Now, we want to start off this video by looking at r squared, which is a measure of the goodness of fit of this line to the actual data points. Now, let's think about this line. What would it mean if we had a perfect equation that predicted exactly like the original sample points. Go to our next slide. It would look like this. That means we would have our line and all of the original sample points would be exactly on the line. But that's not what we Here is what we have. Here's our estimated regression line, and there's all of the original sample data points. They're not on the line. You can think of the distance between each one of the original markers of the original sample points on the line as a bit of an error when we use this model. Now, the distance between a particular value and the estimated value we would get at that x value is called a residual. Now, if we go to our next slide, we can see a picture of what the residuals look like. Here's our original sample data points, and in each one of these sample data points is a y value. Well, guess what? This equation is predicting y hat, which is a y value. If we took the predicted value and the particular value from our sample and compared them, the distance between those would be our residuals and we'll actually say, the original y value, and we'll subtract the predicted value. Now, sometimes the actual predicted value will be underneath, which means the equation is under predicting, other times, it will over predict. Here's our estimation right there. There is the original y sample point from our sample data. It's way below. Now, calculating the residuals is not hard. We go to our next slide. Here is our original sample, x values, our original sample, y values. And we already calculated the slope and the intercept. So guess what? We're going to take our y hat, but each one of the x's will be from original data set, and if we calculate that down, we have y hat. That means we have an exact predicted value at the same x as the original sample y value. Then we simply subtract. I'm going to say, particular y minus the predicted, and this is called the residual. And when we make all of those calculations, we get the distance, the red line for each one of our original x values. Now, check this out. If we were going to add these up, because we want to analyze the error in using our model as compared to the original sample points, if we add these up, we get 0 just like with our deviations. Now, since we want to analyze these, we will get around that. The same way we did when we did standard deviation and deviations earlier, we will square the values and then add them up and that will be called the sum of the squares of the errors. Now, before we go and calculate that, I want to think about our model here, this is the y hat model, compared to a y bar model. What's a y bar? Now, we saw how to plot this in an earlier video, but think about this. If a manager in a business didn't bother to go out and see if there was a relationship between weekly ad experience and weekly sales to try and build this model here to predict, as weekly ad expense increases, it looks like weekly sales increase. If the manager didn't do that, they might be stuck with just using y bar at every single possible x. That means no matter what the weekly ad expense was, they would just be using y bar. Now, we want to think about what would happened if we only used y bar. Well, there's the particular value. They're all over here. Look at the distance there. So the error in using y bar as our model would be huge. Now, look at our y hat model. The distance between using the actual predicted value and the particular value is much smaller. That is going to be the essence of calculating r squared, the goodness of fit. We'll actually calculate all of the errors, or deviations, in using the y hat model and we'll compare it to the actual distance using our y hat model. And you can see right off the bat that it's going to be much smaller. So a lot of errors when making predictions with our y bar model. These are called deviations, much less error when we're using our estimated equation, and these errors are called residuals. Next slide, our calculation, as we mentioned, is going to be called r squared also known as coefficient of determination, and it will measure the goodness of the fit of our estimated regression line to the actual sample data points. Now, before we go and calculate and review, here's what a perfect model would look like, all the points on the line. Here's our residuals. That's the error in using our estimated regression equations. Here's the total error that we would have if we use y bar. You can see that's a much bigger than the error we would get if we used our regression equation. Now, there's actually two parts in this error, if we go to our next slide. Here's the total amount. That's the total error, but this total error, you see that line kind of splits it into, there's two parts. The first part is the actual error and us using the estimated equation to make a prediction. But guess what? This bottom part is the extra benefit we get in using our estimated line over the y bar line. Now, let's see how to calculate both of these errors. Total error, as it will be called, is always going to be the particular value minus the y bar. The actual residual, particular value minus the predicted, you can see particular value minus y hat. And the bottom part is going to be, this is the part that's explained by using our regression equation. That's going to be the predicted value minus the y bar, so y hat minus y bar. Now, there's some synonyms for this error, residual error or the unexplained part of total error when we use our regression equation. This part here is the explained part. Again, it's like the benefit of using this equation over a y bar equation. It's also called the regression amount. Now, here's what we'd like to do. I'd like to calculate every single blue line for every single point and add it up and then do the same for the residuals and add it up. And then we can compare those two parts to the total and get what percentage is explained by using the equation and which part is not explained. But we already know what the problem is. If we add up all the residuals, we get 0. Guess what? If we add up all these amounts, we're going to get 0 too. So if we look at our next slide-- and we'll go do this in just a second. Here's the total. And we already know this one. This is from doing standard deviation, a particular value minus the mean. Do it for every single one, you're always going to get a total of 0 for all the deviations. Here's our residual. We already saw that that adds up to 0, but also the explained part or the regression part, if we take our predicted value minus y bar, do it for every single original x, we get a total of 0. But that's no problem. We know what to do. We can square all these and then add them up just like we did for standard deviation. Next slide. That's exactly what we do. Just like for standard deviation, particular value minus the mean square, add them up except for now in our analysis of our regression equation, this will be called sum of the squares total. That's the whole little error, in essence, of using y bar. And we have one two part, the regression part-- predicted value minus y bar squared. Do it for each individual x and add them up. That's called sum of the squares of regression. Our residual squared will take particular y minus the predicted square and do for each individual x, add it up, that will be sum of the squares of error. Now, there's an important relationship between this. Remember if we go back a slide here, two parts make the whole. If you add the individual error to the individual regression, for the total, when you add them up, you get exactly the total. If we go back to slide 44, that's the same relationship here. If you add sum of the squares of regression to sum of the squares error, you're going to get sum of the squared total. So our next slide defines each one of these as a math formula. Sum of all the deviation square, that's going to be SST. Sum of all of the predicted values minus y bar squared, that will be SSR. Sum of all the particular values minus our predicted square, that would be SSE. Now, I want to be sure that we're thinking about SST, SSR, and SSE in the right way. So when we go to our next slide, SST, you can think of this as how well the actual particular values are clustered around the y bar line. When we get to SSE, you can think of that as how well the particular values cluster around the estimated regression line. And when we go to our next slide for SSR, how to think about SSR, it's going to be a measure of how much better using y hat is for making predictions than y bar. Again, we're going to take all those blue distances, square them, and then add them. So our relationship, SST equals SSR plus SSE. And that means if we want to figure out the goodness of fit, remember this is the part that's explained. We simply use that in the numerator, this in the denominator. In our next slide, we get coefficient of determination or r squared. SSR divided by SST, that is our goodness of fit, or r squared. Now, sometimes you already have SSE the sum of squared errors, so if you compare that part to SST and you have that percentage, you subtract it from 1. And of course, that's going to be exactly the same because SSE plus SSR, two parts make the SST. Now, r squared always is going to result in a number between 0 and 1, the closer to 1 the better the fit of the estimated equation to the xy sample data points. rr squared can be thought of as the measure of goodness of fit of the estimated regression equation line to the xy sample data. You can think of it as the proportion of the sum of squared total that can be explained by using the estimated regression equation. You can also think of it this way. The proportion of the variability in the dependent variable y. That is explained by the estimated regression equation. Now, we want to go over to Excel and make these calculations. We're on the sheet rr squared and S1. All right. We can calculate our predicted, and guess what? I'm going to show you a different way than before. We calculated the individual slope and intercept using the slope function and the intercept function. But if you don't already have those calculated in the cells, there's a great function called the equals FORECAST. And what it will do is if you give it an x, comma and then the known y's comma and the known x's, it will internally calculate slope and intercept and run that equation for that particular x. So I'm going to have to lock these. I forgot, so I'm going to hit the F4 key and watch this. I'm going to click [INAUDIBLE] y and hit F4. And there we go, close parentheses. All right, so FORECAST is substituting us having to individually do the slope and the intercept. Control-Enter and copy it down. That FORECAST function did the entire process, calculated our y hat at each particular x. Now, I want to note something about the lines and the predicted value. If you add up both of these columns, I'm going to Alt-Equals, and when I hit Enter, I come over here and I do it for the predicted values, Alt-Equals, and lo and behold, they come out exactly the same, a nice way to check if you did your predicted values correctly. Now, we can calculate our residual. Remember residual is looking at the difference between the actual y from the sample and our model predicted value. So are you ready? Equals and you take the particular value minus the predicted value, Control-Enter, and copy it down. Now, these are residuals, so when we add them up, Alt-Equals, we better get 0. Now, I want to calculate total, explained, and unexplained; total, regression, and residual or error, and we're going to square this because we want to add them, sum of the squares total, in parentheses, a particular y value minus our y bar. Make sure to lock it with the F4 key, close parentheses, caret 2. This is no different if we were calculating standard deviation longhand, the deviations for y squared. So we copy these down, Alt-Equals to add them up. If we're doing standard deviation, then we'd have to divide it by n minus 1 and take the square root, but we're not. We're calculating SST to analyze our estimated regression equation. There's SST. That's the total sum of squares. All right. Now we can calculate. The regression amount equals, in parentheses, the predicted value minus our y bar. F4 to lock it, close parentheses, caret 2, Control-Enter. Copy it down. I came down to the bottom because I squared it. When I add them up-- Alt-Equals-- I do not get 0. That's the sum of the squares regression, in essence, the amount of this SST total that the estimated regression equation explains. Now we can calculate residuals squared, and I already calculated the residual, so I'm simply going to get that amount there, caret 2, Control-Enter, and copy it down. Now, I come to the bottom, Alt-Equals to add it up, and there's my sum of the squares of error-- SSE, SSR, and SST. Now, I'm going to come down here and I want to do a slightly different calculation for SST. I just want to prove that the relationship is true, so instead of going and getting this one right here, I'm going to say, SSR plus SSE, and it better be exactly the same. And sure enough, there it is, exactly equal. Now, I want to calculate SSE next, and I could just come up here and get that amount, but I want to show you a second method of calculating sum of the squares of error. And the only reason we have a second method here is because we already have our residuals. And what do we do over here on this column? Square them and add them. Well, if you have a column of residuals, there's a function that does specifically that, SUMSQ. It will square and then add whichever numbers you put into the number one argument. So I'm going to put all the residuals and it will automatically, from the S cubed part of the function, square it, and the sum part we'll add it. When I hit Enter, I get exactly the same as the calculated amount up here. Now, SSR, since there is this relationship, I can say equals, give me the total minus the other part, and this will give me SSR. And you can see they're exactly the same. Now, the moment of truth, here it is, r squared, goodness of fit, equals SSR, that's the explain part, divided by the total squared above and beyond our y bar line. And there it is, goodness of fit, 0.84. 84.2% of the total sum of squares can be explained by the estimated simple linear regression equation line. In essence, about 84% of the variation in the y can be explained by the estimated simple linear regression equation. This means that we have a strong fit of the estimated regression equation to the xy sample data point. So remember, the max this can be is 1. And down here we can think of it this way. If we had not collected x and y data, we would have been stuck with the y bar line with the total variation of y minus y bar. Our predictions would be much less accurate. Our equation explains 84% of the variation above that y bar line. Now, we don't have to do the longhand process, but certainly doing it and understanding how the numbers come together to calculate this helps in our understanding. But guess what? Just like there are so many other statistical functions, if you want to calculate r squared, there's an r squared function. You've just got to put in your known y's, comma, and your known x's. And sure enough it will say, about 84% of variability in y can be explained by our estimated regression equation. We can also, if we know SSE and SST, we can do 1 minus. And just to prove it, I'm going to say, there's that other part, not the regression part, the error part divided by the total and 1 minus-- that's our complement rule-- it will give us exactly the same thing. Now, that's goodness of fit. I want to actually jump over to our PowerPoints and talk about the relationship between coefficient of correlation and coefficient of determination. This is r. This is r squared. r correlation can be calculated. If we know r squared, we take the square root of it and assign the sign from our slope. So if slope is minus, we put a minus in front of it. If it's positive, we just leave it as square root of r squared, whereas coefficient of determination, if we know r, we simply square it to get r squared, thus the name of r squared. Now, correlation, we remember it's a number between minus 1 and 1. It measures the strength and direction of the linear relationship between one independent variable and one dependent variable. For coefficient of determination, it's a number between 0 and 1. Remember we're comparing parts to the whole, so if one of those parts goes down to 0, that's the smallest that number is ever going to be. Coefficient of determination, r squared, measures the strength and goodness of fit of the relationship. Now, here's the deal about r squared. It could be used on a linear or nonlinear relationship, and it can be used for one or more independent variables, and we'll see multiple linear regression in a couple of videos ahead, and there we'll refer to it as a big R squared instead of little r square. So little r square will be for simple linear regression when we have one independent variable. Big R squared is when we have two or more independent variable. But remember for correlation, it's only for linear and it's only one independent variable. Now, I want to go back over to Excel, and down here, I want to try and calculate from r squared getting r. Well, I need the sign. I could just look up-- and I didn't calculate the slope, so I'm going to say equals slope, known y's comma, known x's. And I could just look at that and go it's a positive and take the square root of r squared, but I'm not going to do that. Just in case this is a template and we put some new data in there and it comes out negative, there's a great function in Excel called SIGN. And all it does is you put in some number-- and I'm going to put in the slope-- and it tells you whether it's a plus or a minus. Now right now, it will come out as 1, and I'm just going to overwrite this. If I put any negative number here, it shows minus 1. Now, I'm going to Control-Z to undo that, so that's what SIGN does. It says 1, yes, your positive, minus 1, yes, you're negative-- times and then I'm going to go get my r squared except for I need to take the square root of that. And that will work. When I hit Enter, I get exactly r. That is correlation. Now, we could test it with our Pearson or Correl function. There is one array comma, here's a second array, and boom, we get exactly the same answer. Now, the reverse is easier. If we're going from r squared to r, we have to do that SIGN thing. But I don't have a cell for it, but right over to the side I'm just going to say, hey, if I already know r, I simply take r and square it. It doesn't matter if it's positive or negative, because when you square it, it's going to turn out to be the same exact number. So that's the relationship between r squared and r. Now, we're going to go back over to our PowerPoints because we're going to have to calculate one more awesome statistic for linear regression line. I'm going to go to slide 54. And we want an estimate for variance and standard deviation of the estimated regression line. Well, check this out. We already took all of the errors and squared them. Now, remember standard deviation, we take the deviation, square them, and then we divide by n minus 1. But here, when we're going to calculate standard deviation for our estimated regression line, we're going to take the sum of the residual squared and we're going to divide it by, in essence, n minus 2. Now, this textbook says n minus q, and q is the number of independent variables minus 1. Now, n minus q minus 1 or in our case n minus 2, that's called degrees of freedom. And degrees of freedom is simply, take the n and subtract how many population parameters you've estimated. And what do we do? We estimated slope and the intercept, so that's two. So we'll take sum of residual squared divided by n minus 2. That will give us the estimate of variance for our regression equation. And it has a special name, mean square error. Now, once we have our mean square error, that's our estimate for variance, remember back in our chapter where we did variance and standard deviation? How do we go from variance to standard deviation? We simply take the square root, and s will indicate the estimate the standard deviation for the regression equation, which has a special name, standard error of the estimate. Sometimes you hear standard error of the y, but we're going to say standard error of the estimate. So let's go over to Excel and calculate this. All right. So here we have q. That's the number of independent variables for this example. n minus q minus 1, I'm going to say equals, there's our count at the top minus number of independent variables minus 1 and Enter. So our degrees of freedom will equal 188. Now, for mean square error, estimate of variance for our regression equation, we simply say, equals and we go find SSE divided by our degrees of freedom, 188, and there is our estimate of variance. Now, for standard deviation, we simply say, square root of our variance and boom, we have standard error of the estimate. That is the standard deviation for the regression equation. Now, if you have your s and y data, there's a built-in function for standard error of the estimate-- that means standard deviation for our regression line-- equals ST for standard and E for error, and there it is, standard error for our x and y data. We simply put in our known y's and I'm going to have to scroll up here and get my y's. That's the range C4 to C193 comma, and our x's, that's B4 to B193. And when I hit Enter, I get exactly the same thing. And just as when we did standard deviation, you can think of the standard error of the estimate that it suggests that any particular estimate will deviate on average by $52,000. All right. Now let's go look at a second example. We'll go to r squared and s2, and here's our same data set, bike weight and price. And we were to calculate slope, intercept, and now we want to calculate r square. Well, of course, we're simply going to use rs cubed for r squared, and there's the known y's, comma, known x's. And so r squared, goodness of fit, of our equation to the sample data points is 0.8. So that's pretty strong. Now, if we were to calculate r from this, remember we have to go the sign of our slope times square root of r. And sure enough that would tell us in dollars, if I apply general with Control-Shift grave accent or tilde, keyboard for general, 0.89, so almost minus 0.9 correlation with an r squared of 0.8. So remember correlation tells us the strength and direction of a linear relationship with one independent variable. This tells us the percentage of SST total, the total measure of variation above and beyond the y bar line, how much of that is explained with our regression equation. Now, we can also calculate the standard deviation for this line equals ST for standard, E for error, and there's xy, our known y's comma, our known x's. And interest or standard deviation is approximately $71. All right. So in this video, we did a lot. We talked about the standard error of the estimate. We talked about r squared, and we talked about SST, SSR, and SSE. All right. When I to come back, in our next video, we're going to talk about an awesome feature, the data and analysis regression feature. All right. We'll see you next video.
Info
Channel: ExcelIsFun
Views: 13,037
Rating: undefined out of 5
Keywords: Excel, Microsoft Excel, Highline College, Mike Girvin, excelisfun, Michael Girvin, Business Analytics, BI 348, Data Analysis, STEXY function, RSQ Function, FORECAST function, Standard Error, goodness of Fit of Regression Equation to X Y Data, SST, SSR, SSE, Sum of Squares Error, Sum of Squares Regression, Sum of Squares Total, R Squared, Coefficient of Determination, Standard Error of the Estimate, What does R Square Mean? What is a residual, Residuals, Predicted Values
Id: S2b8nseeJjA
Channel Id: undefined
Length: 30min 24sec (1824 seconds)
Published: Wed Nov 25 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.