Regression II - Degrees of Freedom EXPLAINED | Adjusted R-Squared

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to the second of five videos dealing with the statistical concept of regression if you missed the first one we discussed the very basics of regression so we looked at sums of squares so all those SST SS and SS are typed terms we looked at the error terms and also touched on R squared so if any of those things seem confusing for you maybe have a look on the Z statistics YouTube profile and you'll be able to find the first regression video but in this one we're going to look further at R squared and in particular have a look at something called the adjusted R squared and we're also going to be dealing with the very pesky notion of degrees of freedom which is the thorn in pretty much every student's side when it comes to statistics so hopefully I'll be able to give you a real intuitive way of looking at degrees of freedom so let's recap and take the definition of r-squared given from the previous video which is SSR on SST or alternatively the proportion of variation in Y being explained by the variation in X now a quick note SSR in this equation is specifically that sum of squares due to the regression so this is this is what I'm using throughout all these videos SSRS a summer squares due to the regression SS a is the sum of squares due to error now a couple of people have some questions off the previous video about the correctness of having SSR which shouldn't be SS e or vice versa there are alternate definitions for SSR SSA alternate like letters so just be very careful when you're using SSR and SSA as acronyms be careful what they are and a step for respectively because an alternate way of doing it is to say the explained sum of squares which is ESS which is the same as SSR anyway I don't get you confused so I'm going to go back to this and this is what I'm going to be using throughout the entire tired e of the video so just be aware that that's what I'm doing so we have R squared equals SSR on SST now what is R squared actually represent here we have four unique scatter plots so each of these represents some kind of you know study and we have five observations in each of these studies and as you can see if we start from the top left all of the observations line up in a very neat line you've got your y-axis your x-axis and you can see that you've got all of these points lining up really nicely so as we said in the previous video a regression is essentially drawing a line of best fit through all of the observations so if we look at the top left you can see that the line of best fit goes through every single observation which is not necessarily very realistic in practice but if that's the case then we say it has an R squared of 1 now the further away the observations travel from that line of best fit the lower that a square it is so as you can see for each of these and I've just drawn these myself to make it clear what an r-squared represents but as R squared decreases you can see that the actual relationship is becoming weaker so X is explaining less of Y as we go across and then down these little panels also keep in mind R squared varies between 0 & 1 1 is where we have a perfect linear relationship so that's a top left example and 0 if you have an R squared of 0 that's where we have absolutely no relationship at all so it should look like a random scatter of points so not not even this one on the bottom right has an R squared of 0 and in fact you're very rarely going to get something that close to 0 you're never going to get 0 in practice even the most unrelated variables are going to find some kind of relationship being weak or otherwise ok so let's have a look at the concept of degrees of freedom now as I said I think I'm going to give this a very intuitive flavor so hopefully you can run with me here the way I'm going to start the explanation of degrees of freedoms is by looking at a simple linear regression with one dependent variable so I one independent variable that's our X variable here and one dependent variable Y now I guess I'm going to ask you what's the minimum number of observations required to estimate this regression so let's just say Y is say person's height for example and we're going to try to estimate a person's height via their weight so you'd expect that someone who's who weighs more might be taller but it's not obviously going to be a one-to-one relationship there's going to be some error associated with it but how many people do you need in this study to make a regression so here we have x and y now if you think you have one observation so one person in this study you can't run a regression you can't draw a line of best fit through one point I think everyone can appreciate that you can draw a line in any direction you want through that one point so you might think all right well we need two observations for our regression and you might think okay that's good we can draw a line of best fit through those two points but appreciate that it doesn't matter where that second point goes if it's over there or if it's up here we're always going to get an R squared of 1 what that means is that line of best fit is always going to go through both of those points and given that the R squared is always going to be 1 the strength of the relationship between y and X just can't be assessed so it's not really a regression at all and it's only when we get that third observation that the model gains some freedom to assess the strength of the relationship between x and y that line can actually go in between those three points now and you can see that our R squared it's not 1 here we have it R square to 0.87 and the idea is we have one degree of freedom because we have that third observation which allows the model to actually differ from the points itself ok does that sort of make sense so if we throw another observation in the mix there we actually have now two degrees of freedom because there's two additional observations allowing this model giving this model a bit more power now here's the real kicker and this is what I think gets people across the line to appreciate how degrees of freedom interacts with the number of variables you have in a model in this case I've thrown in a second variable so let's just say Y was out you know hype again x1 is our weight of a particular x2 is perhaps someone's mother's hi so that's the person's mother's height for example in this case what's the minimum number of observations required to estimate this regression visually we can say it's represented by a sort of three-dimensional space we have x1 on the horizontal axis here x2 on this kind of axis coming out of the page if you will and then Y is going up so what's the minimum number of observations you need to run a regression now let's just start with three now unlike the two-dimensional equivalent here we actually add right are drawing a plane of best fit when you have two x variables essentially what you're doing is putting a plane through those points and appreciate that any three points in three-dimensional space can have a plane cut through all three of them so here we have an R squared of 1 but when we introduce that fourth point that plane actually gets some freedom now to cut through those four points meaning we have degrees of freedom of 1 in this case so in the three-dimensional example we have here we needed that fourth observation to get one degree of freedom and if we had five observations we'll have two degrees of freedom etc so with that additional variable x2 we've actually lost some degrees of freedom for a given number of observations which leads us to this particular formula degrees of freedom equals n minus K minus 1 where n is the number of observations you have K is the number of explanatory or X variables so you can see as K increases for a given n we're going to lower the number of degrees of freedom so in this case for the single X variable example we had four observations and two degrees of freedom because if you think about it we have n is four K is 1 we only have one X variable so n my k minus 1 4 minus 1 minus 1 is 2 now if we throw in the third variable well the second independent variable and we have 4 observations 1 2 3 4 we only have one degree of freedom in this case the addition of that extra variable x2 has lost us a degree of freedom okay so why do we care about degrees of freedom at all what does it actually do for us well as we'll see degrees of freedom is actually closely related to R squared and a squared is quite useful don't forget because it tells us how much of the variation in Y is explained by X it's that measure of that strength of the relationship between x and y and that's affected by degrees of freedom how does degrees of freedom relate to R squared well as the degrees of freedom decreases for example you're adding more and more variables to your model R squared will only increase and that means that if you're throwing in useless variables into your model it doesn't matter how useless they are or how little they affect your Y variable R squared is going to go up not because you're adding any more explanatory power to your model but because you're reducing the degrees of freedom so to summarize all of that we know that R squared can be quite deceiving when you have low numbers of degrees of freedom so what can we do about that here's a metric called adjusted R squared which has a fairly complicated formula but if you have a look at the top one here if you have the original R squared all it is it's a case of plugging the numbers in here you can see you've got your end there which is the number of observations you have K is the number of variables and as K increases you'll see that we adjusted a square it actually decreases when you hold everything else constant so what the adjusted r-squared is effectively doing is accounting for the reduced power in the model when you have a low number of degrees of freedom so here I've written as K increases adjusted r-squared will tend to decrease holding everything else constant obviously if you're adding very useful variables to the model adjusted r-squared will also increase but if you're adding use less variables you'll find your adjusted r-squared decreases to reflect the fact that you have lost degrees of freedom so to finish off let's have a quick look at this very hypothetical situation where we have 25 observations for these first four models let's say and we have four variables in the first model five six and seven variables in each respective model after that you can see I've kept that the same for them these four models down here as well but for these four we have ten observations so ten people in the study now the r-squared again I just made this up completely 0.71 seven six seven eight seven nine you'd be thinking great every time we've included a new variable the r-squared is increased this is great let's put more variables in because our model keeps getting better and better but if we have a look at the adjusted r-squared you can see that there's okay there's a sizable jump here when we've gone from four to five and indeed the adjusted r-squared increases as well and again it increases the next time not by that much though and once we get to seven variables even though the r-squared is increased the adjusted r-squared actually decreases using that formula on the previous slide so this kind of indicates that we actually had the best situation when we had six variables not seven and you can see that when we have a few only very few observations with respect to the number of variables the effect is even greater so with the same R square situation here you can see that the adjusted r-squared decreases considerably so 0.05 500 seems very very low and why that's the case is that we only have two agrees of freedom in this very last regression remember it's n minus K minus one which is not much at all that's not that's not a very healthy regression you need quite a few degrees of freedom to actually be able to explain anything to get to allow the model to have error you know to see whether the two or three or four variables are related to each other final point to note about adjusted r-squared is that it's not bounded by 0 & 1 it can actually go into negative so it's not really an intuitive value you can't sort of say that this is sort of 0.05 of something it's not 5% of anything but it does give us a way of actually comparing between models so in this case you'd say well or this one looks like the best model where we had four variables and out of the top four we might select this one as the one that has the best the best model in terms of explanatory power so that's it I hope you've enjoyed the the explanation of degrees of freedom and R squared for this the second of hopefully five videos on regression I'm Justin zeltser and this is said statistics
Info
Channel: zedstatistics
Views: 302,774
Rating: 4.9670897 out of 5
Keywords: CamtasiaForMac, Coefficient Of Determination, Regression Analysis, Degrees Of Freedom, R-squared, Adjusted R-squared, Statistics (Field Of Study), zedstatistics, zstatistics
Id: 4otEcA3gjLk
Channel Id: undefined
Length: 14min 19sec (859 seconds)
Published: Sun Aug 11 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.