Learn Statistical Regression in 40 mins! My best video ever. Legit.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
g'day team Justin zeltzer here for zedstatistics.com and the YouTube channel of the same name or roughly the same name uh I'm here today to do a bit of a new version of a video I did about 10 years ago now this video was probably the most popular video I had on my channel for the time and it stayed that way since however because it was done 10 years ago it kind of sucks if I'm honest it's me talking to an Excel spreadsheet not very good looking but it was just a good video at the time so I thought I'd try to do it proud and do a new version of that today to make sure it's the perfect video on regression for those who are new to the topic or those that just want a foundation for it it's going to be 40 minutes seems like a long time right but 40 minutes will get you from Zero to Hero on the topic of regression so you don't need to know a thing to come into this video and I guarantee there's going to be some real nuggets of wisdom that you'll be receiving over the next 40 minutes so put the kettle on sit down and hopefully you'll enjoy what I've got to show for today so this is the topic list that we'll be looking at in the video we're going to start by looking at the objectives behind regression so giving you an intuition for what regression is all about and I hope that sets up a nice architecture for you so we can start then looking at the population regression equation and Sample regression line it's at that point that we actually incorporate data to see what can we do with an actual data set to create a regression line and how does that get created we look at the topics of SST SSR and SSE which are the nuts and bolts of the actual mathematics behind creating that regression line and we talk about the confusion that can occur between the r and the E and how there are some textbooks that actually reverse them so it does get a little bit tricky but bear with me and check out that section if you are interested uh we'll then look at r squared which is a measure of the strength of a regression and finally I deal with the pesky topic of adjusted r squared and degrees of freedom which I think I have a quite a unique way of explaining that which from the previous video seems like a lot of viewers got a lot out of so stick around check out that section it's a 40 minute video okay it's going to take you from Zero to Hero I hope you appreciate it and if you do please share it it's taken me a lot to try to put the post-production of this video together so I hope you appreciate it and leave a comment do all those kind of lovely things if you can for the channel and I'll catch you on the other side let's get stuck into it so let's dive into the objectives of regression so we can start with a little definition of regression here saying that it's a means of exploring the variation in some quantity so maybe you're interested in figuring out why heart disease varies or why the interest rates are going up or down and that variation that they're talking about the way it moves we have to separate it and this is what regression does it separates that variation into what can be explained and what is unexplained so there are two components so let's use the example and I'm going to be using this throughout the entire video today looking at the ice cream sales of a particular vendor and we're going to explain it using three different variables we're going to say let's try to explain why ice cream sales varies using the daily temperature so that would kind of make sense right the higher the temperature of a particular day the more ice cream sales you'd expect we're also explaining it by the amount of rain that we have in a particular day again you're expecting that say the more rainfall the less you're going to get for people being out and about buying ice cream and whether it's school holidays or not predicting that with school holidays you're more likely to sell more ice creams right these are just predictions so far but importantly there's also an unexplained component when we look at regression there's going to be a part that is left unexplained and regression will be able to quantify how much of that is unexplained versus how much will be explained how much of the variation of ice cream sales is unexplained versus explained so that's it that is the simplest way of thinking about regression so now we look at how we're going to use algebra effectively to map out this exploration that we're going to do so let's look at the population regression equation and you'll see this in any textbook you're given so simplify things a little bit let's just look at explaining ice cream sales using one explanatory variable or as we might say independent variable we're going to use daily temperature here forgetting about rainfall and school holidays just for the moment so here we have our population regression equation the Y on the left hand side is called our dependent variable and that is our ice cream sales why is it dependent well the ice cream sales will depend on the daily temperature and not the other way around right the daily temperature can be what it wants and then the ice cream sales will follow so the daily temperature is our independent variable that's X here and the error term relates to everything that's still unexplained now these be looking things well they're actually Greek letters they're beta naught and beta one and these are our coefficients so these coefficients together with the X term here as we'll see it's kind of like a linear relationship with Y and that's why this is called linear regression if you remember back to high school and you had y equals MX plus b here we have exactly the same thing y equals well it's not MX it's beta 1X and it's not plus b it's actually plus beta naught and they've sort of reversed it here but it's the same thing it's just a linear relationship that we're modeling between y our ice cream sales and X our daily temperature and the one little addition that we have is our error term which makes it a regression not just a linear relationship Okay so the role of a regression is to both estimate those beaters try to figure out what is that linear relationship what effect does changing temperature how does changing X affect y make a change in our ice cream sales that is the first objective of a regression we also want to quantify the error so not only do we want to figure out what is the effect of X on y we kind of want to know how much variation is left over or as I say here how much variation in ice cream sales is not yet accounted for by looking at changing temperature and you'd expect there to be a lot within this bucket of error at the moment now the important thing to understand about the population regression equation or more specifically about beta naught and beta1 is that these are parameters that we can never ever know for sure we're only estimating them beta 1 represents the slope of that relationship between Y and X it's the gradient right and beta naught represents the y-intercept remember that when x equals 0 Y is going to equal beta naught so we're going to estimate these two parameters but we can never know what they are for sure so to actually create estimates of beta naught and beta1 we're going to need some data and this is my mate Paul who's asking us where does the data fit into all this and indeed we need to move on to looking at the sample regression line where we include data that's why it's called a sample regression line to estimate that theoretical population regression equation all right well I hope you're enjoying the video so far I thought I'd interrupt just briefly to offer a small recommendation for a podcast called a positive climate which is a podcast hosted by my very own brother Nicholas zeltzer and Alex McIntosh who together they're investigating the positive initiatives that we're making to try to correct for climate change so they're looking at things like biofuels like chicken poo for example as a means of generating energy they're looking at electric vehicles they're looking at V2 meets all of those lab-grown meats and alternatives to plastic really interesting topics and they interview a whole bunch of CEOs in that space in that green Renewables that kind of space which I think is a fantastic initiative it's called a positive climate they use a bunch of the statistical analysis that I look at in theory on my channel and they apply that to the real world so quite an applicable use of the statistics you're seeing anyway it's called a positive climate check it out on all the platforms that you could expect to see podcasts back to the video so let's have a look now at the sample regression line now we can't really talk about a sample regression line without an actual sample so let's have a look at this data set there's 10 observations here so we've surveyed our ice cream vendor across 10 Saturdays and we've assessed how many ice cream sales that person's made and the let's say maximum daily temperature of each of those Saturdays now you'd be familiar with the scatter plot and all we're doing there is putting our ice cream sold on the y-axis here and the daily maximum temperature on the x-axis so a sample regression line would just be creating a line of best fit through that data in other words it's the best estimate that we have for the relationship between temperature and ice cream sales it seems that the higher the temperature the more ice creams we get sold so indeed there's a sort of positive relationship there but let's backtrack now and have a look at the equation for the sample regression line you'll notice that it's slightly different to the population regression equation right there are hats now on beta naught and beta1 that's one way of writing the sample regression line the hats mean an estimated value of beta naught and an estimated value of beta 1 and in fact it's an estimated value of y here as well the important factor being that there's no error term so we're no longer worrying about an error term this is just an expression to describe that line of best fit that black line is being described by a gradient that's this beta one with a hat on it and a y-intercept which is a beta naught with a hat on it so there's your just y equals MX plus b right from high school but just with different letters that representing m and b now other textbooks will use lowercase b naught and lowercase B1 to mean the same thing so just be careful some people like using betas with hats on them other people like using lowercase B's but they mean the same thing they're actualized values these will be numbers so I can find out what beta naught hat is I just need to know what the y-intercept is here I can find out what beta 1 Hat is I just need to find out what the gradient is of that line and that's beta 1 hat and indeed in this case those numbers are minus 8.82 and plus 2.86 that's beta naught hat and beta1 hat so we have here our sample regression line for this data set now as I said because it's y hat equals all this stuff you are describing the line itself I can create an expression which has y on the left hand side I have not y hat and that's describing the value of y for each individual observation and if we're going to describe the value of y for each individual observation we're going to need an error term because you can see that this particular observation has an error term here in other words a distance from that expected line so if we're going to describe the actual value of y we need an error term and you'll notice it's actually a different error term to the one we saw in the population regression equation this one is just an e a lowercase e not an Epsilon so I'll get to that distinction in just a second but for the moment what Paul's going to ask us is well how did this line actually get created how did we know that this was the line of best fit I can just draw that line in using roughly my eyesight there but how do we know that is the exact line that fits this data set the best now your first reaction might be to say well let's find what all these error terms are calculate how far these observations are from the line of best fit and we'll try to minimize the sum of those error terms but the problem is you've got negative error terms here so distances which are below the line and you've got positive distances up here so distances above the line so this will give you a sum of the error terms to be zero and I can actually create several lines that will have the sum of the error terms being zero here's another one the positive error terms here are going to net out with the negative error terms so it's not good enough just to find the sum of the error terms because that won't give us the singular line of best fit we're going to need to minimize the sum of the squared error terms because that avoids the issue with those negative errors and turns them into positive values so here we can find each of these error terms we can calculate them and then we can square them add them together and then minimize that final value and the line of best fit will be the line that does indeed minimize the sum of the squared errors that Greek letter Sigma means the sum so that's why the whole process is called Ordinary least squares you might have seen that written somewhere ordinary least Square regression and of course we'd use a computer program to do that for us but it is possible to calculate it using a calculator which we're not going to go into in this video because it's not necessary now this is going to be my favorite bit of the video and in previous videos I've made on this topic this is where the light bulbs really come on so click your brains into gear for this this that we just created which was our sample regression line and notice that I've actually put in here the error term so it's an expression of the ith value of y or just any specific value of y so it has B naught which is our calculated y-intercept it has B1 which is our calculated gradient and it also has its own error term now I could have used my beaters with the hats on them here but I've just chosen to use B naught and B1 remembering that some textbooks will do it this way now this is actually an estimate of our population regression equation so lowercase b Nord and lowercase B1 or beta Nord hat and B to one hat are estimates of beta naught and beta 1. so here we found that this was indeed our sample regression line the minus 8.82 and the plus 2.86 those are estimates of beta naught and beta1 based on the 10 data points that we collected these are our best guesses for what beta naught and beta1 are but we can never know what beta naught and beta 1 actually are so for example if I took another sample of 10 days say we looked at August September and October or something right here's the next 10 Saturdays worth of data and we can create a different line of best fit that might have a completely different equation we'll have an estimate of beta naught which is now minus 52.4 and an estimate of beta 1 which is now positive 4.08 so it's a steeper line and it has a lower y-intercept for what that's worth now this is still an estimate of the beta naught and beta1 but the idea is this there is if we sort of toggle between the two there there's the first one there's the second one there is something which we're trying to estimate now this sort of like yellow line here is what might be called the true relationship between daily ice cream sold and daily maximum temperature there is something we're trying to estimate and it's that golden kind of Godly line that we can never know so we only ever get a small snapshot of data which can go into estimating that yellow line and in the example we've got in the back we've got that black line which is our estimate but the idea is that every single observation has a calculated error term to its line of best fit and that's that lowercase e but it also has a theoretical error term which is that Greek letter Epsilon so that's why we have these two different error terms you'll sometimes see whenever you're looking at a population regression equation like this the error term looks like that Curly e which is a Greek letter Epsilon and whenever you have an actual sample regression line you'll have a lowercase e and you can calculate that value for each particular observation you cannot calculate this Epsilon value because it's a theoretical one because we can never truly know this line yellow line that I've created there that golden line which would represent our population regression equation all right let's move on to SSR SSE and SST now if you recall I said that the purpose of regression here is to separate the total variance in ice cream sales into variance explained by the temperature and the variance that still unexplained we need to quantify how much is still left unexplained by the temperature before we look at how to calculate these things we need to be aware of a certain peculiarity around the lettering of SSR and SSE SST will always stand for the sum of squares total some people might write TSS for total sum of squares doesn't matter if there's a t in it it means it's the total variance in ice cream sales and we'll see how the sums of squares sort of fit into things in just a second SSR now this is the one I use the sum of squares due to regression it's the variance in ice cream sales which is explained by our regression in other words explained by our independent variable temperature SSE relates to the sum of squares due to error which is that variance in ice cream sales still unexplained now be very careful because I've seen this in a couple of weird textbooks where TSS is the total sum of squares ESS is the explained sum of squares which actually relates to the variance explained by the temperature so the r and the E have actually swapped positions which is really annoying e is now the explain sum of squares and R represents residual which is another word for error oh so look don't shoot the messenger here but there are some annoying quirks in statistical textbooks I'm using this middle one here where the sum of squares total is going to equal SSR meaning the sum of squares due to regression and SSE is everything that we don't know so far and I think that's more common too so what we're going to be involved in is looking at each observation and trying to figure out why it's different from the mean so let's look at a particular observation here Saturday the 1st of July the ice cream sales on that day was high it was 112. in fact it was the highest one we have in our sample now the question is why is it that high well that's what regression is going to help us find out before we incorporate the daily maximum temperature into our predictions we're only left with the difference between that particular Point that's 112 and the mean and we can calculate that difference and that is going to be using the value of y in other words the height of the point minus the mean value of y which is 59. so why with a bar on it or Y Bar is the mean value of y so that distance there represents the total deviation to the mean now what SST is is the sum of all the squared distances to that mean so if I took you know there's 10 observations there and you can see they've been grayed out the other ones but I can take this one down the bottom and find out how close that is to the mean it'll be a negative number but we're going to square it such that it becomes positive and we're going to add all those squared values together and we can get a calculation for SST the sum of squares total so it's the total distance to the mean now what's going to happen here is we're going to split that total difference total residual or error we're going to split that into what's explained and what's unexplained by the X variable which is our maximum temperature so bear with me here before we looked at the actual temperature as an explanatory variable we had no idea why Saturday the 1st of July we had such high sales but if we incorporate the fact that it was a particularly hot day now from my North American viewers 32 degrees Centigrade or Celsius is roughly 90 degrees Fahrenheit I had to look that up and spell Fahrenheit correctly which I did not do but you got you got to get on Celsius Americans I think in fact you know what let's have a look I take a quick digression because I am very curious how many countries use fahren height few countries United States Liberia and the Cayman Islands people you've got to get on board Celsius is where it's at nonetheless 32 degrees Celsius is 90 degrees approximately Fahrenheit a hot day and we could say well because it was a hot day we were actually expecting sales to be up here anyway we were expecting sales to be at this sort of Green Cross Point which is looks like about 85 something like that and what we can do is create there for a an explained component of this variation which is this bottom bit and that still yet unexplained component which is this top bit that combines to form the total deviation from the mean now what's going to happen is this bottom bit here is going to feed into SSR which is our explained sum of squares and this top bit is going to feed into SSE which is our unexplained or the sum of squares due to error so with each observation you've got this explained deviation to the line of best fit based on what temperature it was so colder days we're expecting lower sales warmer days we're expecting higher sales but this particular day outstripped what we expected by this much so we're going to incorporate that into our calculation of error so I've written here that this top bit adds to SSE y minus y hat our predicted value of y this bit at the bottom adds to SSR because it's the predicted value of y minus the mean Y Bar and you can see that the total distance to the mean which is 59 that total distance splits nicely into SSE and SSR so if we add all these together it does in fact hold that SST that's all of those distances squared and summed or equal to SSR all of the blue distances squared and summed plus SSE which is all of the red distances here squared and summed now there will be some smart Alex that say that doesn't look like it necessarily holds true when you look at all of the observations will those sums add up together like this now that's actually a very difficult mathematical question which I'm not going to bug myself down in for this video but if you check the description I'll try to make a video on how that indeed Works a little bit later and so yeah check the description if I've made a video it'll be in there but the mathematics is beyond this video so let's move on from there but I hope that's given you at least a nice visual impression of what SST R and E are about and how they link together all right now we're going to look at r squared so r squared is a very widely used calculation and the way it's done is the SSR value over the SST so it's the explained sum of squares divided by the total in other words how much of the variation in ice cream sales What proportion of the variation in ice cream sales is being explained by daily temperature so this is going to be very useful it's going to be ranging between 0 and 1 and as I say here to generalize ask weight is the proportion of the variation in the Y variable being explained by the variation in the X variables so how much of our ice cream sales is being explained by daily temperature that will be returned to us as r squared so that brings us to this comparison of two particular scenarios when when the data set matches up really nicely with the line of best fit you can see that the error terms in other words the distances to the line are very small so s s e is going to be a really really small value remember that's the error terms the residuals they add to that sort of unexplained component all those distances to that line of best fit so if SSE is really small and I've written that there low SSE we have a high r squared because the numerator is pretty much going to be the same as the denominator SST being a combination of SSR and SSE so again I'll repeat that if SSE is really small r squared is a very very high value close to one that is because it ranges from zero to one if the data points are further away from that line of best fit SSE gets a bit larger and so this fraction becomes a bit less and approaches zero or becomes less so in this case you might have an r squared value of 0.91 that says hey this line of best fit is really mapping out quite well to our data that we've collected an r squared of 0.36 says well there's still some relationship there but it's not as strong as the previous okay so let's move into the final topic for this video which is degrees of freedom and adjusted r squared now this topic is not dealt with well by lecturers and sometimes textbooks are like they don't try to give you any intuition behind the concepts and that's what I'm going to try to rectify so to do so I'm going to ask you this particular question and I think this question really opens up that intuition I'm talking about so ask yourself this what is the minimum number of data points that you need to run a regression so let's just say you have a regression with one explanatory variable that's One X variable and a y variable so it could be our daily temperature that's X daily maximum temperature trying to explain the number of sales that we have for ice cream how many observations do you need to run a regression in the first place now you might think hey all I need is two observations because with two observations I can draw a line of best fit and there we go we have a regression well no that is not a regression because it doesn't matter where you put those two points you'll always be able to draw a line immediately through both of those points there is no possibility for error at all irrespective of where those points go r squared will always be one so you'll never actually have a regression a regression needs the possibility for error here's this error term you're never going to get an error term if you only have two observations and it's only with that third observation that we can run a regression because that line of best fit can now sort of escape from those observations themselves and kind of be drawn in between the observations and that's where we say we have one degree of freedom it's not a great regression not at all you'd want more observations absolutely and it's when you have say four observations that you'll get two degrees of freedom five observations you'll get three degrees of freedom etc etc you want more and more observations but the key point to note is that if we go back a little bit two observations does not make a regression so let's move on to see what happens when we extend our regression to have an extra explanatory variable so instead of just having daily maximum temperature we're now going to include say daily rainfall to try to explain our ice cream sales as well now the analogy here is no longer two-dimensional we have X1 going to the right of the page and X2 coming out of the page so I'm trying to draw it as a three-dimensional space now instead of a line of best fit we're talking about a plane of best fit when we run a regression now with two x variables the sort of geometrical analogy is no longer a line it's a plane so for example if you have three points in three-dimensional space I mean look at the room around you imagine you had three points in that space around you you can draw a plane of best fit through any three points that you draw right just put three kind of Imagine you've got three raindrops or whatever just hovering around the room and you have a big piece of straight cardboard you can put that straight cardboard through those three points irrespective of where those points are so with three points you have again this situation where you've got an r squared of one there's no error that plane cannot Escape those three points whereas when you have that fourth observation finally you have one degree of freedom because that plane that you can construct will miss the points and it can kind of cut in between the points as opposed to touching all four points and it's when you have five observations that you'll have two degrees of freedom etc etc so what we get is this relationship between degrees of freedom the number of observations and the number of X variables which we use the letter K to represent so when you add more explanatory variables add more X variables your degrees of freedom are actually reduced and we saw that right when we had that second X variable we needed an extra observation to maintain the same number of degrees of freedom so the degrees of freedom sort of represents how much error can we potentially show in this model and you want degrees of freedom you want the model to be able to show error so we get a sense of indeed how good that model is so by adding more explanatory variables the degrees of freedom are reduced so the opportunity for error in the model is reduced so if we kind of summarize what we saw before when we had a model with One X variable we needed three observations so that we'd get a degree of freedom but when we added an extra explanatory variable here our r squared went up to one not because the model got any better but because we lost the opportunity to show error and what we're going to find is that by adding extra variables adding extra explanatory variables we can fool ourselves into thinking our models getting better but in reality what's happening is the model is just losing the opportunity to show any error so let's see how this operates when we add extra explanatory variables so if we return to our example let's just say we know we have our ice cream sales over 10 observations and we want to try to explain the variation in ice cream sales so the first thing we do is we incorporate the daily temperature as our first explanatory variable and we find that the degrees of freedom that we have here is actually eight because we have 10 observations that's n equals 10. we have one explanatory variable and so n minus 1 minus 1 gives us 8 degrees of freedom and our r squared when we run this model is 0.58 so that tells us that 58 of the variation in ice cream sales is being explained by temperature awesome that's not so high but it tells us we have a bit of information now when we incorporate a second explanatory variable that's rainfall notice the degrees of freedom actually goes down to seven because we now have n minus K minus 1 10 minus 2. minus 1 because K is equal to 2 now we have two explanatory variables our r squared has increased to 0.74 so we're thinking this is great R squared's going up we're explaining a larger proportion of ice cream sales now 74 of the variation in ice cream sales is being explained by our model looking good so we're going to add another variable which is school holidays and you'll notice that school holidays happen in July here in Australia so this is actually this is actually called a dummy variable we're not going to get too much into that but let's just incorporate that information to say that we now have three x variables degrees of freedom is six and our R squared's going up again so we're really patting ourselves on the back here going this is awesome our model is getting better and better explaining ice cream sales then what happens is we incorporate what is clearly a nonsensical variable I've put in here the moon phase right that will have no impact on ice cream sales we've increased the value of K because we now have a new variable to put into our model our degrees of freedom is going down again and now r squared it still went up so if we were to summarize this information with our four models that we had progressively adding more and more X variables you can see that the r squared continued to increase and here's the thing about r squared r squared will only ever increase when you throw more variables in the model so by throwing more variables in the model irrespective of how useless they are the r squared will still increase and we might be thinking to ourselves hey this is still good our squid went up a little bit moon phase is clearly relevant let's keep that model with including moon phase but look at the adjusted r squared the adjusted R squid actually goes down now this is the formula that is used to adjust for the fact that you're losing degrees of freedom so remember what I said before that we can get fooled into thinking that our model is getting better but what's really happening is that our model is losing the ability to show error or to find error and the adjusted r squared reflects that you can see that when we included moon phase our adjusted r squared actually went down so that maybe gives us some information that the best model we had here included temperature rain and holidays but did not include moon phase so even though r squared has this really nice interpretation where we can say all right 84 percent of the variation in ice cream sales is being explained by the variation in temperature rain and holidays we do have a bit of a problem with r squared when you have a small number of observations because when you have a small number of observations you have a low number of degrees of freedom and in that case we might need to look at an adjusted r squared value which takes that into account you don't really have a nice interpretation of adjusted r squared but in at least we can use this for comparison across models right so that brings us to the end of the video I hope you've really enjoyed it and if you have feel free to like the video and subscribe to the channel uh if you do subscribe you'll be privy to what's happening on the channel over the next little bit I know I've been a bit lazy with posting content recently I've been busy becoming a school teacher over the last few years so cut me some slack but I am going to be doing some interesting content on how high school mathematics applies to the real world it's going to be called Mountain maths so stick around for the channel well subscribe to the channel so you can see that coming down the pipeline anyway my name is Justin zedstatistics.com is the website and I will catch you at the next video catch you around [Music]
Info
Channel: zedstatistics
Views: 211,216
Rating: undefined out of 5
Keywords: zedstatistics, zstatistics, justin zeltzer, regression, statistical regression, regression explained well
Id: eYTumjgE2IY
Channel Id: undefined
Length: 40min 25sec (2425 seconds)
Published: Mon May 22 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.