Stats 35 Multiple Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey folks in today's video we're going to be taking a look at multiple linear regression now multiple regression is a lot like simple regression only instead of using a single predictor variable to make a prediction about a dependent variable you're actually using multiple predictor variables still to make one prediction about a single dependent variable so in simple regression you've got one X and one Y in multiple regression you've got a number of X's multiple X's and still one Y one dependent variable most of the same concepts from simple linear regression apply to multiple regression in fact multiple regression is still creating a best fit line through the data only in multiple regression the data is multi-dimensional so you're actually still using the least squares method where you're finding the smallest sum of the squares of all residuals meaning differences between the data points and the predicted line but in multiple regression it's more complicated so let's take a look at what an equation for multiple regression might look like this equation I've got really should've put a little dot dot dot here but this shows actually what a multiple regression equation would be with three predictor variables so I've got X 1 and X 2 and X 3 and this is this you should recognize this this looks similar to the equation you get for simple linear regression but we've got a you know three X's three different types of X's three predictor variables instead of just one so this is our y-intercept that's a we've also got a coefficient for our x one predictor variable a coefficient for X two particular variable and a different coefficient for our x three predictor variable and then we still got an error term which is to say you know there's always going to be some amount of uncertainty with our regression equation it's never going to be entirely accurate so this is the error term that's also present in simple linear regression so what do each of these coefficients say well obviously this is the y-intercept but this B 1 here says for a single unit change in X 1 Y is going to change by be one it's the exact same interpretation you have with simple linear regression same thing over here for a single unit change in X 3y is going to change by B 3 this is just so it's you can almost think of it as like three separate slopes that's not maybe a good way to think about it but you know what would be the slope in simple linear regression applies to every single predictor variable in multiple regression so the possible objectives of multiple regression include one would be just to come up with an entire predictive model for your dependent variable you want to know how Y is affected by all these factors another very common reason you might run multiple regression is actually to determine the relationship between maybe one or two predictor variables and the Y variable but you throw in a whole lot of predictor variables to control for their effects in other words if what you're really interested in is one or two predictor variables you got a factor in the effects of all these other predictor variables to get the true effect of the ones you're really interested in so maybe we really want to know what the relationship is between X 1 and Y but we threw X 2 and X 3 in there because we want to control for their effects and get a more accurate coefficient for X 1 in predicting Y we'll look at an example to illustrate this all right so let's say we got a situation where we want to know the relationship between the size of a house and its eventual sale price maybe we're a real estate company and we want to be able to say predict what is going to sell for according to its square footage why wouldn't we just run a simple linear regression where our predictor variable is square footage and our dependent variable is the eventual sale price of the house what we could and that would give us you know some relationship but we can get a more accurate understanding of the relationship if we factor in some additional variables in order to control for their effects so we could add in additional factors such as the median household income in the neighborhood the home is selling for the age of the home the size of the lot and the quality of the local schools for instance and that's going to if we throw all those into our multiple of multiple regression it's going to give us a more accurate coefficient for the square footage in that its effect on the home sale price let me use one of these to just all expand on the quality of the local schools and to sort of explain how this the effect that this could have imagine that we're really all of our data is coming from to two neighborhoods and one of those neighborhoods tends to have pretty small homeless home sizes in terms of square footage but that neighborhood has just the best local schools and so a lot of people want to move there if you don't factor in the quality of the local schools your results in terms of the predictive power of square footage on home sale price might look a little bit strange because you're going to see a lot of small houses selling for relatively more but once you factor in the quality of the local schools that predictor variable is accounted for and your coefficient for square footage is going to change to reflect this so what what I mean by that is when you run multiple regression all of the different factors are going you're going to get a coefficient for each predictor variable that has taken into account all the other predictor variables you've added so you could actually see if you started off with by running an equation that had two predictor variables imagine that we don't have a third variable here if you then rerun the same using the same data but add in another variable so let's say we started out with you know a square footage and medium neighborhood income when you add in the age of the home the coefficients for square footage and medium neighborhood income are both going to change then if you add in the size of the lot the three other coefficients are also going to change so they all take each one of the the coefficients is affected by all the other predictor variables you have in there and as long as they're all good variables you're going to get essentially more accurate results for each for the relationship between each and the dependent variable so now picking the right independent variables the right predictive variables to run in your multiple regression is really really important this is pretty much a science in itself um but the bottom line is you really want to pick predictive variables to run in your regression that make sense another way to say that same thing is don't pick variables that are stupid for instance if you want to measure the efficiency of a manufacturing plant don't use as a factor what color you paint the exterior of the plant if you want to know how someone's going to vote it doesn't really make sense to ask and run in your regression the last time it's been since they got their hair cut this is not just sort of common sense it can actually really mess up your regression um because sometimes dumb predictor variables the run will actually come out as significant in your regression equation um sometimes this is because there's another factor that influences both the predictor variable you ran and the dependent variable so I could see for instance if you ran one the last time someone got their hair cut and how they're going to vote there might actually be a significant relationship because there might be a relationship between factors that do make sense in terms of how someone's going to vote say age gender racial or cultural background those might all affect both how someone's going to vote and how long expenses they last got their haircut so um you want to be careful about throwing in stuff that doesn't make sense and then saying well it's significant also I mean if you run enough stupid factors just any you say like I want to be 95% confident that each one of my predictor variables is significant based on the tests that we looked at in the previous video you know ninety-five percent at one out of twenty is going to come out as significant even if it really isn't so that's just type one error so the bottom line is don't pick stupid variables you know pick variables that make sense another thing and this is a lot more subtle that you want to consider is choosing variables that don't have redundancy in terms of their predictive power this is a concept that's called multicollinearity and and what it means is you might have two variables both of which sort of makes sense in in in terms of their they might both have an effect on the dependent variable but they might not be having basically the exact same effect and so that what the trouble with that is that it doesn't it then it's hard to know which of them has more of the effect or either of them have a significant effect it looks it's going to look something like this if this is our Y sort of everything we need to know about our dependent variable everything we could know about a dependent variable we might have one factor that tells us that much and I don't know another factor that tells us that much so here's we'll say this is a this is the predictive power of x1 and this is going to be the predictive power of x2 over here it's hard to separate them out so generally you want to try to avoid picking variables that are pretty much redundant so here's an example let's say we we wanted to know the effect the effect how the factors that affect the teaching efficiency effectiveness of of k-12 teachers and so we've thrown in a variable predator variable for the highest degree that that teacher has earned but then we also throw in a variable that's the number of years of schooling that they've had obviously those are going to have pretty much the same effect there are they're almost the same thing just stated in a different way not exactly but you can bet that there's gonna there's going to be a huge effect of multicollinearity they're basically going to have a big overlapping effect and you might even though they might be either one of them might be a significant factor when you run your multiple regression you might get a kind of large p-value for each of them and essentially you'll say you know looking at those results you might say neither one of them is a significant factor and that's just because they're overlapping so much a more subtle example could we could go back to our you know our housing prices example and say well what if we ran something for square footage this was remember this is our key predictor variable this is one we really care about but then we also throw in another factor for the number of bathrooms that a house has well presumably the square footage and the number of bathrooms that a house has are going to have they're going to be highly correlated with each other and that's really what multicollinearity is this one you have two X variables that are highly correlated with each other and therefore their effect on the dependent variables kind of gonna be the same so you want to watch out for that and try not to run variables that are going to have a high correlation with each other the bottom line here is what you're really looking for is to come up with a good predictive model where that's informative but also practical basically you can take a look at the results see your coefficients for your X variables and actually use that um in a way that's actionable and where you can actually make use of those results so that's that all comes in to picking the right independent variables okay so let's take a look at an actual multiple regression output in our videos on simple linear regression we use this example of a basketball team we used a predictor variable of their shooting the team's shooting percentage in a game and the dependent variable was their points scored so we wanted to figure out a relationship between shooting percentage and point scored we can use we can continue to use this example but I'm going to add in some additional factors so let's show those here what if we wanted to know also the effect of turnovers the number of turnovers that a team has its effect upon the number of points scored so presumably you know we already look how the higher shooting percentage you know there was a positive coefficient for shooting percentage he obviously makes sense the better that a team shoots the number the higher the number of points scored is going to be we would expect a negative relationship with turnovers so the number the higher number of turnovers a team has the lower their point score because they're not getting as many shots they're joining the ball over and let's add in one additional factor and that would be where they played the game generally basketball teams tend to do a little bit better on their home court they've got their own fans they're cheering for them and a little bit worse away so we might expect you know when they're at their home court they are going to score more points and when they're away they're going to score probably less points and we've also got a neutral site here so okay something in between that let's let's use these in our multiple regression but can we can we actually just use everything that's listed right here and run a multiple regression well turnovers and shooting percentage does work but well we can't use sight that's not it they don't have a numerical values so can we use it well the answer is yes but obviously not in its current form we have to assign numerical values to this site to you know this this factor of site and what are our possibilities there well could we just say could we say neutral is 1 or 0 and so maybe say neutrals 1 and home is 2 in a waist 3 does that make sense well no not I don't think so I mean first of all it doesn't really make sense to assign maybe even rankings to these actually if you were going to rank on minutes a like Newt away was one and neutral as two and home is three four that's not it's not good to rank them first of all because it's making a big assumption now you know maybe that assumption makes sense based on what we know about basketball but it it's not really good scientific practice to make an assumption and sort of apart from that it really you know what we're doing here is we're assigning weights to each of these sites and that really it's it's that's no good these are not you know it does not make sense is is home three times as good as a way that'll I mean who knows you know it we can't really wait these what what it's going to be much more safe to do is to assign a separate variable for each of these possibilities in other words we're going to assign a variable for home it's going to be binary they are either home or they're not home and so that will assign it a value of one if the team played that game at home and it of zero if they played it on a neutral or away site this is just the variable for home so we say home is either going to be one or zero it's one if they're home zero if they're not home whether they're neutral or away add another variable for a way where he gets a one if they're away and a zero of their neutral or home now the trick here is we don't actually have to assign a variable for all three of these possibilities we're only going to assign a variable for two of them that doesn't make a lot of sense but let me explain it it might not make intuitive sense well let me explain we sort of need a default something that everything else is compared to so I might say I'm going to create a variable for home a binary variable for home a binary variable for a way and then leave neutral alone in other words if both home and away are 0 that means it's going to be neutral and the coefficients that I'm going to get for home in a way are all going to be compared to as if they were at a neutral site so I'll get it maybe a positive coefficient for home that's what I would expect you know better they play better at home than when they're at a neutral site and presumably I'll get a negative chord late-- coefficient for away which means they're going to score less points away at a visitor whether the visitors then when they're at a neutral site that's that's anyway that's my hypothesis that's my prediction but we have to have something that we're comparing the other dummy variables to so that's so the default is going to be neutral and the as we get for home of the way are going to be as compared to a neutral site so we need that baseline now I can that's totally arbitrary I could have said home is going to be the default and we're just going to assign a variable binary binary variable for a way and for neutral and I would expect the coefficients on both of those to be negative because they'll be negative as compared to when they play at home where they we would expect they would score the most points at home so we would expect the coefficients for neutral in a way to be negative if home is our defaults the one that we don't assign a variable to this process of making um you know non numerical values into their own separate binary variables assigning a 1 or a 0 is they're called these are called dummy variables and so this is how it might look based on what I was saying before if I if I make neutral the default that we're going to compare home-and-away to you can see every time there's a neutral site home-and-away both these two new variables these two dummy variables I just created are going to be 0 if the team is home home gets a 1 then away gets a 0 if the team is away away gets a 1 home is going to 0 you're never going to see home and away both with a 1 because that would imply that they were playing both home and away which is not possible for one game so this is how we create the the dummy variables and now now we have all numerical values we are ready to run this multiple regression thank you for bearing with me we're actually going to leave is it dummy variables a little bit more detail in a later video so if that didn't make perfect sense well hopefully it will in round two all right so let's go ahead and run this regression now again there's many ways you can run regression including many ways in Excel but I use the data analysis add-on so I go to data click update analysis I do not want a random number generation I want regression ok my Y input range is point score that's my dependent variable and my X input range is going to be all this I don't have to select each column separately I can just grab wall I have labels here meaning guys check the box for labels meaning my top row is these are labels non numerical values so I need to recognize that I'm going to put a 95% confidence interval and let's run this thing all right let me zoom in for you this is my multiple regression output I'm gonna delete a row here just so we can see everything actually you know what Wow look at this one I don't even need this alright so what do we got here well a lot of this should look really similar we should by-and-large be very similar to what you've seen with simple linear regression especially our summary output I mean very similar things listed here well with first of all we have the number of observations up here we have the multiple are the multiple are is the coefficient of multiple correlation it's kind of like the correlation you get in simple linear regression where you're comparing One X and one line you say what's the correlation between these two variables when you're doing correlation between two variables the value is going to be between negative 1 and 1 depending on whether it's a positive correlation meaning they both when one goes up the element tends to go up as well or if it's a negative correlation being when one goes up the other one tends to go down and vice-versa um in multiple regression it's all the multiple are is always going to be between 0 and 1 because it's sort of more complex and so you're never going to it's never going to be just straight positive or negative relationship between all of your predictor variables and your dependent variable but the multiple art does tell you them between 0 & 1 it tells you sort of the strength of the relationship between all of your predictor variables and your single dependent variable now saying strength of the relationship a statistician would probably I'm sure would disagree with me but it's sort of a simple way to think about it the R squared this is the same I concept as with simple linear regression the R squared if you'll recall is the amount of the I should say the proportion since we're dealing with decimals the proportion of the variance of the dependent variable the proportion of the variance of the Y variable that is explained by all of their predictor variables you've included in this multiple regression so that was the same thing with simple any regressions in two proportion of the variance of the dependent variable that is explained by your one x value in this case it's all of your X values so again it's sort of like the strength of the predictive model um you can say well you know if it r-squared is high you're saying we're really predicting a lot about the variance of the Y variable using our X values um and is this really is kind of context dependent in terms of how do they interpret the strength of that predictive power but you know that's the simplified version and that's I think what we want here now adjusted r-squared is something new this is um adjusted r-squared is actually probably a better measure of the strength of the predictive power of your regression model because get factors in all of it factors in the number of predictor variables that you've included yeah remember I said because because here's just what happens with the R squared R squared always goes up when you include new predictor variables and I remember I said that one of the really important things with multiple regression is to choose predictor variables that make sense not to choose stupid independent variables not too stupid predictor variable you don't want to throw anything in there because it might come out as significant but it'll always increase R squared R squared never goes down it always goes up whenever you add in new predictor variables no matter how inane those predictor variables may be and no matter how insignificant in an onion nonuseful I need some better language to describe what I'm trying to say but they might be in actually determining a predictive relationship between all your variables and your dependent variable adjusted r-squared done it actually factors in the number of predictor variables you've chosen to include so even though R squared is always going to go up no matter how stupid the variable you threw in there adjusted r-square will probably go down if it doesn't have if that variable really isn't adding much to the relationship because it you know a just r-squared is adjusted for the number of variables you've chosen to include so what you're really kind of trying to do in terms of picking variables to include in your multiple regression is get the highest adjusted r-squared because that's going to say you know like I said r-squared is always gonna go so you can all just keep throwing junk in there and r-squared will go up but adjusted r-squared might go down if you've chosen to include when you add in another predictor variable that really doesn't add much to your equation so striving for a high adjusted r-squared and adjusted r-squared is really good like I said it's a kind of a good factor to to to gauge the success you've had in choosing predictor variables because multiple regression is often a very iterative process you might start off with some predictor variables and say well these aren't significant or I need you know I wonder if I add some additional predictor variables so you might take them out if they're not significant you might throw some more in that you think might improve the predictive power for your model so you're striving for a higher adjusted r-squared with each of these iterations all right now let's take a look down at the actual X variables on the the predictor variables and look at their coefficients and their their significance let's start okay you know we have our coefficients these are you know according to this model if you increase the turnovers by one say we would expect the number of points scored by the team to go down by about 0.96 how do we interpret the coefficients for a dummy variable well this says if we're at home we would actually expect to score one 0.93 seven points as compared to a neutral site interestingly this is not what I would have expected if you're a way you would actually score even more you know an entire over a point more as compared to a neutral site um you know in shooting percentage you know as your shooting percentage goes up by one you would expect to score two point two two point five points additionaly so that's coefficients and how you interpret them but now let's go ahead and look at the significance of each of these factors because remember that's what we want you know we want to make sure that these actually have a significant relationship at say 95 percent confidence in that the interpret the way we do that is exactly the same as with simple linear regression we could look at the p values first of all and so starting down here shooting percentage still is extremely significant remember if we said we want to be 95 percent sure that a factor is significant in terms of its predictive power with regards to the dependent variable the p-value should be less than 0.05 and so this is clearly less than 0.05 so this is significant as is turnovers but when we get up to the dummy variables that we included home in a way well those don't look so significant so you know we can't say that these are actual these actually have any effect upon the dependent variable based on these p-values the other way to think about significance is basically you're looking you want to make sure the you want to be not sure you want to be numb you say 95% confident that the actual relationship is not zero that the co real coefficient is not zero and so when you take your 95% confidence range zero should be nowhere within that range so it's not here it's not here but it's well it's almost you know smack-dab in the middle of the range for home-and-away so what do we do with this information well what I would do is like I said multiple regression is an iterative process you would probably want to take out these variables you might choose to substitute in another variable um like number of days of rest for the team something to that effect but I'm going to go ahead and just run this regression again after taking out these variables and seeing what happens so let's do this regression again got the same Y range but now I want a different X range I'm just going to include turnovers and shooting percentage let's run this thing again these so now we've run the regression this is still a multiple regression could we still got more than one X variable we've still got two predictive variables turnovers and shooting percentage still got our y-intercept our constant and let's see we would expect our kono doesn't it look like oh no I had much of an effect at all and we would actually expect there not to be much multi collinearity between hone away and tremors and shooting perception maybe we would we would still expect our coefficients to change slightly so let's just look at this one turnovers was in our new model is negative 0.98 while we're here and when we had home-and-away factored in its negative point nine six so not much of a change but you can see coefficients change as do p values so the p value we would expect to have seen change a little bit as well and it certainly has changed so this is just to reinforce the point that all of the factors you choose will affect all the others and let's take a look up here now when we had all four factors we had all four variables we had an r-squared of 0.98 sorry that's 0.89 two and we would expect that to go down once we remove those other factors because r-squared always goes up when you add in additional factors so so yeah now that's just point eight nine we've lost we lost about 0.002 but we would actually expect our adjusted r-squared to go up so we start out with an adjusted r-squared of 0.8 6 3 when we had all four factors and now it is 0.8 7 7 so it is actually increased the adjusted r-squared by removing those factors we still have a we would say maybe a better predictive model after we've dropped those two variables so like I said it's an iterative process we might now go back and add in some additional variables and see what kind of results we get see what our our our adjusted r-squared is affected all right so just to wrap this thing up let's go ahead and make a prediction for for our points bored using our new multiple regression equation keeping in mind so we've dropped the the dummy variables Home & Away were note we don't longer care about the site because we weren't able to determine it was a significant relationship with point scored so now we've just got our two independent variables for our multiple regression let's say we have a but let's say we have turnovers 15 turnovers in the game and a shooting percentage of 49 percent and we'll make a prediction on how many points are scored in the game so we say it's going to be equal to the intercept plus turnovers it's on the coefficient for turnovers times the number of turnovers plus the coefficient for shooting times our shooting percentage which was efore and we would expect if we have 15 turnovers and shoot 49 percent to score about a hundred and seven this team would score about 107 points in that game
Info
Channel: George Ingersoll
Views: 272,771
Rating: 4.7437654 out of 5
Keywords: Statistics (Field of Study), Statistics for Business, Linear Regression, Multiple Regression, Normal Distribution (Literature Subject), Confidence Interval (Literature Subject), Multicollinearity, R-Squared
Id: AkBjJ6OunR4
Channel Id: undefined
Length: 32min 23sec (1943 seconds)
Published: Sat Jan 04 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.