What is Multicollinearity? Extensive video + simulation!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi team for those of you that have been watching this series on regression you'd know that all regression is is a way of describing the relationship between one dependent variable Y and a whole bunch of independent X variables but what happens if those X variables are themselves related well that's multicollinearity my friends come with me as we explore the concept [Music] ok so here we go for a deep dive into multicollinearity all the other videos in the series on regression are up on Z at statistics comm and indeed we've touched on multicollinearity in the regression assumptions video but here I'm spending just a little bit more time on some of these concepts and I've also done something pretty special I've created a little simulation of my own to give you a flavor of how multicollinearity actually feels not just how to diagnose it and how to deal with it but I'm gonna give you a sense of how it actually effects a regression anyway this is the plan of attack we're going to have a look at the intuition behind multicollinearity we're also going to assess why we bother caring about it in the first place is it a big issue and what in theory is it doing to our regression well then look at two different methods of detection one via what might be called bivariate correlations and a second where we look at variance inflation factors which is a little bit more robust than the correlations method well then see what happens once we've diagnosed multicollinearity in our model what remedies are available will be assessed here I'll then show you that simulation that I referred to earlier basically what I've done is I've created a data set and then infected it with multicollinearity to greater and greater degrees and we'll see what happens to our coefficients and standard errors and finally we'll have a look at see what happens with perfect multicollinearity right so we're ready to go let's dig straight in so let's think about the following regression let's just say we're trying to assess a lawyer's salary as a function of their number of years experience and also their age now what a regression we'll try to do is tease apart the individual effects of years of experience and age on the lawyers salary but you're going to have to afford the regression the opportunity to tease apart those effects multicollinearity occurs when the X variables themselves are related such that those individual effects become obscured so this is the perfect example right the more years experience you get as a lawyer probably the older you're getting at the same time right so this regressions going to have a tough time figuring out well was it your age increasing that improved your salary or was it the years of experience increasing that improved your salary how can you really figure that out if these two work in lockstep now another way of looking at this is to think about how we interpret these coefficients so remember in my previous videos to interpret the coefficient of say year's experience that's beta one you'd say that's the marginal effect on salary of one additional year's experience holding other variables constant and similarly for beta two that would be the marginal effect on salary of an additional year of age holding other variables constant now is it possible to really hold the other variables constant in each of these cases well Janet down here is is rightly cynical because of course you can't really hold your age constant as your years of experience are increasing so that's a problem and that problem is called multicollinearity okay so why do we care about that once I looking to see what it does to our regression results so let's just say I've run this regression and we've got some coefficients here for experience and age so these would suggest that for an increase in one years experience your salary is to increase by three thousand dollars three thousand eight hundred and eighty six dollars on average now for age something similar happens for an increase in one year of age you're expecting an increase in about two thousand dollars to your salary now each of these coefficients has standard errors associated with them and again if you've seen my previous videos you'll know how to interpret these plus the p value as well but what part of this output does multicollinearity infect the coefficients themselves are still unbiased in other words they still represent our best guess or our best estimate as to the true values of beta 1 and beta 2 respectively but they become quite sensitive so maybe if you throw another variable in the mix here these coefficients might jump wildly but really what happens with multicollinearity is that it inflates the variance of the affected variables so experience an age if they're affected by multicollinearity these standard errors are going to blow out and indeed they're quite high here such that both of the p-values are reasonably high as well and you'd know looking at these p-values that neither of these two variables seem to be statistically significant and there might be surprising right you'd think that surely years of experience is going to affect your salary same with age but it's almost like these two variables are fighting for the effect on salary and the kind of getting in each other's way because they're moving in the same direction so there are questions having a tough time teasing apart beta 1 and beta 2 so these standard errors get blown out and finally it's probably worth noting that the overall model fit is not affected so it only really influences this red section of your regression output the r-squared the ANOVA and the F statistic and also your ability to use the model for prediction is unaffected so I guess in summary all that's happening is that we become more uncertain about these coefficients or in other words they just have a higher variance okay so how do we detect it in a model well the first thing we can do is check the correlations between all pairs of X variables so let's just say we had four X variables so instead of just years of experience and age say we had two other variables as well and let's just call them x3 and x4 you can check what's called the bivariate correlations between each pair of these X variables and of course we know that correlations go from 1 to negative 1 so you'd be hoping for something around 0 here for these correlations for example X 1 and X 2 here have a correlation of 0.9 1 which is quite high but Janet's gonna ask us question how much is too much correlation and there's a general rule of thumb that says if the correlation is greater than about 0.9 this could start being a problem but as I say with all of these rules of thumb you'll get in statistics there's nothing black-and-white about it there's nothing magical about 0.9 in fact there's a lot of sources that suggest that this is a bit too conservative correlations up to sort of 0.95 is still pretty cool so if you're looking to me for a black and white letter of the law what do I do with my correlations you're not going to get it because it really is a question of degree and as we'll see when I do my simulation a little bit later correlations of 0.9 and even 0.95 are not fatal to a regression model we'll see how that works a little bit later but let's have a look now at the second detection method which is variance inflation factors now as I said this is a little bit more robust because unlike correlations we're no longer looking at bivariate relationships between the X variables we can actually do a little bit more here so let's again look at a theoretical regression with Y and for X variables x1 x2 x3 and x4 the way to calculate the variance inflation factor for each of the X variables is to first create an auxilary regression for each of the X variables so let's just take x1 and we'll create what's called an auxilary regression for x1 and that's where we regress x1 on the three other X variables so Y is nowhere to be seen here it's just x1 being our new dependent variable and being regressed on the three other X variables and what this effectively is trying to figure out is how much how much of x1 is being explained by the three other X variables how superfluous is it in the original model how much of its information is already contained in those three other variables so of course we're going to have to try to find the r-squared of this model and the variance inflation factor for X 1 will be 1 divided by 1 on R squared from this model and this is got a subscript K here just meaning it's the R squared from the kaif auxilary regression right so you'd run this regression find the R squared and you could then you could construct the variance inflation factor and you would do that for the three other X variables as well you put them in the hot seat in the dependent variable position and regress them against the three other X variables and see how they go with the variance inflation factors and clearly the higher the R squared the higher the variance inflation factor will but Janet's gonna ask us this question again which is how high is too high for the variance inflation factors what number are we looking for is there a figure beyond which we're going to say you know what that's too high for our variance inflation factor and of course my answer is going to be the same yes there's a rule of thumb and typically people will say variance inflation factors above ten are problematic so in this case if our auxilary regression from x3 provides us with the variance inflation factor of 12 we might say that x3 is somewhat redundant in our original model in other words it's being explained by the three other X variables to a certain degree but again nothing special about the number ten here you'd always be using your judgment okay so what do we do now that we've detected multicollinearity in our model option one is to do absolutely nothing about it and believe it or not that's typically what you're going to do if your model is being used for prediction only then the standard errors of the coefficients are not important so long as the coefficients themselves are unbiased which they still are then you can use your model for prediction alternatively if the correlated variables are not of particular interest to the study question you can still use your model for everything else so it's not as if this invalidates all the other variables in the model right the models goodness of fit is still intact so you can still rely on things like the r-squared of the model and all that kind of stuff and you can certainly look at other variables that aren't suffering from multicollinearity no problem there and also if your correlation is not super EXTREME you can still just do nothing about it and be reasonably okay and we'll have a look and see how extreme the correlation needs to be for us to be forced to do something about it so option two is to remove one of the correlated variables so here you'd assess the actual variables and and figure out whether they're providing the same information now arguably age and years of experience are providing subtly different pieces of information right but certainly there are some instances where two variables are giving you the same information just slightly in slightly different ways and in those cases you would want to remove one of those variables but the problem with option two is that in trying to solve multicollinearity you might actually be creating another problem which is called omitted variable bias because all of a sudden you've got a variable that's outside the model now that might be pulling the strings still of the remaining variable that you've left in so yeah again this is making option one look a little bit more appealing still there are two other options as well you can combine the correlated variables now this one's actually pretty good so for example you know years of experience and age scenario where we're looking at lawyers salary we could create something called a seniority score or something like that right where we combine algorithmically experience and age and that way both of those pieces of information are included in one variable that no longer suffers from multicollinearity in the final option which I have never used in practice ever but you can use these things called partial least squares or what's called principal components analysis now I did find a good video on principal components analysis I haven't done it but the guy does it pretty well it's it's long but it's quite useful so if you're a sucker for punishment you can have a look at that and I'll put the link for that video in the description but really as we'll find that's probably only a resort in very rare scenarios okay so here we get to my favorite bit unsurprisingly because it's my simulation and as I said in the intro this is where we're going to get a feel for what multicollinearity does to our regression output so what I've done is I've concocted a data set myself of law firm salaries so here we've got ten employees I guess you'd call them partners right at a law firm and here's their salaries these are Australian dollars peeps but still they're on good coin and here's the years of experience so Frank's earning the big bucks 250 grand per year but he's been at his job for 27 years dennis is earning 180 grand for 22 years etc etc so I created this dataset myself so they've got 10 observations in it but that's okay I'm now going to run this regression that assesses the effect of experience on salary and we would hope that the regression output would tell us there's a positive relationship between experience and salary and indeed when we run a regression you can see that the coefficient is 6,000 and 14.8 with a standard error of 519 let's highlight that so the coefficient 600 6015 that indicates to us that for an additional years experience one would expect the salary to go up $6,000 not bad and a standard error is quite small 519 in comparison giving us a p-value that is less than 0.0001 so very very small p-value which is unsurprising because quite clearly experience is affecting the person's salary okay now what's going to happen here is I'm going to add another X variable so our new regression equation might look a little bit like this where we've got experience here again plus another X variable I've thrown in now it doesn't so much matter what this represents but what I'm gonna do in my simulation is simulate a new set of ten values here for this X variable that is highly correlated with the ten values from the experience variable and then we're gonna see how these figures change will the coefficient of experience change now that there's a new correlated X variable in town will the standard error of the experience variable will that change now that there's a new X variable and also is it possible that this very very small p-value can blow out such that experiences no longer even significantly impacting salary at all so of course I'm going to be testing numerous strengths of association between X and experience and I've tested where Rho which is the correlation is 0.8 0.9 0.9 5 and 0.99 so I've run numerous simulations under each of these four conditions and then assess how the coefficient and standard error of experience has changed juicy right let's find out what happened so there's the equation again experience and there's X our correlated X variable with experience now this was the original condition where X didn't exist so when there was no additional variable the coefficient of experience was around six thousand and fifteen if you recall and the standard error which is in black here was 519 now when I add that X variable that has a correlation of 0.8 it doesn't much change the standard error increases a little bit don't forget this is the standard error of this coefficient for experience on salary that coefficient stays about the same and the standard error only increases a little bit now when I ramped up the correlation between these two variables this coefficient beta 1 still stays about the same at around 6000 there's a bit of fluctuation because obviously there's a bit of randomness associated with these samples that I'm pulling with the correlation of 0.9 but conveniently the standard error increases a little bit further now this is still quite a small standard error respectively a standard error here is only about 1/4 of the coefficient which still indicates a fairly low p-value here suggesting that experience is still highly related to salary in this condition and when here to 0.95 with this correlation here between X and experience the standard error is starting to creep up a little bit and this is starting to become of sort of borderline significance I think the p-value of this points around 0.02 and finally when we ramp up the correlations so that it's 0.99 so that's when X and experience are very highly correlated you can see the ramp up in the standard error now so this is empirically showing what I was telling you in the beginning of this video that the coefficients are still unbiased in the presence of multicollinearity so this beta 1 character didn't change very much a little bit of fluctuation but certainly no structural change given this correlation but the standard error did ramp up significantly and what actually happened at the end is that the p-value here is about 0.2 meaning that experience is no longer significantly affecting salary at this point so clearly X an experience of fighting for significance in this model now such that the model can no longer tease apart the effective experience on salary now I'll include the spreadsheet I used to conduct this simulation I did do it on excel for whatever reason if someone wants to have a go at doing it using proper software feel free and send it through to me I'm happy to give you a little props in a video if you've done that but just so you know I haven't just concocted this out of thin air you can download my handiwork but I think that's pretty cool and it gives you some visual impression of how multicollinearity actually effects your coefficients and the other thing that's quite surprising is just how much correlation you need for things to become problematic even at 0.9 there's not that much of an issue right now of course experience was very highly significant in explaining salary in the initial stage right so maybe this this jump in standard error might be a significant jump but when I saw this I was definitely surprised at how little an issue these sort of mildly high levels of correlation really were anyway thought that was interesting okay so a final point to note about perfect multicollinearity now you may have been looking at that plot in last slide thinking who why don't happens if Rho goes all the way up to one itself now this would mean that the X variables are perfectly correlated well if you try this using any statistical software it gives you an error message that says near singular matrix which is basically software speak for computer says no as soon as you have two variables in a model that are perfectly collinear the whole model breaks down and you cannot get any regression output think about it the models trying to tease apart those individual effects and you haven't given it any leverage to assess those individual effects okay so that's all well and good but other examples real-world examples where this actually happens and yet there are let's zoom in say I'm trying to assess the amount of energy burned by an Olympic swimmer when they're doing their swimming training I can construct a model like this where the amount of energy but you presumably in kilojoules being burned is a function of say I don't know whatever they had for breakfast how much sleep they had the last night blah blah blah you'd also want to put in the total distance covered in their training session that's quite important and maybe also the number of laps that they've done but Bob ow that's no good because the distance is obviously going to be the elapsed times 50 you can get from the number of laps to the distance covered using this straight formula so there is no extra information being provided by the number of laps when you've already provided the total distance covered so this is one of those scenarios where you would need to remove one of those two variables and comfortably so as well because they're providing the same information just in different units right meters or laps what about this one you're setting the water pressure on a diver as a function of the meters they are from the surface of the ocean and also the meters they are from the ocean floor hmm now is that a problem well again you can get from one to the other the meters from the ocean floor is just the total depth of the ocean - the meters they are from the surface so again these two variables would exhibit perfect correlation such that the regression now softs suffers from perfect multicollinearity and finally we might get one that looks a bit like this we've got say sales as a function of well here b5 here our variable is the amount we've spent on advertising but we also have what's called dummy variables here for whether we're in quarter one quarter two quarter three or quarter for being say the Christmas quarter right so if you're in quarter four you would expect sales to be higher but if these indeed are dummy variables one if you're in quarter one zero otherwise one if you're in quarter two zero otherwise etc etc well quarter for q4 can be expressed as a linear combination of the other three so I can say q4 is equal to one minus q1 minus q2 minus q3 because if it's in one of the other quarters q4 must be zero and this is actually called the dummy variable trap remember how you're always told with dummy variables that you have to remove one of the categories and let that be your base case remember that well the reason is if you include all of the categories if you exhaust all of the possible categories as dummy variables that final one will be causing perfect multicollinearity and your regression shuts down so there you have a team multicollinearity pretty deep right well you asked for it if you liked the video feel free to subscribe I've got heaps of others on zette statistics comm on regression button on on a whole bunch of other things as well I also do a podcast and if you've got any suggestions at all feel free to send through an email on the website or just make a comment on the video and I shall reply see you around [Music]
Info
Channel: zedstatistics
Views: 87,669
Rating: undefined out of 5
Keywords: multicollinearity, what is multicollinearity?, regression, zedstatistics, zstatistics, justin zeltzer
Id: Cba9LJ9lS8s
Channel Id: undefined
Length: 27min 2sec (1622 seconds)
Published: Thu Jan 24 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.