Econometrics: Control Variables

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello i'm back the scenery has changed uh i've got a pandemic haircut and we're going to talk about controls so in this video we'll be talking about controls what controls are uh so we've been we've moved on from just having a single variable regression to being able to have a multivariate regression meaning that there's more than one variable on the right hand side now we're typically really only interested in the effect of one of those variables right we're trying to identify the effect of one single variable on our outcome now what are all the other variables doing there we would call them controls because they need to be there to help solve endogenous we want to avoid identification issues because if we just look at the relationship between y and x by itself we might get some results that are not representative of the causal effect of x on y and when we get an effect when we get an estimate that might be an accurate statistical estimate but does not reflect the underlying theoretical causal effect that we are interested in we call that identification error now we can add controls to help get us to identification and now what are controls what are they going to do for us so first of all a recap of what endogena is so we're going to assume that our true model looks a little bit like this uh where we have a y as an outcome variable and there's a linear relationship between y and x by linear relationship i mean we are drawing a straight line there's an intercept to that line the beta 0 and there's a slope to that line to beta 1. now of course there's not a perfect prediction right very few relationships are perfectly predicted by a straight line like this and so we have that error term epsilon there uh and what that is doing is that is just explaining in not statistically but in truth there's a there's some other stuff in y that's not being explained by x everything that explains why that is not x is in that error term all right so if x is related to any of the other things that explain why we're going to have an endogenous problem we're going to have an identification issue because ols is going to mistake the impact of x for b for the impact of those other variables so let's give a quick example so let's say we're trying to explain why people eat ice cream when they do and one of the variables that we've noticed is that ice cream eating tends to be related to when people wear shorts people wear more shorts on days where people eat more ice cream however of course if we just look at this relationship by itself we will be attributing a the actual effect of a different variable to shortswork and that is temperature temperature here is in the error term because it explains ice cream eating when it's hot people eat more ice cream as i can tell you right now it's very hot and we have a freezer full of ice cream and it's not in our model so if it explains the outcome variable and it is not in the model it is in the error term now temperature is also related to short swearing people tend to wear more shorts on days when it is hot and ols doesn't know that short swearing doesn't actually cause ice cream meaning it doesn't know it can just see the data it doesn't know how things work okay and so when it sees in the data that ice cream eating and short swearing are related to each other it will say ah this is a relationship i'm gonna say this is a positive relationship looks like short swearing causes people to eat ice cream but actually it's that thing in the error term i am misattributing the effect of temperature to being a short swearing effect because i can't tell the difference right how can we make it tell the difference well we can make it tell the difference by bringing temperature in and including it as a control variable right if we take it out of the error term and put it in the model suddenly ols is able to tell the difference between the effect of short swearing and the effective temperature even those two though though those two things are related to each other okay so that's the point of controls the point of controls is that there's something in the error term that is related to our effective interest or our variable of interest that we're trying to find the effect of if we leave it in the error term it's going to give us an identification issue because ols is going to assign the effect of those variables to mistake and it's going to mistakenly think that it's the effect of the treatment variable but by bringing it into the model we can get rid of that additional effect what controls do and this is moving on to the actual point of controls is that they look at the relationship between our outcome variable and our treatment variable of interest and it says there's a couple of reasons why those two things are related to each other right why are ice cream and short swearing related to each other well one possible reason is that wearing shorts makes you eat ice cream it's possible i don't know right but another possible reason why is temperature right if it's hot both of those things are going to happen maybe there's other reasons why we might expect both of those things to occur together as well so given that there's multiple reasons why but if we're only interested in the one where short swearing causes you to eat ice cream i need to separate those things out and so adding a control for a variable removes the part of the relationship between the two variables that is explained by that variable so why are ice cream where eating and shorts wearing related to each other maybe one causes the other but maybe it's just temperature that on hot days both of those things happen if i add temperature as a control it literally takes the part of that covariation the part of that correlation and subtracts it out and anything that's left over has to be the part of the relationship that has nothing to do with temperature at least assuming that we've got our linear model correct uh let's actually see this in action so here's some raw data so what we have is we have an x variable and we have a y variable all right and those two variables it kind of looks like if you just looked at them all together there's a positive relationship but you can also see that the different colors which are the w variable here are what's actually causing that positive relationship to be there that if you look at the clusters by themselves it's kind of negative but only by looking at them separately well so if we were just to take this raw data and look at the relationship between y and x if we regress to y and x we'd get a strong positive relationship uh particularly at a correlation of uh 0.425 which is not not a weak correlation but we also know that it's not really that x causes y in a positive way it's that this w thing seems to be related to both of them a w value of one seems to be related to high x's and high y's together and vice versa so how can we control for w well literally what it's doing is it's saying hey there's a difference in x that seems to be explained by w i'm going to subtract that difference out and there's a difference in y that seems to be explained by w i'm going to subtract that difference out and by subtracting out the parts of x and y that are explained by w we are subtracting out the part of the relationship between x and y that are explained by w meaning that the remaining relationship between x and y is just the part of the relationship that has nothing to do with w which in this case is a correlation of negative 0.457 so in going back to the ice cream example if we controlled for temperature what we would get in our coefficient on sorts right now would be the part of the relationship between ice cream and short spraying that has nothing to do with temperature that's the point of adding a control that's what adding a control does and that's why we're interested in doing it because there's not a whole lot of relationships that you can get the causal effect of without any sort of adjustment whatsoever there's usually some sort of indonesia problem and so we're going to need to add some controls in order to handle that issue all right that's it for this video thank you
Info
Channel: Nick Huntington-Klein
Views: 4,739
Rating: 4.9591837 out of 5
Keywords:
Id: Ba2Nhn4co88
Channel Id: undefined
Length: 8min 23sec (503 seconds)
Published: Mon Aug 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.