An intuitive introduction to Difference-in-Differences

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
difference-in-differences is one of the most widely applied methods for estimating causal effects of programs when the program was not implemented as a randomized control trial in this video I will describe the situations where the method is applicable and give you the intuition behind it I'll also explain how and why you might want to use regression to estimate different if effects throughout I will talk about the key assumption required for the definitive estimate to be valid suppose suppose the big the largest city in Brazil Institute's a free lunch program in its Elementary School's in 2009 there are many reasons to expect students to perform better in school if they are guaranteed a free meal in the middle of the day but it would be nice to know how large such effects might be suppose also that Brazilian fifth graders take a standardized math test at the end of every year how might we evaluate the effects of this program on test scores one way to evaluate the program would be to compare test scores of kids in Sao Paulo in 2010 with the scores of Sao Paolo kids who took the exam in 2008 before the program was implemented this difference is certainly partially due to the program we're going to call that difference D 1 and D 1 is certainly partially due to the program but suppose there was an important international soccer tournament during the week of the exam in 2008 but not in 2010 this tournament might also influence the difference in the test scores between the two periods so what we have is that D 1 is both the program effects and what we're going to call the trend what else might have the difference due to what else might be happening at the same time now suppose we also observe test scores in 2008 and 2010 in Rio another large city in Brazil not far to the north on the coast if we're willing to assume that the difference across time in Rio is reflective of what would have happened in Sao Paulo then we can use the difference d2 as an approximation of what the trend is and we get our first difference-in-difference estimate that is the difference in the differences d1 minus d2 it's not the only way to get the different if estimate from this data though you can also start with the differ simple difference between the test scores and Sao Paulo and Rio in 2010 let's call that difference d3 this difference is going to be the sum of the program effects and whatever differences might exist between the two areas that has nothing to do with the program if we're willing to assume that these area differences the area differences that existed in 2008 were approximately the same then we can look at that difference in 2008 call that d4 and we get another estimate of the different if effect that is d3 - d4 it turns out these two different difference in difference effects are algebraically identical that means if either one of the assumptions we had to make sounds fishy to you then you should be worried about the validity of the different if estimate in most but not all cases you'll use the different if to estimate the program effect when you have one group that is affected by the program and another that is not and you observe outcomes of both groups before and after program implementation if the treatment is random you don't need a different if to get unbiased estimates of the effect you can simply look at differences between the treatment and the can groups that said even in those cases a deafened if can sometimes improve the precision of your estimates if you're sure that nothing else changed between your the measures of your outcomes before and after program implementation then you could do a simple before after difference to get the effects but that's a rarely reasonable assumption if the treatment was assigned to different groups based entirely on observable characteristics you could use multiple regression and control for these characteristics to get an estimate of the program effects unfortunately you often don't know how the program was assigned or what other differences might exist between the treatment and control groups as we've seen the DIF and DIF estimate is straightforward you start with the means of the four groups you create the pre post differences for treatment and control and then DIF them to make sure you're getting the idea I suggest pausing the video now and computing the different if estimate for the hypothetical free lunch program using this hypothetical data okay good now note that we observe test scores increase by about seventy or exactly 70 in sub hollow but they increased by 40 in Rio and so if we think that they would have increased by about 40 already in Sao Paulo no matter what's then we have our different different we can compute a different if estimates which is going to be the trend the change over time in Sao Paulo - the change over time in Rio which is going to be an increase of 30 points on this ninth tests as we've seen all we really need is aggregate level data to compute the estimate but we can also compute the required sample averages from repeated cross-sectional data if we have that that is separate samples for each cell or longitudinal data where we observe the same people on both time periods if you happen to have individual level data either cross-sectional or longitudinal you can often estimate more flexible models that don't rely on assumptions as strong as the pure definitive model that we've been talking about here we have cool data on kids before and after the intervention in both South Paulo and Rio why is the variable holding the test-score DTR is a dummy variable equal to 1 if the individual who took the test was in the control group and D post is equal to 1 if the individual took the test after the end of intervention so for example Miguel got a 40 on the tests and he took the tests in Rio the control group and he took it before program implementation in Sao Paulo Sofia on the other hand got a hundred of the tests she took the test in Sao Paulo after program implementation that is she got the free lunch you might want to pause the video again and compute the same sample averages we just used to compute the different de festival you should get identical numbers because we now have individual level data we can also compute estimate the difference in difference using a plain old regression model all we have to do is regress our outcome variable on the two dummy variables D posts the dummy variable for whether the individual was measured before after the program and DTR the dummy variable for whether or not the individuals in the treatment group and the interaction of the two the coefficient on the interaction is going to be our different de festival so here's the data we were just looking at got our 10 observations we've got our same three variables all we have to do is first create the interaction so i'lll interacts the treatment dummy and the posts period dummy by creating a new variable which is just the two multiplied together and now I just run the regression regress Y on the dummy variable for the treatments dummy variable for the post period and the interaction so we can see is that the coefficient on the interaction is thirty exactly what we got when we did it by hand to the different if by hand so why does this work when we run the regression we're saying we believe test scores are determined by this regression model if we take the conditional expected value of y take the expected value of y given all of our independent variables we get this linear combination and the error term drops out we can plug in these different values say the posts W variable equal to zero and the treatment dummy variable equal to zero when we plug in those values we get these population means shown in the table so the expected value of the test score in the pre period in the control area so that would be Rio is just going to be beta zero and similarly in the post period in the treatment area the expected score is going to be the whole thing B 0 plus B 1 plus B 2 plus B 3 the difference-in-difference estimate for the population is easy to compute the difference across time in the control group right here remember that was d1 actually that was d2 that's going to be just beta 1 the difference across time in the treatment group is going to be beta 1 plus beta 3 and the difference between the two is beta 3 when we run this regression we're getting an estimate an unbiased estimate of this coefficient which again is our difference-in-difference estimate so we can do it but why would we want to compute our difference-in-difference estimate using a regression well there are three reasons first of all any time we run a regression our stat software is going to give us estimates of the standard errors for our coefficients we don't have to do anything special at all and we get standard errors for our definitive estimates that's convenience second we can add additional controls that's what these X's are if the trends in the treatment and control areas unrelated to the program are different because of differences in observed characteristics for example socioeconomic status we can control for these differences and still get an unbiased estimate of the program effects finally if we control four important determinants of Y that's going to reduce the variance of epsilon or error term and thus give us smaller standard errors for our estimate of the program effects so at the end of the day when is a difference in difference estimate a good estimate of the causal effect of the program let's think back to Sao Paulo and Rio suppose sapelo was picked for the program specifically because kids were worse off there that's okay for the difference in difference because that's a pre-existing difference between the treatment and control groups and we take care of that now on the other hand what if Sal Paulo and Rio were the same in 2009 but the policy team decided to run the program in Sao Paulo because they knew a free lunch program was being implemented in Rio already and they didn't want to conflict with it well that's not okay if we have reason to believe the trends unrelated to the program might not be in the same night might not be the same in the two groups then the different if estimate will give you the wrong answer
Info
Channel: Doug McKee
Views: 165,445
Rating: 4.8966203 out of 5
Keywords: Difference In Differences, econometrics, statistics, economics
Id: J7q2H8aB8bQ
Channel Id: undefined
Length: 12min 49sec (769 seconds)
Published: Thu Jan 29 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.