Hetroskedasticity consistent (robust) and cluster robust standard errors

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hetroskedasticity consistent standard errors or   robust standard errors are  quite common in empirical work. Their extension is cluster over  standard errors which takes care   of the non independence of the observations. These techniques are fairly simple to use because  they have been programmed in the many commonly   available statistical software and if your  software supports these standard errors then   their use is simply a matter of switching them  on and you don't need to really understand the   math behind how the standard errors are  calculated. But in this particular case   going through the math behind these two type  of standard errors allows us to learn something   about what these standards do - what they're  capable of - what are their limitations and   also in the case of cluster robust standard  errors looking at the equation allows us to   learn something new about when clustering of  observations will be a problem and when not. So in this video I will walk you through how  the hetroskedasticity robust standard errors   are derived and how the cluster robust standard  errors are derived and where do we actually make   the hetroskedasticity and the independence  of observations assumptions in regression   analysis and what those assumptions  actually mean for the calculations. Let's start with hetroskedasticity. So  the idea of iheteroskedasticity was that   if we have a predictor here - on the x  axis - and then a regression line - the   population regression line - and then we  have the error term which is the variation   of the observations around the regression  line then the variance of the error term   is not constant. So the idea is that in  some part of the regression line there's   less variation than in other parts like here. So  the variance here basically varies as X varies. The homoskedasticity assumption was that these  variants around the regression line is constant   so that the observations don't spread out  and don't become closer to the regression   line. Of course you can have many other shapes of  heteroskedasticity beyond this simple final shape. And there are the fifth assumption  in regression analysis and it was   required for consistent estimation  of the standard errors. If there is   a lack of homoskedasticity or there's  heteroskedasticity in your data then   your conventional standard error equation  will produce incorrect results. So let's   take a look at why heteroskedasticity causes  problems for the conventional standard errors. So the conventional standard errors are  calculated using this equation. This is   the variance of the estimates. So we have the  Sigma here and this is simply the variance   of the error term. We replace the variance  of the error term with the variance of the   residuals which is the estimate of the  variance of the error term and then we   divided by sum of squares total X and which  is just the sum of squares of X minus the   mean of X. So how much X varies around its mean  multiplied by sample size and that gives us an   estimate of the standard error of regression  coefficient in the simple regression case. So let's take a look at first where these  simple equation comes from and why we need   the homoskedasticity assumption. If the  homoskedasticity assumption fails then   we have this alternative formula which can be  found in many econometrics textbooks where you   take the residuals and then you multiply squad  residuals without the squared differences of the   observations from their means and then you  take a sum and you divide by sum of squares   total X to the second power and that gives you  the heteroskedasticity robust standard errors. So let's take a look at why this works and  this doesn't under heteroskedasticity. So   we need to start by looking at what  is the variance of the regression   coefficients and we derived this  variance formula on this slide   and I will take a couple of shortcuts.  You can get the full derivation in your   favorite econometrics book and I'll just take  a few shortcuts to make it fit in one slide. We will derive a bit - one particular  useful form of this equation. So let's   start with the covariance of X  and Y divided by variance of X   which is the simple regression  coefficient estimated by OSL. We can write out the covariance and the  variance equations. So the covariance is   simply how much an observation minus its mean  multiplied by an observation - another variable   minus its mean - what is the average square  of these two differences. So we work with   differences from the mean. We multiply two  differences from the mean and then we take   the average. We divide with N minus 1 which  is an unbiased estimator for covariance. Then we have a variance which is simply the  covariance of the observation with itself.   And this can be simplified by eliminating the N  minus 1 because it appears in both the numerator   and the denominator and we can further simplify  by writing this as square. So we have these sum   of squares from mean. So this is basically  sum of squares residuals from a regression   equation that only has an intercept. So this is  the sum of squares - total X which is the sum   of squares of the null model if we just  regress X on and use only an intercept. And we write it as sum of squares total X. So we  take differences from the mean. We square those   differences and we take a sum that is the sum of  squares total and we do that for the variable X. So let's move on and we can take this  equation here. This upper part and we   can separate it. So we can write  it out as X minus X bar times y1   and multiplied by X minus X bar times  y bar. Y bar is simply the mean of Y. Turns out that this here is actually 0. So we  can take it out and we have X minus its mean   multiplied by Y divided by sum of squares total  X. Y of course is our dependent variable and it   can be written as a function of the population  regression model. So Y is beta 0 plus beta 1 X   plus the error term as the model defines and we  can further simplify this equation by splitting   it into two. The beta 0 is constant and it will  be eliminated because this has a mean of 0 and   then we have these two parts here this sum. We  have the beta 1 times X minus X bar times Xi and   turns out that that is the same as sum of squares  total X and this gives us a convenient formula. So   the estimate of beta is beta 1 plus this thing  here. So to understand how much the beta hat or   the estimate of beta varies we need to understand  how much this varies here because this regression   coefficient beta in the population that's a  known - that's a fixed value it doesn't vary. So the only thing that varies here is this  part and how much it varies is what our   standard error quantifies. So now we start  looking at the deriving the standard error.   So how do we actually estimate how much  this thing here varies. And we've write   out the variation here. So variation of beta is  variation of this sum and we can now drop the   beta 1 out because it doesn't vary. It's  it's a population quantity. It's fixed. In regression analysis without going into details  we treat the X variables as fixed to. So our sum   of squares total X is fixed and we can or that's  constant in our equation and when we have variants   of constant times something then that is or  let's say that we have constant times X and we   want to take the variance of that then that is  constant square times the variance of X. So if   you remember path analysis tracing rules when you  calculate the variance of something you always go   to the source and come back which means that you  take squares. We take squares as here as well. So we have sum of squares total X to the 2nd  power and dividing this variance of X minus X   bar or difference of X from its mean multiplied  by the error term- We can further simplify this   by taking all this variance and moving the sum  outside the variance ones. So the variance of   the sum of independent variables is the sum  of their variances. So that's the idea here. This equation now we can still make  it a bit simpler because this variance   of X minus X bar this X minus X bar is  fixed value so it's fixed because X is   fixed. We can move it outside the variance  function. So it's X minus its mean to the   second power multiplied by variance  of the error term for one particular   observation and now at this point we have  to make the homoskedasticity assumption. So this far we haven't made any assumptions  that this variance of Ui would be constant. If   we assume that the variance of Ui is constant.  It doesn't vary as a function of X or anything   then we can actually take this variance of U and  replace it and move it outside the sun function. So we have the variance of U which is the  variance of error term multiplied by this   thing here which is simply the sum of squares  total and we have sum of squares total to the   second power here and that gives us the  variance of error term divided by sum of   squares total and this variance of error term  is estimated with the variance of the residuals. So that's the normal conventional standard  errors. You can find this derivation in your   favorite econometrics book if the book is  any good and they may explain a few more   steps in why like why some term are zero. I just  stated that they were zero without explanation. So what if we have on heteroskedasticity? What   if we can't make the homoskedasticity  assumption but they require from moving   from this line to this line. So what  do we do about it? Let's take a look. So the idea here is that we can't move this  variance of Ui outside the sum because it's   not constant. We can move it if it's constant  because sum of different elements multiplied   by the same constant is the same as  that constant multiplying the sum of   those elements. If Ui is different for each  observation we of course can't move it up. So how do we deal with this problem? We deal  with this problem by actually replacing the   Ui here - the variance of Ui - with the squared  residual for that observation. So the idea was   that there are variances the mean of square  differences from the mean. We know that the   residuals have a mean of zero and the error term  has a mean of zero as well. So we can estimate   the variance for each observation separately  with by using there the squared residual. So   we take this kind of equation and that's our  hetroskedasticity consistence standard error. So this hetroskedasticity consistent standard  error can also be used for regression with   multiple predictor variables in that case  we use matrix equations and the equation   looks like that. These are called Iker Huber or  White standard errors or combination of these   names based on statisticians who have discussed  and introduced these concepts the literature. This is also called a sandwich estimator  because we have this X matrix here and we   have the other X matrix here and then this  beef - these residual squared multiplied by   the observation pair squared observation  squared - is sandwiched between these two   other matrices. So that's it's called  sandwich estimator for that reason. So you can see that there's sums. This is sum of  squares total and this is sum of squares total   because in matrices when you multiply two things  together the order matters. For that reason we   have one on the left side one on the other side  instead of multiplying it twice. The minus one   is inverse which is basically the same as  or equivalent to taking dividing something   one by something. So you create an inverse  and otherwise it looks the same. So we take   residual squared we have observation squared  and then we multiply by sum of squares total. So the matrices are something that are useful  if you want to study this technique yourself   but as a normal researcher you don't really  have to know how to read all that stuff. So the question now is that if these don't  assume that there's homoskedasticity they   are more general because they also  work under any hetroskedasticity then   when should we use these and why not always use  heteroskedasticity-consistent and standard errors? The thing is that heteroskedasticity consistent  standard errors have been proven to work in   large samples and there is some evidence that  their performance may not be that good when   the sample size is very small. So in practice  if you have large samples several hundreds or   thousands then using heteroskedasticity robust  SE's as a standard practice is probably not   a bad idea. If you work on small samples  like you have experimental data you have   just maybe 40 people in each experimental  group then you maybe are better off using   the normal standard errors even if there  is slight heteroskedasticity in your data. The reason why these don't work as well  in small samples is that these residual   squared here is on not a good estimator  for the particular variance of the error   term in small samples. So this gets better  and better as the sample size increases. So heteroskedasticity robust  standard errors allow you to do   the heteroskedasticity and way you use  them is that you simply turn them on. Understanding cluster robust standard errors  is something that allows you to understand   the effects of clustering. So let's take  a look at the cluster standard errors. So the idea of heteroskedasticity robust  standard errors was in matrix form that you   take the residual of one observation and you  square it. In cluster robust standard error   you take two different residuals belonging  at the same cluster and you multiply them   together and you repeat that for every  observation in the same cluster and why   would you want to do that and what's the point  and what does this equation or analyzing this   equation tell us about the effects of  clustering on normal regression model? Let's take a look at this particular part here  and why do we have two different residuals? So   here we have one residual. Here we have two  different residuals. Of course we could have   the same residual but we basically multiplied  every every pair of residuals together. So let's take a look at this derivation of  heteroskedasticity consistent standard errors. So   that's the heteroskedasticity  or that's the normal standard   errors. We make the heteroskedasticity  assumption here and we actually have to   make the independence of observations  assumption a bit before. So it's here. So why do we need the independence of  observations here? The reason is that   when we take a sum of these differences  from the mean multiplied by the error   term - the sum of these - the variance of  this sum is the sum of these variances only   if the observations are independent. So if  you take two variables the variance of the   sum is the sum of variances only if those two  variables are uncorrelated or independent . So what do we do when that fails? So we  can't move from here and then move this   sum outside the variance. We can't do that  because of non independence of observations. What we actually do here in the cluster  robust standard errors is that we calculate   this variance. It is actually a sum of  variances plus sum of all co-variances   in the cluster. So if we have ten observations  then the variance is - a 45 co-variance is ten   variances and the variance here variance of  the sum is the sum of those ten co-variances   plus two times the 45 co-variances or we can  just use each covariance twice in the sum. So we take a look at this covariance between the  observations in the cluster. That covariance can   be from multiple different sources. It can  be from unobserved heterogeneity. So some   clusters are on average higher than others  and there is no particular pattern. In panel   data this can be autocorrelation so it can  be that observations close to each other in   time are more similar to one another than  observations that are far apart on time. What we do is that we take this Ui  and you Uj - the error terms for two   different observations. We replace those  with residuals. We reorganize a bit. So   covariance between these two products is  because the means of zeroes is simply the   whatever is the product of all these  things multiplied together. So that's   our heteroskedasticity consistent standard  errors and that's the equation in matrix form. Looking at this equation and looking at  the variance equation for the regression   coefficient in clustered with cluster  data allows us to learn something about   the effects of clustering. So let's  take a look at these two equations. So this is the variance formula. If we know  the variance or the covariance between two   observations in the same cluster in  the population we know - that the sum   of squares total then we can calculate  how much the regression coefficients are   estimated from reheated samples from that  population would vary from one sample to   another. And this is the equation that we  use to estimate the targets. So this is an   estimate of variance or standard error and this  is the actual variance if we know these values. So let's take a look at that. So what does that  tell us? We can use a little bit of covariance   algebra and let's write it down and I'm gonna  use D for 4x minus mean of X. So we just work   with different scores and a different score for I  and the Ui covariance with a different score of J   and Uj is simply whatever is the expected value of  the product of all these things minus the means of   these two terms here. Well the means are simply  0 so this is eliminated here and the covariance   is simply whatever is the expected value or the  mean if we multiply all these four things - the   two error terms and two deviations of X from  the mean together and then we take a sum and   then we divide by the number. So we calculate  this quantity for each observation take mean. And because the error terms are assumed to be  uncorrelated with the predictor. So that's the   no heterogeneity assumption. This equation  can be written - can be separated. So it's   what is the expected value of these two errors  - these two deviations from the mean of X and   these two error terms which is the same as  covariance between two predictors multiplied   by covariance between two error terms.  And this equation actually gives us some   insights that are demonstrated in another video  with a simulation - using simulated data set. The thing here is that if two observations are  independent so if ICC 1 of X is 0 then this term   here will be 0 and so whatever the correlation  will be that two error terms is doesn't matter.   So if your X variables are independent of one  another - there's no clustering effect no other   correlation no anything in the X variables  - then it turns out that non independence of   the error term is actually not the problem for  your analysis. Which is a kind of an interesting   result. It's not probably very practical but  it explains why in another video we can get   clustering effects if we just manipulate  the ICC of one variable but not the other. If we look at the actual equation that we  use for calculating the standard error. We   can look at this part here. So because we  multiply two residuals together and we do   that separately for each pair within a  cluster and we do that for each cluster   independently - this implies that the  cluster robust standard errors are valid   regardless of how the observations  -the error terms - are correlated. So there can be strong autocorrelation for some  cluster - no other correlation for other cluster   - and these standard errors don't care because we  don't make any assumptions about any covariances. We estimate every covariance within  cluster by taking multiplying two   residuals together. So this is robust for  arbitrary within cluster correlations. Most traditional techniques for panel later  particularly focus on unobserved heterogeneity   and unobserved heterogeneity manifests in error  terms that are correlated within cluster but that   correlation is constant. So in normal traditional  panel data model there is no effect that two   observations that are closer to one another  are more similar than two observations that   are farther from each other. So that would be  an out require and model for autocorrelation. So this cluster robust standard errors in  contrast to - for example GLS fix effects   and GLS random effects - also allows you to  have other correlation structures beyond the   basic structure where each observation  is correlated at the same level with any   other observers. So it's more robust and  that's one reason why when you work with   panel data you should always consider using  cluster robust standard errors even if you   applied already an estimation technique that  took unobserved heterogeneity into account. The second point that we learn from here is the  when we have two residuals that we multiplied   together here then that is a poor estimator of  the actual covariance unless our sample size   gets large. So if the number of clusters is small  then the standard errors are typically small too.   Typically biased and this is a problem that  you can't solve by increasing the number of   observations within clusters. So the idea is that  if you have let's say you have 30 companies that   you follow and you follow them from 10 years. So  you have 300 observations. You could be concerned   that your cluster robust standard errors are  slightly biased because of the small sample -   small number of clusters - then increasing the  number of observation within cluster from 10   to 20 to increase the total sample size to 600  wouldn't do anything for the permissional bias. So this depends on - whether this is accurate  or not depends on the number of clusters and   what is sufficient. It's difficult to say  for example Angrist here says that well   maybe 40 would be like a minimum limit.Sso  if you have observation number of clusters   below 40 then you could be in trouble. If  you're more than 40 then it could be fine   but this of course depends on many different  things. So we can't just set a one cutoff   that will be useful in all scenarios but  that gives you a ballpark estimate of what   kind of numbers of clusters are needed  for this technique to be really useful. So this video went through some of the math  to show some insights about how clustering   works - how it affects the variance  of regression coefficients and the   key takeaway here is that these techniques are  useful but they require large sample sizes. If   you want to apply these techniques you don't  actually have to understand much of the math   here because these are typically applied by just  switching them on in software that supports them.
Info
Channel: Mikko Rönkkö
Views: 3,823
Rating: 4.9607844 out of 5
Keywords: research methods, statistical analysis, organizational research
Id: XsyUzaZHs5o
Channel Id: undefined
Length: 28min 13sec (1693 seconds)
Published: Mon Sep 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.