Susan Athey, Sackler Big Data Colloquium

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay I think we're ready to start the next session we come to attention so we're going to have three talks coming up now before lunch and the first one is Susan Athey and we doe in Benz who isn't here this time but they're giving this joint presentation given by Susan great thanks I'm so delighted to be here with such a terrific audience this has been such a great conference so thanks and actually this this session the next three papers are all things that are sort of near and dear to my heart so I want to talk about today about combining methods from machine learning with methods from causal inference and this was sort of a journey for me because I came into I came into my my experiences with big data and machine learning with a very particular perspective and through interdisciplinary work especially working on the search engine and search advertising system at Bing I learned an enormous amount and I've been thinking for the last few years about how you can put together the things that I already knew with the new things that I was learning so to sort of start out in some sense what my what I what I felt like I learned after a number of years of being the only person from sort of a more economics and and social science perspective and a sea of people from machine learning were some sort of themes I felt like in terms of strengths and weaknesses from from different approaches so what I felt I learned that in supervised machine learning that I thought was was really you know kind of missing in some sense from a lot of the work that we traditionally done in social sciences was that there was a just incredibly well-developed and widely used nonparametric prediction methods that work very well with big data and they're you know they've been long used in a certain set of applications and I think it's just in the last few years that they're really getting exposure and Social Sciences in other areas one of the the really big things that stands out that I found attractive was the use of cross validation for model selection and just sort of a very rigorous and well specified method for how you choose your models we're in economics we generally think of that as sort of an art or it's supposed to be given by theory even though you know what sane person would have a theory about how two thousand variables you know affect an outcome and in what functional form it's completely insane but we all sort of pretended that we did and and that was a you know huge defect when I started working with with big big data problems you know there's a huge focus on prediction and the application of predictions and that's really caught up that's really tightly wound with the the fact that the approach of cross validation and that cross validation is saying I'm gonna see how well my dot and my model is gonna make a prediction for a particular observation and I'm gonna see how well it does by seeing how well my prediction matches reality and that's a that's sort of a key component of it a weakness now except for the people in this room who are the sort of leading exceptions to all the rules is that you know most of it is not does not pay a lot of attention to causality and especially the you know the second tier of people out there in the world who've been trained in machine learning really don't have a good language for even talking about it and that was like a huge shock for me that people had sort of PhDs and they they didn't they really the average guy coming out of an average school just couldn't even an intelligent conversation with me about it and that was kind of disappointing so the people here have been working to really change that within machine learning but that's still a you know a generalized observation today now in econometrics social science and statistics we have decades of experience with formal you know theory of causality you know there there's one branch based on the potential outcomes method the maths very nicely onto economic approaches we think about somebody as being potentially treated with a whole bunch of different drugs or a whole bunch of different prices and we have a theory that says well they would have a different outcome if they'd seen a different price or if they've gotten a different dosage of the drug another branch which looks very different when you read the papers but is is in some sense conceptually similar as what we call structural models and economics that's actually what most of my research has been on in the until the last couple of years these are models that have a more more specified sort of functional form about the way the world works so a lot of my research has been on options I have a bunch of survey papers on my website you can read where we develop a theory if I see somebody bidding in an auction I look at them and I figure out well given the environment they were in they had a certain strategy they should have been following if I see their bid I can infer their evaluation and then I can compute optimal reserve prices optimal market design interests you know in entry policies should I have small business set-asides and so on and so this is the massive literature and economics leon's gonna present some work in the next session that fits into that literature and this is also widely used in practice merger analysis for example by the the Federal Trade Commission the DOJ use these classes of models we're gonna say I see it I see a firm setting prices I can figure out what their costs were I can figure out how their incentives would change if the firm's merged and how much they would raise prices and so this is a very well accepted approach and and all you know big consumer products firms have teams of people trained in this approach and that's what they do to set prices so generally there's these well-developed and widely used tools for estimation and inference of causal effects often using observational data although sometimes using experiments and again you're gonna you're gonna use this to say what price should you set what would happen if I change my prices what would happen if I change the minimum wage what would happen if I change class size and this this whole methodology there's you know thousands tens of thousands hundreds of thousands of papers at using this methodology for these problems some of the big weaknesses though and this is where you know it really hit the fan for me is it I had a toolkit I went into environment with lots of covariance I started having spreadsheets with thousands of models this made no sense and it really felt incredibly unprincipled so you know the research agenda that we've embarked on is to then trying to grapple with these problems many problems in social sciences entail a combination of prediction and causal inference your ultimate question is causal inference but you might have thousands of variables which are really could be thought of as predictive we don't have a theory about them and we're not interested in changing them I might have you know millions of patients but they're their fundamental characteristics aren't gonna change I'm not gonna give them a drug and change their age I'm not gonna give them a drug and change the climate they live in you know this is I'm holding I want to kind of hold fix these attributes of individuals and think about only intervening on a small number of variables and so at least in this in the social sciences and from what I read from machine learning that distinction between the different kinds of variables are is not commonly made we run a regression we might have a causal variable and a bunch of attributes in that regression model but we don't treat them differently we we use the same you know X prime X inverse X transpose Y and we use a formula for the standard errors and all of those are are treating those things the same so we want to treat them differently and then secondly the existing ml approaches off the shelf are not directly optimized for the problem of estimating causal parameters and third inference is often more challenging from ml methods they're designed for prediction they're not designed for hypothesis testing so our proposals are to formally model the distinction between causal and predictive parts of the model and treat them differently in estimation and inference to develop new estimation methods we want to be very close to machine learning when we're using machine learning to a kind of standard a kind of metric approaches when we're using doing causal inference we don't want to invent something new when we don't need to instead we want to build on the hundreds of thousands of papers that preceded us but bring them together in ways that work we want to develop a particular new approaches to cross validation that are optimized for causal inference and that's really the focus of my talk today as well as some robustness measures so I'm going to start by giving a quick summary of a couple of these other papers I've been working on and then I'll dive into into this a specific example of estimating heterogeneous treatment effects so just to start out I'm going to introduce this sort of app in chill out comes model for causal inference just to define causality what we mean is we want to know what would happen if a policymaker change the policy why I of W is the outcome person I would have if I gave them treatment W so if I gave them a high dose a medium dose or a small dose of the drug this is the outcome that they would have and so for a binary treatment which I'm going to do for simplicity today the treatment effect is the difference in the potential outcomes so you think of it every single individual in this room has a treatment effect but the problem is the fundamental problem with causal inference is that I will never see you at this moment here in this room both with the drug in without the drug okay and that's going to be my statistical problem I'm also going to have fixed attributes of the units of study I'm going to call them X's and these are things I want to highlight we want to in we're going to be focusing on this distinction a little differently or more precisely than a lot of the literature we want to say the units have fixed attributes and these attributes would not change with alternate with alternative policies for example if I want to think about assigning minimum wage laws to States I'm not contemplating moving coastal states inland when I change the minimum wage policy the states are going to change I'm going to change the policy for a fixed number of states but by the way there's only 50 states so that's my population okay that population is not going to change so the first paper is a paper about inference and actually it's really kind of a small data paper not a big data paper but but it was motivated by big data applications but it applies to both and this is basically saying that bid and regression models which are the most widely used models and economics and a lot of other social sciences we've all been doing our inference incorrectly although it turns out that some of it has been the right inference for the wrong reasons so one of the things we want to start with is we want to formally define what the question is we're asking and that helps us highlight why we've been answering the wrong question so we want to formally define a population of interest and how sampling occurs and then define an S demand that answers the economic question using these objects distinguishing between effects versus attributes so for example my population of interest might be 50 States or it might be the users on a search engine on a particular day okay in each of those cases I might very well see the entire population I might see all the users of the search engine on a day I might see all 50 states income or earnings okay so whatever my problem is it's not that I'm sampling from a population I have a population so I want to keep that in mind instead we want to say well what data is missing and what's the difference between your estimator and the estimate and what makes that difference uncertain so for example if I want to know if I have a data set from 2003 that has the income of all 50 states I actually know with certainty for that data set what the average difference was between coastal states and interior states there is no sampling uncertainty that's just a number and it's a number that I know okay you might say well but I didn't really you had data from 2003 but I didn't really care about 2003 I wanted to know what the average difference was for a random year next year but it's actually not what we're doing because if we were doing that then we would be writing in our papers about how we thought the cross-sectional variation across states relates to the inter temporal variation within a state of course there's no reason at all to think those would be the same if I wanted if I want to use 2003 to predict 2004 it's that's a pretty stupid thing to do if I really want to predict 2004 I might use 2002 and 2003 estimate the serial correlation and it might be that even though you know things change from year to year that the difference between you know that actually states are highly serially correlated over time and that the serial correlation is much stronger than the cross-sectional correlation so it'd take a lot more assumptions which nobody ever writes down if you were really interested if you were using 2003 to pick 2004 and that's not what people do so I would argue that actually you know the real question there is something that you know is certainty and so what what's really unknown is the causal effect of changing minimum wages that's something we'll never know because we're never going to see both states I mean any state with both in both scenarios so you know once we've set up this kind of framework we then read we read arrive what the inference should be and what we find is that what's commonly used the Huber white robust standard are conservative but they're actually the best feasible estimate for causal effects so the way that people have normally been doing inference is fine but for fixed attributes like whether you're on the coast or the interior the standard errors may be highly overstated because they're taking into account sampling variation that isn't there so the good news is if you want to adopt the standard errors from our paper all your standard errors will go down so I'm optimistic that this will be a widely cited paper you never want to write a paper that says people standard errors are too low okay so then a second paper which which is coming out in May is asking a different but related question but also using this distinction between causal variables and attributes so here we want to look at a question of robustness of causal estimates and I would say you know we came to this project inspired by machine learning inspired by the rigor and the systematize systematic model selection approaches for machine learning but here we ask a question what if you have a model what if you're ready to publish what economists and social sciences do if they're if they're being rigorous is they might say here's my preferred specification and here's five more columns I put in fixed effects I took out fixed effects I could control for this I didn't control for that I changed my population and look my my effect is robust across these five things now that makes sense sort of if you had ten variables you might have actually done an exhaustive search and basically be reporting you know what came out of that now of course nobody corrects their standard errors for that process either that's another story but you know it's not very systematic and if you have lots of covariance it's just not feasible for the human brain to have really absorbed all the possible robustness checks and so we wanted to sort of start a little bit of a literature here we've got a proposal that could be improved upon but it's the the basic idea is that we want to have a measure of robustness for a particular model and so our proposal is that we want to use a series of tree models to partition the sample by attributes so for example we can go one attribute at a time we can take old people and young people as the first partition then within each of those partitions we aree estimate our each of those will give an average causal effect and we can average that back up across all the ways we split the sample and depending on how you split and re-estimated you're gonna get different average effects across all the different models you're always going after the same thing which is the population average treatment effect but you can estimate it in lots of different ways and the different ways are different partitions and so doing this other ways would be an avenue for future research and then we prefer using the standard deviation of effects as a robustness measure we applied that to a bunch of different past studies and found that indeed as we had hoped that randomized experiments tended to look very robust it doesn't matter how you split the sample Andry estimate you're always gonna get the same result but observational studies tended to be less robust and you were often finding you know when you used matching methods or so on you weren't you're the way that you specified the model might have actually made a big difference and you weren't as robust as you would hope to be and so here's just the proposal to social scientist would be let's be inspired by being more systematic let's let's look at the algorithms that have been used for exploring alternative specifications and machine learning and take that as an inspiration for developing robustness approaches and so this is a first step in that direction now let me spend the rest of my time kind of jumping into a specific paper and this paper is really focusing on heterogeneous effects so the first motivation and I think this motivation may be the broadest in terms of interest for different Sciences is a motivation of experiments and data mining so generally in medicine if you're gonna go through FDA approval you have to pre specify your analysis plan you're not allowed to look at your data of a thousand people and then figure out okay these three guys the drug did really well and they happened to be 67 years old and 68 years old and live in Montana so therefore this drug should be approved for 67 and 68 year olds who live in Montana you're not allowed to do that there's a very good reason you're not allowed to do that that's called fishing or that's the bad version of data mining multiple testing is the more formal version of it but you can't just go look in your data and find the few people for something worked for and say that you found an effect okay that's that bad science it's bad statistics however in a world with with lots of data and big data it's it's pretty ridiculous to imagine that you as a as a doctor or you as an economist could possibly come up with a pre-analysis plan why would you have a theory about every possible combination you don't know who the drugs gonna work on you can guess but you don't know and so in some sense we're throwing away data and potentially people are dying over years because you might look at your data from the first experiment find a group and then have to run a whole new study with the right pre analysis plan to to find the effect you're looking for so what we would like is a systematic method where you can look at the data after the experiments been run but still have valid inference to find out who on earth did this drug worked for and so we're gonna deliver that anybody who has a randomized experiment or randomized medical study can use our method and discover who the drug worked for and get valid inference so we think that's that's important um more broadly you know we want to understand that the the the treatment effect function we want as our object of interest is how does the effect of this drug vary with the covariance with the people's characteristics now that's different from the problem of finding the optimal policy but sort of inspired by social sciences we want to be think of having a structural function that is a function that says what is the what is the benefit of this drug across peoples characteristics what is the benefit of minimum wage across characteristics of a state and because we think the world might change the price of the drug might change we might discover complications later but if we know what the benefits look like we can do cost-benefit analysis in a wide range of settings okay so what we're gonna deliver is a model that again distinguishes between causal effects and attributes that estimates treatment effect heterogeneity and it's gonna combine supervised machine learning prediction methods and causal inference tools and this is I think a really novel part of the paper that I think could inspire a lot of follow-on work that would improve upon what we've done is to introduce and analyze new cross validation approaches that are customized for causal inference and finally we want to do inference as I just discussed so just to start with since this is sort of a broad audience let me start by reviewing this is probably the simplest supervised machine learning method the one that's easiest to understand and it turns out that being simple and easy to understand is also going to make it very attractive for these causal inference problems so it's a regression tree so we have outcome y i attributes x i for a standard prediction tree not causal inference yet just for a prediction tree what you're gonna do is you take a training sample you partition the samp the the the sample into into subsets which we call leaves we predict Y conditional on a realization of X within each leaf using the sample mean and then we go through and we keep splitting the leaves using some goodness of fit criteria like how well you're doing predicting the outcomes and then finally we select the treat tree complexity using cross-validation based on prediction quality so if you've built a really complicated tree but out-of-sample you don't predict outcomes very well then you make the tree less complex okay so we're gonna build on this approach the simplest easy to understand easy to interpret machine supervised machine learning approach and try to modify it for causal inference here's just a picture um this is a search engine example this is like what the output of a tree looks like you take the the population you divide it according to criteria characteristics of the query and then within each style you take a mean and you have a certain number of observation in the leaf okay so um let me let me then break down the overview I gave you of these tree models into a few components and then what we're gonna do is change those components okay so the stand I'm still in the standard framework the the things in red are the things that we're gonna change so the estimator for what is the predicted outcome is the sample meaning of the outcome within a leaf the typical and sample goodness-of-fit function is a mean squared error it's the deviation from your predicted outcome and the actual outcome and then when you do cross-validation and you go out-of-sample you use that exact same criteria but you used it out-of-sample on a test sample you you your tuning parameter is and of leaves there's a penalty for the number of leaves and so what the cross-validation does is it figures out which penalty for complexity gives the best out-of-sample fit in terms of mean squared error okay so now what I'm going to do is say let's let's modify this for causal inference I'm going to change the estimator the in-sample goodness of fit and the out-of-sample goodness of fit so again the causal framework you have the potential outcomes notation what we're trying to do is to predict an individual's treatment effect but we want to predict it as a function of their observables so tau of X is going to be the difference between the average outcome if you were treated and you have attribute X and the average outcome if you were not treated and you had attributes X tau of X the treatment of that function that's the object we're after okay for the motivations I already gave so if I wanted to just go off the shelf and apply machine learning to this problem let me let me give you the two most obvious approaches and actually not surprisingly since they're kind of obvious people have have taken some attacks on these before so the first thing you could do is you could say well I've got a treatment group and I've got a control group let's analyze them separately and if there's a you know if it's a stratified experiment or if there's selection unobservable so that you know actually your probability of being treated depends on your attributes will use propensity score weighting which is a very standard method in in the social sciences so what we're gonna do is we're going to do with in-group cross-validation to choose the tuning parameters so I'd like for example I would build a tree for treatment outcomes that's gonna say how the treatment outcomes depend on X I'll build a tree for control outcomes and see how the control outcomes depend on X and I'll take the difference okay the second approach is well let's just build one big tree I mean what we're trying to get is mu the the average effect is a function of whether you were treated and your covariance and so let's just build one big tree or one big lasso or one big random forest or one big whatever and then just use that function to estimate the effects of course if you're building a tree and there's some parts of the tree where you don't even split on the treatment effect then you would get an estimate of exactly zero for the treatment effect so I say these are great methods better than what we've been doing in many cases if you have a big data set but they're not my's for the goal you haven't set out to actually choose the complexity of your models to optimize the trade-off between complexity and prediction and the second problem is that and I should say with two trees or something you're gonna have a much more complex model than you need because you might have a very different partition for the treatment and then control but the real fundamental problem here is that if you want to apply the machine learnings off the shelf they're all built around the following idea that for each in my test sample I know the ground truth in my test sample I see your outcomes and I see your covariance and so I can know what I should have gotten for my test sample but of course the fundamental problem of causal inference is that I don't know the ground truth for anybody okay so I cannot take this directly off the shelf I've got to find some other way to do it and of course that's probably one reason why you know people have have started with these methods where you you take a prediction problem and turn it into an inference problem because we know how to predict but we don't know how to solve this other problem okay so some people have looked at this there's a bunch of people who have done things in the spirit of single tree and two trees that I just described there's a few people who have tried like say splitting trees on treatment effects but none of those actually when they go to cross-validation if they do cross-validation they go back to prediction so they're say I'm gonna I'm gonna I'm gonna split a leaf if the treatment effect is different in the two leaves but when I see how well my model did I'm still gonna look at how well I predicted outcomes because after all outcomes are what I observe I don't observe causal effects so we're gonna come up with a couple of methods to solve that problem what you're gonna see is that there's no one right answer and I'll just preview that even our simulations and applications one right answer does not emerge it's going to be trade-offs that's why there's more work to be done here but I'm going to start with with a simple proposed approach it's not going to turn out to be the best approach but it's it's going to actually solve this problem and it's a it's an approach that's very close to propensity score waiting but it's a little bit different so let me start out with a very simple example suppose I have 5050 randomized experiment between treatment and control I'm gonna define a new variable that new variable is two times your outcome if you were in the treat group and minus two times your outcome if you were in the control group I'm gonna call this Y star okay and so this is gonna be defined for every individual so each individual my population has a Y star so this Y star has this lovely property that it is actually an unbiased estimate of the treatment effect now this seems a little bizarro to start with I'm gonna say that one person's outcome one person can give me an unbiased estimate of a treatment effect but it's actually true it's mathematically true now it's only really useful if I take an average over multiple people so why is that well if I take the average of this transformed outcome I'm gonna get you know positive the outcome for the treatment guys and those are half the guys and I'm gonna get negative for half the guys and multiply by two and I'm back to my treatment effect okay and this is this is very close to propensity score waiting here the propensity score is a half but the key thing is that it's a small difference but it's very important I'm actually combining them I'm taking one single variable for the for the whole population this Y store and what that's gonna give me is it's gonna give me a ground truth for every individual it's not a very good ground truth for anybody in particular like you know your outcome is not a for you is not a good measure of your treatment effect but it's unbiased for a fer subpopulation so I'm gonna use that as my ground truth and I'm gonna in the beauty of this it's got a lot of couple defects but the beauty of this is if if you wanted to tell your student to do this they don't have to write a line of code they can one line of code they can write y star equals 2y if W one I equals 1 and negative 2 y fw I equals 0 that's one line of code line two is apply lasso random forest trees whatever n are and they are gonna have everything that's gonna follow off the shelf with that and and that's gonna be very easy to do you can generalize this if you have a non 50/50 trial or if you have a stratified experiment or if you have selection on observables instead of two you know gonna have something like a P of X okay so just to be formal what this with this then our approach and this is sort of an an estimation and cross validation approach we're going to call this conventional tree transformed outcome now the estimator within a leaf is going to be the sample mean of the transformed outcome the in-sample goodness-of-fit function is going to be the deviations from the transformed outcome so of course every single person is going to be wrong because I'm gonna estimate a treatment effect for Leon and his transformed outcome will either be you know Y or minus y so it's not gonna be very good for Leon but it's on average this criteria is and it's gonna be an unbiased estimator of the of the goodness of fit okay and then out-of-sample I'm gonna use the same criteria so again I don't actually have to write any new code for this I just have to transform the outcome and plug it in everything else is the same so there's like a really obvious dumb critique of this which is that it was nice I didn't have to write any code but actually my estimator of the treatment effect within a leaf is kind of a dumb estimator because within a particular leaf you might not have 5050 like if it was 5050 in the population if I gave you a sample of 10 people and said what's the treatment effect if six were treated and four were controlled you would take the average of the six against the average of the four but but but my method is actually not going to adjust for the fact that it was 6040 it's gonna pretend that it was 50/50 so it's kind of a stupid estimator it was nice I didn't I could plug it in off the shelf but it wasn't a good estimator so what I'm gonna do in my next approach is say well actually let's have a smart estimator of the average treatment effect within a belief I'm just gonna take the sample average treatment effect within a leaf is my estimator of the treatment effect and if it's a if it's a selection on observables I'll use propensity score weighting everything else is the same now I do have to write some code now because I have to mod it I have to change what happens within a leaf it's not it's not hard to code but it's it's not a command and are okay but it gives a more sensible outcome so then the last set thing that we we try to look at a little bit more is is actually what are some other methods we can use to to examine goodness-of-fit in sample and out-of-sample so just to remind ourselves the infeasible goodness-of-fit would be the difference between your actual unobservable treatment effect and your predicted treatment effect we can expand out the infeasible effect it's a squared so we got the squared of the truth and the the squared of the estimate and we've got the the interaction effect so the squared of the truth of course is the same for all cried things so it doesn't really matter for comparing models and so what really matters is the square of the estimates as well as the cross term between the estimate and the truth and so we can think about various ways we might estimate that cross term both in-sample and out-of-sample if we look one thing you could do is is to actually use matching so this this unobservable term actually something that could be estimated using matching we think that's computationally bad for doing it in inside of a cross-validation loop but when we look at the end to compare the performance of the model out-of-sample we might use such a criteria another observation and this is actually very analogous to classification problems in machine learning another example is that you can actually within in sample we know that our estimates of the treatment effect within a leaf are going to be unbiased estimates and they're constant within a leaf so actually the expected value of the product of the estimated treatment effect in the truth is just going to be the square of the estimated treatment effect in sample where you know that you've got you've you fit the mean exactly and so what that tells us is that we can actually use an in-sample goodness of fit criteria which is just the square of the estimator and this is very analogous using the Gini coefficient to split in classification problems it's saying that we're going to have a better estimator is one that has a higher variance of predictions one that discriminates better between parts of the covariate space in terms of prediction so it's and so we're gonna propose for our last model that we use as the in-sample goodness of fit measured analogous to the Gini coefficient just the variance of my of my of my predictions which is they say I'm gonna reward predictors that discriminate well in terms of treatment effects that say there's these guys have high in these guys have low and that's actually going to perform better and practice it because it's lower variants of the trans out transformed outcome so those are the five approaches they're going to be similar if the treatment effects and levels are highly correlated that is if people have a high treatment effect when they have a high outcome but they're gonna do very badly for example if if outcomes have a lot of variants but treatment effects are relatively stable and that's actually a very common setting in the world that you know some people might have a high chance of dying other people have a low chance of dying but the treatment effect differences are not as big so we have a couple of ways to compare the approaches we can do simulations with an Oracle we can use our transform outcome goodness-of-fit measure and also we can use matching to estimate infeasible goodness-of-fit and in our paper we explore in simulations which ones do better and generally the fifth method is is usually the best but you can construct examples where others are better finally for inference an attractive feature of trees is that you can easily separate the construction of the tree from treatment effect estimation so a tree constructed on the seam on the training sample is independent of the sampling variation in the test sample and so the the just to repeat you know with this tree we're gonna construct a tree using attributes and we're gonna estimate treatment effects within each leaf and so essentially the you can think of the tree construction is just a function of the covariance it's saying how do i split my sample and if I do if I estimate the tree on one sample and I do inference on the other sample I'm gonna have valid standard errors we can't do it within the training sample because the training sample you're gonna find leaves where due to sampling variation the treatment effect is really big and that's that would be the problem of kind of fishing like I find a three guys who have a high treatment effect and I'm gonna call that a big treatment effect I can't do standard errors based on that because I chose the leaf exactly because these people had tried hi treatment effects in the test sample that's not a problem and so it's very simple to prove that you get valid standard errors if you build the tree on a training sample and do your inference on the test sample and that actually implies that in fact on the test sample you could also use a different method of estimating treatment effects within a leaf maybe one that was more computationally costly because once you've figured out the leaves of the tree you can do whatever you want within each leaf on the test sample okay and you could use matching you could use something that's more computationally expensive if you wanted to I'll just point out that the literature that shows that propensity score weighting actually works and observational studies does require additional conditions so for example the leaves can't get small so the leaf size need to be bounded relative to the size of the sample for the standard errors to be correct if you're using propensity score weighting if it's a simple experiment you know it's just standard comparison of means okay so so I would so we we have an application to search I think I'm out of time probably to do that what we just to say really fast this here's a picture of what a tree looks like it each bottom node of the tree leaf of the tree we're gonna get an average treatment effect a standard error and the proportion of the sample and so what you can look at is like these are all the leaves the effects the standard errors and the proportions and training and test sample as expected the variance of effects would be higher in the training sample because the training sample is building leaves that that respond to variants the test sample has a lower variance of treatment effects and it gives you valid standard errors we find things like you know if you search so so leaf 3 is has a very low effect that's when you searched for celebrities and it turns out you don't click on the organic links very much you pick you you click on the pretty picture of you know Britney Spears and so it doesn't matter how you word or the organic results while if you have kind of very purely informational queries you have big effects of ordering of ordering the links of search queries and so just to conclude I'm you know the keys to our approach are first of all to distinguish between the causal and predictive parts the models to try to use the best of both worlds take the really great things about the supervised machine learning people literature apply combine them with the the state-of-the-art techniques from the causal inference literature and come up with standard errors which is what the causal inference people are looking for and I'd and I do think that in practice people should use this I think that you know we have a bunch of randomized experiments of drugs where we haven't actually gone out and spent the money to prove that it works for a certain population and I think it'd be really important and save some lives if we went back to those studies and had a systematic and valid approach to use that data to figure out whose lives the drug saves thanks [Applause] yeah really interesting work what one of the things that's that came up here is this idea that you have the wrong loss function if you're doing causal inference and you're just applying a standard loss function and to me that evokes the work on targeted maximum likelihood estimation and that you need to construct some different loss function for the quantity of interest and then do your cross-validation with the normal loss function but then do some kind of targeting step to you know actually focus on that loss function and that would usually involve making the bias-variance tradeoff differently for the model so as to have a lower bias for that particular causal quantity of interest and so that's like all the sort of mark of underline kind of work so I'm just wondering how this kind of compares to that and whether you know that machinery has the advantage of that I can do a big ensemble of kind of any machine learning techniques I want as opposed to constraining myself to maybe trees sure so two points absolutely that's a very and a longer talk actually that's the sort of next related literature slide so that's absolutely very similar in spirit I would say actually there's no I should have emphasized this more but there's no constraint to trees especially with the transformed outcome approach any method will work and so I think here we're trying to focus you know getting you know getting the heterogeneous treatment effect approach and and what's going to be very optimized for solving that problem but I would say that they're similar in spirit maybe we can follow up offline about details yeah it would be interesting to see a direct comparison with those methods especially since there's sort of some provable efficiency guarantees about the GLM Emily that's right so so here in some circuits so we're going to be you know able to use efficient estimation I mean we'll have sort of efficient estimators within leaves but the the efficiency of the entire algorithm is not [Applause]
Info
Channel: NAS Colloquia
Views: 8,743
Rating: undefined out of 5
Keywords: National Academy Of Sciences (Membership Organization), sackler colloquia, Susan Athey (Academic), Big Data (Industry)
Id: L72E08QsyMc
Channel Id: undefined
Length: 41min 0sec (2460 seconds)
Published: Thu Apr 02 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.