Susan Athey: Counterfactual Inference (NeurIPS 2018 Tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to the tutorial on counterfactual inference it's a great pleasure to introduce professor Susan Murphy who is the economics of Technology professor at Stanford Graduate School of Business and she previously taught the economics departments of MIT Stanford and Harvard she's had many awards and accolades and I won't cut into her time by by telling you about all of those but she was elected to the American Academy of Arts and Sciences in 2008 and the National Academy of Sciences in 2012 her research focuses on economics of the Internet online advertising the news media marketplace design virtual currencies and the intersection of computer science machine learning and economics and I think there's a lot of interest in in causality and counterfactuals etcetera so it's great to have her here just before we start there are there are two things we need to handle administratively one is that there will be a break at about 55 minutes so there'll be a 10 minute break then and we'll want you to come back afterwards and the other one is that I think that it's probably best if we leave the questions to the end just by the way when there are questions we'll have to use the microphones here cuz this is a large room but I guess I'll let Susan well get on with her tutorial and also so thank you so much for having me here it's a real honor to be able to give the tutorial to such a terrific and large audience one of the things that I'm finding as I'm going out and speaking about this topic is that in some ways it's gotten harder to give tutorials and lectures over the last few years because when I started working on this topic maybe eight years ago from a machine-learning perspective there was a sense that the only a very small part of the machine learning community was really thinking about counterfactual inference or really even thinking about the words and very little of the economics and social science community was thinking about machine learning so if I went to an audience that was mostly one or mostly the other I could pretty much count on people knowing a lot about one thing and not so much about the other one of the really exciting things that's happening now is that a lot of these literature's are coming together and there's a lot more interest in knowledge out there so today I'm gonna try to do my best to kind of hit a little bit of both worlds I'm gonna start from the beginning and talk at a high level to people who aren't that familiar with the topic because I hope some people have come out because they're curious I'm also going to try to give some insights to the more advanced folks and we'll see we'll see how that goes one thing if you're interested in more I should say I did I haven't given that many references throughout my slides but then when I post them all have some references at the end but just for those who are interested I did a two-day lecture series for the American Economic Association and there's videos and slides available as well as two days of lecture notes from a much longer tutorial I also have links to several survey papers there and so on so if you're interested please you can google susan Athey AEA that's american economic association and you'll find that all that material there and would love to answer any questions via email as well so I've been working myself on using data to answer counterfactual questions for most of my career I started working Auctions actually I started working on auctions when I was about 17 in in 1987 and I was an undergraduate at that time nobody really cared about auctions and we were even in economics we were relatively early in using data to do counterfactual estimation and that was a very exciting time I came only to machine learning in around 2007 and 2008 when I went to work as consulting chief economist for Microsoft for a few years and particularly focused on the Bing search engine and there I met a fantastic set of collaborators from the machine learning community from Microsoft Research the engineers building the search engine and so on and one of the really interesting things that I encountered was that I felt like we were actually needing to solve a lot of counterfactual problems yet a lot of the folks from the machine learning community didn't really have a good language to talk about that and so at the beginning we had a lot of sort of culture clashes of people insisting that you know one thing was possible or I said well I thought that was impossible without running an experiment and you know we had a lot we had difficulties sort of communicating but then that that those early debates and interactions really bore fruit later as they they really working with some of these smartest people in the world all attacking an applied problem and really trying to get to answers helped us I think all learn from each other and it really embarked me on a new research agenda I had been more of it on the applied side really using data to help make decisions and and to do counterfactual estimation for specific problems but I felt like this whole area was just so wide open that there wasn't you know like many of you as well you know you the the world changed and we all dropped everything to come and think about machine learning and in my case the intersection of machine learning and causal inference and this has gone from being a very small literature to really a much bigger literature in the last few years it's growing very quickly and so even my survey of a year ago already is is getting out of date so let me start with a big motivation for people who think about artificial intelligence broadly and I want to talk about really two sets of issues that I think make sort of causal inference counterfactual reasoning kind of must know must understand concepts for artificial intelligence the first is that you know we've we are all are seeing that there are some you know gaps between what we're doing in our research and what firms are applying and of course there's all these amazing success stories of you know Google Images and so on but actually even at the top tech companies there's maybe less adoption of machine learning and artificial intelligence in in some systems and you'd expect given that like all the people there are inventing it and and embrace it and so when we think about some of the impediments to adopting machine learning and artificial intelligence and applications there's a whole sort of grab-bag of issues and each of these issues is the subject of tutorials and workshops in and of themselves but to me they're all pretty interrelated so we typically you know for firm is going to dump their old simple regression credit scoring model and put in one based on a black box machine learning algorithm for example they're gonna have some reservations they're gonna worry about what happens when they use a black box algorithm a first thing that comes up is well is this interpretable how would I understand you know if it was working and why it was working and and you might say well why do they need to understand it well because they're gonna make decisions and actually it might take five years for the loans to be repaid or not repaid so it's going to take some time to really get feedback on whether they work and so one reason that firms and economists historically use very simple models is that we were trying to come up with things that we could understand because we were typically in a data poor environment and were in an environment where it was pretty hard to know just by looking at the data whether we were right or wrong and so something's interpretable you can reason about it you can actually if you were an applied economist what what you would do with a research paper an empirical paper you guys will be really amazed by this but we would often spend four or five years getting puerile paper published and we might present it 25 times and sit in a room of smart people for an hour and a half and everybody would beat up our models and try to reason about whether they made sense and then we would improve them and you know think about implications and in this edge case and that's edge case what would the model say to see if they were reasonable and so there's that that kind of reasoning is harder to do in a black box closely related to that are issues of stability and robustness so you know we're we're if we use Twitter data today to decide if someone's creditworthy that that may be difficult because maybe people who run Twitter three years ago or different from people who join Twitter during the presidential election and maybe it's a different signal of what kind of person you are year by year you know whether how your Twitter usage is changing so things like stability are also important for applications and that again is related to robustness as well a similar concept transfer learning so we might estimate a model in one setting and want it to apply in another setting there's a very exciting machine learning literature around fairness and discrimination and we again with black box models we have to figure out ways to assess them and see if they are discriminating or if they're unfair in certain ways or if they're making generalizations that were uncomfortable with and I think there's a bigger picture we're building a big initiative at Stanford on human centered artificial intelligence and and I'm involved in organizing that and leading it and one of our missions is to is to create more human-like artificial intelligence but to me you know what does it mean to be human like one big aspect is that you're gonna make reasonable inferences reasonable decisions in scenarios that you haven't seen before and again that's gonna be part of causal inference and counterfactual reasoning so my argument is going to be that actually all of these desiderata are satisfied sort of by design in a causal model now I'm not going to argue that all problems can be solved or that I can solve all problems because actually it's really hard to actually implement a causal inference framework and to get credible estimates so but the causal inference framework actually gives a framework for understanding and in at least a guideline for how you would address all of these issues so in a causal inference framework really your goal the stated goal is to learn a model of how the world works and that can be something simple like what happens to your body if I give you a drug that's like a simple treatment a simple intervention it can be like a production function so I'm trying to understand the mapping from inputs to outputs of a firm it can be something like understanding you know what would happen if I raise the price which consumers change their behavior and it might even be something as complicated as doing inference about what happens if I change the rules of an auction that's something we thought about at the search engine you know what would happen if we move from a generalized second price auction to a Vickrey auction and in my early career I worked on timber options and looked at comparing first price sealed bid options to open ascending auctions and second price auctions so we might want to do inference about what happens when we change the rules of the game and indeed economists have you applied these techniques not just the timber auctions also to Treasury bill auctions and online advertising options so these are the kinds of questions we want now in general there's within the SSA within economics with empirical economics I would say like 95 percent of the empirical workers and the causal inference framework so almost all the work that is done in that field this is trying to get causal effects and even though everybody agrees on that goal there's huge arguments about how to approach the goal and whether any particular paper achieves the goal and that's partly because you can't just look at the data and see whether you got the right answer for causal inference it's almost always assumptions unless you have repeated experiments and of course that's something that we look for in ideally you would have a world with lots of experiments so you use data from some experiments and you can see how you do on other experiments but you don't always have that scenario especially if you've never done something and so one of the things that we spend a lot of time worrying about is that the impact of an intervention can be context specific we worry about external validity that I might learn something in this setting but it doesn't generalize well to other settings now that's not a problem with the framework because in principle you can write down all the things about the context that that effect you know whether the drug works or whether giving children bed nets or lower in class size is a good idea but in practice of course it's very hard to estimate all of those things so again this comes to be a practical problem but at least we have a language and a formalism to talk about what goes wrong when we try to generalize so our model would map context and interventions to outcomes and again we have a would then have a formal language to separate correlates and from causes so think about you know something like gender you know if you say okay here I'm a 47 year old woman with three kids how much do I know about artificial intelligence maybe if you all that's all you know about me your your guess is that you know probably not that much your average 47 year old woman with three kids in the United States probably doesn't know that much about artificial intelligence although I would say my border patrol person entering at the airport asked me if I was a natural language specialist and I said no i was doing causal inference so and he complimented me for being original so you know maybe everybody knows something now but you know if you so of course we have lots of covariates that can be predictive of people's attributes in the absence of other information but you might want to think about what would actually you know cause me to know something about this subject and it might be my computer science degree from undergrad it might be all my publications and research that I've done and those are the real causes of my of my expertise so the ideal causal model is by definition both stable and interpretable because it's a model and so when we write down a causal model we are giving it an interpretation when I say here's a drug I want to the effect of the drug that's interpretable it's a well-defined mathematical object you can argue when I show you a number and I say hey this drug makes you ten percent better you can argue about whether I've actually measured that properly and whether I have an unbiased estimate and whether there were confounders and compliance issues and all sorts of other problems but conceptually the thing I'm estimating my s demand is perfectly interpretable it's just a problem of whether I've actually succeeded in doing that in practice transferability well in principle if I understand the the impact of a treatment and I understand how the context changes the interaction of the treatment that I can transfer it to any new environment and again with fairness you know many aspects of discrimination relate very closely to sort of correlation versus causation and the fact that we're doing sort of statistical discrimination arm of a Z in human beings we're all Bayesian right so it's Bayesian human beings we update if we only have a few pieces of data we draw inferences based on that data and our models are our drawing inferences from data as well and if they say there's too many covariance and we do some regularization we're going to load up on to some attributes even if those aren't aren't actually causal so I would argue that you know in some sense like a lot of the world problems could be solved if we could all just estimate causal models all the time but as I mentioned I'm not going to argue that I or any of my friends can solve all of your problems because it's actually really really hard to do causal inference and there's all sorts of challenges that remain probably the biggest one and the one that we've struggled with for decades in economics is that we just don't have the right kind of variation in the data you until recently firms didn't run many experiments and so you know if we wanted to understand what would happen if a firm changed its price you know an economic consultant might go to a firm and say all right show me your data they're like well you know we've kept our price of the cereal you know constant for the last three years you're like okay so how am I going to figure out what would happen if you change the price that can be really difficult if they've change the price in the past and you know so there's there might be lack of variation at all or then it might be that the variation that they do have is it's due to confounders so suppose I wanted to try to understand you know the impact of a hotel raising its price I could go out and you know count people going into hotels and I could scrape prices from hotels calm to see what the prices were and I might I would see prices change and I would see the number of people in the hotel change but what I would also see is that when the prices are high is the hotel full or empty full right what our price is like right here right now they're high because you guys filled up all the rooms okay and that's because the hotels are using algorithms to set their prices and in particular when it looks like there's a big conference coming to town of course there's discounts and so on for some people but for your regular business traveler who wanted to stay here this week they're paying a very high price so there's there's variation in the data there's lots of price changes in the data but it's not the right kind of variation it's not experimental variation and in particular it's moving sort of one-for-one with demand and so it's hard to know what would happen if I change my algorithm to set a higher price in every state of the world we also typically have a hard time observing all the context or getting enough data about all the context in order to understand treatment effect heterogeneity and then also analysts often have a lack of knowledge about the model and again what you know we this problem we struggle with in economics for the last few decades is that we often were in a data poor environment so we've had people working on problems like dynamic games between firms or dynamic decisions that workers make about unemployment decisions and so on but unemployment insurance or whether to take a job we've been building these models for decades and but and they've worked pretty well this is inverse reinforcement learning it turns out like several of the components of alphago are basically map right on two weeks that were used in the 1990s in economics but what they think those methods didn't sort of change the world firms didn't replace their store opening managers with a eyes because it was actually pretty hard our problem wasn't that we didn't know the right techniques the problem was we didn't have enough data and we didn't have enough variation in the data to really understand you know what would happen if Walmart opens a store here and doesn't open a store there if Target opens a store here and so on so the problems are really not conceptual problems the conceptual framework is solid the problem is how do we implement in practice now the very exciting thing is that now we have firms that are running tons of experiments and we have we're interacting with humans in a digital environment so actually we're creating environments with lots and lots of variation lots and lots of data lots and lots of contexts so the reason this got really exciting for me is not because there was some groundbreaking new philosophy some since we've had the philosophy for a very long time the what was exciting is wow we can actually do something with it outside of a few very special circumstances we can do something credible at the same time I'm much more optimistic too about the ability of all these techniques to work in settings that have lots of experiments and lots of randomization like the tech firm settings then I am for you know settings that don't have that so we'll you know building an AI to run the central bank or to make opening and closing decisions for Walmart I think is much much further off than some of the other problems so now I'm kind of drilling in a little bit more deeply into more like true AI algorithms so I'm going to use as an example in this talk sort of contextual bandits because that's a very similar simple tractable example of a very simple artificial intelligence algorithm so you know multi-armed bandit is balancing exploration and exploitation to try to figure out which arm is the best to pull which is the best treatment arm and a contextual bandit understands that in different states of the world there might be different optimal treatments so if I want one application I'm looking at is reckon charities keep recommending people give to charity when they come into checkout with PayPal so in in that case the contextual information might be which websites they were coming from and maybe something about their purchase history and that might help me recommend a charity as well as a motivational message to the person and I'm gonna want to learn online learn over time what the best arm to pull is so that's a simple example of course reinforcement learning and the robots climbing the wall and the people you know playing video games and so on are more complex examples and but in all of these cases what the AI is doing is they're selecting among alternative choices and they therefore must have an explicit or implicit model of the payoff from alternatives and that's a counterfactual model and that's not surprising that say the especially the people working in the contextual bandit literature have been sort of closer to the statistical literature on causal inference than some other parts of machine learning because you recognize that actually if I'm running experiments I need to I'm thinking about counterfactual reasoning of course in the initial phases of learning you have limited data so a lot of the theory in this literature is just like okay n goes to infinity everything is good we've learned the best thing eventually but if I'm running survey experiments or if I'm trying to get people to vote or whatever I actually often don't have enough data treatment effects are often pretty weak I'm often very underpowered so actually you really do care about what happens and this in the initial stages and it's important to be a good statistician so I would argue inside sort of every AI of this type is a statistician performing counterfactual reasoning you haven't necessarily written it down or spelled it out but that must be what you're doing of course unless you just programmed all these heuristics in to start with but an AI that's exploring and learning about the world is going to be building a model from its past data and drawing inferences about the best action to take in the future and I think it's then it's sort of tautological that you would rather your AI be a good statistician than be a bad statistician okay and so of course at the beginning we're gonna be really excited about all of our eyes there are simple statisticians I won't call them bad statisticians but there are simplistic statisticians because they can do really cool things they climb over walls they play video games they run mazes and so on but eventually you know we're gonna hit some diminishing returns to that and then for the next round of improvements you know being a good statistician is going to be very important and indeed you know I've seen in practice and applications of banded algorithms that they can kind of go off the rails due to bias ease that are very predictable if you think from the perspective of causal inference and that's basically just you know you're gonna tend to in a region where you think that RMA is good you're gonna pull arm a a lot in a region where you think arm B is good you're gonna pull R&B a lot and then as a statistician you're gonna end up having sort of biased data because RMA is gonna be you're gonna be getting high outcomes for RMA and they've reason where RMA is good and high outcomes for MB in the region where RMB is good and you're not going to necessarily extrapolate well to the rest of the distribution and we have luckily great techniques for dealing with that but they haven't been adopted so much so that's something that I've been working on with co-authors alright so in the next little section of my talk I'm still gonna be very conceptual and I'm gonna talk through a couple of different types of counterfactual inference and I should say that this is one thing that gets very confusing especially in interdisciplinary audiences because you say the words and everybody thinks it means something specific but perhaps something different and I would say we have good history on that in economics because we've been having arguments huge clashes Wars debates fights between different leaders of different approaches to causal inference over over the last decades so just because you agree that you want to think about causal reasoning doesn't mean that you agree about how you should write it down you know whether you should use equations or this notation or that notation and also doesn't mean that everybody agrees on what the most interesting questions are I think a lot of that noise can be dealt with by just realizing actually there are lots of different questions some they're usually multiple ways to write down the same thing and we just need to be a little more precise about where we're talking about and there's really no reason to fight about mathematics we just have to write down correct mathematics so the first type of counterfactual inference especially it's very popular in the social sciences and biostatistics is what we call an economics program evaluation or more broadly treatment effect estimation and so the examples of the kinds of questions we've been worrying about for a long time or things like what was the impact of raising the minimum wage you know this is a very famous paper from several decades ago trying to look at the impact of changing the minimum wage and you're comparing New York I mean New Jersey to Pennsylvania and you know you try to use one as a control for the time trend and the other to understand the impact of the minimum wage so these types of problems have been hitting us for a very long time even in a world where we didn't have a lot of data training programs have also been a really important and an active area big literature is on what's the effect of reducing class size for kids in the more modern economy one of the biggest applications of this literature is is advertising effectiveness and you know there's an economist friend of mine it Netflix who's doing a ton of this and basically like Netflix is amazingly good at estimating the return on investment of their advertising using a lot of randomized experiments and also bringing together observational data with experimental data and so we have questions like did the advertising campaign work and what was the ROI and these are very hard questions actually because as some of my colleagues pointed out in an in a research paper the basically most advertising experiments run by large firms or underpowered that means you don't have enough data to even tell if the campaign worked even if you had a perfectly designed experiment and you certainly can't reject you can't you if you're lucky you can just tell whether the advertising worked but it's often almost impossible to tell if there was a positive return on investment whether it was worth the money so we we really need to be careful with our statistics about those problems because the signal-to-noise ratio is such a problem another really active area the political science community is get-out-the-vote campaigns and actually for again since the 80s we've been running these large-scale experiments that used to do big mailings there was a famous study where they gave they sent people letters and they and you know your voting record in the u.s. is public so they sent people letters saying we're gonna tell your neighbors whether or not you voted and they compared the effectiveness of that to other types of messages and found that telling people they were going to tell their neighbors whether or not they voted was a very effective way to get people to vote and so we would we would want to estimate first of all do these campaigns work and now the more modern versions of these and these really are just getting rolled out in the last couple of campaigns are more you know personalised policies and these campaigns are starting to use things like bandits to learn the best policies and what's an optimal policy assigning workers to training programs so in this in this world the goal is to estimate the impact of an intervention and this literature generally focuses on low dimensional interventions there's extensions to more complex cases but like 95% of the literature is like about a binary treatment so like a drug or raising the minimum wage but like one change in the minimum wage the s demands the things that people are interested in learning or like what was the average effect like just did this thing work the more sophisticated more recent versions look for heterogeneous effects so for whom did it work because that will tell me in the future you know who should I send a mailing to to try to get them to vote and and also more recently a lot more emphasis on estimating optimal policies policies mapping from people's characteristics to their assignments and a big emphasis in this whole literature is confidence intervals so because we can't observe the ground truth and because there's often pretty noisy effects of most treatments and experiments and and so on are expensive we are typically really worried about sampling variation we're typically really worried about whether the effects we're finding or spurious and we're also really worried about bias we're worried about things that that we're actually going to say something's good when it's actually bad and so there's a huge emphasis on estimating an effect estimating it consistently and putting confidence intervals around it and you know just to get published in an economics journal that's basically table stakes it's one reason people have been slow to adopt a lot of machine learning methods because basically people don't know what to do if they're gonna write a paper where there's not a coefficient estimate and asymptotic normality theorem and confidence intervals so that's been part of my research agenda is to provide those four economists on machine learning methods so that they can actually start to use them now there's a big important point which I want to just pause on for a moment which is that we don't actually know the ground truth and I that's really a big difference between supervised learning and causal inference so I could think about what's the treatment effect for each of you of giving you coffee before you walked in well I could try to estimate that effect but of course you know people for whom coffee is effective or tend to have chosen coffee and people who don't like don't normally drink coffee and stay alert without coffee didn't get coffee so if I just met looked at the correlation in the audience I wouldn't be able to get that causal effect and if I held out 10% of you that wouldn't really change anything I could find a big positive correlation between actually in this case I might get a zero correlation between coffee and heart rate I might get that in the training set I would also get it in the test set but that wouldn't tell me anything about whether I got the right answer about what would happen if I force-fed you coffee okay so the so we don't have treatment effects stamped on our forehead we don't have a held out test set that tells us whether we got the answer right or wrong and that's one reason that we focus a lot on theory so instead we this literature focuses a lot on designs that enable what we call identification and estimation of these effects so first of all this whole literature tends to focus on treatments that have been observed in the past so we have a data set some people got the drug some people didn't the problem is that maybe they weren't randomly assigned random as some random assignment is of course the gold standard for figuring out causal effects but what happens if you don't if you need to make a decision and you don't have that so this literature focus is on what we call designs different designs that would allow you to learn about causal effects even without a randomized experiment so one category are natural experiments and those are called instrumental variables there's one example and I'll come back to that also unconfoundedness would fall into this under some assumptions and I'm going to go into those in some detail three others that I won't go into today but that are also very very popular for drawing causal inference in the social sciences one is regression discontinuity so you compare people near a boundary so for example I want to figure out the impact of being in school a versus school B I compare people on opposite sides of the street where there's a where the school district boundary is and I say well other you know their neighbors were similar their houses were similar in price all was different was the school they went to or if there's a test score criterion for getting into a school we look at people whose test scores were just below the cutoff so those we look at it people whose incomes but just qualified them or just didn't qualify them for program and so on of course there can be a lot of problems with these as well each of these examples I just gave you can think of reasons that would go wrong but that's a very very popular type of strategy and in fact for tech firms you can think about designing these types of experiments so Hal Varian Google's chief economist recently wrote a paper with art Owen from Stanford about this and he was sort of arguing that hey like in YouTube you know they give them free t-shirts if they get them enough likes on their video so instead of just having a hard cutoff like if you get more than ten thousand likes I send you a t-shirt instead I should take all the people near a 10,000 and randomize them and then I would get even more statistical power from that program and I could actually learn whether t-shirts help or whether Google's wasting their money sending people all these t-shirts so that's the regression discontinuity design the difference-in-difference design and it's closely related to looking at longitudinal data a classic example of this would be like you know say Kansas have this tax reform so you of course the world is changing the economy is changing you're coming out of a Great Recession but you can look at states that are a lot like Kansas and look at their time trends and then Kansas at some point pluses tax reform and you can see how its outcomes kind of go downhill relative to the time trend that was established by the other firms and this area is one that's seeing a lot of activity right now in terms of bringing together machine learning methods particularly matrix factorization methods and causal inference methods and that's something I've some papers on you can find on my website I'm not gonna talk about that anymore today okay approach to and I should say in economics the people who do approach one and the people who do approach two generally are not friends they kind of argue a lot so I've done both in my career but not that many people kind of essentially historically did both now it's becoming a bit more common so structural estimation the way the structural estimation people would say is that they make more assumptions but they also answer more interesting questions so at some level the first the first approach is only good for comparing people who got something in people who didn't get something and you can just measure outcomes the goal of structural estimation is more than that it's to be able to think about worlds we haven't seen before and also to to say something about welfare I want to understand if I raise the price what's that gonna do to consumer welfare what's that gonna do to firm profits okay and these kinds of models have also been used for decades in any trust cases so staples and Office Depot want to merge some economists is going to get paid hundreds of thousands of dollars to write a report building a little model that says you know well if these two merge this is how they're gonna change their prices and that's what it's going to do to consumer welfare the area that I worked on this most historically was in auctions but I'll show you an example at the end of doing that now I'm working with David Bly on and on models of pricing so what would happen to her if firm to firm demand of price increases what would happen to prices consumption and welfare if two firms merge what would happen to platform revenue advertiser profits and consumer welfare if Google switched from a generalized second price auction to a victory auction and that's a change after that Facebook may at some point and you know those of us working in this literature you know discuss the hood at the time okay so how does this work well our goal is to estimate this impact on welfare profits of participants and on alternative counterfactual regimes and these regimes may never have been seen before so how could you possibly do that I remember when I first explained this to friends at Microsoft you know 10 years ago when I finally convinced them of what I was doing they're like you're smoking dope that can't be science how could there be a science where you're claiming to make a prediction about a world you haven't seen before how could you possibly do that and the answer is you can only do that with some assumptions but in some cases you can make assumptions that make a lot of sense and of course your artificial intelligence agents are gonna have to do that they have to reason about worlds they haven't seen before and so we're all kind of in this business okay so you need basically a behavioral model for the participants you need to understand why the agents in your in your data are doing what they're doing so for example I could assume that when you go in the supermarket and you choose a cereal you're choosing the one that maximizes your utility that's a behavioral assumption you chose the one that maximizes your utility of course I could also imagine that you don't gather all the information or that you're sloppy there's lots of other assumptions I could make the starting behavioral assumption we would make is that you're maximizing your utility so if there were two cereals it offered at different prices if you took one of them you must have liked that better than the one that you didn't take and from that I can learn something about your preferences you call that revealed preference your choices reveal something about your preferences of course we still need designs and enable identification and estimation so for example if I want to learn about people's responsiveness to price changes I still need there to have been price changes in my historical data and I still need the right kind of variation say quasi random variation in the data but I'm gonna go farther with these models I'm gonna rely on the behavioral model to to estimate what you would do in different circumstances if I know if I suppose I watch you go to the supermarket over 18 months which I do and some of my papers over time I'm gonna learn how you feel about prices so I can predict what would happen if a price changed I can say that you're very price-sensitive if this price goes up I've never seen you with that price before but I don't think you're gonna buy and so in this particular data set we have prices changing every week and so we do hold out data where prices change and we show that indeed we do a good job at figuring out who buys and who doesn't buy when the price goes up now I'm not gonna be able to talk about this I'm too much today but there's a really lovely paper by a young economist at Yale named egami and he actually does a nice job here's a little paper that's basically artificial intelligence as structural estimation the paper is written for economists and it's basically showing how things like alphago are using what what would be known in a machine learning as inverse machine reinforcement learning and what we've talked about is dynamic programming value function iteration policy function iteration and so on in in economics now we there's some question about what how much value making that mapping works in the step examples I'm going to show you today I'm gonna show you how I think I can make AI better by being better statisticians we have some ideas about how I could make alphago better by also thinking about it the way that that we've thought about it but I think it's a little bit less clear you know whether or not those improvements will materialize and how important they will be so another type of counterfactual inference causal discovery or learning the causal graph is about uncovering the causal nature of a system so and this actually mostly attributed to today a pearl and this is often what machine learning people when I say I'm working on causal inference people often think I'm working on this this has actually not been the main focus of economics often because in economics we're just answering like a much more simple question we want to know like did the minimum wage change work and we already understand we have a model we have an economic model that you know of how of what causes what that if I raise the price people won't buy my stuff if I raise the minimum wage you know certain things will happen so I kind of understand qualitatively how the world works I just don't know quantitatively and so in economics we've focused a lot on just estimating the magnitude of things and getting really precise estimates and unbiased estimates of magnitudes while for other applications that you often just don't really understand how the world works at all I have some big software program and there's lots of variables and I don't know which ones cause what and I don't know what our inputs and what our outputs same with the human body I just don't really understand how it works I have some idea that I start getting nervous and I start to to release some hormones and then I start to sweat and you know there's this this causal pathway that my body goes through in response to stimuli but I don't understand what it is and so trying to discover the nature of causal relationships is really a very distinct subfield and so I'm also not going to go into that today partly because that's not what I work on and also there are there are other people who can speak much more clearly to that I've never done any applications in this area I would say recently these literature's have started coming together and all of these economic statistics computer science engineering all trying to come together and I think really artificial intelligence and decision making personalized treatment policies all that stuff is unifying everyone and in pursuit of a common goal and so I would say the recent literature is really bringing causal reasoning statistical theory and modern machine learning algorithms together to solve important problems so what are some of the themes of this literature first of all causal inference versus supervised learning as I mentioned supervised learning one of the amazing like sociology of science points is like how you know having all these images and being able to hold out a test set has led to the advancement of science if everybody could agree that if I do a better job telling cats and dogs by testing me in a test that we can all agree what progress is and we can make a lot of progress really fast I guess my cautionary note would be that most problems aren't like that like most problems you can't tell if you did a good job or not certainly not that kind of precisely and objectively you know as I said I would spend two years arguing with people about whether I got the right estimate in an empirical paper and you know at the end half the people are unconvinced so we just you know the progress of understanding the impact of the minimum wage is like very very slow even though there's a bunch of really smart people devoting their whole careers just to that other problems you can go much faster if I have a billion iterations of a game I can see who won and loss and I can really understand what happened and what wins and what loses so the the causal inference is different because I don't have an observed ground truth now it turns out that we can make progress despite that and what that what our literature has done is figured out ways to estimate objective functions try to find is model free ways to estimate how well we've done that's hard to do it's noisier more prone to mistakes requires more assumptions than supervised machine learning but nonetheless we met we've been making good progress and basically the theme is we change the objective function another thing as I said sampling variation matters and we care a lot more about uncertainty quantification I would say that's a really important problem when I see firms going to apply AI some team builds this decision engine they send it off to a loan officer the loan officer looks at it they see a recommendation and they say that doesn't make sense to me you know and and nobody has told them whether that's an accurate assessment or not and so they just sort of distrust the whole machine because sometimes they get crazy answers and so that we need to as a community put more emphasis on what are the places where this algorithm is just gonna get things wrong maybe it's biased maybe there's emitted variables maybe it's just very uncertain and really expressed that uncertainty quantification that's been a big theme all along in causal inference we do require theoretical assumptions and domain knowledge and then one of the things I've also been finding very interesting is that tuning a model for counterfactuals is actually very different than tuning a model for prediction you're gonna choose different complexities of models different different functional for of models all of your choices will be different if you're tuning for a counterfactual versus tuning for a prediction and indeed if in a simple case like trying to understand the impact of changing price it's very common for for us to be able to in a simple correlation explain 95 percent of the variation if I regress hotel occupancy on price I can explain maybe 95 percent of the variation because the hotel's raise the prices when they're full but if I wanted to get the causal impact I might be able to might have a you know an r-squared of like 0.01 or something because it's just much much harder to to get the causal effects the insights from statistics and econometrics so what are some of the things that that machine learning can learn first we usually start out by first thinking about identification and then estimation we first ask could we solve our problem with infinite data that's what we call the identification question and that's a very helpful exercise because it forces you to think about what in the data could possibly answer my question for me and in many cases the answer is I can't answer my question this data set just doesn't have the right kind of quasi-experimental variation to answer my causal question all I can get is correlations even with infinite data we have a design based approach and then we we want to do estimation now where the big lift from machine learning has come is that now we have much more data and we have many more covariance and so machine learning has done a great job in figuring out how to do the best job in the most efficient job at making use of the data you have once you give it the right objective function a few other themes regularization induces emitted variable bias so if you have two things that are highly correlated typically a model will choose one of them and not the other because they're providing similar information but then that creates interpretability problems because say I have parents income and parents education I might do a variable importance thing and a random forest and say oh parents incomes important parents education is not which is like a silly conclusion because they're just highly correlated with each other and most economic datasets and most firm datasets frankly all the variables about people are highly correlated with each other and so I think people make a lot of mistakes and interpretation when they don't think about a mitad variable bias and challenges causal inference another thing and this I'll have to flesh out in my more technical part is it semi-parametric efficiency theory which is a big literature and statistics and econometrics it can actually be very helpful and so we've been able to improve on the best known regret balance from machine learning by bringing in the insights from semi-parametric at the efficiency theory and changing machine learning algorithms and there's things we do call cross fitting and orthogonalization which basically make your models more robust to mistakes that you make in estimating what we call nuisance parameters so if you're trying to get a causal effect there's things like just estimating an outcome model predicting what your outcomes would be not your treatment effects and for those outcome models you try to residual eyes everything so that mistakes you make an that estimation or orthogonal to the problem you're really interested in and it turns out that increases performance a lot and also improves the theory finally I would say that using the exploiting the structure of the problem carefully for better counterfactual predictions so a lot of the old people and I'll call myself an old people since since I started in computer science in the 80s I do have a computer science degree before I abandon it for economics you know we get a little grumpy about all these black box algorithms and to my economist friends I'm trying to evangelize the black box algorithms but those old folks are a little grumpy because we used to spend a lot of time thinking about modeling and generative modeling but I would argue that if you're really going to do heart AI or human-like AI we do have to go back to the models and we're gonna use the black boxes for components of those models but we're gonna use structure because you want to be able to make sensible decisions in states of the world you haven't seen before and a lot of times that structure really helps so in my supermarket analysis for example it just really helps to say hey we have these functional forums that tell us something about like if the price of this goes up how are you gonna substitute to other types of paper towels how do I use the past behavior in the in a structured way and those models have been super successful over decades and I can beat the best machine often you know kind of reduce foreign machine learning methods just black box by incorporating a little structure and also using modern machine learning but I put them together ok so now I'm going to spend some time going into some very specific models and I'm going to start getting progressively more technical so the first place and this is where a lot of the emphasis and causal inference has been about estimating average treatment effects under unconfoundedness and indeed for some folks in the causal inference land that aren't from social sciences they think this is the only thing that we ever do when I would say this is actually the least favorite way to do causal inference in economics nonetheless and in fact there's lots of people who think we should never do this but I argue back to them now tech firms are generating lots of the kinds of data that we would use that would satisfy the assumptions required for this the setting to work so the idea that only observational data is available that is we don't have a simple uniform experiment with a be testing from the past the analyst has access to data that is sufficient for for that the that has the part of the the information used to assign trait units to treatments and this related to potential outcomes so for example a tech firm might be deciding you know Facebook's going to show you something on the newsfeed based on a bunch of characteristics and you know that that data is logged by Facebook the things that go into their algorithm now their algorithm itself is a black box so they typically although in the ads we wish that they would if you're an advertiser but typically firms don't actually log a probability of assignment like suppose the firm was randomizing what you were gonna see they typically don't log the probability because that would require taking a lot of draws from the black box and that would be expensive and why would you need to do that you just do one draw and show that something to the person but as an analyst I can think of that as random and go back and look at it later and of course it's a computer algorithm so whatever was used an assignment must have been recorded at some point but the analyst doesn't know the exact assignment rule and there was some randomness in this setting conditional on the observables we have random assignment but the the simon is not uniform so people for whom i thought the ad would be effective we're probably more likely to be shown the ad than people who were not contextual bandit data is another good example if the tech firm is itself running a contextual bandit or the political campaign is using a collect contextual bandit to say I'm going to map your characteristics to the email I send you then I would generate data like this and then the question is you know what what's the treatment effect so for an online ad we could say ads are targeted using cookies a user sees car ads because the advertiser knows that you the user visited car review websites but there's still some randomness as to whether you saw the ad even conditional on that and so the interest in the cars is the unobserved confounder but if the analyst can observe the history of websites visited by the user which is the same thing that was used for targeting than in principle I can control for those things and do in do causal inference so in this setup what we what we would formalize these assumptions as unconfoundedness or it's also called ignore ability and basically what that says is that conditional on some covariates X the potential outcomes that you would have received from seeing the ad or not seeing the ad were independent of the treatment and so some of the things that could lead to randomness could be like running out of budget or like a firm has a limited budget and so they might only show the ads to half the people or competitors might be running out of budget there's lots of things that could be could be leading to that randomness this notation is the potential outcomes notation and this Y I of 1 and y I of 0 are these counterfactual outcomes for each of you there's a counterfactual outcome what you would have had if you didn't see the ad and what you would have had if you didn't see the ad even though I'm never gonna observe both of those for the same person at the same time those exists hypothetically and so it's really important to write those down because that's what allows us to distinguish correlation from causality and to write down the objects of interest our object of interest a treatment effect would be the difference between your outcome of treated and your outcome with not treated we need there to be enough randomness that for every type of person there was some chance you saw the ad and some chance you didn't so if there's people that just never get shown car ads we just can't say anything about what would have happened if we did show them the car so we should just throw them out we can't say anything about them we would have to extrapolate too much but if there's people who had some chance of seeing the ad based on their characteristics then we can we can put them in even if though it's not a high chance and so the old literature from the 80s basically shows that it's sufficient to control for the propensity score you don't have to control for the whole vector of X's which is might be hard and high dimensional the propensity score is enough if I compare to people that have the same probability of being treated then I can think of the treatment assignment as being as good as random that's not completely obvious it's a theorem it's a bits of paper and you can find the algebra for all of that in my lecture notes if you control for X well then you can estimate the average treatment effect and so there's a bunch of popular methods used to try to estimate treatment effects in that circumstance so I can maybe do this best sort of visually here's a problem of a data set and say the the Reds are treated and the Purple's are controlled and here's a covariant X and so what we're seeing is it looks like the treated guys are a little bit above the control guys but it's a little hard to tell because it also looks like x increases outcomes and it looks like there's a few more treated guys with high X's than with low X's so X is a confounder it's it's correlated with your outcome higher X makes you a higher outcome and it's also seems to be correlated with the treatment assignment people with higher X also were more likely to get the treatment so there's a few different ways to try to do causal inference in this setting the first thing you can do is just reweighed the guy's so we can take the the guys with high X's and I did this by hand so apologies my blobs aren't perfect that you know the guys with higher X's in the control group get weighted a little more than the guys with low X's in the control group and if I do reweighed then I can I can basically adapt for the fact that X was correlated with the treatment assignment and I can then compare the weighted average of the purple guys the control guys and say well that's what would have happened to the treatment guys if they hadn't been treated and then I would say okay if if this in the weighted-average purple is lower than the related is is lower than the red average then there's a treatment effect so and this is ver it turns out this propensity waiting has been very popular in machine learning because there's we've already had the idea of re waiting it's very easy to take a machine learning estimator and reweighed it off the shelf and so this has been the most popular application of this causal inference literature in machine learning is this kind of rewedding now unfortunately it turns out this is not actually the best way to do things even though it's popular it's conceptually easy another thing that you can do is you can do outcome modeling so we could say let me just actually take the control group and I'm gonna estimate the relationship between x and y and then for every every unit I can sort of adjust for the impact of X and once I've adjusted for the impact of X then I can compare the the treated outcomes and the control outcomes this is that this approach actually is implicitly used in a lot of the contextual bandit literature like the linear Thompson sampling the the lasso versions of Thompson sampling and UCB these algorithms implicitly rely on outcome modeling they basically assume that you can run a regression and and understand the impact of having different contexts and use that to draw inferences about which arm to pull now that's going to work okay but if you don't have a lot of data then actually there's a lot more control observations at low X's and so if this thing was curved you might really screw up like if the X if the if the purple average is a lot lower than the red average the control guys have lower X's you would really be extrapolating quite a long way to say what should be happening in the region of high X and so if you have the wrong functional form you can go way wrong and that's why those models don't perform as well as they could so the way that's that's tipping tends to be the highest performing from from this literature on causal inference is what's called double robust estimation and then what we do there is we both reweighed the data and use an outcome model and if so if you think about it once I've done reweighed extrapolating very far so if I get the slope of the line a little bit long wrong I'm not extrapolating very far and so it doesn't really matter if I get the slope a little bit wrong because I'm I'm my average for the treated guys and control guys is very similar on the other hand if I do the outcome modeling well then it doesn't matter if I do my rewedding very well because if I have if I have a correct model mapping X's to Y's then I can adjust for the effect of X on Y's and I can control for the difference but if you combine outcome modeling and re waiting then you can make some mistakes and both and do a better job and in a bandit you never have enough data especially early on you're always in a data poor environment therefore you you're never gonna have exactly the right model and so you're always gonna get some benefit out of doing this now if you did happen to have the right outcome model then the rewedding will increase your variance so there in fact is a trade-off but in most real-world settings you don't have the right outcome modeling and so you do better by reducing bias and this kind of doubly robust intuition it turns out to be very important when applying machine learning to causal inference because generally we have these high dimensional models and you do make a lot of mistakes it's hard to estimate the treatment assignment policy it's hard to estimate the outcome model you can't control for everything you have regularization and duce biases and so combining the two making being more robust makes you perform better okay so I'm gonna stop here and take a break and we'll come back at 3:40 Thanks [Applause] you where I left off last time was talking about this this kind of canonical problem and I think it's like the really the basic starting point of trying to think about a causal inference outside of the randomized experiment setting and that's the setting with unconfoundedness and so the I would say the the final recommendation out of this and I have some formal results since a series of papers is that actually you can make your optimal policy estimation or your bandits work better if you use doubly robust estimation and so we have a paper coming out in triple AI I guess in January that illustrates this and contextual bandits and then also I'll show you if I get time how that improves the best known rates from the machine learning literature on estimating optimal policies and so and again this isn't doubly robust is not my idea it goes back a very long time but I think the reasons that it works really well in machine learning settings in particular is that we do we in high dimensional settings we're never get the models right we we think that it's it's a really big problem that we don't get models correct because we we don't have enough data to really estimate the complete functions and so this combination of outcome modelling where the outcome modeling is basically trying to adjust for the effect of confounders on outcomes and REE waiting together can work better so and let me just go back because actually somebody was asking me in the breaks that I I went a little fast here so I'm gonna use this notation again so let me just say one more time what all the objects are this outcomes why I have one my I of zero those are the potential outcomes those are the outcomes that you would have gotten why I have one is what you would have gotten if you were treated why I have zero is what you would have gotten if you were control if both of those were stamped on your forehead then causal inference would basically just look like a supervised learning problem the problem is there's a missing data problem that I only observed one or the other for any particular person and it's not him of course if it was just missing at random like in a randomized experiment it's missing at random so having data missing at random is not a problem the problem is if the data is missing not at random that is the people who drank coffee or took the drug looked different than the other people and the WI is the treatment xir the features and again the unconfoundedness assumption is that if you condition on all the x's these confounders then after conditioning on them the treatment assignment is good as random and I should mention but one reason this this approach is sort of out of favor somewhat in social sciences is that what we found over a period of many decades is that people would apply this observational data they would kind of argue that they'd controlled for everything but it was hard they didn't have enough data to really do it well and so they got different answers depending on which functional forms they chose and so there was sort of a methodology wore around this whereas you know some people said matching some people said outcome modeling some people said propensity scores on any particular data set you could get different answers from different methods so I think the beauty that machine learning brings to all of this is that it gives you a systematic way to choose functional forms and then also if you adopt this sort of double robust type of approach it tells you you have many fewer decisions to make as a researcher and the using very flexible functional forms in a data-driven way to to to select them with the correct objective function can really improve things and then finally as I mentioned I think in the modern digital world it's actually more plausible that we might have a data set where the assumptions are satisfied where in the past you know I might have done I might have said oh I'm gonna try to figure out the effect of you going to college and if I just control for enough stuff you going to college is as good as random and you might say no it's not you know you you two people who looked identical in terms of all the stuff I could observe in the census data one went to college and one didn't probably those people are different in some other ways it's just not plausible that I've really controlled for everything relevant if I did it for an application like that or occupational choice or something it just wouldn't be plausible but if I'm thinking of a setting where I'm interacting in a digital environment the treatments are being assigned digitally then the treatment if it's designed digitally then the exes must have been observed in order to make the treatment assignments work okay so if you let me just I'm gonna kind of breeze through a few of the applications of machine learning the way machine learning has come into this area recently an early idea people had was to to use machine learning to estimate propensity scores and again the propensity score just going back a second the propensity score is the probability that you're treated conditional on your exes so if your if your exes show that you're very sick you might have a high propensity to get the drug if the exes show that you're healthy you might have a low propensity to get the drug so the propensity score is your propensity to be treated as a function of the covariance if you go to a lot of car websites you might be more likely to see a car ad than if you don't and so this this early approach was to estimate propensity scores and use those for reweighed and I just gave one example but there's actually even you know a number of people in the machine learning literature who have adopted propensity weighting in some form or another for example a group at Microsoft Research New York has done that a lot with the contextual bandits John Lankford and co-authors a second method is regression adjustment so as I mentioned that was set to show in the picture kind of drawing the line which you might want to do is try to just estimate well how do exes affect outcomes how do your features affect outcomes and if I can adjust for that then I can adjust for the differences between the treatment and the control group and get a causal effect so one of the early economics papers in this area I mentioned it here not but it's actually not the best method to use but because they make this really nice point that it's very different to do supervised learning than it is to do causal inference so in particular they they what they suggest that you do is you use the lasso regression model to estimate two separate regressions the first regression is the regression of the tree an assignment on the features so this is estimating the propensity model this is looking at the data and saying which X's predict who should be treated then a second separate regression model which X's predict outcomes now then what they suggest is take all of the X's that were not zeroed out in either regression so if you if you were selected in either one then keep them in and finally run a regression of the outcome on the treatment Y on W controlling for all these X's now if since laso is optimizing for goodness of fit we know that you're going to be sacrificing goodness of fit because if you do the cross-validation correctly just the regression of Y on X and you letting last so tell you which X's predict Y would be the best thing from the perspective of goodness of fit so why would you sacrifice goodness of fit to throw in these other X's that don't even explain outcome so strongly but X's that explain treatments well the answer is that if you don't control for confounders you'll be biased and you'll actually get the wrong answer so in the example of a drug there could be some medical sign that is important for doctors and assigning the drug it's sort of weakly predictive of outcomes but it's important for getting the drug and it's indicative of being sick and there might actually be a lot of weak predictors and laso will zero out we're predictors if they're weak enough so you might have a lot of weak predictors of being sick that are important for signing the drug and in aggregate are also going to affect your how sick you are but if you just do a predictive model of Y on X you won't pick those up and so the point is that for causal inference we're not just concerned about the predictive power we're concerned about bias we're control we're worried that we might omit a variable that is a confounder we're worried we might omit a variable that relates to both your treatment assignment and your outcomes and that you would confound the causal effect so this is a nice little kind of very easy to explain an example that's the contrast between off-the-shelf supervised she's learning and causal inference and and it's really because our goal is different and you can think about this as changing the objective function now this is not the highest-performing method method but it's just it's just a simple example so another method that's been fairly popular is to try to estimate the conditional average treatment effect and then average over the exes and so there's a there's a community Jennifer he'll be having one of the early papers here using Bart and that has been very successful in a bunch of that we have these causal inference competitions at this Atlantic causal inference conference and the the Bart based methods have actually done really well and a bunch of competitions especially in certain settings with signal-to-noise ratio being in a certain range and so what she does is she's going to use Bart to estimate the outcome model and we're gonna think about estimating the expected value of the outcome conditional on your covariance as well as conditional on the treatment and if I can estimate that that out that the conditional mean function then I can estimate just take the difference between what that function would be when you're treated and what that function says you would be if you're not treated interpret that as a causal model and estimate the average treatment effect as getting the difference so this would be an example off-the-shelf of outcome modeling now in extensions people have tried to bring in propensity weighting into the BART and so on as well so the way that that that I would recommend based on the statistical literature is double robust or double machine learning methods and so what this is going to do is take what's what's called an efficient score from this literature on statistics and this is going to be related to kind of a theory about how do you efficiently estimate a parameter and so in particular in the case of the unconfoundedness this is what the the scores look like and there's different ways to write the scores so I'm gonna for every observation I'm gonna first figure out an estimator of the treatment effect I'm just going to try to estimate a treatment effect model a conditional average treatment effect so I'm gonna try to figure out what is the the average treatment effect as a function of your exes and I haven't told you yet how we do that I'll come to that a little bit later there's a variety of different techniques I've been very into random forests and so I would recommend you can use my generalized random forest package which is available in our but there you can use lots of other methods for that as well so for you you can think for every observation you basically have a baseline estimate of the treatment effect for that observation and then I'm going to adjust that treatment effect and there's this kind of complicated algebra but let me just say it in words because the algebra is a little bit hard to read there's this term which is basically the residual of a regression of Y on X and W so I'm gonna I'm gonna also separately build an outcome model Mew hat which is going to be the conditional mean of the outcome conditional on the features and the treatment and so then if it turns out that this observation was a treated observation I'm gonna weight it by the probability of it being treated and if it turns out it was a control observation I'm gonna weight it by the probability of being control and that's all that that middle expression says so so basically what this score is doing is it's it's sort of orthogonal izing things it's it's taking a residual and this moment mistakes I make in estimating these nuisance parameters like the music which is not the main object of interest or orthogonal to the moment and that's what that's gonna mean is that mistakes that I make if I get a new hat a little bit wrong it's not gonna bias my estimate of the treatment effect so this this statistical theory it's complicated so I'm not gonna be able to do it full justice so I just really want you to kind of take the idea that for every every problem that you look at you can do a bunch of algebra and come up with a set of moments that you can use to estimate the parameter and they're called orthogonal moments when the parameters that aren't your parameter of interests like I'm not intrinsically interested in estimating the outcomes are orthogonal to your moment and in particular here that's also it it's doubly robust as I said before if you get either the outcome model wrong or the propensity model wrong you'll still get the right answer and so in particular if I if I use this approach I'm gonna get routing convergence even if the nuisance parameters couldn't converge more slowly say it rate into the fourth which helps in high dimensions so another thing that we're gonna do here which is not standard in machine learning well although it might be if you're using random forests but isn't isn't always used is that we're going to use either out of bag estimates of the nuisance parameters or we're gonna use cross fitting so if we were doing a neural network we might break the data into ten folds and for each observation I'm gonna estimate their nuisance parameters using the data other than that observation and what that's going to mean is that if if one particular observation is an outlier that observation might might then draw up say the outcome model and that would create a correlation between my the outcome and the mistakes in this in these outcome models in the Mew hat functions and so by using cross fitting I make myself much more robust and that's actually important for the statistical theory here and so what we found in in applications is that that using cross fitting or using out of bag for the random forests actually really does improve the performance of these of these models and I'll talk a little later about another example of what would have a deep neural net implication of instrumental variables where they also found that various types of orthogonal izing made a big difference in performance a final method that that I worked on is actually gets rid of the assumption that you can even estimate this treatment assignment model so suppose Facebook is using you know 3000 features to decide whether you see an ad or what goes in your newsfeed and suppose in my advertisers experiment I don't have enough data to really estimate that function well so we use a programming approach to avoid actually estimating a propensity model but the programming approach just directly tries to estimate weights and still gets double robustness so I'm gonna skip that math in the interest of time now I want to talk about instrumental variables and this is something that you can use with unconfoundedness fails and I would say this is actually a much more popular method in the social sciences now a couple years ago I gave a talk about this at kdd back in Australia and at that time I asked how many people knew about instrumental variables and almost nobody did so maybe people have been exposed in the meantime so let me ask again how many people know about instrumental variables alright woohoo there we go diffusion of ideas so so I'm gonna go over them I still go over them from the beginning but then again I'm gonna try to talk about more recent research around these ideas so the idea of an instrumental variable is it's a variable that is correlated with your treatment assignment that means it's relevant but it's it's it's independent of the potential outcomes and so some examples of this if your treatment is military service suppose I want to know what's the effect of putting someone in the army well people who go to the army are different from people who don't go to the army so I can't just directly compare them but when we had the military drafted for the Vietnam War we could say that people with high lottery numbers were more likely to go to the war than people with low lottery numbers and so the lottery number was randomly assigned now rich people or people that had swore an khals or heels or things like that that had a lottery number still didn't go to war so the instrument didn't completely determine who goes to war it just made you more likely to to go to war if you had a high draft lottery number there's a bunch of other examples of this so in another one of my papers on random forests we use the example from a famous economics paper about the impact of having multiple children on women's labor force participation now you're gonna think wow I can't do a randomized experiment where I plop down babies on women this is gonna be pretty hard to understand but it turns out that if your first two children are the same sex that actually increases your odds of having a third child and so that and so that having two children the same sex is randomly assigned at least before we start genetically screening our baby and so we can think about we can think about that as something that's orthogonal to your labor market outcomes things like your quarter of birth determine how many years of education you get because if you're born right after a cutoff or jus right before a cutoff that changes how much education you get and so on even with things like advertising experiments or getting a drug there's something that you can have things like uh non-compliance so I can try to show an ad on Facebook but Facebook doesn't actually show it so my I had a randomized experiment where I tried to show ads to these people and not to those people but some other ad beat me out for some of the people so what's random is my intention to treat but not who actually saw the ad and people might not comply with a drug trial so those are some of the other big applications of instrumental variables so let me actually skip over some of the details there's something called a local average treatment effect which can help get the intuition for instrumental variables so suppose that the treatment and the instrument are both binary so for example the treatment is going being in the military actually no that's not a binary so the the treatment is seeing an ad and the instrument is whether you were assigned to the treatment group we have some assumptions relevance means that the instrument has to be correlated with the treatment exclusion means that the instrument has to be random with respect to your potential outcomes so it's people it's it's not related to your treatment effect and then something called monotonicity which I'll skip over then the local average treatment effect is the ratio of two quantities its first the impact of the instrument on the outcome that's like the reduced form so it's like I'm compared that would be just comparing the people I assigned to show the ad to the people I didn't show the ad or its commits the people I tried to give the drug to to the people I didn't try to give the drug to then in the denominator is that is the expression which says what is the impact of the instrument on the treatment so in a sense how much does being assigned to the treatment group increased your chance of taking the drug and so the ratio of those two things as a treatment effect now I can also do this conditional on covariance and that's what I'm showing here and that's gonna be what's gonna be more interesting from the machine learning perspective I might actually have a lot of covariance and I might be interested in heterogeneous treatment effects so there's a bunch of different things you can do with machine learning around instrumental variables so again Victor chair Nozick Hoff and co-authors at MIT had an early look at this using lasso models and they basically argued that you could use lasso to try to figure out which instruments you should include in your regression and so in particular they would use lasso to select instruments when you have lots and lots of instruments and so you that you can think about different scenarios where you might have lots of instruments one example of this would be in a tech firm so in the background you're running lots of baby tests say you know bang or Google's running hundreds of a/b tests at the same time they are designed for different purposes but each a/b test is going to have some impact on the user and so I'll show an example in a minute with ads where we where we found that where we used a B test as an instrument for the ranking of the ads the the experiments weren't necessarily designed to change the ranking of ads but they had an impact on the ranking of ads and since the switch treatment group you were in was random we could use that as an instrument and so Victor and and his co-authors looked at this in a paper that looked at the effect of years of schooling on labor market outcomes and the instruments were the quarter you were born but also the interactions of those with lots of pretreatment variables lots of characteristics of people and they got much more precise estimates one of the experiments I ran when I was at Microsoft was again trying to look at the impact of being in higher positions for ads so if I looked here's two queries iPhone and viagra these are two different queries that people would put into Bing and each of these would show ads and so if you just did a simple regression of whether the ad was clicked on the position effects as well as the identities of the advertisers it would look like there were really strong position of it looked like that being in the second position was about two-thirds as good as being in the first position and then going down to the third position was about 40% and the side position was like five percent so it looked like the clicks were falling off really fast but of course this is not a causal effect because Bing isn't stupid and they put the best ads on top so the the an ad in the second position might get clicked less not just because it's in the second position but also because the ad is less relevant so what we did is we used the identifiers for a whole bunch of a B tests and again these a beat us were not designed to do ranking but they just had the consequence of affecting ranking and we used those as instruments for the treatment for the position of the ad and what we found was that the position effects were much flatter than in the the correlation and of course that's what you expect in the correlation there's two reasons a lower position would get click less one is because it's in lower position and second because it's a less good ad so you expect the the clicks fall off faster in terms of correlation then in terms of causation and indeed that's what we found and separately I ran a bunch of experiments randomizing positions at being and found similar results in a purely randomized experiment designed to look at this and so I guess I would just say this is this was an example you know circa 2008 to 10 where you know the search engine wasn't really taking this into account so the click prediction model was just using the observational data and trying to control for add effects and position effects but wasn't really factoring in that there was this bias and so indeed we found the estimates and the quick prediction algorithm were biased and then working together with a group of people including Leon but to really let a lot of this work to try to get more experimentation into the click prediction algorithms okay so another thing we can try to do is to look at heterogeneous treatment effects so we want to understand for whom is a treatment effect large and so I'm going to illustrate this with two different approaches one from a paper of mine with Julie tips Ronnie and Stefan Walker that's just coming out in annals of statistics and another that was by a tea that was at Microsoft Research at the time including Matt Teddy Greg Lewis and Kevin Leighton brown from UBC and that's using neural nets so I'll start with the neural net approach so first of all here's a little causal diagram showing the instrumental variables application that they were looking at so they were interested in trying to understand the effect of price in this example and so the goal is to understand say for a firm what happens if they change their price and by the way this was part of a project that that this team was working on to try to commercialize some machine learning algorithms that were designed to help firms make decisions with the idea that a lot of firms want to do causal inference about things like prices not just you know do prediction so the causal diagram shows that there's a bunch of things that effect a price so in in the firm's historical data there might have been shocks things that caused the firm to change the price that were also related to outcomes why so you know the firm might raise their price in a high demand day and that II that would be the unobservable it's not not observed to the analyst but it was the thing that caused the firm to both raise the price and have high demand there's also some Co very it's X these are sort of just characteristics of the context and then Z are the instruments and so Z for for the case of price they might be cost shocks they might be randomized experiments the firm ran if it was kind of more of a tech firm and we're pricing online but and Z would be something that's affecting the price that's not related to demand and so again in in economic applications historically we looked for things like cost shocks that would cause firms to change the price but for reasons that were unrelated to demand and so the the key thing is that is that the Z is excluded from the outcome relationship so Z is not directly affecting the it's not directly related to the say the demand for the firm that day and so what that means is that I can I can write the expectation of outcomes conditional on X and Z in a very particular way here I'm writing it as an integral and I'm saying okay there's some function G of P and X and that's telling me basically something about how changing price is going to change outcomes and there's also in the historical data there was some distribution of prices and prices tended to be on average different in in different contexts X and prices were also different with different Z's but the key is that the Z is not in the G function that's the formalization of the exclusion restriction Z is something that's the shifting prices that's shifting the price distribution but it's not actually affecting the overall demand relationship and that exclusion restriction is what allows you to actually learn about the effects of price on demand without that you don't have a good way to learn that and so what they argue is that we can use neural nets to solve these two different parts of the problem first of all I can just solve the problem of figuring out how to x and z predict y that's a prediction problem that's understanding what that whole integral looks like but separately I can also use neural nuts to figure out how to X and Z affect price and then I've got this integral that says how they fit together and I can basically try to invert that now this is economists have thought about this for a long time I'm citing here a paper by Nui and Powell from 2003 which was actually written in the early 90s I told you we take 10 years to publish our papers and so you know this this approach had been used for a long time it's just that we never operationalized it it very well in high dimensions we had we had great theory but no practical algorithms so the typical way that we would done this in economics is we would just assume everything was linear and then if everything was linear we could just run this with regressions and inverting this would be very easy because if if everything was just linear then my my demand function would just be some parameter times the price I could just pull that out of the integral so if I could just estimate how Z the instrument effects prices I could just affect how Z effects affects outcomes just take the ratio and that would be my estimate of tau so that would be a very easy thing to do and that's called two stage least squares and again that's a staple of like undergraduate econometrics we did then tried to make this more complicated with sieves and other types of nonlinear functions but it just never worked very well so nobody ever adopted it so now they've what they what this paper deep IV suggests is let's go back and actually target this loss function directly they're gonna change the objective function but they're gonna use neural nets to basically analyze each of these functions and then try to find the demand function that actually minimizes the loss function so they've turned this causal inference problem into two generic machine learning tasks and and applied them so they actually I the the earlier IV regressions or things that I did in in the late 2000s they came back and did a more much more thorough job with the same application of trying to understand the impact of of position on the click-through rate at Bing ads and so then so not only now were they trying to estimate the effective position but they were trying to understand the heterogeneous effects so some kinds of queries are brand queries and some are not brand queries some are navigational queries and some are not navigational queries and in addition some queries are for websites that are very popular and other queries are less popular and so they they use this deep IV approach to try to look at the causal effect of position and see how it changed with all of these different covariance and again they used the a B test IDs as instruments the a/b tests were not designed to shift the ranking but they just had that as a consequence each particular experiment only had a small effect but there were lots of experiments so they basically had lots of small natural experiments in the data and for example they found that for for off-brand queries the position effect was fairly constant with the with the website rank that was clicked on but for brand queries they found that the position effects were much longer for unpopular websites so that these low numbers mean that the relative click rate of say being in the second position was much lower than being in the first position but if you were a very popular brand website then being demoted to the second position didn't do much to your click-through rate because people who were looking for you would click on you anyways so manipulation in some sense would be less effective for an in in reducing clicks for brand queries and actually I have a PhD student I mean actually undergrad student who's now applying for PhD programs who took my class and he did another nice application of deep deep Ivy in genetics and there you can think about the your genes as being random and they're affecting the probability of you getting a disease for reasons other than behavior and what you eat and so he made a nice application of deep Ivy to try to understand the effects of disease and and treatments so that was a nice application of this technique as well so now I'll tell you about one of the methods that I've been spending a lot of time on which is generalized random forests I know random forests seem very you know very old-fashioned to to this audience but nonetheless they still work pretty well out of the box and they turn out to have really nice statistical properties and so we've been actually getting a lot of traction on them and social sciences because we're able to do things like prove asymptotic normality and get confidence intervals which are still somewhat elusive for neural nets although there is progress being made on that including by some colleagues at Chicago so the basic idea of what we're doing with generalized random forests is we're trying to think about doing local estimation and we basically reinterpret random forests as a way to generate neighborhood functions and so we say if I want to make a figure out the treatment effect for you I'm gonna build a bunch of trees and people who are more likely in the same leaf as you are gonna be people I consider to be your neighbor I'm gonna label them to be more like you and so then I'm gonna run analyses like instrumental variables but I'm but I'm gonna run them set currently for each covariant value and I'm going to do a local analysis waiting more heavily the people that are similar to you there's a lot of computational work that goes into this particularly because we want to make sure that we're looking for for heterogeneity and parameters so our goal again is treatment effect heterogeneity not predicting outcomes and so we optimize our random forest splitting rules for the the heterogeneity in the parameter estimates and so we have this imagine a setting where we have a parameter of interest theta of X that would still be like the treatment effect and we have some moment conditions or maybe maximum likelihood equations that tell us how to estimate the parameter but we want to do it locally we want to make it a function of X and we want to do that very flexibly and so we this moment depends on our parameter of interest theta it also pends on nuisance parameters and so in the case of IV regression there's a bunch of different ways you can write down how you estimative e but one way is to write it down in terms of a moment condition where you multiply you want to set equal to zero the the product of the instrument and the residual of a regression of the outcome on the actual treatment assignment and so those moment conditions can then be estimated locally as a function of X and so the moment the way we would operationalize that is that for each target X we would create a weighting function alpha and we would we would set the moment we would try to solve for the parameters that set the moment condition equal to zero but we're going to weight more heavily than nearby X's and so what do I mean by nearby well I'm going to generate a bunch of trees each tree is going to be a partition of the covariant space and I'm just going to take the frequency of being nearby as the weight that I get in in this weighting function now there's a few more details because in order to get asymptotic normality for this we actually need to use sample splitting so in particular we we use subsampling to create the trees and in fact we use two samples for each for each each time we create a tree one sample to create the partition and then another to lect wait because we're again we're trying to be careful that we don't let an individual's own outcome influence the weights that it receives so in this paper we establish asymptotic normality of the parameter estimates and provide confidence intervals we also recommend orthogonalization and we have some reasonably high performing software on crayon and are called generalized random forests and if there's any students in the audience we have lots of people contributing to this project and we welcome contributors so and in this package we'll do things like quantile regression regular causal inference under unconfoundedness randomized experiments and shrah mental variables it can also deal with with clustering and so on we've also been working on trying to deal with some known weaknesses of random forests and in particular and a lot of the economic applications I look at there's some smooth relationships like you might have your outcomes might be increasing in income or if you think about a tech firm application there's a lot of monotonicity people who are more active or gonna do much more of lots of other things they might also have higher treatment effects and so the problem with forests is they make these these kind of rectangles and they wait everybody equally within a rectangle now when we average over the rectangles we're gonna get smoother waiting functions but we still aren't really accounting for the fact that if someone's on the boundaries of a rectangle then we're then you know they might be fairly far from a target observation so together with a student of ours Reena Freiburg we wrote a paper called local linear forests and this is motivated again by some work that's been done in the statistics literature where what we do is we we adjust for being being observations within a leaf being far away another way to think about it as we've run a local linear regression at each point waiting using the random for dissuading kernels and so we we see that we're a regular random forest would fit a step function and have a hard time estimating a steep slope in a small data set by using these these local linear adjustments we get much smoother effects so we apply this to causal inference we looked at this randomized survey experiment where there was done by the by the government the big social survey that's done every year and for a period of years they randomized asking people one of two questions some people were asked whether they wanted to provide assistance to the poor while other people were asked whether they liked welfare now those are the actually the same thing but people in the u.s. don't like the world welfare it's been poorly branded by conservatives so if you look here we see that the treatment effect is actually much higher for conservatives than liberals and for rich people than for poor people but the normal causal forest kind of flattens things out near the boundaries while the local linear forest makes things smooth so let me now skip to the last part I want to talk a little bit about structural models and discrete choice models in particular this is an this is an area where I've been working together with David Blythe to try to put together some modern machine learning methods from Bayesian and France variational inference with traditional econometrics structural estimation so this is a problem of a firm trying to estimate the impact of changing prices and so the way that these have been modeled and this actually Dan McFadden worked on these models in the early 1970s and won the Nobel Prize for this more than 10 years ago maybe 15 years ago for this work and he was actually motivated by transportation problem so he was trying to say counterfactually what would happen if they expanded the the public transportation and in San Francisco and so he really needed to understand the welfare and use revealed preference to understand that I'm applying this type of model to supermarket which is a very common application in economics and marketing where I have scanner data from supermarkets so what we do is we model first that each person has a mean utility so user you with product I and time T has a mean utility for a product it might depend on the care the observed characteristics of the products and it's decreasing in price so the higher price the less I like the thing and then I'm going to say that my utility is equal to my mean utility plus some idiosyncratic error and it turns out if that's an extreme value error if has a particular logit functional form then we can write the probability of you buying of the the user you buys item I at time T in this pretty multinomial logit format it's just e to the mean utility of the item divided by the sum of e to the mean utilities of all the items and so sort of part of the Nobel prize-winning work was relating this statistical models lots of people run multinomial logit s' right but relating the multinomial logit to a user utility model and then he also did more work looking at some of the unpleasant assumptions that come out of just having the simple model and did various generalizations that made it look more realistic now if you if you just have a cross section you can kind of figure out something about overall price sensitivity but if you have a panel of people where you see there say they're shopping data over time it's at the store you can actually learn an estimate for each person of their price sensitivity as long as the prices are changing over time and so that's what I do in a series of papers I have some on shopping data and we also have some using mobile location data for people going to lunch so we have a large data set where we observe Kim people's mobile phones where they are during the day and we see where they work as well as where they go to lunch and for each of these we do a couple of things we first bring into this very simple model matrix factorization so we look at lots of products and instead of trying to estimate you know for each product what its utility is we actually use matrix factorization techniques to improve efficiency we also are you going to be a much we're gonna use modern computational techniques so historically marketing people used Markov chain Monte Carlo but that didn't scale very well and so we're gonna use variational inference and then we're also going to do some things to you like for the case where we use data about the product hierarchy we're going to study products where people only choose one of them and so we're going to use that fact in our estimation so in our model our mean utility instead of having kind of just fixed coefficients we're gonna have a factorize mean utility that's going to be latent and we're gonna try to learn a latent factorization of the mean utility and we're also going to try to learn a latent factorization for the price sensitivity so each user will be characterized by a vector of mean preference parameters as well as a vector of price parameters and then each product will be characterized also by a vector of parameters so we'll learn that lettuce is like tomatoes and if you're sensitive to price sensitive for tomatoes you might also be price sensitive for lettuce and we're gonna learn that from the data but we're gonna have this nice functional form and this functional form this this multinomial logit functional form has the property that it's gonna tell us like if i pull one of the items out like if i pull out your most popular item it's gonna tell us how you substitute two other products and so the model will actually tell you the the consequences of raising a price of one product on the demand for all of the other products and that functional form turns out to work very well in practice now one of the modifications we needed to do was to model just whether you choose the product at all because I'm most shopping trips you just don't buy paper towels at all and this is called nested logit that was one of the things Dan McFadden got his Nobel Prize for and so now we have a factorization version of the nested logit where we also factorize the probability of not buying an item at all and there's this nice algebra that shows how you relate all of these things and and it simplifies the computation so I just want to kind of close off with two kind of interesting things that come out of this first of all one of the things that we did differently in this paper is that we had because we had lots and lots of price changes in the data we actually tuned our model for the counterfactual so in particular instead of just looking tuning for the log-likelihood of goodness of fit overall we instead tuned for we held it beta held out data set of weeks that had price changes and then we selected the model that did the best at predicting the change in purchase probabilities from price changes rather than the ones that predicted best on average and it turned out that those were different models and that's important and so what we did this is just one of our goodness-of-fit tables we looked at what happens in a week where another product in the category changed prices so like if one land of bottled water increased its price we looked at the goodness of fit of predicting what happens to the other brands of bottled water we also do the same thing for out of stock and own price changes as well and so we find that our model of course I'm showing you the picture so of course our model does best and we compare it both to traditional economics and marketing models as well as to what happens if we take a reduced form machine learning model we just throw a big factorization at things take the parameters out and stick them into an economics model and it turns out that taking a big machine learning model just getting reduced form predictions that don't use any structure and sticking those predictions as covariates into classic economic and marketing models does a lot better than not doing it so if you were trying to scale to a large tech company that might be a good a good intermediate solution relative to ours which is more computationally expensive and then the last thing I want to show is just what you can do in terms of what we if you have this counterfactual model so you we're gonna this counterfactual is about personalized pricing so once I have estimated every individuals price sensitivity I'm gonna say some people are more price sensitive than others to a particular product and those will vary across products so I can do a counterfactual where I imagine targeting a certain set of people with coupons and we compare the performance of our model with other models in terms of how much personalization helps and by using this rich factorization model relative to what the traditional economics and marketing models we get much more personalized estimates and therefore we can get a lot more benefits from personalization and in our paper we also go through and and show we validate those things by looking at held out data and show the indeed we are the people that we think are most price sensitive tend to be more profitable on days where they get prices that are more appropriate to them so I will stop here and take questions so thanks very much [Applause] sooo if anybody has questions please come up to the microphone what people are coming up with questions I'll just say I'm gonna make my slides available I think actually the easiest place I'm not sure I can get them on my website by tomorrow but if you google Susan Athey a EA machine learning that's a EA is the American Economic Association I have a little Drive there where I keep all my lecture notes updated so you can find a link to the Google Drive and I'll put them there for tomorrow yeah question hi thanks for the talk so this is kind of just a practical question for like Internet companies why can't they just know the propensity of the treatment like they're the ones making that decision it seems like it's easy to just have a model that outputs a probability of the treatment and then records the top of the probability along with the treatment yeah so the question was why don't the tech firms just keep the propensity well so the causal of inference folks lobby for that quite heavily but it's actually and and I'm actually on the board of some at some companies that do a lot of advertising and we have lobbied tech companies to provide the advertisers with those propensity to help us evaluate experiments that we run as well or use observational data so the problem is it's expensive because usually you can think of like if there's randomness going on if you show up at Facebook you know they're just gonna kind of sample from a black box and figure something out if they were gonna get a probability they might have to draw 100 samples from the black box and if it's a black box they can't really write down the the probabilities but they could do something like offline it would be approximate and you know so so we wish that they would keep track of them more often and maybe with enough lobbying they will sometimes they do I guess I don't understand why it's a black box that outputs a decision instead of a probability of a decision because like normally in machine learning we have put a probability distribution over what you know what decision the classifier is gonna make or something I mean not always so not I mean not always in these tech systems so I mean and and you might like for the big search engine you might get you know me like 200 classifiers but then they go into a final thing and then out pops a page so there's there's actually not a probability distribution for all the pages that you could have showed thanks I have question regarding local union forests around the forest that you have used is there any incremental learning behind choosing the number of local random forests yes so I think we have that I would say that's not fully solved so we would suggest that you might do some pre-processing to figure out which co-variants go into that regression we actually use them in the software we use a Ridge regression to regularize ok and it's not very complex because random forests by themselves and that is complex and finding there's are there's a lot of hyper hyper parameters that we need to fix so when you have some local random forests there are huge huge parameters that you need to fix so it's not a little bit complex model yeah so the tuning is the tuning makes more of a difference you're absolutely right and we are working on tuning like tuning algorithms but they and they work sort of well but not great ok so and they're computationally expensive so yeah you give something up there so you gave a pretty strong recommendation for doubly robust methods yep and you also mentioned the Atlantic causal inference conference competition and my understood the 2016 results was that modeling outcome alone did extremely well and I was interested in just so your thoughts on that that results in how do it yeah so I think a lot of this really depends on the data generating process so if you can model outcomes well if you have enough data to model outcomes well then the in then the propensity weighting just increases the variance and so I can absolutely generate lots of examples where that can happen but I guess I've been playing a lot with bandits lately where we don't have enough data like in the beginning stages of a bandit you definitely don't have enough data and so the outcome models kind of are not good enough but I think it completely depends it does depend on the setting and the small data literature actually has the same message it depends which is a little bit sad but I still feel that as we've moved to higher dimensional methods I think the it pushes us more towards doubly robust and the orthogonalization and then some of the results we have like we have theorems about estimating optimal policies and we have good regret bounds we have the best balancer with doubly robust and we don't think we could get the same regret balance otherwise although we haven't proven a counter example so I'm becoming more convinced of that but I think any particular competition anything can win ok last question yet hey thanks for the talk and I'd like to ask you a question about the interpretation of the results we got from you so many IV like many a beat us because we know that by using IV we based on Yasmin and local average because in fact if we have so many IVs that how should we interpret the you know finest mutton yeah it's great question so there's a lot of subtleties with using instrumental variables and if there are unobserved say if you there's some unobservable that affect your treatment effects then it can be very complicated to interpret your results so I actually my husband's hito in bins who's done a lot of work on local average treatment effects he coined that term and I tried to get him to work on this problem ten years ago and he didn't want to work on it for just the reason that you said he said it'd be too hard too messy and then other people just assumed the problem away and went forward so yeah it's intentional variables can give very messy interpretations and that's just a problem I don't have a solution for I see thanks alright thanks very much you
Info
Channel: Steven Van Vaerenbergh
Views: 7,368
Rating: 4.879518 out of 5
Keywords: nips, neurips, 2018, tutorial
Id: yKs6msnw9m8
Channel Id: undefined
Length: 124min 0sec (7440 seconds)
Published: Thu Dec 13 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.