Michael Johns: Propensity Score Matching: A Non-experimental Approach to Causal... | PyData NYC 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm Mike Johns I'm a data scientist at hellofresh thanks for coming I'm gonna talk about propensity score matching which is a non experimental approach to trying to accomplish causal inference just a little about my little bit about myself data scientist is now the fourth leg a very long career I've had using data and analytic methods to answer questions and solve problems I started off as an experimental psychologist torturing undergraduates and P hacking a lot I went onto epidemiology worked for the city for many years did Policy Research and most recently have made the leap into data science proper so this is a topic that comes from my home field of the Social and Behavioral Sciences so just a quick outline I'll start off with a kind of overview propensity score matching in a nutshell and then get into the specifics talk about how to calculate a propensity score which involves building a model selecting covariance how to use those scores to then perform matching in order to find a comparison group and then if there's time I'll briefly touch on estimating treatment effects that's actually a pretty complex topic in and of itself that we could spend a couple of hours talking about so I'll do a lot of hand waving there okay so propensity score matching this is a quasi experimental method that is designed to help you find a control group in a situation where random assignment is impossible so you're using this propensity to be exposed to something let's say it's a marketing campaign some new policy some new law as a proxy for assignment to a condition as you would in an experiment so it's a pretty common method in social behavioral sciences gets used a lot in educational research economy Trish ins are probably pretty familiar with it it was developed the original theory was developed by Don Rubin and Paul Rosenbaum back in the 80s so it naturally fits within the the social behavioral science world and it's basically founded on this idea that the key to drawing causal inferences is based on this idea for groups in a study treatment and non treated groups to be balanced essentially to be equivalent on average and this is an idea that comes from the potential outcomes framework that Rubin has also been heavily involved in and so the idea is that you want exposure to be independent of any observed or unobserved background characteristics all right so you can make an inference that any differences between groups are based solely on exposure to some treatment or condition now this method is best suited for dealing with selection bias which is just a pre-existing tendency to behave in a certain way expose yourself to certain material and so forth cannot necessarily eliminate other types of confounds so the list I have here history maturation regression I mean these are all have a temporal aspect to them and so this isn't necessarily a solution to dealing with all potential sources of confound in a observational study situation so let's say we have a situation I often find myself in we I work for hellofresh we sell food probably makes sense that we want to put some advertising up on the Food Network right and in the ideal situation we would have some means of controlling who sees that ad who doesn't see that si sees that ad and then maybe after some time seven days 14 days 21 days we look to see of the people that saw the ad how many show up as customers and of those who didn't how many showed up and the difference between those two conversion outcomes ends up being our estimate of lift or a treatment effect unfortunately the real world is not so convenient or cooperative so oftentimes all we know is who saw the ad and who didn't we had no control over whether or not they were assigned to see it or not and so almost certainly we have a case where we have a confounding situation based on selection bias so people who are interested in cooking almost certainly are gonna be more likely to watch the Food Network especially if they have cable and then by that fact more likely be exposed to the ad and almost certainly they're probably just more likely to become a customer of hellofresh if they're interested in cooking right so this is the classic classic illustration of selection problem so we have background variables in this case we have a latent variable that is related to both the chance of being exposed to the ad and the outcome itself now typically we don't have access to this latent variable we have information about other variables or factors that are related to this thing of interest so demographic characteristics past behavior psychographics that we talked about in the world of marketing but what we really want is this situation right we want to sever that relationship between these confounds these selection biases and exposure to the ads so that when we look at the difference between those people who saw the ad in those who didn't we are not that inference isn't gonna be confused or confounded by the fact that the people who saw the ad are just different to begin with okay so those of you are familiar with sort of causal graph the work of Jude Pearl will know this as the back door problem right so we want to shut that back door and propensity score matching is basically just one way to try and accomplish this so there's four key steps here you first have to calculate the propensity score so you're gonna build a predictive model and and the model you're building is predicting the probability of exposure for those both those people who let's say in this case saw the advertisement and didn't see the advertisement then we're gonna use that score to create a match control group for the exposed group we're gonna check that that matching actually accomplished balance right it made the group's equivalent on average and then we're going to estimate a treatment effect based on this matching so for those of you like equations or this kind of formulation so propensity score is pretty simply just the probability of being exposed to some treatment let's say it's an advertisement or what have you conditioned on some set of covariance so covariance basically means variables features if you like background variables and so forth so again it could be average age in a household number of people in a household median income things like that and so it's basically you're predicting the probability of being exposed and then you're gonna use that probability to find what are unexposed units that are very similar to the exposed units so basically your goal here is to build a model of the selection process and essentially one idea that I really like is it what a propensity score is doing is it's taking all this background information it's just condensing it down to one scaler right and that's actually gonna allow make it a little bit easier for you to find similar people households units and so forth and what you're trying to do is again optimize balance between the groups optimize average equivalence so that you can make selection strongly ignored in the world in the words of causal inference so for all the data scientists in the room I will call out what you're not doing is trying to build the most predictive model that then is going to generalize to unseen data right it's a slightly different what we're really building here is a descriptive model not a predictive model so just to illustrate that on the left you have your classic kind of machine learning ideal where you have your positive cases and ideally you have the probability mass pushed all the way up against one and for your negative cases right probably the lady Mouse pushed it all the way up again zero and very little overlap here between the two groups right now I'm not saying it always works out this way we would love it to work out that way but this is what we're going for and the propensity score situation we're actually looking for some overlap and you still probably want that unexposed probability pushed up against here but for your exposed group it's almost ideal to have a nice spread across the different propensity 'z because that space that common support there that's where you're gonna find your your matches for your exposed group so when it comes to building this propensity score model predicting who's exposed who isn't there predicting the probability of exposure one of the key things that comes up a lot literature is in building the model is capturing the proper functional form of the relationship between these background variables on your population and the outcome and so typically this is a question of are there complex interactions that you're missing are there nonlinear relationships and so forth despite that typically what people use is logistic regression it's pretty simple handy tool there's some limited research on more sort of traditional machine learning models random forests and boosting it typically doesn't seem to give you any advantage except in the case maybe not surprisingly if you have a lot of non-linearity non-additive ax T which are regression assumptions so for the most part logistic regression usually gets the job done and then sort of jumping back a little bit in in terms of thinking about how you're constructing this model and what's going into what are your covariance as they're called what are your predictors so really again you're trying to build this model of the selection process so your predictors should ideally be related to that selection process and in US and really you're looking for what you could call true confounders things that are going to predict exposure as well as the outcome and generally you also want to try to avoid avoid what are called instrumental variables the economists in the room will know what I'm talking about so these are variables that are related to exposure but actually not related to the outcome and so of course this is a quasi experimental method and your ability to get a real unbiased estimate here depends on finding and including key variables or key features that are going to sure that selection process so omitting variables is is something of great concern now generally I would advocate when you're thinking about how do you build this model what variables do you include is really having a theory of selection okay so hopefully you have some domain knowledge you have some domain expertise that you can draw on to think about the problem that you're faced with in terms of what are the the characteristics of people or households that are going to increase the likelihood that they see an ad for your product and also just become a customer without exposure to any advertising so along those lines past behaviors are great to include anything that happened before the exposure demographics attitudes Geographic related variables probably one of the better ones is any pre exposure values you have on the outcome you're gonna predict if that's if that's possible that usually is more relevant in a case let's say in a business case maybe if you have a product that people perch purchase over time multiple times in terms of unobserved variables and this is really the the big shortcoming here is by definition they're unobserved but this is where having a theoretical model can be helpful to think about what is it that's likely to create selection bias I try to think about are there proxies for that data or that variable that I could use in the model so the world I work in the world of marketing you have all these third-party data providers they purport to give you data on things like interest in camping or interest in magazines or what have you how good those data actually are is is an open question but there are potential sources out there that you could use to try and get to things that are more sort of latent most of the time so interests propensity z' proclivities things like that so this next point I make very cautiously so typically over including variables is not going to be a huge problem because remember we're not technically worried about fitting here that said overfitting generally not a good thing we want to be careful about multicollinearity but this is a case where you can probably if you're not sure you want to over include here you can draw on more traditional data science or sort of statistical learning methods and use a regularization maybe do some variable selection and then calibrate your probabilities just to be safe okay so once you've built the model you have your probability or propensity for your exposed graphic unexposed group now you're gonna have to do some matching and again the idea here is that you have the population is scored you match and essentially you're trying to find this set of overlapped overlapping groups that have similar propensity x' both in pairs and across the entire group so the matching algorithms there's a whole literature on this and we could spend an hour talking about different approaches but the three core dimensions that you see are how greedy is here is your matching algorithm as a greedy or is it optimal whether you put constraints on the distance so do you set a caliper or do you let it be open through nearest neighbor and then cemetery one to one versus one to many just quickly to walk through these so in the greedy approach right you're just iterating through you're looking for the best match you're gonna seize on that stop and exit so in this case if we imagine we were looking for a match that was in with about 10% of our exposed and we started from the top we might get to 0.5 607 say that's great it's in my range I'm taking that one and I'm exiting even though technically the best match the closest match might be down here an optimal we're looking to minimize the total within pair difference so this is a more traditional optimization problem where you're going to go through the entire list of exposed you're gonna try and find the best match for every every pair and then optimize across the whole group um caliper versus nearest-neighbor this one's a little tricky so caliper is basically just saying you're gonna set some define limits on how close the match needs to be so again if we stick with an example we're looking for let's say it about a 10% match we could pick any of these four here highlighted as a potential match nearest neighbor obviously you're looking for the closest match but given the circumstances you could end up with a very close match or you could end up with a very far match there's no limit placed on how near or far the neighbor actually has to be and then one-to-one one-to-many pretty self-explanatory so the general recommendation that comes out of the literature usually what people do they match on the logit of the propensity score they set a caliper right so they set a limit on how close the match has to be to so the match in the exposed group how close has to be to the unexposed and it's used about 25 percent of a standard deviation in terms of greedy versus nearest neighbor or gree versus optimal there's not a lot of evidence on on which one performs better and when a greedy algorithm is going to give you better balance between groups and that's ultimately what you're after but there can be some instances where optimal is is a better approach and then really when it comes to one to one versus one to many it depends on the support level in a lot of ways so how how much overlap is there between the propensity zuv your unexposed and exposed groups so unfortunately this is an area where there's really no sort of silver bullet or magic bullet solution I don't know which one it's the right metaphor it's really going to depend on your situation it's gonna depend on how much the scores the propensity scores themselves overlap which is going to depend in somewhat on the class balance imbalance and then also the baseline difference between your covariance how big of those differences in the two groups so once you do your match you have your exposed group you have a matched group of unexposed that you're now going to be your control so now you need to check to see did the matching work did you achieve balance did you reduce those pre-existing differences on these covariance that you're using so generally you probably want to avoid p-values in testing the difference between the groups generally p-value isn't going to answer the question you want to know which is have I reduced the size of the difference it might it's gonna tell you the likelihood or the probability that there's a difference this size are larger that's not really what you want to know so generally what people use is what I know as Cohen's D it's also just called standardized mean difference so you take the mean of your exposed group after matching unexposed group that has matched to it and divide that by the pooled standard deviation and so you get a mean difference estimate in standard deviation units it can be positive or negative I usually look at the absolute value over here down in the right-hand corner just an illustration of what a Cohen's D of one looks like what the sort of level of overlap is there it's most like excuse me quite a lot of overlap and like I said there's no clear consensus but 0.1 is usually a pretty good metric to use but it can it can vary depending on your use case so this is just an example from a recent match that I did so here on the left is before matching so these are just density plots for the unexposed unexposed group very similar to the ideal that I showed you earlier this is totally coincidental well it wasn't intentional at all but you can see unexposed most of the the probability densities pushed up against zero but for the exposed group there's actually pretty good support pretty good coverage and then once you match you can see the probability propensity distribution so I should say overlap almost completely it's interesting little kink there right around 0.5 it probably has to do with the matching algorithm but more importantly so generally looking at covariance you want to then go and check that you've achieved your goal of getting balanced so in the blue you have the unmatched differences across these different demographic characteristics so percentage of male different age categories or race ethnicity education and so forth and and then the orange you see after matching the size of those differences and so the dotted line here is that's a standardized difference of point one so generally does a pretty good job brings most of the differences down to at least point 1 or less okay estimating the treatment effects so a lot of research on this topic a lot of different approaches you could take here so the general recommendation is that once you come up with your matched set if your supposed and expose you want to incorporate the propensity score into the analysis somehow and the basic idea here is that the propensity score and the matching isn't going to be perfect there might be some residual confounding you can add that score into the model it's sort of a proxy for all these variables you're trying to control for it might give you a little bit of extra bias reduction there's three general ways that you can accomplish this so the standard way would be a regression adjustment right you have a model where you regress your outcome variable let's say we're talking about conversions here on to some indicator of exposure or not and then you include either a stratified version of your propensity score usually quintiles or you can include the raw version there's also some recommendation came from Ruben that if you run your diagnostic tests and there's still some imbalance there's lack of balance on particular covariance you could also throw those in just just to be safe the last two approaches so I'm gonna highlight these there they're slightly different than what I just described so stratified analysis is a classic epidemiological method for controlling for confounding right you take some confounder you stratify it again I think it was Cochrane that suggested quintiles and then you estimate the treatment effect right comparing exposed and unexposed with each within each of those strata wait and pool the results now you could do this after matching but generally it's recommended or that this this type of analysis came out of the idea that you wouldn't actually do any matching you would just do more of a classic epidemiologic style analysis where you would would do the stratification I don't really I haven't really seen that used either way much in this literature but that was something it was talked about early in the development back in the 80s and 90s so inverse probability of treatment weighting this is a method also that is become I see this actually pop up in marketing now more and more it's often referred to as the synthetic control method and essentially what you do is you fit a regression model and you rate you weight it by the inverse probability of the propensity and again you could typically this is done without matching so that would allow you potentially to get sort of inference over the entire range of exposed and unexposed you could also do this on a matched sample my general preference if it's possible is to you know find a good match and basically create sort of again like a true quasi experiment where you have a treated group exposed group and then you you estimate average treatment effect in the standard way okay limitations and cautions so as I've said multiple times this is obviously quasi experimental method not an experimental method and the ability to produce valid inferences here depends on how well you can account for key confounding variables again I think theory is really important here but it ultimately comes down to I would argue can you rule-out all plausible alternative explanations for your findings okay and I I want to emphasize this because I feel like in the statistical literature on causal inference that point gets lost a lot and it's actually really what causal inference is all about and again I think having some theory is very helpful to really think about what are the potential confounds here whether the potential threats to validity your inferences are potentially based on a subsample right of common support so this area where the propensity is overlap if you think about you go back to the sort of machine learning ideal where you have a very small area of overlap imagine trying to do propensity score analysis in that region maybe if you have big enough data you could but in a sense your inference would be on a sub population that's right around 5050 and so it's a little bit of a contrived situation in that way and it's going to it's always sort of seemed a little odd to me so that's why you really want to make sure you have good coverage general literature research simulation research suggests propensity score matching is probably pretty consistent pretty good for directional inference so is there an effect is it positive or negative there's a lot of mixed evidence about how easy it is to recover the size of the treatment effect that you're looking for so it's unfortunately there's no again silver magic bullet pick your preferred metaphor for this particular problem and definitely your results are gonna vary and so probably the best thing you can do if you want to implement this given the use cases that you're dealing with is you know create some synthetic data that are similar to the data you're you're you're working with relevant to your use case and really just test different approaches and do sensitivity analysis see what happens by changing your matching algorithm see what happens by changing model specifications different interactions and things like that I wish there was a simple and easy cookbook for this approach but it's generally it's gonna take a little bit of work upfront to get it to a place where it's probably gonna be useful in your particular use case just some resources so perhaps not surprisingly there's not a ton of packages in Python this is a method that comes out of Statistics so that's where most of the packages have been developed but there's two packages I found there's causal inference which was looks like it was built about three years ago I don't think there's been a pull request in about two years so maybe you want to make a pull request a PI match is kind of in the same situation it looks like it was built out for a particular project it was developed about two years ago and it I'm not sure if it's even being maintained if you want to just dive in you're not hung up on using Python you can check out match it which is an R package that does a lot of this stuff theory and practice the original article by Rosenbaum in Rubin is probably worth checking out and then more sort of applied articles this article in particular Peter Austin at university Toronto has done a ton of research on propensity score matching both different matching algorithms different modeling approaches and so forth and then this last article is a really nice overview of really the issues you need to think about it each step in this process of trying to do propensity score match okay I think that's all I have happy to take questions yes that's a good question you're really concerned with including and capturing all of the potential factors that could confound so it's a more traditional statistical model where I'm just trying to build the model that I think best represents how some independent variables or some predictors relate to some outcomes so when I've used this I I do very often use a light amount of regularization I do a test train but I'm not really trying to optimize anything to be honest and I'm really just trying to I'm focused more on what variables are in the model and whether or not those are the right variables to potentially eliminate any confounds or selection biases so you could use you know scikit-learn it's not going to you know be a detriment in any way but it's not necessary either you can use stats models just build a simple regression model yeah so can you talk about why it's used as opposed to a model where like covariant matching yes so um so I let me see let me repeat your question back to you because I think I think what I hear you asking is right what is the advantage of a propensity score versus just mashing on covary it's like in their original form yeah that's a great question so um yeah and it originally is kind of the traditional approaches like you have you know five background factors you know and you just try to match almost exactly or even in straight out process but it quickly you run into limitations on practically how many you can how many covariance you can match on and one of the real advantages of this approach is that you could have you know 10 20 30 40 covariance that you could theoretically be matching on simultaneously so in a way it's more efficient it gives you more bang for your buck and in some of the projects I've done where we have a lot of potential background information there's no way there's just no way you could do like a straight one to one or even like sort of I guess you could call hyper stratified match is that answer your question yeah and like I said the way I've always thought about this is really nice as you're really you're using this model to take all this information and just get it down to one number and then once you've got that one number the matching is a lot easier oh sorry that's not a question that's five minutes left in terms of so the propensity model let's say you had a control condition and three treatment conditions or you tighten or like a kindy yeah yeah yeah yes you could you could adapt it to that situation you would basically you actually haven't seen a lot of work on this I think right now the general approach would be you would you know let's say you had three treatment conditions so you would build you know 3-1 propensity models sort of rate one against each build three one treatment against control I've not seen it done where the outcome is some yeah some continuous variable that represents maybe exposure in some ways yeah yeah yeah I think you'd still need to like categorize them in some way to create sort of those those buckets those levels but you could do it theoretically yeah my question is if there's not enough yeah that's a good question um so you could try different model specifications frankly to see if you could increase that coverage probably in a situation like that your best option is something like inverse probability of treatment weighting so that you could use the entire sample and I think that's why that method has become so popular because it doesn't it doesn't need that that overlap but yeah to some degree you're kind of stuck I mean in an interesting way it's it's that models telling you that these are just two very different groups that maybe aren't comparable yeah yeah what is it called again Wang Wang okay yeah yeah and I should say I've found one or two papers that you try to use boosting to build the propensity model not a lot out there but yeah that's good to know
Info
Channel: PyData
Views: 11,939
Rating: undefined out of 5
Keywords: Python, Tutorial, Education, NumFOCUS, PyData, Opensource, download, learn, syntax, software, python 3
Id: gaUgW7NWai8
Channel Id: undefined
Length: 34min 25sec (2065 seconds)
Published: Sat Nov 30 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.