Causal Inference in Tech

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
me take it away all right thanks Michael okay so more hand-raising thank you guys for joining me after lunch on the final day how many of you have heard this statement correlation is not causation okay everyone now how many people have estimated a causal impact all right totally what I expected so what that shows me is that we know everyone hears you know that correlation is not the end-all and Beall and causation is really important but in reality it isn't done in practice so much and it's one reason for that is because it's difficult and it's really really messy so causal inference is the process of drawing a conclusion about a causal effect based on the occurrence of an event and it's difficult because if you thought data munging was dirty if you saw it you know one hot encoding was difficult if you thought standardizing your variables could be really ugly well I have a treat for you because causal inference is really not sexy it's a branch of data science that's not sexy but it's incredibly important and hopefully by the end of today's talk I'll have proven that to you so let's go over some examples I want to bring this back to the real world and really give you a sense of some of the questions that we face in the tech industry at Silicon Valley not just at Yelp but across the board causal inference is able to give you the causal impact of things so one can answer potential questions like what's the monetary value of an additional user hour you know we can actually get that causal we can get that actually numerical effect or how is advertiser churn affected by switching to a new sales model when you have a new sales model you might expect some advertiser turn and we can measure that impact with causal inference or if you reroute your users from the mobile site to downloading the app how does that increase the pageviews on your web site so a lot of these questions are only answerable by causal inference okay so the purpose of causality establishing definitively that a feature X causes an come why instead of just trending along with it so why don't we just establish correlation you know correlation is often enough if you want to do prediction we've all seen ML models no one ever has to talk about causality you can predict and forecast the future up to however many days with correlation alone without knowing the underlying dynamics that govern the causal flow of the variables so if you're just doing prediction it might be enough not to worry about causal inference or if you want to understand general trends in your data and overlaying heuristics you may not need causal inference you don't need to invest in all of this also you could just conduct an a/b experiment so an a/b experiment an a/b test its purpose is to establish causality so a perfect a B experiment done well does the causality for you so in those cases you don't need to do causal inference separately it's not always possible though and in fact in most cases it's not possible it's very difficult to have a perfect clean a B test especially when you're scaling up your enterprise okay so let's first look at a B testing though what would an a/b test look like and how would we determine causality from that so here I have two screenshots of a potential home screen that a user would land on in the Yelp app and you can see that you know some some users would get the control home screen on the left hand side and some users would see the treatment the new application home screen on the right hand side so if we want to know the effect of how this new home screen affected our user behavior we could just have different users see different things and then try to get the effect that way see how they behave differently now we can already see a few problems with this hopefully you know there are slight problems with the a/b test that I've this is not a real ap test that we did by the way but so a lot of things are changing on the right hand side not only are all the buttons have all the buttons become different icons but we've added an add photo icon at the top left we've added an ad tip there are some ads there and there's a bottom bottom icons as well okay so if you want to know the effect of adding add photos alone you're not gonna get that from this a/b experiment all you're gonna get is the effect of this whole new layout as opposed to the previous status quo layout okay so let's say that here we find out we have a two percent lift in some metric why this is something we do all the time we were on hundreds of ad tests all the time and so do all the big tech companies Amazon Google all of them okay so great in this particular case if we've done the a/b test correctly then we already know the causal impact of the new home screen it's a two percent lift in metric way okay but we can't always do an a/b test there are a lot of issues with that for example if you have a cyst auric data and you want to look at something that happened a treatment or an event that occurred historically you can't go back and then cohort your users in two different cohorts and test that you have to just rely on the historic data and establish causality that way or you know as I see a lot there are a lot of imperfect experiments the example I just gave was was exactly one of those a lot of things were changing there so if you want to determine what change exactly caused the two percent lift in metric why it's going to be difficult to tease that out with the experiment we just ran okay so imperfect experiments we see non-random cohorting sometimes users are not perfectly randomly split into treatment and control cohorts so the experiment is not valid there are also other confounding changes a lot of tech companies run a lot of experiments together and we try to layer it to have different salts so that they don't they don't compromise each other but it's not always perfect and sometimes experiments run together and then it's difficult to isolate the effect of one particular change also there's causality from actions we can't control so if I want to look at user behavior and how that drives one of my business outcomes I can't go out there and force the user to do to perform certain actions right so I can't have a controlled environment for something that happens in the wild and so if you want to look at the effect of something that happens beyond your control you're going to have to rely on causal inference in addition it's not always desirable we have a huge that was a big interphase change that could potentially have a huge hit on the user experience there are some changes we've made that have had hits and you want to mitigate that so sometimes you don't want to scale out a big a big change on the platform also some tests may not be ethical or fair you can imagine doing some experiments on your sales reps or some of your users that would not necessarily be acceptable ok so the idea is that if we want to lever a targeted business action in order to change an outcome we do have to determine that the correlation is causal just having two variables trend together does not mean that we can lever one of the variables to push another outcome okay so here's an example for that imagine that we see an x percent increase in push notifications or we implement an X percent increase in push notifications so that I'm gonna have some terminology that's gonna always we're gonna call that the treatment effect and we're gonna see that that causes a Y percent lift in the click-through rate click-through rate is a big buzzword in Silicon Valley we talk about that a lot because websites care about click-through rates okay so if we know that it's causal then what we can do is we can actually use push notifications to reach a certain desire to click-through rate that we want but if the relationship is not causal then no amount of changing the push notifications is necessarily going to lever click-through rate okay it also helps us to properly attribute outcomes to certain business decisions right so if we're you know investing in a bunch of different initiatives at our business and we see that the business is doing well well we need to know which one of those is actually causing the business to do well in order to invest further in the thing that's leading to that so we want to properly attribute success to certain business initiatives okay so as Michael mentioned real-world data is really really really messy and when you perform causal inference on this messy data it's a pain and I suspect that's that's a huge reason why a lot of people stay away from causal inference it's not like an ml model you know we may think ml model is difficult have to do some standardization have to for neural net have to make all the pictures the same size do some resizing rescaling whatever but this is really different because we're gonna have to layer on a ton of assumptions in order to eke out a tiny bit just one answer okay so I mean I walk through different data types today and they're gonna include longitudinal data otherwise known as panel data cross sectional data time series with a lot of periodicity we're gonna be we're gonna have to be able to deal with all of these data types selection bias as a factor a bunch of confounding variables and reverse causality as well so these are all issues that we have to face when we do causal inference okay so at the heart of any causal inference problem is determining the counterfactual we want to know what would have happened to the treatment cohort if they had never been treated so let me translate that back into real-world terminology what would have happened to the people who didn't see the new app homescreen what would have happened to the people who did see the new app screen if they had not seen it instead so we're determining a counterfactual in this in general in machine learning and deep learning is also a thing that we care a lot about so in general determining the counterfactual is a big is a big headache for all of us so this just doesn't just apply to causal inference so if I can tease out what would have happened then I can measure the size of the impact of what did happen and I can get a sense of the return on our investment all right so first things first before we actually get started with some of the tooling that I'm gonna introduce we need to check a few things and these things are going to keep coming back throughout the presentation so going to talk about selection bias we're gonna talk about identifying some natural experiments and identifying come about confounders so no matter what inference technique you use you're not going to get away from three things alright so I'm gonna keep using actual examples that we see you know in this online space so one question we asked is does app usage actually increase user engagement so that's a pretty important question for any platform that has users and rely on the users to generate content and as part of the business model there's only one problem here which is in the universe of people right in the general population they may not they may be infrequent Yelp users but of the of the users that have the app they're probably more likely to be extremely engaged Yelp users so they're different they're fundamentally different from the general population so I can't easily compare those people or users with the Yelp app pre-installed okay so what we need to do before we do any causal inferences we need to check whether there is selection bias so what I would do is I want to see okay so the treatment in this case is having the app so I want to see is the treatment equally distributed across my different cohorts so my cohorts are people with the app that's the control that's the treatment cohort and people without the Yelp app that's the control cohort and I'm going to be comparing the effect of the app that way and now you as we can see if I look at the pre these are not real numbers I can't show real numbers but if I look at the pre treatment period if I look at the pre analysis period I can see that yeah in fact you know users with the app were more engaged than users without the app so how can I just subtract the the effects of the app on those two groups I can't just do that I have to do some stuff beforehand to make sure that the treatment and control cohorts are comparable so we do that amongst many other techniques we do that most commonly using matching so in this example again the treatment cohort are the users that have the app installed and the control cohort are those without the app and so we have this general population but what we want in our causal analysis is to take away the people who definitely have would have had the app or people who had no chance of ever downloading the Yelp app okay we want to have people as comparable as possible in some circles it's known as you know a twin study we find a twin for every control individual we find a twin in the treatment cohort okay so then the question becomes how do I know the likelihood that they would have downloaded the app right if I know the likelihood that you and you and you would have downloaded the app we can just compare you and match you like that so the first step is to determine that likelihood so there are many ways to do it and one way is a propensity score matching method so the propensity score matching method is based on a logistic regression so first we model your likelihood of undergoing treatment in this case installing the app okay and so of course with any sort of linear modeling you have to be very careful and it's more of an art than anything else and so there are a few packages available for this but they tend to be you'll see throughout this talk that a lot of the causal inference packaging within Python is underdeveloped and so we do rely on a lot of UDF's and we do rely on a lot of individual contributors to contribute these modules so there is a causal inference package authored by an individual developer Lawrence Wong it actually is very good to use but since the propensity matching is based on a simple logistic regression one can implement a logistic regression via either scikit-learn or stats models so either way okay and other matching methods we can use clustering methods such as nearest neighbors readily available in scikit-learn so let me just show you what the implementation would look like with the causal inference package okay so here how we would model this model a user's likelihood of installing the app undergoing treatment is you get a set of covariates X variables that determine selection into having the app or not so in this particular example one can imagine okay what makes you likely to have the well if you're a mobile user a heavy mobile user in general so if if we had data on mobile usage if you have a lot of other apps if you have the app pre-installed on certain Android phones that we have contracts with those are all things that make you likely to have the app and then we model your likelihood and once we have everyone's likelihood we match them up and we only keep the people we keep a pool of people that are similar okay so you essentially start up a causal model object and then fit it and then get a better propensity calculation and then you have to do the matching yourself Aphra Behn so this just gives you the likelihood okay so did we correct for it well this is an example of something I did the before and after of a likelihood of undergoing treatment when I corrected for the selection bias and you can see that before I did the matching the treatment cohort was much more likely to take whatever action the treatment was and this in our case it was not that installation of the app but in our example it's the installation of the app so they were fundamentally different the people in the different groups were very different from each other but afterwards I took out all the people who were not similar to each other I restricted it to a subset of the sample and I'm left with a much better much better and more comparable pool of people okay so natural experiments we also want to look for natural experiments this is a great way of getting exaggerating which is essentially what you want in any causal analysis so natural experiments are interventions that occur outside of a controlled setting so anything that happens naturally and so it has to happen naturally and the treatment has to differentially affect different groups of people so that we can see the different impact on the different groups okay one example for us is something I mentioned before which is that the Yelp app comes pre-installed in certain Android phones so I don't have to if I'm looking at the impact of the app being on your phone I don't need to look just for people who manually went and install the app I can look at people who had the app naturally on their phone so they didn't select into the treatment group or the control group okay so that's a natural experiment if you have one of those and it's much easier to determine causality all right we also have to very carefully identify all the confounding features so in this case you know we are looking at again having the app and its effect on user engagement so their likelihood of coming back on to the app and what if we're forgetting something what if we're forgetting another variable Z that's associated with both things I already mentioned some of those potential factors those confounding factors right the fact that if I don't account for the fact that some people have the app pre-installed if I don't account for the fact that some people are just heavy-heavy mobile users then I'm gonna have a really biased estimate of the impact of X on Y I'm gonna capture the effect of Z when I ask to me the impact of X on Y okay so we finally get to some techniques and these are the five techniques that I'm going to go over and I'm gonna also talk about the tooling that's available for this and some of the things you can do when tooling is not available so it turns out these are all based on statistical methods so you know even when packages are not available we can get a little lower and you can actually build things from the ground up okay so the first thing let's start off easy start off with a standard regression so the counterfactual in this case if we look at the question how does price affect ad sales that's definitely something we care about we care about whether our cost per click is going to affect ad sales but when you do a simple regression it's never simple when you're trying to get causality because if you don't control for all the other potential confounding factors your counterfactual cannot be established the counterfactual is based on the assumption that we have accounted for all possible causes of ad sales okay and this is one example in which both cost per click and the ad spend which is the advertising budget of the companies are really important indicators predictors of ad sales so if I only look at one of these dimensions and don't factor in the others we're gonna have some problems okay so there's a lot of tools for regression I'm sure you guys have seen some of them in scikit-learn we have the linear regression model and stats models we have OLS as well as a ton of nonparametric regression tooling and to do this it's fairly simple you know we add numpy column stack of our variables that we think determine the outcome and then we just simply fit so this relies on I like to illustrate this even though it's a very simple thing because it shows that unlike other machine learning models where we rely a lot on the machinery here we really rely on you the user the data scientist to have domain knowledge to properly orchestrate to properly model this behavior we can't just rely on the neural net to find the patterns that we can't see know we have to see the patterns in advance and put them into the model otherwise things will you know you're not going to get a good estimate okay so another technique is bayesian time series I really like this this new this fairly new package written by Google because it allows us to really lever time series data well okay so beige instructional time series is based on the idea that we have control time series that can predict another time series so it's related to Granger causality which is a whole you know time series of variables that can then predict something else so the question here is how many additional daily clicks were generated by an ad campaign net of cannibalization what is net of cannibalization mean you can imagine that I have an ad for a business and the user clicks on it but actually that user might have clicked on the business page anyway so I have to take away the effect of them potentially clicking on it anyway I can't attribute that to the success of the ad campaign that would have been an organic user okay so this is hopefully going to allow us to do that so what we have here is there a laser beam thing no okay no laser that's alright so we have the outcome variable Y alright so the second vertical dotted line is going to be when the ad campaign started so before that we're gonna fit the time series model on the observations of Y prior to when the treatment occurs at that second dotted line so we're gonna have X 1 and X 2 as time series that determine Y and we're gonna fit the model to the pre treatment time frame and then post treatment so everything after that second vertical dotted line we're going to determine the counterfactual we're gonna estimate what would have happened had we not had the treatment okay so we're gonna estimate what would have happened if the ad campaign had never happened alright so the dotted line the dotted blue line on the right hand side is our estimate based on the time series the red and green time series and that's our estimate of what would have happened so this assumes a lot of things it assumes that the time series x1 and x2 were not impacted by the ad campaign they have to be exhaustion as' and we have to assume that they have the same exact relationship with the outcome variable Y in this case daily ad clicks before and after the treatment so as I mentioned before a lot of strong strong assumptions to do a lot of things here and in order to model the time series properly we have to do local linear regression we have to do you know some time series prediction with various periodicity x' there's a lot of heavy tooling behind this but luckily we have some help because we have the package that I'll talk about in a bit so what this allows us to do then is it allows us to get ok at the top graph we're modeling us ad clicks on a daily basis and then we get a prediction of what the ad clicks would have been on a daily basis had we not had the ad campaign so every single day we can measure the difference in how much our ad campaign contributed to ad clicks and then in the lower graph what you get is a cumulative impact so for something like an ad campaign you don't the effect of one day you want to say what happened on exactly 30 days after the ad campaign started you want to have a period so in this particular case we're looking at a 7 week window after the ad campaign started and we want to see what was the total amount of clicks that were generated by this ad campaign so this model allows us to get this cumulative effect across time and so luckily this you know a lot of tooling goes into this thank God our friends at Google our good friends at Google created this package called the causal impact package they all have very clever names causal inference causal impact okay so the Google causal impact package is unfortunately solely written in our but lucky for us we do have an r pi/2 connector to python such that you can write our code straight into a jupiter notebook of course it does require you to unfortunately know how to write in our code so that's potentially a problem but there's also again we're saved by our individual developers here so there's this developer named Jamal can't pronounce the last name who ported the our causal impact package from our into Python ok so this is again an alpha tested this is not I don't necessarily you know I have not rigorously tested this but so far it looks good in my in my few tests it looks good so what you do is you set up a pre period date time date time list you have a post period day time list and then you you determine some control time series you want X 1 and X 2 and you simply run the causal impact package and it does all of the heavy lifting under the hood it's doing a bunch of nonparametric stuff it's doing ARIMA it's doing for yay it's doing all that you don't even have to do it that's the coolest part but if you don't trust you know this individual contributor and you don't want to write in our code there's another solution you could use Facebook's profit which is a very very very powerful time series forecasting tool I highly recommend it but it does require you to know what you're doing in terms of modeling of the time series itself modeling the various periodicity weekly cyclicality monthly cyclicality daily cyclicality and then you can get the prediction and determine the difference between the prediction and the observed effect yourself so you can you can figure it out yourself if you want gives you a lot more control over what's going on in the backend okay so a lot of ways around that alright another method that's used a lot in statistics is what's known as difference in differences okay so here is where we already know that we have some problems we already know that people are fundamental you know the control group and the treatment group are fundamentally different we realize the selection bias and we're gonna just correct for it naturally within the technique itself so here the question is I like to give I like to give you examples of all the different questions we face so did our new sales model we switch out sales models once in a while our sales models are built on you know machine learning machine learning predictive models and we switch them out from time to time and update them did our new sales model affect our sales reps call behavior okay we put out a new sales model did that did that result in them calling more or less people or no effect at all okay so here we know that hey some sales reps are you know call more people than other sales reps so we know that there might have been some differences in our control and treatment cohorts to begin with so what we do is in order to just to determine the effect of our new sales model in other words the intervention we're gonna look at we're gonna assume something very strong which is known as the parallel trends assumption which is that hey we know that the that the sales reps have the same slow you know have a similar relationship between their slopes before and after the treatment so we're gonna estimate the counterfactual by assuming that same relationship in their slopes would have been maintained had one of them not seen the new sales model so one of them got the new sales model one of them didn't get this new sales model in this case in this example the better performing reps in terms of call volume got the new sales model first so if we want to determine that fact by getting rid of the selection bias what you look at the change in the slope relative to your estimated counterfactual okay now something that we face in a lot of tech companies face is things are not deployed simultaneously if you have a new model or if you have a new feature on your site you don't necessarily roll it out what we call roll it out to 100% of our users immediately we might roll it in waves similarly here our new sales model we didn't roll it out to our sales reps all at once right we're testing it so we roll it out in waves so here's an example where the intervention is occurring at different times so sales reps one through three are seeing different before and after periods so we have to do a little bit of manipulation as I told you tons of manipulation is going into trying to eke out this this little one number impact so what we can do is look at a certain set of days before and after before and after they got the intervention and simply cut them off that way so even though they got the intervention at different times in the calendar year we're gonna just look at each of them and thirty days prior and thirty days after or they got the intervention they saw this new sales model okay so some tools for difference-in-difference it's actually pretty easy to implement this is one of the one of the nicer and easier ones to implement that's also very powerful because it had selection bias correction built into the model all you have to do is implement a standard linear regression with an interaction between your treatment and and the cohort group that's all you have to do and the coefficient estimate on the interaction T times X that's going to be your impact okay so pretty standard you can do that with a simple regression if you want to do what I talked about in terms of matching across time intervals those are custom functions but those are easily done with daytime functionality within pandas and numpy so I always implement them manually because there's no package for that okay we also have instrumental variables okay so here's here's the scenario that we've seen how does signing up for cash back you'll pass something called cash back by the way if you check in at certain restaurants you can get some cash back I get 10% off my meals all the time as a result of this so how does signing up for cash back affect the activity of a user okay so that sounds that sounds like we would want to know that especially the cash back team they want to know how it's affecting the activity of the user there's just one problem though users that are more active are more likely to sign up for a cash back so that's known as simultaneity right what's where the arrows between X & Y go in both directions so in the cases of simultaneity of what you often want is an exogenous variable a totally different variable that's correlated with your control variable but not itself correlated with your outcome variable so in this particular case I could potentially look at signups for you know third party restaurant promo mailing lists because those people are likely to have signed up for cash back but not necessarily likely to have been a very active Yelp user so you'd have to test to this of course you have to test that a variable is a good instrument and so there are a variety of tools that we have for instrumental variables also known as IV you can do a simple OLS to see if it was an effective instrument so if x and y are correlated you can even just do correlation analysis to see if X and X and Z were correlated so that Z is a good instrument there's also what's known as two stage least squares which is implemented in the stats model as GMM module and this does it for you you can put in your set of endogenous regressors that's the ones you think are compromised in this case cash back you know the likelihood of having cash back put in your exogenously and put in a set of instruments for it so in that case signing up for some restaurant mailing list okay and you simply fit it and the IV to SLS model is also available via the linear models package which is a separate not as well tested as the stats models package all right so the final technique before we go into some of the pitfalls regression discontinuity Yelp has something called Yelp elites if you write enough reviews and you're a pretty engaged Yelp user you can get elite status you can become known as a Yelp elite and then we fly you out to all sorts of events and you get to go to a lot of parties and you get a lot of yelp perks you'll sell perks okay so we can ask we want to know does becoming a Yelp elite cause users to actually write more reviews okay do they become more engaged once we give this status symbol are they going to contribute more to our community the problem with that is you become a Yelp illipe Yelp elite partially when you write a lot of reviews so it's determining the treatment itself okay so how do we account for that well with regression discontinuity we look at the pre treatment and the post treatment and we graph them against each other and we say well there was an arbitrary we leveraged this arbitrary cutoff at which point a user becomes a yell believed I don't even know what the cutoff is it's something you know at some certain set of reviews number of reviews you can become a yell belief and it's an arbitrary cutoff so we say if we look just locally at the users who were just below that cutoff and the users who were just above that cutoff we can assume they're pretty similar we couldn't probably assume they're pretty similar the difference between writing 49 reviews and 51 reviews if the cutoff is that 50 like in this example it's I don't know if it's actually um they're pretty similar user so if we just look at them and how they behave differently after they become elite then we can get at the effect of elite status okay so again not actual numbers wanna make sure all right so there are a lot of tools for regression discontinuity and once again this is a case in which R has a lot of the tooling so we're gonna have to use some connectors or we're gonna have to implement it ourselves as I said again it's not easy okay so first we have to know that we can do a regression discontinuity all of these techniques I've talked about requires strong assumptions so every time you apply a model you have to make sure you know if you're gonna do even in ml if you're gonna do a binary classification model you better make sure that it's a binary problem you have to make sure that the model fits the problem at hand so here we need to make sure we can run a regression discontinuity and to do that we can do a McQuarrie density test which sounds fancier than it is actually just a distribute to look at whether there's a discontinuity in your distribution so you can write a custom UDF for that because again the concept behind it is actually pretty simple or you can use ours DC density function which has the McQuarrie density test within it and connect it to python via our pi/2 we do a lot of our pi/2 connecting and then on top of that the regression discontinuity design relies on a lot of non parametric estimation so you don't want to in this example I showed linear lines being fitted on both sides of the treatment both pre and post-treatment but they don't have to be linear and in fact they should not be linear they should be non parametric and we could for example use the lowest regression technique to get nonparametric techniques so that they're nonlinear lines on both sides and then we're gonna have to interpolate some points in between because it's there's going to be some sparsity there so we can use Sai pies interpolate function also pi QT Fitch is another package that has this non parametric estimation and does it all very well and does the interpolation as well so a lot of different options there and if you want to go the parametric regression regression route just estimating linear you know linear curves before and after then you can just use decimals or scikit-learn okay so this is just a summary of all some of the packages that I've talked about of course at the heart of it all these this is all implementing numpy and pandas you don't have to rely on anything you know you don't have to rely on nested dictionaries or anything like that okay so there's a lot of pitfalls of causal inference as well you know so I like this quote a lot from Andrew Lang the statistician uses statistics as a drunken man uses lampposts for support rather than illumination great quote absolutely true we got to always be suspicious of what we're finding right this this is one of those things that's not it's again it's an art so this is this is a problem I ran into someone asked me by how much doesn't an additional user vote so on the Yale platform you can vote for things like if hours are correct so a business say has you know it says it accepts credit card and a user can then vote no that that's not true or yes I agree that that's actually true okay so they were asking me how much does an additional user vote on a business page increase page views so that sounds great that's like yeah so like we have people engaging on the platform they're voting on other people's you know on our attributes and does it increase people coming to the traffic to the site so this is the workflow of how that goes we have user one going to the business page user one makes a vote you know yes this restaurant has amazing ambiance it has a romantic ambiance then user two goes to the same page and user 2 C's users one user ones vote now there's a problem here right because the question was once the user sees the vote are they more likely to see the page are they more likely to view the page but the problem with the actual flow is that the page view happens before the user actually sees the vote so you can't actually make this sort of you you can't actually do this sort of analysis as it stands you have to you have to reframe the problem okay so it's very it's very much about understanding the nature domain knowledge on your problem there's also omitted variable by we ask ourselves how much does an additional photo when a user puts on a photo does it affect the number of users that check in the hypothesis being that hey if I see more photos on that biz page then other people are going to see the photos and be like oh that's really great and then they're gonna go to the restaurant and they're gonna check in that's great but what happens if we just look at the relationship of photos to check-ins right there's a lot of things in between that are affecting it so we can't just look at photos we have to look at um you know if I want to look at the value of an additional photo I have to know how many photos were already there for example I have to know how many reviews the business had I have to know how old the business is how long it's been around because those are all things that are going to affect whether a user checks in at that business regardless of whether or not there was an additional photo so we have to be really careful of identifying all potential control variables and so it's interest it's an interesting problem because you actually get to be super creative it's not just about hyper parameter tuning whenever I do the other the other parts of my job hyper parameter tuning hyper parameter tuning you run it into a program but here you have to be very creative all the time alright and another thing it's reverse causality does purchasing Yelp ads improve a business's performance obviously all of our advertisers want to know this you know if I if I purchase Yelp ads is it gonna improve my performance um problem if the business is doing well it's also more likely to have ad expenditure budget and it's also more likely to purchase Yelp ads so we have causality going in two directions and we need to potentially mitigate that and one of the techniques I mentioned before the instrumental variables is something that helps with that and finally we have stratification so here you see a plot where we have the overall population on the left-hand side so on in the overall population X is negatively associated with Y some variable X is negatively associated with Y but hey then we group them into say men and women or some other things app users and non app users and suddenly the trend reverses okay this actually this is known as Simpsons paradox and it's actually a more prevalent than you might think so imagine that you're doing an analysis and then you group your people into groups such that the the trend actually reverses when compared to the general population that's not good you're not necessarily getting the correct intuition and there are several ways in which this can happen one way is one way is if there's an unequal probability distribution of the treatment across your groups so Group A and Group B had a different likelihood of undergoing the treatment to begin with so it's still about selection bias it's oftentimes about selection bias and removing selection bias and these problems okay and another possibility is that we're looking at a specific snapshot in time and we're not looking at the overall aggregate effect across time so you have to make sure that you look at the effect over the entire period rather than just a cross-section of specific points in time okay and you should also check whether there was proper cohorting done before your analysis okay and the final thing that I've worked with a lot is a geospatial differences you know we are actually a multinational company a lot of people don't know that Yelp is multinational and in another lifetime I was an international economist so I looked at a lot of countries at once and guess what users in one country are not going to have the same effect on an outcome as users in another country it's incredibly important time and time again even in ml prediction models when I put in geographical features geospatial futures I find that they're hugely important to the predictive capacity and in this case incredibly important for determining causality and controlling for the differences that occurred geographically so there's many ways one can do this one can have one hot encoding for each of the different geographies right but you can imagine that the dimensionality would explode there if you're a mold if you're a company like Google and you operate in every city most in the entire world if you want to control for every single city my god your dimension is just already in the millions so one way you can do it is you can group regions together and one-hot encode that way so you can group you know either by state or by administrative census census region metro area or country if you're a big enough if you're a big enough company or maybe you only care about your top 50 markets so you can group according to your top 50 markets versus everyone else there are all sorts of ways whatever way this is this is definitely about testing out a bunch of different things but it's incredibly important to control for geospatial differences all right so in conclusion causal inference is hard and it's really unsexy right it's really unsexy I made a lot of assumptions along the way had to check for a lot of different things control for tons of factors tweak it was very bespoke there's no package where you can just toss your data in and it'll spit out causal the effect you have to engineer it and massage it very carefully but that's also what makes it fun get to be really creative and you get to apply a lot of your domain knowledge to it and no two people will do causal inference exactly the same so you get a wide range of potential things so there is some Python tooling available but there's not much of it so right now we're relying on a little a lot of individual developer contributions which is bless them and I'm hopefully also going to contribute at some point that way but my hope is that we're gonna in the future pipe a lot more of ours functionality in this space over to Python otherwise in the meantime develop custom tooling which requires a lot of fundamental understandings of what's going on in the background
Info
Channel: Anaconda, Inc.
Views: 3,827
Rating: 5 out of 5
Keywords: anaconda, conda, analytics, python, open source
Id: NOcqmcdEHFI
Channel Id: undefined
Length: 46min 0sec (2760 seconds)
Published: Mon Apr 16 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.