Foundations of causal inference and its impacts on machine learning webinar

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone welcome to this webinar on the foundations of causal inference and its impacts on machine learning my name is emrakujiman i'm a senior principal researcher at microsoft research in redmond and i'll be speaking to you today together with amit sharma who's a researcher at msr india we're both part of a group of researchers at microsoft interested in causal inference and causal machine learning and that will be our topic today uh in this webinar we're going to talk about why causal inference is important what we believe is is special about it we'll walk you through four steps of causal inference and give you a detailed walk-through of these steps with the do i python library finally we're going to close with some of the more fundamental connections between causality and what are core machine learning challenges so why is causal inference important we're motivated and i'm certainly motivated uh by this uh trend of computing uh being becoming more and more integrated into so many parts of our society with data-driven decision-making really transforming the way that we make decisions in the context of health and medicine industry retail and many other scenarios you know if i think about medical scenarios like whether an image of a skin lesion seems to be cancerous if i think about what how people's behaviors will change if i raise the lower retail prices if i think about what are the implications of agricultural decisions like what cover crops to plant or when to irrigate these are all questions that are are benefiting from this trend towards more data-driven decision making enabled by the advances in computing that have been happening over the last decades and when we think about how to take that data and turn it into into decisions we often think about machine learning and what machine learning is hard is doing is searching for patterns in this data and using it to pull out insights that we can use to gain a better handle over what decisions we want to make and what will help us in all these various domains the problem is that often these uh patterns that correlational machine learning finds are spurious ones the problem with machine learning algorithms finding these spurious patterns that these is that these patterns are not robust machine learning fundamentally makes an assumption that the data it sees in training is representative of the data that's being used uh that where the test data where it's being used and deployed and when that assumption is broken machine learning makes mistakes so for example algorithms that work really well at reading handwriting recognizing objects answering questions about images all start to fail when when the data that they're being deployed on changes and we see that here where digits are being analyzed at different angles where stop signs are being looked at from different angles where the the questions that are being asked and answered trying to be answered through a through an image are slightly are slightly tweaked from the type of questions that were being asked otherwise and this can lead to some serious problems and bad decision making for example important importantly it can lead to biases and machine learning models in in really critical situations and so um why is why what what what happens why do we really care about this in decision making scenarios well it turns out that that when we start to use machine learning uh algorithms for decision making we're not only assuming that the environment isn't going to change we're at the same time because of the decision we're taking the action we're going to make we're actually actively changing the environment in a way that often breaks those patterns so for example think about a simple task of learning a machine learning model that will help us make irrigation decisions in a farm let's say we're going to be learning a model to predict soil moisture levels based on you know current readings and future weather patterns we can train these models on years of sensor data from real farms and then ask the question hey here's the status of my farm right now and i know it's going to be hot in a couple days do i need to water my fields to make sure the soil is is a moist enough the model trained on years of data is very likely to say no the soil moisture will be high when the temperature gets really high don't worry about it and why would why might this model do this it doesn't make sense to to me right if it's going to be really hot that's going to dry out dry out the soil well maybe this model learned that in the past soil high temperatures were associated with with high soil moisture levels because the farmer in the past always watered on those hot days and when we're coming about now and asking what's going to happen that was a pattern that the the machine learning model picked up on but it's a pattern that we're going to break when we look at this result and decide not to irrigate based on the model's prediction so if we look at this together we're seeing computing helping with decision making in a broader set of scenarios and we see this potential challenge where uh where uh conventional machine learning patterns aren't up to the task of making good helping us make good decisions the interventions that we make based on these analyses are break the correlations that these machine learning algorithms depend on and when we look into their internals for insight we often find that they don't actually necessarily represent the the models don't give us insight into how we actually might want to act in the first place so for decision making we need something a little different we need to find the features that both cause the outcome and we need to be able to estimate how that outcome would change if the features are changed so one way to think about this is that you know in supervised machine learning we're assuming that the training data is the is matches the test data and we have great ways of of estimating um and evaluating a prediction model and making sure that we're right under these assumptions causal inference is really a different task though in some ways it looks uh like a similar setup here we're not assuming the training data is the same as the test data and instead of trying to to simply predict uh the the value what we're actually trying to do is uncover the underlying generative model the the the values of the causal mechanisms that are are creating our observed data now this is a good time to step back for a moment and ask what is causation now this is a tremendously philosophical uh question we're going to take a bit of a practical definition just just to be able to to move forward and make progress on some of these important problems we're going to say that some action or treatment or decision t causes an outcome y that we care about if and only if changing that value of that feature t leads to a change in y while everything else is constant now what that means is we're going to have some real world data where we actively take action and change t and we're going to be asking the we're going to say the t causes the the the change in y if y would have been different in some counterfactual world where we did not set t to that value and the causal effect then if there is a causal if there is a causal relationship the causal effect is going to be the magnitude by which the outcome changes every time t change t changes by a unit so we're going to estimate this value now this leads us to two really interesting fundamental challenges one is that we never actually observe this counter factual world we get to see what we did but if we're really serious about that assumption that all other things being equal we never get to see the counterfactual world where we did something different so this leads us to the first challenge for causal inference where we can't directly calculate the causal effect say you know for ground truth validation we have to always be estimating the counterfactual if we do one thing we have to see we have to estimate what would have happened otherwise if we do the other thing we have to estimate you know what would have happened on the other hand and this means uh that there are among other things going to be challenges in validating our methods the the second interesting fundamental challenge is that the there are multiple causal mechanisms that can be fit to any data distribution that we actually observe we're not going to go too much into detail about about why this is or how but just suffice it to to say that data alone is not enough for causal inference we need domain knowledge and assumptions to disambiguate the potential mechanisms that could be causing the observations that we see so so this is going to be two of the challenges that we're going to be talking about as we go through as we go through this presentation so just to uh uh highlight here that that not o uh that in the context of these decision making tasks there are a lot of of uh data science and science scientific questions that are really at hard causal questions when we're doing an a b experiment that's one way of answering a causal question if i change this algorithm will lead to for example a higher success rate in whatever optimization metric i'm looking at if i'm thinking about policy decisions i might want to think about i might want to estimate the impact of that policy change once i've deployed a policy and i want to know whether in hindsight that policy helped or hurt given everything that else has happened since i deployed that policy now that's a very interesting counter factual question where i'm combining my knowledge of how the policy works with my observations of of the world that i've the extra information that i've learned since and there's fun problems in credit attribution are people buying things because of the recommendation algorithm i just deployed would they have bought anyway whenever i see some outcome that i care about can i tell where that came from so if i was to recap why we should be carrying why we care about causal inference it's it comes down to the fact that decision making in all these important scenarios depends on understanding the effects of decisions and actions if we really want to optimize how we make decisions we need to be thinking causally the predictions from correlational machine learning are insufficient because because of the fact that our decisions and actions are changing the the correlations that machine learning depends on causal inference addresses this challenge directly but does run into new uh challenges we have to overcome primarily we have to start bringing in outside knowledge to augment our our observed data we talk about these being model assumptions that we have to explicitly augment and we need new methods for validation and new methods for estimating those counter factual scenarios so that we can really know that we're making progress and trust our results so in the second part of the talk we're going to go through the four key steps of causal inference so the four key steps of causal inference that we're going to go over are first how we model our problem this is going to be about creating a causal graph that encodes the assumptions that we're bringing in from our domain knowledge and our knowledge about how the world works to augment the data that we're using the second step is identification here we're going to be taking those assumptions that causal graph and model and using it to formulate what we need to estimate the third step is to actually compute the estimate given all the realities of the data sets what's the best way to trade off the bias and variance for whatever task we're trying to to do to get a causal estimate of what is the impact of doing one thing versus another on the outcomes we care about and the fourth step is to is to validate our assumptions and run a number of sensitivity analyses to really see if we can refute this estimate see if we can come up with a reason not to trust it and this comes back to those challenges around the the lack of ground truth validation for many of our tasks which means that we have to find new ways of making sure that our analyses are trustworthy to implement these four steps we built do y it's an open source python library for causal inference our motivation for do y was really was really based on our experience working with data scientists and others uh on causal inference and we noticed that the most challenging parts of of the causal inference process wasn't the statistical estimation but actually the initial encoding of modeled assumption of of assumptions in the form of a causal model and the last step of of uh validating and refuting the results at the end these were two new steps that were are different from what we do in conventional machine learning and so this is where uh we thought people could use more assistance and that's what do y does it makes it it makes it really steps people through these four steps in a way that ensures that people are making a transparent declaration of their assumptions and helps them evaluate those assumptions and the results that they're getting at the end of the analysis as much as theoretically as much as it's possible so these four key steps are going to be the framing that we use for the rest of this part we'll start with modeling our assumptions modeling and capturing our assumptions is about converting domain knowledge our intuition and concrete knowledge about how the the world works how the particular system we're analyzing works into a formal model that we can use to augment our data so a good model is built around exposing the key causal relationships between the outcome that we're analyzing and the other variables that might influence it including the actions that we're trying to take the assumptions are about how different variables affect each other and we usually encode this as a causal graph where each edge encodes a mechanism uh you know a direct edge is a from one node to another from a to b in this case is a direct cause where a the value of a influences the value of b the graph itself now in total implies a number of conditional statistical independences for example in this case their a and c are independent of each other and you know d is independent of a conditioned on b you can read these statistical independencies off the graph using rules that are called deseparation rules we won't get into that but it's a fascinating topic if you want to dive deeper into it the key intuition about these causal graphs is that the assumptions themselves actually aren't the edges the assumptions are encoded by the missing edges and the direction of the edges the way to think about this is uh as follows if i draw an arrow for between two nodes a and b what that says is that when a changes b's value is gonna is gonna change weighted in some way but that weight could be zero so really drawing an edge from between two nodes isn't assuming anything it assumes that there might be a relationship but if i remove that edge then i'm setting that weight to zero by definition and i'm saying that if i no matter how i change a without some path between a and b b's value is not going to change these relationships that we're encoding this graph also represent stable and independent mechanisms that is if we have a large causal graph that is representing a a bunch of different stuff in some system or environment we can expect that if we reach in and change the system in some way we change some of those arrows we can assume that the rest of the system the rest of the causal graph is not going to be affected the another key intuition is that the graph actually can't be learned from data alone while we can look at the data and learn something about what the shape of the graph might look like it turns it it it turns out that the the same data can you can be represented by multiple graphs we can narrow it down and get to some equivalence class but we're not going to be able to to find the right one and that's why we need to encode this our domain knowledge explicitly because we can't just skip this step and figure it out later and finally it's important to note that these graphs are a tool to help us reason about a specific problem we can write these graphs at different layers of abstraction at a high level at a very detailed level we can capture major uh major causal relationships or the most minor ones and what the right answer in the right way to model the these these cause relationships is going to depend on the questions that we're caring about uh and basically the context of the specific problem here's an example graph just to just dive a little bit into the interpretation of these graphs here we're looking at a graph that shows some of the relationships between user interest user fatigue what people have done in the past and then some anonymous treatment here that we're considering won't say what it is some outcome variable reading this we can read off what we're doing in terms of what our assumptions are in the underlying system we see by the fact that there's no arrow from user fatigue to user interests and there's no path that we can follow along these directed edges to go from user fatigue to user interest that user fatigue does not affect user interests we can change the values of user fatigue we can you know maybe find ways of making sure our users get more sleep or get more rest walk away from the system we're building and you know and feel more relaxed and we will not be affecting their user their their interests we see also that past clicks does not directly outcome uh passively so the history of what a user has done does not directly according to this graph affect the outcome that we care about this y it does however indirectly affect it through through its influence over whether or not people see the the treatment that we're studying um and so on now when we are thinking about going in and changing this the the value of this treatment feature t note that in this example graph we see that t is being influenced by quite a few things but when we talk about wanting to go in and set t or change t what we're really saying is we're going to build a new system where the treatment t is not influenced by these other variables it's it's just simply influenced by whatever we decide to do we're going to reach in and change t potentially completely independent of what's going on the rest of the system so now this is actually a new intervention graph it's uh basically a a new system where all edges that we're going to have been removed now this represents a new data distribution we usually refer to this distribution using do of t so this this operator and the causal effect that we want to capture in our analyses in causal inference is the basically the distribution of y the probability of y given do of t given that we are are um setting the value t to some setting the feature t to some value and now this leads directly to now the identification problem the identification step of causal inference here what we're going to do is we're going to take that p of y given do of t and we're going to try and figure out how we can calculate that value so uh the challenge here is that our observed data is generated by the graph on the left where t is being influenced by everything else going on the system and those things are also influencing the the outcome that we're observing and the we want to answer questions about this intervention graph where t is being set by us independently of everything else and so now the question is how do we represent quantities from the right-hand graph p of y given do of t using only the statistical observations that are in the data generated from the left-hand graph now there's a trivial example a solution to this in the case of randomized experiments in randomized experiments the value of t is being set with a coin flip it is statistically independent of everything else so there are no edges from the rest of the system to t in the first place this means that our observed graph is the same as the intervention graph in a randomized experiment and what this means is that the probability of y given do of t is equal to the observed distribution of p of y given t and uh this leads us to kind of our intuition for how one one approach to to doing causal identification it's going to be basically can we generalize generalize this insight about randomized experiments by taking our observed data and trying to find ways to tweak it to simulate a randomized experiment so when treatment t is caused by other features z for example we'll call it the set set of feature z we're going to try and adjust for their influence to simulate a randomized experiment now this approach is called the adjustment formula and the adjustment formula you can see on the screen is basically um is basically adjusting for the all the observed values of these um feature z that are influencing the the treatment now the feature z can't just be any set of features they have to be a what's called a valid adjustment set in order for this formula to give us a correct estimate of the causal effect of changing t on the value of y there's a couple of different valid adjustment sets one is all of the parents of t if you if you set z to be all the parents of t then this formula will give you the right answer you can also identify a set of features using what's called the backdoor criterion and the towards necessity criterion and the the key intuition we're not going to define these algorithms but there are algorithmic approaches to identifying these features the key intuition is that the union of all features is not necessarily a valid adjustment set you just can't go and use all of your data to to do this you will get uh incorrect estimates um and um you know i just want to point out you know you might be thinking why why don't we just always use the set of all parents of t why do we need more complicated ways to to identify valid adjustment sets well the issue is that sometimes these parent features even though we know they exist they might be unobserved in our data set so we might we might know that they're there that they're that they are influencing the value of t but for one reason or other we might not have been able to measure them we might not have captured them in our data set which means that we have to find alternate adjustment sets that don't include those unobserved variables in addition to adjustment sets there are many other identification methods there are uh identification methods that rely solely on uh the shape of the graphical uh of the the causal graph um these are in addition to adjustment sets there's the front door criterion there's randomizing natural experiments and then there's also other identification methods that in addition to the causal graph require some additional constraints or assumptions about the shape of causal relationships or the um or the uh or narrow down on the kind of causal estimate that we're that we're interested in identifying these include instrumental variables regression discontinuities and difference differences and many of these methods uh can be used through uh are already in rre in do you at this point i'm going to hand off the presentation to amit sharma as i mentioned earlier amit sharma is a researcher at msr india he's a co-developer of the du y python library and he'll step us through the rest of the four steps of causal inference starting from estimation and into the final part of the the talk on on the connection between causal analyses and core problems in machine learning thanks very much thank you emre i'm amit sharma and i'm excited to present this webinar let's move to the estimation section now emerge has talked about the first two steps of modeling and identification and once you have the identified s demand we would want to use the observed data to compute the target probability expression for common identification strategies using adjustment sets the estimation really just boils down to conditional probability estimation so here for example i'm just showing you that the do operator converts to a probability distribution of y conditioned on t and w and of course this is assuming that the adjustment set is a valid adjustment set in binary treatment case the thing becomes even more simpler so if you want to estimate the causal effect it's simply a subtraction between the conditional distribution at t equal to one and the conditional distribution at t equal to zero and so in all the estimation steps and all the estimation methods that you may find the final goal is that we want to estimate the conditional probability and you want to keep the confounders constant how do we do that let's start with one of the most simplest methods known which is just matching so so imagine that you have data about health and and you have people who were given a treatment which meant they did cycle and then there were other people who did not cycle and your goal is to find the effect of cycling on their health right so we have these different people here what the simple matching method says is that let's just match people who have the same baseline health and confounders together so you match children with each other you match people with different characteristics which are shown by colors here and once you do that you have effectively created a set where the confounders are the same between each pair and so therefore when you just compare their outcomes you are guaranteed to get the causal effect and so as i share here of course in practice you won't have exact matches there might be high dimensional data and so typically we have a distance metric as defined here and we say that a matching condition is that the distance metric should be lower than a certain value epsilon now this can work really well if you have a small number of confounders for example the equation here just you have to subtract the outcomes for each match but in practice you will have many people who would not be matched they are shown in the bottom gray box here so these are people for which we are unable to match anyone and therefore we have to exclude them from the data and as you can see just simple method actually shows the challenges of building a good estimator uh especially in high dimensional data if you think about it we removed all these people because we could not match them and that just meant that we had a very stringent matching criteria and that means that our estimate will not be reliable because we just have a small matches that we have to rely on on the other hand if you think you can just relax the matching criterion and obtain many more matches now the problem is your estimator is biased because it's no longer capturing the do operator that you wanted it's just not being able to keep the confounders constant and in the real world things become even more complicated because often you might just have a small set of people who have been given the treatment in this case cycling and so what will happen is that those few people won't be matched as much and it will lead to both high variance and high bias and so what has happened over the past decade is that people have realized that we need better methods to navigate this bias variance tradeoff and i'll just talk about two examples of recent machine learning methods that have come up that can help us in this estimation problem so the question i want you all to think about is let's say you try to match a person or a data point and a good match does not exist is it possible to just create a synthetic match let me describe what i mean by it let's say on the right hand side you're seeing a graph about people the orange people are the people who are the treated people and blue are the ones who did not get treatment right so if you just try the simple matching strategy as the vertical lines show uh you would find that there's no one else in the blue case who had an equal value of the confounder to the person that i'm pointing to using the vertical arrow in the orange bar and similarly for people who are in control if you just pick one point at random and see which person has the same confounder you would not find anybody and so what machine learning can help us do is we can actually estimate the function of the outcome for different treatment values so that's what we do here so we say let's take all the people who are treated and try to find the relationship of w the confounders with y and in this case it's a straight line and we can do the same thing for the control group as well what does this give us this gives us now a pretty good matching for any person in treatment or control so for example for the blue bar here instead of matching to a real person i can just match to the prediction that our model gives and again we have to always make sure that the function that we are estimating is a good function approximator because the key assumption here is that the f that we estimate now captures the true relationship between the confounders and outcome if that is true then the causal effect again can be given by simply sort of different differencing the outcomes of the treated and the control while they are matched with the same confounder and as you can see other machine learning methods can generalize estimation to harder problems as well i'll give an example of continuous treatments so this has been a hard problem in causal influence for many years where it's pretty easy to do if you have treatment equal to one and treatment equal to zero binary but what do you do when your treatment can be continuous so this is one method that has been recently proposed which takes the cue that of course as we discussed earlier just having a predictor is not enough so you can't just take all your data take treatment you take confounders and just say i'm going to build a predictive model that's not the right approach but there's a simple trick that you can use to still be able to use predictive models so what we do is that we break down our task into two sub tasks let's look at the graph first we can always come back to the equations so what the graph is showing is that typically why the outcome is having some correlation with the confounder and that's where you see a tilted line there and some of the treatment can also have some correlation and this is the issue we want to remove this correlation so that we can actually capture the effect of t and y so what the double ml method says is that let's build a predictor using w and try to predict y and similarly let's build a predictor using w and try to predict t once we do that what's nice is that whatever is the residual left it's literally the amount of variation in y that is not explained by w and so it is the unconfounded variation that we want to look for in y and so that's what this method does in the first step we take the gray dots that we have for the outcome these are static in the graph which means they're independent of the confounder similarly we get an independent residual for the treatment and now causal inference just becomes a simple regression of these residuals on each other so what we have done really is that we've used machine learning to extract out the confounded factors of both the outcome and the treatment and once we have those residuals we just use linear regression or a simple method to find out the causal effect of t on y and depending on the data set properties actually different estimation methods can be used so for example we talked about matching we also talked about double ml but actually there are many different methods and you have to try them and see based on dataset which one works the good news is that all these methods are implemented in dubai either directly or through a library from microsoft which is econ ml and so these all methods which are pretty complicated are now available in a simple library for you to use with but that's not enough right so now we have our estimate that's great but can you call it a causal estimate well as you probably remember from the earlier parts that m represented causal influence really depends on the assumptions right so just having an estimate from a causal method is not enough now is our chance to check whether the assumptions are actually being valid in in the current setup and just as a recap we have assumptions which are pretty much untestable for most cases going through all the three steps that we so far discussed so the first step we discussed model there's an assumption about which edges should be there or not maybe we made a mistake there's also an assumption about whether we missed any of the observed variable that could be confounding our estimate in identification there can be assumptions about some parametric assumption that we made about deriving the s demand or some other kind of assumption in the graph that we now have to be careful with whether it actually is valid in the data or not and similarly in estimation as you just saw we made assumptions about the functional form we also made some assumptions that might lead to high variance in the estimate and so this is quite important it's probably to me one of the most important steps of causal influence because unlike in machine learning where you have cross validation there's no such data set no such gold test data set available right so we really have to now get creative and think of how can we check the robustness of this estimate well the best practice from experts is that try to do as many robustness or refutation tests as possible so that you can rule out violations of assumptions so we have typically two types of tests you can think from a software development perspective one is the unit test which just tests one part of your analysis pipeline so for example you can just test whether the modeling is correct using a conditional independence test similarly you can test identification using d separation and so on and separately we have what you can call integration tests which just test whether your whole process is correct or not right so they just begin with an assumption about what a correct process should do and just test whether your method is following that and we'll discuss one or two of these we'll specifically discuss the placebo treatment refuter and the sensitivity analysis again we won't have time to discuss all of these methods but they are all implemented in dubai the library that we have built and so you can all for any kind of estimation now it's very easy to just run all these tests at once and check these assumptions one caveat i'll say before we begin is that these tests are not proofs they're not able to tell you whether your analysis is correct or not think of them more like they can weed out the bad analysis very quickly right so they can tell you when a method is wrong but they cannot actually it doesn't mean if all tests pass that the method is correct it just means that it's not really bad and so you've made all the due diligence in your analysis okay so i'll describe sort of three methods just to give an intuition on how refutation works uh and you can think of it almost like the scientific process you have an estimate now you have a theory and you want to check and you want to falsify it so if you have a problem with your modeling and you're concerned that you might have missed some confounding or you might have missed some other variable one easy thing we can do is that we use the knowledge of causal graphs properties so we know that every causal graph implies certain conditional independence constraints on its nodes let's take an example so the graph on the right you can see that there's a treatment there's an outcome but the user has said that there are other variables a and b which are affecting the confounder so just by looking at this graph we can actually deduce this is analytically that a should be independent of b that's the assertion in the graph similarly a should be independent of t the treatment given w and b should be independent of t given w as well now that's pretty nice because the user has said here are the independences i believe exist in my data and the refutation method can say sure let me check them i have the data with me let me just check if these are correct or not and that's exactly what this method does so it says that let's check if the observed data satisfies the independence conditions and for this we implement an appropriate statistical test if all the conditional independence is pass great if not then it means that we have to look at our model again and maybe change some arrows or make it more complex once you're not so much worried about your model you might be worried about other things down in the steps that might have gone wrong right so you might be worried about maybe there's some problem with the estimation identification and that's where integration tests help so one example is a placebo treatment all we are saying here is that we don't know what happened among the modeling assumptions and the identification assumptions but here's something that we can attest to and which is that if the treatment is something that is completely random then it should have no effect on the outcome right and that's just a simple assertion and we can say that if given such a data set any causal inference method should return an estimate of zero and that's exactly what we do if you look at the graph on your right the first one is the true data and the second one is a simulated data where we what we do is that we look at the treatment column remove it from the data and replace it with a completely simulated uniformly random variable so for example you can choose gaussian so now because you generated this variable by yourself you know that it cannot be uh causing the outcome and so the effect has to be zero and then we can just rerun our whole pipeline check whether with this reduced treatment uh do we still get the estimate actually equal to zero and again if it passes great if it doesn't uh then the method is incorrect and we have to make amends and the final one i'll talk about is the most interesting in many ways which is that we can now test how our method would be sensitive to unobserved confounding right so remember this is one of the hardest sort of assumptions in causal inference we can never be sure whether in our modeling there was one effect or one model sorry one variable that was missing and so what we can do is that we can say sure let's try to simulate a variable let's try to simulate a confounder with some correlation with both treatment and outcome and now let's redo an analysis again and see how much our estimate changes so this method typically doesn't give you a yes no answer but it gives you a sense of how robust you are to even a small confounder that might have gone missing in your data and this can give you a subjective understanding of whether the estimate can be trusted or not now armed with these four steps let's just walk through how you can do them using the do y library that we talked about it will just give you a good sense of how this causal inference analysis happens in practice just to make things fun we have actually created a mystery problem uh this is available online i'll present the link later imagine that you are a data analyst and someone just gives you a data set and this is how it looks like over time x and y seem to be reasonably correlated and if you look at the data you have three variables x y and w so what do we do so in do y the first step is to model the problem first so before even making thinking about what you're going to estimate let's try to model uh this whole setup here so that's what you do using the causal model class as i've shown here and you specify what variable is the treatment outcome you can also specify a graph here and then do i say is great we have recorded this model and if you ask it to present it it can present it as a graph so here this is sort of a familiar graph where we have a treatment outcome and some unobserved confounders second step now is that given the graph now everything else is automatic in dubai the graph is where really human knowledge and human judgment goes in after that we just say hey you have the model let's identify the effect so that's the second line of the api and if you see something interesting is happening in the outcome that comes from do y one thing is that it shows you a warning it tells you that hey this is observed data not from a randomized experiment there might always be some missing confounding and so this is a way of reminding the user again that they have to be careful and recheck their model in case something is missing but if not then it gives you automatically the exact s demand to estimate based on the model given to you right so it's basically implementing do calculus for you so that you can simply give the graph and you immediately know what to estimate once you know what to estimate now you can say estimate effect so that's the third part of our api you save the same model estimate effect and in this case we're using linear regression but as i said you can use sort of multiple methods of different complexity and so now do y actually does the estimation and comes up with the estimate of one what this means is that changing the treatment value by one unit would also change the outcome by the same one unit what's interesting is that if you just look here the causal effect is different from the observed data it's actually a lower effect than what you would have if you just believed the correlation and finally the most important step is to now check whether we are actually having a robust estimator or not here i just show one refuter we chose the plus placebo treatment refuter remember the placebo treatment just replaces the treatment with a placebo so our goal here is to check whether the effect now goes to zero or not as you see in the output the effect does go to zero not significantly different and so now we can feel reasonably confident that our model is a good one and we can trust the estimate in fact because it's a simulated data set we can also check here which we can't do in real world and we actually find that the effect is the same you can also try out this whole example on github it has many more details that you can check out at the link below our goal here was to simply give you a breeze run through how do i works and how causal influence can be made practical through the four steps in practice actually do y is actually much more extensible api our goal is not just to implement the four steps but also allow external implementations for those four steps but they still follow the same four step api so for example we are now already supporting estimation methods from external libraries like econ ml and cos lml the only difference to a user would be that instead of saying backdoor.do y here they can say backdoor.econml and we also have external contributions so for example adam kelleher we are very thankful to him he's added a pandas dataflame extension which means that now you can estimate causal effects using the same pandas api as you would work with a data frame so in summary we've created an end-to-end library for causal inference and you can answer what if questions and estimate causal effects of any variable importantly using data and domain assumptions you need both and the four steps are repeated again here i would just repeat the most important ones which is we have a formal framework to make assumptions explicit and we provide a set of automated tests to check robustness of the estimate let's pause for a second and think of what we've achieved so far we started with a problem that machine learning is five is powered by correlations uh and then we discovered that causal influence is a great framework that can help us make sense of the data and find out the causal effects of any variable we're interested in and so that's where we are but actually in the last five or six years there's been another set of work going on that's even more interesting in some sense so what's happening is that we are realizing that we can use causal inference to also improve machine learning right so it's not just that machine learning methods can be used for high dimensional causal inference actually causal inference can be used for machine learning and that's really exciting so what i'll talk about is just give a view into this recent research a lot of it is active and we'll see how causal models can be more generalizable they can have better privacy guarantees and they also give you a very principled framework for thinking about responsible machine learning all right let's get into it so what is causal learning so if you're familiar with standard machine learning we are typically given a data set and our goal is to predict y as a function of x and what we do is that even though you might have some information sometimes for example on the right i'm showing you a graph people don't use that and they just say that just minimize the loss of a function of x to y causal learning says that that's not enough to truly build generalizable models let's use the knowledge that we have about our variables so in this case we say that the graph shows us that only xc are the set of variables that are causing y because they are the parents in the graph all the others are simply correlated or blocked by xc so what causal learning says is that let's just build a model only on the xc variables in most cases actually we have to learn the representation of xc it's not like xc is given to us and so therefore causal ml problem becomes the problem of learning a causal feature representation as i'll show it has it can have lower train accuracy at some points but it does stay consistent with new domains and so these are the three benefits that you get if you can build a model that only depends on the causes the first benefit is that the out of distribution error of causal models is lower in the worst case the intuition is simple if you have a graph like this imagine that you move to a different domain right so you may have an intervention that means that the distribution of data might change the distribution of features might change and that's what is being shown by the hammers here but the best part about causal features is that across all these interventions the function that connects the causal features to the outcome stays the same and so if you have a model that sort of is predicting disease in one hospital it's going to be the same as you move it to other hospitals as well the second property that we get is that it also has stronger differential privacy guarantees than associational models this is a slightly technical result but just to give the intuition a causal model because it's focusing on the common invariant relationships it doesn't really care about the individual specific relationships right so the extent to which you can use the predictions from a causal model to identify back some individuals in the data has to be low and and what we show is that in practice the differential privacy guarantee can is always stronger for a causal model compared to a machine learning model and finally just because of differential privacy benefits we are also able to show that causal models are more robust to privacy attacks like membership inference so in membership inference what happens is just by knowing the output of a machine learning model just by knowing the logits for example you can predict whether a particular person was in the training data of a machine learning model or not and just sort of very intuitively this happens because sometimes machine learning models overfit to the data and so their predictions hold information about the training set and since we are better in causal modeling with differential privacy it turns out models are also more robust to privacy attacks now how to build causal predictive models is a really exciting area of research and this is something that our group at microsoft also works in there are three main ways that have come up one is use multiple domains and the diversity in those domains the other is to use some randomized experiment data and the final one is to use causal constraints so domain knowledge that people would have and we are very excited to see what more comes up in the future for building these models that also generalize well and just finally i want to end with this showing that causal reasoning also provides useful definitions of responsible ai concepts like explanation and fairness so imagine that the typical example of explanation is that you applied for a loan but you were denied what happens typically is that you may have a future important explanation saying that annual income was the most important but this gives no information to you it doesn't tell you what you can do and that's where causality comes in uh there's been work proposing counter factual explanations which says explicit knowledge about if you change your income to ten hundred thousand you would have got the loan right so it's really sort of telling you what you can change and giving you actionable information that can be useful and similarly fairness also can be explored from a causal angle the simple idea is that if you can find counter counterfactual explanations for a model using only sensitive attributes like gender most likely there's a problem with the classifier and there's a formal definition of it in kushner at all and i give you a simple example that if you find a person gets rejected while just changing the gender that's a problem with the model and you can extend it using more formal counter factual definitions so in summary causal reasoning can help us enable better generalization better privacy and more responsible ml models so there's a lot of interesting work still to be done in this area just to conclude we hope that during this session we were able to communicate that causal inference is key to many decision making tasks we can capture what if scenarios that are not captured by traditional ml but more importantly it's also harder right so there's no free lunch we have to model assumptions well and we also have to be careful to validate them more excitingly i think there's interesting exchange between conventional ml and causal ml like we just talked about and this is a very exciting area that we are contributing to as well we also want to make these resources more available and more accessible to a broader computer science community so that's why we are starting to write a book we've written a few chapters on causal machine learning that is publicly available on the link here and of course we also have our dubai library that we would love if you can check out and also if you're interested in our publications you can check out the publications of our causal ml group at this link below with that uh i thank you uh it's been a great pleasure uh giving this webinar and now we look forward to your questions so embryo and i will now be live and we'll be taking questions hi everyone thanks for attending the foundations of causal infants and its impacts on machine learning webinar we are now in the live q a session i'm amit sharma and i have with me emre kachaman and over the next 15 minutes we are going to be answering some of the questions that you asked us through the chat let's get started so our first question that we are asking is what do you do when you don't have the causal model and how do you actually go ahead and start constructing a causal model for a practical system emre you want to answer that now let's max the research using causal inference that's really fun are we live yet we are live it sounds like emery's having some connection issues i mean if you want to take that one you can go ahead sure uh so uh one of the things to understand is that we don't need the full we don't need the full causal graph uh we just need certain variables that we are interested in so let's suppose your goal is to estimate uh the effect of a recommendation system where you know that there is one product that's being like shown and the other product that's recommended next to it so in that case it's not important to know all that is happening in the recommendation system or the live system all we need to say is that here is a product here is a second product and one of the biggest features that's probably impacting both of them are users preferences for these two products right so as long as you have variables that speak to the user's preferences and you have other variables which are contextual variables like time location and so on you're already in a good place with respect to causal analysis in the sense that if you assume this causal model that these variables are the confounders for you then what you estimate will already be better than what you could have estimated by just sort of putting all the variables together in a model so that's perhaps the first insight but we don't really need to model the full system i think all you need to do is for a specific question you just want to find the variables uh that are relevant to that question and all the others can be put up as one sort of unobserved nodes or confounders the second thing to note is that i am ray do you want to take on answer this question oh it seems like uh emre is having some trouble uh the last thing i would add about causal models is that it's often a dialogue in our experience with the domain experts whenever we start with a problem i think as computer scientists we often have this desire to learn as much as possible from the data but one of the fundamental results in causal discovery is that it's not possible to uniquely determine a graph based on just observational data so you either need to do a lot of experiments or what we found in practice that's very useful is that we talk to the domain experts if you're let's say looking at a system that is in healthcare we can talk to doctors if similarly for any other system we can talk to the people who designed it or who know more about it and i think the key insight that we want to convey today is that it's just about having the right causal model even if it's an incomplete causal model in practice what you'll observe is that even with that incomplete causal model you'll be closer to the true estimate uh compared to not using it so that's another way to think about this problem is that even with a partial model that's causal you may get better insights than having no model at all let's look at the second question that came up so this was about explanation actually two of you asked this question so there are explanation methods like shapley values there are also other future importance methods and the question is how does causal inference and do y differ from these methods and also how can do y be used for future importance so the first thing is to understand that many of these feature importance methods if you think about them and just abstract into what they're doing they essentially say that given this input this set of let's say x1 x2 x3 features there was a particular output y-hat that i observed these methods say let's try to imagine multiple scenarios or now that we have this vocabulary multiple counterfactual words where these features took on some other value and then they try to estimate on all these words what was the importance of a particular feature right so this is what methods like shapley value do they say that let's imagine words where x1 was initially one and then we see how much effect you get when you add x2 as well and similarly you can say x2 was 1 and how much effect does x1 at right and so there are many such methods but the sort of key challenge they have is that they have no knowledge of what ordering is the right ordering and which combinations of variables are just impossible in the world and i think that is what do y offers here is that if you're trying to do future importance if you're trying to do explanation in any of these methods you can imagine having a causal graph behind it that tells you whether certain combinations are possible not not possible at all so by domain knowledge you would know that two features can never be won at the same time or maybe you also know that it's always that x1 is the first feature that is activated so the short answer is that i think there's interesting ways in which causal influence can be used for explainability but i think one of the first ways is even just understanding that it helps us construct these set of counter factual words against which you are explaining something how it happens in practice we have a couple of notebooks on dubai on github essentially you can think of trying to estimate the causal effect of each feature one by one on the outcome while you're conditioning for the other set of features that you have then there was a another question about how microsoft research is using causal influence so here this has been an interesting journey for us where we found causal inference applications in a number of places at microsoft uh so i'll talk about some of the sort of things that we started working in and now i think the various places in which causal instance is having an impact uh so where we started was uh questions about online networks uh online recommendation systems uh emerald did some great work on online social networks uh hi ambry are you are you back yes i am sorry about those technical difficulties um not sure exactly do you want to take this yeah so we were just discussing this question about how microsoft research is using causal influence and i started the answer by saying that we both of us started with online systems i think you did some work with social networks i was working on recommendation systems but i think now we are seeing that there are multiple applications in business and marketing for example uh sales as well in uh finding out the right interventions we are also seeing applications in traditional machine learning sort of predictive models and how you can make them more generalizable to new domains uh and then off also their applications in sort of finding out for whom the intervention works well so there are applications in uh bing as well and some others maybe embryo you want to talk more about how we're using it yes i mean the way i see the place where causal inference can have the most direct impact on our computing systems and the way i abstract it out is is really around decision making systems that the places where we're making decisions and intervening in the system as a result of of a machine learning model is where we need causal methods the most and this this this thinking i think really spans a lot of of domains we've certainly seen it in trying to understand the signals that we get from people's you know digital traces to help us understand what's happening in the real world in uh health contexts um you know with real world evidence we see it in online systems where we're studying user behavior to understand what's working and not working in our products and we're even starting to see it in you know industrial context as well where people are using machine learning models in more in a variety of industry agriculture and other settings as well as the sales and marketing type of domains that you where econometrics is has more conventionally used use these methods i guess one other interesting kind of test where we're doing this type of work now amit is also in trying to trying to use causal insights to improve uh traditional machine learning tasks as well so how do we improve prediction and classification for example where we might not you know need necessarily a causal understanding of the underlying system but where we believe that by using causal methods or relying on more causal features and relationships we'll be able to improve really core machine learning attributes that we care about like generalizability and robustness yeah that's that's a great point emre and it actually connects well to a question that was also asked uh which was that uh how can we understand why conventional ml methods may not work as well and what is the secret sauce behind causal methods that make them generalize well so i'll just give a very simple example here that can vet our intuitions here so this is a common problem that has now been well researched and it's an active area of research as well which which is assuming that you have two domains and you're trying to make a prediction task so you can even imagine the simplest task in computer vision which is mnist and you're trying to predict the digit uh based on the image you see and it can just happen that in these two domains in one of the domains uh color is correlated with the digit and when you just do conventional ml it just tries to look for the simplest feature or the most sort of easy way to get to the right prediction and a high accuracy and so in conventional level models they have no way to know whether color is the right feature or whether the shape is the right feature and i think what we are seeing is that infusing some of the techniques and insights from causal inference into the prediction problem so for example in this case you can develop causal regularizers to to say that some features like shape are more important here what it does is that it then gives you your your predictive model as well more generalizability to new domains so that's just one of the simplest examples where you can think of like when you have data from multiple domains how do you extract the features that they consistent over all these domains i think one thing that's worth emphasizing is that when the traditional ml models are relying solely on the data and so if you have superior correlations in your data your model won't be able to you know easily distinguish between the spurious and the and the causal relationships and when we know the experience once by definition they might change because of some environmental difference and what causal methods let us do is it formalizes it helps us formalize our assumptions about either the structure of the data gathering process or about the fact that data came from multiple environments and it helps us bring that knowledge into the analysis um in a more systematic and rigorous way and and that's that provides that extra little bit of um nugget to help us focus these models on on the causal relationships where it otherwise i wouldn't say that there's anything that we're able to do with data alone with the causal method that we couldn't do with the conventional machine learning model it's the fact that the causal analysis is bringing in an ability to reason about some additional structural or relationships between among the data itself yeah so maybe we have time for one last question uh i think there's a question about uh unobserved confounding we obviously talked a lot about that in the presentation and the question is how do we account for unobserved confounders in our causal model so uh maybe i can start so uh there are two places in which uh we want to account for them uh so the first place that is in the modeling itself uh often what happens is if you're doing the traditional statistical analysis or even the machine learning analysis what happens is you look at what data is available to you as emory was saying and you try to build the best estimator for your effect or for your quantity that you're interested in the downside of this is that if your data itself is does not have all the right variables or it misses some really important variable your estimate might be off and you would have no formal way or even no intuitive way of knowing that that is happening so i think the first way unobserved confounders can be used and this is what we have tried to do in do i is that you explicitly tell or state that in your causal model itself so in your graph for example you can say not only that there is an unobserved confounder i believe that is affecting this problem but also how is it affecting this problem right so is it just as causing one of the variables in your model or is it causing probably all the variables in your model or some middle ground so i think that's one way that we account for it and this is this just gives us a perspective or some humility into what the estimate that we are capturing is really capturing so is it the causal effect or not right so that's number one i think the second way that we can account for it and this is also one of the probably the biggest strengths of do y as a library is that you can do your analysis without having access to that variable but at the last step in the refutation you can say that i know my analysis is not perfect because there was an unobserved confounder let me try to now simulate such a confounder and try to see how much my estimate changes so this is you can think of like a sensitivity analysis where what we are saying is that we don't know how big that confounding was but we can simulate confounders of different strengths which are unobserved and see how much our estimate would have changed if we did include such a variable in our analysis and looking at that uh along with sort of domain knowledge of how strong a confounder could be that's missing i think that can give us a much better idea of how robust our estimators and also how trustable or generalizable it may be yeah yeah just following up on that it it turns out that even when we don't have the opportunity to observe uh these conf a particular confounder the domain experts can still have intuition about how strong that relationship might be and if we are able to compare the strength of the unobserved confounding that's necessary to [Music] reverse our our understanding of a causal relationship what um what we see that we can take a look at that strength and compare to the strength of the confounders that we are able to see and then a domain expert can say yeah it's not it's not doesn't make sense that there's someone observed control that's thousands or millions of times stronger than anything that we've seen so far and maybe in that case you trust your analysis more than if even a tiny bit of confounding uh were to um were to reverse your your estimates great uh thank you all for attending today uh we really appreciate your participation and all the questions that you asked if you're interested in learning more we have some great resources in the resource list to the right of your screen the list includes references to the duvy library on github and within that on that library you would find notebooks there that you can explore as case studies on how do y can be used uh and also there are links to book chapters that we are writing on the same topic of causality and machine learning and you can find the first two chapters of our book also available in the resources we look forward to seeing how you can build on this research and evolve the area of causality and machine learning for any further questions i would say we are happy you can probably reach out to us you can find us online and also on the dubai issues page you can raise issues you can raise questions and we'll be happy to answer them with that have a great day and thank you so much thanks everyone
Info
Channel: Microsoft Research
Views: 1,638
Rating: 4.8666668 out of 5
Keywords:
Id: LALfQStONEc
Channel Id: undefined
Length: 76min 57sec (4617 seconds)
Published: Mon Mar 29 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.