What is causal inference, and why should data scientists know? by Ludvig Hult

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

oh okay hi everyone nice to see this big crowd I was gonna ask whom of you would self situate yourself as working with data if you have that slightly broader sense than just data scientist you would say that okay so even a very much larger bunch who of you are a bit familiar with statistics because okay yeah okay that's good then I have an idea these two are cats aren't they and the pictures are taken from a research article from 2017 about adversarial attacks if that's the word that you know up you know where this is heading they tried to feed these pictures into Google's by then very high-performing Image labeler inception III and the first fiction they say well we're quite certain that is a cat if we're allowed to round a little bit they're 100% sure it's a cat it could be a different type of cat who knows but it's definitely a cat but the other picture is that really a cat I think so but inception thought it was guacamole and I think it's a funny example do the computer really think that the cat is mashed avocados most certainly doesn't it doesn't really have ideas about what it is it only had some training set some test set performed quite well people did not probably use sharp on it so they just said that the model performed quite well and then some other researchers come and find this really stupid example they probably added some white noise I have the reference at the end there's a really good blog post on how to reproduce this if you're interested and why neural networks are sensitive to adversarial attacks like this people are not entirely consensus on it but the way I think about it is like this if you have a complicated model it will identify patterns in that data and the computer will have no idea what is a reasonable or or how it should explain things that is always up to humans at least this far and one way to think about causal inference is really about that we're not happy to just find this statistical patterns in a data set we're interested in what are the Kaushal structures and for those of you who are wondering like what are these two weird words think about it this way Kaushal means it has to do with cause and effect inference that means to infer it means drawing conclusions so Kaushal inference is drawing conclusions about cause and effect or some bad translation to Swedish for you speak Swedish could be like oh Shox sleuth leaning maybe I don't know it's very convoluted Who am I my name is Ludwig Holt I'm a PhD student at the epsilon University where i study machine learning and Kaushal inference we try to find the the best of two worlds where we try to draw these robust conclusions while still using modern machine learning techniques the first three points on the agenda is more about the what is so I will not give you a definition actually I will give you a characterization by example and sorry I need to check the time okay um and then at the end I will talk a little bit more on the why maybe you know this comic it's from xkcd it's a quite common example someone takes a statistics class and they learn about correlation and causation and they know that these are not the same as the testicle pattern is not a Kaushal but there is another view on the same topic promoted by some people in this field of research it's a philosophical principle it says that if I find two things that aren't statistically dependent something causes lists they won't be just Fontaine E ously if you don't have small size as samples or something and then that is the reason so there's always some reason some Kaushal background and this is my first characterization of causal inference it's acknowledging that every time you see pattern there's a reason for that and it's not settling with just saying there is a pattern but actually finding the Kaushal direction my second characterization is from Miguel or nun they use this in the data science education at Harvard they categorize three typical categories of tasks that you do in data science the first one is answering questions about the description let's say you work at a beer company someone asked where do we sell our light lagers the answer is in the data just calculate that and you will have the answer the second category of tasks is about prediction people ask you the question that does not have the answer directly in the data maybe they ask about the future how much beer will will sell in Germany in the spring it's in the future you don't know but the data has seasonal patterns and geographical patterns and it helps you answer this question and causal inference is in this setting the next step answering the even more difficult question what will happen when we start to intervene on the system if we don't have data on for example buying Google ads this will be a very difficult question but maybe one of your branches in your organization did try that it will they are not comparable to every other part of your organization so you really need to think a lot before answering the question causal inference is not a certain tool it's answering certain questions and it's this type of question we have a certain toolkit though and the first and I think the best place to start if you don't know anything about causal inference from before it's structural causal models they are equations of this sort it might look a little bit scary and but there are several parts that are not that important the most important things are the things on the right the X and Y's in this case those are the variables that we might have data on and we say that they get their value from the thing on the right it's an equation sign as in an assignment in Python and it's some function FX and FY are the name of those and in one case it's a function of a random variable UX that we don't know it will make X random but Y in this case is both a function of the random variable U Y but it's also a function of X we say that X is a direct cause of Y in this type of model we can represent this graphically and you can see there is a correspondence there's an arrow from X to Y so X causes Y we can extend it to more variables and introduce something maybe Zed and you can see in the equation and that it corresponds to the graphical representation Zed has no arrows coming in it is only a function of you said X is a function of UX but also of Zed so there's an arrow from Z to X and Y is a Kaushal affected by both X and Z we call this diagram the fork it's one of many different ones that you learn to recognize the funny thing with this type of model is that you can do what we call interventions we can set the value of x in this case to something let's say 9 and for those of you have read a bit more statistics the interesting thing is that now the distribution of x y and z will not be the conditional distribution that you might find by just subsets in your data to the places where X equal 9 this will give rise to different probability distribution and this means that it was model interventions and it will model what will happen when we do something putting it in putting it in a little bit more context we start with a structural causal model that models how we intervene on a system since that is rule for how you produce random variables the x y&z it will include a probability distribution on x y&z or like how x y&z randomly correlates and from that you have data so the two first steps of data science is answering questions about that probability distribution what is the shape of my data essentially but Kaushal inference is trying to go all the way back to the structural causal model and this picture also explains why it's very difficult what we're doing because there might be several different structural causal models that give rise to the same probability distribution and the same data set so it's what's you in math call an inverse problem you need a lot of assumptions to make this happen so this research in cultural inference is very much about being transparent about what your assumptions are for example drawing them graphically so people understand them and minimizing the amount of assumptions that you need okay so now you know now you hopefully will recognize this graphical model and know that it's a-okay we use them to explain the graph the Kaushal structure how do we use them I will give an example with what you would call adjustment methods and give you an example how to do that in some lines of Python the case we're interested in is a famous data set from the 80s about kidney stone surgery in a subset of that data we're only comparing open surgery and minimally invasive surgery and we want to know what this average Kaushal effect or the ace on the kidney stone on the success rate of getting rid of the kidney stones I can produce that in a few compact very compact lines of Python I do make the data from a summary table so it's kind of like cheating and the data looks like this I've added a third variable I will talk a little bit more about that in a while every variable is or one so they are binary so I don't have to think about feature engineer part of this but for every patient there's a zero or one whether or not they got the open surgery if there was a favorable outcome of the surgery and also um if whether the kidney stones were large or not and just looking at the average over the whole data sample the blue bar is the average outcome for those with open surgery and the reddish one is the minimally invasive from this you can say that oh they're quite even performing but open surgery seems to be a little bit worse the funny thing happens when I group it the data by this kidney stone size and now we can see that open surgery is worse for both subgroups this is a funny statistical thing we call a reversal or Simpsons paradox where something can be good or in this case bad on average but good for every individual this is really funny and the explanation you can find an instructional Kaushal model the data comes from this Fork diagram that I talked about and comparing with the Reichenbach principle before the reason that a and B are statistically dependent is also because there is something else affecting both so we want to know what this the causal effect from open surgery on whether or not the surgery was favorable but it's confounded by this other variable with large stones there is rules in structural Kaushal modeling for how we should do the analysis in this case we should include this as an extra regressor for example or we should reweighed our data accordingly the most interesting thing is that if the data was more complicated as long as we know the graphical structure there is a quite simple rule for how to draw make the right analysis the gist of it is that the Kaushal influence from the surgery on the outcome must go in arrows that goes from surgery to outcome but the statistical correlation can be created by any arrows going in in any direction as long as it's connecting the treatment of the outcome the takeaway for you is that as long as we know the graph it's possible to do the right thing so we do it what we call a back door adjustment we look at the back door arrows those that goes into the treatment choice and confounded the relationship so identify a back door adjustment set and we apply an algorithm the average Kaushal effect you see it's the equation for that on the left is an expected outcome for two different groups and we use this do intervention that I said before so we needed this type of model to explain this and there's this relatively simple formula as well for to come for computing it in this case with binary data the only thing I want to draw your attention to in that equation the red one is that you have P of X bar Z divided by so we say that that's an inverse conditional probability and this estimator is called an inverse probability weighting or a inverse sorry a propensity score weighting estimator and those are implemented in libraries this is my point there is a library that I imported in the previous code snippet it's Microsoft backed I feel it's a little bit immature still but it has all the basics in it so and it's a quite nice IPI it's called do Y you can find it on github the first thing you need to do is define the graphical structure here you do it in GML which is also compatible with Network X for those of you who worked with that so this is the three variable fourth diagram from before I input that and my pandas dataframe and create this causal model which is a new object that has both the graphical structure and the data with all the statistical patterns I ask it to find out how should I adjust my date what was a confounder and so on and then finally I say please compute the effect via this inverse probability estimator and it will say that it's a positive effect of treatment a or the open surgery and that consistent with what we saw in the subgroups it's good for everyone the Kaushal effect is positive this is just what we like to see so that was an example of a toolkit for using the structural cultural information in our analysis there are I can say lots of more difficult and involved ways to do this analysis but this is one of the good starting points when do we need to concern ourselves with this and not only think about statistical patterns should I reach for Kaushal inference if I have a new problem I've never seen it before but I have some data then I think probably not should you start thinking about the domain knowledge about this problem because if you are to reason about graphs and make assumptions you need that knowledge first this is maybe not your starting point but it's the next step I guess a lot of you work in web maybe maybe nod maybe shake your head some people at least so you might a be test your site and see how that costly affects your customer behaviors while searching your site you do an experiment can I do that and not think about this weird framework well maybe it depends on how many customers you have how many transactions do you run in your experiment and what does what will the sample size in your experiment be if you can do good experiments then you are quite fine but if you cannot make perfect experiments this will help you overcome those problems if you need to understand not only that it is a Kaushal relationship but you want to know what how that costless relationship is manifest this is very important for example in detecting discrimination in observational data then this is the type of framework that you typically use or if you simply just have non randomized data and you still want to draw a conclusion this is the framework you use so a slight bit of Outlook and to give you this was kind of the the basics of what it is and why where is it heading so the older research this whole field was started when statistics started something like a hundred years ago but it has got a big upswing in the last 15 years so the well studied part is what I talked about before like weighting and matching which was the type of estimator that I showed you the more recent and current research is about Kaushal discovery when I have a data set and I don't know that graph I said that you need to know the graph but if I don't what can the data tell me and this is Kaushal discovery and it's also a big part of Bayesian network research if you know something about that we're also looking at transportation properties maybe I have an experiment of this branch or my organization and I want to so we call it transport those conclusions to some other group it's kind of a transfer learning setting so to say that is quite current research and what is more the bleeding edge here is this deep marriage between this type of framework and the more standard machine learning methods that you might know of like neural networks for example some of the big thought leaders in neural networks are turning towards Kaushal inference feel as well so we'll see what will come out of that in summary Kaushal inference is about separating the causal patterns in your data from the purely statistical artifacts and using correct assumptions such as about the graphical structure we can draw robust good conclusions even if we have data that is not perfect you don't always need this type of framework it all depends on your data if you do have good experimental data that you're very trusting in it might be fine either way and by using some of the conclusions from this research field such as about robustness and invariance I think that we will see this more and more integrated in our standard methods coming forward and hopefully the methods will be able to not confuse cats and avocados so if you thought that was interesting please find me in the break or send me an email you can find me on some different platforms I'm not reactive at medium that I've written something at least so thank you for your attention [Applause] do you have any questions do we have time for questions yes thank you can you say something about rickum Beck's common cause principle yeah and the history of it or does it predate the the other one correlation does not exploit imply causation or how it arose etc it's a philosophical statement hans reichenbach was a philosopher in the 50s so so the history of this is that okay causality is a philosophical a very complicated problem like what is it even and people are not agreeing so I can I should also say that this is a little bit debated it doesn't really work well with quantum entanglement this one of you exceptions maybe depending on how you see it and his classic example with this was like geysers that fire irregularly but typically in sync and he said that okay this could probably not be chance there must be something that is the explanation for this and in his case it was like use thermal properties it says that there is something and this is really from a from a philosophical standpoint it might not be something that is simple and something tangible I think the good way to think about this is for example when you have spurious correlation the thing that is the reason for why you have kind of correlation structures is that you have a low sample so it might not even be a statistical dependence and this is the other way to think that it's it's maybe not a really dichotomy between these two perspectives I see a statistical dependence between a and B which is a different thing from a correlation which I have in this slide so it's maybe slightly deceiving correlation is just a really near interdependencies between two things whereas statistical dependency in this as in this case is in a very very general sense wasn't an roughly now sir perfect hi really nice talk um I come from a background in like brain simulation and there we have very complex graphs but the graphs are brain networks so how can you apply this to more complex not just three kind of connected graphs so I think that coastal discovery is is the thing you would typically lean against either you do some kind of em you say that you have a certain pattern in your graph that is repeating or you might say that this type of structure is allowed or disallowed and the problem with that you find is that you have a combinatorial problem with how many different graphs you can make let's say that you have a thousand variables which might be too little in your case I mean the number of possible graphs are immense so one thing that is current research is trying to recast this as a smooth optimization problem so you can look at if the functional dependencies in the edges in your graph is of a certain kind like they are generalized linear and then there are like smooth techniques to deal with this you don't get a uniqueness of the solution to the problem but that might be fine I mean this is something that we see for example in neural networks in general so but still we use them so it might not be that big of a problem depending on your situation just so you had a slide where you compare the Kaushal models with probabilistic models so I was wondering the sounds very familiar like similar to biasion inference which is applied to this one yes okay yeah which is applied to probabilistic models so is this causal inference is a analysis tool which can be applied to probably stick models or is it something different you can say that it's kind of one level up for every structural causal model that you have you will get a probabilistic model so you kind of this is one one one abstraction step more they live kind of in a parallel it's not an either-or and I can also say that there are attempts to and to make like a Bayesian formulation of this framework and I think there is an equivalence there but it's it's current studies so we don't really know mm-hmm hi thank you for an excellent talk thank you so you mentioned the the dua framework developed by Microsoft are there now out there potentially yeah the same frameworks and in the Python ecosystem there is a bunch I have not done the full survey but there should be at least three four five something but they have quite different maturity so this is the one I picked because it's like banked by someone that will find us it so it will run for longer other simpler ones are typically by someone who's done research on the field or our just a what you call end crowd someone who is passionate about it so there are definitely alternatives but I cannot top of mine the tandem hi thank you for thanks very good talk as well I was thinking or I was also wondering about this like yeah building the building the graphs and understanding including everything and yeah you can have pointed to call the cost of causal discovery that's the kind of the future but what about for from from the previous question you kind of talked about okay if you have a lot of variables you can start trying to build this graph and understanding what points were but what if you don't have all the variables what if you you're actually I mean because sometimes it's hot or quite often it's hard and maybe you actually you're missing something that you master or you actually don't know even if you have good domain knowledge you're missing something how you know is there still strength in this and I think I think that there must be a problem in research like that's kind of what you're trying to figure out yes it's called the problem with unmeasured confounding yeah so there is active research on it there's like a very ongoing debate and some researchers have proposed that if you have a lot of variables measured such as in genetics for example we have like hundreds of thousands and you want sometimes under certain assumptions you can actually reconstruct the unmeasured confounding but then you need other domain specific assumptions to make it work and you also have complementary tool kits such as instrumental variable regression for example which can be formulated in this framework so so there's no easy answer there are different specific tools to deal with specific situations but there is no general answer thank you for all your questions and you can talk to him in the lunch and that will be all and thank you Ludwick for coming here and presenting in front of all the really good PI Python enthusiasts and this is from us thank you [Applause]

Info

Channel: PyCon Sweden

Views: 36,372

Rating: undefined out of 5

Keywords:

Id: dFp2Ou52-po

Channel Id: undefined

Length: 27min 28sec (1648 seconds)

Published: Tue Nov 26 2019