Lectures on Causality: Jonas Peters, Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
pleasure to welcome you on that pictures from Jericho Contiki Unisys very kindly agreed to give an extra eight hour or minute course on top of the 50 minutes sonar that we but it is original in court and it's great just a great lecture is given courses as in learning summer school intervened you're not getting too much you're wearing good HP has three despisers banner show felt somebody dancing and if you demand that think yet is as the huge good to get a medal from the ETH for businesses then he went on for a short time to be leading a group opinion and then now he's an associate professor at the University of Copenhagen I'm going to take more of his time and let you enjoyed the lectures thankfully Pam thanks everybody for coming I'm super glad to be here I already said Italy it's a bit of a funny coincidence that I'm back because I was the last time I was in Boston was actually in 2001 this was after 9/11 and a visit with my brother who was studying here and for me it was the first New Year's in Boston so I brought a lot of fireworks because I saw this would be a good thing to celebrate but then it turned out that there are nope 5x a lot in public so I ended up bringing all the fireworks back into Germany but I forgot to take them out of my hand luggage so I discovered this at the gate actually so no one found the fireworks and my sister freaked out so we ended up dumping them in a bit of a trash can I'm glad that nobody nobody found it because otherwise I would probably not be here today I guess so this this mini course I will talk about causality and to me I'm very glad to see so many people and I'm a bit unsure about your background in what I'm trying to do is I'm trying to introduce you to some of the concepts of causality and I'm trying to keep it basic and if it's going too slowly or too quickly please do interrupt me at any time so if you don't understand anything it's from most likely because of me not because of you so please do interrupt em okay I will I have a couple of slides that I prepared but again I'm a bit flexible so if you want to hear more about one of the specific topics also please come to talk to me in the break or afterwards and maybe we can adjust even a bit so I will talk about many things and this is a very incomplete list of people who contributed ideas to this talk so I just go over a couple of them Judea pearl is one of the early figures in the field he said UCLA we sleep the Turing Award for work on causality and graphical models there's a bunch of people at CMU working on this topic for some historical reasons they are all in the philosophy department although they are pretty strong in mathematics as well we've Donald Rubin and Jamie Robbins at Harvard I personally was working with Peter bloom and Nikolai mine thousand-odd ETH Zurich and Dominik hunting at the natural cup of the MPI you was more contributed a lot to the ideas here the unit he's now at the University of Amsterdam Patrick Hoya there are many many others so please do not feel offended if I forgot your name here okay so um what I will do is I will try to introduce you into the topic in fourth life and if you take back anything from this eight-hour survive it will be please let it be the first four slides they are quite important to me so they describe the essence problem of M of causality so consider the following problems so we have data from some genes so let's say this some measure of the activity of a certain gene AIDS and imagine some phenotype at the same time so think about the flowering time or something what you see is the data dependence so the more this gene is expressed should think about a lock expression level here the more the gene is active the more of the phenotype you find the question is now M in statistics the question you're answering is often a question of prediction so it says well now assume that you for example observed that the activity of this gene a slice around six what is then the best prediction for the phenotype and this is the problem that we have studied for many many years in statistics and that is well understood we have all these prediction techniques here you can use a linear model if you like and then you the answer will probably something like though if the activity is around six then you expect the phenotype to lie around 15 in cost ality however we are dressing a different problem and this is very essential to understand so here we are asking the question as follows so we are saying well now what happens if we actively intervene on the system so here for example what happens if we delete the gene a what is our best prediction for the phenotype if you delete the gene a and this means that we set the activity to zero let's say so here our laser is a little heart disease I don't know if anybody knows I don't I have any other lasers but maybe there's an analogue pointer I could use a long sticks or something like a finger alright so it's just our remit okay so here the question is what is the best prediction for the phenotype if you delete the gene so if you set the activity to zero and this now is a very different question why is this the case so if you want to answer this question the only thing that you what you need to include is causality and that's my my argument here so in order to answer this question you have to talk about the causal structure of the underlying problem and vice this so I'm trying to show this with these two slides here imagine now that there is a different gene be no gene a vagine be and that shows the very similar dependence structure now what we will see is that depending on the course of structure the answer to this problem will be different okay so assume that we will make this more precise later but assume that the gene a is really causal for the phenotype so then what do we expect if we delete the gene a so we see that the activity to zero so you can think about a hammer and you're changing the gene a so then we expect to see a difference a change in the phenotype as well and this is why the best thing so now to predict the phenotype is to say well it probably lies around this interval this is at least intuitively hopefully clear so if we have a gene a that is causing the phenotype we delete the gene a then the phenotype drops down as well the Jean V 4 in this case in this book this example works differently so we have the same dependent structure but now we have a very different causal structure that we can lead to a very similar dependent structure here so it's Hume that the Jean B is not not causal for the phenotype but there's a hidden common cause something that we call a confounder Alice's causing both the GMB and the phenotype at the same time but now what happens if we now intervene on gene B so if you delete the gene Vivi again set the activity to zero then what do we expect the phenotype to do though basically we expect the phenotype to do what it always does so it stays in the area that we have observed it so this means the best prediction for the phenotype is to say well it does what it always does so here if you put a hem on gene B and delete it we don't see a we don't expect a change in the phenotype yep just be very very clear you're saying it's an intervention not an observation right yes so that you're actually deleting GP activity as opposed to just observing occasion yeah exactly that that's the correct notation that they're going to use so the question is we want to predict what the system does after an intervention yeah right and also there's nothing special about the values here completely Lucien any active change how does it move responded yeah the gene deletion is only one special case of an intervention that I chose here and they take home messages really if you want to answer this question you have to talk about causality and I understand that causality is also rather young research topic I would say and there are many subtleties that we are going to discuss with facility and it's fine if you don't want to talk about causality but then I think it's important to realize that then the best thing you can do to answer these question is to say I do not know if you want to give an answer then you have to talk about facility any other questions about this example good so then step number three what is good what is a causal model so usually what we are doing in statistics is we are creating a model for some distribution so let's say you have a data generating process and we want to model the distribution so you're saying it's a it's a Gaussian distribution with a certain mean and a certain variance now a causal model is something slightly different because if you have a causal model you want to do several things at the same time you also want to be able to model a distribution that you observe so these are the data points that you have seen on the last slide but at the same time you want to cause the model to be able to say how the system reacts under these interventions so this is the the gene deletion that you have seen on the last slide then what usually also comes with a causal model is something like a cross the graph this you have already seen as well so this is something that is very convenient for example to get an overview of the causal structure right and then something that if time allows we are also going to discuss a bit a counterfactual so this is also that something that comes out of some of these causal models these are statements like these what if statements that we will discuss later if anyone wants to sell you a causal model on the street make sure that these properties are satisfied otherwise it's not a causal model so here we want to be able to model the distribution and most importantly also the other interventional distributions and this is what we are going to discuss in this mini course then step number four if you want to start with your research and causality it's super easy because what you will see is that all the concepts are quite basic there so there are lots of open questions and I guess most of you are smart enough to immediately start and doing great research on this so what are some of the questions that I studied in this research field so one is how does the causal model actually work this is also something that they're going to discuss but there are lots of subtleties that people are still thinking about something which is very important also for practical applications is things get more complicated if you have a lot of hidden variables involved or feedback another question is the question about graphical representations so in this mini course we will mostly talk about so-called X directed acyclic graphs so we do not assume feedbacks and we do not allow for hidden variables mostly but if you want to then these graphical representations become rather complicated so these are these mags and pegs that you've may have heard of for example counter factual statements something that people investigate is is it actually possible to test counter factual statements in practice and then a very important question that we are have been also studying a lot is in practice often we are not given the graph but we are rather rather giving day given data and the question is can be somehow and further graphical structural across the structure from the data something that is the last question here that I think is a very interesting research topic but it's still in its infancy so to say this is the question whether causality is useful in classical machine learning or statistical problems and the reasoning is as follows so usually you would say that causality becomes important whenever you are interested in answering these interventional questions but now then you would say but on the first order approximation if you're not interested in interventions you don't care about causality so much but whenever you have systems that evolve over time so there are not stationary so you can think about problems like domain adaptation or transfer learning then the question arises whether this causality point of views whether they can help to make the system the statistical system more efficient in terms of making use of the data and this is something that we are beginning to investigate as well okay so these are the four steps into causality any questions yeah pathology the one question may be and where the computational complex being actually encoding the okay so the question I repeat is say how many data points are needed to recover the cause of structure and what is the computational complexity of the methods so this is something there is of course a very difficult question to answer so what it caused the discovery so that you're referring to this question here so we are given some sample from the Joint Distribution we are trying to recover the causal model and of course this the this question now I would say it's not very poor get because what your what we are going to see is that this step backward so recovering the causal model from the distribution or from the from a sample from the distribution this is only possible and the further assumptions and this you can think of there's an analogue actually and statistical learning theory and so it's very similar in statistics because there you also have that if you're receiving data from an underlying distribution then to recover the underlying distribution is only possible under smoothness assumptions for example so if you think about a regression problem if you don't make assumptions about the smoothness of the function that you're trying to learn then of course you will also not be able to to learn the function so this certainly has to play a role in there and then the second question that I think we should discuss as well it's like do the assumptions that we make in order to get this link from the distribution to the causal model do they do they hold in practice so these are the questions we maybe have to answer first and then I can point you to some results that look at statistical efficiency yeah what I mean it might be encapsulated so this is the question is what data would be most useful in order to what data would be most useful in order to infer the Dakota structure M okay so this this Holly goes to the sort of the the second part of this this mini-course again is the the question about inferring the causal structure so a short answer so maybe what I should say at the beginning is that this research field is still rather young and I will present to you some ideas how to do this thing how can we learn causal causal models and what I would say at the moment what I think is most promising is if you have actually a system that evolves a bit of a time that changes so we call these different environments so it's possible and some assumptions to learn the causal structure from just observational data and I can show you some of these ideas and they have some promising result in real data but it's a very difficult problem I think you gain a lot if you're some if you somehow have a perturbation of the system and you don't even need to specify how the system was perturbed but I think this helps a lot for learning the code structure the second point is the second answer may be is that a time series structure also helps a lot because whenever you have a time series structure in the in your system you at least know that time probably is not the causal errors are probably not going to point backwards if you have a measuring system that measures like one step at a time at least then you can sure if you find calls letters they should always point forward sometime there's also help okay so these are the three parts that I would like to discuss so how does the model work so we need to introduce a bit of language okay thanks so we need a bit if we need to introduce a bit of the language and then we are discussing the questions that you were referring to can be in further cause of graph structure and then at the last point I would like to discuss these machine learning questions which if there's anything else that you would like to talk about again please approach me in the break this is one one picture that I would like to show you as well in so what you see here on the bottom either this is stronger pizza okay I still use the analog pointer maybe so what you see here on the bottom is sort of a classical statistical setting so what you do first and statistic 101 I guess a probability theory you start with the probabilistic model and then you do what we would call probabilistic reasoning so you're saying well if I get a sample what is the distribution of the mean for example if I have 50 data points then statistics or statistical learning now goes in the other direction so there you would say well if there is an underlying probabilistic model that I do not know but they want to infer how do I do it so I get some observations and then I want to infer something I want to estimate a parameter for example I want infer some property of my underlying probabilistic models in costella T now things are slightly different why you can do sort of a similar graph here so you can say well we have a causal model and now the causal reasoning would be to say well we don't only make statements about observations but they also make statements about changes and interventions now if you want to go back this is something that we discussed we would usually call cost learning or causal discovery or some people would call it structure learning so we are given some data and what we want to infer something about the causal structure so they're going backwards so it looks like there is correspondence T to the usual statistical world but there's one crucial difference and this is as follows usually in statistics if you are given an infinite amount of data points you're done so if you are interested in a parameter let's say the mean and someone gives you an infinite amount of data points you just read it off so if you are given the full distribution you don't have to do anything anymore and this is different than causality because in this step even we will see this even if you have an infinite amount of data points this step it does not become trivial so even if you have an infinitely many data it's still a questionable what is the underlying causal structure and you have seen the example at the beginning with the data points with the jeans it's not a question of like how many data points you see I can give you as many data points as you like both of these distributions could have been produced by two different cause of structures which mean that they would make very different sort of statements about what happens under interventions and in this mini course mostly due to time I will like it for most of the time I will not talk about the finite time finite sample setting but I will show you sort of the ideas I will in many cases I will just discuss about what happens if you have knowledge about the full distribution and then of course you have the full statistical machinery you have on top of this does this make sense yeah in this doesn't matter so this is more general this can have a time structure or not it's actually either because of models I will introduce are not regarding time but this is what you can write them down immediately I think by yourself there isn't there's no conceptual difference to including time good so I will start with a couple of examples some of you may have you may have seen before this is by now a classical one but is still one of my favorites so this is a study from 2012 on what you see here is the chocolate consumption in kilogram to a year per person for different countries and then on the y-axis you see the number of Nobel Prize winners per ten million and there's a very strong correlation then what you see here what I think it's nice is that sort of Sneden at an outlier this should make you suspicious and we the other one that I think it's interesting is that the Swiss chocolate seems to be much better than the German chocolate and at least it tastes better as well but here of course I mean the question is well this is a dependent structure what's the causal structure and you may have seen these examples before it's try it's funny to see that immediately you get like statements like eating chocolate produces Nobel Prize Minister studies is from confectionery news that conference maybe should be taken with a grain of salt but this of course the reason why I brought chocolate for Philip as well so we can try we can do the experiment and see whether it really helps you find the other directly find it well so here on 4th they say Charlotte a Nobel Prize and study so they say geniuses are more likely to eat lots of chocolate which is claiming the other causal direction which of course is probably false either but we will have a look at the data set later this is the data set that I think it's very important for historical reasons so this is a paper from 1950 about smoking in lung cancer it's a very well done study I recommend looking at this paper so where they found the dependence between smoking and lung cancer and this is a some data that you see the details he are not important but the most important thing is that the more cigarettes you smoke the higher the chance of lung cancer get it's pretty impressive how many cigarettes some people smoke so this is more than half a million so it used to be very popular but then he of course the question is well relatively soon after the study also the politician said well they have to call the link between smoking and lung cancer and here you really do have the same problem right because this study doesn't show it so here they said that smoking is causing lung cancer so we should introduce it tobacco tax in order to reduce the lung cancer but the tobacco industry of course said no no no so there's a dependent structure but it made me due to a hidden common cause that we have not measured and this was for historical reasons this was a super important example and it's really the question however how do you find out whether there's a causal link or not look at this example as well there's another study that was the nature publication so here what you would find is there these I don't know what do you call them in English so some children apparently afraid and the darkness so what you do is you put some light bulbs into the plaque and then it's like almost daylight in the in the room and the study found that whenever you have these night lights then you're more likely to develop advocate you're more likely to develop myopia so short sight sightedness so here the more light you have the more myopia you get so now what you do with a study of course the authors didn't want to claim causality some time they write something like this the strengths of the Association does suggest that the absent of a daily period of darkness during childhood is a potential precipitating sector in the development of myopia so you of course don't mention the word cause but you write what is it the precipitating factor of it means the same thing but you haven't said the sievert okay so what do you do what do you do this such a study village has the decision you can try to analyze this if you have a clever then you're inventing something like this so this is a patent a nightlight with sleep timer so this is basically something you plug in and then it's like almost daylight but then after half an hour this thing switches off and it's dark again and the idea of course is well this helps because if I reduce the room light then the chances of obtaining shortsightedness decreases and this is the causal interpretation of the study and we will see that this was actually not not the case but this is this is again just looking at the dependence and claiming that it's causal here the question is does nightlight this slide with sleep time are really help this is an example that we will have a look at this is also very famous one this is about kidney stones and how you recover so this is something that is known in the name of The Simpsons paradox and we will resolve this and hopefully you agree with me after this that this is not so much a paradox but rather just the question of causal causal phrasing so what we have is we have a patients with kidney stones and we have two different treatments treatment and treatment beads and these are the recovery rates so here you see that they were in total 700 patients and 350 obtained treatment a three and fifty obtained treatment B and what you see is that it looks like treatment B is better than treatment a because the recovery rate is higher and now comes the magic but I'm now showing you is exactly the same data set but I'm providing you with more information so that you can it actually turns out you can classify the kidney stones into small stones and large stones and I now show you the results on the subcategories so for small stones the treatment able to actually better the treatment be and for large stones the same happens so also they are the treatment Abel's better than treatment B so here you can see look at the numbers they really do add up so eighty seven plus two hundred sixty three equals three hundred fifty I hope so now this is known under sentence product so how can it be how can it be that involves subcategories treatment a is better than treatment B but then overall it looks like treatment B is better we have an intuition for this how can this happen exactly sir the answer is somehow it's an unfair comparison because treatment a has to deal with many more of the large stones and the last stones what you can see here the large stones are the more difficult cases so here the recovery rates are smaller than for the super small stones so these are the difficult cases and it's been unfair because treatment a got assigned to many more of the difficult cases and indeed it was the case that in this study the doctors or the medical doctors already thought that probably treatment aid works better so whenever they had a patient coming in with large stones they felt pity and said well you probably get treatment a and so in terms of causality this is not so difficult to write down so here this is the cause of structure so we have the recovery and it's indeed influenced by the treatment but we have a confounding factor which is the size of the stone the question is we are interested in this this causal link here right and this is something that we are going to compute so here what is the expected recovery for example if all get treatment B so this would be an intervention it's still a bit astonishing to me so I talked about this example once in a class in the keeping in it during the break one of the students left and the next week he came back and asked what I'm going to talk about like the following week and I was wondering what do you mean and it turned out he said stomach egg during this lecture and he had kidney stones I'm not kidding so event like in the break of his lecture we went to the hospital and had the most diagnosed kidney stones and he got treatment a so I hope that no one of you today is leaving the class very weird coincidence okay this is a very famous example something that goes into the direction of these applications to machine learning so this is an example and advertisement M so if you're using adblock as I do I suggest to switch it off so at least an hour is browsing the web it's a very different experience with them so here this is if you use google and type in something like my coffee beans then what you find is above the search result this is the first search result you find all these ads and these are called the main line s they are the ones that make the most money so here you see the small ad indicating that this is an advertisement and you can buy something like the kicking horse that I would be curious kind of coffee this is but this is a machine learning system so it's mostly nowadays it's mostly trained with data so you have data from a lot of customers you know after after they have searched for something what did they click on and I would want to argue that it's also beneficial to at least think about causality in this concept why so this is a very simplified picture of this this underlying causal structure so what you have something is the user intention so whether the user is looking for some information or wants to buy something and this is something that is hopefully hidden then you have a lot of user data especially if you're using Gmail or something so they know of course where you're from they know like here they know anyhow your IP address they know the time of the day and so on and the search varied what you're looking for and then you have a parameter that is called the mainline reserve and this determines how many of these mainline s you want to show so it's between 0 & 4 and then you have something that is the number of s and the mainline there's some other factors going in there as well and then at the end you have whether the person clicks on the ad or not so this is a system that's very simplified of course but that is underlying this machinery and what I want to want to argue later that you can actually benefit from looking at this causal structure so he the research person of course well this is the parameter that we can tweak like if your Google for example this is the parameter that you can change and the question is how do we choose an optimal mainline reserve given all the other stuff that I have and this is also something that you know for example from reinforcement learning may be something that you have seen before but so this is also something you can phrase in accord with a last example and then we will start doing some real stuff so this is it's some data that we have been working with so this is these are gene at the action so this is a data set from yeast so what do we have we have about 6,000 genes and we measure their activity and we have 160 what they called bio types I'm not a biologist but this you can consider is being observational data and then what is nice about this data set is that you have a lot of gene deletions as well so you have about 1500 of those gene deletions where the target is known so for example the first data point here means I have deleted on gene number five thousand three hundred four and then I measure all of these genes in fact the biologists have even repeated the experiment five times but they only provided us with a mean so you even have some statistical confidence here so now here example you see one example for the observational data point so here you see two genes 5954 and 4710 and you see that there is a dependence between those those two and here what you see is these are now all these 1500 gene deletions so this is the if you like a lot of interventional data and the goal is to sort of predict causal interactions between these genes so we are given the data and we want to see that can we predict that let's say genes 17 is causing a gene number 120 and something that we will make use of and this is the something that may be interesting to see directly from the picture so somehow these causal models they have a stability guaranteed and what do I mean by this so if you find a cause then what often happens if you intervene somewhere in the system except for the targets then often this dependence remains the same and here you see this this observation so indeed I chose the gene such that this gene is really causal for this gene and what you find is maybe the dependence it's a if you model it with a linear model for example it does not change if you intervene somewhere else in the system and this is something that it's a very interesting property of course of structures as well but so we will use this data set mainly it's really trying to infer some of these causal relationships dancing one other gene that cotton versus the allergy is doing whatever it does yeah but if you have something like this so you have these two genes let's say x and y and then indeed if you know intervene on on w you can see the example such that the dependent structure remains the same but now imagine you intervene on this guy then suddenly it is not stable anymore right here maybe I didn't understand the point this is shown here are all these provisions combined or her you talked about in each deletion on your right now these are these are 1500 different deletions these are very different dimension see I thought at one point we might respond to a one up higher than that but right there you can see you're right yeah but it's very unlikely that even if you intervene on this going with the dependent structure remains the same I mean I said you can tweak a subset but usually you don't expect this because imagine that you have some variance that comes from a from a different angle here from a different part and then if this suddenly becomes very large then suddenly the dependent structure here is much stronger right because this this part if this doesn't change then these become much more dependent because then if this is very large and this this Colonia is very large here then the other component from the from V that is independent of X doesn't matter anymore so what you what you want to get for this I mean taking of it the early maybe but what you would find is that if you now look at the model where you sort of condition why on V and V then it really doesn't matter at all where you intervene so the conditional distribution of Y given its parents and this is what I mean by a causal model this will really be stable no matter where you intervene it doesn't matter the rest if you are not like if you are not the sort of shooting off by using the parent then you won't get the stability it lets the last motivating example and now I'd like to if there are no further questions I would like to talk a bit about how do you do it good so what I what I call the model so this is the first part called the language and causal reasoning so how are these causal models built and they are a couple of ways to do it and I will mostly focus on so-called structural causal models and this is the picture you should have in mind so we are now trying to build a causal model that is really able to do all these parts here ok so here you see the first structure called the model and you will see immediately that it's something very easy so what is it it's basically if you're looking for a structure called model I need to start with two variables x and y and in a structure called the models just the collection of two equations in this case please write X as a function of noise and Y as a function of X and some noise so these are sometimes called structural equations we use the word structural causal models sometimes some people would use structural equation models basically the same thing so formally it's now a set of two expressions and you have these noise variables and for now let's assume that these noise variables are independent as an example you can think about X being something like the altitude of the location and Y is the average temperature of this location so we know that if you're sort of this is a roughly correct at least for certain range if you go up 100 meters then roughly the temperature decreases by 0.6 degrees cell use of course there's only an approximation it's not going to be a linear effect and we will see some data later but let's say it's a linear model for now so then we are saying that X is just some distribution so you have a distribution of different locations in a country let's say and then the average temperature is now a function of this altitude and again you have some noise that you model as being independent of the altitude which could be something like some other geographic features here so then formally so you have the set of equations that we call s and then you have the distribution of the noise variables and here we assume there I are D let's say Gaussian 0 1 then you can draw a corresponding across the graph and how do you do this let's trivial you always check which of the variables appear on the left hand side and if they do then you draw an edge from the variables on the left hand side to the variable on the right hand side now first thing we need to check why does this induce or entail a Joint Distribution of X and y so this hopefully you see what is the corresponding so if I write down this model you can now guess maybe so what is the corresponding distribution over X and why I'm telling you this is a delta T r1 distribution so certainly X must be delta T 1 distribution and this again goes with your 1 so what is the distribution of Y for example what's the marginal distribution of Y Gaussian yes what is the mean 0 and what's the variance 37 exactly and now what is there they have a dependent structure of course so what is the covariance here in x and y I think it's minus 6 yeah I heard it correctly so this is a this induces if you write down a model like this this induces a Gorge distribution over x and y so in this case it's a bivariate Gaussian the third mean in a certain program structure so we will see that this holds in general even if you have more than two random variables but now comes the important part how do the model interventions and this is actually something something very easy as well so what you do is you just take one of these equations and replace it by something else so if does this very hypothetical intervention in this case but if you take your city and put it to it like raise it up so you I don't know if they did something similar in Seattle I think no where they just raised it up by one of the ten meters or something so you take the same city and then you rise it by let's say 300 meter by building a giant wall then what what happens is that you are sort of replacing this this is the mathematical expression for this you're replacing this structural assignment by another one and then what you see is that this is a now sort of you're starting from the same structure call the model that this of course induces a new distribution and what is it well in this case we certainly have X we just put two three this is what we call intervention so X is always three now so the probability of X being three is one and what happens to Y in this case well the distribution of Y now changes because now then if you have that X's or basically well then the mean of Y is minus eighteen and the variance is one right say this is the mathematic formalization of an intervention here yeah yeah very good question I come back to this many times and on the next slide maybe it becomes clearer this is another intervention so what you can also do is you can intervene on the temperature so now we are saying that we don't care about the altitude anymore what we are doing is you're building a giant compartment around the city and we just say well we have a heating a giant heating machine and we always set the temperature to two degrees Celsius with the variance of two so what happens now so this is another intervention that you can do and what you will see and I've already depicted this in the graph that if you do this then the altitude and the temperature don't matter anymore because you have this giant compartment over the whole city so therefore in a sense x and y now become become independent again this is should this sort of end users or entails the distribution of x and y and now it looks very different so the marginal decision of X is very important this did not change so here we have that the marginal distribution of X is still 0 1 right as before it didn't change but now the distribution of y equals in Gaussian with mean 2 and variance 2 and what we have done is we have created some independence between x and y so this is a we will see later that this here you can already get so if you don't put in a Gaussian to 2 but if you set y equals to 4 for example you would have the same thing we would still have that x and y are independent various if you condition this is not the case but we will see this more clearly later ok but here this is a super easy mathematical process of interventions but this is how we formally do it any questions sir you know e getting the picture here so what is how is that obviously set of samples example x1 y1 x2 y2 x and y now x and y are basically jointly gaussian now when there's jointly Gaussian X is a linear function of life Y is also a linear function of X so now I'm looking at the second picture after getting these M samples of X III how do I draw the arrow between X to Y because I can technically write in a dry non-routine x2 y2 y3 or second reason why to that yes so the question is how do i how do I find out that this is sort of the error point from X to Y not from Y to X two questions about this one you talked about sampled example this is really about distributions they have no finite sample involved so far this is about distributions it's not about a finite sample so this is about building a model what does it mean to be a causal model so we will talk about samples very soon but not not now the other question is how do we find out whether it's a linear model from X to Y or Y from Y to X this is a very important question we will talk about this in the second part about causal inference right now I'm giving this to you so right now in the first part we assume that someone is giving you the correct structure across the model and of course we want to relax this because in practice this will never happen but what we have to make sure that we have we are able to like work with these structural causal models if someone gives it gives it to you and then of course at the end of the day we want to insert from the data ourselves but this is the first step so here we assume is given I mean the better you are finding while popular vector X Y is equivalent I think short a the notation here is important the fact that in that equation you're writing X n that means that X is the cause of yes so this is it so you can formally do this but now we have the language already know so if you would write the same model but now you replace x and y so you're writing XS a linear function of y then you have a model that you can it uses exactly the same observational distribution but it would induce very different interventional distributions because if this is the correct I go back one slide if if this is the correct cause of structure then it means that if you intervene on X you will see a change in Y the rest if you exchange the roles then if you would intervene on X you would not see a change in Y so this means it really does matter how you write it how you write it here and the question of course in practice how do we find it out this I have to apologize we do only doing the second part but it's important that this makes a difference here and right now we assume it's given ok so here you can do this for an arbitrary number of a friend variables of course so here I just threw 4 so it's the same idea you have four random variables X 1 2 X 4 it structurally calls Manas just a set of equations and it Joint Distribution over the noise variables which are all assumed to be independent and for now we assume that the corresponding graph does not have cycles you can draw the corresponding graph very easily by saying well here X 1 appears on the right hand side for the equation for x 2 so we draw an edge from X 1 to X 2 it's just it yeah just clarifies the power point that also implicit in that structure you cannot have like a directed graph right those in other words it would be impossible to say you know extra function YY is function of those of X in this under this the question is is it ok if I repeat the question is it possible to introduce feedback here so can we say egg monitor function of X 2 and X 2 as a function big month so we could we come back to this so the first question is and this relates to this why do you see why can I always why does this construction here always induce a Joint Distribution over X 1 to X 4 this is the first property that we needed to check right we computed it by hand in the bivariate example but why is this the case here so why do you see this right so why is it the case that whenever I draw a structure like this I write down these equations and I'll give you the Joint Distribution so I'm telling you again think about the Gauss's there one case or something and I'm telling you the functions here so I'm saying this is X 3 squared plus sinus of n1 why does this always entail a Joint Distribution over X 1 2 X 4 again okay because there's no cycle and can you make this more specific exactly you start at the source so think about a computer program so think about sampling from this distribution what you could do is you start at the source variable in this case X 3 you first sample from the noise distribution n 3 so let's say it's given write it let's say it's a Gaussian vr1 distribution then you'll get a data point for X 3 we are generating the first data point and then what you're doing is you then go to the next variable so we now already have the value for x 3 so then we can solve this equation so you sample from N 1 and then you already have the data point for X 3 so the you plug it through this function and you get the corresponding value for x 1 so this is exactly what you're doing you start at the source and then you propagate down and this relates now to the question so is there any way of still if you now introduce cycles because this a cyclicity we have used for obtaining the source right so if you have a cycle there's no source node so is it still possible to somewhat write down a distribution that is entailed by this model and the answer is yes often it is sometimes it's not but often it is this is the only thing you have to answer so if you have cycles you have to say what is the distribution that we are talking about and one solution is what people sometimes look at it's something like an equilibrium distribution so they say ok if you now do the do this procedure and you let the system evolve over time let's this converge to a stable distribution and if it does then this is a bell defined object but this is really the crucial point here we somehow need to make sure that our causal model is entailing a joint distribution does this answer your question yeah so here it's just a triangle triangular system right if you're the S one of two or three or four words yes you know it was e to be a triangle system so we just need this to have almost surely your solution yeah exactly so this is say if you have a density it's everything is easier you can this also I mean this goes now if you want to prove it you do it exactly like this if you think about generating samples you think about the computer program you yeah questions of what it takes to convert to them and not much you just introduced time so think about copying this for example you can model now these as being the instantaneous effect so this is all at time 20 and then you for example could copy this this whole thing to T plus 1 and then you just draw edges from the variable X 42 X 40 plus month but you can write down the exactly the same way and this is actually also what what we have done it it's very easy no good so here just for notations to the observation of distributions we often call P and the intervention of this usually you can now exactly as you have seen before what we are doing now intervention distribution just replace is one of these two actually structural equations and this is what we this the same as two notation that Judea pearl introduced so this is because it induces a new distribution and this is the name that we give to this distribution so we say do x1 equals 0 so this indicates we have started with a certain structure cross model and then we have replaced the structurally craving for a grant by x1 being 0 and I go back one slide because here what you see now is if you put x1 0 to 0 think about the computer program before egg 1 was depending on x3 right but now it doesn't depend on X 3 anymore which is all they set it to 0 so what you get is that this arrow here this edge disappears okay you can do many other interventions as well so here you can for example intervene on Expo and set it to 13 and then you would call this do x4 equal 13 and now I'm coming back to the question well is it the same as conditioning it's not and this I would like to sort of to make clear here think about the system and and think about these all being linear sort of equations X 2 is just a linear function of X 1 plus some noise or something and let's say it's a positive linear function now I go back to the original SCM imagine now that we know that X 4 is very large it's like 2045 what does it tell you about x1 it increases the chance that Eggman was also large right or maybe x3 was large but something had to be large it's pretty unlikely that explores very large better so if you condition on the fact that x4 was very large then probably one of these variables these other variables have been large as well so this propagates this information propagates sort of up the graph the rest if you intervene so now I'm setting again going to the intervened distribution so then the distributions of these guys don't change right so now we are setting this to 13 is very different from conditioning because if you think about the computer program these assignments for the these three variables they didn't change right so they don't change so they still have the same distribution there is a fee condition this means well probably something was large so the distribution changes so this is why the do notation here they do and the do distribution is very different from the conditional distribution and this we will see a couple of times ya know so the question is do we always have to consider assignments that are constant the answer's no so actually the other ones are more informative so we will if you think about randomized experiments this is just one way of looking at interventions so there you don't want to set it to thirteen but you want to randomize so you're saying maybe you set it to thirteen with a certain variance and the Gaussian variable or something now you can even so if you think about this advertisement example if you're setting the mainline reserve depending on the user data now if we change the mainline reserve we want to see well how does the system perform how many clicks do I receive if I intervene on the mainline reserve and of course it would be stupid to always set it to three to always show three ads so the end intervention is actually even more complicated so they an intervention is even a conditional so you're saying I still want to condition on let's say the user data but I want to condition in a different way so this is also something you could do so here you could intervene on x2 but still make a dependent on x1 but in a different way so these interventions are quite flexible yeah and you use this in practices of the notation that you have so what is the distribution of work we're talking about when it says be it a joint distribution yeah there's a joint distribution over by X 1 X 2 X 3 X 4 yeah okay so it is B the joint distribution of all the variables when one of the variables take one value distribution if you like it's just a measure right so this is just a distribution on R to the 4 so this is just a measure on R to the fourth but the important point is that this is a different distribution now right so if you do not intervene you get a distribution that we call P and if you do intervene you get a different distribution so here for example X 4 can take many different values whereas if I intervene and set it to 13 it always is 13 so therefore it must be a different distribution and this is just the notation for this new distribution if you don't like it you can call it P tilde that I'm telling you it will be convenient later on to call it this way that sulphides just the name no we can all send it in at several equations simultaneously and it's also good question because sometimes we will need to do that yeah you may be you tell us what piece of do x1 equal 13 would be it's the same thing so let's say we start with this structural cross model and now you intervene on x1 being 13 so you replace this equation here by x1 being 13 and how does the corresponding graph look like you can probably answer this yourself so x1 is not always 13 so one of the edges disappears yeah this one right because Eggman is always 13 so the corresponding graph would actually be looking like this yeah I'm good I think to do a short short break and then we look at a concrete example with the with the kidney stones so just 5 minute breaks off okay let's continue I have some strength sir good so let's look at in a concrete example of the correct ones so this B is the example of the kidney stones so now what do we have so we are interested as I try to argue before we are interested in saying that what happens if you if you set the treatment to let's say a so I'll get treatment a so this is an intervention and the point here is now that I'm giving you the cause of structure but I'm not giving you the structure causal models so I don't know what exactly these functions are and what the noise distributions are we are just given the observational distribution and the causal structure in fact we are given I'm lying a bit because we are given a finite amount of data but let's say this is very close to the observational distribution because I want to not focus on the statistics part here so then what do we do so the question has come in now so we are given data from this of the way from this observation distribution and can we somehow infer something about this interventional distribution variant in further treatment and the answer is yes we can and there's something that leads to this to the answer let us say I think a very nice pathology so here again we want to compute the distribution for the intervention do treatment being a ok so this is what we're interested in but we have data from the wrong distribution so now this is a mute what I call mute the most useful tautology ever so how does it reach if you intervene only on XJ you intervene only on XJ so this is clearly a tautology but why does it help so look at this example here this is the observation of distribution that we have the data from so the size of the storm is called in treatment and recovery and we have an interventional distribution where we set the treatment to a interested in this distribution so for example you're interested in the equation what is the probability of recovery under this to intervention but we only have data from this distribution and here comes now the tautology and to play what are we doing we are intervening on the treatment right so we're intervening we are setting the treatment to a so what is this implicitly also means is that we are not changing the way the recovery depends on the size of the stone on the treatment so here what we are doing is we are intervening on this guy we are changing the treatment but this also implicitly means we are not intervening on the recovery so the structural equation for the recovery remains the same after the intervention and this now means you can think about this if you like in a break this now means that the conditional distribution of recovery given treatment and size this does not change and this is written down here so the distribution of recovery given size and treatment is still the same and the observation of distribution and in the interventional distribution and this is in a way this follows from the tautology but this is a very useful thing and this we are going to use now is this clear this statement so this is what we are looking for and this is what we can use what is the other conditioner that remains the same here do you see that is one more thing I wrote on one but there's another one what about the distribution of the treatment of course this changes right this is where we intervened there's one more conditioner that remains the same receive it yes size of the stone so the marginal distribution of the size of the stone is exactly the same right because we are not intervening here so the structural equation for the size of the stone remains the same and now I would like to ask you to take a pen and paper and see whether you can compute this just 5 minutes so this is the target this we want to compute and this is what you can use and now see whether you can come up with a strategy and so this is the one thing that you can use and I've write down the other one too this is P of s equals 3 do treatment a of s so this doesn't change either what you have to do is you some I have to massage this bit to make it dependent on like this is the distribution right think about this as being pitted you have to massage this a bit so that it only depends on terms that are sort of looking like this and like this because then you can transfer this to the observational distribution so do it now he won yeah and you conditional the two side spells but because they are actually about the same and the treatment which is close to the average of the two values for the treatment of pain yes so what we ever see is open dead free version thank you even the right intuition without phone and without pen and that's that's correct we will see if shall I show you how to do it formally so M I'm obviously so this is how you would do it so this is the thing I mean the details you don't matter too much but this is the thing that we want to compute right but we don't have data from this we have data from the observation of the solution which is not the interventional distribution and now we are using these these nice tricks where we are saying well some of the things we named the same in both distributions these are just these conditioners right so what you're now doing if you're trying to massage this formula to get something that is exactly look like what you suggested so the first step for example is just to say well we now marginalized over s so this we can do right and because T is always a we can also add the T equals a because this always happens in this distribution and very intervenes on t you'd set it to a right now the next thing is to just define the conditional distribution look at the definition of conditional distribution what you're saying now well this is the joint and this includes the conditional of our given s and T times s and T margin over s and T so now the next step is you don't have to copy this of course I can put the slides online if you like so the next thing is to say get rid of this t equals a again and then what you have is the only important part here you have an expression that of course still like it looks as P dou T equal a so they'd the interventional distribution but now we can use our most useful tautology ever because this guy here is the same in the observational distribution and Indian dimensional distribution so we can replace this the same here and then what you end up is is exactly what you suggested so we are looking at the recovery given that the size of the stone is let's say small and large and three treatment at a and then we multiplied by the chance of having a small stone or a large stone and because these groups were roughly the same roughly 50% we just have to take the average over the recoveries of the treatment so I go back to the data so this is the recovery this is the probability of recovering given that you have a small stone and treatment a is the probability of recovering given that you have a large stone and treatment a so then you get something like the average it should be tiny bit closer to this guy because it is a tiny bit larger so this is why you get 0.8 three two okay and you can do the same thing for the treatment B and what you will see is that well if you force the person to get treatment B then the probability of recovery is 87 percent but what does this mean now this answers our first important question namely which of the ones is better this is exactly what you want to look at if you are if you're wondering which of the treatments is better because you cannot just look at the data as it is because it's messed up by these doctors assigning the treatment a to more difficult cases so this is the question you're interested in you want to so if you have if you are a patient and you don't know whether you have like small stone or a large stone you just have to decide to a get treatment a or B this is exactly what you are looking for so you want to say well what is the probability to recover if I get treatment a if I force myself to get treatment a or treatment B and here you see now that treatment a is indeed better than treatment treatment B yeah so why we need the causal framework for this music if you look at the data under the compounder there the way the groups were split and you can very good question why do we need to call this framework a very guy a very good transition to the next two slides I hope so it turns out so here by the way just as a reminder because we have seen this already this is different from the recovery given treatment a because I want to stress like one small right because this means if you are given treatment a so if you see yourself and you just don't think about interventions you see yourself being in the study and you you realize a hug or treatment a what is my recovery rate the probability of recovery this means this carrot information right because if you got assigned treatment a this probably mean to your hard case so this means that the recovery rate is lower right so this is here it's a different thing to condition or to intervene is once more you see this all over the place but this I would like you to take home so now why do we need to score the framework it turns out that this idea holds very generally in a very general sense and this is the definition of what is called an adjustment set so if you imagine that they are not the size of the stone but they are like many many other variables as well then what you need to do is we need to adjust for them and we formally be saying ok they are given a structure called the model over x and y and some covariates w and we are interested in the causal effects from X to Y very similar to what we have seen before we are interested in saying what happens is like we treat ourselves it's treatment a or b so here is the treatment the X and the y is the outcome the recovery rate and then we are saying no this is a valid adjustment set if we can compute this interventional distribution by what is called adjusting for these so what we are now doing is and this is sort of the averaging right this was the recovery like for treatment a and we adjusted for this for the size of the stone so we took the average over the size of the stone you can think about taking averages here again you see this is different from just conditioning Y on X because it would be the same if you would have another Z given X here but it's not so this is really a different thing so now the if for example if what you what you want to enter help if you want to have such an adjustment set because then you can do this what we just did from the example of the kidney stones in a very general manner so you're given like data from the observational distribution and we can compute like the interventional distribution from just the observational distribution right it's like magic we don't have to intervene at all we just can it be a given because the structure we can just read it off and so this parent adjustment is the easiest form of adjusting this means that you can always use the cause of parents of X for adjustment adjusting this is always a valid adjustment set for X Y okay so this is maybe even a bit more interesting and we will see a more complicated graph in a second M I want to say one word about adjusting and linear models how do I do it so in a way this is a full distribution right and sometimes you want to see the difference between if you want to talk about causal effects from X to Y for example it's what you could do is you could always compare you can always compare P of let's say Y and P of Y where you intervene on X right so you can compare these two things if you like so and if you see that these are very different then probably XS a strong effect on Y right so you intervene on X and if you see there's a big difference between the the distribution of Y so if Y changes a lot then you would say there's a strong causal effect from X to Y there's one way of sort of summarizing this and this is especially interesting in linear models and how does it work so if you want to just give one number then what you can sometimes do is to do the following so this sometimes called the cause of strength or the causal effect from X to Y so what do you do how do we define it we are saying how much does the expectation change so this is like a one number summary of the difference between these two distributions so what you're doing is you're taking the expectation oops sorry you're taking the expectation of why you are looking at this in this new distribution 2x equals x and now what we are doing is we are taking the derivative with respect to X so this is sometimes called the causal effect and now just intuitively this means I mean it's the same idea right so if you change X a tiny bit and you see a dramatic change in what in the expectation of Y you would say well there's a very strong effect from X to Y right so this is sometimes what people use as a sort of a summary of the causal effect and now this has a very nice sort of property because in linear if you have linear Gaussian models with all these structural equations are linear and Gaussian what you can do is you can just read this effect off so think about the graph that looks maybe like this I in your hand fizzy here and you have now coefficients can you read this or this love I don't I can even use this graph here so if you have it it causes structure that's like this and now if all of these are linear structural equations in the Gaussian setting for example we have a coefficient here two minus three four one five seven minus seven and now if you are interested in the causal effect from X to Y and now I'm talking about this guy here so you can read this off from the graph and how you just multiply the causal graph because the paths here so they're in this in this case the causal effect from x to y equals one times five five and this is a in a way that's intuitive right so we have another pass here but this is not really causal you don't care about this go in this path here right because whenever you think about your intervene on X you're changing X and you're interested in how much does y change and here you look at the change the expectation so what you see is well if you change X a lot then what really matters for the change in Y is this path here actually multiplied 1 and 5 any guesses if there's a second path so what do you do if you have this yes then it's minus 2 because then you add all these paths these are quality I think right parts rules or something so this is a small exercise if you like to make sure that this is really the case I mean I'm just claiming this now but this is the this is nice I think for the intuition because in linear causal models this is what you would expect no you're really interested in the causal paths from X to Y and these these are not interesting yeah [Music] to the you can also do that but in a way I think you want to be I mean here I mean this is here we are not comparing this anyhow right you're just looking at intervention but there's one thing that you want to maybe this is now a hand baby a bit but if you have a situation like this you want the conditioner you want the effect from X to Y to be zero right but the conditioner really changes so if X is large then Y is probably also large so one has to be bit careful there I mean it's possible this is not I'm not saying by any means that this is optimal so there are many different ways that one can choose this but now we will come we will see like that this is actually quite useful in the following sense so these were adjusting is adjusting sets right so now how does adjustment come into play adjusting if you know I interested in sort of finding this finding this effect here so this we can read off but if you want to find this in practice what you can do is you just if you are if you know that Z is a valid adjustment set there's another way of getting this this number five or in this case minus two here and how does it work you just do a linear regression so this is the first way it's sort of looking at this path and there's an alternative bridle and this is now important for practical reasons because if you have data exactly what you want to do so the alternative if we Ismael it adjustment set and this a proposition for example tells you one way whether this indicate or not what you can do is V X to y equals the regression coefficient or X a linear model by you regress Y on X and Z okay this is the same thing and it's very useful in practice so if you know that you can adjust for these so this is a valid adjustment set all that you have to do is if you want to find the correct causally like effect from X to Y you just have to improve V in a linear model so let's see let's look at this an example then hopefully this becomes clearer so this is now it calls the graph that is a bit more complicated okay and we are interested in the causal effect from X to Y so how can we do this how can we compute this so one day is to say well the looking first of this this criterion what is a valid adjustment set right and what was the criterion so it said that we can always use the parents of X as a vendetta justment set and why is this intuitively the case because you cannot just like condition Y on X or just use the regression model from Y on X why not because this what we call a vector pass this message message the relation up which is if you think about the kidney stones example right so this was the treatment this was the recovery but we could not just condition Y on X because we had this VX dependence on the size of the stone right and these are the ones you want to adjust for so if you know I mean at least intuitively what happens with if you now condition on the parents you sort of block all the backdoor paths okay they admit the intuition formally the proposition tells us well if you adjust for the in this case a and C you had you are doing fine so then you are blocking all this and this happens in let us so you're blocking all these paths that you are not interested in you're only interested in this disc or the path here from X to Y okay the parent a parent set is a valid adjustment set so that's fine so we can turn our correct for a and still obtain the correct call the coefficient from X to Y can you guess whether there are other valid adjustment set so somehow I mean is it really necessary to adjust for C somehow no rights because it doesn't really I mean a seems important because it blocks this path but see some of doesn't matter and there's a more general sort of criterion than the parent adjustment it is because the back door adjustment so what you want to do if you really want to block all these spectral paths it's called a back door because it enters X through the back door so to say so we have seen that C a is a valid adjustment set is the parent adjustment but I'm telling you without proof that you can also for example use case for adjustment and you can also use F C and K for adjustment this doesn't matter it's all the same thing and now this becomes a bit more interesting right because what happens is that if someone is giving you this sort of causal structure and you obtain data from the structure and you're interested in the causal effects from X to Y you know I can adjust Phi and C but now imagine you are not observing eighth-place something hidden aid you they have no data about a so then this theory tells you how you can also just okay so this is especially important if you have a lot of hidden variables and you need to adjust somehow like for this vector path but there's a theory that tells you exactly how to do it and I just want to show you one example so maybe that's that's helpful I mean it'll only takes two minutes so if you have if you have not a linear model you put coefficients on all of these edges and let's say I now generate the data set they have exactly these coefficients so this is minus 2 and this is minus 1 and then we can have a look what happens so what is the causal effect that we would like to recover it's plus 2 of course right it's the product of this path coefficients but in fact if we don't know that this is minus two and minus one we just have data just update and want to recover this plus two but you are giving us data and the cause of structure you the graph structure so what you would do in practice is as follows works now so this is just a very small our program so what I'm doing is can you see it is this too small make this bigger okay so what do we do here so we're just generating data from this structurally like this structure that I just showed you right and these are all linear structural equation models and I just I mean I use Gaussian but of course you can use anything else if you like so I am generating fight on a data point and this is how you generate data from the structural equation right so you just you start as we have seen before you start at the source node I show you the picture once more so hopefully see at the source node yes it is so you start with C and then probably you want to simulate from a and then you propagate down right it is correctly so we start with C then we simulate from a and then you propagate down so far so good and now we want to recover this causal coefficient from X to Y so we are not just simulating data so we have our data set so you can look at this for example like this and you find that I don't know you have some distribution here some data and now the question is what is the cause of coefficient from X to Y and if you just I put this up this works if you just regress Y on X this does not do the trick so then you get so these are the coefficients that you go that you get if you just regress by an X the coefficient that you get is 1.3 it's biased it's not the correct causal coefficient and why is this because we had this veered vector path that you're not controlling for right instead what the parent adjustment says it says now you should adjust the parents a of C a and C or what you can also do it you can adjust for K so what what this means in practice you just use the linear model and you're including K so we are saying to a linear model from Y on X and K and then you look at the regression coefficient of x and suddenly now we get an unbiased estimate so this is much closer to 2 so you can also as I said you can also use F C and K I'm going to chose an on purpose because I'm a big fan of FC cologne which is the second teams but it doesn't matter so here X X is also very close to 2 right so it's one point nine nine eight so again if you adjust for these variables then again you get an unbiased estimator but and this of course is the importance you cannot do this for all sets right it's not that if you're the more you include the better so if you also include H for example then again you get a biased estimate so then you're not close to two and this the sort of the this theory of adjusting tells you exactly what you can adjust so and what not yes so here if you if it just okay that was fine and intuitively why because we're blocking this path if you include F for example it doesn't matter as well this is a bit more tricky maybe to see but you can imagine it doesn't really have an influence of this path right if you include h or g this is a bit more tricky to see but this you're not allowed to do because if you're including g you are sort of messing up with this pass here whether you include c or not stress doesn't matter but this theory tells you exactly what is the valid adjustment set and what not so we I showed you in a fission condition but you can also have necessary conditions yeah I'm missing bits of hardware you know the dollar justice said the definition validation stage it's like those yeah it's different but it's not so different so I don't do the proof yet the valid adjustment set looks like this stress it's the same thing so if you it's an exercise so you can show this is what I meant by this guy if V is a valid adjustment set then you obtain the causal coefficient by regret this is the regression coefficient for X in a linear model where you regress by an X and V and this needs to be proven so if there's a dense this is the batiko battier says that the diagram today is AC o what we want why it first is not a because the United parents of X if I wrote something else then this is wrong yeah it's the parent of X it's like I understand why not about it by giving the definition no no oh yeah but this this is a bit more tricky right I'm not claiming it it I'm saying I'm telling you it is not but it's not so clear from the definition so there you have to work it so it takes you seven minutes I guess but there there's something to show and this is in general there's a characterization of valid adjustment sets with if and only if and it's really based this I think it's funny I mean you can so we have written a book on causality which is going to appear next door at MIT press it's already online now so you can have a look at it and then we are going through this this valid adjustment set and what you see is that it really it only is based on this tautology on this mute notice most useful tautology ever is the only thing you're only interested you always have to see well which of the conditioners remains the same at which change and then you can compute the if and only if condition for what it's a valid adjustment set and here you have to trust me that B is not intuitively what you are doing if you're adjusting for D you can think of something like conditioning and looking a venue condition you sort of block the past that you don't want to block you want to block these paths but this is I know that the same very it's not a mathematical proof but the proof is really based on this idea then from actual temp if in terms of if you measure the change in V of X adjusting for different factors does that give you evidence of where things may buy on the pathway but if you can you repeat that if you're measuring the change in C of X like eat like the vegetables yeah the X is hard to why if you do that when you're adjusting for different factors ah this is a very interesting idea so if you've got now compute the CX Y if you don't know that if I understand correctly you want to say if I don't know the cause of structure I'm shooting the CX Y and then I'm doing this several times and this may give me a hint about the structure a very good point I don't know I I thought about this once but I don't know about any method that is exploding this I think it's an interesting I more than welcome you to look at this I think it's an interesting thing there's one vote via talk discussing maybe tomorrow or late today there's one method that comes close to this it looks at this condition infinitives okay so this is just the code now back to one example and this I think is quite nice I think you're in 1950 think you're working for a tobacco industry now the politicians tell you while smoking a scrolling lung cancer we have to put text on tobacco so you're looking for the tobacco industry what do you say obviously the first thing you say it is rubbish but then they ask you what do you mean so what do you tell the politicians the text is causing lung cancer okay so you might get a bonus from the tobacco industry but I think it will be a very small one anyone had like willing to get a larger bonus yeah it's just correlation for but how did you may have to explain the correlation right so you explained this by this hidden sector right so what do you think it is what do you claim it is socioeconomic status very good you get a medium bonus now it's a high bonus but it's not the highest bonus ever why not because this is a good idea so you're saying it's the socioeconomic status but the problem is by this idea of adjusting right imagine this is socioeconomic status so we buy the dollar here so what you now can do is if you're interested in the causal effect from smoking to lung cancer what do you have to do you have to adjust by this guy right so we know how to do this so we are saying well we get data from this causal structure then what we can just do is we measure the socioeconomic status adjust for it for example included in a linear model or use this formula I have shown you before and what you get out is the cause effect from smoking lung cancer and this paper has done a lot of this stuff so if you read this paper then they adjust for many many things they're just for like age and stress factors for sex obviously for like where you coming from and so on they always adjust and they always see well there's a coding a there's a effect from smoking to lung cancer thing else on the cigarettes that is not the Fargo and okay so you're saying well there's something else on the cigarette there's also a nice idea but I mean then the back end did politicians would say bulbs we don't care then we just text not the tobacco but the cigarettes anybody I mean it's all good ideas I'm just using for the idea that they used looking for the idea that yes so there you get a very good very high bonus so if you're saying and this is what they did it's pretty clever they said well this is this is a hidden factor and it's a genetic engineering sector because it's 1950s there's no way of measuring this so you can I mean no seriously I mean of course from nowadays it sounds ridiculous because you're saying they really claim there's one gene and if this gene is expressed you feel the desire for smoking and you get like lung cancer with a very large probability nowadays we know that is super there's no giant not a single gene causing anything but like and even like two of such very diverse things in encoded on one gene is very unlikely but it's 9050 and if this would be the case what you have to do is you would need to adjust for this and there's no way to do it so the only valid adjustments that you find in this causal graph is to include this guy but if this guy is not measurable then there's no way of adjusting for it and this is what they did and these guys are pretty clever and I'm pretty sure I got a very good bonus for this and so something that astonished me a bit this is a book that I recommend I enjoyed it a lot reading merchants of doubt' so turns out that this em these scientists that were experts on lung cancer and smoking they're also experts on climate change so have they also have they also find very good ideas of sort of claiming that either climate change does not happen or that it's not caused by humans and this is like in this described in this book merchants of doubt so how you can actually try to influence the interpretation of science I think it's by treating no specific yeast I mean you can ask in family tree about how here ammonia and cancer and we are partially yes yeah this is what they did they also looked at different I mean it's coming close to this it's some serious stuff because it was a very important question right so they looked at for example populations in different countries as well and there you can also say if you don't have a lot of mixture you can if you see the same increase like in two very different genetic population structures and then you can try to argue but it becomes Matthew it becomes very hard to good and there's one last thing I would like to talk about em today or any other questions about this example the adjusting is this graphically good yeah how many how you not taking the game on the members of previous 72 teleportation is in lung cancer and soot in like in the air I think enable in center with a background unrelated to smoking the supported subject yes also good point so what you are doing then is you I mean you're trying to broader your model class right this is what you have to do and we will see actually and this becomes what so called independence or constraint based methods there is something useful from a mathematical point of view you can argue about this the case there's one more thing that I want to mention in this is a bit maybe surprising that it only comes now because it's so essential for causality and this is the idea of randomized studies so it turns out that a couple of centuries ago M scurvy was actually a big problem and I read that in the 18th century that's caused more deaths of British soldiers than any enemy any other enemy action so this was a big problem on the ships and James Lind he was a Scottish medical doctor here M conducted one of the first randomized experiments they're usually what you often see here and with a randomized experience you see examples from a different field to have an idea what they're Facebook and Google from like two centuries ago it was it was agriculture so this is where all the statistician worked but this is an example from the British Navy that I like so what do you do in a randomized experiment and this is I'm sorry I mean maybe I'm biased but I think there's really a genius idea but we have seen a couple of times now is we have a treatment and recovery and we are interested in the causal effects from the treatment to the recovery but then you always have to adjust right because you have these Cupid backdrop up it was the size of the stone in the example that I showed you in many other studies I mean in the you have the chocolate it's the same game now you have the chocolate and the Nobel Prize winners the problem is you always have this hidden later and even causes in the chocolate the case it's probably something like economic strengths of the country or something so rich countries are probably more spending more money on the research end on chocolate at the same time but you always have these in factors and you can try to correct for them as it was done for the lung cancer study but you never you can never be sure like that there's not another one that you didn't think of that you have to correct for or adjust for now what do you do in a randomized study I think it's really genius so what you do is you randomized food there you randomize the treatment so you are not distributing like you're not observing anything but what you're saying is you're throwing some dice and then if the outcome is one two three you get the treatment if it's four to six you get a placebo or not a treatment it turns out apparently that slowing the dice is not random enough so what they did is with the the doctors at some point the medical doctors had to call in number and then this number the person on the other side of the phone told them a random number and then they had to take this because apparently what this medical doctors did is for some time they threw the dice and if they didn't like the result they would again this is maybe not what you want to have in a randomized study and apparently if it's not a person tells you the random number that's hard trust us for some reason but anyhow this is I mean any desire implementation details which are important but the idea is really that you randomize the treatment and what happens it automatically kills all incoming arrows so then you're getting data from a causal structure where the treatment does not have any incoming arrows so this means what is the valid adjustment set well the parent adjusting it says is the empty set you have to use the terms of F there are no parents or food or the treatment there are no parents so we do not have to adjust at all so the to run a linear regression or any model that you like from the recovery of the treatment and this is I think it's really genius because no matter how complicated life is if you have a way of randomizing this then it's very easy to find the cross link and this is what our society sort of makes use of nowadays no I mean this is these are the randomized studies that we're doing in the medical regime which like use everywhere and this is somehow this is seems to be something that you trust and formulas this is like saying this no I mean if the distribution of recovery if he intervene now it's the same as conditioning usually it's not but because there are no incoming arrows into F this is the same thing so this is we can just read it off from our data so how does it do an out of the work in practice so in this this was one of the first studies so James Lim's what he did is he was one of the ships and then you can read this together so on the 20th of May in 1747 I selected twelve patients in the scurvy so there was a scurvy outbreak I bought the sella story at sea so what they thought is that this is something like you need some sort of acid to cure it for some reason they didn't know anything better so what they did is they distributed sort of different treatments they're all related to two asset so to be ordered each X part of Saturday those were the lucky ones two other took 25 drops of electric vitriol this is basically sulfuric acid so this these were the less lucky one three times a day to others took two spoonfuls of vinegar 3 times a day two of the first patients were put on a course of seawater here you see that this is actually bad thing to do it's not a well done randomized study because you should not look at the whether they're bad outbreaks or not and two others head each two oranges and one lemon giving them every day aha so they didn't know about the concept of vitamin C this was not found yet but those were the ones that got better as you can imagine the two remaining patients took an elector II recommended by a hospital surgeon this is some weird stuff that was common in those days the consequent was that the most fun affect the perceived from the use of oranges and lemons one of those who had taken them being at the end of six days fit for duty so take it took some more time to like really get this published and like acknowledged and repeated experiments but this was one of the very first randomized experiments and this guy had the correct idea any questions about this okay so that is it clear how this randomized study they fit into the picture important for me good so there's one thing that is there's only a side comment it's also not important for the maned of the the mini course but somehow this was important for me at some point because what you're doing now so far and hopefully we get better in the next sessions but so far we are somehow working only on sort of in the mass world right this is all mathematics so we are defining a structural causal model we are sort of defining what an interventionist we are defining adjustment that is all in math and we somehow have to link this to reality at some point so what is the course in reality and I think this notion is pretty useful so if you have causal models you can say that two models are called either probabilistic or interventional equivalent if in the first case they have the same they entail the same observational distributions or the second one you have to agree on the observation and interventional distributions and there's a trivial definition in a way but for me it is really important to sort of link this to reality because it turns out that it is actually it doesn't matter what kind of interventions you look at it suffices if you always look at interventions we've heard that there are many many interventions you can do but it suffice if you only look at the interventions where you randomized the variable XJ and you put it to a fully supported noise variable and this is the idea of the randomized treatment so it's a fight us to look at interventions there you take one of the variables like let's say the treatment and you randomized it so you put it to a random variable now this I think establishes the link to reality because in reality we have a model right so this is not different from statistics you're writing you're writing down a model that some of describes the data generating process so let's say you're saying this is a you're claiming there's a Gaussian distribution now you're receiving data and at some point you want according to Papa at least you want to be able to fortify this model so what do you do you construct tests for this right so you're saying well I get one more data and then I test whether the data looks like it if it's Gaussian or not and if it's not I reject my model and this I think is very important for for science if you do it like proper thought about it at least roughly and here you can now do the same thing so now we are not looking at a statistical model but they are looking at a causal model and we should somehow be able to falsify it otherwise it's a bit of an question what are we doing with the game we are playing and here you can do this because we are saying well it caused a model you can falsify now if you look at the data generating process and if we observational distribution does not seem to come from the model then you can reject it that's the game we have played before or not but now what we can also do is we can link it to interventions so we can say that if it's a causal model that predicts that something happens on a randomized experiment and I actually performed the randomized experiment and I see the outcome is very different the distribution of Y is very different then I would say well then it's probably the wrong causal model and this link I think it's nice that you only have to look at this randomized experiments but this is really a link how you can fortify this causal models think like some a philosophical point of view that's important yep once that is a system that was randomized properly then we wouldn't have needed to yeah yeah so if the kidney stones would have been a randomized study then it could not have been that the size of the stone is causing the treatment right would be completely random and then you don't have to jump but in a way it's an interesting thought now because if you already think the treatment a is better it's a very fair question whether you should randomize right because if it's there if there's a way it's good tricky but if there's a way to say that it only depends on the size of the stone or nothing else then we can actually adjust for this right and then you save more lives at the end of the day so you can actually a trade-off during the study how much like how many people sort of get hurt or not and in in some studies it's very difficult to do because if you see the patient then suddenly you would need to correct for a lot of lot of things in yeah and he has my child and we need there still sometimes adjustment there yeah that's that has much to do with like amplifies into place okay so I mean maybe two comments it's a the question as we still have a Justin and randomized route so one point that I don't address here but this is of utterly importance in practice is the question of statistics right so you can adjust but you have seen that the the values that I got in the program in the our program they vary about two right so the question is what are the correct confidence intervals there and also are some adjustment sets better than others this identity rest the second question is well the randomized experiments sometimes you still need to adjustment and the reason is that it becomes I mean there are some extras on this also not far from here so Jamie Robbins for example is working a lot on this the point they become arbitrarily complex somehow because often you have sequential treatments so this is one effect so you treat someone you randomize and then depending on the outcome so maybe you don't see a result or you see like your side effect you change the treatment later on so then these things you sometimes need to adjust another problem that turns out to be very important practices that um some people if they think they got the placebo they drop out so they stopped taking it and this is something you need to correct for this is also I mean then in practice you get all these these problems so therefore I mean this is a simplified picture maybe not like a very easy case but this is a full line of research if you like yeah strengthen your life definition I wonder so if you have a series of interventions where I have actually played with all these variables can actually infer the causal graph yeah so the question is if you have a lot of interventions can answer the call to graph the answers yes so the simple answer is the simple way to see this is imagine you have intervened on all the variables so let's say there's an underlying causal structure the decide cyclic and you have lay your twelve variables and you have performed twelve interventional experiments so then you see of course if they intervene on X 5 and se seven and eleven change they must be downstream no but it's like some others don't change then I mean if you know for every variable which is like what other variables are downstream then you can infer the structure but if they exist much more efficient ways and I might may not be able to talk about this in this mini course but on Friday we'll also talk about in this seminar talk I will talk a bit about one of these method sounds good I've learned that one should sometimes do visual breaks with my visual break from Copenhagen and I think this is a very good place to stop I make sure not sure when we reconvene if you are interested 2 p.m. ok thanks for your attention [Applause]
Info
Channel: Broad Institute
Views: 51,017
Rating: undefined out of 5
Keywords: Broad Institute, Broad, Science, Institute, of, MIT, and, Harvard
Id: zvrcyqcN9Wo
Channel Id: undefined
Length: 104min 5sec (6245 seconds)
Published: Wed May 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.