Toward Causal Machine Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

each year Microsoft Research helps hundreds of influential speakers from around the world including leading scientists renowned experts in technology book authors and leading academics and makes videos of these lectures freely available you well very good I think we are just about ready to roll it's a great pleasure to have in the lab again then how it show cough I say again because when I first came to the lab which is you know nearly 16 years ago or something Bernhardt was here already a scientist but he's been on a long journey since then involving going via New York to become director of the Max Planck Institute where he is now except it's a slightly different Max Planck Institute when he started in the meantime he has I would say transformed our understanding of Colonel learning and everybody even close to that field now uses his ideas and he has been a remarkably innovative scientist and more recently has got into the issue of causality in machine learning which is a fascinating issue and then how it we're very much looking forward to hearing what you have to tell us about it thank you very much for the introduction on the invitation it's a great honor to be here it's also a great pleasure to be back here I have very fond memories of my time in Cambridge it's where some of the best years of my life and I also learned a lot in particular I learned that the world is not just kernels and vc-dimension there's also Bayesian methods and things like that and maybe you will see some influence of this same today's presentation which is partly about kernels but not only about kernels so let me start with an application so I work in machine learning and machine inference and of course this has many applications in science and industry and for instance if you are looking on a shopping website for this item called a glass circular cutter then from past purchases of other customers Amazon can recommend what else you might be wanting to buy and give you some suggestions such as this item here at balaclava this or this baseball bat here or maybe a little bit surprising this rucksack with bat wings so there are various things that you might want to buy and let's be a little bit more abstract now and think about machine learning more abstractly and about inference as let's assume we have measured two observables x and y and we found them to lie approximately in a straight line so we might be willing to infer a corresponding law of nature now already life is thought about these kind of problems and you pointed out that even if we scatter spots of inks randomly for instance by taking a quill pen and shaking it over a piece of paper we would still find a mathematical equation that is satisfied by these points and like this argue that we wouldn't call this a law of nature because no matter how the points are distributed we can always find such an expression explanation so according to like this the explanation should be simple in some sense and of course this makes the question of what makes the equation simple and a lot of people have thought about this from the mathematical point of view such as like knit smile chai tea in public German Incas now if you're a physicist you could have a more pragmatic view and this is a physicist here a rather fault he said if you experiment these statistics you ought to have done a better experiment so I'll try to convince you that there are interesting inference problems where better experiments are not enough so inference problems that are non-trivial in some sense and I think we're currently experiencing a revolution in the Natural Sciences and maybe not only there where people try to solve exactly those problems and it turns out that learning an inference plays a central role in this so here's an example so in this example is from bioinformatics we trained a support vector machine to solve a certain classification problem based on human DNA sequence information it's a reasonably high dimensional the training set consists of 15 million sequence snippets and it doesn't matter what exactly this curve shows the main message is down here if we only have a thousand training examples the performance is essentially chance level but as we move up here to ten million or more 20 examples performance performance gets very so it's kind of it's a regularity in the world it's not chance but probably assumed ins we wouldn't see it because we wouldn't be able to look at 10,000 of these examples we wouldn't be able to make any sense out of such a data set so it's a structure that we wouldn't see and I think that's really quite interesting and it's highly non-trivial to find it and some characteristics of these kinds of problems are that they are high dimensional they are complex for instance nonlinear almost and stationary we typically have little prior knowledge so we don't have a mechanistic model of what exactly is going on and therefore as a consequence of all these issues we need big datasets and those to process them we need computers and we need automatic inference methods and machine learning methods so the core problem of machine learning is called generalization and I try to illustrate it with a little toy example so suppose we've seen these four digits and ask what's the next digit I think any computer scientist can answer this question I guess it's 11 that's very good answer which is called a lazy Cato's sequence and because it's the number of pieces into which you can divide up a cake with n cuts you just make sure that every cut intersects all previous cuts but 12 is also a good answer 13 is a very nice answer has even has a name 14 is a nice answer in this case I would predict that the sequence ends this is also a nice answer which was proposed by a former postdoc of mine which combines the decimal expenses of pioneer in fact in fact you don't even need you can go to this nice website which lets you search for digit sequences in pipe and then you find out that it occurs at position 16 thousand nine hundred ninety two and here's how it continues so that's a very compact description of this sequence and there's this website online encyclopedia of integer sequences you get 600 hits for these digits so obviously that's can't be the answer and if we ask which continuation is correct and there's really no way to tell and this is called the induction program in philosophy now in statistical learning theory we're trying to answer or to ask a slightly easier question maybe asking how can we use observed data to come up with a law that's correct or that's typically correct or that's with high probability more or less correct and this is close to what Karl Popper called demarcation problems it's also a classical problem of philosophy it's a problem which separate what separates physics from metaphysics so what kind of methods should you be using to call yourself a physicist as opposed to a metaphysics and rappeling Germanicus try to answer this kind of question starting in the 60s in the Russian Institute for control science the Russian Academy of Science s PhD students and they studied this simplest scenario of machine learning which is called two class classification M pattern recognition and in this problem we are given observations consisting of inputs inputs X and labels plus -1 so for instance these could describe the class of a handwritten character is it a 0 or 1 and we assume there's an unknown regularity modeled as a probability distribution and this is a regularity we think has generated these observations and now our goal is to minimize what's called the risk or the expected error for future data drawn from the same distribution based on these training observations now we can write down the expected error in the future and we can't compute it because we don't have access to this distribution we only have data from the distribution we can approximate this quantity in terms of an impure quantity this would be called the training error and then we can ask ourselves what happens if we minimize the training error will we get a solution which is close to minimizing the future error and the answer is in general that this doesn't work we need additional conditions or assumptions we need conditions on the class of functions over which we minimize and essentially the class of functions shouldn't be too complex if it's arbitrarily complex then we can explain the data and no matter how we were shaking in the quillpen and one way to capture the complexity of such a class of functions is now referred to as the VC dimension so this is a notion that they came up with as PhD students and which is not on the basis of whole branches of mathematics and the empirical process theory and the main statement is this this process of minimizing the training error is called empirical risk minimization is consistent so in the limit it leads to the right answer independent of what the regularity has generated the data provided the VC dimension of function class is finite now maybe I don't need to define the VC dimension messages it's a measure of complexity for function classes it's a combinatorial measure it's it's difficult to evaluate for complicated function classes but we know how to do it for linear function classes it's nice because afterwards we will reduce something else to the linear case so we can do it for linear function classes what if the problem is not linear so here we have a typical nonlinear decision problem we want to separate these blue crosses from the red circles the true decision boundary is this ellipse it turns out if you think about this for a second if we first transformed all training points using this nonlinear mapping map them into a higher dimensional space to this three dimensional space spent by these three coordinates then we could suddenly solve this classification problem using a linear equation so using a what's called a hyperplane and what's nice about this trick is that actually if we wanted to compute dot products in this three dimensional space it turns out the dot products reduced to a simple function of the two dimensional inputs and this is called the kernel trick and is an example of a degree 2 polynomial kernel and the same thing also works for higher degree kernels so not just degree 2 and it also works for n dimensional or D dimensional inputs so we can use this kind of trick to implicitly compute dot products in higher dimensional spaces of points that we have non linearly mapped into these spaces in same thing even general more generally works for everything that satisfies the condition of positive definiteness so whenever we have a so-called positive definite kernel or a covariance which function as some people prefer to call it we know there is such a representation in higher dimensional space that we call the feature space and in this space we can do all sorts of things we can for instance caramelize as people call it a linear algorithm such as a principal component analysis which then leads to something nonlinear and in this case it leads to a different and interesting data feature extraction method the first two nonlinear principal components would separate this simple data set into the three clusters and then the higher-order components would start looking for structure within the clusters in interesting ways or instance this component is orthogonal to this one this is roughly orthogonal to this one etc and this is a contains various other approaches as special cases now a support vector machine I think a lot of people know about this so very briefly and uses the same idea it Maps data or actually it has used this idea already before it Maps data that may not be linearly separable or multiple into a high dimensional space then it computes a linear solution in this case it's a separating hyperplane with a large margin of separation or the reason being that for this we can give we can control the VC dimension so we can make sure that our method generalizes well and then this linear separation the high dimension space will correspond to a nonlinear separation in the input domain so I have a little bit over here we are start with a mapping into the feature space that's almost linear by taking a Gaussian kernel that's very wide and then I gradually turn up the non-linearity by making the kernel more narrow and you can see that this decision boundary becomes more and more nonlinear in this case and then there's various nice properties of these approaches one of them being that we can prove that the solution is always an expansion in terms of kernel functions and that it can be computed by solving a convex QP and various other aspects so in all cases we generate nonlinear surfaces in this high dimensional feature spaces in those course sorry we generate linear surfaces in the high dimensional space corresponding to nonlinear services in the input domain and these nonlinear surfaces could look rather complicated so this is an example of an application in computer graphics or computational geometry where we approximate point clouds and then actually developed a method to morph between different sets of points clouds which led to this automatic method for morphing shapes so in all machine learning methods issues have to be addressed one way or another and I think it's nice that kernels provide a unified framework to do so so the three issues are we need a notion of similarity that we use to compare data points it seemed especially important if the data points are not vectorial to begin with if they are for instance strings or graphs something like that and the kernel induces a linear representation so it of the data so it in implicitly tells us how we represent the data when we process them and finally it encodes the function class because as I mentioned before when can prove that the solutions to currents are expanded in terms of kernels there's much more to girls though and in recent years there have been some interesting developments where they were used in a slightly more general sense and for me the starting point of this was a tutorial example which we came up with we were working on this book that had on the last slide in a lot of this work was actually done while I was at Microsoft so this this is a trivial algorithm where what we try to do is build a classifier as simple as possible in this case we have two point clouds we map them into the feature space then we classify a new point based on whether it's closer to the mean of the one cloud or to the mean of the other cloud now obviously this induces a hyperplane decision boundary in this high dimensional space and you can ask yourself what does this hyperplane correspond to in the or what does the decision function correspond to in the input domain when it turns out its so-called plug-in classifier based on thousand windows estimate but it's actually a little bit more than all because these kernels don't need to be valid density models now that's interesting so here we have a point set that corresponds to a density in some other space and actually we can think of this more generally and we can think of the feature map this mapping that takes points and maps them into the higher dimension space where we apply our linear methods we can think of this as a mapping not just for points but actually also for sets of points and even for continuous probability distributions so if we take a a set of points we define the mapping just to be the average over the mappings of the individual points the case of distributions we define it to be the expectation over this distribution and then by the choice of a kernel we determine how much information we retain has been mapped this object into the high dimensional space so if we choose a linear kernel we just retained the first moment so the mean or the expectation if we choose a polynomial kernel we retain the moments up to all the end so either the empirical moments or the analytical moments of a distribution and if we use a slightly more restricted class of kernels which is called a strictly positive definite or characteristic then it turns out we retain all the information so we don't lose anything so all the information about a data set or about the distribution is stored in this one element in the feature space and then we can ask ourselves what can we do with these kinds of elements what kind of linear methods when can we apply to such elements to perform inference on distributions and there are various applications or special cases one is something well known from statistics it's called the moment generating function this is a special case for one particular kernel we can also do independence testing if we map the product of two marginal distributions and and also the the Joint Distribution of two random variables and compare how they lie to each other what's a distance in the feature space we can construct independent tests we can construct homogeneity tests which means testing for whether some data has been generated by the same distribution as some other dataset and one can do things like base updates one can give nice physical interpretations connecting this to wave optics and finally a problem that I've been working on more recently but I think I won't go into detail today is related to a probe ballistic programming so interview think about programming or what probabilistic programming might mean so from my point of view it could mean that we lift all operations that we can define on standard data types to distributions over these data types so I'm not thinking of something specific to perform certain type of inference I'm just thinking of operations on distributions over data types and it turns out that if we have some distributions and embed them into such a feature space with this kernel mean mapping then we can give an elegant way of performing general arithmetic operations on such embedded distributions which is a non-trivial problem because if you think about it for a second suppose you had two distributions let's say very simple if they're both Gaussian and you want to compute the distribution of the sum so I should say more precisely so we have two random variables that both have a Gaussian distribution now you want to know the distribution of the sum of these random variables in this case it's doable it's the convolution of the two densities assuming we have densities if you have a more general function if you have a product you can still do something if it's something more general it's very complicated and people essentially a resort to sampling so and this gives us a nice approach that in some sense generalizes the sampling or provide some kind of smooth sampling for this kind of problem but I think I'm going to skip over this one and instead I get to the main part of the talk where I want to talk about causality and about what motivated me to sort of leave this the standard machine learning field at least part of my time so these methods I didn't talk about application system our method so far but there are lots of applications and these are used everywhere your cell phone uses them your camera use them and I think you have lots of applications they were developed in this lab here of course and also we have a relatively good understand of how and and why they work as long as nobody intervenes in the underlying system that generates the data unfortunately this assumption is often unrealistic and then things can break sometimes in embarrassing ways and I want to illustrate this with another shopping example in this shopping example here we have a person who is shopping for a laptop rucksack and then the website recommends that along with the rucksack the person should buy a laptop now this sounds it sounds funny it feels somehow wrong if we think about it why it feels wrong then we would promise I intuitively we would think that buying the laptop can be a cause for so wishing to buy a rucksack and so buying the rucksack is not a course but buying the rucksack is an effect of already having the laptop or having decided to buy alot snapped at at the same time so both the cause and the effect contain information about each other of course but in some sense for four different reasons so the course contains information about the effect because it controls the effect the effect contains information about the cause because it carries some kind of footprint of the course if you measure this in terms of information theory using mutual information then a mutual information is a symmetric concept so you wouldn't be able to distinguish between these these two dependencies one way or another so if we only look at that which in some sense is what machine learning is doing then this directionality is lost so there's a recieves there's some kind of causal process going on to the real world which has generated statistical dependence that we observe but as as soon as we only look at the statistical dependence using machinery for instance that we have lost this underlying reason for the dependence and which is in other words we could say what we have done is we have imitated the exterior of a process this is a quotation taken out of context I'll tell you where it's from later on so what we've done we've imitated the superficial exterior of a process without having any understanding of the underlying so that's what we do if we do standard machine learning so let's look at another example that also tries to illustrate this difference between statistical dependence and causality and that's also well that's a well-known example about storks and babies so if you look at the frequency of storks in European countries in the human birth rate you you find a strong correlation between the two of course everybody knows that stocks don't bring the baby so we wouldn't infer causality from this but there are much more subtle examples where it's essentially the same and we do think we infer causality so this is quite a tricky point but the main point is that we would not try to increase the birth rate by increasing the number of stocks nobody would believe that this kind of intervention would lead to the desired outcome so the statistical model it doesn't make predictions for this kind of intervention and in fact the intervention in some sense also violates the typical what we call the iid setting independent identical identically distributed data so it's tricky and it's summary so far we can we could summarize saying that causal links seem to generate statistical dependence but vice' besides a bit more complicated so they are related but they're not the same and huntsville bar is this a very interesting physicist and philosopher had a profound insight into the link between causality and statistics and i think even today many methods are still based on this essentially and what he was postulating was that if we have two observables that are statistically dependent so think of them as distorts and the babies then there must be a third observable that causally influences both of them this third observable I should say could coincide with one of the two so there could be a direct causal link but the generic case would be that the third quantity has a causal effect on both of them so for instance there could be storks babies are related and maybe there's a confounder that says the I don't know the development of the country or something like that or the space that you have or I don't know various economic issues so this confounder has a costly effect on both of them generates dependence between the observed quantities and moreover - bah was postulating that this third observable screens x and y from each other in the sense that if we condition on the third one then x and y become independent so that's nice and it leads to or-or-or here is a model of causality which can be shown to satisfy the liking bar principle it's it's almost equivalent but there's some some details that I won't go into and in this model we assume we have a set of observables that are connected by arrows these arrows represent direct causal links so that's where the course semantics comes in each observable so let's think of this green one it's determined by its parents in the graph and by an variable U Wii U stands for unexplained sometimes this is also called the noise so it variable is a function of its parents and a noise which is not shown in this picture and the noises are random variables and these noises are jointly independent that's the assumption and everything is deterministic apart from these noises but from the fact that these noises are random variables and I should also say this is a directed acyclic graph we can then in some canonical ordering compute the distributions of all these quantities they all become random variables we get a joint distribution over all these miracles and it's an interesting Joint Distribution because it has conditionally independent structures that are specified by what's called the causal Markov condition which I have written here so the Markov condition in one form there's the so-called local Markov condition says that each variable is independent so think of this variable is independent of its non-descendants over here when conditioned on its parents so the all information exchange between a variable and its non-descendants has to go by other parent so this leads to something which is also called the graphical model and there of course great experts on graphical models in this in this lab here so this is gravity model and every one could also say every graphical model with respect to attack directed graphical model can be written in this way but not uniquely so a functional causal model contains a little bit more specific information than the graphical model and in a nutshell without going into detail the difference is that such a functional quotient model also lets us make statements about what's called counterfactuals but it's not going to this now okay so the central question of causal inference is now the following suppose we observe a lot of data may be so much data that we can estimate the full distribution so let's assume we have the distribution describing all these random variables can we recover the graph from this distribution so in other words can we infer the causal links from purely observational data that's a tricky question and number of people have worked on this and there's a beautiful theory and people people like the Spiritist lime or there's a beautiful theory or a beautiful answer but not really a complete answer and there are several the answer is that if we do so remember before I told you that the graph structure implies certain conditional independence properties of the distribution now we could say well how about we just test for this conditional independence properties and then we can infer the graph structure there are several problems with it first of all is that actually the application goes the other way around the Markov condition says that certain graph properties implies certain conditional independencies but what we actually want is we want to observe conditional independence properties and infer graph properties from that people have argued that this is sort of almost an equivalence or that it would be very coincidental for this not to be equivalent and people it's something called faithfulness maybe that's reasonable when one can debate about that it's probably hard to justify for finite data sets which is an issue that people haven't thought so much about in causality next problem is if these functions that sit on the nodes I told you that each node we have a function if these functions are complex then conditional independence testing based on finite samples actually can become arbitrarily hot so we shouldn't just assume that we can do it it's a problem that's as hard as the fundamental problem of generalization in machine learning where theoreticians have worked for a long time I mean causality we have been neglecting this but this is an issue that we shouldn't neglect and finally that's the main content of my talk today if we have only two variables then there are no conditional independencies conditional independence always connects three quantities so there's nothing we can observe nothing we can base our inference on so how can we decide about the causal direction for two variables only and it's this is the problem of giving two verbs deciding what is cause and what is the effect and of course a lot of people have thought about this starting again from the philosophers needs reference and said that the error of mistaking cost for consequence is reasons intrinsic form of corruption so just very briefly I'll tell you two ideas how to deal with this first idea is that cause and mechanism should be in some sense independent when can formalize this mathematically but I'll rather show you a picture to illustrate the idea so this is this object is called appreciate cherish in the next somewhere else an outdoor art museum and when we see this object or when we see any object our brain makes the assumption that the object in the mechanism by which is light enters our brain are somehow independent people sometimes call this a generic viewpoint assumption if we violate this assumption and here we can violate it by looking through this hole then we actually see an object that isn't there we see this chair so our perception has been misled there's another nice example here which is from the National Gallery and it's a painting by Hans Holbein it's called the ambassador's there's this strange object down here which as you look from the very fast side turns out to be a skull so this was of course done to impress his donors or people with this skill of illusionist painting but I think it's also a nice example of this principle independence of inputed mechanism so let's assume in the generic case this principle holds true then it turns out it tells us something about causal direction and the ideas if we assume that our input distribution and the mechanism so in general we think of this mechanism as a conditional of output given input but let's do it even simpler and let's assume this is an invertible functions or a deterministic relationship let's assume this input distribution and this mechanism are independent if that's true then it turns out that the output distribution carries a footprint of the input distribution in this simple case for instance well where this sorry carries a footprint of this regularity of this function where this function was flat the output density will be large this is just an intuitive picture but one can formalize this various ways one way is what we've done we we postulate that the logarithm of the derivative of this function here and the value of the density we assume the existence of a density both viewed as random variables on the unit level with respect to lebesgue measure we postulate that these two random variables have zero covariance then from this postulate we can actually prove that in the backward direction unless the function is trivial this identity the backward direction we get a strictly positive covariance between the two backward random variables that's nice it's an asymmetry and the asymmetry can be rephrased in various ways that I won't go into detail on it can be connected to information geometry rather I'll briefly tell you about our second approach which is based on making an assumption about how the noise influences these functions so remember I told you in this structural equation model or functional causal model each node is a function of its parents and the noise so let's assume we have the simplest case just two variables x and y let's assume X causes Y so Y is a function of X and an independent noise term X also has a noise term but no other parents so let's actually identify X and the noise term just say X and u are independent if we have such a model then it still turns out this is an extremely extremely rich model so for instance if the noise we're discrete switching between D different values we could use such a noise to randomly switch between the different mechanisms that transform input into output so this D could be very large there's no restriction on the domain of the noise could be continuous it's somehow clear that even with a large set of data if we don't have restrictions on these functions and it's been particularly on how the noise acts on these functions there's no way we can we can learn this kind of model from data so one restriction that we have studied is require additive noise so you think of it as a first order Taylor expansion where we say this mechanism actually first takes F processes F with a function and then adds the noise and it turns out if we make this assumption then again we get an asymmetry between cause and effect so if it's true that if the forward model has additive noise additive independent noise and if we additionally hypothetically assume that there's a backward model of the same form that explains the same observed Joint Distribution of x and y then it turns out that these quantities they involved the distributions and the non-linearity of the function have to be matched to each other in a very specific way which is highly unlikely to occur in reality so it's a 3 dimensional solution space from which the model can be fitted both they both directions and in the generic case it turns out if this kind of model is identifiable so I'm going to skip this one well so before I was telling you that we can do conditional independence testing to learn something about the graph structure so typically the answer is if we do the conditional independence tests if we assume faithfulness we can infer a equivalence class of graphs that are consistent with the conditional independencies that were observed it's normally not a single graph but at least a set of graphs and in some sense the the idea one intuitive way to think about is that we have injected independent noise everywhere remember each node had a noise variable the noises are jointly independent and then we let them spread through the graph and then measure in different places what arrives where and how do these different informations relate to each other so that's one way to think about conditional independencies we the new things that I've told you about new ideas and potentially new ideas is that rather than just checking independence of noises we can also think about independence of noises in functions so I told you about this idea of Independence of a mechanism in the input and so the these noises as the threats spread through the graph they pick up information not just about the graph topology but also about the functions by which the noises were transformed as they passed through the nodes and I think that's something interesting to think about more I also talked about restricting the function classes we have also thought about how to combine these aspects I think I'm not going to talk about this in detail and we have thought about an approach that uses algorithmic complexity or good model of complexity instead of Shannon information or probability theory which leads to something like graphical models in terms of Kolmogorov complexity where on the nodes we don't have functions we don't have functions processing probability distributions but instead we have programs process processing bit strings and in pendants of bitstrings is measured by whether we can gain by joint decompressing bitstrings etc and one can develop something like graphical models or causal graphical models in this setting as well but rather than talking about this I want to briefly tell you about how cousin knowledge knowledge can help machine learning because my long term goal is really to to combine causality and machine learning and I'll have two examples one very briefly about semi-supervised learning and a distribution shift and the other one is in the field of exoplanet detection so it turns out that in some machine learning problems we are learning to predict effect from course and in others it's the other way around we predict or from the effect actually it seems to be more often the latter and we could call these problems causal and anti causal so for instance digit recognition is an anti caused a problem because we decide in our heads the class label and then we produce the digit so the digit is the effect the class label is the cause so the directional prediction is is the opposite of the direction of causality in some other problems they are identical now it turns out if you think about these two settings that that actually makes a difference for practical machine learning problems so in general whenever we want to estimate a mapping from X to Y we have to estimate properties of the conditional of Y given X now remember the causal assumption that I made before and that the distribution of the course and the mechanism should be independent in some sense one can formalize in various ways now in semi-supervised learning I think you might know this but for those who haven't heard about it the idea is we want to improve our estimate of this quantity by having more data from P of X the co-vary shift on the other hand assumes that P of X changes between training and tests and P of Y given X and might or might not be robust now let's think about these two settings we have a cause and learning problem then we assume that P of X and P of Y given X are independent in the causal case X is the it's a prediction and causality are lines so X is the cause Y is the effect so we assume this independence which means that semi-supervised learning should be impossible because in same a supervised learning we have additional information from this in order to improve our estimate of this but if they're independent that's pointless and on the other hand cooperative should be easy because our causal assumption actually precisely says that P of Y given X is not affected by of P of X so if P of X changes P of Y given X should be invariant under this change that's nice it turns out in anti course learning it's exactly the other way around in anti cousin learning we have this kind of Independence which implies as I mentioned before and independence in the backward direction so semi-supervised learning might be possible but on the other hand cooperative should be harder and I don't show data but we've done a meta analysis looking at benchmarks of other people and categorized their data sets into chordal and ante courses so usually people in machine learning are not interested in that if we do that and look at the benchmark results then this is very much consistent with this prediction which is nice and also a the various assumption that people have come up with to justify semi-supervised learning can all be interpreted causally so that's that's also nice ok finally last application which is from exoplanet detection in astronomy so that's a problem we've been very much interested in lately and the their permits the following so there's this Space Telescope which was launched in 2009 so this is where we are in the Milky Way space telescope was launched in order to observe a tiny fraction of our Milky Way staring in this direction all the time in search of exoplanets so if you look at the search field this is somewhere close this is the Milky Way this is the constellation cygnus and the telescope was looking at this patch here with a set of CC these which are close to each other and so charge-coupled device imaging sensors for about four years so in this field it's taking pictures or it was taking pictures a series of 30-minute exposures essentially each picture looks like this you can see the stars on each picture I think there's about half a million stars and out of these the astronomers declared 150,000 for interesting and extract time series because we have this series of exposures so for each of these 150,000 stars we get a time series and this telescope is called Kepler after the astronomer Kepler and now we have this big data set and what we're looking for is these transits so what's a transit and imagine there's a star somewhere out there the star has a planet and we happen to look at the system from the side so every so often there the planet passes in front of the star and if you think about what kind of signal disparate produce if you looked at our system of Earth and Sun from space you can work out that it would be an eighty four parts per million signal just from the size of the earth it's the diameter of the earth is about 150 Suns so it's ten to the minus four roughly and it would last about half a day a signal and you have to be lucky to see it because it would be visible only geometrically from half a percent of all directions in the Milky Way so this is the kind of signal we're looking for it's a relatively faint signal which is why you have to do in from or why it's better to do it from space because if you look from the ground you have all these atmospheric disturbances but even if you do it from space it turns out the errors caused by the spacecraft and by the stars the stars are actually more variable than we used to think they are so these errors actually lead to changes that are sometimes much bigger than the signal we're looking for nevertheless this Space Telescope has found many planets some of them are for series habitable in the sense that they allow liquid water they had the right distance from their planets and but those are all around stars much cooler or smaller than the Sun because that means the habitable zone is closer here is much shorter and then we see more of these events which gives us a better statistical power so each star projects lights on a number of pixels and if we look at these pixels so this is for instance all belonging to the same star and these are all little light like of these pixels unfortunately all these curves look quite different and this looks quite hopeless it's due to various errors and the biggest one is probably telescope pointing so the telescope is the pointing is exquisitely accurate but nevertheless small changes of a fraction of pixel together with the fact that the pixel sensitivities are different lead to these I mean this is scaled up of course but they lead to these different behaviours of the pixels and the astronomers have developed the data processing pipeline to deal with this kind of systematic errors have something they call PD see presearch data and conditioning and then they put together these things process then when they get light curves like this and then you can see an exoplanets little transit here and this dip actually it's not a little one this is a big one because this star actually has three exoplanets and the other two I think we don't even see here because it has one that has a period of eight days and this is about 30 days so we should see some small ones that we don't see here anyway so this is the kind of data that we're looking at and it's a difficult problem and we thought about this from the causal point of view and I want to motivate the method with this little illustration done by my wife and it's so the idea is we try to reduce these systematic errors and we do it in the same way as this problem here suppose you see this this picture of five siblings so there are some statistic either this probably doesn't apply to England but the statistics say they up to 20 percent of all children have an unexpected father but now it in this case it might be easy to tell which one but there's a slightly harder problem and this harder problem is how does the milkman of this family look like and in our astronomy application what we want to do is essentially we want to reconstruct the milkman from the picture of many siblings without knowing the mother so here I have a little graph what we want to do so this is the observed star this is what we want to infer about the star this is the noise so unfortunately the mother plays the role of noise here and but we see many potentially other siblings here and now the idea is that we have to assume actually that the milkman and the other siblings are not related and the idea is that we try to predict properties of the one child or of the observed star from the other stars and then whatever can be explained must be due to the fact that they are both caused by the same noise process since the true star is causally independent of the other stars and therefore also the measurement of the other stars whatever we can explain by explaining this in terms of that has nothing to do with the true star so the idea is now we perform regression of Y on X and then we remove somehow this regression from Y in order to reconstruct Q Inlet sounds optimistic now it turns out under suitable assumptions we can prove that this works and which surprised me but in retrospect maybe is not so surprising and the assumption is again an additive noise type assumption an assumption I've told you already about this independence of the milkman and the children the other children and finally we have to make the assumption that the effect of the noise can in principle be predicted from the other stars so that they exist some function precise such the effect of the noise is that function of the other stars we don't need to know that function so in principle it should be the case that the effect of the mother is visible in the other children and in that under these assumptions we can prove that we can reconstruct Q up to a constant offset by simply subtracting the regression from Y and it's nice we can reconstruct the full variable with probability 1 not just the distribution but the actual all the values by doing this simple regression and we can slightly generalize this by relaxing this assumption here that the effect of the noise is fully predictable if we relax it then we can still prove that roughly speaking if we can imprint it so if X determines the effect of the noise with small variance then we can prove that we get a good reconstruction in terms of small mean squared error but maybe it's not given to teachers on this instead just one picture here and so here this was the what the astronomers so far did this is now our curve after taking out these systematic errors and we think it looks nicer the astronomers like it we are still working on this problem and it's a complicated problem you need to not just normalize the curves but you also need to have a model of these dips and you have to do this periodic search etc and we have a more comprehensive system that disturbing all this where we have recently produced a list of candidate planets and some of them have already been confirmed so that's nice so I promised coming to the end promised I will come back to this citation so this quote which was maybe a somewhat naughty description of machine learning machine is doing is actually from this philosophy website and it's a description of what people call a cargo cars so cargo cult is this a term for what what happened in some Pacific Islands I think it was for instance described by a Fineman where people started to associate certain objects and acts with cargo arriving so the airport's landing strips people waving flags and things like that where associated with positive things that came from the skies with American soldiers and afterwards when the soldiers didn't come anymore the islands were still waving flags and doing things like that so which sort of did they observe this correlation they tried the intervention of waving flag didn't work so it illustrates the difference between machine learning and a chordal model and now of course if you apply machine in the right way it would still be valid if you had a generative model that produces lots of pacific islands and tells you on which islands are there people waving flags and landing strips etc then probably you would be able to predict at least if you sample from the right time when this still happens so we distribution shift is not allowed we have to do iid then you would have been able to predict whether there is cargo or not so if you apply machine in the right way and we have become very good at generating iid data big companies are paying people to label data I bet so Microsoft so it's also doing this in order to train machine learning systems so if you do it the right way then of course it's extremely powerful but at the same time I think it's interesting to think about what we can do if these assumptions are violated and I think that causality is one interesting direction to consider in this context and with that I would like to acknowledge my co-workers and thank you for your attention oh we have a little time for questions from the audience if people have questions about the fascinating talk yes run masse experiments scale and that has particular power to the causality how would you compare that power to our methods you were describing so of course you can handle cases but you cannot do experiments but if you could do experiments would these would that be one so I think if you can do good experiments that that will certainly be preferable to these kind of methods because all these methods that I talked about they need assumptions in the simple case cause effect inference we think that it works maybe 80% correct but if you can do a controlled randomized experiment then I think you are much much closer to the ground truth so in these situations I think it's great to do these experiments and and I agree these nowadays we have we have the means to do this in many domains but as you also say there are many domains in neuroscience etc where we can't do this kind of experiments so I think from practical point of view if you can do them it's better from purely scientific point of view a few interesting inferences also it's very interesting to think about fundamentally how far can we get without these experiments I think it's a hard problem it's also a hard problem for humans to perform cause the inference if we cannot intervene in a system mention the historic sand the babies well the idea is that there must be some sort of hitting some sort of other factors we're not observing the influences both the Stoics and the babies so if you have if you perform in different some of those graphs in terms of and your goal is to find causality between the notes that you observe obviously you won't you will not find like in some cases there will be in some problems there will be other factors that you're not observing is it possible that our methods out there to find at least detected the existence of such of such factors and determining say how many of them there might be or which notes that you might need might be connecting and so on yeah I think okay there's the answer is there are some but it's very much an open problem from my point of view so I think it's one of the most interesting problems of causality and yeah I haven't talked about it today and how to detect whether they are confounders may be hard to bound the effect of confounders if you're willing to make certain assumptions about what how the confounding works and I could send you some things if you if you're interested I think it's very interesting very interesting problem for sure could I maybe use their methods to figure out what I should tell you no other is certainly wrong in this hybrid space bridges between yeah I think it would be interesting here I think that some people have thought about this kind of problem so I haven't studied that myself but I think that's also an interesting direction yeah so how can you I don't know whether people who understood in the back how could you combine the possibility of doing some experiments that might be expensive with this kind of methods that might suggest directions to explore or suggest what might be causing directions etc yeah I think that would be certainly useful to combine yeah you have the questions I um I heard been lecturing recently at the Royal Society where he won the Milner award which is the top European award in computer science only been given three times and the lecture was followed by refreshments and so I I infer from that that a lecture by Ben Howard causes refreshments to be served and in the that is the case today so let's thank him and then go enjoyed

Info

Channel: Microsoft Research

Views: 7,023

Rating: undefined out of 5

Keywords: microsoft research, machine learning, deep neural networks

Id: 9IT3nUUL5WI

Channel Id: undefined

Length: 57min 33sec (3453 seconds)

Published: Mon Jun 27 2016