Bernhard Schölkopf: Learning Causal Mechanisms (ICLR invited talk)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so it's my great pleasure to introduce Bernhard shook off and it's really easy to introduce him because he's famous for so many things he's a director at the Max Planck Institute for intelligent systems and to begin and he's probably most famous for developing support vector machines and more generally meeting the kernel revolution of the early 2000s and for those of you who are perhaps so new to the field you might not even know that kernels were sort of the last big wave of excitement before deep learning recently there in heart has started doing more foundational work on causality which I'm excited to hear about so please take it away the barren heart thank you very much for the introduction thank you very much for the invitation it's great to be here thanks for coming to my talk so I'll talk about a work done at the max planning to do intelligent systems I should add that I'm currently at Amazon so it was actually a very easy drive by car up from Seattle so it's one of the first conference talks without jetlag here and I'll be talking about causality so dependence versus causation is a big historical issue in philosophy of science and science in general and I'll start with an example which is the example of storks and human birth rates so there's a strong correlation between the number of starts for instance in this case living in Germany and the human birth rate so given the number of storks we can predict the human birth rate with relatively high accuracy actually in this case even do it using a linear model so for every roughly speaking for every power pair of bruiting storks we have about 500 new babies per Hulme so if I'm gonna metal calls Donald Trump she could tell him that number and Trump could use it for prediction in the u.s. potentially now how should I'm gonna communicate the model to Donald in a trip in a way that he will able he will be able to understand and it turns out although this is maybe the simplest possible math model imaginable it's a non-trivial model how to communicate this so we can't simply say that each pair of storks brings 500 babies per year the model is only valid in a certain iid setting which I hope will become clear during the talk and chances are he won't understand this IID setting and maybe he isn't interested in prediction anyway maybe he's more interested in intervention and so suppose he wants to have more babies in the US should he changed the number of storks or not and this of course leads us to the relationship between dependence and causality and this is something that maybe was first understood by this philosopher and physicist Viking Barr he postulated something he called the common cause principle this principle says if you find two observables x and y to be statistically dependent then there exists another variable let's call it said that causally influences both of them and this variable explains all the dependence between x and y and this inset when we condition if we conditioned on that x and y become independent and as a special case this variable could coincide with X or Y in which case we get these graphs but in the generic case we would have this kind of graph now if the storks bring the babies then the green graph would be the correct one if the babies attract storks is this graph and if there's some other variable such as economic development it causes both of them then the red one is correct now the crucial insight is that without making any additional assumptions we cannot distinguish these three cases based on observational data so the class of observational distributions over x and y that we can realize by these models is the same in all three cases so a causal model and these are three different causal models obviously a causal model it contains genuinely more information than statistical Warner now this connection between causality and statistics was further elaborated and refined and built into a beautiful theory by pearl and others so there's a picture of judea pearl and in this theory in the simplest case we assume that we have a set of observables or variables that form the vertices of a direct acyclic graph in this graph arrows represent direct causation and in the graph moreover we assume that at each node we have a function which gives us the value of that observable variable depending on the values of the parents in the graph so PA stands for the parent so these are the outputs of the parent nodes and a noise variable actually this is the shorthand here's our u so actually it's what people call an unexplained variable we often think of it as a noise but it could be something else so we have an unexplained variable at each node so that gives us a set of such variables and to make the whole thing statistical we could specify the joint factory in distribution over this variable so we assumed these random variables are joined these noise variables are jointly independent and we call them UI now everything is deterministic apart from these noises that we feed in here and of course once we specify a Joint Distribution of factorizing Joint Distribution over these noises we get an overall distribution of all observables that's called the observational distribution and it turns out this distribution has an interesting property which it inherits from the graph structure from the topology of this graph and this property is the characterized by the cause of Markov condition which is that whenever we condition so let's sit on this note here the green one whenever we condition on the parents of these nodes this random variable becomes independent of its non descendants so these are the descendants everything else non descended about except if it's a parent so this becomes independent of its non descendants when conditioned on the parents and this graph together with the Joint Distribution hours are called a graphical model now interesting B this property is a kind of a footprint of the topology of the graph that we can pick up in the joint distribution in terms of conditional independence properties so in a way you can think of this as follows you have a noise probe that you feed in at each node independent noises and now you start asking conditional questions or you test for conditional properties of the form what does this node node know about this one that it doesn't know from this one etcetera so if you ask all these questions then you learn something about the graph topology and you can ask these questions if you have a functional conditionally dependence test which is a half program in its own right and but in principle you could do that but you can only do it if you have at least three nodes because if you have two there's no conditionally independent test or property to be tested it's a it's a ternary relationship you need a P and C so in the two variable case there's nothing you could do so this was a problem that we've focused on for quite a while and actually it led us to some I think some interesting insights that also have implications for the the case of more than two variables and that as I will try to convince you today it also has implications for machine learning and so the two variable problem is of course a classic problem this is an old philosopher Nietzsche we talked about this in in this book here so the first chapter for great arrows starts with the biggest of them all the arrow form of mistaking cause for consequence which he calls reasons intrinsic form of corruption so can we help caused from consequence and I want to introduce sort of one idea which will be elaborate in this talk and I'll introduce it with this simple example here to give you some intuition so we have an example of two variables altitude and temperature so this is temperature readings average annual temperature from a number of weather stations in Germany and you can see that the higher it gets the colder it gets so this is in Celsius this is in meters and you can see that most places are relatively low and now you could ask the question does temperature cause altitude or does altitude cause temperature and and of course with all cause of questions it's quite subtle to really decide and you can question anything but I think most people would say I think it's reasonable to believe that altitude has a causal effect on temperature and most people will say I don't believe that temperature district Lee has a causal effect on altitude so from that point of view we would think it's reasonable to consider altitude as the input in our quartel model and to factorize the distribution in this way so we first sort of choose an altitude and then there is some mechanism that given the altitude somehow converts this into a distribution over plausible temperatures so suppose you don't believe me and I have to convince you somehow there are multiple ways I could try to convince you so the first one is the most expensive one I could try to make an intervention experiment so I could build a big machinery to lift the city of Vancouver by one kilometer and then we'll continue measuring the temperature next year we find that it's actually a little bit colder than this year so that could be a possible way to go but obviously you can't always intervene there actually many many cases where you can't intervene for ethical reasons for mechanical reasons for whatever reasons and so really we want other means of doing inference now sometimes it's enough to do to just do a thought experiment and do an hypothetical intervention so I could I could tell you well look let's imagine we built this apparatus to lift the city on Vancouver and since we've both taken some high school physics maybe we will agree that if we go up the air pressure will go down and maybe when the air pressure goes down I don't know there's less atmosphere around to keep the earth from cooling into space or whatever I don't want to speculate on the exact mechanisms but I think I could probably convince you somehow that it would be plausible that the altitude the change of altitude will have across the effect on the temperature so if I convince you of this then I would probably talk about physics talk about physical mechanisms that given the altitude plausibly argue that the temperature will change so it's really an argument that implicitly assumes an independence between the physical mechanisms and to the input distribution so that's the second point the first one is intervention second one is a certain have independence and actually these independence is also related to a notion of invariants in which could be used as follows so I could for instance say well suppose I can get another data set let's say I have two data sets of somewhat comparable countries and say I take Austria and Switzerland the data sets look a little bit different but then I find out I can fit this in these two data sets using the same No so using an invariant conditional maybe that would also plausibly argue that this thing since it applies both in Austria and Switzerland less is a property of physics whereas the other one the distribution of altitudes maybe that says of culturally contingent maybe the Austrians like to build their cities higher in the Swiss or something like that so these are three different ideas and I think they also can be Illustrated or the underlying notion of independence can be illustrated also with this example here which is a so-called bushy chair and whenever we see some object our brain makes the assumption that the object in the mechanism by which the light reaches our brain are independent now we can violate this assumption by looking at the object from a very specific point and in this case we can look through this little hole and if we do that perception fails because we see a 3d structure which in reality isn't there now in the independence assumption that our visual system makes is a good one because most of the time it holds so we assume our brain assumes that the objects are independent of the vantage point the independent the illumination and this rules out any sorts of accidental coincidences of 3d structures that are not line up in 2d etc and that's what's called the generic viewpoint assumption in envision and of course also if I move around on an object my vantage point changes but I assume that the object itself is unaffected by my change of vantage point and that's an invariance which is implied by this independence and it allows me to do something which in business called structure from motion so I can I can close one I move my head around an object and still get 3d information for that and that only works because I'm assuming that there the object doesn't change so let's assume we have a causal generative process which is composed of autonomous modules that do not inform or influence each other so every causal graph implies the corresponding factorization of the Joint Distribution and I've written down such a factorization here in terms of the parents in the graph and it's su batalla the cause and parents and this factorization is one into which I assume that this will be in two independent conditionals and those we will call the causal mechanisms now I will generally assume or argue that every change in the real world distribution that we see out there always has to come from a change in course conditionals or in other words structural function structural assignments or structural equations depending which terminology you like and noise variables so they're always something like this has to change and in the generic case not all of them have to change because if we change one of them it doesn't automatically imply a change of another one they're independent objects now this locality of change doesn't apply if we factorize according to some other graph so if we factorize according to a graph which does not capture the correct causal structure then in the generic case it turns out that changing one thing will sort of have to be counterbalanced by changing other things at the same time now here's a kind of a special case of this which is something that you're said last year before lives in a talk which I found intriguing he called this abstraction challenge for unsupervised learning and the question that he posed is why is it so much worse to model directly the density of acoustics so the acoustic signal compared to modeling separately the condition of acoustics given phonemes times the density of phonemes and when one explanation from the point of view of causality is that this second formulation is decomposition into causal conditionals so no matter what phonemes I want to produce I will use approximately the same read conditional so this is it an invariant property so it will be the same for many different tasks or datasets so in that sense it should be easier to learn from different if fastened so I haven't yet told you what I really mean by independence of mechanisms and of course this has to be formalized and different formalizations I'll give you two of them the first one is a statistical one the second one will be something in terms of algorithmic complexity the first one looks at the special case of having only two variables and no noise so in this case the conditional is just a function and this function should be independent the input distribution which is now shown down here so if this independence holds true still informally then we would expect to see a dependence between the function and the output distribution because we would expect that wherever this function is flat the output density tends to have large values slightly more formally we will postulate that the covariance between the input density and the derivative of the function in fact the log of the derivative for some technical reasons should be zero so you'll see that here in this case there's a positive correlation between the density value and derivative of the inverse function so we will assume that in the case of independence in the forward direction in the course of Direction the covariance between these point two quantities should vanish of course for this covariance to make sense we have to specify what we mean by this we have to define a probability space on which these two things are random variables we'll just use the probability of a zero zero one interval and lebesgue measure and then we can formally define that and moreover we can then prove that unless the function is the identity in the backward direction the same kind of independence measure will be strictly nonzero so if we have independence in the forward direction we probably have non independent or we have dependence in the backward direction so we have an asymmetry Saturday between cause and effect which is implied by assuming independence of the input in the mechanism which is a special case just a two variable special case of the general concept of independent mechanisms not the input because the input is just a distribution is just the preparation of the input distribution if you want can be identified with the mechanism okay so this kind of thing can be written down or it leads to estimators that you can actually try on data and we have a set of data sets and it turns out that if you apply this kind of method in there's a whole set of different causality inference methods not then you can do a reasonable job as telling cause from effect which people previously thought is impossible so you could you can roughly do 75% correct sometimes more so far I've talked about links between causal structure in statistical structure but actually the cause and structure is the fundamental one because it captures the physical mechanisms that generate statistical independence in the first place so the statistical structure from my point of view is actually just an AP phenomenon that follows if we make a they make us you have a causal model you have some unexplained variables and then you make the unexplained variables values variables random and this implies it to a distribution that has certain statistical properties but it doesn't have to be like that and you can instead you can expect this that make the unexplained variables random bit strings which we've done in this case so here we have a tag which has a computer program UJ sitting on each node and they run on Turing machine T they take as inputs the results of the parents and we assume that all programs are jointly independent all these things are totally independent in the sense of calm of complexities that means a joint compression of these bit strings does not save space compared to individual compressions so if we are willing to make these assumptions then we can prove that this model so this is like a structural equation model like a functional causal model it implies a chordal Markov condition just like in the statistical case so that's interesting because you get something like a theory for graphical models which is another analogous to the statistical case but it's in terms of cognitive complexity so this is a interesting other form of Independence or other formalization of Independence and it has some interesting implications I think for for physics and and to see this we look at this experiment here so in this experiment we have a beam of particles that comes in and so this is our cause if you want so we'll prepare this cause distribution this cause state and then this course gets transformed by conditional or it gets scattered at an object so this is the mechanism and then the outgoing beam so these arrows that go out in different directions that's the effect obviously contains information about the object that's what makes vision possible or that's what makes photography possible photons contain information about the object where they have been scattered now we all know that microscopically from the point of view of physics the time evolution is reversible so things should be symmetric microscopically nevertheless the photons contain information about the object only after the scattering not performed so why is that the case so whatever I do photographs show the past rather than the future and of course this is due to our assumption of Independence at the beginning and because we're putting this in in a sense and then we automatically turn that we get out the thermodynamic arrow of time so in this case the independence assumption will read that so we have some initial state of a physical system we have the mechanism which is just the system dynamics applied for some fixed time it's a invertible mapping and we will assume that the initial state and the mechanisms are algorithmically independent in the sense of a mutual algorithmic complexity and also algorithmically independent and which means that knowing s does not enable a shorter description of a mean vice versa and if we do this then we can actually prove that the CONMEBOL complexity is non decreasing under this dynamics so if we apply this mapping our mechanism to the state we end up with the cognitive complexity which is lower bounded by the one initially and this plus I think it's a detail kamagra complexities only ever defined up to an additive constant and so we can prove that this quantity does not decrease can only increase and in a way that's the that's the second law of thermodynamics provided so it gives us the thermodynamic arrow of time that provided you're willing to accept code model of complexity as sensible measure of entropy which is something that quite a few physicists working on fundamental problems have put forward so I think that's the interesting connection but I want to now come back to statistics occasions of course and structures and talk about the impact for machine learning so this is a topic that has gained some traction in recent years and I'll try to discuss two or three of the applications that are listed here I think all of them are related to the factorization of the joint distribution in the independent conditionals or independent mechanisms and just to remind you so if we do this factorization according to the underlying causal graph then we expect that these mechanics will be independent so if we use the wrong graph to factorize our distribution because for each distribution many such factorizations are possible if we use the wrong one we would not expect that these objects are independent and so this point we've recently observed also in neural network models and I thought that might be interesting for this community here and to talk about that I've just want to distinguish between a causal and anti causal problems so a machine only we always stick X or let's say supervised machine and we have some eggs we try to predict why we usually don't care about the underlying causal structure but there's many possibilities and there's two simplest ones is that X could be a cause of Y or Y could be a cause of X the first case let's call it a causal learning problem in the second case and anti cause of learning problem now in some cases such as in via models or encoder/decoder models both directions occur in the same model and it's interesting to look for independent mechanisms in such models so to this end we can use a dependence measure we use a specific one that we developed looking for dependence in Fourier space we develop this for time series signals and apply it with some success there and now we've applied it in spatial convolutions and and we use it here now to measure the dependence of filters in successive layers of the epinet models and when we do this so this is a word with me shared by self we observe actually two interesting things the first one is that if you look in the causal direction so that the generative direction and then you do find independence so you find approximate independence between filters of successive layers which means in our case that these blue distributions are centered on roughly centered on the number 1 so we have certain quantity that we measure is 1 for independence and if we do the same thing in the anti causal direction so in the encoder and then we we always see dependence and it turns out that dependence increases as we get closer to the low-level features so it starts it starts sorry yeah so that's interesting we start with a relative dependence and then as we get closer to the lower level features there the dependence increases so this orange curves further away from the number one so what these things shows illustrates is that the the underlying causal direction does have an influence that'll makes a difference for the if you want for the physics of machine learning because they're independent mechanism assumption if you want is a assumption as a physical assumption is that it's not a it's an assumption that may or may not hold true for real data and it turns out it makes it makes a difference now let's look at another example which is my kind of my favorite implication of causal direction and this is implication for semi-supervised learning so whenever we want to learn a mapping from X to Y we want to estimate properties of the conditional so if we do a regression estimation we estimate the conditional mean if we do classification many ways made where this quantity passage through 0.5 m and in an insanely supervised learning we try to improve this estimate by having additional data from P of X now of course this can only possibly help if there is a link or a dependence between P of X and P of Y given X and now this makes you suspicious already I told you before about my independence assumption so if in we go to the pure case of cost learning we actually assume to begin with that these two objects are independent therefore semi-supervised learning should be impossible on the other hand in the anti anti causal setting it's the other way around because in the backward direction we actually did in this simple model from before we get us we got a strict dependence between these two objects so in this case in principle semi-supervised learning could be possible and one can make a similar argument for coverage if to transfer learning and there it's kind of the other way around but that's stick we semi-supervised learning so we predict that in the course of directions semi-supervised learning should be impossible and this is what we very nicely found and we didn't even have to run experiments for instance we just took benchmarks that other people had run and then we label there assets into causal anti-coal or confounded each such label then tells us what we would expect do we expect that they can improve things or not and our meta-analysis of these benchmarks beautifully confirmed the prediction that in the course of Directors in Siemens well as learning doesn't help and it's also instructive to think about the different assumptions people have come up with to justify semi-supervised learning turns out they all can be viewed as linking the fee of X and the conditional P of Y given X so they all need some kind of link between these two objects and one example is that cluster assumption but move on to the next example so this is an application example our causality inspired the work in machine learning and this is based on this data from the Kepler satellite which is a Space Telescope that NASA launched in 2009 it's named after the astronomer Kepler and it observed 150,000 stars in this direction here for four years and in these observations Kepler is looking for EXO planet transits these are events where a planet partially occludes the host star because we look at it from the side of the solar system and this causes a slight decrease in brightness a very slight one usually usually orders of magnitude smaller than the influence of instrument arrows now we observe a set of such pixel light curves each of these pixels now so this is a pixel on the CCD which each point of B on these curves is is typically a half-hour exposure so the number of photons gathered by that pixel in the CCD in this half hour and then we get these light curves we see some of them get fainter Samuel didn't get brighter one of those actually belong to the same star so is slightly shifts around and the CCD even though the telescope even tries to correct for changes in solar radiation pressure and but in the end we are interested in this signal here which is the signal comes from the star but we only observe this one here which is affected both by a star and the instrument arrows so we'd like to reconstruct you it seems to be hard since we don't know the instrument error but we are method measuring 150,000 stars so we measuring we're measuring many other stars at the same time and there's two aspects of these other stars that are important first of all they have no causal link to the signal that we are measuring and then measured by the same instrument so the other stars actually as we measure them half-siblings our star so we've developed an inference method that removes the effect of the instrument in the method and we have a nice result of several theoretical results so we can prove that it recovers this random variable almost surely up to a constant offset subject to assumptions the first one is that our observed quantity is the sum of Q and some function of the noise we don't need to know that function and second assumption is that all the information about that function of the noise that gets added to this thing is present somewhere in the other measurements we don't need to know where it is but in principle the effect of the noise should be some function of the other measurements and now if we combine this method which in the end is just like a regression method and we combine this method with efficient measures methods for searching like curves we actually get a nice method for finding exoplanets and we've discovered about 15 new exoplanets so I'm going to escape through this work which I didn't expect to get through this is something some coordination method for fairness and instead briefly talk about this thing here which is a method to learn independent mechanism using your network models so here we assume that we have a data set which has been transformed by a number of different and unknown transformations and we we try to recover these different factors of variation without supervision so the data are handwritten digits and the transformations are things like adding noise contrast inversion or translations and what we're gonna do this is we're gonna do is we learn to invert these mechanisms with this architecture which combines competition between a set of experts and something like again so when a transform digits comes in here all these different experts they try their luck so these experts try to invert this transformation each of them tries their luck they produce some output the discriminator then tells us which one looks most like an Emily's digit and then the one that wins gets trained to increase the score measured by the discriminator and the discriminator at the same time is trained all the time trying to distinguish transform digits from real digits now suppose the transformations that we want to invert are independent to begin with in the sense that they should not contain information about each other so this is slightly informal and there's there are ways to make it more formal but suppose they are independent and by which we we mean in a sense that performing well at one of the tasks of one expert they should not improve performance on another so they sort of go in different directions the space learning one doesn't help you with another one and then this is exactly what allows the experts to specialize because if one of them gets better at one task it doesn't give this expert any advantage in specialized in to the other tasks so the other experts can specialize and the other tasks and we in a large initialize all of them to the identity and so in this sense this competitive frame procedure is exactly what it what exploits the assumption of independent causal mechanisms and if we do this so here every plot is a mechanism to be learned left translation down translation etc each color is an expert so this first this green graph shows that this it shows the discriminative value that the green expert achieves on this task of left translation so you can see that it eventually specializes to invert this thing if you look at the red one the red expert industry cannot decide between these two tasks at some point it decides to specialize on this one and then this task up here gets taken over by the green experts and in the end they all have converts and yeah so this is works worked out nicely we're quite robustly and then afterwards once the system is trained you take randomly in random inputs that have been transposed by one of the one of the mechanisms and then this is the output of the winning experts and we can we can quantify how this helps by this experiment where we we take a we feed the transformed digits to a standard M this classifier so if you take this just a standard English class power trained on the original digits if you feel it with transformed digits you get an error rate of forty percent and if you the original error rate is close to a hundred percent now if we pre process our digits with our model and we get performance which is actually close to the optimum here after relatively short training so if we train our model only 200 raishin to each iteration being a mini batch of 32 examples so this is this is just six six thousand examples we already get quite a respectable performance here and this is a number of experts is too many or too few you still get reasonable results if you have too many experts similar in tone specialized if you have too few some of them specialize on multiple transformations at the same time and my favorite example here is that we can also nicely generalize to two other classes of inputs so the experts trained on M NIST we can afterwards we can apply it inference to these Omniglot characters and each expert so this is the contrast inversion expert these are translation expert expert does something reasonable okay so I'm coming to the end in last three slides I think the long-term goal of this kind of work or my long-term goal is that I would like to learn structural causal models from a multitude of tasks in multiple environments so in a way is a brain that can do many things not just one so I think these models need to reuse components which would require that the components are robust across tasks and I think a sensibly inductive bias for this is to look for independent causal mechanisms and maybe competitive training and play a role in this now the overall problem of course is closely related to something people have started to call this entanglement and I have to admit I find that problem not very well defined I don't quite understand it and maybe one could make sense of it in terms of causality but I also don't have a good answer for that yet we have a paper on a slightly more principled way of doing this entanglement which is the work work work shop session so please check it out and overall I think we have as a field especially this community here have made a lot of progress in representation learning and usually that refers to the representation of a fixed probability distribution so it's kind of an IITs setting and we want to represent an iid distribution but our brains really need to represent causal models and causal models they capture not just one distribution by the whole set of distributions interventional distributions and I think that's totally open heart we should represent those and it probably has to do with with RL with causality with planning and reasoning and if you want you could say it has to do with everything that's related to thinking which we're not very good at yet so I think I'm going to skip so maybe just mention that I think a beautiful way to think about thinking is what the pathologist can Harlow and set for him thinking is actually nothing but acting in an imagined space so we need to learn models in which we can act in imagine space and this won't be statistical I think those this won't be purely statistical models they also won't won't be super complicated differential equation models I think we need a level of complexity in between and I think causality might play a role here so and as we've heard this morning the first Industrial Revolution was triggered by the steam engine second one was mainly driven by electrification if we think about it broadly both of them are about how to generate and convert forms of energy and and we should delete that will generate because energy is as we know is a conserved quantity so we're currently experiencing a clear already it's not so current I think it started in the mid 20 mid 20th century under the name of cybernetics and it replaced the most interesting thing about it is that it replaced energy by information now just like energy this morning information can also be processed by people but to do it at an industrial scale we need computers and maybe to do it intelligently we need machine learning and I know the analogy goes further just like energy information it's probably a conserved quantity in physics so we can only convert and process it and and I do know one can speculate about this analogy bit more so with energy it took a long time maybe until a minute or in the early 20th century to really understand it as a constant of motion related to certain symmetries or covariances and we might be at a similar stage now with respect to information so we know how to abstract some of it we have relatively crude methods and maybe one could also say we are currently a little bit intoxicated by the success of these methods I don't know maybe a bit like people must have been amazed with with what could be done using steam engines but probably we don't fully understand it yet and I think that may have to do with causality so causality statistical information is just an epiphenomenon and these come from underlying horses purchase so with that I would like to thank all my coworkers and thank you for your attention Thank You Barry okay we have time for one question and to the next speaker come up here Wow no questions it's a big chance okay then our last question you were talking earlier about time reversibility when we have time series data and say like a hidden Markov model how much work is done for free for us by by encoding the timestamps of the data so time gives us constraints we know that the future can't cause the past so we certain causal connections are ruled out by time but in a way it's only a small part of the problem so in that not everything that when X happens before Y it doesn't apply that X causes Y and the problem of confounding is just as hard as it is in the case wait don't talk about time so in a way it's interesting that one can talk about causality without talking about time and that's a whole discussion in its own right and maybe maybe I shouldn't start that now but I think it's a very interesting problem all right let's thank Bernhard [Applause]

Info

Channel: Steven Van Vaerenbergh

Views: 8,802

Rating: 4.8930483 out of 5

Keywords: iclr, iclr 2018, talk, presentation, machine learning, causality

Id: 4qc28RA7HLQ

Channel Id: undefined

Length: 43min 10sec (2590 seconds)

Published: Fri May 04 2018