Learning Representations Using Causal Invariance

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

in ninety ISM computer vision applications with structured outputs in the late 1990s and the theory of large-scale learning in the 2000s. During the last few years he has focused on clarifying the relation between learning and reasoning with increasing attention on the many aspects of causation such as inference, invariance, reasoning, affordance, and intuition. Leon has pointed out that learning algorithms often captures pourous correlations present in the training data distribution instead of addressing the task of interest. This correlations occur because the data collection process is subject to uncontrolled confounding biases, but suppose that we have access to multiple data sets exemplifying the same concept but whose distribution exhibit different biases. Can we learn something that is common across all these distributions while ignoring the spurious ways in which they differ? I think that he will be answering that question for us today in a presentation on learning representations using causal invariance. So without further ado, let's welcome him to the stage. *Applause* Leon Bottou (French accent): Well thank you very much Anna. Thank you for giving me a chance to speak here in front of an audience that I'm very afraid because it's much bigger than what I see here and I'm going to spend some time trying to motivate what I'm doing by first saying that I don't believe what's written in the papers about AI. That you read in the paper that AI is at the corner of the road and everything and I think we have some serious difficulties we need to address and to give an idea I wanted to show something that you might find amusing. I have a 30 years old demo here. And so that was written the numeric code American code there was written somewhere in 89 and the graphics in 91-92 so and the data file I have to I wanted to load it before but it's composed of 480 digits because this is what we could handle actually no this is a demo so we could handle 10 times more but the demo are just 400 we're going to use 320 for training that's what I'm doing here 320 here and so that's the training set and the testing set and I'm going to prepare a network. So I'm going to load one and then we need to add actually yes because this is too small the problem of this demo is that it runs too fast the problem is to make it slower okay there and to make it slower one trick is to have a lot of graphics. So these are performance loading and I came to some more some more and maybe if I go over the training set I have to start to initialize my weight, initialize learning rate and I can well see it's too fast so I can slow it down a little bit. So you see you see these digits their moused wrong because we didn't have a camera or scanner and that doesn't work now we're going to Tran so let's stop this and start training so I'm going to switch off this display the networks have to stroh and up up and up. Now we're going so so the black thing is the training arrow the black dot that you see here is and the white one is the testing arrow so initially the opposite I don't remember it's going to be easy to see and this is the arrow whenever I show an example just plotting it make it almost reasonable that used to take 15 minutes I can even go a bit faster so that because that is part of the the talk is not going to be ok and a stop stop stop stop stop stop okay so the point now is that now it works I decided to do that. It's always harder to do when I'm on lockdown like this why what's going on? This is the one I want. Okay so so you see the digits growing and you see that the white thing going up and down means that it lands and now I can do it a bit slower so you can know that it works and there are some amusing things you can look you can do you can look at some specific patterns that are bizarre like so this one well I know it's a five okay so I think there is one like this no it's been a while I don't remember them did use pattern by heart oh no it's working really well okay but it's working you can add noise you can do things so that was 30 years ago. This is the convolutional neural network it's very small by today's standard it's random 320 examples which is based on a standard but it learns generalizes on the task that's not completely trivial. Thirty years with more law-Moore's law no the computer speed doubles every year and a half that's a factor of 1 million and I told you we could run this at the time in a couple thousand examples. It's not hard to see that the fact that today we can use a thousand times more exam plus a thousand times bigger networks it's not very surprising. It's fundamentally the same kind of phenomenon it's the same level of things. And what we wanted to do at that time is that so so we knew that this was quite interesting we knew it was too small but we knew that we change and we wanted to find the way to program computers not by programming them but by teaching, them by training them to do something like another alternative way to use computers and thoroughly to use to see how computers work instead of being programmed labo loosely detail by detail we want to just to train them and make them work in weather was a little bit more similar to us. It's not AI yet and there are other things in AI, but if you can get a computer to do what you want by training it, it's clearly a step and have we done that and this is what I'm going to go to my slides which would be somewhere here. And and do what I've done will be in the title and Martin who is a PhD student in New York University, Ishand who was in Google at the time and now is in MIT finishing his undergrad and David who is a researcher in Facebook AI research in Paris. Okay that's an explanation but the thing is this is what really puzzled me. That was 2014 neural networks were back appearing and we are trying to show some transfer learning properties in vision and one of the tasks we looked at was the action detection problem in Pascal Bach and one of the actions is detecting whether somebody is giving a course is placing a call and you have images you have a bounding box and if the person in the bounding box is calling is on the phone we supposed to say yes otherwise you supposed to say well maybe this person is doing something else. That looks all very nice and we get seven percent correct was the state of the art at that time and I think it's not far from the state of the art today but Maxim, who was a student working on this came back and said "You know we got 70 percent but duty doesn't work look at this picture" so there is a row of pay phones which is something that still was there in 2014 there was a person here and this person is obviously not calling but if we move the bounding box in front of the pay phone is going to say very strongly calling in fact whenever there is a person on the phone in proximity it says calling so of course it's not solving the problem of detecting whether a person is calling. And the worst part of this is that the algorithm is right. If you take the that you get typically they that you get pictures you get on the web when there is a person near a phone most of the time this person is calling. It's a selection bias you know you don't take a picture of somebody who just walks by your phone when you take a picture of somebody's a phone is calling. So that means that 1.) we're not solving the problem and 2.) the algorithm is right so what's missing there? And what's missing there is that you have to realize that the tasks they taking if somebody's calling is a task that we don't know how to solve directly so we define a proxy problem the proxy problem is a statistical problem here is a data set of yes and nos and try to replicate that and between the proxy problem and the task there is a world of things we do we ignore this with we assume we start machine learning by things let's suppose we have our ID data and one part is the training set one part is a testing set but here you see a practical example where what we want to do what we have an ID data set because we split it into training set necessarily the Pascal work people did that and so it perfectly fits the properties of the machine learning theory and yet it's totally missing the point. Now, turns out that if you look what has happened in the thirty years thirty years ago we would make the data sets very carefully like for instance when I was nineteen at the Bell Labs we worked on zip recognition or check amount recognition these kind of things and we'll make data sets that we're at most ten thousand or hundred thousand examples for the biggest ones and we would be very careful in curating them so that they represent the thing we want and everything and this curation is kind of enough process so and nowadays as I said they come from the web the huge quantities they're not things we can look at and in fact the sizes that we want to use a so large that is impossible for human to even look at them in a careful way and they're corrupted by plenty of biases and if you look at the papers there are plenty of papers that comment about bias like this one the troll Bonnie frost is a computer vision paper where they try to use several object recognition databases and try to recognize a car a car class in each of them and basically what they show at that time that was before CNN's before the rediscovery of the results of CNN what they show that when you train on what data set it performs very badly on under the one on the other hand if you take an image and you want to train a classifier to say from which is that I said it comes that works really well so that tells you that each of these data set is so specific that it's impossible for the learning algorithm out to catch some specifics of the data set instead of the concept you want to have, so that's something about recommendation system or ad placement systems where you find a lot of causal effects that cause the data to be completely biased and another one that's quite recent about visual question-answering which looks like a very nice task now you get a picture you get a question and the computer must answer, there's a picture so what is the color of the tie of the man who is walking in second position? And the question should be right the answer should be right so it seems that if you can do this from our biased perspective it might seem that if your computer can do this means he understands the imagine of someone questioned. And very quickly systems get in the seventy percent correct and somebody say a stop stop stop there is a problem if you don't even look at the image you just look at the question. When the question is what is covering the ground the answer is the snow. And when the question is there something on the something the answer is yes. And that comes back to how the data was collected this first time images from the web were captured and then a first set of Mechanical markers was supposed to invent questions and a second set was supposed to give the answers. Now another the imagination of people in terms of question is not very big like this is something with something on the Shelf they're going to say is there a flower pot on the shelf because they saw a flower pot there but if you are going to ask whether there is a giraffe on the shelf or something else while there is a flower pot is the natural to ask a question like this, So the result is that the data set is so biased that the result that were looking very promising were in fact barely better than the ones you can get by just exploiting trivial biases like that like look just at the indication or just at the image so so the result of learning is that the data collection creates a lot of biases and you have confounding biases, feedback loops in the systems, you have selection biases, and we can control for them and all the machine learning are going absolutely love to take advantage of the spurious correlations. If they can find an easy way to solve the problem they will use the easy way not the hard way that requires understanding something. So if I go back to these spurious correlations what do we say a correlation is spurious like I take my phone example why do I say that when a person is close to a phone saying that this is strongly correlated with the person calling is furious and the reason is that I do not expect that this is going to work in the future. I do not expect my system is going to be used in situations where I can make this assumption, and the question is what informs us? Why do we say such things? So we might have substantive knowledge about what is this to give to call when you call well it's not enough to be close to a phone because calling involves interesting consequence that people are going to be informed of something and can also ask whether this substantive knowledge come from and we have to say that whether it's in one person with mannequin it comes from the past observations. Now that's the problem because the past observation we said are biased and this is where there is an interesting concept that you can find in the in the literary philosophy literature about causation coming back from to human other people; is that humans are just not looking for correlations they are looking for stable properties. What does it mean: "stable property"? Mean that you want to see that the said action of calling is connected to what you see in ways that are stable that are not going to change. And that means that maybe we when we look at the past we take the data in the past and we say always correlated but the past is not uniform. Nature doesn't shuffle the data now we shall follow data when we do machine learning. We shuffle data because we want it to be our ID because this is what we understand but in fact when we collect the data we collect on different point in time different points in space like if you take even the example calling and take it at different points in time nobody is going to be a cellphone in the past is going to a big phone with the rotary thing and and if you look in different countries it might be different too because people might hold the phone different they might have different equipment or in different experimental settings you can like the high resolution camera low resolution camera. Sometimes I think that when you look at image net and you recognize all these dogs you know when you take a picture of a dog you take a picture with everybody takes the picture with the same kind of phone from the same distance with the same focal length because this is how you take the picture of the dog, now I recognize in the dog or the subtle noise patterns that tell you how you use the you phone and how it was set up. And so then we shuffle the record, we take all this data we mix it and we say there are ID and we can proceed with machine learning. And that's clearly lots of information. And so we started to follow a line of work that comes from Jonas Peters in 2016 maybe before I was Bruno mine thousand and Peter Bremen Zurich in 2015. We consider the data set we have doesn't come from a single distribution but come from several distributions that we call environments. Initially first a discrete set PE so we have XEYE for equal 1 2 3 4 5 small number and so we have a bunch of these distributions and we have training sets that I'm going to assume large from now provided or some of these distributions and we want a predictor that's going to work for many of them. And the important point that I'm not going to assume that this sum distribution of the ones I have are kind of random sample of the possible environments, so what I call an "environment" is one of these distributions. I'm just going to say I've solved them. So when you have a situation like this the classical way in statistics is to try to be robust. You're going to say "I'm going to minimize the maximum over all environments" so if you look at this formula here maybe I'm going to use the mouse so that it is visible. You have the minimum over your familiar function of the maximum over all environments or let's see the square there all and you can have an environment going to discuss the baseline data so that says that you want something that's going to perform well on all of them not just one of them or a mixture of them and then you realize it's a problem suppose you take the calling example you have two environments: one of them is made of very clean pictures taken last year and the other one is made of pictures from the 20s black-and-white grainy not pretty. Well it's going to be harder, so the arrow you're going to make on the on the old pictures is going to be higher because you don't see them nearly as well even though the phone's way bigger but is another detail. So you might say well for each on this environment maybe I want a baseline, and I'm not going to say how to compute the baseline, and what I want is that by having a single function for all environments instead of one trend for each environment, I'm not losing too much compared to the baseline. I mean that the relative loss of accuracy compared to the baseline is not large even though I'm going to use the same function to work in all these environments. But when you have this you can start doing a bit of mathematics and reset the problem. I say is the arc mean of M subject of all E, M is greater than what I want to minimize and when you have a constraint problem you can use the Karush-Kuhn-Tucker Theory and basically tells you there is a set of lambda positive such that the solution of that problem is a first-order stationary point of a proper mixture of my squared errors, so I'm back to my initial point is saying well if I want to minimize this problem the only thing I have to do is mix my environments in the right proportions and I'm going to satisfy this this condition. And the worst part is that by changing the baseline I can change the lambda either way I want choosing the lambda zero or choosing the baseline is the same problem and so the robust approach means that mixing the elements with the correct proportions. So here is an example I have for distribution P 1 P 2 P 3 P 4 and the robust approach says I'm going to be good on all of them, and therefore I'm going to be good in all the the convex hull of these distributions. And this is a 10 by missing minimizing a specific mixture. Now the thing that's interesting that there might be distributions are outside of this convex hull so with barycentric coordinate that are slightly negative and that can be a legitimate distribution but I have no guarantee on this. It doesn't say anything about something outside and was it important? That takes another example. Take a search engine. My example is wrong but if you take a search engine it's interesting very often to classify the queries in different categories like the query is commercial in nature on navigational in nature, that's quite important. And suppose that I have a number of environments which are the set of queries today, yesterday, the day before yesterday, the day before yesterday, and so on; and quickly you're going to see that these three kinds of queries; you have the queries that remain constant the frequency is the same over all this period. Some of them are growing, like for instance when you come close to a particular event, come close to Christmas queries about Christmas presents are growing, and some of them are decreasing after Christmas the queries about Christmas presents are decreasing or going away. And the growing and the decreasing ones they are a very small subset of everything. So what I have is that if I take my triangle here and say this these are many queries that are the same that and these are the ones that are decreasing in popularity the ones increasing your popularities. My four days they form a very small set of points that are very close to this, and if I'm robust I'm going to be having performing well in that little domain here, but in fact if I'm waiting let's say one month I'm going to move away in that direction, so what you see here is an example where having a guarantee that works only in the convex hull of my environment is not really sufficient, I can do better and I would like to better, in fact it's a problem of interpolation/extrapolation and interpolating is something that we understand we can do quite well extrapolating is always a mystery. So if we go back to this idea of learning stable properties and you find that in a number of old philosophical world. Suppose that, I'm going to come back to my problem of calling, and you have a set of pictures taking from the web there is a selection bias and pictures where I see a person on the phone often represent the person calling. Now also this is where I have little movies and let's see the movies of the same selection bias because if I take my little movie at some point in the movie, somebody close to a phone is giving a call, he's placing a call, but if I take the frames of the movie even though at some point the movie is going to show the person calling, there is going to be before calling and the after calling I have a lot of frames where somebody was close to a phone but not calling and close to a phone and not calling but because the call is finished, and that's interesting because it means that if I take image from these different sources, pictures taken from the web or frames taken from movies, they both have the selection bias like the correlation between the proximity of a person and a phone and the event of calling is high, but they have it in different ways. The differ in strength and if it differs in strength it means that if your regression system has the choice between two kind of things, like for instance features that represents the shape of the person the position of the hand, and features that represent the presence of objects merely the the regulations will be different if I complete the regression on the first data or the second data they're not going to rely on the same features in the same way because in the first case the proximity is a more reliable indicator of calling while in the second case is a bit less reliable and enough to use a little bit of something else even though using this is still very favorable in terms of accuracy. ID that we would like to learn phenomena that remain environed across environments. We really want to learn a regressional classifier that uses the features in the same way across environments; if one of them is wobbly because it has different strengths with with different environments, we're going to say if this one is suspicious. So this ID is very related to the notion that we don't take all the data as a single distribution, we look at its interior structure and say that if some correlation is maybe highly predictive but changes in strength across environments, we see it with suspicion. So why is it interesting? So let's first say environed regression, which is a strong requirements, suppose that instead of minimizing the maximum arrow across environments, I'm searching for a function that belongs to the minimum for all environments it's not real, it exists, it is not so simple to find it, but what does it mean about mixture coefficient? And if you say that F star is a stationary point of my arrow for all environments, it's also a stationary point of any superposition of my probability for okay and that's true for all lambda is positive or negative if some lambdas are negative this is still true, so the in vines property in a sense has this is stronger because we want something that's way way stronger maybe hard to do, to achieve, but if we achieve it we don't just generalize to the distribution that are in the convex hull of the ones I have, but to the biggest extent that we can reach with the let's say negative barycentric coordinates, which is what I wanted to say here. Some trivial existent cases like in the noiseless case maybe it exists there is a function that works and classified everything it exists and this trivial existent cases annoy me because I don't know how to deal with them very well. But I'm going to be interested by the cases where there is no single F star that belongs to the minimum, or that minimizes the regression for all my environments. And I'm going to say well to play with the function family maybe I should go straight to the thing. I'm going to say that I want to find the representation Phi of X and in this representation I want the relation from Phi of X to Y to be environed across environments. So the idea there is that if my problem is noisy enough, and I don't know where we will have to deal with noiseless problems, and that's that's an issue, but it is noisy enough. The only way I can find the function that's going to minimize my regression for all the distribution all the environments is by first projecting my patterns into a certain representation that essentially eliminates the features that are unstable, the feature that was dispersed correlated to what I want, and that gives the idea that we can use this criterion of environed to just learn but to also guide the creation of features, and that's very different from recognition. When you do a neural network and you have hidden state in a network these hidden states are created in a way that permits the best possible prediction. Here we're not interested by the best possible prediction, we're interested by a prediction that remains environed across environments, and that's something that's very very very important in science. For instance, suppose that you watching an apple from the following tree. What do you have? You know you just have an image or a little movie this is what you see, and you could pay attention to a lot of things, you could pay attention to the color of the apple, you can observe that when the Apple is falling very often it's red because it's ripe but it doesn't tell you something very important I said to the trajectory of the falling apple. You could look at the size of the leaves, the size of the tree, the size of the trunk, the number of leaves on the stem of the apple. But in fact if you look at the right variables, let's say the position and the speed, you observe that all your apples now obey exactly the same equation, and because you observe it on all the apples, you can say oh maybe it's going to work on all falling objects, maybe you have found something more important. So in that case we know it's not going to work on different kites, it's not going to work that way, but for a number of falling objects the same equations are going to work, so there is some value in finding exactly the same solution, in the sense that finding exactly the solution is a hint that this solution is a bit more general and it's also a hint that the data or the representation of the phenomenon in which you find this environed solution is something important. Now there's a lot of related work. The first one is that this idea of environs in in fact related to causation and something's been known for a long time, there is a work by Nancy Cartwright from the theorems, there is there are people working in philosophy like epistemologists, but it's easy to understand in the causation framework, let's say the statisticians what you want to do is not predict what the system is going to do but predict what the system is going to do when you intervene on the system. Like for instance, if you want to look at the efficiency of drug and you can make some tests and everything, but what you want to know is if I give that drug to everybody will the population be better? And this is not obvious, when you want to do this actually two things you can use, you can use your knowledge of the intervention, so I'm going to give the drug to everybody, and the second one is you can use what you believe remains environed after and before the intervention. Initially you give the drug just to a little test of test people, and after you give the drug to the whole population and the probability of getting better given that you have the drug or not, and all the variables of interest is preserved, then you can use that. So you're looking for property that's environed, and in fact all the things like the calculus or your ability of assumption, they are all tools to try to find the mechanics to model this kind of things that are environed. For instance, you have a graph you intervene on the graph of of causation, and you can detect that some distributions are going to be environed and some conditional distributions are not going to be environed, and do some kind of calculus. Now invariance is attractive for learning because reconstructing causal graphs from data has proven very difficult while learning *Audio cuts out* that stable across time or across various conditions, you actually do something that's as powerful as doing causal inference because you get half of the data, and that you can use directly and it seems to be an interesting alternative. So I mentioned the paper of Jonas Peters that was a big inspiration for this work and in a paper of Jonas Peters it considers causal graphs on which you intervene, this is what the little hammers saw, and each causal graph and the intervention described a slightly different distribution because of the intervention. Now you get all these distributions, you have a variable of interest, this is Y, and you're going to try to find an environed regression for Y, so in the case of Jonas, he assumes that all the values are known, so the representation is just selecting which variables I'm going to regress from. And what he shows is that, under caveats that are mostly technical, if you find an environed representation, you find a set of variables such that when you compute the regression from these variables to Y you get the same one, then you found the direct and dissidence of Y in the graph. Which is a very nice result, now the limitation of this of course is that you have to assume that you know which are the important variables, I want to assume you just get a bunch of pixels. Finding the important variables is the difficult part, and this is just about narrowing, just you assume here that you have a small subset of viability and you know that the important ones are subsets of them and you have just to refine, but finding what's important to measure to start with is difficult. So another related topic is adversarial domain adaptation, which is a recent thing, and the goal is to learn a classifier that does not depend on the environment, and the idea that you want to learn a classifier that's you going to train on some distributions and going to work well on the other ones, and the simplest thing given adversarial terms that says that if you take some states from a hidden layer somewhere and try to classify and see which environment the data comes from, you cannot anymore. So you're trying to find a classifier that has eaten representation from which you cannot recover the environment. And if you think about the paper I mentioned earlier, the observation that, given image you can see from which the set the data comes is already a problem, so they're trying to alleviate this by saying I'll take my image and I'll go to a set of features from which I cannot recognize the data set anymore. Now you realize this is too strong because it might be the different environments of different probability of yes/no answer, so if you say that these features do not allow you to recognize the environment it means that the distribution of these features in all the environments is the same. If it were different you can say from which environments it comes better than chance, and if they're the same it means that the distribution of the cluster bell you're going to predict is the same in all environments which is a bit too strong. And there are some other variants and basically the question is whether you force P of H, the hidden layer, to be environed from the environment or P of H and Y joint, or P of Y even H, or P of H even Y. What we do is weaker, we just want the regression to be the same and finally you have problems learning which is something that happens, let's say maybe the most common, the most popular idea about this is the PGD approach to resist adversarial examples. Well you say essentially that instead of minimizing on the distribution of the data, you're going to define a set of neighboring distributions and minimize the maximum of your error on all these distributions, and well I mean in contrast we use multiple environments which comes from the data, it is not defined a priori, and then we work in variants. How much time do I have? 15 minutes? Okay so I'm going to start with the linear case. So in the linear case, you have X, the representation function is the matrix, the regression is a vector, and in fact the whole operation is linear, is that certain W, and you can see already that is very over determined because you could change S a little bit and balance something in the regression back and forth, and you also have lots of insertions. What's interesting is if I choose S equals 0 of course it's environed, it's not very good meaning that I'm just elementing all the features, and you can see what matters is the null space of S which is the information and censoring for my system, and another difficulty is that if what matters is the null space of S and S has a certain null space, a small change of S can completely cancel in a space in the vicinity of a singular matrix tab, just plenty of non singular matrices, the null space is 0 just the new null vector, so it's not very, it's a finicky criterion to minimize, but if you do a little fling algebra you can characterize the solution and you don't characterize in terms of S and V but instead of the W, the whole set, and you can see the whole W's that satisfy the environed property, in fact if it W satisfies the environs property, meaning that there is a S and V and the V is the same for all environments, only if this thing is true, which I'm going to describe later, and then we can reconstruct the S - there are lots of them and in the linear square this equation represents ellipse, so essentially for each environment you have an ellipsoid in the white space they all cross 0 equals 0 is the solution not an interesting one, and the W that have the inverse properties of those that are at the intersection of all these ellipsoids, which is sort of bad news because the intersection of apes weights typically is not connected so it's going to be a bit difficult to search. I'm going to keep on this about computing ranks and high rank solutions, and maybe go straight into Ison. So one possible idea is to use this criterion as a regularizer, so you have all these ellipsoids and you're going to say "in order to be close to the intersection of ellipsoids, I can add a term to the cost that *Audience member interrupts* They all share S, everything is shared there is only one classifier at the end, so the question is what information S is going to remove in order to make sure that after applying S the linear regression is the same for all environments. So it's the only one S 1, only 1 V, and surprisingly there is actually, uh it's not every W that has this property, so I can regularize towards them and since I've ellipsoid the environment I can measure the distance to the ellipsoid, which happens to be a fourth-degree thing, which is not fun but we knew it, and one way to look at it is say I have s and V and I can insert a dummy multiplier here, theta, that's 1, I'm going to say it's one but I can compute the derivative of my cost function respect to theta, and turns out that this is exactly what I want. So basically what I'm saying here is that maybe you've heard of domain adaptation layers, you have a system and you values, domains for environments, and you say well I'm going to turn on everything but for each environment I'm going to use a little extra layer that I'm going to optimize for that domain, and you can see theta is the dominant notation layer, a trivial one, and what I'm saying is that actually I don't need to adjust it, so I'm looking for a solution that's such that if I had a domination layer, I don't need to change it to model all my environments properly. And there is equivalence between the two approaches, and this way of looking is interesting because if this is nonlinear I can still make the same kind of reasoning, I can still say well I'm inserting a frozen done annotation layer is the identity doesn't change anything, but what my regularization term says is that when I'm go from looking the all my environments, none of them is calling for change of theta. So I'm going to take an example, I call it colored MNIST. Digits with misleading colors, so we take chemist and we split it in two classes, the low digits of zero to four, the high digits five to nine, and I told you I need noise in my system because I'm going to use noise to constrain the representation, so I'm adding 25% so the highest classification of noise I can use by using the shape of the digits is 75%, but then I'm going to add colors, I'm going to say that if my class Y is 0 my digit is going to be red with probability 1 minus E, green for the E meaning that if is more my low digits are going to tend to be red and high digits are going to tend to be green, and my two environments are going to be defined by specifying E between 0.1 0.2 and that means that if I use only the color, I'm going to classify better than if I use the shape. If I use the shape I can be only 25% color correct, if I use the color in one environment that can be just a 10 percent error and the other environment I can have 20% error, both of which are less than 25% error which is the best I can achieve because of my level of noise. So I'm training with equal 0.1 0.2 and if I train with normal training minimize empirical risk, well I get about something halfway between 10,000 and 20% error, which is normal, but if I test with equal 0.9 I reverse the scroll scheme well it doesn't work at all. If I do the same training but add the environment aggression terms and they're very painful to train no it's not nice numeric set this is painful, its slowest and everything but consistent performance so basically I was able to say that because the relation between my pattern and the color is not stable, I don't want to use it and have to rely on the ship, which is running noise in that case which doesn't make it very easy. And if you look at the output of my classifier, so these are dots corresponding to example inch of the environment, and you see that for the 0.1 0.2, you get an answer like this but for the green one you get the opposite answer because you realize on the color and the color has been switched, well if I train with this mixture well it's more reasonable even though I have a little bit of crap that remains. So it's a small example and it works only when it's very noisy, which is the problem, and the question next is to scale these kind of ideas up, and this is where we starting having problems, first of all we have numerical issues the regulization is very non convex, normally targeting the intersection of plenty of ellipsoids and the discrete points are all over the place they're not connected is not going to be easy, and then we have something differently but realizable problems just many of the problems that are interesting for people nowadays are problems where you can achieve zero loss essentially, and I don't have this noise that allows me to to make the system work so I have to find a way, and in fact if I go back to this realizable case which is the case where, well, you know, there is a function that's been able to find everything, let's look at my little observer self I have a phenomenon I was seen and have a vibe of interest that exists in the scene maybe is not observable directly that I call Y pre pre pre labeling I get some kind of image X and I get a bunch of people that I call *indecipherable* is what we do in supervised learning, and they're giving me labels that are Y post suppose the label is whether a person is calling for instance, another the white post the labels that the lab allows give it's not necessarily the truth because somebody could be calling in hiding and you don't see calling the person calling really the label laws are going to say no and they're going to agree but in fact the person was calling so there is slight difference between this the widest reality and the why that comes from the label laws. Now the labeling process is often designed to be as deterministic as possible, we train the label laws to be consistent, we did ask questions in a way where they are not many ways to give answers, so we're trying to make sure that there is a function that belongs to all the environments, and that's true for the post-labeled C, true for the pre-labeled. Okay so what this means is that when you have a supervised problem where the labels are given by label laws, we artificially created the solution, the situation, in which there is an environment, not because we found an environment in reality but because the action of the label laws was environed and because there is no noise at all how to swap them out, so that's my problem at the moment, so if I should conclude in this talk I said that the statistical problems on your proxy but something very important and between the real problem and the statistical problem and the proxy, there is a huge gap that we haven't explored, and this initial idea that you maybe you can form computer in a different way by teaching them to do things instead of programming them, is not going to be achieved unless we understand what sees in the gap between what we want to do and the statistical proxy, the second one is that natural doesn't shuffle these examples, we shuffle the example because it matches up ideas about learning but by doing so we are moving a lot of useful information. That one of these information is about the property that was stable and when you start looking for business table environed across environments you start making sense of extrapolation, the idea that you can extrapolate to new environments, sort of makes sense in the situation because when you you optimal for each of your environment, you also optimal for mixture of possibly with negative bias and recording it so you can go out of the street interpolation, that invariance across environments is related to causation, it's an alternative view of causation, it's what I explained that directly formal proofs that environs is similar to position that you can try to find in vinyl presentations to enable environments, and that only works if it happens that not all representations enable environs. Well everything is invariance, it doesn't work that easily and this is why of my program at the end that I want to understand how how to slightly change this concept to make them applicable to the realizable problems the one way of 0 which is tricky in different ways. That's it, thank you. *Applause* Off-screen voice: Okay so let's have five minutes for questions. Leon: I can repeat the question. Okay so location is about what I think about policies that try to determine which form of determentation works best, the first of all, but that could work, but this not what I'm looking for, and the reason is that when you speaking of the documentation you you in the situation where you define a ball around the distribution that you have, so you say instead of taking examples, the examples I have I change with the examples and I create a combined distribution, so the documentation is just a way to look in a slight ball around your distribution, this slight ball is arbitrary because you're finding data in ways that you think are good, like for instance you're thinking maybe I want to have more rotation, more translation, more G-term, or more color changes, but what about the thing you didn't think of, like for instance, Oh turns out that all the pictures of potato we're taken with the same focal length and the same kind of camera and that's the tech table in the noise, well you didn't think of that one, if you didn't think of that one you're not going to be able to fix it, so somehow we want to be able to to handle these kind of situations and for this we need extra information and extra information I think can be found in the data of the distributions. So the question is the connection between looking at data versus SGD because a SGD theory is very often connected on our ID well this is why we looked at it as saying we have a set of sub distributions and we assume we had training data from each of them and so we can do SGD on each of them, or we can try to have a regularizer on top of it, but through that in many situations the formulations you can give to a problem like this are challenging as SGD, that's true. It's difficult to find the SGD solution that computes the right solution and you have people who do it with adversarial means and others is very hot and interesting but you have to realize that it's awfully slow, you know, just training a gun is terrible compared to training a network. I couldn't show you 30 years old demo of training a substantial gun because I was totally out of which, I can show SGD with with a CNN and measure cognition no problem, so the adversarial ways to which and having constraints of the distribution saying that the distribution of hidden layer and why satisfy certain relations is always difficult to express with a SGD but that's normal because we going away from having a criterion that the simple average on my data and as you the optimal average on the data, yeah that's a problem. Off-camera voice: So yeah so great talk, like I love the paper, like for me intuitively it makes a lot of sense when you're like doing classification you know you want to find the invariant relationship is kind of like spot the difference hey there's always a cow in all these photos, but suppose I'm trying to do an inference task and I use your IRM predictor and it's also like the expected outcome of why conditional, I don't know, treatment and covariance, like what is the invariant relationship in that sense, in like when we're thinking about you know, how's the inference? Leon: If you take causal inference relation with graph and you take the typical situations or confounding unconfounding, if you know, if you just collect conditional distributions on the basis of what you observe without taking account with our conditioning over the potentially confounding variables this conditional you obtain is not environed when you apply the intervention. On the other hand if you condition with respect to all the potential confounders, the distributions you obtain is then environed. Off-camera voice: IRM will just not work if there's any like you know, like if there's any unobserved confounding that you're not account for? Leon: Well first of all IRM ,it's not an end, it's the beginning, so it doesn't do much, no, we we were happy to do the colonists and we have a hard time you know making it bigger or something like this but what's interesting is that it's an interesting direction where you try to go away from the statistical problem and look for other properties of the solution that you believe should be important. In the situation we consider, we assume that we have let's say rich data like images and by censoring information you can reduce it to a set of eyeballs that's sufficient to make a reasonable prediction that's environed, now you have the situation where you have hidden variables let's say in the case of causation that would be confounders that you don't see, well then you have a solutions problem because either you reconstruct the confounders or you just say there is a confounder that I don't see if I can't conclude and the mechanics of let's do calculus or is going to tell you that be careful there is something with bizarre, but you have to put it in our assumptions essentially, and I'm just going to tell you I cannot do. I cannot compute this do probability on the basis of what you have now, you need to make a different experiment. It's not a given situation I consider where I have a very rich data to start with large images where maybe somewhere in the image everything I want to measure but it's true that in physics for instance, sometimes you cannot predict what's happening because of things you can't see and the history of mankind we need to find ways to see them, so yeah, so no I can't do that, and I don't know how to do it, and I don't know if anybody can, but it seems that as a group of people the scientific process has been able to do it in many interesting cases. New off-camera voice: Hi, thank you for your presentation. I was just wondering how the extrapolation system works exactly because in in the MNIST example we have green and red digits but what if you want to be invariant to color entirely and have images that, we say purple for example, like even in the in the related work for example the domain adversarial approach trains on a loss that knows that this is coming from this domain and so like there's a predefined set of domains so how does it generalize? Leon: So in the case of this experiment the thing that's interesting is that it doesn't not only generalize to color schemes that are in the same range as the one I used for training, it generalized the completely opposite color scheme, which means that essentially the system became colorblind, and even though we had two data sets that were both biased that in the fact that they were biased in slightly different ways, was used to say that we don't want to use color at all, so if you are introducing new color it' still going to work, so that's the good point, that point is that first of all is not very efficient process when you have data sets that are very very little in terms of the dependence between color and label it takes a while to see it, and that we obviously-humans do it in a much easier way which I can't describe, I wish I could but I can't. Off-camera voice: Thank you *Applause*

Info

Channel: NYU Tandon School of Engineering

Views: 2,191

Rating: 5 out of 5

Keywords: Engineering, NYU, NYC, AI, Artificial Intelligence, ECE Seminar Series, NYU Tandon

Id: lbZNQt0Q5HA

Channel Id: undefined

Length: 63min 16sec (3796 seconds)

Published: Mon Oct 07 2019