in ninety ISM computer vision applications
with structured outputs in the late 1990s and the theory of large-scale learning in
the 2000s. During the last few years he has focused on
clarifying the relation between learning and reasoning with increasing attention on the
many aspects of causation such as inference, invariance, reasoning, affordance, and intuition. Leon has pointed out that learning algorithms
often captures pourous correlations present in the training data distribution instead
of addressing the task of interest. This correlations occur because the data collection
process is subject to uncontrolled confounding biases, but suppose that we have access to
multiple data sets exemplifying the same concept but whose distribution exhibit different biases. Can we learn something that is common across
all these distributions while ignoring the spurious ways in which they differ? I think that he will be answering that question
for us today in a presentation on learning representations using causal invariance. So without further ado, let's welcome him
to the stage. *Applause* Leon Bottou (French accent): Well thank you
very much Anna. Thank you for giving me a chance to speak
here in front of an audience that I'm very afraid because it's much bigger than what
I see here and I'm going to spend some time trying to motivate what I'm doing by first
saying that I don't believe what's written in the papers about AI. That you read in the paper that AI is at the
corner of the road and everything and I think we have some serious difficulties we need
to address and to give an idea I wanted to show something that you might find amusing. I have a 30 years old demo here. And so that was written the numeric code American
code there was written somewhere in 89 and the graphics in 91-92 so and the data file
I have to I wanted to load it before but it's composed of 480 digits because this is what
we could handle actually no this is a demo so we could handle 10 times more but the demo
are just 400 we're going to use 320 for training that's what I'm doing here 320 here and so
that's the training set and the testing set and I'm going to prepare a network. So I'm going to load one and then we need
to add actually yes because this is too small the problem of this demo is that it runs too
fast the problem is to make it slower okay there and to make it slower one trick is to
have a lot of graphics. So these are performance loading and I came
to some more some more and maybe if I go over the training set I have to start to initialize
my weight, initialize learning rate and I can well see it's too fast so I can slow it
down a little bit. So you see you see these digits their moused
wrong because we didn't have a camera or scanner and that doesn't work now we're going to Tran
so let's stop this and start training so I'm going to switch off this display the networks
have to stroh and up up and up. Now we're going so so the black thing is the
training arrow the black dot that you see here is and the white one is the testing arrow
so initially the opposite I don't remember it's going to be easy to see and this is the
arrow whenever I show an example just plotting it make it almost reasonable that used to
take 15 minutes I can even go a bit faster so that because that is part of the the talk
is not going to be ok and a stop stop stop stop stop stop okay so the point now is that
now it works I decided to do that. It's always harder to do when I'm on lockdown
like this why what's going on? This is the one I want. Okay so so you see the digits growing and
you see that the white thing going up and down means that it lands and now I can do
it a bit slower so you can know that it works and there are some amusing things you can
look you can do you can look at some specific patterns that are bizarre like so this one
well I know it's a five okay so I think there is one like this no it's been a while I don't
remember them did use pattern by heart oh no it's working really well okay but it's
working you can add noise you can do things so that was 30 years ago. This is the convolutional neural network it's
very small by today's standard it's random 320 examples which is based on a standard
but it learns generalizes on the task that's not completely trivial. Thirty years with more law-Moore's law no
the computer speed doubles every year and a half that's a factor of 1 million and I
told you we could run this at the time in a couple thousand examples. It's not hard to see that the fact that today
we can use a thousand times more exam plus a thousand times bigger networks it's not
very surprising. It's fundamentally the same kind of phenomenon
it's the same level of things. And what we wanted to do at that time is that
so so we knew that this was quite interesting we knew it was too small but we knew that
we change and we wanted to find the way to program computers not by programming them
but by teaching, them by training them to do something like another alternative way
to use computers and thoroughly to use to see how computers work instead of being programmed
labo loosely detail by detail we want to just to train them and make them work in weather
was a little bit more similar to us. It's not AI yet and there are other things
in AI, but if you can get a computer to do what you want by training it, it's clearly
a step and have we done that and this is what I'm going to go to my slides which would be
somewhere here. And and do what I've done will be in the title
and Martin who is a PhD student in New York University, Ishand who was in Google at the
time and now is in MIT finishing his undergrad and David who is a researcher in Facebook
AI research in Paris. Okay that's an explanation but the thing is
this is what really puzzled me. That was 2014 neural networks were back appearing
and we are trying to show some transfer learning properties in vision and one of the tasks
we looked at was the action detection problem in Pascal Bach and one of the actions is detecting
whether somebody is giving a course is placing a call and you have images you have a bounding
box and if the person in the bounding box is calling is on the phone we supposed to
say yes otherwise you supposed to say well maybe this person is doing something else. That looks all very nice and we get seven
percent correct was the state of the art at that time and I think it's not far from the
state of the art today but Maxim, who was a student working on this came back and said
"You know we got 70 percent but duty doesn't work look at this picture" so there is a row
of pay phones which is something that still was there in 2014 there was a person here
and this person is obviously not calling but if we move the bounding box in front of the
pay phone is going to say very strongly calling in fact whenever there is a person on the
phone in proximity it says calling so of course it's not solving the problem of detecting
whether a person is calling. And the worst part of this is that the algorithm
is right. If you take the that you get typically they
that you get pictures you get on the web when there is a person near a phone most of the
time this person is calling. It's a selection bias you know you don't take
a picture of somebody who just walks by your phone when you take a picture of somebody's
a phone is calling. So that means that 1.) we're not solving the
problem and 2.) the algorithm is right so what's missing there? And what's missing there is that you have
to realize that the tasks they taking if somebody's calling is a task that we don't know how to
solve directly so we define a proxy problem the proxy problem is a statistical problem
here is a data set of yes and nos and try to replicate that and between the proxy problem
and the task there is a world of things we do we ignore this with we assume we start
machine learning by things let's suppose we have our ID data and one part is the training
set one part is a testing set but here you see a practical example where what we want
to do what we have an ID data set because we split it into training set necessarily
the Pascal work people did that and so it perfectly fits the properties of the machine
learning theory and yet it's totally missing the point. Now, turns out that if you look what has happened
in the thirty years thirty years ago we would make the data sets very carefully like for
instance when I was nineteen at the Bell Labs we worked on zip recognition or check amount
recognition these kind of things and we'll make data sets that we're at most ten thousand
or hundred thousand examples for the biggest ones and we would be very careful in curating
them so that they represent the thing we want and everything and this curation is kind of
enough process so and nowadays as I said they come from the web the huge quantities they're
not things we can look at and in fact the sizes that we want to use a so large that
is impossible for human to even look at them in a careful way and they're corrupted by
plenty of biases and if you look at the papers there are plenty of papers that comment about
bias like this one the troll Bonnie frost is a computer vision paper where they try
to use several object recognition databases and try to recognize a car a car class in
each of them and basically what they show at that time that was before CNN's before
the rediscovery of the results of CNN what they show that when you train on what data
set it performs very badly on under the one on the other hand if you take an image and
you want to train a classifier to say from which is that I said it comes that works really
well so that tells you that each of these data set is so specific that it's impossible
for the learning algorithm out to catch some specifics of the data set instead of the concept
you want to have, so that's something about recommendation system or ad placement systems
where you find a lot of causal effects that cause the data to be completely biased and
another one that's quite recent about visual question-answering which looks like a very
nice task now you get a picture you get a question and the computer must answer, there's
a picture so what is the color of the tie of the man who is walking in second position? And the question should be right the answer
should be right so it seems that if you can do this from our biased perspective it might
seem that if your computer can do this means he understands the imagine of someone questioned. And very quickly systems get in the seventy
percent correct and somebody say a stop stop stop there is a problem if you don't even
look at the image you just look at the question. When the question is what is covering the
ground the answer is the snow. And when the question is there something on
the something the answer is yes. And that comes back to how the data was collected
this first time images from the web were captured and then a first set of Mechanical markers
was supposed to invent questions and a second set was supposed to give the answers. Now another the imagination of people in terms
of question is not very big like this is something with something on the Shelf they're going
to say is there a flower pot on the shelf because they saw a flower pot there but if
you are going to ask whether there is a giraffe on the shelf or something else while there
is a flower pot is the natural to ask a question like this, So the result is that the data
set is so biased that the result that were looking very promising were in fact barely
better than the ones you can get by just exploiting trivial biases like that like look just at
the indication or just at the image so so the result of learning is that the data collection
creates a lot of biases and you have confounding biases, feedback loops in the systems, you
have selection biases, and we can control for them and all the machine learning are
going absolutely love to take advantage of the spurious correlations. If they can find an easy way to solve the
problem they will use the easy way not the hard way that requires understanding something. So if I go back to these spurious correlations
what do we say a correlation is spurious like I take my phone example why do I say that
when a person is close to a phone saying that this is strongly correlated with the person
calling is furious and the reason is that I do not expect that this is going to work
in the future. I do not expect my system is going to be used
in situations where I can make this assumption, and the question is what informs us? Why do we say such things? So we might have substantive knowledge about
what is this to give to call when you call well it's not enough to be close to a phone
because calling involves interesting consequence that people are going to be informed of something
and can also ask whether this substantive knowledge come from and we have to say that
whether it's in one person with mannequin it comes from the past observations. Now that's the problem because the past observation
we said are biased and this is where there is an interesting concept that you can find
in the in the literary philosophy literature about causation coming back from to human
other people; is that humans are just not looking for correlations they are looking
for stable properties. What does it mean: "stable property"? Mean that you want to see that the said action
of calling is connected to what you see in ways that are stable that are not going to
change. And that means that maybe we when we look
at the past we take the data in the past and we say always correlated but the past is not
uniform. Nature doesn't shuffle the data now we shall
follow data when we do machine learning. We shuffle data because we want it to be our
ID because this is what we understand but in fact when we collect the data we collect
on different point in time different points in space like if you take even the example
calling and take it at different points in time nobody is going to be a cellphone in
the past is going to a big phone with the rotary thing and and if you look in different
countries it might be different too because people might hold the phone different they
might have different equipment or in different experimental settings you can like the high
resolution camera low resolution camera. Sometimes I think that when you look at image
net and you recognize all these dogs you know when you take a picture of a dog you take
a picture with everybody takes the picture with the same kind of phone from the same
distance with the same focal length because this is how you take the picture of the dog,
now I recognize in the dog or the subtle noise patterns that tell you how you use the you
phone and how it was set up. And so then we shuffle the record, we take
all this data we mix it and we say there are ID and we can proceed with machine learning. And that's clearly lots of information. And so we started to follow a line of work
that comes from Jonas Peters in 2016 maybe before I was Bruno mine thousand and Peter
Bremen Zurich in 2015. We consider the data set we have doesn't come
from a single distribution but come from several distributions that we call environments. Initially first a discrete set PE so we have
XEYE for equal 1 2 3 4 5 small number and so we have a bunch of these distributions
and we have training sets that I'm going to assume large from now provided or some of
these distributions and we want a predictor that's going to work for many of them. And the important point that I'm not going
to assume that this sum distribution of the ones I have are kind of random sample of the
possible environments, so what I call an "environment" is one of these distributions. I'm just going to say I've solved them. So when you have a situation like this the
classical way in statistics is to try to be robust. You're going to say "I'm going to minimize
the maximum over all environments" so if you look at this formula here maybe I'm going
to use the mouse so that it is visible. You have the minimum over your familiar function
of the maximum over all environments or let's see the square there all and you can have
an environment going to discuss the baseline data so that says that you want something
that's going to perform well on all of them not just one of them or a mixture of them
and then you realize it's a problem suppose you take the calling example you have two
environments: one of them is made of very clean pictures taken last year and the other
one is made of pictures from the 20s black-and-white grainy not pretty. Well it's going to be harder, so the arrow
you're going to make on the on the old pictures is going to be higher because you don't see
them nearly as well even though the phone's way bigger but is another detail. So you might say well for each on this environment
maybe I want a baseline, and I'm not going to say how to compute the baseline, and what
I want is that by having a single function for all environments instead of one trend
for each environment, I'm not losing too much compared to the baseline. I mean that the relative loss of accuracy
compared to the baseline is not large even though I'm going to use the same function
to work in all these environments. But when you have this you can start doing
a bit of mathematics and reset the problem. I say is the arc mean of M subject of all
E, M is greater than what I want to minimize and when you have a constraint problem you
can use the Karush-Kuhn-Tucker Theory and basically tells you there is a set of lambda
positive such that the solution of that problem is a first-order stationary point of a proper
mixture of my squared errors, so I'm back to my initial point is saying well if I want
to minimize this problem the only thing I have to do is mix my environments in the right
proportions and I'm going to satisfy this this condition. And the worst part is that by changing the
baseline I can change the lambda either way I want choosing the lambda zero or choosing
the baseline is the same problem and so the robust approach means that mixing the elements
with the correct proportions. So here is an example I have for distribution
P 1 P 2 P 3 P 4 and the robust approach says I'm going to be good on all of them, and therefore
I'm going to be good in all the the convex hull of these distributions. And this is a 10 by missing minimizing a specific
mixture. Now the thing that's interesting that there
might be distributions are outside of this convex hull so with barycentric coordinate
that are slightly negative and that can be a legitimate distribution but I have no guarantee
on this. It doesn't say anything about something outside
and was it important? That takes another example. Take a search engine. My example is wrong but if you take a search
engine it's interesting very often to classify the queries in different categories like the
query is commercial in nature on navigational in nature, that's quite important. And suppose that I have a number of environments
which are the set of queries today, yesterday, the day before yesterday, the day before yesterday,
and so on; and quickly you're going to see that these three kinds of queries; you have
the queries that remain constant the frequency is the same over all this period. Some of them are growing, like for instance
when you come close to a particular event, come close to Christmas queries about Christmas
presents are growing, and some of them are decreasing after Christmas the queries about
Christmas presents are decreasing or going away. And the growing and the decreasing ones they
are a very small subset of everything. So what I have is that if I take my triangle
here and say this these are many queries that are the same that and these are the ones that
are decreasing in popularity the ones increasing your popularities. My four days they form a very small set of
points that are very close to this, and if I'm robust I'm going to be having performing
well in that little domain here, but in fact if I'm waiting let's say one month I'm going
to move away in that direction, so what you see here is an example where having a guarantee
that works only in the convex hull of my environment is not really sufficient, I can do better
and I would like to better, in fact it's a problem of interpolation/extrapolation and
interpolating is something that we understand we can do quite well extrapolating is always
a mystery. So if we go back to this idea of learning
stable properties and you find that in a number of old philosophical world. Suppose that, I'm going to come back to my
problem of calling, and you have a set of pictures taking from the web there is a selection
bias and pictures where I see a person on the phone often represent the person calling. Now also this is where I have little movies
and let's see the movies of the same selection bias because if I take my little movie at
some point in the movie, somebody close to a phone is giving a call, he's placing a call,
but if I take the frames of the movie even though at some point the movie is going to
show the person calling, there is going to be before calling and the after calling I
have a lot of frames where somebody was close to a phone but not calling and close to a
phone and not calling but because the call is finished, and that's interesting because
it means that if I take image from these different sources, pictures taken from the web or frames
taken from movies, they both have the selection bias like the correlation between the proximity
of a person and a phone and the event of calling is high, but they have it in different ways. The differ in strength and if it differs in
strength it means that if your regression system has the choice between two kind of
things, like for instance features that represents the shape of the person the position of the
hand, and features that represent the presence of objects merely the the regulations will
be different if I complete the regression on the first data or the second data they're
not going to rely on the same features in the same way because in the first case the
proximity is a more reliable indicator of calling while in the second case is a bit
less reliable and enough to use a little bit of something else even though using this is
still very favorable in terms of accuracy. ID that we would like to learn phenomena that
remain environed across environments. We really want to learn a regressional classifier
that uses the features in the same way across environments; if one of them is wobbly because
it has different strengths with with different environments, we're going to say if this one
is suspicious. So this ID is very related to the notion that
we don't take all the data as a single distribution, we look at its interior structure and say
that if some correlation is maybe highly predictive but changes in strength across environments,
we see it with suspicion. So why is it interesting? So let's first say environed regression, which
is a strong requirements, suppose that instead of minimizing the maximum arrow across environments,
I'm searching for a function that belongs to the minimum for all environments it's not
real, it exists, it is not so simple to find it, but what does it mean about mixture coefficient? And if you say that F star is a stationary
point of my arrow for all environments, it's also a stationary point of any superposition
of my probability for okay and that's true for all lambda is positive or negative if
some lambdas are negative this is still true, so the in vines property in a sense has this
is stronger because we want something that's way way stronger maybe hard to do, to achieve,
but if we achieve it we don't just generalize to the distribution that are in the convex
hull of the ones I have, but to the biggest extent that we can reach with the let's say
negative barycentric coordinates, which is what I wanted to say here. Some trivial existent cases like in the noiseless
case maybe it exists there is a function that works and classified everything it exists
and this trivial existent cases annoy me because I don't know how to deal with them very well. But I'm going to be interested by the cases
where there is no single F star that belongs to the minimum, or that minimizes the regression
for all my environments. And I'm going to say well to play with the
function family maybe I should go straight to the thing. I'm going to say that I want to find the representation
Phi of X and in this representation I want the relation from Phi of X to Y to be environed
across environments. So the idea there is that if my problem is
noisy enough, and I don't know where we will have to deal with noiseless problems, and
that's that's an issue, but it is noisy enough. The only way I can find the function that's
going to minimize my regression for all the distribution all the environments is by first
projecting my patterns into a certain representation that essentially eliminates the features that
are unstable, the feature that was dispersed correlated to what I want, and that gives
the idea that we can use this criterion of environed to just learn but to also guide
the creation of features, and that's very different from recognition. When you do a neural network and you have
hidden state in a network these hidden states are created in a way that permits the best
possible prediction. Here we're not interested by the best possible
prediction, we're interested by a prediction that remains environed across environments,
and that's something that's very very very important in science. For instance, suppose that you watching an
apple from the following tree. What do you have? You know you just have an image or a little
movie this is what you see, and you could pay attention to a lot of things, you could
pay attention to the color of the apple, you can observe that when the Apple is falling
very often it's red because it's ripe but it doesn't tell you something very important
I said to the trajectory of the falling apple. You could look at the size of the leaves,
the size of the tree, the size of the trunk, the number of leaves on the stem of the apple. But in fact if you look at the right variables,
let's say the position and the speed, you observe that all your apples now obey exactly
the same equation, and because you observe it on all the apples, you can say oh maybe
it's going to work on all falling objects, maybe you have found something more important. So in that case we know it's not going to
work on different kites, it's not going to work that way, but for a number of falling
objects the same equations are going to work, so there is some value in finding exactly
the same solution, in the sense that finding exactly the solution is a hint that this solution
is a bit more general and it's also a hint that the data or the representation of the
phenomenon in which you find this environed solution is something important. Now there's a lot of related work. The first one is that this idea of environs
in in fact related to causation and something's been known for a long time, there is a work
by Nancy Cartwright from the theorems, there is there are people working in philosophy
like epistemologists, but it's easy to understand in the causation framework, let's say the
statisticians what you want to do is not predict what the system is going to do but predict
what the system is going to do when you intervene on the system. Like for instance, if you want to look at
the efficiency of drug and you can make some tests and everything, but what you want to
know is if I give that drug to everybody will the population be better? And this is not obvious, when you want to
do this actually two things you can use, you can use your knowledge of the intervention,
so I'm going to give the drug to everybody, and the second one is you can use what you
believe remains environed after and before the intervention. Initially you give the drug just to a little
test of test people, and after you give the drug to the whole population and the probability
of getting better given that you have the drug or not, and all the variables of interest
is preserved, then you can use that. So you're looking for property that's environed,
and in fact all the things like the calculus or your ability of assumption, they are all
tools to try to find the mechanics to model this kind of things that are environed. For instance, you have a graph you intervene
on the graph of of causation, and you can detect that some distributions are going to
be environed and some conditional distributions are not going to be environed, and do some
kind of calculus. Now invariance is attractive for learning
because reconstructing causal graphs from data has proven very difficult while learning *Audio cuts out* that stable across time or across various
conditions, you actually do something that's as powerful as doing causal inference because
you get half of the data, and that you can use directly and it seems to be an interesting
alternative. So I mentioned the paper of Jonas Peters that
was a big inspiration for this work and in a paper of Jonas Peters it considers causal
graphs on which you intervene, this is what the little hammers saw, and each causal graph
and the intervention described a slightly different distribution because of the intervention. Now you get all these distributions, you have
a variable of interest, this is Y, and you're going to try to find an environed regression
for Y, so in the case of Jonas, he assumes that all the values are known, so the representation
is just selecting which variables I'm going to regress from. And what he shows is that, under caveats that
are mostly technical, if you find an environed representation, you find a set of variables
such that when you compute the regression from these variables to Y you get the same
one, then you found the direct and dissidence of Y in the graph. Which is a very nice result, now the limitation
of this of course is that you have to assume that you know which are the important variables,
I want to assume you just get a bunch of pixels. Finding the important variables is the difficult
part, and this is just about narrowing, just you assume here that you have a small subset
of viability and you know that the important ones are subsets of them and you have just
to refine, but finding what's important to measure to start with is difficult. So another related topic is adversarial domain
adaptation, which is a recent thing, and the goal is to learn a classifier that does not
depend on the environment, and the idea that you want to learn a classifier that's you
going to train on some distributions and going to work well on the other ones, and the simplest
thing given adversarial terms that says that if you take some states from a hidden layer
somewhere and try to classify and see which environment the data comes from, you cannot
anymore. So you're trying to find a classifier that
has eaten representation from which you cannot recover the environment. And if you think about the paper I mentioned
earlier, the observation that, given image you can see
from which the set the data comes is already a problem, so they're trying to alleviate
this by saying I'll take my image and I'll go to a set of features from which I cannot
recognize the data set anymore. Now you realize this is too strong because
it might be the different environments of different probability of yes/no answer, so
if you say that these features do not allow you to recognize the environment it means
that the distribution of these features in all the environments is the same. If it were different you can say from which
environments it comes better than chance, and if they're the same it means that the
distribution of the cluster bell you're going to predict is the same in all environments
which is a bit too strong. And there are some other variants and basically
the question is whether you force P of H, the hidden layer, to be environed from the
environment or P of H and Y joint, or P of Y even H, or P of H even Y. What we do is weaker, we just want the regression
to be the same and finally you have problems learning which is something that happens,
let's say maybe the most common, the most popular idea about this is the PGD approach
to resist adversarial examples. Well you say essentially that instead of minimizing
on the distribution of the data, you're going to define a set of neighboring distributions
and minimize the maximum of your error on all these distributions, and well I mean in
contrast we use multiple environments which comes from the data, it is not defined a priori,
and then we work in variants. How much time do I have? 15 minutes? Okay so I'm going to start with the linear
case. So in the linear case, you have X, the representation
function is the matrix, the regression is a vector, and in fact the whole operation
is linear, is that certain W, and you can see already that is very over determined because
you could change S a little bit and balance something in the regression back and forth,
and you also have lots of insertions. What's interesting is if I choose S equals
0 of course it's environed, it's not very good meaning that I'm just elementing all
the features, and you can see what matters is the null space of S which is the information
and censoring for my system, and another difficulty is that if what matters is the null space
of S and S has a certain null space, a small change of S can completely cancel in a space
in the vicinity of a singular matrix tab, just plenty of non singular matrices, the
null space is 0 just the new null vector, so it's not very, it's a finicky criterion
to minimize, but if you do a little fling algebra you can characterize the solution
and you don't characterize in terms of S and V but instead of the W, the whole set, and
you can see the whole W's that satisfy the environed property, in fact if it W satisfies
the environs property, meaning that there is a S and V and the V is the same for all
environments, only if this thing is true, which I'm going to describe later, and then
we can reconstruct the S - there are lots of them and in the linear square this equation
represents ellipse, so essentially for each environment you have an ellipsoid in the white
space they all cross 0 equals 0 is the solution not an interesting one, and the W that have
the inverse properties of those that are at the intersection of all these ellipsoids,
which is sort of bad news because the intersection of apes weights typically is not connected
so it's going to be a bit difficult to search. I'm going to keep on this about computing
ranks and high rank solutions, and maybe go straight into Ison. So one possible idea is to use this criterion
as a regularizer, so you have all these ellipsoids and you're going to say "in order to be close
to the intersection of ellipsoids, I can add a term to the cost that *Audience member interrupts* They all share S, everything is shared there
is only one classifier at the end, so the question is what information S is going to
remove in order to make sure that after applying S the linear regression is the same for all
environments. So it's the only one S 1, only 1 V, and surprisingly
there is actually, uh it's not every W that has this property, so I can regularize towards
them and since I've ellipsoid the environment I can measure the distance to the ellipsoid,
which happens to be a fourth-degree thing, which is not fun but we knew it, and one way
to look at it is say I have s and V and I can insert a dummy multiplier here, theta,
that's 1, I'm going to say it's one but I can compute the derivative of my cost function
respect to theta, and turns out that this is exactly what I want. So basically what I'm saying here is that
maybe you've heard of domain adaptation layers, you have a system and you values, domains
for environments, and you say well I'm going to turn on everything but for each environment
I'm going to use a little extra layer that I'm going to optimize for that domain, and
you can see theta is the dominant notation layer, a trivial one, and what I'm saying
is that actually I don't need to adjust it, so I'm looking for a solution that's such
that if I had a domination layer, I don't need to change it to model all my environments
properly. And there is equivalence between the two approaches,
and this way of looking is interesting because if this is nonlinear I can still make the
same kind of reasoning, I can still say well I'm inserting a frozen done annotation layer
is the identity doesn't change anything, but what my regularization term says is that when
I'm go from looking the all my environments, none of them is calling for change of theta. So I'm going to take an example, I call it
colored MNIST. Digits with misleading colors, so we take
chemist and we split it in two classes, the low digits of zero to four, the high digits
five to nine, and I told you I need noise in my system because I'm going to use noise
to constrain the representation, so I'm adding 25% so the highest classification of noise
I can use by using the shape of the digits is 75%, but then I'm going to add colors,
I'm going to say that if my class Y is 0 my digit is going to be red with probability
1 minus E, green for the E meaning that if is more my low digits are going to tend to
be red and high digits are going to tend to be green, and my two environments are going
to be defined by specifying E between 0.1 0.2 and that means that if I use only the
color, I'm going to classify better than if I use the shape. If I use the shape I can be only 25% color
correct, if I use the color in one environment that can be just a 10 percent error and the
other environment I can have 20% error, both of which are less than 25% error which is
the best I can achieve because of my level of noise. So I'm training with equal 0.1 0.2 and if
I train with normal training minimize empirical risk, well I get about something halfway between
10,000 and 20% error, which is normal, but if I test with equal 0.9 I reverse the scroll
scheme well it doesn't work at all. If I do the same training but add the environment
aggression terms and they're very painful to train no it's not nice numeric set this
is painful, its slowest and everything but consistent performance so basically I was
able to say that because the relation between my pattern and the color is not stable, I
don't want to use it and have to rely on the ship, which is running noise in that case
which doesn't make it very easy. And if you look at the output of my classifier,
so these are dots corresponding to example inch of the environment, and you see that
for the 0.1 0.2, you get an answer like this but for the green one you get the opposite
answer because you realize on the color and the color has been switched, well if I train
with this mixture well it's more reasonable even though I have a little bit of crap that
remains. So it's a small example and it works only
when it's very noisy, which is the problem, and the question next is to scale these kind
of ideas up, and this is where we starting having problems, first of all we have numerical
issues the regulization is very non convex, normally targeting the intersection of plenty
of ellipsoids and the discrete points are all over the place they're not connected is
not going to be easy, and then we have something differently but realizable problems just many
of the problems that are interesting for people nowadays are problems where you can achieve
zero loss essentially, and I don't have this noise that allows me to to make the system
work so I have to find a way, and in fact if I go back to this realizable case which
is the case where, well, you know, there is a function that's been able to find everything,
let's look at my little observer self I have a phenomenon I was seen and have a vibe of
interest that exists in the scene maybe is not observable directly that I call Y pre
pre pre labeling I get some kind of image X and I get a bunch of people that I call
*indecipherable* is what we do in supervised learning, and they're giving me labels that
are Y post suppose the label is whether a person is calling for instance, another the
white post the labels that the lab allows give it's not necessarily the truth because
somebody could be calling in hiding and you don't see calling the person calling really
the label laws are going to say no and they're going to agree but in fact the person was
calling so there is slight difference between this the widest reality and the why that comes
from the label laws. Now the labeling process is often designed
to be as deterministic as possible, we train the label laws to be consistent, we did ask
questions in a way where they are not many ways to give answers, so we're trying to make
sure that there is a function that belongs to all the environments, and that's true for
the post-labeled C, true for the pre-labeled. Okay so what this means is that when you have
a supervised problem where the labels are given by label laws, we artificially created
the solution, the situation, in which there is an environment, not because we found an
environment in reality but because the action of the label laws was environed and because
there is no noise at all how to swap them out, so that's my problem at the moment, so
if I should conclude in this talk I said that the statistical problems on your proxy but
something very important and between the real problem and the statistical problem and the
proxy, there is a huge gap that we haven't explored, and this initial idea that you maybe
you can form computer in a different way by teaching them to do things instead of programming
them, is not going to be achieved unless we understand what sees in the gap between what
we want to do and the statistical proxy, the second one is that natural doesn't shuffle
these examples, we shuffle the example because it matches up ideas about learning but by
doing so we are moving a lot of useful information. That one of these information is about the
property that was stable and when you start looking for business table environed across
environments you start making sense of extrapolation, the idea that you can extrapolate to new environments,
sort of makes sense in the situation because when you you optimal for each of your environment,
you also optimal for mixture of possibly with negative bias and recording it so you can
go out of the street interpolation, that invariance across environments is related to causation,
it's an alternative view of causation, it's what I explained that directly formal proofs
that environs is similar to position that you can try to find in vinyl presentations
to enable environments, and that only works if it happens that not all representations
enable environs. Well everything is invariance, it doesn't
work that easily and this is why of my program at the end that I want to understand how how
to slightly change this concept to make them applicable to the realizable problems the
one way of 0 which is tricky in different ways. That's it, thank you. *Applause* Off-screen voice: Okay so let's have five
minutes for questions. Leon: I can repeat the question. Okay so location is about what I think about
policies that try to determine which form of determentation works best, the first of
all, but that could work, but this not what I'm looking for, and the reason is that when
you speaking of the documentation you you in the situation where you define a ball around
the distribution that you have, so you say instead of taking examples, the examples I
have I change with the examples and I create a combined distribution, so the documentation
is just a way to look in a slight ball around your distribution, this slight ball is arbitrary
because you're finding data in ways that you think are good, like for instance you're thinking
maybe I want to have more rotation, more translation, more G-term, or more color changes, but what
about the thing you didn't think of, like for instance, Oh turns out that all the pictures
of potato we're taken with the same focal length and the same kind of camera and that's
the tech table in the noise, well you didn't think of that one, if you didn't think of
that one you're not going to be able to fix it, so somehow we want to be able to to handle
these kind of situations and for this we need extra information and extra information I
think can be found in the data of the distributions. So the question is the connection between
looking at data versus SGD because a SGD theory is very often connected on our ID well this
is why we looked at it as saying we have a set of sub distributions and we assume we
had training data from each of them and so we can do SGD on each of them, or we can try
to have a regularizer on top of it, but through that in many situations the formulations you
can give to a problem like this are challenging as SGD, that's true. It's difficult to find the SGD solution that
computes the right solution and you have people who do it with adversarial means and others
is very hot and interesting but you have to realize that it's awfully slow, you know,
just training a gun is terrible compared to training a network. I couldn't show you 30 years old demo of training
a substantial gun because I was totally out of which, I can show SGD with with a CNN and
measure cognition no problem, so the adversarial ways to which and having constraints of the
distribution saying that the distribution of hidden layer and why satisfy certain relations
is always difficult to express with a SGD but that's normal because we going away from
having a criterion that the simple average on my data and as you the optimal average
on the data, yeah that's a problem. Off-camera voice: So yeah so great talk, like
I love the paper, like for me intuitively it makes a lot of sense when you're like doing
classification you know you want to find the invariant relationship is kind of like spot
the difference hey there's always a cow in all these photos, but suppose I'm trying to
do an inference task and I use your IRM predictor and it's also like the expected outcome of
why conditional, I don't know, treatment and covariance, like what is the invariant relationship
in that sense, in like when we're thinking about you know, how's the inference? Leon: If you take causal inference relation
with graph and you take the typical situations or confounding unconfounding, if you know,
if you just collect conditional distributions on the basis of what you observe without taking
account with our conditioning over the potentially confounding variables this conditional you
obtain is not environed when you apply the intervention. On the other hand if you condition with respect
to all the potential confounders, the distributions you obtain is then environed. Off-camera voice: IRM will just not work if
there's any like you know, like if there's any unobserved confounding that you're not
account for? Leon: Well first of all IRM ,it's not an end,
it's the beginning, so it doesn't do much, no, we we were happy to do the colonists and
we have a hard time you know making it bigger or something like this but what's interesting
is that it's an interesting direction where you try to go away from the statistical problem
and look for other properties of the solution that you believe should be important. In the situation we consider, we assume that
we have let's say rich data like images and by censoring information you can reduce it
to a set of eyeballs that's sufficient to make a reasonable prediction that's environed,
now you have the situation where you have hidden variables let's say in the case of
causation that would be confounders that you don't see, well then you have a solutions
problem because either you reconstruct the confounders or you just say there is a confounder
that I don't see if I can't conclude and the mechanics of let's do calculus or is going
to tell you that be careful there is something with bizarre, but you have to put it in our
assumptions essentially, and I'm just going to tell you I cannot do. I cannot compute this do probability on the
basis of what you have now, you need to make a different experiment. It's not a given situation I consider where
I have a very rich data to start with large images where maybe somewhere in the image
everything I want to measure but it's true that in physics for instance, sometimes you
cannot predict what's happening because of things you can't see and the history of mankind
we need to find ways to see them, so yeah, so no I can't do that, and I don't know how
to do it, and I don't know if anybody can, but it seems that as a group of people the
scientific process has been able to do it in many interesting cases. New off-camera voice: Hi, thank you for your
presentation. I was just wondering how the extrapolation
system works exactly because in in the MNIST example we have green and red digits but what
if you want to be invariant to color entirely and have images that, we say purple for example,
like even in the in the related work for example the domain adversarial approach trains on
a loss that knows that this is coming from this domain and so like there's a predefined
set of domains so how does it generalize? Leon: So in the case of this experiment the
thing that's interesting is that it doesn't not only generalize to color schemes that
are in the same range as the one I used for training, it generalized the completely opposite
color scheme, which means that essentially the system became colorblind, and even though
we had two data sets that were both biased that in the fact that they were biased in
slightly different ways, was used to say that we don't want to use color at all, so if you
are introducing new color it' still going to work, so that's the good point, that point
is that first of all is not very efficient process when you have data sets that are very
very little in terms of the dependence between color and label it takes a while to see it,
and that we obviously-humans do it in a much easier way which I can't describe, I wish
I could but I can't. Off-camera voice: Thank you *Applause*