Invariant Risk Minimization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's figure out what that means hello everybody thank you for the introduction today we're going to talk about such an interesting topic and their interest musician with which was proposed by Martin our Joe Souki in this article and let's start with a simple example so imagine that you had a simple classification task you wanted to classify between Kohl's and camels images so what's the problem I just take a convolutional neural network a bunch of images of course a bunch of images of camels I feed forward back propagates and eventually I receive a model with a very low training loss that seems to be able to predict between cows and camels so what can be the problem but the problem is in this image it seems that the color the images of camels have this you know sandy beige background whereas the images of course has these green grassy nice landscapes so basically our network just cheated it averaged the color of the background and predicted output and this feature is really very strong but it's spurious it's not the natural course of things not why images actually are our images of course or images of camels and here we come to the classical correlation versus causation dilemma you see our modern machine learning algorithms tend to absorb all the correlations they can find in data and in data we can find a lot of spurious correlations that are induced by various biases those biases can be some random factors or our way of collecting processing data anything else how any how we always have these spurious correlations in data that we don't want to learn and we have a very non-trivial problem we want to identify though properties of our data they describe spurious correlations and get rid of them and identify and use useful probe properties that represent the phenomenon of interest like animal shapes in our example so what will be so we can notice that those spurious correlations are not stable they tend to vary from data set to data set or from environment to environment whereas true causal features properties are invariant they are stable they present in all environments in all data sets so we can use it to build predictors that will generalize well for new testing environments and by the way the authors of this article also noticed that shuffling that we commonly do in our for our training is not that okay because by shuffling we destroy the information about how the data distribution changes from data set to data set and hence we can no longer follow up on which features are stable and which features are not so we make our lives more complicated no just shuffling of data yes because we we have a very strong assumption about the identic identically distributed independent data but you know when you collect your data objects from different environments they are not identically distributed and when we mix them and receive one big data set we destroy this information about how each but that's just a side point stated by the authors that's not the main like idea yes yes you don't want to shuffle data from different environments into one big data set okay so what will be our strategy we will assume that our training that data is collected from several distinct separate environments okay and we will promote learning such algorithms that will absorb only stable correlations across these training environments and we will hope by that that they will generalize well for new testing environments so in our example that will be very simple we can just for instance collect calls cause images from different countries as you can see cause in Hollands always based your on such juicy green aggressive landscapes whereas calls from Corsica actually have background very similar to images of camels right and using these separate environments with different landscapes for cause images we can hope that our invariant promoting algorithm will notice it and get rid of this strong but spurious correlation okay here we come to the spotlight of the paper the principle they invariant risk minimization principle so it says that in order in order to learn invariance invariant predictors we need to find such a data representation that a classifier on top of that representation would be optimal in all environments at once so I understand that this is not clear at all vine by now but by the end of this talk I hope that things will get more apparent so let's be more concrete the basic formulations so we consider that we have several data sets as I said they are collected from different training environments that are a subset of all and as possible and our goal is as always to predict output from our inputs right and we will promote such a minimization such an objective to minimize its basically its maximum risk across all environments we have so we can see that our E is just empirical risk in environment E so why is it why that goal makes sense you know let's consider the following example so I'll write it down because we will need it further on so we we have three very two input variables X 1 which for example is Gaussian with varying noise output which causally depends on X 1 plus some very noise and the spurious variable which depends on the input plus some fixed noise for example so basically our environments can vary in first of all in the magnitude of the noise of course as you can see for example we have we can have two training environments with two different magnitudes of noise and also the author states that we can vary the structural equations for our input variables for example for X x-2 instead of this equation we can have something like I don't know Y in Y squared plus some Gaussian noise or just a constant 10 in 6 degree anything you want the the thing that remains stable is this cause causal correlation between X 1 and output Y okay so we have this setup and we imagine that we are just building a linear regression on top of it so we can regress from only x1 from our course or variable and obtain these coefficients 1 & 0 we can regress just from x2 the spurious variable and obtain the following coefficients and we can regress from both and obtain those coefficients so you can clearly see that only the first regression actually is stable across even our two environments that differ by the magnitude of noise whereas the second two are not so we can clearly see that the true causal dependency between X 1 and Y induces invariant linear regression and so yeah that's what I just said X 1 yes well because you have its bet we'll see it in this statement because the first the first regression this one from x1 the true causal variable this is the only aggressor that has finite out of distribution risk that maximum empirical risk across all environments because as I said we can vary this structural equation and if we tend in if we have a testing environments which has a very high x2 very variable we will have infinite error infinite loss you see so but our out of the resolution risk is I will return to is the maximum risk across all environments so in some environments it will be okay actually in our training environments those regressions are much better than X 1 because our noise in X 1 and here is much higher than in our spurious correlation you see so just yes yes yes that's exactly because we cannot have as much the variance as we want we only have several environments well we can't solve this problem this this is what this article is about so even if when we have not all environments possible where we have just a subset of training environments we can actually identify which which dependencies are true causal and which will vary from environmental environment which will not generalize so let's make it clear is yes are you I'm sorry like the environments the smaller vitamin - you have a large federal yes well basically this example just shows that this out of the distribution risk is something that makes sense if we want to to search for true causal dependencies in data if we can we can reason on these for we can find some drawbacks of this loss basically its maximum loss across all possible environments it's absolutely infeasible in practice to minimize such a loss but in this example this was actually helps us to fetch the true causal regression because all the regressions means not they're not non 0 X 2 coefficients would have infinite as we can vary the structural equation infinite loss so that's more or less clear ok I I hope it will be more clear further on because just as house on well yes better than what because if we mix all environments together and we will minimize just empirical risk across all environments for example okay here we have these two environments if you minimize empirical risk we will receive this regression write this in boss actually is environment this noise well you have two environments two data sets each data set is collected from its environment and then to minimize empirical risk you unite them maybe shuffle and then you build the regression formula for a regression equation and you will receive these regression coefficients no no no we have several environments to train but we need to use them more more cleverly than in just empirical risk minimization well okay I guess that this index here is well it means that yes yes as you collect you see you collect your data sets from different environments so you have several data sets each one is collected in its environment so does that make sense more or less yes because you what yes well okay if you mix you will not you will not have this index here but if you regress in each of these environment why was yes yes exactly that's what I'm thank you very much okay so if you if you don't understand any details by now it's okay we will go further on and things will get I hope more appearance well at least you can look at the paper it's a very good paper so if you want really dive into it I encourage you to download the article so okay now about how we can solve the problem of generalization or like prior work the first one is simple empirical risk minimization right but it doesn't work in our example because for empirical risk minimization we have training environments with very high noise with the variable X 1 so our empirical risk will promote correlations with X 2 and then we will receive at testing environments with very high numbers of X 2 for example absolutely different structural equation that we didn't meet in our training environments we will have infinite or a very large loss empirical loss so empirical risk doesn't work here ok that's ok is that clear why it doesn't work in this example yep so empirical risk minimization just it's a simple strategy when you just want to minimize the sum of empirical risk in first training environment and empirical risk in the second training environment yeah that's what is proposed initially that's our like super wanted goal to minimize the maximum risk overall in viruses because if we will have such a predictor that minimizes our risk across all environments it would be good everywhere in all possible environments but that's that's an infeasible optimization task for us by now okay so empirical risk doesn't work really yes well believe me doesn't work next we we have such such an approach [Music] [Music] yes that's what empirical risk would do empirical risk now because in training environments because you trained on some this example I'm explaining so okay so you when okay is it clearer than that if you minimize empirical risk in this for this model with very high noises you will promote learning from this variable right x2 then you have testing environments which has something like this equation so it instead has a very low loss here you know it's here the same it's it shoot has should have the same or similar up to the noise magnitude so actual equation here and then it has x2 which is basically constant and it's 10 in 6 degree then your predictor y equal to X 2 would have super high loss here yes yes it's not it's not totally different it has it has it's not because this is the spurious curly because we are trying to find causal dependencies like in in Bayesian network if we build the Bayesian network of this model will have something like this or exactly this picture so that the causal correlation which should not this is our main assumption here in this article or main assumption is this at this and is that this connection does not depend it does not vary from environmental environment we can vary like these noice connections or this connection we can vary but this remains stable the causal this is an an inductive bias of course this is our assumption but it seems to be something reasonable okay so now robust learning so okay empirical risk doesn't work unfortunately robust learning well a robust learning seems to be something very close to the out of this out of distribution risk because you see that we also maximize our empirical risk but now we maximize only across oh sorry only across training environments - some like it's called environment baseline this environment baseline helps us not to focus too much on noisy training environments whatever that means but it turns out that and we will and we will show that very simply that this robust learning is actually a weighted version of empirical risk minimization because you see that if we want to minimize across all the predictors the maximum risk of this predictor this is basically the say the following constraint optimization task we want to minimize across F M M where our up I forget the baseline where our risk - waistline is that okay that's okay that I just rewrote this in a similar in a Google no I have I've maximized here across some okay some some I I'd say I try to minimize this this thing I try to minimize I I try to minimize this thing yes so I I minimize across I minimize the maximum okay now let's follow the line the first one what's exactly the problem I just small re is some constant that depends on the environment so in each environment in each training environment it would be just a scholar value that we subtract from the ah I need brackets here yes here there must be the brackets okay so I output it also here okay what is this thing it's called environments of baseline and it's needed for not focusing too much on noisy training environments but that's another paper it's not this environment risk minimization invariant risk minimization it's a constant it's yes we try basically we can try to approximate it like in you know in reinforcement learning in reinforce algorithm we add a baseline and there are several techniques for this but I'm not familiar with this approach yes no it's not known where we select it we slay usually predictors no no yes it's environment but robust learning tries to minimize across all the predictors that Maps inputs to outputs and what it minimizes it minimizes the maximum baseline subtracted risk yeah yes of course but that's why we need these baselines that will well that's just an approach and I I'm going to show that this approach is equivalent to weighted erm so we came to the constraint optimization task right then I right KK KK T so just the derivative so I just put the derivative of my Lagrangian to 0 and the derivative with respect to my to F to the predictor and the derivative basically is sum by E of gradient of lambda e by gradient by F and we assume that our risk is good convex function and this is equivalent eventually to F which is which on which we have zero gradient of Lagrangian is actually an art Max and the art min across all the predictors of that weighted sum okay so basically this is the weighted version of empirical risk minimization right this this those are empirical risks in environments and for empirical risk minimization we would have those lambdas equal equal to each other right and this is just like a modification of empirical risk so this will have the same drawbacks as empirical risk minimization for our task it will also promote learning spurious correlation okay yes it may have much bigger weight but it depends on how we choose our baselines and we're not talking about now we're not talking about that like imaginary example with cows and chemists we're talking about this example very simple example but it's as we see we already have two methods that can cope with this problem now the third one is promoted by many different authors for example by victor limpet ski domain domain adaptation principle we want to well the the formulation is very simple we want to find a data representation which has the same distribution across all environments okay but the authors of this paper claims claim that this may sometimes lead us into wrong type of invariants and actually they dedicated the whole section in their appendix to explain why domain adaptation is not good but in our example it's obvious because the true cause of variable X 1 it has varying distribution across environments so our invariant representation of data will not be as we want x10 and it will not work for us as well and finally a so-called invariant causal predictors prediction the idea is to find a subset of variables such that the regression the linear regression on top of that subset would have equally distributed residuals across environments okay so basically this is somehow a state-of-the-art methods before this one which I'm gonna which I'm about to present and but it has several drawbacks for the first one it's always it's always assuming linear regression the second one it's search for a subset of variables and searching for a subset of variables in a high dimensional tasks is absolutely infeasible and the third point it also doesn't work work for us because we have varying noise from environmental environment so in variant coastal predictor will also not help us to learn the true causal dependency of Y from X 1 so it seems that even for such a simple problem we cannot find the true causal solution so is everything alright by now all I can yes I can transform but it's as you can see here we have only a subset of environments not all environments so I can also transform my out of distribution risk into a weighted empirical risk but across all environments if we have all the environments possible it's okay we can learn empirically we can learn empirical risk minimization predictor it would it would work fine it would actually be optimal for us if we had all the environments but when when we are constrained to several one we cannot afford this so yeah it seems that even for such a simple example of the state of those methods in causation learning seemed not to work and here we are approaching the formulation of in period of invariant risk minimization but first of all we need to define what are actually invariant predictors so I say that for example I have assumed that I have a data representation which maps my inputs into some space age and I have a classifier on top of that representation which maps H to my outputs and I say that my predictor W by Phi is invariant if if and only if my classifier is the optimal one across environments from some set of environments over no not here across all the classifiers on top of that representation for each environment so that's basically the definition of invariant predictor proposed in this paper well the authors the authors claim that this is something reasonable because it actually reflects the concept of induction in science because you know that for example in physics we try to where we're not focusing on concrete objects representation we try to find some abstractions for example for the gravity laws for the gravity law it doesn't matter whether your objects are apples or stars planets it doesn't matter we try to get rid of everything irrelevant for us and just leave useful causal features so that's why philosophically this definition is something that really reflects invariant predictors is this definition clear okay so now we proceed to the formulation of the special presentation yes no it does not and actually those things are really similar and as I said the author's provides the whole section of discussion on this topic why domain adaptation and IRM are absolutely different things but simply simply saying why the main annotation is something we we don't want is for example say that we have to we have one training environment and one testing environment so for example yeah so we have X collected in training environments and of you can you can use access from the testing data set in the meditation the diminutive level character change my speeches later on English with a limp with an English accent and I'm doing this mission of English and American accent I know that I'm doing information work and make a speech recognizer which is well this is one of the differences yes because it's much more general thing this one in Ave in domain adaptation you actually really want to adapt for a new known somehow domain for a testing set but you in domain adaptation you actually don't care about the outputs so if you're dressed test and training environments have the same distribution of X's but different distribution of outputs then your domain that the patient would fail because yeah domain adaptation is for coverage shift basically and this one is a more general thing so yeah the formulation of IRM basically it's again a constrained optimization problem you see that we are trying to minimize across all the accessible representations and all the accessible classifiers we're trying to minimize empirical risk on our training environments but we are constrained to use only invariant according to this definition only invariant predictors so we're minimizing our empirical risk across invariant predictors because for example we can take zero like zero constant predictor it would be invariant apparently because it's constant it's even into it in on intuition sense it's invariant but it will be useless so we're trying to like me solve the min max problem we are trying to get rid to get rid of everything irrelevant and leave only you features for our prediction exactly well this is this is a very good point and I also think of the thought of that but you see the authors say that this is just like a theoretical theoretical formulation of IRM because it's impractical we have very strong constraints we they're infeasible in practice and for practice purposes we have this version of our M yes yes and because as I said you can 0 predictor would be would be invariant it would be arguing across across all 0 predictors yes so I mean that for example consider [Music] well not 0 predictor but a constant predict areas if you're if you map your inputs just to 0 so you destroy all the information of your inputs then a simple constant classifier as it's called what we would be invariant you just take the mean of outputs [Music] well I think that's they they do not make this thing clear I mean they they do not have this point in their article like they have just this formulation of invariants risk minimization and then they try to develop it so it's basically trying to again minimize the empirical risk but subject to invariant predictors not all invariant predictors are good but those that are good are we will be good in all environments yes as you can see we have no minimization in our practical version of IRM and we're going with we're going to discuss it this one we're gonna we're gonna move to it it's gonna be fun so those things happen to be related but it's absolutely not clear why and we're coming to the trip so from IRM from the classical formulation to the practical one the first step will be very simple we will just rewrite our constraint optimization problem into soft constraint problem which is just penalized with some regularizer which actually this regular regularizer is penalizing us from being non not optimal in environment okay that's what his purpose and we assume that it's nice differentiable function etc yes this is the same we just yes yeah yeah because we have E so there should be parentis here okay very simple now you know we assume that our classifiers those W are always linear this is a very strong assumption and we will talk about that a little bit later but okay for now we assume that we will only deal with linear classifier and arbitrary data representation so consider just a linear least-squares regression then we can analytically write the optimum and we can propose a very simple penalty well yeah you should have also the authors of this paper do not like the parentheses yes so your you can just propose the following panelizer which is very trivial just the squared norm distance squared distance towards the optimal solution but you see we have this matrix inverse here and it's not very nice thing so we will just multiply both terms of both vectors by this matrix and we will and we will receive we will obtain a similar analyzer but it will have much nicer properties that are depicted in this figure so what what is represented here basically the those are plots of two penalties for for the same example as we had assumed that we are trying to find there a very simple representation that will be as follows we will remain x1 as it is and multiply is the second component by some constant C which is parameter of our mapping and we remember that our optimal solution is C equal to 0 and opts is 1 0 okay this is the true causal predictor that we want to learn and here basically we have the graph the plots of those penalties for different values of C for different magnitudes of the second spurious component so we we measure the distance between the solution given for the given C of empirical risk minimization least squares regression and the optimal classifier 1 comma 0 ok so we can we can see that the second penalty behaves much better than the first one and even the first one in a very regular in a strongly regularized for strongly regularized linear regression so we will use it further on so that's clear why this thing is better well basically because we got rid of the matrix inverse here but the authors also provides some graphical intuition on that ok so now okay we have regularized optimization problem with linear classifiers and our penalty would be would have the this form we go on so we can see that assuming linear classifiers we have over parameters ation because for a given pair of classifier and representation we can just multiply them by some invertible transform sigh and receive the same predictor right so circle is just a superposition of function okay so you you just apply one function after another and okay let's just constrain ourselves very simple we will assume we will fix some non zero classifier till the W and just optimize only across representations such that the optimal classifier again autop on top of that representation is has this non zero value so we got rid of the optimization across console fires yes we jut non zero we just fix some non zero vector because we can it's clear from this equivalence so it seems to be like reasonable okay now the first step well basically that means that we can just take first the first basis vector and use just the first component of our representation and everything will be alright so we will receive this optimization so so we will receive this optimization problem eventually and by now probably you ask yourself what's going on is everything all right didn't we mess up somewhere so the authors provide the following theorem and it's very simple theorem don't worry they have much much worse theorem but we will not touch them don't be afraid so I want I would like to first formulate it more simply it's a linear classifier on top of our representation yeah it's it's like in your in neural network you map your inputs to some final layer yes and then there is just linear regression top of it so that's that's intuition okay so the theorem actually makes things clearer for the linear case before it actually says us what are invariant predictors in linear case so we say that they're okay we have say we have XS which belonged to our D and one-dimensional wise and the set of in linear invariant predictors yes the predictor is just linear regression so we just V transposed X because Y so the set of such these that are optimal that are invariant and according to definition what that means that means that we can represent our V as some fee transposed by W where field belongs to R P by D and W belongs to some R V such that our classifier is optimal across all environments so such that W is argmin across all that W tilde from our P of our risk of few transpose by w for all environments from some set of environments okay and this this set of invariant linear predictors is actually the set of such vectors from our G that are orthogonal to the gradients of risks ah yeah yeah sorry okay well a very nice theorem it makes things more apparent at least for the linear case and I will define this as G II V okay we're here yes it's some number we will see it well it can be it can be one actually so no it's not it does not depend on there's number of objects it's just some number so this theorem means that the optimal linear classifier predictor which can be represented as some matrix some matrix or for some P as Phi transpose by W where W is Arg min of risks across some environments is actually the set of vectors which are orthogonal to the gradients of risk no for here there exists okay so for this set of V of vectors V there exists there exists such matrix Phi such a number P such matrix Phi and such vector W that V is Phi transpose by W and W is our mean of risks so let's prove it to make things actually apparent so here I will prove the necessity and okay so what actually means that W is our mean we can read rewrite it across all the W's from our P of risk in well basically okay we assume that our that our risk is convex so we can rewrite that that in similar gradient manner so it means that the gradient with respect to W of risk yes I'm sorry W of field transposed my W is zero and I take the derivative basically this is just V by the gradient with respect to V of our e this is V I remind you of V okay so this is the zero vector then I multiply this zero vector by by W from left and I receive W feel WV our V is 0 this should be transposed and this is be transposed so we should have shown this part yes okay now the sufficiency or I don't think that you will see it here so we will not actually need it anymore I just wanted to show it we don't need sufficiency to explain our ideas it's a little bit more some it has some peculiarities so the opposite way okay we have a vector that is orthogonal to the to all the sake we have a vector V which is orthogonal to all the G EVS for all the for all is from some set of okay this is equivalent to V being orthogonal to the sorry orthogonal to the span of those vectors to show actually why we okay let's get to the path so basically okay you know this theorem shows us that we can actually really search among one rank representations so we can actually restrict our classifier to be just color multiplication by one okay and for the linear case but we can generalize for that for any other cases and this is the illustration of the theorem statement basically we can see that the solutions of our invariant linear predictors are just they lay on the intersections of L ellipsoids where L episodes are induced by this equation for the MSE loss for they miss EULA's this equation will will be just ellipsoids because it's squared function well nevermind I understand it I'm also tired so it's okay but actually this shows us that we have a very big drawback of linear of assuming linearity of our classifier because 0 is always solution is always a solution and okay we will not dive into details it's just a bad thing and the authors sail say that they will consider the nonlinear case in the future work okay but in this work we assume linearity of our of our classifier and finally we just rewrite why's there a solution because 0v is yes yes well basically the author's say say that in our in our IRL this empirical risk term will tend us from the zero representation so it's it's okay in practice but they they're going to consider it in future work more in a more in more details okay so we just rewrite the law the penalty that we introduced earlier in in a general case in this case because those to coincide for MS Evo's so if we have our risk as just squared error then the gradient with respect to the classifier would be this equation using the same reasoning we can yes we're basically we're excluding this because you see in our practical formulation of IRM our may we have a direct mapping from inputs to outputs and here we need this dummy classifier which is just a scalar multiplication by one just to take the derivative just to put it to PI torch so it can also go out and count the derivative which is our regularizer we which seem to be able to help us achieve in variant solutions okay now we will skip this don't worry invariants cause out in Germany in generalization well we can believe that I RM promotes low error and in variance across training environments okay but we have several questions does it promote environments in all invariants in all environments does it promote low error in all environments how do those things actually connected like causation statistical variance and the out of distribution generalization well please see the article for details because well I can explain it for those who are interested after this talk but the details are really interesting because the author's provides although all those they answer all those questions for linear case and even for linear case they require differential topology to to answer those questions so if you are interested please no no the answer are these big theorems that says well yeah sometimes yeah something like that okay experiments are our favorite so the first one would be a little modification of our first example that we had on on the bed of the blackboard okay so this is just a little bit more specific structural equation model we have also now like hidden latent variable here and a little bit more difficult equations they also consider eight different experimental setups which each one is each one is encoded by the these three parameters so the first parameter is scrambled or unscrambles observations that means that if we can observe X as they are or after some orthogonal transformation is experiment fully observed or pair partially observed so basically that's our hidden variables do mess into our model or not and homoscedasticity nature scholastic this is just the position of noise and those are the results so let me explain what you see here for each of eight experiment types you see for example f oo is fully absorbed homoscedasticity so for each of these eight experiment setups they put the plot 2 bar plots the first one is for the causal error and the second is for the non-causal error so this is basically the error before the coefficient of x1 and decay fish and the spurious coefficient x2 so they say that as you can see I RM which is green i RM has the lowest causal error in all the experiments but sometimes it has was for non-causal variable then invariant causal prediction and the authors say that their method is much better ICP is just more conservative because we are focused only in learning causal predictors so believe me it seems that iron works and to prove it even better I would like to introduce you the mideast experiment of course we're coming to the to label spurious correlated minused we have a binary label which is zero for first five digits and one for the rest five and then we also flip our labels that were received this way with the probability one force it's like a noise in our environment and then we have we introduced this purest correlation between the color and the output so we will color our image into green if we have color index zero and interests if we have color index one and this color index is received by flipping our label with some probability which is dependent on the environment okay so this the set up seems to be rational and in training environment those probabilities of flipping are relatively low so this correlation between color and label will be very strong in our oh sorry in our training environments however it it changes from environmental environment so we can hope that our in variant predictor will notice it and get rid of this spurious varying correlation and for our testing environment we on the opposite set this probability to a high value so our predictors that use this color feature for their prediction will fail okay and here the results they just compare against simple empirical risk minimization setup so as you can see Erin as it can be anticipated has a very low training loss because it uses color but it absolutely fails on test environment whereas IRM seems to be more or less robust and here we have four examples several other base lines those are random guessing which is 50/50 is always optimal in variant model hypothetical so hypothetically the best model could achieve 75% accuracy because we have this random noise in no environments and they also provide an Oracle model it's like empirical risk minimization on top of grayscale images that do not they do not hit the color so they are invariant now it's a multi-layer perceptron well it's like it's like hidden variables that we cannot observe but that though okay it's like a noise okay just noise I don't know they they just said it for the generality so in in general environments we can have noise in our outputs that we cannot do anything with this and so yes we have two environments with low but yes and as you can see the testing is in it's absolutely different from the training environments it has and also I would like you to think with me about this plot No I also had this question when I read this paper but it seems like they're not too fabulous well instead they provide the discussion on the topic there the first one yes those that spurious colored mist is their invention IRM and and the columnist maybe well maybe we'll discuss it we can discuss it after a little I would assume that's without this nerd there's also because go with this noise I think they are we're making okay this way you're okay so one more figure the author so also provides there the probability of output being 1 conditioned on the log it's on the H is just Phi of X it's just the the last layer of our network okay so H are just log its and they plot the probability of output being 1 against the law gates and they claim that as you can see empirical risk has absolutely different distribution for the testing environment whereas our Oracle has the same distributions and IRM like to has has the same distributions but I actually do not understand and I couldn't find it in their code how do they plot the probability of output being 1 against the prediction of the network against the law gets maybe some Montecarlo estimate but they didn't make it clear actually so those are just some more plots if you you're interested ok and finally information theoretic view that's literally 5 minutes so that that's the information theoretic view I encountered it's in this blog post and I found it really interesting idea really interesting viewpoint for for this concept so basically those the author of this work proposes to look at our tasks in general like this like generally this Basin Network where we have environment we have some latent variables or noise we have causal features and spurious features and of course output and our main assumption is that given inputs given causal inputs and noise we're independent of the environment so this is the main assumption in such a graph representation of the paper okay and our task is to learn such representation of our inputs that our output is as much as possible independent of environment and also this representation should be somehow informative about our outputs but that's very close to the informational bottleneck isn't it because in informational bottleneck we're trying to learn such a representation of our data which takes the list from the inputs it gives the most to the outputs right and for this case we can rewrite IRM principle in this form so basically again we're trying to maximize mutual information between our outputs and representation but we subtract the mutual information of our outputs and environment given our representation so something reasonable right it's very close to the penalized version of IRM that we already see that we have already seen but the author goes even further and says that we can actually receive something like this from the informational theoretic point of view the equation of this type and he does the following he represents the mutual information the conditional mutual information on the representation of our outputs and environment he does some math juggling good we will not dive into details but it seems to be reasonable he explains it in more detail in his blog but with some mathematical juggling we can approximate mutual information between our outputs and environment with this equation then instead of minimizing here overall the parameters of our predictor we can minimize only in some neighborhood of this value theta and this actually is the negative here we have we should we should have the two on top it should be squared but the others seem to mistyped it so this this seems to be like reasonable thing for small epsilon values right and finally yes and finally we add the empirical risk term to to this lower bound of our mutual information of our outputs and environment and add minimization by representation and the final thing would be to notice that every global minimum of this term is local minimum of this term because when the gradient is zero its local a vehicle risk minimum and then we can we can combine those two minimization minimization minimizations into one and achieve something really similar to the IRM so it's like empirical risk plus some gradient squared regularization so this is just like SitePoint this yes it's that's what very rough and the author also says that it's not optimal but he says that if we assume that they are like nice and convex then everything should be all right here well basically yes I also think so this because this is just a yes just minus gradient okay so let's come to the summing up okay we want to learn some robust predictors that are based on true causal properties on our date of our data and do not pay attention to the spurious correlations that present in our data for that we notice that invariance and km causation are quite related things and we can leverage their connection by promoting out of distribution generalization as we've seen in the beginning also we assume that our data are sampled from different separated environments they can vary but all of these environments preserve the same causal structural equation for our outputs and finally the IRM principle once again we try to find such a presentation of our features that the optimal classifier on top of it would be optimal simultaneously in all environments and here is the practical version of fire M so thank you very much no that's just version one because there are so many things that are for the future directions they live very they live a lot of things for future studies know like considering nonlinear case like answering those questions for nonlinear case or something like that so this is just a concept the first one that was proposed and it seems that this thing will develop in further on we have very high information and very little information target right if our targets would be like multiplex as in he's asking so instead of saying Cal Campbell he was saying how how high how for an arrest you're asking the system to learn about everything there's an image about learning about everything there's an image you can experience accordingly basically when I look at a picture of a cow I can also say that this is a grass this is the cap or this is the colic skin mouth whatever like I you can close this image into a hierarchy of different things so if you asked the model to learn all the same information okay yeah I understand you say that here's the cow you think that to you when you make your problem more complex more specific then you will have less spurious correlations yes so you will not care about the background color if you are just [Music] never ever seen before let's see it try to Chinese try nowhere so viel a chop to the moon's distributive features which are also spiritus because I don't understand the true underlying process but Jim's done the drilling process I need to understand all these details that are reconstruct the more general world view from the details when you ask the model to a very similar way to Istanbul latching on to her mother security features today well you see those spurious features are changing and those are like the shape of like I don't know the horns the eyes of colors they are stable they every call every call has horns of the of similar shape so this is a stable yeah but like didn't ask you to do more things ask you to do more things like ask you to do what I came in this will cause you to know because well yes that's exactly what we skipped in this this is this question that you're asking I see and the author's provides some some they shed some light on this but again for for specifically in their case and this is highly non-trivial so this they they apparently show that yes though that arrow OD it's something that works that actually helps us to find true causal predictors but to prove that you need to really dive into details so the this presentation is just something like wave waving hand waving yes okay what 31 very nice very nice what I don't know I just know that this is Facebook research so there no you see this is more feels more like a philosophical paper they have they have a very very large first of all they have a lot of theorems and those theorems are not trivial at all this is the first point and about the philosophical about the philosophical point they constructed their discussion section as a dialogue between two students that just read this paper yes and they just eager to to discuss this this paper with each other and well actually I've never seen students that would talk like this but they I've seen one so yeah and those two students discuss a lot of philosophical things about why this makes sense and when this makes sense and how this connected to our life and yes not put on Socratic Socratic dialogues they are yeah yeah yeah well they called like [Music] yeah [Laughter]
Info
Channel: BayesGroup.ru
Views: 2,672
Rating: 4.7692308 out of 5
Keywords:
Id: iBlCpJmaBh0
Channel Id: undefined
Length: 90min 51sec (5451 seconds)
Published: Mon Oct 07 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.