Francesco Locatello (Amazon) - Towards Causal Representation Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so i'll get i'll get to it so welcome to this uh manga seminar series this is also part of the ellis genova activities um so is a pleasure to have francesco locatello uh he somewhat so you know reminded us is his trajectory in the last few years he he did he did a phd in in zurich and then spent a bunch of time in in google both in zurich and amsterdam he's a trajectory started with the more kosher uh optimization stuff the kind of stuff that silva villa here like so uh greggy mata's conditional graded and so on but in the last few years it's been moving towards the problem of representation learning today is going to talk about the intersection between representation learning and causal learning and somewhat the setup and the setting has in mind is largely that of computer vision and situation where you need to have effective transferable and efficient representation uh for images and more general more specifically maybe scenes so it's a real pleasure to have francesco i forgot to say that he finished his phd and he's now at amazon research in tubingen which is uh very much geared towards it's a new lab very much geared towards research and and um well as he was saying and i guess they're looking for interaction so hopefully this will be a chance to do it a bit with the center here um so i leave it to to francesco in a second uh i just uh just two words of zoom logistics um first of all if if you have a camera and you want to leave it open for personal experience this makes the life of the of the speakers slightly more lonely you don't feel that you cannot really feel the vibe but you can see some faces so please do if you want and the format usually is that if you want to ask a question and make it a bit more interactive it's usually okay so and if you instead you want to collect you know write questions either on youtube or on the chat please do again let's try to fight the zoom format as much as we can all right francesco i leave it to you it's a real pleasure to have you and i'm really curious to see what you talk about thank you thank you very much uh for the invitation and the kind words yeah so the title of my talk today is uh towards causa representation learning uh this is meant to summarize basically what i've done in the last two years of my phd after i moved from more like theoretical optimization and vision inference work to representation learning this was actually supposed to be i mean this is actually a paper that i wrote and when i decided i would speak about this topic here today i was 100 sure the paper would be actually out and of course this didn't happen so i'm sorry about that i hope it's still going to be interesting and then you you'll be excited to look at the paper when it actually is released so today i'm gonna start uh introducing like what is causality and why do we care about it i'm a machine learning person uh not a causality person uh but i think causality is very exciting and so i'm gonna try to explain you why i care about it and then i'm going to talk more about deep learning things in particular these entangled presentations i'm going to explain you what this problem is about and uh how and why is it related to causality uh we will see that the unsupervised learning of these entangled representations is rather challenging and it has some like critical problems uh and we would have to go beyond the unsupervisor we need to actually be able to solve them we're gonna see uh semi-supervised and weakly supervised approaches then i'm gonna briefly talk about architectures and the importance of the representational format and some hopefully inspiring conclusions on what i think is the way forward for bringing representation learning and causality closer together so generalization is is actually a very core concept in artificial intelligence and this was recognized already in the very very early days of ai however when we build and deploy a machine learning system the generalization visit rata that we have actually have multiple facets so we clearly want to be able to generalize one sample to the next in the iid setting but also we want to be able to generalize across different settings for example if i train a neural network that is very good at recognizing chairs then i would want it to be very good also if the chairs are presented with unusual backgrounds or unusual viewpoints and finally i would argue that generalization is also about being able to reuse and repurpose knowledge and skills so that if i have a physical system such as a robotic platform and i train a model that learns to pick up the cube and then hit it with the third finger then the knowledge of the environment that this model should have is the same that is needed to solve other tasks such as for example inverting the order of a pile of cubes and so ideally we want to learn this structure and share it across many different tasks and the important remark here is that the performance of a learning algorithm inherently depends on the assumptions that you are willing to make and the assumption that i decided to make in my phd is that the word is structured and now if this statement is true then it would be prudent to incorporate a corresponding structure in the solution of our learning algorithms that's why in the first two years i worked in context optimization in particular constraint optimization when this structure is enforced as a constraint and in the second half of the phd i worked on trying to discover this structure from data and this is what i'm going to talk about today and now the hope is that if we you know incorporate this assumption then we would facilitate learning in particular we will be able to reduce the number of examples that are needed to train a model but also this would help with generalization and in particular stronger forms of generalization and finally this structure can be shared across many different tasks and now if we continue with this three-finger robot example we can say now if we have a task such as where will the cube land given some initial conditions then we have many different ways of solving these tasks and they're all equally valid and here i want to focus on on two extremes on the one hand we have physical modeling so we know physics we know differential equations and if we know the initial conditions we can pretty much compute where the cube is going to land exactly in the other extreme we have statistical learning which is let's hit the cube a million times and see what happens then we change the mass we change other properties and train a big model and what i would argue is that you know both approaches are actually valid approaches to solve these questions but maybe you know this is not the only question we care about on this particular system and we don't even need to go to uh different tasks but once i have a model answering this question then i would like to also ask well how can i land the cube in some particular position that i want for some reason or what would happen if the cube was yellow for example i know this question seems easy but actually pretty hard because to really solve this type of problems we need to go beyond statistical learning and a nice example explaining the difference between statistical learning and causality in particular interventionally controversial questions comes from from this paper that matthews wrote in 2000 where he collected data about number of stark breeding pairs across europe he measured the human birth rate and then he proved that storks actually deliver babies with statistical significance so what does this tell us well it tells us that if a new country arrives then we likely can predict the human birth rate given the stark population and this is quite reasonable to expect because this sampling process should be roughly iid or at least that's the hope we are well within our assumptions and so our model is expected to generalize the second tier of this causality ladder are interventional questions interrational questions are like um if i want to improve the human natality rates should i increase the number of storks and of course this type of question is much harder to answer because it actually requires to have like an understanding of the mechanicistic processes that are behind the statistical associations and finally we have uh counterfactual questions which is the hardest level of the causality latter and this is the hardest level because contrafactual questions are about things that actually didn't happen so an example is how would the natality rates be today if the italian government had imported 10 000 starts in 1980 of course this is very hard from the machine learning perspective because we did not import storks in 1980 so we don't have an answer to this question so the difference between statistics and causality is very well explained by reichenbach in 1956 so he argues that if you have two observables x and y and these two observables are statistically dependent then there exists a variable z which could coincide with x or y that causally influences both and explains all the dependence in the sense of making them independent when conditioned unsaid so this means if we observe a correlation between the human birth rate and the number of starks then either babies bring storks or storks bring babies or there's some other variables such as economic development which influence causal influences both and from the machine learning perspective we are interested in causal models because they offer a much richer description of reality compared to a statistical model so one nice way to think about the difference is that if you have a physical system a statistical model tells you something about one particular configuration of this physical system and this is the configuration that you actually observe well a causal model gives you a much more complete description of this physical system because it tells you about all the possible states this physical system could enter as a result of interventions of course this is not as powerful as having like a full physical description of a system through like like differential equations for example but the hope is that causal models could still be learned from data with some additional assumptions and the reason why on the machine learning side we are interested in causality is because we hope it's going to help in a variety of settings you know nowadays corresponds to open questions in machine learning and in particular we've seen some uh you know preliminary but positive answers to this in the settings of semi-supervised learning others are vulnerability robustness and strong generalization in particular is something i particularly care about but also common practices in deep learning such as pre-training data augmentation and self-supervision war models and offline reinforcement learning in rl and applications and multitasking continual learning so this is great on a very intuitive level you know it's easy to be convinced that causality sounds like a you know useful concept for machine learning however there is a big elephant in the room here which is causal variables so causality starts from the premise that you are given descriptors for your physical system in the forms of causal variables discard the values could be observed or unobserved but still they are part of your model and so the language causality operates on is inherently different than the language we typically use when we do machine learning in particular deep learning and so how can we uh you know instead of relying on handcrafted causal variables i actually learned them because in representation learning we don't have these causal variables but we have images for example and here we have a very pretty picture of switzerland and the goal in representation learning is to learn a function that maps these observations into a low dimensional vector representation and now an underlying assumption that we can make here is that the observations are not truly high dimensional but are the manifestation of a set of low dimensional ground truth factors often these are called factors of variation in the disentanglement literature and these ground rules factors would correspond to the causal variable we would like somehow to uncover because we hope that these causal variables would then have a certain set of properties that would be useful for downstream tasks such as generalization maybe more robust distribution shifts and so on so now i want to talk about these entangled representations and how this relates to causality and how can we actually solve this inverse problem of recovering factors variations from high dimensional observations so this is why one slide connecting disentanglement and causality so in causality we have causal variables s i and we have the very famous causal graph that encodes the causal relation between these variables and these causal relations can also be written as structural equations so a variable s i is a function of its parents in the graph and some noise variables ui and the structural equation actually implies a causal factorization of the graph so this set of random variables factorizes as p of s i given its parents and now you can think of this entanglement as trying to recover these factors of variation under the assumptions that there is no confounder so they're actually independent so we have observations x that are a function of n independent variables and we want to learn these variables and if you don't want to assume that these variables are independent is actually equivalent to recovering the noise variables in a structural causal model so a bit more pragmatically this means that we want to align our representation with the ground truth factors of variation so that when a factor of variation changes then the representation changes accordingly in a single dimension so in our paper challenging common assumptions in the unsupervised learning of these entangled presentations we first theoretically showed that for arbitrary data this task is actually impossible and intuition is actually pretty simple so here you can see 16 examples from the car's 3d data set which is one data set which is commonly used in the disentanglement literature and on the right hand side you see the representation that we would want to learn this is a synthetic data set which was generated with three factors of variation and we want to learn this specific representation because well this is a synthetic data set which was generated with these three factors of variation however we can construct a second set of ground truth factors of variation which are fully entangled with the correct ones but also give rise to the same observations and now you can immediately appreciate where the problem is because a representation learning algorithm that have only access to the observations have no way to distinguish from which narrative model these observations come from uh because from the observation perspective these two generality models are actually identical and so this is exactly how the theorem looks like for any factorizing distribution over the ground truth factors dead there exists a transformation f of z such that the jacobian is non-zero almost everywhere meaning that z and f of z are completely entangled because if i change z in one dimension then all dimensions of f of side are going to change but p of z and p of f of z have the same marginal distribution and the intuition behind the proof is also very simple let's consider the example of ground truth factors that are just two scalars drawn from a normal distributions and now the observations are a rotated version of the ground truth factors with some unknown rotation so the rotation would be my generating function and now the question is can we obtain a different angle representation and this means can we undo this rotation clearly this is impossible because gaussians are rotationally invariant so after observing the samples i have no way of knowing you know which were the correct axis before the rotation so this is exactly how the proof works we start from this factorizing distribution here on the top left and then we map it to a gaussian using the probability integral transform and applying the inverse gaussian cdf then we apply a transformation there that preserves the distribution the marginal distribution and then map it back to the original space and now this whole chain of transformation have the properties that we preserve the marginal distributions but also the jacobian is non-zero almost everywhere meaning that z and f of z are completely entangled now this tells us that these entangled presentations are non-identifiable from observational data however in this theorem we make no assumption on what the data actually is in particular uh when we approach this problem from the deep learning side and the computer vision in particular you know we work with images and we have very strong inductive biases on the architecture side for example we use convolutional networks which tend to work really well on images so you know maybe inductive biases are sufficient to distinguish z from f of z and the model just converges to the right set of controls factors so to test this uh we performed a large-scale experimental study trying to answer the question can we learn disentangle the presentations without looking at the labels because we want to do this unsupervised and to answer this question we built a library to facilitate reproducible research on this entitlement this library supports end-to-end training and evaluation of the prominent state-of-the-art approaches we have automatic visualization so all these pretty gifts that you see in this talk are actually generated automatically for every model to train and we consider vae based methods now we have 14 originally in the first idea we had six these methods are all variants of the vanilla variational encoder and all these methods basically enrich the va objective with a regularizer and this regularizer is supposed to encourage this entangled representation through some suitable bias then at test time we assume to have access to the ground truth factors which is the case because we only consider synthetic data sets and compute some disentanglement metrics and for the purpose of this talk is not particularly important to know what these methods do or um you know what these metrics actually evaluate for our computer so now i want to show just two results which relates to these two questions we've seen that there are multiple methods so the natural question is which one is the best one and the second question is well if i have a pool of trained models how can i pick the best one so here you can see the distribution of performance of these six methods uh that were trained on the car's 3d data set and evaluated with the factor va is called which is one of these internal metrics and now as you can see the performance of these methods is heavily overlapping indicating that the choice of the loss function actually doesn't matter as much as the choice as the random seed and the hyperparameters in fact the objective function only explains about 37 percent of the score variance so this first question actually doesn't matter that much because if you can select a good model any of these loss functions would do so now how can we select this model and here we tried many different things and the ldr is that nothing we tried really work in a reliable way but one result that to me really stood out is the following one so here you see the rank correlation between the training metrics and the test disentangle metrics and as you can see they are essentially uncorrelated or at least there is no obvious trend that tells you for example well having a very low reconstruction error is also an indicator that training was successful and therefore the model is also going to be a more decent angle this does not happen so the unsupervised training metrics actually do not seem useful for unsupervised model selection and this question actually remains a key challenge so this paper actually opened a bit of a crisis in the community and in my perspective this really raised two fundamental questions the first one is regarding supervision so we know that we need some supervision both theoretically and empirically now the question is how much second question is about the usefulness of these representations it seems like this entanglement is not a trivial task so is it actually worth to learn this type of representation and now to answer both of these questions we need to go beyond unsupervised learning and the first result that i want to show here is from this paper is entangling factors of variation using few labels here we ask a very simple question what happens if we observe a few samples which have a notation from the ground truth factors let's say a hundred or a thousand which is less than one percent of the size of this data sets now as soon as you have access to supervision you have two options you either train without supervision and use all the labels you have for validation or you use the regular training validation split and what we discovered in this paper is that both approaches actually seem to work and you can see two gifts of the traversals on the bottom of the slides but not only they work but they're also surprisingly robust so here you can see the rank correlation between the validation metrics computed with 100 examples and the test metrics that are computed with 10 000 examples and this correlation is rather positive indicating that we can indeed do mod selection using our validation examples and of course like here we need to be a bit skeptical because you know factors of variations are actually pretty hard to label although we are only asking for 100 examples so we also explored different types of imprecisions for example we can have coarse factors of variation where we just bin the factors into five categories some of these factors are discrete so this spinning doesn't even make sense but we do it anyway or we have label noise so ten percent of the factors are actually completely random or we have partial observations only two factors are observed and what we see here is that this correlation remains which indicates that actually we don't really need that much supervision or like very precise supervision because very little and very impressive supervision actually seems to be enough and of course if you have more label data you can just attach a supervised objective to your favorite unsupervised method and and this works even better especially the more data you have but this is pretty obvious so maybe this crisis is actually an opportunity in disguise because we've seen that little and imprecise supervision seem to be enough to actually uh identify these disentangled representations and now in some applications maybe you have a few labels and so it actually makes sense to consider semi-supervised algorithms as opposed to fully unsupervised but from the research side this really raises the question how far can we push it how about we use zero explicit labels because that would be much more elegant and only rely on weaker forms of supervision and for this we go back all the way to causality and the critical assumptions that we add is what we call the sparse mechanism shifts assumption so these assumptions say that small distribution changes tend to manifest themselves in a sparse or local way in the causal factorization so they usually not affect all factors simultaneously in a disentanglement setting this looks as follows we change the setup we don't have iid samples from a generative model anymore but we assume to collect pairs of non-iid examples and across this pair several factors of variations are actually fixed and of course in practice you do not want to assume which factors are fixed or how many so the algorithm doesn't know anything about this they only know that the change in the appropriate factorization should be sparse and an example of of this setting is you have a robotic platform the robot is doing some tasks and you have a camera which is taking a video of the scene and then you can automatically construct these pairs by taking nearby frames and here we're operating under the assumption that two adjacent adjacent frames typically have mostly the same content maybe in pixel space changes can be a bit denser for example when objects actually moves a lot of pixels changed value but in the appropriate factorization almost everything remains the same and from the theoretical perspective it's quite intuitive why weak supervision is actually sufficient so this is the slide i've shown earlier and here i said well if you only have the observations you cannot distinguish from which generative models the observations come from however if now i group the observations like this and tell you across these two images only a sparse change happened then clearly this second generative model could not have generated the observations so to really give a formal statement we need more assumptions in particular we assume that we actually know the number of fixed factors and we also have a variability condition on this so that every factor can change essentially we also assume that z has a continuous distribution and that the generating function is a differential then what we show is that if we do distribution matching and satisfy these constraints then all plausible models have aggregated posteriors that are reparameterizations of the true aggregate posterior up to a coordinate wise change of variable and index permutation so every generative model with latent variables that hat uh this latent vibration head is a function of the original grand truth factor z and the jacobian of this function is actually diagonal now of course uh the assumptions that we make in this theorem like don't hold in practice at all in particular we don't know how many factors are shared this number is not constant because it's unreasonable to assume this in practice we have finite data sets we have finite capacity for our models we have a mix between continuous and discrete factors and if we train vaes we also use the elbow for distribution matching and we sold like the distribution machines actually approximate but the good thing is that inspired by the proof technique we can actually develop methods for which the training metrics actually correlate with this entanglement meaning that now finally we are optimizing for the right thing first of all and second after we train our model we can do model selection just looking at the training metrics for example taking the model with the best reconstruction error or the best er but we typically use the reconstruction error and on the left hand side you can see the traversals of the model achieving the best weekly supervisory construction error which also achieves almost perfect disentanglement so on the supervision side we've seen that under some condition we can actually recover the factors of variation with no labels of course these conditions are rather strict but in practice at least the training matrix can correlate with disentanglement meaning that at least we can do more selection and now the second question is you know should we do this at all and here we consider three different tasks and these tasks are extremely different so one is abstract visualism so here we are given a set of panels in this grid and one panel is missing and these panels are related to each other uh by some high level logical rule and now the neural network needs to find a missing panel out of a set of potential answers which are the six panels on on the right hand side this is extremely hard for neural networks because you cannot solve this task just by looking at local features you really need to reason about the similarities between these pictures in factors of variation the second task we consider is strong generalization so here after training the representation you are giving a training set and a task for example you want to classify the shape of the object mounted on this robotic arm but the training set is extremely biased so all the objects are white and in the test set objects can take any color and none of them is actually white and the third task we consider is a furnace task so here the setting we assume is slightly different from what is typical in the furnace literature so here we have independent factors of variations one of which is the target variable one of which is the sensitive variables that get mixed into the observations with some unknown mixing function and typically in fairness the sensitive variable and the target variables are related to each other and in our case they're independent and they become conditionally dependent given the observations of course however the other assumptions that we make which makes the problem much much harder is that we don't want to have any label for the sensitive variable we only observe the target variable we want to train a classifier which accurately predicts the target viable but is also robust to changes in the sensitive variables and and this we measure with a notion of fairness which is called demographic parity the definition is on the top here and you can interpret it as the probability of a classifier predicting a certain candidate should get a job uh it's true given that the gender of this candidate is male then this probability should be the same if the gender was equal to female so in this panel here you can see the rank correlation between this entanglement and our weakly supervised reconstruction and the unfairness on the different data sets that we consider here you can see the performance on this abstract visual reasoning task at different sample sizes and here you can see the strong generalization performance so what these three tasks have in common is that this entitlement is a useful property on all of them however this disentanglement metrics are a bit meaningless in this case because you know we don't want to assume to have access to the ground rules factors otherwise you would have to compare also against supervised approaches but now with weak supervision we have our weekly supervised reconstruction that actually correlates with performance on this task so this means that the model achieving good reconstruction the representation that this model has learned will be useful for all of these tasks at the same time so overall this weekly supervisor's entanglement has some recovery guarantees and reconstruction loss identifies models that actually learn this sort of reusable abstractions we wanted to learn in the causal science in a way so the usefulness we ticked it we can identify representations that are useful for diverse tasks terms of performance strong generalization example efficiency this is what we've seen so these entangled presentations are an active and important research area but the use of inductive biases and supervision needs to be much more explicit because in the unsupervised setting we've seen that there are fundamental challenges and the type of supervision needs to be data and task dependent and finally uh one takeaway from from these studies is that it's very important to have sounds and reproducible experimental setups with several data sets because it's very easy to draw spurious conclusions if you only look at a subset of these datasets so are we done i would argue that we are not done and a clear example is in this slide so i started talking about this entanglement showing this very high dimensional and complicated picture and then the data sets that i used were this like super simple 64 by 64 pictures with one object in the center and here the very important difference between these two pictures is not just a resolution or the lack of textures but also the fact that in the data sets that we consider there is only a single object while in reality we have multiple objects and even worse we have multiple instances of the same object and this is a problem because let's say we have this robotic platform and we train one of the models i've just talked about using this weak supervision right now the hope is that this model actually learns something about the physics of this environment but now if i uh change the problem and now i have three objects in the arena then now my representation is either not disentangled anymore or it can only store a different tank of the presentation for a single object so the fundamental problem here is that i've been talking about this entanglement for so long now but i'm still using a distributed representational format and so this is one way to interpret a vlot attention um so that attention is a differentiable interface so it's a layer between distributed and set representations of task dependent high level variables it's very similar to capsule nuts but the main difference is that the capsules in capsule nuts specialize to always bind to a particular class while in our case the slots can bind to any entity in the input scene and this binding happens through an iterative and competitive attention procedure on top of a learned encoder and this attention competitive procedure can be thought of as a metal and clustering and now i'll show you why so this is how slot attention works it takes as input feature maps and position embedding then you have an iterative procedure first initialize the slots by uh drawing from normal distributions with learn me mean and standard deviation and then you iterate and you first compute the dot product between projection of the inputs and the slots then you take the softmax as in regular attention but now you normalize the soft max over the slots axis then you take weighted mean instead of a weighted sum to compute the updates and you update the slots using a greater current unit and now if you look at this pseudo code hard enough you will realize that it's basically the same as softkey means with some minor difference so in softkey means you initialize the centroids in this case we call them slots and then you have an iterative procedure you compute euclidean distance between each point and the slots you take the softmax for the soft assignment and then you actually perform the update which is the weighted mean of the inputs with the weight from the softmax with this is uh recomputing the centroids and then you just replace the old value of the centers with the new one so slots means a lot attention can be thought of having to compete for explaining parts of the input very much like clustering algorithms in this case it's meta learned because first of all we met to learn uh the objective function is not just the euclidean distance but we have this learned projections and then we have this parameterized update with the gru so the data attention has two key properties the first one is permutation in variance with respect to the input so if i permute the input the output of that attention is the same and this means that it's suitable for set to set processing if i want to do grid to set so disabled representation to set then i need position embeddings the second property is permutation equivalence with respect to the slots so if i permute the slots then the output order of the slots is the same like it's permuted in the same way and this is useful because lots really act as object files that learn a common representational format and each slot can bind to any object in the input scene so this is how you use it you have a convolutional neural network you place a lot of tension in the middle and then whatever you do on top you apply it with share parameters across the slots for example if you want to do object discovery you have a decoder applied on each of the slots which reconstruct one object at a time or if you want to do supervised learning you apply an mlp head and then you need to match the prediction with the ground truth labels using some matching algorithms an important property here of that attention is that the number of slots can be changed at any time in particular test times because the number of objects we can represent is actually arbitrary so it works fairly well it's competitive with baselines in particular we compare against the dipsup addition network which is a proper like set representation and this is already much better if you use an mlp as if it was a set method so like you enforce an arbitrary ordering of the slot in the distributed representation as you would do for example in this entitlement and of course there are tricks to make the method much better and this is not particularly exciting what is exciting is that the object masks which are not supervised actually emerge during training and segment the scene in a reasonable way achieving actually a pretty decent ari which is a segmentation metric and at this time you can change the number of objects and here it really means changing the architecture uh to increase the number of slots and of course the performance degrades because the problem is also harder but integrates gracefully if you had a distribution representing a distributed representation you would not be able to change the number of objects in this way so i think this lot based architectures are very promising for learning and reasoning about objects and attention is an extremely powerful mechanism that can act as an interface between perception and abstract representations and there are a bunch of limitations which will be very interesting to explore for example having hierarchy so that you can have still some degree of specialization but in the same way that is in programming so you in programming you have classes and then you instantiate as many objects as you want from a given class and now the second limitation is also that these variables are only task dependent in in slot attention and we don't really have strong biases from the data side that you would have for example if you consider like videos so to conclude my talk what i'm interested in is bringing representation learning and causality a bit closer together and in my opinion the fundamental limitation here is that these two fields talk a different language because the dictionary in causality is the starting point are these causal variables and in representation learning you you work with images for example but the disease derata is to perform causal reasoning on learned representations and here we can think of these images as being a partial view on the state of a physical system with some underlying causal structure and now causal representation learning is about training a neural network that learns an appropriate descriptor for this physical system that support causal statements that are useful and relevant for a set of downstream tasks that we actually care about because it's very difficult to talk about causal graphs for images because there's no one true causal graph in nature um the granularity of the causality statements that you can make depends on what you care about if i care about moving these objects then the appropriate descriptors are likely the objects if i care about you know how these objects are made then i should probably look into the molecules and stuff like this so the key question going forward now is why do we need any of this because the skeptical listener would say well why can't we just train big networks because if structure any particular structure was useful at all a network could implicitly learn it in a distributed format if we have enough data enough episodes and the short answer is that it's unclear whether we need it or not and gpd3 is a great example for this however i would argue that generalization and in general the performance of a learning algorithm is related to the assumptions that we make and in deep learning we make plenty of assumptions too on the architecture how we train the model um on which data we pre-train the model and we need to talk about these assumptions explicitly so that we can selectively incorporate the ones that we truly want in the training framework and here the example is inductive devices in this entitlement that were supposed to turn this problem from impossible to possible and then in practice didn't really deliver on it thank you i hope i'm next fantastic no no you're good okay so uh yeah so i guess we have time for questions so i mean maybe i'll kick off while we'll see if there's any curiosity so i guess some some question that i had is uh you know a while ago i mean there is this idea that for example you know when you talk about disentanglement and the different kind of variation for example you can try to make some strong assumption of the kind of variations you have to do there there are many different kinds right there could be geometric variations viewpoints uh and and then there is the idea that you know you can take a group theoretic point of view where in some sense these changes are quite rigid they don't have a lot of degrees of freedom but then if you start to talk about them more about more uh substantial things so people people talk about example of a small deformation but if you start to allow deformations then you can do all kinds of things and then even like semantics can be seen as a sort of uh noises yeah so i was wondering i'm not an expert about the disentanglement i was wondering a bit in this landscape of potential changes you want to disentangle how you can distinguish between what's like more rigid and more is more flexible and what is the semantics in the end of the day seems to me that is very much task-based you know depends on what you want to do you will want to factor out something and i don't quite see how what is the general way to think about this and how it relates to what you were discussing yeah i think this is a great question and this really highlights why unsupervised learning i think is not good for this because you know if you don't really put any uh like constraints on what you want to model as factors of variation is is really hard uh to actually learn something meaningful because uh like the the only possible solution would be to learn the representation of the most fine-grained uh like the the most possible fine-grain representation you can get in terms of factors of variation because uh yeah you have no no other information so i think like in the weekly supervised case is where things actually get interesting because uh what is a factor of variation is uh is induced of course from the task if you have some sort of supervision but also on uh you know how you construct these pairs for example so we said that factors of variations can change other things that change sparsely and so if we never see something changing across two frames then this thing would not be a factor of variation and this is to me very interesting because uh it's it's somehow disaligned uh from what i think factors of variation should be right like for example like object class should be affected variation because why not uh but this is also like pretty rare to you know like you see a video of a ball rolling and then it suddenly turns into a car um so maybe maybe this should not be a factor variation and i feel it's very exciting that factors of variation are actually data dependent and and they depend on you know which interventions you can perform because this opens up a whole lot of possibilities for machine learning in my opinion for example like there seems to be yeah as i was saying then there is a trade-off between some of the universality of the representation and the task specificity right exactly but there's no free lunch right so uh and we do this already in deep learning like data augmentation is is basically doing this right when you have an image and you algorithmically generate an intervened version of this image and then you ask your representation to be invariant to these transformations you're basically asking you're introducing you know a style variable and then you're basically asking your presentation to ignore it yes yes yeah absolutely so maybe while we are waiting there is a question on youtube by bashir sadeghi which says based on your representation can we say anything about representation dimensionality uh okay so i think if you have infinite data then you could in principle like uh you know uh [Music] learn uh what's the implicit dimensionality of the money phone with data lisa more pragmatically in our experiments we fix it to a number and this is a hyper parameter and we actually didn't even make a sweep on it we just kept it fixed to keep the computational budget in check okay other questions otherwise yes we can thank francesco this is the usual zoom format uh so thanks a lot it was super interesting and super stimulating so um again we'll we'll probably get in touch so that we can have a bit more of a private conversation uh great thanks a lot for today we hope to have you to have you here in person very soon thank you thank you very much have a good day okay okay
Info
Channel: MaLGa - Machine Learning Genoa Center
Views: 1,460
Rating: 5 out of 5
Keywords:
Id: 0Zq1PbPpLug
Channel Id: undefined
Length: 54min 37sec (3277 seconds)
Published: Tue Feb 02 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.