WSAI Americas 2019 - Yoshua Bengio - Moving beyond supervised deep learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] all right so this talk will be a little bit technical and it's about some basic research that we have started in my group that I'm really really excited about and it tries to address some of the challenges that that we see with current machine learning so let's get moving so what's deep learning about deep learning is about the presentations it's about how we can transform say sensory data into a form that's more abstract such that in those new spaces of representation the computers or brains for that matter can take better decision can better understand the world can control the world and so on and one of the things that I have been thinking about for a long time is it'd be nice if the variables in that high level representation could actually correspond to the kinds of variables that we manipulate with language we humans and typically those variables are what we call causal variables in other words there are the kinds of things that you can consider either a cause or an effect of something else sometimes both but up to now I don't think we had a good way of building systems that can discover a causal variables so this is one thread you know but motivation for what I'll be telling you about from the beginning the question we were asking was how do we transform the data into this new space of representation where the variables corresponding to the underlying causes are separated from each other are disentangled that's the term that I introduced a long time ago and in addition to separating those variables of course we want to find the relationships they have to each other like which is cause of which and and what kind of effect does a variable have another one these are some of the fundamental questions that scientists are trying to answer when they try to understand the world but these are also the kinds of questions that we would like an AI to figure out by themselves right and what I'll be arguing today is that this question is related to one that seems very different that we've been asking in machine learning which is how do we represent knowledge in a way knowledge that the learner is is acquiring from the data in a way to separate it into pieces that can be easily reused right so there's a notion of modernization for a few years researchers and deep learning have been asking how to separate the new on that into modules that can be dynamically recombined in new ways when we consider for example new settings changes in distribution or new tasks so this is related to another important notion in machine learning called transfer learning that is how do we build learning systems that donnelly work well on the training data but but then can adapt quickly on new data that comes from a related distribution so that you know you don't need tons and tons of data for this new setting in order to do a good job and and we can have systems that are robust when the world changes because there are non stationarity the world changes all the time due to interventions of agents due to our incomplete knowledge as we move around in the world and so on okay so that was a bit of a long introduction but you know I think it's important to understand the motivations for this now in classical machine learning the learning theory is based on the idea that we only thinking about one distribution the training distribution and we can assemble training data we can assemble some test data from the same distribution but once you start thinking about the scenario where things change in the world the distribution changes this whole theory is not sufficient anymore and maybe that explains a lot of the issues companies have when deploying machine learning products when say you trained on data from one country and then you test you apply the system in a different country or for example maybe people with the different distribution of genders or races as we've seen recently and things don't work as well so so that sort of robustness is something that current learning theory doesn't really handle because it's about how we generalize from data from the same distribution as the training data but it doesn't tell us anything about how to generalize to new distributions that of course must have some relationship to the old distribution otherwise we can't say anything but still we'd like to be able to handle those changes that happen in the world which we call non stationarity and I'm hoping this will also this in this kind of investigation that I'm talking about will also help us build models of the worlds that are better that have more common sense than current machine learning if you look at the kinds of mistakes that current systems make you realize that they don't really understand the world around us even a two-year-old understands things like physics of course in an intuitive way much better than current pay I systems and even a two-year-old understands human psychology and you know social interactions better than currently AI systems so so it looks like we're maybe missing some fundamental ingredients I don't think for that matter that we have arrived with you know the progress we've made and that has been celebrated recently there's a lot more basic research that needs to be done right and when we ask the questions how do we how do we make our current systems better model of the world one of the ingredients that come up to my mind at least that we don't really master well is causality so I mentioned causal variables earlier but current machine learning systems even those are try to model in this regime don't really explicitly trying to deal with this notion of causality and if you if you think about if you do a little bit of introspection about yourself and you're trying to observe your own thoughts which is a good exercise by the way you will notice that very often your mind is focusing on trying to explain what is being observed right trying to find the causes of what you've seen but this is not something that we're currently doing actively with our current research but I think will be important to build good models of the world okay let me skip that now I'm going to I'm going to propose a hypothesis to help us deal with how distributions of the data change and that hypothesis is inspired by work by burnish a coffe and collaborators who wrote a great book recently on causality that hypothesis is that the changes in distribution are small if you represent the information about the description in the right space in the right way right so like in the space of pixels it may look like things change a lot for example if I if I shut my eyes pixel wise things have changed a lot but really it's just you know one little thing in the world that changed my eyes got closed so so we don't want to model the world in in the space of pixels we would like to model in this space of causal variables and and the hypothesis is that in that space of representation of the the structure between the relationship between those variables the changes will be small in fact will be focused on maybe just one variable so for example if I act in the world if I do something like and move this around I am only changing one thing at a time in the world well one thing in the right space of representation in the space of these causal variables so like this object is something I want to model at the high level but the pixels you know it's not something I can act upon easily and and so it's not the right space to do that but so this idea that in the right space of representation the non stationarity is involved changing just a few of these mcann that relate variables to each other is what we're going to be riding on and in this research we are going to try to take advantage of this hypothesis to build systems that can adapt quickly to changes in the environment and and in fact learn about causality so let me see how we can do that well if if we have represented our knowledge in this space of causal variables then the this hypothesis of small change should mean that I can recover from that change now the adapt to that change with very little data right because if just a few things change in my representation of the world for example I'm representing those those objects right and only one of them changed I only need to gather data about that change and I don't need a lot of examples so that's what we call sample complexity in the jargon of machine learning so so now the idea is that we can turn this upside down and say well if having the right representation of knowledge leads to fast adaptation under that hypothesis then we can use past adaptation as a training objective to find a good representation of knowledge this is the heart of what we're trying to do and we're going to be studying this for now in a very simple toy scenario I like to do these kinds of very small easy to understand experiments where we really control things and and we can understand what we're doing so the simplest possible example of this idea is to try to figure out given two variables a and B and observations of the pair a and B which one is caused and which one is affected so so for doing this we're going to assume that the distribution between those variables is gonna change because of an intervention so some agent is acting on one of those variables and the the claim that I made just earlier that we can adapt faster in this setting comes down to the following that that's what's represented in the equation here that if we represent the Joint Distribution as the product of two modules one which captures the probability of a and the other that captures the quality of B given a and each of these modules have their own parameters but only one of the two modules is going to change when the distribution changes for example P of a changes then the error the gradient that will be measured on the modules which didn't change like P of B given a will be 0 I'd it's sort of almost obvious and so it means that the modules which didn't change after the change in distribution don't need to be adapted and you only need to adapt the ones that changed that's why we're gonna observe faster adaptation so so this I believe is related to important issues with neural nets currently so one of the things people have observed a long time ago is called catastrophic forgetting and difficulties when neural nets are trained on one task and then another one and then another one in sequence and they sort of forget the old ones and need to almost relearn from scratch the new ones which which apparently is not at all what humans are doing we are able to reuse the past knowledge even though the the tasks we are seeing the examples we are seeing come in sequence and I believe that this is due to the fact that in in current MILNET architectures it's like if every parameter wants to participate it in every job in every part of the knowledge representation and so when something changes in the in the world like one of these modules changes in like in a ground truth model in the world all of the parameters all of the weights of the new on that want to change and because there are many of them it takes a lot of time a lot of data to adapt all of them to adapt to the change but if we were able to factorize the knowledge if we were able to break the neural net into modules that specialize on these different causal relationships then we might be able to have much faster adaptation so we verified this empirically in this very very simple setting that I talked about where you only have these two variables a and B and the learner doesn't know whether a is a cause of B or B is the cause of a but it's trying to figure out which is which and so we consider two hypotheses one where the learner assumes that the correct answer is a is Kasabian and that's the in blue and the other where B is the cause of a that's in red and and and we're just seeing how fast the learner adapts to a change when the distribution of a the cause has changed from one distribution to the next and now we're looking on the second distribution after the change how fast the accuracy this is actually the log likelihood of the data improves as the learner adapts to the change in distribution and what we see is that under the the model that has the correct a causal direction the adaptation is much faster than under the model has the wrong causal direction right so so in other words if I have in my head the right causal model I can adapt faster to changes in the world so this is of course a toy experiment but this is what it's trying to validate and now what we're trying to do is say well if good causal model leads to faster tap tation maybe we can use fast adaptation as a training objective to find good causal model so so this is what we do and so instead of thinking of the change in this view non-stationarity and so on is a hindrance like as we usually think about in machine learning we don't have enough data for this new domain or there are too many parameters to adapt compared to the amount of data we have or the agent in some environment you know faces continuous changes that you know it's difficult to track we think of all these things as a source of signal a source of information that can help the learner figure out the causal structure which is something fundamental about how the world works right okay so so we run experiments with this scenario that I've been talking about where we consider a sequence of two distributions where the second one is a obtained by modifying the first one on one of the variables say they're the cause and ideally we would like now is to use these trading signal to figure out two things one is really fundamental in deep learning which is what we call the encoder how do we map the observed data which are not causal variables which are like pixels - the causal variables in other words it's a transformation that you know goes from pixels to say objects right and then what is the relationship between those variables which one is cause which one is effect and so we're gonna use what's called meta learning for learning these things so the idea of meta learning is we have two types of learning embedded in each other so we have like the normal learning where the the learner adapts to the examples as they come and modifies its parameters but then sort of an outer loop at a slower pace we're learning something more generic that's true across all of those distributions all of those changes and these are we call meta parameters that are being adapted at a slower pace whereas for each change in distribution we have these fast adaptation of the regular parameters right and so the the Prady the meta parameters are those things about the world which we consider to be true across you know all of the changes in the solution that we will be observing okay so I don't have a lot of time left but we run these experiments and I'm sorry if the the equations here didn't come up right where we consider these two hypotheses I talked about I'm not gonna go through the math but we define an objective function that is this meta learning objective corresponding to how fast learner adapts we showed some some theory about if we do great in descent on this objective this meta learning objective then the learner can actually recover in this very simple case with two hypotheses the correct answer if you know which is cause in which is effect and we verified that indeed doing this meta learning allows the learner to figure out cause and effect and you know we can measure how fast it converges across episodes now each episode correspond to a seeing one one more change in distribution so you know the agent goes you can think about agents go through the world and things change what from one episode to the next and we can see how fast now at the level of these episodes like these meta examples the learner figures out the correct causal direction in this case so we've done a lot of experiments playing with this with different kinds of relationships between the variables linear nonlinear unimodal multimodal I'm not gonna go through these things we've played with a very simple version of the system where the learner has to discover this encoder but here is very simple linear transformation that relates their observed variable to the true causal rebels and it can it can recover those things and currently what we're doing is extending this work which was with only two variables two more variables and so we we have slightly modified that the math to be able to handle the situation with more very so we are experimenting with again neural nets but now those neural nets represent the conditional distribution of one variable given all the others or subset of the others and what we learn is which subset so that's the connectivity of the graph right so for example we can learn these graphs here with three variables where instead of having everything connected to everything we learn the right pattern of causal dependency so so with that I'm gonna stop and thank my collaborators on this project tryst of the lunacy Merriman rosemary kay Alexa billion ook and you would go yell Chris Paul and Emil appear thank you [Applause] so thank you very much Yeshua well we're gonna have a little bit of QA but let me throw the first two an at you so I was just talking to a group of chief data officer x' from global 2000 companies and i asked them so should we be paying attention to meta-learning and they basically told me this is science fiction you know but then again five years ago we would have said the same thing about deep learning so given what you've just said you know what is possible for companies for governments and for others and how quickly should the implementers of AI in the world start paying attention to the research that you just talked about well some companies are paying attention as you know I'm a co-founder of elementary I and from the beginning one of the focus of the research group is being transferred learning and meta learning is one of the key tools for that so so the reason the thing companies want to pay attention to this is very often in practice you're dealing with situations where you have a new scenario where you don't have enough data right and so standard techniques will just not generalize very well but you might have data accumulated from other settings and if you can take advantage of that to have very fast adaptation and good transfer to the new setting you can have a lot of practical value wonderful so let us go to our first question here in the audience are Richard yes [Music] so I'm you know extremely grateful for the work that people like you do pearl and others have done and and it has really inspired me a lot what's missing from the kind of work they've done from the point of view of deep learning is how do we figure out what those causal variables are so so the typical applications that apparel and company have looked at is in typical science like in medicine or astronomy or whatever you the scientist defines what the variables are and the observations are directly of the variable the values of these variables but if you're a robot or if you are an AI or if you're a system working in a world where we don't know what the right variables are we like to have something like deep learning figure out the right representation this is what deep learning is about so that's really what I'm after that's new next question right there a microphone is coming with what yes absolutely yes right right right so so we've already made a lot of progress with deep genitive models like Gans in producing like having computers generate new images so they're creative but right now they're creative in the pixel space what we really want is to have machines that are creative in this abstract space that I've been talking about that would like machines to discover if you if you again if you do a bit of introspection about your own imagination your own creativity you're projecting yourself into the future or in the past or you know thinking of new things it's not in the sensory level it's in this abstract space of high-level variables that you are sort of gluing together to build something new so we have technology to build new things at the pixel level and now we have to like extend it to the ability to build new things in this high level of abstraction where we don't necessarily build an explanation for everything but just about a few things that you're consciously aware of at a particular moment next one I can't see if there is one so I have one last one for you so what you just spoke about is something that I don't think has made its way out to the corners beyond academia which is that we think of AI as a prediction machine but there are a lot of really cool generative approaches what does that make possible that you couldn't do before using traditional supervised learning well many things humans are very good at imagining right and so right now ganz already being used to to synthesize images but actually they're being used to synthesize things like molecules so we have research on synthesizing new materials new drugs you could imagine this this work in synthesizing text right so you could use that as a tool for generating text conditioned on some context I think you know the possibilities are immense of using these creative tools in new ways well thank you very much can we get a hand for Yahshua [Applause]
Info
Channel: World Summit AI
Views: 3,695
Rating: undefined out of 5
Keywords: Yoshua Bengio, deep learning, World Summit AI Americas
Id: 0GsZ_LN9B24
Channel Id: undefined
Length: 25min 59sec (1559 seconds)
Published: Thu Oct 24 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.