From Deep Learning of Disentangled Representations to Higher-level Cognition

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>> Okay. Good afternoon everyone. And welcome to the MSR AI distinguished lecture series. I'm delighted today to have Yoshua Bengio, from the University of Montreal, as our second in a long series of speakers. Yoshua's immediately recognizable as one of the key figures in the Deep Learning revolution that's taken place in the last five years. And so for those of you who think he's new to the field, just jumped in this century, he's been at it for more than 25 years. In fact, I wrote down one of his earliest papers was something called data driven execution of multi-layer networks for automatic speech recognition, which was published in 1988, I believe at a AAAI. >> That's 30 years. >> Oh, you're right. I'm not very good at math, sort of soft math. And many of the important advances that we see today in speech, vision, text, and images, machine translation, are directly attributable to Yoshua's work and that of his students. And that work, that recognition is evident in many ways. If you start with something like citations, his work has been cited last year alone, more than 33,000 times, that's a career for dozens of people and just a one year sample of Yoshua's influence. The other way in which his work is felt, is in the long stream of algorithmic innovations that he's had. Most recently, they come in the form of unsupervised learning, notably the work on generative adversarial networks, on attention models, such as gating that's been used for machine translation, but really opens up whole other doors to using a variety of other data structures. And then perhaps the one that's a little more hidden, the form of influence it's a little more hidden, is his tremendous work in education and in supporting the community. It comes from his textbook, which is advertised there, a lovely volume. But, you note that it also has all of the chapters online. He's worked tirelessly to promote new conferences in the area, like ICLR, develop community, develop tools that are broadly shared and to educate a whole new generation of leaders, and I think that in many ways may be as lasting a legacy as all the technical contributions. And so today, what he's going to talk about is how we move on from all of the amazing breakthroughs that we've seen along perceptual dimensions and speech vision, and to some extent language, to go to a much Higher-level Cognition, which has been a pursuit that he's been interested in for many years since the early PDP models and perhaps before that. So please join me in welcoming Yoshua. >> Thank you, Susan. You hear me fine? Yeah? Okay. So. Yeah. You see me fine. So here, we have a little bug. We're just going to. So, thanks Susan for the kind words. And as you said, I'd like us to move away from what we're doing now, which is great and is giving us amazing industrial successes to something closer to human level AI. And in that respect, I think it's important to look at the kinds of mistakes and failures that our current systems have. And I spent some time examining that. I'm sure most of you are aware of the adversarial example's issue illustrated here from the work of my former students Ian Goodfellow and his collaborators at Google Brain, showing that if you just change an image a little bit in a very purposeful for a way, you can completely fool a classifier, but, if we go beyond this sort of amazing thing, what can we conclude about the failures of our current systems? And I would say, the strongest thing I see is, that they are learning in a way that exploits superficial clues, that help to do the task they're asked to do. But often, these are not the kind of clues that humans would consider to be the most important, and often these reveal that the models don't really capture the underlying explanations, the underlying nature like the objects for example, as we understand them, by thinking about images as coming from physics, and in the 3-D world. So, a lot of what are we talking about is how can we move forward beyond this tendency of current models to sort of cheat by picking on surface regularities, and why this is important. So we just put out a couple of months ago, a paper illustrating one more time one of these bailings. So we take Deep Convolutional resonets and we change something superficial in the data distribution, which is the spectral distribution in fourier space. For example by, filtering in fourier space the images we take, images like these, and in the fourier domain we apply a mask like this, which just smooths things out. So in other words, we get rid of high frequencies, and you basically don't see much changed. But unfortunately, a network train with these kinds of images does very poorly on these kinds of images. And it gets even worse if you do this kind of filtering, where you randomly enhance or reduce some of the spatial frequencies, and now you get these images which humans would still recognize properly, but there some weird colors that show up in some places and so on. And that completely offsets the Neural Nets. So if you're train on images like these, and then you test on images like these, you get really bad errors. The error rates go. For example, from a say, six point five percent to 34 percent or something. I'm not going to explain all of these, but basically you've train on one of the data sets that has one way or the other of changing the spec the fourier characteristics, and then you test on the others and humans would still see the objects, which are the thing that matter very clearly, but those net networks would get fulled. You can of course, considerably reduce those effects by training the network on all of these types of data, just like you can reduce the effect of versatile exam of a show examples by training with our gestural examples. But of course, then there would be something else that shows up, because the network probably still didn't really capture the object nets that we have in mind. So why is this important? As we are deploying Machine Learning in the real world, what happens is that, the kind of data on which those systems will be used, is almost for sure are going to be statistically different from the kind of data on which it was trained. And as an example of this consider self driving cars or vehicles, for which would like them to behave well in these rare but dangerous states. And unfortunately, with the kind of models we have now, train which provides learning, if these examples are rare, they're probably not going to be learned very well. And [inaudible] to these special kinds of domains like near accident situations like in the picture, might be difficult. So, humans are actually pretty good at dealing with such cases. I only had one accident in my life, right. And how did they do before the accident? And then how do they do after that? We'd only use single example link. We don't need to die a thousand deaths to know how to prevent dying while we drive. What's the catch here? My intuition about this is very simple. We have a mental model that captures the explanatory factors of our world to some extent. It's not perfect. And we can generalize to new configurations of the existing factor. We already know about trucks and cars and the physics of objects and driving. And we already know about social behavior and all kinds of general knowledge that allow us to intuitively know quickly what's the right thing to do, and many new situations completely different from those we've seen. Because they involve concepts that we already know, they just combine in very new ways. Instead, our current machine learning would just tend to drop dead on these statistically very different situations. I think we need to make serious progress on the ability of our models to discover and understand the underlying explanation, the underlying causal relationships so that they can make valid predictions in scenarios, in situations that are very, very far from anything that's been seen. And that can be important also when you consider machines at plan. Because one of the reasons we're planning is to avoid those situations like dying in an accident. But if we don't have a good way to project ourselves into these situations that are very, very different from the ones we typically see, then we won't be able to do that. As I've been saying for more than a decade now, it's all about abstraction and deep learning as we conceived it more than a decade ago, really was about learning multiple levels of abstraction. And in one of the ideas that we proposed pretty early is it'll be nice if we had algorithms that can separate out these explanatory factors. The phrase we use is disentangle the factors of variation. The last NEBS there was a workshop dedicated to this question and it's still not clear exactly what it means. But different people have different ways of trying to formalize this idea, and I think we should continue to try to do that. This notion of disentangling is related, but different from the notion of invariance, as being a very important notion and computer visions, speech recognition and so on, where we'd like to build detectors and features that are invariant to the things we don't care about. But sensitive to the things we do care about. But if we're trying to explain the world around us, if we're trying to build machines that explain the world around them that understand their environment, we should be prudent about what aspects of the world we would like our systems to be invariant about. Maybe we want to capture everything. And the most important aspect isn't what we get rid of or not, but rather how we can separate the different factors from each other. If you're doing unsupervised learning say with speech, you'd like to have features that capture the phonemes and you'd like to have features that capture the speaker. And even if at the end of the day you get rid of the speaker ID and you only care about what's being said, it works if later you decide that, well, what you care about is the speaker ID then, that also works. The other thing that happens with disentangling is it's going to help us to deal with the curse of dimensionality. It is a good rule of thumb to understand what a good representation is. The idea is when we transform the data in the space, machine learning becomes easier. So, in particular, the kind of complex dependencies that we see in them, say, the pixel space will become easy to model maybe with linear models or factorize models in that space. And there are many instances of this. One of the early ideas which has given rise to a number of unsupervised learning methods in deep learning is that we are going to learn a two-way map or gobalistic transformations between the data space, pixel space or whatever and some representations space. So, we have encoders and decoders and by now, everybody uses those terms. And so, what do we want from our encoder and what do we want from the decoder. Okay, so one is sort of the inverse of the other. But one of the early insights from around 2010 or something, is that what the encoder does is take the data distribution, which is complicated, here is represented as a spaghetti. So, the set of points corresponding say to images is along this curve. So we have this spaghetti that we, of course, we don't know where the spaghetti is but we'd like to do is to flatten the spaghetti into something that is going to be easy to model. Once it's flat, then it's like maybe a Gaussian or something. And predicting things in that space become very easy. Also, if you're able to do that, so you know this two-way transformation, generating becomes easy too because if the manifold of points is now almost a straight line, then generating points in a straight line is very easy. And then, I apply the inverse mapping, the decoder, and I get points in the image space that looks like images. So, that's sort of one geometric interpretation. It also comes in with this idea that in the high dimension, we have what's called Marginal independence. In other words, I can assemble from each of the factors if dimension is independently they all depend on each other so it's very easy to model distribution in that space, whereas, modeling directly in that space is kind of hard. So, that's one view that's kind of interesting that came up pretty early. And some of the algorithms tried to do that explicitly. One of the things we did also pretty early is think about how that affects sampling methods. But here, because something on the flattened space is easy but here, I'm showing sort of toy experiment to illustrate what happens when you do this flattening with a very simple algorithm for representation learning, which from those days these were stacks of denoising autoencoders. So, what we're going to be doing is again, we going take the spaghetti here, I just represented it as a curved manifold. And it's going to be transformed into the hidden unit space into something flat. And so, one interesting question is how do we know that it's flat? So, here's a very simple way that you can check that a manifold of points is flat. You take two points on the manifold. So, two images and you take their average. So, you interpolate between them. And if the thing in between looks like an image, then it means that the manifold is flat. So, if you take these two images and I take the average and the thing in between is also on the manifold, then the manifold is flat. If it's true for every linear combination, then you know, everything sits on that convex set. But if the manifold is curved, then the things in between will not be on the manifold. They will not look like images. So, we can do that experiment. So, we can take like this image of a nine and image of a three. And if we work in pixel space and we just do averages linear combinations, we get these images which obviously, are just pasting a nine and three together but don't look like natural digits. However, if we take the same nine and we project it in the representation space of some unsupervised learner. And then, we do the averaging in that H-space. And then, we take these linear combinations and we project them back using the decoder. We can look at the corresponding images. And what we see is that the corresponding images look like natural digits, all the way along the interpolation line. And the two lines here correspond to interpolating a two-layer, as the first layer and the second layer. And there's more than that, which is that if as we move interpolating from this image to this image. So now, I imagine this curved manifold, just for illustration just like in the picture. What we see is that it's going to be nine, nine, nine, nine, nine, nine, nine, nine and suddenly there's like a very quick transition where it becomes a three. And it's all threes up to this three. And the junction that is something that's in between a nine and a three, but still preserves either one's or the other's identity. So, all of this gives us some nice intuitions about the geometry behind what we'd like to have in these higher level spaces. And so, there's been a lot of work in applying these ideas to generative models. As Susan mentioned, one of the really hot family of algorithms are these GANs that we started off three years ago. And they are pretty amazing, I'm not going to describe them. But I think we are still far off the mark. So, what's missing? As I've been saying, if you analyze the errors made by those systems, both for recognition and for generation, you can see that they clearly don't understand the world in the way that we do. And they are missing often the point, the crucial abstractions. So, what's needed? More abstract representations. One aspect of this I think is that our learning theories also have to be modified. Current machine learning theories anchored in the IID assumption, assuming that the task distribution is going to be the same as a training distribution. And so we can get very confident about our models, but when we go out of that assumption, things can break down. Somehow, humans are able to do well as I said earlier, so what's missing? So there's this idea of learning Disentangled Representations, and one thing I tried to do in a 2013 review paper on representational learning is present the idea that we're not going to get very, very good machine learning without introducing some priors, some assumptions about the world. Deep learning is starting with an assumption that this composition of different layers is appropriate for the data we want to model. That there is some compositionality that can be captured with these families of functions. But there are many other assumptions that are as broad, as generic that some of our models use, and I think we should continue adding more elements to this list of priorities that we put in our models in order to improve a generalization, but ideally try to do it in a way that remains fairly general and not too task specific. So, these include things like exploiting the fact that there are different scales, both spatially and temporally, and we already have models that do that, but I think there's a lot more to exploit there. I already mentioned Marginal Independence, in other words the idea that, if you take the data and you transform it into the right space, then all of the factors become independent. Something I already mentioned earlier is this idea that, when we move the data in the abstract space, the dependencies between the factors, the variables in that space become really simple. And later I'll talk to you about the consciousness prior which is one way to implement this. And then, another idea I mentioned is that of causality. So, I think the whole area of causal machine learning is still very young and I wish there would be more people exploring this because I believe this is something that humans take advantage of, and it's one of the ingredients that allows us to generalize to these very different scenarios. And I'll mention some work we started, what I call controllable factors in that direction. So if we look at how humans learn in a very autonomous way, like look at all the knowledge that babies have by two year old, like intuitive physics, one thing that is interesting to consider is how much they interact with their environment in order to acquire that knowledge. And I now think that for upcoming progress towards human level intelligence, machine learning needs to move more towards this notion that acting in the world to acquire information is a very, very important tool. Humans do it and I think this is something that we need to pay more attention to. And of course, the basic tools from that come from reinforcement learning, but we may have to reinvent a bit of reinforcement learning to move further in that direction. So, one particular aspect that we explored is this idea of controllability. I'm looking for a prop here. Let's see. So, here's a prop. I just made up a policy here to control this sheet of paper. I can move it around in space, I can fold it. I can do all kinds of fun things with it and I just made it up. I've made up these policies to control different aspects of this object. So, what this illustrates is that, our brain can come up with control policies that influence specific aspects of the world, like the position of this object and different, complicated geometrical attributes here about its folding. And clearly my brain can represent that information so, I could report it to you, I can tell you where it is, I can have a mental model of it, I can plan based on it. And so, we have these two things happening in parallel. We have the representation and the policies and they're matched in some way. So, I have policies to affect one aspect of the world. I can decide to move just this thing and I can represent that thing in my mind. So I'll tell you more about this. But, clearly, in order to discover these aspects of the world that are controllable, which are not all the aspects of the world but I think they give us very strong clues about how the world works, you need to be acting. You need to be trying things out. And I think as you do that, emerges almost naturally for the learner very important notions from cognitive science, like the notion of objects and agents. Well, the notion of object emerges because the objects are the sort of aggregates of attributes that I'm controlling together. I controlling x, y and z and other attributes of this thing and they are all sort of spatially coherent as well. And that sort of things that can be controlled are attributes of objects. And agents are the entities that can do that. I'm an agent but I can also see you do some things and imagine how I would be doing it if I were you, or something like this. And we do that all the time. So, the notion of agents is very, very convenient to model the world once we incorporate of course actions and trying to control things in the world. Now, there's something particular which sort of blew my mind a little bit at the beginning about the kind of knowledge that an agent acquires by interacting in the world, which is that it's not universal knowledge, it's subjective knowledge. So, the policy I have depends on my body, and that's related to what people call affordances. There are things that I can do that maybe a baby can't do. And so we have a different vision of the world, a different understanding of what can be done. And so, this is different from the maybe simple minded view of AI where there is sort of a universal truth that all of us see and all of us can control our bodies and what we are able to do, kind of condition our understanding of the world. And it's kind of unfortunate but we have to deal with that. So, last year, we started this project of trying to put in equations and algorithms and experiments, this idea of controllable factors. The idea that one way to discover good representations, or at least some of the factors in a good representation, is that we have clues about the existence of these factors by the fact that we can control those factors without changing too many other things in the world. So, we designed a term that would be added in a training objective of a learner, in which we say, "Let's look for a policy that is going to choose some actions given some state, such that after we apply the policy, the state changes. And, some feature, K of the world changes as much as possible." So, the Kth policy is going to control the Kth factor, in the simplest way to think about this. So I'm going to have policy number K, control unit number K in my network and you need to change as much as possible while the other units are not changing, or changing as little as possible. So, that's the starting point of this. And we can add that criterion to whatever other criteria here, it's a simple reconstruction error. So, we've been able to make this kind of idea work on very simple worlds for now and still facing some optimization difficulties we don't completely understand. But that in little toy problems like this, we learn both an encoder and decoder that maps from pixel space. So this is like when we image space into here, a very simple 2D space corresponding to, I mean, it discovers that what matters in this world is the position of the ball, and the reason is that the only actions that the agent can do here is move the ball around. And so, it learns that what matters is the position of the ball. And so, the encoder essentially just learn to take images and spit out the position of the ball in some funny space and the decoder can mount back. So if I play with the coordinates here and then decode, I get a new image where the ball is in a different place. And it turns out we can interpret the changes here directly in terms of position, and it has also separated the changes happening in one direction versus the other direction. So that's kind of nice. You can also do things like take two images in that world, of course, encode them, take the difference between their representations and that will tell you basically what actions are needed in order to go from one to the other because they correspond to how much of each of the factors you should apply in order to go from one to the other. So that was like one-step actions, one-step policies. We've been extending this to multi-step policies and also generalizing from a fixed any variable set of factors to a common natural set of factors. Because if the kinds of factors I'm talking about are things like positions of objects, the problem is that how many objects are there here? How many objects are there in the world? Do I need a different neuron for each possible object that you could ever see? That doesn't make sense, right? So instead of enumerating all the factors, what you'd like to do is to have a name for factors. And the names is going to becoming natural, like is going to be a vector, an embedding. And so, yeah. So, we're playing with these kinds of ideas right now, and playing the same kinds of games. But now, the names are vectors, and what we're showing here are the embeddings of the names of factors that the system discovers. And here that it's the same kind of game as before, but now there's just been more positions and it discovers that the important factors are different positions corresponding to a grid space because the world here happens to have a grid structure. Anyways Okay. Yeah, I talked about this already and time is flying. So, let me move on to a second type of exploration which touches on I think something I'd like to revisit in our training objectives for improvised learning and it's that they are all in pixel space rather than something like an abstract space. So if you look, of course at likelihood, if you look at any kind of reconstruction error, if you look at even things like gown training objectives, they're all focused on what is happening in what I call pixel space, but that would be like acoustic space, or anything, video space, cactus space, whatever the data space. So why is that a problem? If we were able to map the data to this better representation as I've been talking about then, as I said, modeling in that space, planning in that space, reasoning in that space would be so much more convenient. But actually, it's a chicken-and-egg thing. We're not going to be given like magically those representations, so we'd like to have an objective function that really focuses on the abstract space. And to see a little bit what can go wrong with our current methods, let me share a little bit of a couple of years of experience with trying to do speech synthesis with Neural Nets. We made a lot of progress now to the point where now Google is putting those kinds of Neural Nets into their speech synthesizer. But, when we started this research, we were very ambitious and we said, "Okay, let's do like a pure, hardcore improvised learning, take hundred hours of speech, and train a huge complicated recurrent network to model what speech is sequences of 16 thousand samples per second for multiple seconds." And what happens with our current best models, if they're trained in a purely improvised way. In other words, you don't give them words and phonemes is that they produce speech that sound like someone speaking in other European language if he's train on English. But it's something like there's no words, this is like gibberish. So these models capture the texture of speech to capture how speech sounds like, but they failed completely to capture the longer term structure, the linguistic structure. Now, there's a very easy fix to this, and this is what you have in a speech synthesis systems and all of the systems that people use these days, which is used separately train the model that, is you use improvised learning. You separately train a model that goes from phonemes to acoustics. And from a model that just captures the statistics of sequences of phonemes or in other words, a language model. And that works perfectly well. Now, we can generate unconditionally by just first sampling a sequence of words or a sequence of phonemes, and then conditioning on that generate sounds. So, besides the fact that we have a fix by using that knowledge, I think what this teaches us is that our improvised learning mechanisms have not been able to discover something that should be incredibly salient, which is the presence of phonemes as the characteristics of speech. If you do like KEY means clustering on acoustic signals, you discover immediately phonemes, right? Or, at least broadly speaking, it's very salient statistical structure. How is it that these models haven't been able to discover them and then see that there's like this really powerful part of the signal which is explained by the dependencies between phonemes? And the reason is, I think, simply that that part of the signal occupies very few bits in the total number of bits that isn't the signal, right? So, the rows in this signal is 16 thousand real numbers per second. How many phonemes per second do you get? Well, I don't know, 10, right? or maybe 16. So there's a factor of a thousand in terms of how many bits of information are carried by the word level, phoneme level information versus the acoustic level information. And the long length queue, or any other criteria we use, as I said earlier, they're focusing on these details. And so, yeah. As our models get better and better every year, the approach, something at the higher level, but it's very painful. Maybe we need to change how we train these things. So now, let me tell you about a direction of research that we've started which attempts to fix this, and it's very much inspired by cognitive psychology, and very old work actually about attention and consciousness. So one part of the idea is, we want to design objective functions, where say, we could have encoders, but we don't need decoders. Where the objective function is going to be defined purely in the abstract space. That's part of it. But there's something else here which is, I think fairly new and connects with classical work in the eye and knowledge representation and symbolically. So, think about your thoughts, think about your conscious thoughts more precisely. What happens is, at any particular moment, there is something that comes to your mind and it's very low dimensional, that's my claim. You can convert that into sounds, a sentence. Not everything is like this. For example, I can do visual imagery and that's hard to verbalize but many other things I can verbalize. But even if it's visual, it's very low dimensional. It concerns very few aspects of the world. So, why do we have this thing which seems to focus on so few aspects of reality at a time it seems, I used to think short term memory is like crazy, why is it like we can only remember seven things at a time? Our brain is so big, why would we have this limitation? It's sounds like we are under using our computational capacity. Well, so I'm claiming that this may actually be a prior that we use and that potentially machine only could use, in order to constrain representations. And the prior is that, there are, the assumption about the world is that, there are many important things that can be said about the world which can be expressed in one sentence, which can be expressed by a low dimensional statement. Which refer to just a few variables. And often they are discreet those we express in language, sometimes they're not. We can like draw or somehow, use that to plan. But they are very, very low dimensional. And it's not obvious, if priori that things about the world could be said that are true and low dimensional. So for example, again, I'm using a prop. If I try to predict the future here, there are many aspects of it that are hard for me to predict. Like, where is it going to land exactly? It's very, very hard to predict, right? It's a game. But, I could predict that it's going to be on the floor. It's one bit of information. And I can predict that with very, very high certainty and a lot of what we talk about are these kinds of statements. Like, if I drop the object, it's going to end on the floor, in this context. So, this is the assumption that we're trying to encapsulate in machine learning terms with this consciousness prior idea. So, one part of it is that we need a mechanism that's going to select a few relevant variables from all of the things that we could have access to with our consciousness. So, everything that we can think we see, everything that we can talk about, are the things that we have access to from low level perception to very abstract things, explaining what we're seeing. We can come to our consciousness. So, we need an attention mechanism, in order to just pick a few things that are going to go to our attention, go to our consciousness and that's the attention mechanism. So, we're working with soft attention, soft cotton based attention, which is precisely the kind of attention mechanism that we introduced for machine translation a few years ago as Susan was talking about and has been amazingly successful, not just from Machine Translation, but for all kinds of applications now. And I think we could use the same kind of mechanisms. And so, I've been using the word consciousness, but the word consciousness is loaded with all kinds of meanings and we have to be careful here about what we mean. So different psychologists or philosophers are using terms like access consciousness or the Heyns calls it global availability, but basically it's just the aspect of consciousness that concerns the selection of elements on which we are focusing in order to make a prediction, in order to act and that we can usually report verbally. So, how does the content base attention work in machine translation? We use that to select representations, a different position in a sequence, an input sequence that needed to be translated so that when we decide on the next word in on an R and N, that predicts the next word in a sequence for the translated sentence. We can focus on one or a few words, which are likely to contain the most relevant information for the next word to produce. And we use an attention mechanism, you could think of it like little Neutral Net, that takes two things and input. It takes one candidate location, so that where we want to focus, there are some features that are being computed and it takes the current state, which is sort of the context, in which we are going to take the decision of where to focus. It outputs a score that says, how much we want to focus our attention at this location. And we're going to compute such a score for all the possible locations. And then you can generalize this idea in all kinds of ways but this is the heart of what we're doing with content based attention. It's been used to greatly reduce the gap between classical machine translation, based on engrams and human quality translation at least according to human evaluation, a statistical method to evaluate quality. So going back to the consciousness Prior. Whereas in traditional Machine Learning and Representation Learning, we think of this top level presentation that captures all of the factors of interests here we're going to have two levels of representation. We're going to have this very high dimensional and conscious state, which contains all of the abstract representations, but it contains everything and not just the ones that are coming to our mind. Not just the ones that are coming to our attention. The ones that we are focusing on at a particular time will be somehow stored in this Low-dimensional representation, the conscious state. And just like in the previous slide, we are going to have an attention mechanism, which decide what to pick next to go into the conscious state. And you can think of it like, it's kind of a choose something out of this soup of information. And presumably, they are all going to be recurrent, because everything is happening in time. And an important element of this is, the reason we're doing all this, is to put pressure on the mapping between input and representations, unconscious representation which is like the N quarter I had before. So that the N quarter would learn representations, that have the property, that if I pick just a few elements of it, I can make a true statement or very highly probable statement about the world, maybe a highly probable prediction. So yeah, the name, the objective here is the same as I had at the beginning. So we are only using all of this, to imposing this sort of pressure, constraint on the representation learner, so that it learns representations that have this property that we can, this defines a language in which we can say things compactly that are true. And each of those things we're saying are like just a few variables at a time. Another interesting thing that happens here and that connects to classical AI is that we now going to have to represent here, not just values of variables but names of variables. This is something unusual for Neural Nets and you start seeing things like this and models like neural training machine, where we have these memory with keys and values. So, the reason we need to have names of things is that, for example, if I make a prediction about a future variable, then what I'm trying to do, what I have to story here is, the name of the variable on which I'm making a prediction separate from its value somehow, because later, I'd like to be able to say, "Oh, I had made a prediction about this variable and here's the observed value and here was the predict value and now, I'm going to have to update my parameters." And so, if I were to mush the names and the values, I wouldn't be able to do that. I need to be able to refer to things indirectly. We can do that with Neural Nets, but we just have to design them with that in mind. So one little scenario that one can look at is using the conscious state to make a prediction about a specific variable, using only of course a few variables that are part of the conscious state. And then we could just use a kind of log like you would which tells us how good is our prediction, weighted by on which variable we're making prediction, assuming we only predict one thing at a time in this very simple scenario. But just making this weighted prediction isn't going to be enough, otherwise, the system would just learn to extract variables that are easy to predict, but are kind of meaningless or useless. And there are many such variables. So, we are currently exploring different training objectives for this. But one important idea is, would like the representations to have high entropy, in other words to preserve a lot of information about the data. So that's the idea of maximizing entropy of the representation. There's also a very old idea which we're going to be reusing, which is the idea of maximizing mutual information between past and future representations. So Sue Becker who did her Ph.D. with Geoff Hinton and by the same time as me, use these kinds of criteria to extract features in the spatial domain, that had the property that there was high mutual information between the value of the features at nearby locations. And I think this is something that's relevant here, for we're trying to do. We're trying to define a training objective for this consciousness thing that makes past's conscious states highly predictive of future conscious states. But not just predictive, but also high mutual information. So in other words, that they also capture a lot of information together. Okay. Another potential source of training objective which I would like to minimize as much as possible but maybe we have no choice, is not just a kind of predictability, but a sort of usefulness for enforcement lining. And so, what do we use our thoughts for? Well, it's very clear that our thoughts have a very strong influence on our actions. So our thoughts are used to condition our actions and also plan our actions. So very often we have mental imagery and right after we do something, because it helps us figure out whether it's a good thing to do or not. Right? So we could use this consciousness mechanism just conditioning information for policy, and allow this not just to make a single prediction, but, imagine a future with unfolding of many time steps in the future. A few more things I want to close on. I mentioned that we don't really want to have like a different neuron for each factor, we want to use this notion of a district representation for each factor. So you can think of each factor as like a concept that we can use language for, and so they're going to have an embedding. And there's been actually some earlier work done by Mike Moser in the 90s about how one could represent discrete concepts in a neural net, using something like Help field net. So if you imagine that you have a group of neurons that collaborate with each other to move together towards a sort of stable fixed point stable attractor, that sort of clean up mechanism, it is a sort of discretization that makes a lot of sense from base on psychological experiments, to explain our ability to manipulate discrete concepts in our mind, and take decisions. Right. There's of course connections between this work and classical. Yeah. You can think of something like a statement about predicting a variable given conditioning variables as just a connection is the way of talking about a classical AI rule. Right? It doesn't have to be a rule, it could be a fact, it could be something about the current scene that we know is true, or has a high probability that is also represented in the conscious thought. I think that this notion of having to refer to variables by the representation of the variables itself, the symbol has a name, is something that comes handy if you want to implement some kind of recursive compositional computation which is very common in classical AI. And of course, this also makes connection to language. I'm hoping that it's going to help to associate perceptual words with national language, to ground the actual language, and vice versa. What I'm hoping is that when learners that have this kind of consciousness prior, are learning with language and perception that the natural language they're getting from humans is going to help the learners by giving them hints about the high level abstractions. So I think when we talk to our children, we're giving them hints about what are relevant abstractions that are useful in the world around them, and probably accelerate their learning in this way. Okay. I see that it's already 4:00. Let me have a last slide here about something very different. I think it's time the machine learning community, research community starts thinking beyond developing the next gadget. Of course those gadgets are very profitable, but there are other things that matter in the world. And it'd be nice if grad students and researchers around the world instead of working see an image net could work on datasets, that if we could have good solutions for those datasets, could help millions of people with medical problems, or education problems, or environment problems. And I think that organizations like the partnership on AI, to which Microsoft belongs and as a founding partner, is a kind of organization that would fit very well with a mandate of making such things happen. But more than that, it's not just doing machine learning on these kinds of AI for good applications, it's also coordinating this kind of work across many companies and labs around the world. Because right now, different companies are doing things more or less in that direction, independently each hoping to show off, " Hey look. Where we are good. " But I think we would be all very much more productive if we coordinated those efforts, if we prioritized what could help people the most, if we talked, if we invested in talking to the people in those say poor countries who could benefit most from this, to understand better what could have the greatest impact. And if we didn't just do that science. But also made sure that the grad students in those poor countries also learned the machine learning that goes with it, and maybe come for internships in our big labs and go back with the State of the art knowledge, to their country to bring that future wealth there, instead of us trying to save them. So. Yeah. I think there's a lot that we can do that's not too complicated, that would be not just the right thing to do, but also would in the long run help all of us. Thank you. >> We have time for questions. >> What do you think about this inanes debate like innate. So, human brains solve the problem of billions of years. >> Yes. >> Of evolution and presumably, there are a bunch of the primitive construct that you can use as the base ingredient in your unconscious stay, right? So the question is that, do you envision that we should try to induce them again from data or what's the other way? >> There are two things we're combining already. One is, we are using our ingenuity and math and insights to build in the kind of priors, that evolution has optimized. And we're starting to use the same kinds of mechanisms as evolutionists used, like meta-learning and all that stuff. So actually, my brother and I started doing this kind of thing in the early 90s and just didn't work because, I mean, it worked on so tiny scale that we abandoned, because it's too computational expensive. But now, we're starting to have the computational power to do these things. But ultimately, we also want to understand what are the underlying principles. So I think we can use computing power to do some of that work of discovery, but to also do it in a way that we understand the principles that give rise to good Machine Learning. >> So maybe let me follow up with sort of a concrete scenario. For example, like biomedicine. So there is a lot of this sort of intuitive or from experience, I saw some of this kind of little bit like hand wavy reasoning in medical decision making. And it is actually a very rich soul and body of this kind of knowledge, like all the entities, relations and so forth. So but, in the future obviously, we will also have a lot of big medical data like, all the sensors and so forth. The question though is, do you envision like we actually, we learn all those kind of objective phenotyping, ignore all the prior knowledge. Or is there some kind of a middle ground to? >> So there's always a middle ground. >> Okay. >> But the equilibrium point is shifting towards more data and less human engineered knowledge, as we're collecting more data. So there's always a trade off and it depends on the quality of the knowledge we have. So some things are stronger and we should always use them and some things are so-so and maybe data should be able to override this. If you look at the last 20 years of NIPS proceedings, it's all about specific knowledge that people are putting in their algorithms, right? Yes? >> My question is about, what makes something part of consciousness prior? Is this just. >> You mean, what things come into our consciousness or? >> No. Like, there is these facts in the world or the observations. >> Right. Right. >> Is it just another layer of abstraction you are thinking about to make the computation easier or will it have some special properties like, being shared across agents learning, across different domains kind of this belonging to common sense, belonging to what agents share in terms of acting in the world like this cognitive special properties in your mind, shared cross applications and agent? >> So I think first of all, there is a lot of knowledge that we have and that we are consciously aware of, but that is still hard to communicate, but a lot of what comes to our consciences is precisely the stuff that we tend to communicate. And it's interesting to ask what about animals which don't have language. Do they have something equivalent? I think some primitive versions of that, yes. And we have probably a stronger type of consciousness, because it's enhanced through our learning to communicate with others using that. But a lot of the common sense knowledge isn't something we are even aware of consciously. I think that's why the classical AI program failed, because we were trying to build like the roof of the house and we didn't have the scaffold. And the scaffold is perception and the low level understanding of the world. So this is sort of near the top of the house, that I'm talking about and it's not built yet either. And it's connected to reasoning and symbolic stuff and all that. Yes. >> So I'm interested in the one example, the linearity invention in the first half of the talk. Like in the future space of [inaudible] We should have [inaudible]. So does that mean like somehow maybe imposing linearity towards the end of the neural network can help us to resolve some of robustness issues about. >> Yes. So that's precisely what the conscience prior is doing. I mean, I didn't talk about linearity, but if you think about, what this equation means, it says, that we take the data, we bring into this sort of consciousness level, this representation level and in that space, I'm going to have a predictor that's very sparse, because I just take a few variables. And it's also very simple. So the neural net that predicts this guy and this guy is sitting on top here and hopefully, is very, very simple. It's just a simple linear logic or one MLP. Very, very simple. Doesn't need to be linear, it can be whatever you want. But the point is, by constraining the capacity at this level, we are forcing the representation to somehow come up with those factors, those representation, those features that have this property that we can now do very simple operations in order to predict stuff from other stuff. Yes. >> So kind of relates to this question, earlier you had a picture a true relation to images can you go back to that. So you show that in the pixel space, interpolating images doesn't give you anything that make sense, but >> Well it does make sense, but it's not what we want. >> So you're trying to say we should operate in the abstract space. But another thing that will give you the same interpolation is for example, when you interpolate Wasserstein distance that will also give you something like on the top sort of something on the bottom. So I'm just saying- >> What do you mean by interpolating the Wasserstein distance. > There is a well defined meaning of very center of facts. >> So you mean define a metric that has the Wasserstein distance locally. >> So is well defined notion to interpolate in- >> Wait, the Wasserstein distance is a distance between distributions. So it doesn't make sense what you're saying. >> But in this case you can think of the picture as a histogram of pixels. >> No. >> No, it's well defined, it's well defined. The purpose of this tutorial right or the last leaves on optimal transport. >> Yes. >> So I guess what I am saying is that the structure of this- or the subtraction structure could be, in fact, a different structure from just a abstract space. And with the Euclidean structure on it, it could be, actually, a more complicated structure. >> Oh, you're saying we don't have to do linear interpolation at the top. We could do something different. >> Yeah. >> Sure. >> It is not actually worth it to do the linear interpolation. >> Yeah. I don't know what the right thing but hopefully, it's something simple that can be learned quickly. That is the most important thing, right? It doesn't mean many parameters. That's, I think what, for example, allows us to do one-shot learning once you were presented things in this abstract space. Because relations are simple and sparse, then, a single example or a few examples are enough to sort of deduce relationships that you didn't know before. >> Yes. I agree. >> Yeah. That is the main characteristic. Patrice? >> I want to challenge a rule that what you said about- >> I was sure you would. >> When you just said about the limitation of deep learning. But before I do that, I want to challenge the first thing that you said that humans are very good at learning in a unsupervised way. So, let us take the concept of complexity. Did you learn complexity from the physical world, from your interaction? >> No. >> Where did you learn it? >> Not so many years ago. >> But, you learned it in school, right? >> Yeah. >> And I think that's important because we learn things gradually. >> Right. >> And I am going to claim that the problem with- >> I agree. >> Deep learning is that the hypothesis space is fixed. And, the way human learns is that the grow the hypothesis space gradually. >> I agree. >> And they grow it with help. >> I don't see why you're saying that deep learning as that limitation. I wrote a paper in 2009 called Curriculum Learning. >> Yes. >> That should be and I think you should go back to that. >> No. This is all in line with that. There is no contradiction with what you are saying. >> Well. So the- >> So, one of the early ideas was that we gradually build new concepts. Thanks to the concepts we have already learned. >> Yes. >> And I do not see- So, here, you think of it like I am showing a snapshot but in the evolution of the learner, presumably this space would get richer and more abstract. >> Right. But I think this gradual growth of the hypothesis space is where we need to focus and I think this a very interesting things to model but it requires- >> But that's your focus. You are going to solve that problem for us. Yes, I agree. >> I like to, just on that note, you come to the idea of theory revision, right? As opposed to gradually growing things, you talked about children having causal theories. They often have causal theories that are wrong. >> Yes. >> And you have another example that completely changes the theory, right? It is not like gradual changes not repeated exposure. How does that fit into this notion of just massive revisions of what the underlying representation looks like? >> Well, I liked that question because I don't have the answer. >> Yeah, you know that one? >> Right. So, the good news. So, here, I mean knowledge is at different locations here, but in particular, there's the mapping to representations. Then, there's this very compact representation of how the things are related in that space and sort of corresponding to the set of rules if you want, right? And the good news is that, the set of rules is very easy to change. That's where you can do one-shot learning. This is where you can completely change your view on something without having to rewire everything. And that's connected to classical idea with the idea that, "Oh, I can keep all my rules except they change this one and now my conclusion is going to be completely different where that rules is relevant.". So, by factorizing representation from facts and rules to make things a little bit of, you know, grow sketch, I think makes it much easier to do what you are talking about. Whereas, if it is sort of hidden in the mush of one big neural net that does everything, it is kind of difficult now to change your mind about something specific. And the representation is never wrong. It's just- It might be insufficient to generalize, but it's never wrong. It's just a representation. The actual facts are represented in this top little set of rules if you want so I think it facilitates here. Thanks for asking. I hadn't thought about it. >> If you want to make a drastic change with a few example, you need very low capacity. If you have high capacity- >> That's right. That's what I said. The top thing has very low capacity. It's sparse. It only uses very few variables and hopefully it is somethings trivial like linear. >> So, how is low capacity tricked with deep learning? >> We all know that the deep part is the representation, right? I mean everything is deep learning but the part here that is traditional deep learning is the mapping from say images or image sequences to that representation space. And that is where most of the capacity is, and that is where it's hard to learn and that's where we don't do the job yet because I think we're not putting as much pressure as we should on the representation to the abstract and to have these- well, easy that we can make predictions in that space very easily. That is what this is saying. Yeah. >> So, we get that we have a unconscious state and we put the attention goes to a conscious state too but make up getting more broader concept. So, does the unconscious state encode all the world information? >> All the what? >> All world information. Or is it specific to the- >> All the world information? >> Yeah. >> Like, for example, colors. So, I want to paint- >> Pretty much everything that you can name, right? So even low-level stuff. If you ask me like what does the color of that pixel, I can like pay attention to it and tell you. >> So the input has to be from different domains. >> Yeah. It's very rich, this unconscious state. But mostly it is interesting not because it has the pixels, but because it has the higher level things. Yeah. >> That's it. This notion that we can only can put maybe save things in our memory prove interesting that you see this as a good prior or not. >> Yes. >> Bug of our brain. >> That's right. >> But there's a bunch of other things where kind of irrational or we're not good at handling like vastly different skills and we act in kind of irrational ways and I wonder if any of those would also be interesting priors for dealing with the real world as opposed to seeing them as kind of failures of our breed. >> Well, maybe. But I think we have to try to think of how they could be useful from a machine learning point of view and then say, "Ah, maybe we could use this. It's a meaningful thing to add and not just, "Oh, let's shoot ourselves in the legs because humans have that failure as well." >> Right. But I think it can be useful that the roads are really complicated and we need to build the- as humans, act in that world and we're trying to reproduce that, right? >> And so some of these things might be necessary sort of priors. Just to be able to learn quickly enough. >> Right. >> For you to assume or, you know, act as if you're never going to have to deal with things that are in 10 orders of two different skills at the same time for instance. >> Yeah. But, we have also to be careful, not to try to reproduce everything that we know about humans and the brain because some of these things might be side effects of our particular hardware or, you know, evolution is imperfect and, you know, we like to understand what we're doing. >> Yeah. >> Last question. >> So, I am curious about what you're saying about using the low capacity representation of the world for causal reasoning and reconciling that with the fact that well, I mean, yes we we often have a very kind of abstract approximate reasoning about the world. We know birds can fly for example but then, oftentimes, when you're trying to really think through a specific situation, the devil is in the details, you know. >> Yeah. >> Do you think that a good model would only use a single layer of low capacity information or do you think there'd be some range of. >> So, I don't think that these abstract low dimensional thing is the only space that matters for reasoning. At each time we project ourselves in in the future, when we think about something, all the low level stuff is there hidden and influencing what's going on. So I think that's one reason why traditional rule-based systems fail because they are an incomplete description of what's really going on whereas, we are able to use our intuition along the way, which is hard to do with with pure symbolic rule-based systems. So, by connecting the low level stuff with the high level stuff and keeping them connected, I think we can avoid that trap. >> Please join me in thanking Mr. Wright.
Info
Channel: Microsoft Research
Views: 65,210
Rating: undefined out of 5
Keywords: microsoft research, unsupervised learning, ai, deep learning, reinforcement learning
Id: Yr1mOzC93xs
Channel Id: undefined
Length: 77min 4sec (4624 seconds)
Published: Thu Feb 08 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.