Research in Focus: Deep Learning Research and the Future of AI

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- Hi there, I'm Seth Juarez, and I'm privileged to be here with Dr. Yoshua Bengio of the University of Montreal, who is also the head of the Montreal Institute of Learning Algorithms and is an expert in deep learning. He has a textbook out called Deep Learning, which I could not find anywhere. It's really in demand apparently. We're here to talk about his research and the state of the art in artificial intelligence. It's a pleasure to be with you, my friend. - My pleasure. - How are you doing? - Very good. - So tell me about deep learning. For those that have been living under a rock, that haven't really just thought about deep learning, for researchers out there that are getting started in the field, tell me about deep learning. Well, deep learning is a particular branch or approach to machine learning. And of course, machine learning is about getting computers more intelligent by learning from data. And deep learning is focusing on learning representations, and is very much inspired by some of the things we know about the brain. - So tell me about like general AI, there's this thought about general AI. Is deep learning taking us along the path, or is it the end goal? General AI is certainly an end goal. And deep learning, among other machine-learning approaches, is particularly focused on looking for a very general purpose learning procedure. So you know, within machine learning there are approaches that are more domain-specific, and deep learning research tends to explore fairly broad principles that could be applied to all kinds of applications, which is exactly what we need for general artificial intelligence. - So what makes deep learning, because when I went to school we studied SVMs and decision trees, et cetera. What makes deep learning different than those kind of models? Because I feel like, it feels different to me. What is it that makes it different? - Yes actually in the early 2000s, even end of the 90s, I was starting to ask myself, what are the advantages neural nets and later deep neural nets could have against what were the standard at that time, you know, kernel machines, SVMs, and so on. And I had some intuitions about this and I was able to formalize them and eventually write a number of papers showing that by having these, what we call distributed representations, where say an image or a sentence is represented by a pattern of activation in one layer of a network, and by having multiple levels of these representations, we could in fact have a sort of exponential gain in terms of the ability of the machines to generalize well and be able to represent information in a fairly compact way. So that's something that really comes, is important when you think about the issue of what is called the curse of dimensionality. When something that's actually a big issue for kernel machines when you try to learn a very complicated function, these deep neural nets, these have the potential to generalize much better if some assumptions about the world are satisfied. So there's no magic bullet in machine learning. There's this so-called no free lunch theorem. - Right. - And what it says is that, OK there's no universal machine learning, but if we make some assumptions and we are right in those assumptions, then we can win big. And deep learning is making such assumptions. Well, they seem to work well for all kinds of tasks that humans are good at. So my hypothesis is that human brains are also making those assumptions. And so as a consequence, deep learning tends to be good at those things that humans are good at. - So one of the things that I've found is that, and I spoke to you earlier, is generally when I was doing work with decision trees or SVMs, I was very cognizant of feature selection. And you said we don't do that with deep learning. Why is that? - We don't need to. When you do feature selection, you take a hard decision about, I take those features and I don't want to look at those features. And of course there are cases where it's truly a good assumption about the world to assume that there is only a small subset of features that are relevant. And in that case, feature selection is obviously the right thing to do. But in many more realistic settings, well almost every feature contains some information that you care about. So you really don't want to get rid of all of them, because each of them could give you a little bit of a cue about the right answer. The problem traditionally, the reason people are doing feature selection is well, I don't have enough data, I could be overfitting, so by eliminating features I could generalize better. But there are other ways of preventing overfitting, and deep learning exploits some of those ways. And it's actually not completely understood why, but very large networks that have many more parameters than you think are necessary, actually you can generalize pretty well. So they are pretty robust to having more features than necessary. And of course, you do need to have enough data to get off the ground. But in general, practitioners with neural nets tend to keep all the features. - Is this a reason why feature-dense problems are being used in conjunction with deep learning a lot? For example, like image recognition, and text and speech? - Yeah, yeah. And in fact, it's an advantage. It's also a computational advantage, because it doesn't cost much more to multiply two big matrices on a GPU than two small ones. - Right. - And so you might as well go and have fairly large input sizes. - So we've talked mostly about the supervised case. I've heard that there is some research being done around using deep learning for the unsupervised case, and I'm having a hard time trying to wrap my brain around what that looks like and why it might be interesting. - Right. So let me give you an example of unsupervised learning, which humans are very good at. A child, when she is born, doesn't know much about the physics of our world. She discovers it by interacting with you know toys and dropping things and she starts to understand gravity and liquids and all kinds of concepts. And her parents don't need to tell her about it. - Right. - She figures it out in an unsupervised way. She doesn't take classes about gravity. She just observes things around her and learns how things work. So that's unsupervised learning. And it's a particular form of unsupervised learning, because she's not just ,observing she's also playing with the world. But that's something that we don't know how to do really well. And even now, even though there's a lot of research in unsupervised learning with deep learning, the biggest successes in industry have been with supervised learning. And we know that there's all of that data out there that we don't really take advantage of, because we need better unsupervised learning. So we have many reasons to explore that. - So what does that look like? I mean, again, when I was studying, unsupervised learning was more about grouping things and clustering things and finding ways things belonged. But what does that look like in deep learning? Because I mean, when I studied neural networks, it was primarily for you, there was some kind of objective. If you wanted to learn either classification or maybe even regression, if you could do that, what does that look like? Are they clusters? What is that recognition? - So we don't usually do clusters. I mean clustering is a form of unsupervised learning, but it's a form of unsupervised learning that throws away a lot of information. Once you decided that this image belongs to this cluster category, you throw away a lot of information about the image. So instead, in deep learning, as I was saying, the focus is on learning good representations that are very rich. In other words, I keep all of the information about the input, but transform it in such a way that it becomes easier to answer questions. So if I really understand well the world around me, like the child I was talking about earlier, and I figure out you know, what are the elements of it? What are the objects and the attributes that explain what I'm seeing? Even if I have no task except understanding the world, I can learn good representations. So in fact, the early days of deep learning in 2005, 2006, were about using unsupervised learning methods like RBMs and ordering quarters, to learn good representations without any supervised task. Just throw a lot of images and find good representations that seem to capture things like edges and small shapes. And once you've done that, you can do it with unlabeled data for which you don't know what the right action should be. You can use those initial representations as a starting point or as inputs for a supervised learning classifier. And that was the beginning. Now we're much more sophisticated but the idea is the same. We can use unsupervised learning to discover good representations, and we can also use unsupervised learning for other things, like dealing with missing inputs or actually generating, for say images given something else, like text or whatever. So there are lots of nice things that can be done with unsupervised learning, which we didn't know how to do 15 years ago. - So let's talk a little bit more about this generating images from text. - Yes. - I mean, that sounds like, like if I were to talk about this maybe five years ago or 10 years ago, I'd be like wow, the computer is thinking. What's actually happening when the text is generating images? How do you create a model that does that? So I would say it's thinking, but in a very primitive way. I don't think we have machines right now which understand the world nearly as well as we do, but they are able to extract these representations which, you know, we talked about factors and code for these representations which control different aspects of the image. So we can learn, for example, with things called ordering quarters and there are many variants. A transformation from the image to this representation, and also from the representation to the image. So going from the image to the representation is useful because we can use the representations for say classifications. - Sure. - But going from the representations to the image is also useful, because then I could say, if I fix the representation I could see what kind of image this corresponds to. And if I want to generate images, and I can control like which categories and which attributes, so for example voice. We have some students who have started this company where you can generate different voices. So you can make Donald Trump say something, and you can play with his voice, so you can condition the generation using whatever you want. So say the output here is a sequence of sounds, and that's what you're generating, but you can control it with categories here, like sequences of letters and words and so on. - Interesting, because now computers are starting to become more, like for example there's digital assistants, so say I have a digital assistant named Fred, and I don't want Fred to sound like a robot. You can use these representations to generate real spoken whatever dialect you want. That's what we're trying to do. And the game here is that we can have many controls, so we can change the voice like who is speaking, we can change the emotions, and of course we can change what are the words that are being said. And with images, the same thing. You can play with the image, you can control factors in the image and, for example, people have been using GANS, this new approach that we started here a few years ago, to do things like allowing people to just sketch something about an image, and then the system would figure out something, a realistic image that would correspond to that sketch. - Interesting. So talk a little bit more about GANS. It's not something I've heard about. It's probably an acronym, I'm guessing. - Yes. What does it mean? And then the other thing that I'm having a hard, the engineer in me is screaming out, like well I want to implement something like this. What does something like that look like? How is that model stored or how is that encoded? - OK, so GANS, that's indeed an acronym for generative adversarial networks. And it's a pretty radical departure from how we were doing things for decades. You know, in machine learning, the standard way of training has being mostly reduced to maximum likelihood. And some variant of maximum likelihood. And GANS are really fundamentally different. So that's exciting, because you know in science we're looking for new ways of doing things because maybe this is going to open new doors. And while initially our GANS they don't work that well, but they were exciting and several researchers started to try to improve them. And eventually we started having images generated by those GANS that were very crisp and, you know, were able to generate texture and details that were not thought possible before, at least and in a reasonable future. So now we have pretty amazing, almost photo-realistic images at reasonably low resolution, like 128 by 128, coming out of GANS. - And that's again, pretty impressive. It's a huge departure from what we were doing before, in what way? What's the huge difference? - OK, that's a pretty, that would require a bit of a technical explanation. But I would say one important difference is the traditional way of modeling distributions, because unsupervised learning is about capturing the distribution of the data is to write down an equation for the distribution function or density function and this equation would have parameters. And then the idea of maximum likelihood is we tweak the parameters so that the model would produce a value of the density function that's larger on the data. - Sure. - Here we don't have a density function. There's no question that gives us the density, right. It just, instead we say, OK, well for this application what we really want to do is to generate images, say, and so we're just going to train a neural net that takes in some information, maybe just pure noise, and then massages the information and outputs nice images. And we just train it to do that by training another network that's going to learn whether the output of the first network look like natural images or not. And so that second network is a classifier and it just learns, does this look like this was fake images that were generated by a machine, or this was like real things, real images. And then the first network which generates is trying to fool the other guy. So it's like, it's a game theoretical thing. It's very different from machine learning, traditional machine learning, in many ways and one of them is that instead of having one objective function that you just optimize, you now have two, and you have two agents that are like fighting each other, so it's really a different way of thinking about machine learning. - It's like a Turing test, but one of them is actually a computer trying to see if the other one is human. - Exactly. One is trying to defeat the other, right. And we now have a theory that helps us understand why a deeper network can potentially have a big advantage in terms of the generalization ability, so how it can perform well on new data compared to a shallow network. And of course, the thing with the classical machine running like kernel machines is that they're shallow. Essentially, they are like a neural net with one hidden layer in terms of structure. And so the ability to have the depth turns out to be quite important. - So let's talk about the first is overfitting, because I mean nowadays with like CNTK and TensorFlow, anyone can go out and start to do these things. What are some things that you suggest to prevent overfitting? - Well, what's interesting is the old way of dealing with overfitting, for example in statistics, is to have a smaller model with few parameters. - Which is the opposite of what we're doing with deep learning. - Exactly. We are having these big models that are over-parameterizing things. In theory, there's enough parameters to learn everything by heart. And in fact, if you train them sufficiently, they will often go to zero training error. And you know, traditional machine learning wisdom is oh, you must be overfitting. - Right. - But it's not happening that much. So there are a number of things that we do, like injecting noise, like augmenting the data by deforming the examples. And maybe one of the most interesting that we don't completely understand is the fact that we used gradient descents, is actually in itself regularizing and preventing overfitting to some extent. - I see. - And so you can have very large networks and still get pretty good validation if you stop early. If you don't train to death. - Right. - And you use a validation set to decide you know when to stop training. - So just a standard machine learning things. - Very simple things, very simple things. - The other thing you mentioned is that there is some theory behind the difference between shallow and deep networks in that deep networks perform well. Can you speak a little bit? Is there some intuition as to why the deep networks mathematically are just intuitively will work better than shallow networks? - Yeah, yeah. So here's the intuition. What's happening with the deeper network is a little bit like what is happening when you can write a program with more lines of code. - I see. So a shallow network would be like a program with two lines of code. And so what can you do with two lines of code? You can do a memory look up, which is like a nearest neighbor thing. And in principle you can do any function this way but, but a memory look up isn't very powerful in terms of generalization. But now if I allow you to write 20 lines of code, you can do much more complicated things. - I see. And the part that's important is that the result of the computation, the state of the machine after the fifth line becomes the input for the machine computing the sixth line, and then so on. So each layer produces a new representation which becomes the input, and sort of features or concepts that are going to be combined to build something a bit more abstract, a bit more complicated for the next level, and so on. And this way you can build more complicated abstractions. And what the theory showed us is that really what it boils down to is that functions that look very complicated, for example, in the case of that piecewise linear function, you can count how many pieces are there in a function. So this is kind of a measure of how complex it looks like. It has many pieces. You think it's a very complicated function. And if you were to represent a function like this with a shallow network or a kernel machine, you would basically need one unit per piece, right? Say hey, there's this piece, and then there's this piece and then this piece. OK, so you just literally, you know, cut and paste all the pieces together. But what's happening with deep nets is that each level kind of falls on itself. And if you had some number of pieces at one level, when you put the next level it is like you squared a number of pieces. You can combine a piece here, with a piece here, and now you have like many more pieces. And the more you do this, in fact the number of pieces you get after say K layers is exponential in the number of layers. - So you're able to approximate more complicated functions? - Well not all of them. That's the thing. So there are functions which look very complicated that really can be expressed in a much more compact form. - Sure. - And these are the functions where the neural nets, the deep neural nets are really thriving. If the pieces were put completely arbitrarily then we would not be better off than a shallow net or a kernel machine. But because there is some kind of structure that explains the shape of that function, the network can actually represent that function with a very, very small number of parameters. So we have a function that looks like it has an exponential complexity, but actually it's not. It has a small number of parameters. - So what's new in the field of deep learning? What are the things that you're excited about? What's the next challenge? - Well, I think one of the hottest areas right now in deep learning is the intersection between deep learning and reinforcement learning. If you look at the last ICLR conference, you have lots and lots of papers, you know, mixing these two things. And so one thing that I'm interested in is how a learning agent, which is not just observing the world but is acting in the world, that's where the reinforcement learning part becomes important, can discover good representations about the world. And in particular I'm interested in how the agent can learn to control the world, to control aspects of the environment, and then using that information at the same time as it's learning to do that, to build good representations where the different aspects that it can control corresponds to different dimensions in its representation. So I know what I'm saying is a bit abstract, but basically the idea is, think about babies. So they're not just doing random movements, they're actually trying to control things around them and then control their parents. - Of course. - But while they're doing that, this is how they are building their mental model of how the world works, of the causal relationships and how to represent that in their brain. So we're trying to do something similar. - Is that similar to the adversarial networks that you were talking about? - No. But it's related in the sense that we're trying to learn good representations. - I see. But now, whereas in the traditional GANS we're just observing a bunch of images or sounds, here we allowed the learner to interact with the world. So it's difficult because you can't just have now a big data set and we learn on the data set. We have to build an environment. Like think of it like a video game where the agent can now do things and then get some rewards and move around and see what the effect of the actions are and so on. - So effectively, you're making a little baby neural network. - That's right. - And you're putting it in this virtual environment, and you're saying do stuff. - Yeah. - And then you either punish it or reward it. - Right. Right. - And what is it that you're expecting? What is it going to learn? Like representation that are set. - It learns how to control every aspect of its environment. So then once it's learned that, it can do anything. So once you know what the effect of all your actions will be and how things work in the world, I can ask you, can you please fetch this, even though you've never been there, you know the map, you know the effect of your muscles. You can you can just plan it and do it. So once you understand the world, you can act in it intelligently, and that's why it's so important to build machines that understand the world and not just compute some kind of low level statistics that give us cues about classification or something else. - And this to me is, this is really interesting because just the amount of engineering discipline you would need, for example just to create a virtual environment, just to create an environment where it learns, where you can give it feedback, is pretty impressive. Did you start research on that already, or is this the next thing? - No, we are starting that. We're not the only ones. Our friends in other big machine learning labs are also exploring this kind of thing. So I think it's an important direction and many people are recognizing that. - So where can we learn more about your work and what you're doing here? -Well, besides the book that gives us the building blocks and the mathematics to understand deep learning, there is a lot going on in the community. And everything that people are doing in this community, fortunately, is available online, very often on archive. A good place to look at the recent papers in deep learning is the ICLR conference proceedings. And there's so much to read. I can't keep track. But you know, I think people should just spend more time reading and playing with those systems. There's a lot of software libraries that people can play with. That's the right thing to do. - Well thanks so much for spending some time with us. By the way, the book, Deep Learning, you should check it out. It has a really rigorous introduction, because it feels like it's the first soup to nuts book on deep learning. It starts with linear algebra, probably statistics and then it goes into optimization. Which is pretty impressive for a book of that type. I always thought deep learning, you required this immense amount of knowledge before you even got into it. And it's cool to see that. - So we wrote that book with the idea that any engineer who has basic math and knows how to program should be able to read the book and get into deep learning. And indeed, by the number of sales, I guess it's much more than the number of machine learning people in the world. - Yes. - So that must be that some of these engineers are buying the book and jumping into deep learning. - Again, I could not find it in Redmond at all, and I looked very hard. Thanks so much for being with us. - My pleasure.
Info
Channel: Microsoft Research
Views: 43,791
Rating: 4.9710145 out of 5
Keywords: microsoft research, faculty summit, edge of AI
Id: 5BrNt38OraE
Channel Id: undefined
Length: 26min 48sec (1608 seconds)
Published: Tue Jul 18 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.