- Hi there, I'm Seth Juarez,
and I'm privileged to be here with Dr. Yoshua Bengio of the University of Montreal, who is also the head of
the Montreal Institute of Learning Algorithms and is an expert
in deep learning. He has a textbook out called Deep Learning,
which I could not find anywhere. It's really in demand apparently. We're here to talk about his research
and the state of the art in artificial intelligence. It's a pleasure to be with you, my friend. - My pleasure.
- How are you doing? - Very good.
- So tell me about deep learning. For those that have been living under a rock,
that haven't really just thought about deep learning, for researchers out there
that are getting started in the field, tell me about deep learning. Well, deep learning is a particular branch
or approach to machine learning. And of course, machine learning is about getting
computers more intelligent by learning from data. And deep learning is focusing
on learning representations, and is very much inspired by some of
the things we know about the brain. - So tell me about like general AI,
there's this thought about general AI. Is deep learning taking us along the path,
or is it the end goal? General AI is certainly an end goal. And deep learning, among other
machine-learning approaches, is particularly focused on looking for a very general purpose
learning procedure. So you know, within machine learning there are
approaches that are more domain-specific, and deep learning research tends to explore fairly broad principles that could be applied
to all kinds of applications, which is exactly what we need
for general artificial intelligence. - So what makes deep learning,
because when I went to school we studied SVMs and decision trees, et cetera. What makes deep learning different
than those kind of models? Because I feel like, it feels different to me. What is it that makes it different? - Yes actually in the early 2000s, even end of the 90s,
I was starting to ask myself, what are the advantages neural nets and later deep neural nets could have against
what were the standard at that time,
you know, kernel machines, SVMs, and so on. And I had some intuitions about this
and I was able to formalize them and eventually write a number of papers showing that by having these, what we call distributed representations, where say an image
or a sentence is represented by a pattern of activation in one layer of a network, and by having multiple levels
of these representations, we could in fact have
a sort of exponential gain in terms of the ability
of the machines to generalize well and be able to represent information in a fairly compact way. So that's something that really comes,
is important when you think about the issue of what is called the curse
of dimensionality. When something that's actually
a big issue for kernel machines when you try to learn
a very complicated function, these deep neural nets, these have the potential to generalize much better if some assumptions
about the world are satisfied. So there's no magic bullet
in machine learning. There's this so-called no free lunch theorem. - Right.
- And what it says is that, OK there's no universal
machine learning, but if we make some assumptions
and we are right in those assumptions, then we can win big. And deep learning is
making such assumptions. Well, they seem to work well
for all kinds of tasks that humans are good at. So my hypothesis is that human brains
are also making those assumptions. And so as a consequence, deep learning tends
to be good at those things that humans are good at. - So one of the things that I've found is that,
and I spoke to you earlier, is generally when I was doing work
with decision trees or SVMs, I was very cognizant
of feature selection. And you said we don't do that
with deep learning. Why is that? - We don't need to. When you do feature selection,
you take a hard decision about, I take those features and I don't want to look
at those features. And of course there are cases where it's truly
a good assumption about the world to assume that there is only a small subset
of features that are relevant. And in that case, feature selection
is obviously the right thing to do. But in many more realistic settings,
well almost every feature contains some information
that you care about. So you really don't want
to get rid of all of them, because each of them could give you
a little bit of a cue about the right answer. The problem traditionally,
the reason people are doing feature selection is well, I don't have enough data,
I could be overfitting, so by eliminating features
I could generalize better. But there are other ways
of preventing overfitting, and deep learning exploits
some of those ways. And it's actually not
completely understood why, but very large networks that have many more parameters than you think are necessary, actually you can
generalize pretty well. So they are pretty robust to
having more features than necessary. And of course, you do need to have
enough data to get off the ground. But in general, practitioners
with neural nets tend to keep all the features. - Is this a reason why
feature-dense problems are being used in conjunction
with deep learning a lot? For example, like image recognition,
and text and speech? - Yeah, yeah. And in fact, it's an advantage. It's also a computational advantage,
because it doesn't cost much more to multiply two big matrices on a GPU than two small ones. - Right.
- And so you might as well go and have fairly large
input sizes. - So we've talked mostly
about the supervised case. I've heard that there is some research being done
around using deep learning for the unsupervised case, and I'm having a hard time
trying to wrap my brain around what that looks like and why it might be interesting. - Right. So let me give you an example
of unsupervised learning, which humans are very good at. A child, when she is born, doesn't know
much about the physics of our world. She discovers it by interacting with you know
toys and dropping things and she starts
to understand gravity and liquids and all kinds of concepts. And her parents don't need to tell her about it. - Right.
- She figures it out in an unsupervised way. She doesn't take classes about gravity. She just observes things around
her and learns how things work. So that's unsupervised learning. And it's a particular form
of unsupervised learning, because she's not just ,observing
she's also playing with the world. But that's something that we
don't know how to do really well. And even now, even though there's a lot
of research in unsupervised learning with deep learning, the biggest successes in industry have been
with supervised learning. And we know that there's all of that data
out there that we don't really take advantage of, because we need better
unsupervised learning. So we have many reasons to explore that. - So what does that look like?
I mean, again, when I was studying, unsupervised learning was more about grouping things and clustering things
and finding ways things belonged. But what does that look like in deep learning? Because I mean, when I studied neural networks, it was primarily for you,
there was some kind of objective. If you wanted to learn either classification
or maybe even regression, if you could do that,
what does that look like? Are they clusters? What is that recognition? - So we don't usually do clusters. I mean clustering is a form of unsupervised learning, but it's a form of unsupervised learning that throws away a lot of information. Once you decided that this image
belongs to this cluster category, you throw away a lot of information
about the image. So instead, in deep learning, as I was saying, the focus is on learning
good representations that are very rich. In other words, I keep all of the information
about the input, but transform it in such a way that it becomes easier
to answer questions. So if I really understand well the world around me,
like the child I was talking about earlier, and I figure out you know,
what are the elements of it? What are the objects and the attributes
that explain what I'm seeing? Even if I have no task except understanding
the world, I can learn good representations. So in fact, the early days
of deep learning in 2005, 2006, were about using
unsupervised learning methods like RBMs and ordering quarters, to learn good representations
without any supervised task. Just throw a lot of images
and find good representations that seem to capture things
like edges and small shapes. And once you've done that, you can do it
with unlabeled data for which you don't know
what the right action should be. You can use those initial representations
as a starting point or as inputs for a supervised learning classifier. And that was the beginning. Now we're much more sophisticated
but the idea is the same. We can use unsupervised learning
to discover good representations, and we can also use unsupervised
learning for other things, like dealing with missing inputs or actually generating, for say images given something else,
like text or whatever. So there are lots of nice things
that can be done with unsupervised learning, which we didn't know how to do 15 years ago. - So let's talk a little bit more
about this generating images from text. - Yes. - I mean, that sounds like,
like if I were to talk about this maybe five years ago
or 10 years ago, I'd be like wow,
the computer is thinking. What's actually happening when the text
is generating images? How do you create a model that does that? So I would say it's thinking,
but in a very primitive way. I don't think we have machines right now
which understand the world nearly as well as we do, but they are able to extract
these representations which, you know, we talked about factors and code for these representations which control different
aspects of the image. So we can learn, for example, with things
called ordering quarters and there are many variants. A transformation from the image to this representation,
and also from the representation to the image. So going from the image
to the representation is useful because we can use the representations
for say classifications. - Sure. - But going from the representations
to the image is also useful, because then I could say,
if I fix the representation I could see what kind of image
this corresponds to. And if I want to generate images,
and I can control like which categories and which attributes, so for example voice. We have some students who have started this company
where you can generate different voices. So you can make Donald Trump say something,
and you can play with his voice, so you can condition the generation using whatever you want. So say the output here is a sequence of sounds,
and that's what you're generating, but you can control it
with categories here, like sequences of letters
and words and so on. - Interesting, because now computers
are starting to become more, like for example
there's digital assistants, so say I have a digital
assistant named Fred, and I don't want Fred
to sound like a robot. You can use these representations
to generate real spoken whatever dialect you want. That's what we're trying to do. And the game here is that
we can have many controls, so we can change the voice
like who is speaking, we can change the emotions, and of course we can change what are the words
that are being said. And with images, the same thing. You can play with the image,
you can control factors in the image and, for example, people have been using GANS, this new approach that we started here
a few years ago, to do things like allowing people to just sketch something
about an image, and then the system
would figure out something, a realistic image
that would correspond to that sketch. - Interesting.
So talk a little bit more about GANS. It's not something
I've heard about. It's probably an acronym, I'm guessing. - Yes.
What does it mean? And then the other thing that I'm having a hard,
the engineer in me is screaming out, like well I want to implement
something like this. What does something like that look like? How is that model stored or how is that encoded? - OK, so GANS, that's indeed an acronym
for generative adversarial networks. And it's a pretty radical departure
from how we were doing things for decades. You know, in machine learning,
the standard way of training has being mostly reduced
to maximum likelihood. And some variant
of maximum likelihood. And GANS are really fundamentally different. So that's exciting, because you know in science
we're looking for new ways of doing things because maybe this is
going to open new doors. And while initially our GANS
they don't work that well, but they were exciting and several researchers
started to try to improve them. And eventually we started having images
generated by those GANS that were very crisp
and, you know, were able to generate texture and details that were not thought possible before, at least and in a reasonable future. So now we have pretty amazing,
almost photo-realistic images at reasonably low resolution, like 128 by 128,
coming out of GANS. - And that's again, pretty impressive. It's a huge departure from what
we were doing before, in what way? What's the huge difference? - OK, that's a pretty, that would require
a bit of a technical explanation. But I would say one important difference
is the traditional way of modeling distributions, because unsupervised learning
is about capturing the distribution of the data is to write down an equation for the distribution
function or density function and this equation
would have parameters. And then the idea of maximum likelihood
is we tweak the parameters so that the model would produce a value of the density function that's larger on the data. - Sure. - Here we don't have a density function. There's no question that gives us
the density, right. It just, instead we say, OK,
well for this application what we really want to do
is to generate images, say, and so we're just going
to train a neural net that takes in some information, maybe just pure noise, and then massages
the information and outputs nice images. And we just train it to do that
by training another network that's going to learn whether
the output of the first network look like natural
images or not. And so that second network
is a classifier and it just learns, does this look like
this was fake images that were generated by a machine, or this was like
real things, real images. And then the first network
which generates is trying to fool the other guy. So it's like, it's a game
theoretical thing. It's very different from machine learning,
traditional machine learning, in many ways and one of them is that instead of having
one objective function that you just optimize, you now have two, and you have two agents
that are like fighting each other, so it's really a different way of thinking about machine learning. - It's like a Turing test, but one of them
is actually a computer trying to see if the other one is human. - Exactly. One is trying to defeat the other, right. And we now have a theory
that helps us understand why a deeper network can potentially have
a big advantage in terms of the generalization ability, so how it can
perform well on new data compared
to a shallow network. And of course, the thing with the classical machine
running like kernel machines is that they're shallow. Essentially, they are like a neural net
with one hidden layer in terms of structure. And so the ability to have the depth
turns out to be quite important. - So let's talk about the first
is overfitting, because I mean nowadays with like CNTK and TensorFlow, anyone can go out and start
to do these things. What are some things that you suggest
to prevent overfitting? - Well, what's interesting is the old way
of dealing with overfitting, for example in statistics, is to have a smaller model
with few parameters. - Which is the opposite of what
we're doing with deep learning. - Exactly. We are having these big models
that are over-parameterizing things. In theory, there's enough parameters
to learn everything by heart. And in fact, if you train them sufficiently,
they will often go to zero training error. And you know, traditional machine learning wisdom
is oh, you must be overfitting. - Right.
- But it's not happening that much. So there are a number of things that we do,
like injecting noise, like augmenting the data
by deforming the examples. And maybe one of the most interesting
that we don't completely understand is the fact that we used gradient descents, is actually in itself regularizing and preventing
overfitting to some extent. - I see.
- And so you can have very large networks and still get pretty good validation
if you stop early. If you don't train to death. - Right. - And you use a validation set to decide you know
when to stop training. - So just a standard machine learning things. - Very simple things, very simple things. - The other thing you mentioned is that there
is some theory behind the difference between shallow and deep networks in that deep networks perform well. Can you speak a little bit? Is there some intuition as to why
the deep networks mathematically are just intuitively will work
better than shallow networks? - Yeah, yeah.
So here's the intuition. What's happening with the deeper network
is a little bit like what is happening when you can write a program
with more lines of code. - I see. So a shallow network would be like a program
with two lines of code. And so what can you do
with two lines of code? You can do a memory look up,
which is like a nearest neighbor thing. And in principle you can do
any function this way but, but a memory look up isn't very powerful
in terms of generalization. But now if I allow you to write 20 lines
of code, you can do much more complicated things. - I see. And the part that's important is
that the result of the computation, the state of the machine
after the fifth line becomes the input for the machine computing the sixth line,
and then so on. So each layer produces a new representation
which becomes the input, and sort of features or concepts
that are going to be combined to build something
a bit more abstract, a bit more complicated
for the next level, and so on. And this way you can build
more complicated abstractions. And what the theory showed us
is that really what it boils down to is that functions that look very complicated,
for example, in the case of that piecewise
linear function, you can count how many pieces
are there in a function. So this is kind of a measure
of how complex it looks like. It has many pieces. You think it's a very
complicated function. And if you were to represent a function
like this with a shallow network or a kernel machine, you would basically need
one unit per piece, right? Say hey, there's this piece,
and then there's this piece and then this piece. OK, so you just literally, you know,
cut and paste all the pieces together. But what's happening with deep nets
is that each level kind of falls on itself. And if you had some number
of pieces at one level, when you put the next level it is like you squared
a number of pieces. You can combine a piece here,
with a piece here, and now you have
like many more pieces. And the more you do this,
in fact the number of pieces you get after say K layers is exponential
in the number of layers. - So you're able to approximate
more complicated functions? - Well not all of them. That's the thing. So there are functions
which look very complicated that really can be expressed
in a much more compact form. - Sure.
- And these are the functions where the neural nets,
the deep neural nets are really thriving. If the pieces were put completely arbitrarily
then we would not be better off than a shallow net or a kernel machine. But because there is some kind of structure
that explains the shape of that function, the network can actually represent
that function with a very, very small number of parameters. So we have a function that looks like it
has an exponential complexity, but actually it's not. It has a small number of parameters. - So what's new in the field of deep learning? What are the things that you're excited about? What's the next challenge? - Well, I think one of the hottest areas
right now in deep learning is the intersection between deep learning
and reinforcement learning. If you look at the last ICLR conference,
you have lots and lots of papers, you know, mixing these two things. And so one thing that I'm interested
in is how a learning agent, which is not just
observing the world but is acting in the world, that's where the reinforcement
learning part becomes important, can discover good representations
about the world. And in particular I'm interested
in how the agent can learn to control the world, to control aspects
of the environment, and then using that information
at the same time as it's learning to do that,
to build good representations where the different aspects
that it can control corresponds to different dimensions
in its representation. So I know what I'm saying is a bit abstract,
but basically the idea is, think about babies. So they're not just doing random movements,
they're actually trying to control things around them and then control their parents. - Of course.
- But while they're doing that, this is how they are building their mental model
of how the world works, of the causal relationships and how to represent
that in their brain. So we're trying to do something similar. - Is that similar to the adversarial networks
that you were talking about? - No. But it's related in the sense that we're trying
to learn good representations. - I see. But now, whereas in the traditional GANS
we're just observing a bunch of images or sounds, here we allowed the learner
to interact with the world. So it's difficult because you can't just have now
a big data set and we learn on the data set. We have to build an environment. Like think of it like a video game
where the agent can now do things and then get some rewards and move around and see what the effect
of the actions are and so on. - So effectively, you're making
a little baby neural network. - That's right. - And you're putting it in this
virtual environment, and you're saying do stuff. - Yeah.
- And then you either punish it or reward it. - Right.
Right. - And what is it that you're expecting? What is it going to learn? Like representation that are set. - It learns how to control
every aspect of its environment. So then once it's learned that, it can do anything. So once you know what the effect of all your actions
will be and how things work in the world, I can ask you, can you please fetch this, even though you've never been there,
you know the map, you know the effect of your muscles. You can you can just plan it and do it. So once you understand the world,
you can act in it intelligently, and that's why it's so important to build machines
that understand the world and not just compute some kind
of low level statistics that give us cues about classification
or something else. - And this to me is,
this is really interesting because just the amount of engineering
discipline you would need, for example just to create
a virtual environment, just to create an environment where it learns, where you can give it feedback, is pretty impressive. Did you start research on that already,
or is this the next thing? - No, we are starting that. We're not the only ones. Our friends in other big machine learning labs
are also exploring this kind of thing. So I think it's an important direction
and many people are recognizing that. - So where can we learn more about your work
and what you're doing here? -Well, besides the book that gives us
the building blocks and the mathematics to understand deep learning, there is a lot going on
in the community. And everything that people
are doing in this community, fortunately, is available online,
very often on archive. A good place to look at the recent papers in
deep learning is the ICLR conference proceedings. And there's so much to read. I can't keep track. But you know, I think people should just spend
more time reading and playing with those systems. There's a lot of software libraries
that people can play with. That's the right thing to do. - Well thanks so much
for spending some time with us. By the way, the book, Deep Learning,
you should check it out. It has a really rigorous introduction,
because it feels like it's the first soup to nuts book on deep learning. It starts with linear algebra, probably statistics
and then it goes into optimization. Which is pretty impressive
for a book of that type. I always thought deep learning, you required
this immense amount of knowledge before you even got into it. And it's cool to see that. - So we wrote that book with the idea
that any engineer who has basic math and knows how to program should be able
to read the book and get into deep learning. And indeed, by the number of sales,
I guess it's much more than the number of machine
learning people in the world. - Yes.
- So that must be that some of these engineers are buying the book
and jumping into deep learning. - Again, I could not find it in Redmond at all,
and I looked very hard. Thanks so much for being with us. - My pleasure.