ANNOUNCER: The following program is
brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced the third
linear model, which is logistic regression. It has the same structure as the linear
models, where you have the inputs combined linearly using weights,
summed up into a signal, and then the signal passes through something. In this case, it passes through what
we refer to as a soft threshold. We labeled it theta. And the model is meant to implement
a probability that has a genuine probability interpretation. And because of that, the error measure
we derived was based on likelihood measure, which has a probabilistic
connotation, in which case we maximized the
probability that we would get the data set that we got-- the outputs
given the inputs-- based on the hypothesis that is
represented by the logistic regression being assumed to be the
target function identically. And this makes us able to express the
probability in terms of the parameters that define the hypothesis, which
are the weights, w. And therefore, we have this quantity
that we want to maximize. And then we derived an error measure
that very much parallels the error measures that we had before, in terms of
the in-example error for logistic regression that we will minimize. So this is a useful model, and it
complements the other models. One of them was for classification, one
of them for real-valued function regression. And this one is for bounded real-valued
function that is interpreted as a probability. One of the key issues about logistic
regression is that, because the the error measure is a little bit
more complicated than we had for example in linear regression, we were
unable to optimize it directly. And therefore, we introduced the method
that is meant to minimize an arbitrary nonlinear function that
is smooth enough, twice differentiable. And in the case of logistic
regression, although we don't have a closed-form solution, the
error measure actually has a very nice behavior. It's a convex function and, therefore,
when you apply a method like gradient descent or other methods, it is fairly
easy to optimize because you just fall into that minimum and stay there, rather
than have problems with local minima that we talked about briefly. So the algorithm for gradient descent,
regardless of the error measure that you are trying to minimize. First you initialize and, in the case
of logistic regression, initializing to all zeros was fine. We will find out that today in neural
networks, that will not be fine, and we'll make the point why. And then you keep iterating
until termination. And what you do is, you update your
weight gradually by going along the negative of the gradient. That would be
the steepest descent in the error-- the biggest gain you would get
for a fixed-size step. And in this case, we adjusted the
fixed-size step so that it's a fixed learning rate that is proportional
to the gradient at that point. We keep doing this, and then when we
arrive at termination we report that as our final hypothesis. And we talked a little bit in the Q&A
session about criteria for termination, and also about local minima that
will become an issue for today. So today when I modify the gradient
descent into the more practical version, which is called stochastic
gradient descent, we will talk a little bit about initialization and
we'll talk about other aspects that have to do with local
minima and whatnot. OK. Today's topic is neural networks. And historically, neural networks are
responsible for the revival of interest in machine learning. They have a biological link that
got people very excited and people pursued them. And they were very easy to implement
because of the algorithm that I'm going to describe today. And they met a lot of success
in practical applications, and got people going. Now, it is not necessarily the model
of choice nowadays, probably people will opt for support vector machines
or other models. Yet every now and then, neural networks would do the
job as well as the other models. And many industries use
it as a standard. For example, in banking and
credit approval, neural networks are often used. So the outline for today
is very simple. First, I'm going to extend gradient
descent into the special case of stochastic gradient descent that
is used in neural networks. And then, I'm going to talk about
neural network as a model. What is the hypothesis that
it is implementing? And I motivate it from a biological
point of view, and relate it to perceptrons. And then we'll talk about the
backpropagation algorithm, the efficient algorithm that goes with
neural networks that actually made that model particularly practical. So let's start with stochastic
gradient descent. What do we have? We have gradient descent, and gradient
descent minimizes an error function. That is function of w-- minimizes
it with respect to w. And that happens to be an in-sample
error in our mind. And it is the in-sample error. And the only thing I would notice here
that is particular to the derivation of stochastic gradient descent is that, in order
for you to compute the error or the gradient of the error, which
you need in order to implement gradient descent, you need to evaluate the hypothesis
at every point in your sample. So for n equals 1 to N, you
need to evaluate those, or you evaluate their gradient. And that will tell you what is the
error, or what is that direction you would go to, which is normal because
this is the error we are minimizing. You'd better compute it. So you take the case of logistic
regression and we had a very particular form for that. And now you can see that
it's an analytic form. In this case, friendly and smooth. And indeed, you can get the gradient
with respect to that vector and go down the error surface,
along the direction suggested by gradient descent. Now, the steps were iterative, and so
we take one step at a time. And one step is a full epoch, in the
sense-- we call something epoch when you have considered all the examples
at once, which is the only choice we have so far. And we had this formula
that we have seen. The difference we are going to do now
is that, instead of having the movement in the w space
based on all of the examples, we are going to try to do it based
on one example at a time. That's what will make stochastic
gradient descent. So because now we are going to have
another method, we're going to label the standard gradient descent as
being batch gradient descent. It takes a batch of all the examples and
does a move at once, as opposed to the other mode. So the stochastic aspect
is as follows. You pick one example at a time. Think of it that you pick it at random. You have N examples,
each of them is equi-probable to be picked. You pick one of them at random. Now you apply gradient descent, not to the
in-sample error for all the examples, but the in-sample error on that point. That looks like a very meager thing to
do, because the other examples are not involved at all. But I think you have seen something
like that before. When we take one example at a time and
worry about it, and not to worry about what other guys are doing even
if we are interfering with them. Remember the perceptron
learning algorithm? That's exactly what it
did, and it worked. And in this case, it will also work. Now to argue that it will work, think
of the average direction that you're going to descend along. What does that mean? If you take the gradient of
the error measure that you are going to minimize, which in this case just
for one example, and you take the expected value under
the experiment that you picked the example from the entire training
set at random. In that case, if you want to get the
expected value with respect to the red n, which is now a random variable,
this is what you get. And if you evaluate it,
it's pretty easy. You simply take this value. For every example it has a probability of
1 over N. And the expected value would be 1 over N summation of that. So this would be the
average direction. So you think that every step, I'm going
along this direction plus noise. So this is the expected value,
but because it's one example or another, there is some
stochastic aspect. And if you look at the quantity on the
right hand side, this happens to be identically minus the gradient
of the total in-sample error. So it's as if, at least in expected
value, we are actually going along the direction we want, except that we
now involve one example in the computation, which is a big advantage,
and we have a stochastic aspect to the game. So this is the idea, and then
you keep repeating. And as you repeat, you always get the
expected value in that direction, and you get different noises
depending on which example. So the hope now is that by the time you
did it a lot of times, the noise will average out and you actually will
be going along the ideal direction. So it's a randomized version of gradient
descent, and it's called stochastic gradient descent,
SGD for short. Now let's look at the benefits of
having that stochastic aspect. The main benefit by far-- that is the motivation
for having this-- is that it's a cheaper computation. Think of one step that you're going to
do using stochastic gradient descent. What do you need? You take one example, you put the input
and you get the output, and then you compute whatever the gradient
is for one example. If you're doing the batch gradient
descent, you will do this for all the examples before you can
declare a single move. Nevertheless, the expected value of your
move in the cheaper version is the same as the other one. So it looks there's a little
bit of cheating here. On the other hand, it
looks attractive. If this actually works on average, this
is an attractive proposition. So this is number 1 advantage. The second advantage is randomization. There is an aspect of optimization
that makes randomization advantageous. So you don't want to be
extremely deterministic. You want to have an element of chance. Why would I want an element of chance
if I know my goal exactly? Well, because optimization
is not exact. It's not like you're going for the
minimum for sure after that. There are all kinds of traps that you
can go through, like local minima and whatnot. So let's look at cases where
randomization would help. This is an error surface,
and it is the typical error surface you will encounter. The one you encountered in logistic
regression, which was simply like this, that was a lucky
one, the convex one. In general, and in neural networks for
sure, you are going to get lots of hills and valleys in
your error surface. So depending on where you start,
you may end up in one local minimum or another. You may not get the best one, you
may get one or the other. Now, this is inevitable and there is
really no full-proof cure for it, as we discussed in the Q&A session. On the other hand, it will be quite
a shame if you get stuck in this fellow. You see this small fellow? Because it's really just
a shallow local minimum. But according to gradient descent, you
go here, the gradient is zero, everybody is happy, and you stop there. So you would love to have an added
element that will make you escape at least shallow valleys like that. And the idea now is that, because you
are not going in a direction that is deterministic-- in this case, there
is a random-- there is some fluctuation here. So there is a chance as you go
here that you will escape from the local minima. Now this is a practical observation that
in reality, stochastic gradient descent does help with this. It doesn't definitely cure
it, far from curing it. On the other hand, it does take
care of some aspect of escaping silly local minima. So this is an advantage that basically
is a side benefit. We did it for the cheap computation,
and we're getting this for free. The other one we also talked about
a little bit in the Q&A session, which was the flat regions. So you could be having this being very,
very, flat and then finally going down. So if your termination criterion tells
you that here you are OK, then you-- it looks like flat,
and nothing is happening, and you will stop. Every now and then when you do the
random things, the fluctuation takes you up and down and the algorithm
is still alive. Still, termination is a tricky criterion,
because for termination you need to consider all the examples in
order to know exactly where you stand. But for some of the flat regions, just
the stochastic aspect also helps a little bit with it. So there are basically annoying
artifacts of the optimization of a surface that gradient descent will help
a little bit with, if you use the stochastic version. Now, the third advantage-- shows randomization helps-- the third advantage you have
is that it's very simple. It is the simplest possible optimization
you can think of. You take one example, you do something,
and you're ready to go. And I will see an example in
a moment that applies it. And because it's simple, there are
lots of rules of thumb for it. So people have used it a lot, and
people use it in different applications. So you can find rules of thumb that
are actually very useful. So I'll give you one rule of thumb
that will be helpful in practice. Remember the learning rate? The learning rate was telling
us how far we go. And we talked about, you know, that if
it's too big then you lose the linear approximation. If it's too small,
you are moving too slowly. So sometimes you ask, what should
I use for eta, the learning rate? Obviously, the exact answer depends
on the situation, and even it is dependent on scaling the error up and
down. Mathematically, you can't really pin it down. From a practical point of view, if
you go for a very wide range of applications, you take a normal
application, a normal error function, mean squared or something, and
then you take eta equals 0.1. That actually works. So you can always start with this, and
then adjust it there. That's for stochastic gradient descent. So this is a theorem, eta equals 0.1! And the proof is that. These are advantages, so we are now
motivated to look into stochastic gradient descent. And let's see it in action. I'll take an example far from
the linear models and neural networks. I'll take an example that we looked at
before in an informal way, and it will be very easy to formalize
and implement this way. Remember movie ratings? What was that? Oh, that was the example where you want
a user to look at a movie and do a rating. And you want to look at previous ratings
and predict. All of that. Now it looked like this, at least the proposed solution, that we
will describe the user by a number of factors, which are basically
their taste. They like comedy, they like action,
they hate this, et cetera. So there are some values here describing
their taste, a profile of the user if you will. And then a movie-- you describe the
content with the same factors. Does it have comedy? Does it have-- et cetera. And the idea now is that we are going
to reverse-engineer the ratings, the existing ratings in the training
set, into factors that explain why this rating is. And hopefully by the time we do that,
we will be able to predict future guys. So I do this for the movies that this
user saw, and then I will take the factors of the user, the factors of
a movie that they haven't seen, and do the same combination that I did here,
and hopefully get a prediction for the rating. So all I want to do here is
show you this method using stochastic gradient descent,
which was actually the method that was used in this solution
in the million-dollar prize. So although it is very,
very simple, it is actually used. And if you are working for something
with the stakes that high, you probably will try your best
to get something right. So the fact that actually stochastic
gradient descent survived until that late stage tells you that it's
not a trivial algorithm. So in order to put some formality on
this, we need to give labels for the users and movies. So it would be user i, movie j, and the rating we will call r_ij. Very simple. Now there are factors for the users
and factors for the movie. So let's call them something. The factors for the user will be u_1,
u_2, u_3, u_K, so it's a vector of numbers that describe the
taste of that user. And the corresponding factors for a movie
would be v_1, v_2, v_3, up to v_K, which describe the content
of that movie. When we said we're going to match the
taste of the user to the content of the movie, what we were going to
do, we were going to simply take a coordinate, small k, from k equals 1 to K, and multiply these two. So we're taking an inner product
between these two guys. And then sum up. And that will tell us the level
of matching between the two. At least that is the quantity we are
trying to make replicate the rating. So we would like the difference between
the rating and this quantity to be small. That's the goal. Now in order to be accurate in the notation,
the factors u_1 up to u_k and v_1 up to v_k depend on which
user and which movie. Different users have different
factors, et cetera. So I'm going to add the label of the
user, and the label of the movie. And now, if you look at the picture,
i and j will appear. So it's a bit more elaborate notation,
but it's not a big deal. And we also introduce it
here in the sum. So this will be exactly the case. And for all of the users and all the
movies, you have a shuffle of different users rating
different movies. So the factors are reused for different
ratings that appear in your training set. And now your idea is, how do I make these
guys close to the ratings in the training set, hopefully that
they will generalize. And the way you do it is, you define
an error on that particular rating, which is the difference between the
actual rating and what the current factors suggest. The factors now are your parameters,
and you're trying to find the value for the parameters that minimizes this.
Because you're taking one example at a time, if you do descend on this one, it
will be stochastic gradient descent. If you wanted to do batch gradient
descent, you would have to take all the ratings, add up these terms for
all the ratings you have, and then descend on those. But the stochastic gradient descent
is the one which is used. Could there be anything simpler? You're going to get the partial this
by partial every parameter that appears here. And remember in the first one, we said
that all we are doing is-- we take these factors and try to nudge them a little
bit towards creating the rating. And now we have a principled
way of the nudging. The nudging will be proportional
to the partial by partial each factor. So I have a bunch of factors. Which factors do I modify
in order to get there? Now we have the formula, and the
formula will be, as a vector, I'm going to move in the space that now
has 2K parameters in this case. And I'm going to move in that very
high-dimensional space in a direction that makes me, with a certain size of
step, achieve the biggest drop in the error in estimating the rating. So you can implement this. And indeed, if you implement it, you
will get a pretty good score. Not a winning score, but a pretty good
score for the Netflix competition. And in this case, people started adding
terms, and obviously regularizing, which will be an important
issue that we'll come up with. But basically, the simplest stochastic
gradient descent with very plain squared error on something as simple
as that will get you somewhere. So now we know that stochastic
gradient descent is good. And stochastic gradient descent
is the one we're going to apply to neural networks model, so let's talk about
neural networks model. I am going to start with the
biological inspiration of neural networks, because it's
an important factor. That's where they got their name, and
that's how they got the initial excitement that got them to have
a critical mass of work. So biological inspiration is a method
really we use in engineering applications a number of times. And there is a little bit of a leap
of faith there which is: we are interested in replicating
the biological function. You know, humans learn. We want machines to learn. So in order to replicate the function,
our first order is to replicate the structure. That's what we do. We try to make it look like the
biological system, hoping that it will perform the same. It is a legitimate approach because
something is working-- there is an existence proof, and it
has this structure. Maybe the structure has
something to it. So in the case of neural networks,
this is the biological system. We have neurons connected
by synapses, there are a large number of them. Each of them does a simple job. The job, the action of a particular
neuron, depends on the stimuli coming from different synapses. Synapses have weights. Very much similar, if you look at
a single neuron, to what we thought of the perceptron. Except, obviously, they are different
quantities and not as exact and whatnot, but this
is the principle. So the idea, now, maybe if we put a bunch
of perceptrons together in a big network, we will be able to achieve the
intelligence or the learning that a biological system does. And we get to replicate it, and get
something like that in engineering, a network of this sort. And indeed, this was the initial
thing that we did. Now I'm going to make a single comment
about the use of biological inspiration in this way. So I'm going to give you another
example, where we had biological inspiration. And we'll get a lesson from it. So the other example is the following. We want to fly. We look around. Birds fly. Let's try to get inspired by birds. And after a long chain of events,
we ended up with this. Now, there is no question
that the structure, which is what we are going
to use, made it. There are wings, there is
the tail, et cetera. But once you got the basic structure
going, if you are in an engineering discipline-- if you're in biology, your
goal is to understand why the structure does the function,
and know it. So you want to know how biology
does it, regardless. In engineering, you want
to do the job. You don't care how you do it. You're just using biology
as an inspiration. Completely legitimate approaches
to the problem from different perspectives. But once you did the initial thing, you
are no longer going for the bird and seeing what organs the bird has. No, no, no. What you went here, and all of a sudden
it's all partial differential equations and conformal mappings. And when you get the solution, you
get a plane that flies but doesn't flap its wings. So, imitating biology has a limit. You have to get an inspiration for what
is relevant, and then on your own derive what you need. So going back to our model here. We will get this. Now if I derive a way to learn, et
cetera, I don't need from an engineering point of view to
go back and see if it's biologically plausible. If I'm a biologist, I had better because
my job is to explain how the biological system is working. So if I tell you that it's doing
something that is not biologically plausible, I already violated
the premise. Here, as long as I get
the job done, I'm OK. So it is fine to take the inspiration,
but let's not get carried away. We are actually trying to build
something that does a job from an engineering point of view,
and whatever works, we will take it. And that is where the neural
network is going. So knowing that the building block is
the perceptron, and that we are putting perceptrons together in a neural
network, let us explore what we can do with combinations of perceptrons
rather than a single one. And I'm going to do this pictorially. I will save the math when we define
the neural network itself. So we'll just look at pictures of what
perceptrons do and how to combine them, and we will get the idea that
actually combining this very simple unit does achieve something. So let's look at the famous problem
where perceptrons failed. Remember the four points? With the diagonal +1 and -1. If you want something that is plus here
and plus here, and minus here and minus here, you're out of luck as far
as using a perceptron is concerned. Now we are exploring, can we do this
with more than one perceptron arranged in the right way? That's the goal. So we look at this and say, I can
get the first-- this thing-- with a perceptron I'm going to
call h_1. That's easy. I'm going to get the
second one as this. And maybe now I can take the outputs of
these perceptrons, and combine them in a way that achieves
this particular dependence. And you look at it and say, that's actually very plausible. And your building blocks for doing that
are your old-fashioned OR's and AND's. The logical OR and AND. So you think, let's say I have two
Boolean variables, 0 or 1. Or in this case, +1 or -1. Can I implement an AND, which
returns +1 if, and only if, both are +1? Or can I implement an OR, which
returns +1 if at least one of them is +1? That would be the AND and OR. Can I implement these
using perceptrons? Why? Because I am in the game of trying to
use perceptrons to build stuff, and I'm seeing where this can take me. Well, the OR is very simple. I can do this because I realize
I already have, because of the constant term that
has a weight 1.5, I'm already ahead of the 0. So in order for this to actually go
negative, both of these guys have to be -1, right? And therefore, this actually does
implement the OR because if either of them is +1, I will
get the signal +1. For this one, I'm resisting
a negative bias already. So I'd better have both of them to
be +1, if I'm going to exceed 0 and report +1. So this actually implements the AND. So indeed I can implement the OR and
AND, using a simple perceptron. Now, you create layers of perceptrons
based on what you had. So in our case, we had h_1 and h_2 that
implemented the surfaces we wanted in the Euclidean space, and we just
want to combine them. The combination now, if you look at
it, is that you want the AND of h_1 and h_2 bar, the negative of
this, and h_1 bar and h_2. Basically, you are implementing
an XOR. An XOR wants one of them to be +1,
and the other one to be -1. So this is what you want to implement,
but that is easy because if this is a variable, if I have that ready-- I don't
know whether I have that ready. I know that I have h_1, and
I know that I have h_2. I don't know whether I have this
funny quantity with the bar, but likely I do. Then all I need to do is combine them
this way with the OR function, and then I will get the function I want. So let's expand the first layer, and
make it really layers. So now, you do have h_1 and h_2. We
already established that these are perceptrons. So what you do, when you have
a weight of -1, it's as if you are negating. And a weight of +1, you
are leaving it alone. So you have -1 and +1. And then you get the first layer to
do the AND. But not the AND of the thing itself, but the AND sometimes
of the thing, or sometimes of its negation, in order to implement
this guy that I want. So you end up with these. And these guys will be implementing
the functions you want here. And now you pass them on to the OR, and
you get the function you want. So let's plot the full multilayer
perceptron that implemented the function we want. It looks like this. This is your original input space. This is x_1 a real number, x_2 a real
number in the Euclidean space, and this is the x_0, the constant 1. This is the perceptron h_1 and h_2 that
you implemented in order to get the first picture. So these are the components, and I can
implement them using a perceptron. After I implement them using a perceptron,
I do the conjunction of one and the negation of the other,
in order to get here. And then I do the OR, and get here. So this multilayer perceptron
implements the function that a single perceptron failed in. And we have layers. So each layer would be this fellow,
the inputs going into it and the neurons themselves, the perceptrons. And this is the second layer,
and this is the third layer. So in this case we have three layers. We have strict rules in the
construction, which is feedforward. So it's feedforward, that is, you don't
get the output and put it to a previous layer, and you also
don't jump layers. It is very hierarchical. You go from this layer to the next
layer, and then from the next layer to the next layer. It didn't restrict us very much because
you realize that, if you have done logic before, you realize that if
you can do the AND's and the OR's and the negations, you
can do anything. So I can have a very sophisticated
surface and just by having enough of those guys and combining them, I can get
a very sophisticated surface under the restriction of this
hierarchical thing. So that's pretty good. We now realize that we
have a powerful model. And to illustrate the powerful model
in a case, let's look at this case. Let's be ambitious, not only just the
XOR, I want to implement the circle, which we remember we had to go
to a nonlinear transformation, just using perceptrons. So you say, definitely that doesn't
look anything like a line. And I'm using lines, there's
no transformation here. So what am I going to do? Let me try 8 perceptrons. Just sort of cornering this. If I do this, each of them will be +1
somewhere, -1 somewhere. So I have a pattern of
+1's and -1's. And all I need to do is the logical
function that will give me where I am inside and where I'm outside. So I end up with a polygon, an octagon in this case, that approximates the circle, using 8. I can go for 16. And then I'm getting closer
and closer to the circle. And I can get as close as I want, by
having as many perceptrons as I want. And now I have a bigger task of
combining the logical results, in order to get the final thing I have. And indeed, you can prove that
multilayer perceptrons with enough neurons can approximate any function,
which is very good. And for us, being powerful is good,
but it raises two red flags. Once I give you, this
is a great model. Everybody will be excited, except
people in machine learning. Wait a minute, I have
been there before. So what are the two red flags? One of them is generalization. I have a powerful model. I have so many perceptrons, so they have
so many weights, and the degrees of freedom, VC dimension. I'm in trouble. Well, you are in trouble, but at least
you know the trouble you are in now. That is, you can completely
evaluate this. I have this model. It has that VC
dimension. I need that many examples. Done deal. So this is not going to scare us. It is going to make us careful about
matching how sophisticated we can go, to the resources of data we have. So this is not really a deal breaker. The real deal breaker for using
multilayer perceptron is the optimization. Even for a single perceptron, we
were lucky enough to have this perceptron learning algorithm that applies
only in cases of separable. And we say that in the case of
non-separable, it's a very hairy optimization problem. It's a combinatorial optimization, and
it is very difficult to solve. Can you imagine, now, the problem when
I take layers upon layers upon layers, and combine them? And now I'm trying to find what
is the combination of weights that matches a function. You don't know what the function is. Here, you looked at it. But I'm just giving you examples. I'm asking you to match. How are you going to adjust the weights,
in order to match that? That's an incredibly difficult
optimization problem. And that's what neural networks do. That's the only thing they do. They have a way of getting
that solution. And the way they are going to do it is
that, instead of having perceptrons which are hard-threshold, they are
going to soften the threshold. Not that they like soft thresholds, but
soft thresholds have the advantage of being smooth, twice
differentiable. Rings a bell? Oh, maybe we can apply
the all-general gradient descent in order to find the solution. And once you find the solution,
you can say I know the weights. Soft threshold is almost the
same as the hard threshold. Let me hard-threshold the answer,
and give you that answer. So that would be the approach. So let's look at neural networks. The neural network will look like this. It has the inputs-- same as inputs
before-- and it has layers. And each layer has a nonlinearity. I'm referring to the nonlinearity
generically as theta. Remember, theta was used in logistic
regression as very specifically the logistic function. I'm using it here generically for
any nonlinearity you want. It turns out the nonlinearity we are going
to use is very much like the logistic function except it goes
from -1 to +1, in order to replicate the hard
threshold which goes from -1 to +1. In the case of logistic regression,
we weren't replicating that. We were simulating a probability
that goes from 0 to 1. So it's very similar to this. And in principle, when you use a neural
network, each of these guys could be different. You can have your different
nonlinearities and you will see, when we talk about the algorithm, that there is
a very minor modification you do in order to accommodate these
nonlinearities. So I could have a label for each of
these depending on where it happens. And the most famous, different
nonlinearity that you get to use is actually to make all of them
this soft threshold. And then when you go to the
output, make that linear. So this part would be as if
it was linear regression. This would be with a view to
implementing a real-valued function. So the intermediate things are doing
this thing, and then finally you combine them in order to get
a real-valued function. But for the purpose of this lecture
and the derivation, I'm going to consider all these thetas
to be the same. And all of them will be this function
that I'm going to describe mathematically in a moment. So this is the neural network. It has the same rules. It's feedforward. There is no going back, there
is no jumping forward. And the first column is the input x. So you are going to apply your input
x from an actual example to this, follow the rules of derivation from one
layer to another until you arrive at the end, and then you are going to
declare at the end that this is the value of my hypothesis, the neural-network
hypothesis, on that x. The intermediate values we are going to
call hidden layers, because the user doesn't see them. You put the input, there's a black
box, and then comes output. If you open the box, you'll find that
there are layers, and something interesting is happening in the
layers that I'm going to comment about later on. But these are the ones. And for a notation, we're going to
consider that we have L layers. So in this case, it will be three. This is the first layer
with its input. This is the second layer
with its input. This is the third layer
with its input. This is not really hidden,
it's an output layer. So this is the final layer, L.
And this is that. The notation here will persist
with us for that. Now I'm going to take this, and I'm
going to put the mathematical equations that go with it, in order
to be able to implement it. If you want to code this, the next
slide will be the one for you to implement. First thing, I'm going to define the
nonlinearity that I described. It's a soft threshold and we are going
to use the tanh, the hyperbolic tan-- hyperbolic tangent. And the hyperbolic tangent-- Well, the formula looks more or less
like the one we had before for the logistic one. It's again based on e^s. And this one
happens to go from -1 to +1. At 0, it's exactly 0. It has a slope 1. It has very interesting properties. And you can see now why
we are using it. If you take it this way, it looks
like a hard threshold. And in the beginning it looks linear. So it has the combination
of both worlds. So if your signal, which is what you have here-- this is
the signal and this is the output. If your signal, which is the weighted
sum of your inputs, is very small, it's as if you are linear. If your signal is extremely large, it's as
if you are hard-threshold, and you get the benefit of one function that is
analytic and very well-behaved for the optimization. So this is the one we're going to use. Now what I'm going to do, I'm going to
introduce to you the notation of the neural network, because
it's all notation. Obviously the notation will be more
elaborate than a perceptron, because I have different layers. So I have an index for that. I have different neurons per layer. So I have an index for that. And inputs go to the output. And then the output becomes the
input to the next layer. So I just need to get my house in order,
to be able to implement this. So although this is mostly
a notational viewgraph, it's an important viewgraph to follow because
if you decide to implement neural networks, you just print this viewgraph
and code it, and you have your neural network. The parameters of a neural network
are called w, weights. The weights now happen to belong
to any layer to any neuron. And there are three indices
that change. One of them, the different layers, the
different inputs that feed, and the different outputs I get. I have different inputs and different
outputs for every layer. So the weight is connecting one
input to one output in a certain layer. So let's have a notation. I'm going to introduce a notation,
and then apply it to the w. So I'll denote the layer by l. And l, as you see, appears
as a superscript for w. That will be our standard notation. The layer is always a superscript
between parentheses, for the quantity we have. And then I have the inputs. The inputs we are going to
call i, as an index. And obviously, since the weight connects
an input to an output, the i should appear as an index. And the output will be called j. So now my parameters for the network
are w, superscript l, sub ij. Although it's more elaborate than
we had before, we understand where it came from. Now let's talk about the ranges of
values for these three indices. For l as we discussed, l will go
from 1 to L. So from the first layer to L, which would be the
output layer, the final layer. The outputs go from 1
to d, as if it was-- d as in dimension. So you have-- I'm going to the neuron 1,
neuron 2, and neuron d. And because I am in layer l, by
definition, then the dimension of the layer that I'm talking about
will have that superscript. So d superscript l, the number will
differ from one layer to another. And depending on which layer you have,
you'll have different number of output units, and therefore the
j will depend on that. Now for the inputs, they come
from the previous layer. You take the outputs of
the previous layer to be the inputs in your layer. Therefore, the index for i will go
for the size of the previous layer, l minus 1. Now, I left this out because this
will not be 1, this will be 0. Anybody knows why? Yeah, yeah, it's that constant
x_0 that we always have. Every neuron will have that as an input,
and therefore we will have a generic one, which is x_0
to take care of that. So for every value in this array, you
will have w_ij l, and these are the parameters you want to determine. Now, let's see the function
that is being implemented. What you do is, you get the x's in
layer l in terms of the x's in layer l minus 1. Right? And our notation will give this a generic
index j, so this is the j-th unit in this layer, and this was the
i-th unit in the previous layer. What do you do in order to get that? We do what perceptrons do,
or neurons in this case. You combine them with the weights. The weights are connecting the i to
the j, and they happen to be the weights of this layer. So when we talk about the weights,
the weights will correspond to where the output is. You sum these up. You sum them up from i equals 0, which
is the constant variable, up to the maximum, which would be the maximum for
that layer, which happens to be d superscript l minus 1. So this is the signal. Now you pass the signal through
a threshold, in this case a soft threshold. And you're ready to go. That will be the function
you're implementing. And indeed, this would be the value of the output x. And it happens to be theta of-- we are
going to call this quantity inside, we are going to call it the signal again. And now the signal corresponds
to the output. So the signal is of layer l and
the j-th signal in that layer. You pass it through the nonlinearity,
and what you get is the output of that. So that wasn't too bad. Now, when you use the network, this a recursive definition. So you do this for the first layer,
second, third, et cetera. Every time you use it, you
get the new outputs. So the first, you get the outputs
of the first layer. These are the inputs to the second. You get the outputs of the second, these
are the inputs to the third. And you keep repeating, until
you get to the final. Now, how do you start this? You start this by applying your input,
the actual input you have, to the input variables of the network. The input variables happen to
be in layer 0, if you want. And they happen to be called x_1 up to
that, d superscript 0 by definition. Therefore, d superscript 0 is the same
as the dimensionality of your input. So this one actually has-- what is the x? x_1 up to x_d. So this guy matches this. Therefore, that is how you
construct the network. The number of inputs is the same
as the number of inputs you have. Once you leave that, it
could be anything. It could be expanding, shrinking,
whatever it is, anything it wants. And when it arrives, it should arrive
at the value of your output. You have a scalar output, and
therefore, after a long iteration, you will end up with one output, which
happens to be in layer L, and since I have one output,
the j is only 1. So this is my output of the network,
and I'm going to declare that my output of the network is the
value of my hypothesis. That is the entire operation of
a neural network when you feed it. So when you tell me what the weights
are, I am going to be able to compute what the hypothesis does. Now our job is to find the weights
through learning, so that we match a bunch of input-output examples. When I put those inputs and look at
the target outputs, I find that the network is replicating them well. That is the backpropagation algorithm. So let's do what? We are going to apply stochastic
gradient descent. So you take one example at a time,
apply it to the network, and then adjust all the weights of the
network in the direction of the negative of the gradient, according
to that single example. That's what makes it stochastic. So let's do it. Now the parameters are
all the weights. This array, which is a funny array, is not quite a complete matrix
because you have different number of neurons
in different layers. So this is just a funny array. But it's indexed by i j l. It's a legitimate array. And this determines h. Therefore, what I'm doing here is
getting the error on a single example: x_n , y_n. And I'm going to-- by definition, I have
some error measure. Let's call it e of h-- my error measure between the value
of the hypothesis, which is the neural network, and the target label. And this happens to be a function of
the weights in the network, why? y_n is a constant, x_n is a constant. This is part of the training example. h is determined by the w. That's why this is w, and I'm putting
it in purple, because this is the active quantity now when
we are learning. So to implement SGD, all you
need to do is implement the gradient of this quantity. And what is gradient of this quantity? Well, the gradient of this quantity
is a huge vector. Each component is partial the error,
by partial one of the parameters. So we put it down. All you need to do is compute
this for every i, j, and l. That's all you need to do! And then you take this entire vector
of stuff, and you move in the space along the negative
of that gradient. That is the game. There is nothing mysterious about this. If you never heard of backpropagation,
you will be able to do this, as we'll see in a moment. The idea is just to do it efficiently. And it makes a big difference when you
find an efficient algorithm to do something. For example, those of you who have
learned linear systems know FFT, the Fast Fourier Transform. Fast Fourier Transform is,
you implement the discrete Fourier transform. What's the big deal? The big deal is because it's faster. You get N logarithm N, instead
of the alternative. And that simple factor made the
field enormously active, just by that algorithm. And very similar here. Backpropagation, if you look at it, I
can brute-force-implement this for every i, j, and l. But now I have one thing that will get
me basically all of these guys at once, so to speak. And therefore it's efficient and people
get to use it, and that's why neural networks became quite popular. So let's try to compute this. Let me take part of the network. So this is in the layer l minus 1,
and this is in the layer l. I'm looking at the output of one neuron
in this layer, feeding through some weight, into this guy. So it is contributing to the signal
going into the next guy, and the signal goes into the nonlinearity
to produce the output. Now this quantity is not mysterious. If you look at it, we can evaluate
those one by one. That is for every single weight in the
network, we can ask ourselves: what is the error? Well, the error is sitting there. At the output, I have-- Here's my pointer. I have the output. I went further than I should! But the output is sitting
somewhere there. Therefore, there is an error. And that error will change,
if you change w. And that will tell you what
is partial e by partial w. So we can do this analytically. There is nothing mysterious. I can get the output as a function of
the previous layer, of the previous layer, of the previous layer,
until I arrive here. So I have this function that has tons
of weights in it, and I'm focusing on one of them. And I can say, what is partial
e by partial this fellow? Apply chain rule, get a number. Not a big deal. It's not your favorite activity,
but you can do that. Or even you can do it numerically. I can take this fellow, perturb it
just a little bit, and see what happens for the error of the output. And therefore, I can get numerical
estimate for partial by partial. The problem with those approaches
is that I have to do it for each one of them. What I'm going to do now, I'm going to
try to find something that will make me get the entire array, which is the
full gradient, in almost one shot. So here is the trick. The trick is the following. I'm going to express partial e by
partial w_ij, the change in e which is upstairs here, with respect
to this particular parameter. I'm going to get it in terms of partial the same quantity by partial
an intermediate quantity, this signal, times partial the intermediate
quantity by partial what I want. This is just chain rule. But chain rule with partial derivatives,
you need to be a little bit careful, because there may
be different ways your variable is affecting the output, and you need
to sum up all the effects. But here, if you're looking for how does
w affect the error? Well, w only affects this sum. w_ij affects only the sum s_j. So if I get partial by partial s_j,
this is the only link which w_ij affects the output, and therefore I'm
allowed to do this, and there is nothing to sum up with respect to. So I have this chain rule. That's nice. I can probably look at this and say, this
is a very simple quantity to get. How does the signal change
with the weight? We probably can get
an easy one there. But this one is almost as
bad as the original one. How does the error change
with this signal? It doesn't look like
a great progress. But the great progress is that this
quantity will be able to be computed recursively. That's the key. So what do we have in this equation? Well, we have the first one. Because if I take this guy, what
is partial s by partial w? s is simply the sum
of w's times x's. So partial s by partial w is the
coefficient, which happens to be the x, and this is the coefficient there. And that is readily available. I already computed all the
x's, so that I have. The other guy, this is a troublesome one,
so we're just going to call it a name, and see if we can get
something going for it. And the name we're going
to call it is delta. So now delta goes with a signal. There will be a delta sitting here,
if we can compute it. And the interesting is that the
derivative of the error with respect to this weight, which will determine how much you change
that weight, because when you get that gradient, you move along
the direction of the gradient. It means, in each component, you go in
proportion to the value of the partial derivative. So since this is the partial derivative,
the change in the w will be the product of these two guys--
proportional to that. One of them is x here, and one
of them is delta here. So we'll be changing the weight
according to two quantities that the weight is sandwiched between. And that's a pretty attractive one. If I get all of those, then I look at
the x's and the delta's, and the weight in between will change accordingly. Now, let's get delta
for the final layer. Why do I get delta for
the final layer? When we computed the thing, we got x's
for the first layer, the input. And then we propagated forward
until we got to the output. The reason why we're going to get it is
because the math will tell us that if you know delta later, you're going
to be able to detect delta earlier. So this will be propagating backwards,
and hence the name backpropagation. So we're going to start with
the delta at the output. And it's not a surprise, because I'm
trying to get the partial error by partial something. The closer I am to the action, to
the output, the easier it is to compute it. And indeed, for the output, it
will be very easy to compute. This is the definition of delta
for any value of j and l. And when you look at the final layer,
the final layer is not mysterious. It's l equals L, and j equals 1. I have a scalar function, so
that is the output layer. Therefore, the quantity I'm interested
in is exactly-- just substituting with this quantity. I want delta superscript L,
subscript 1. That's what I want to compute. Now, can I compute this? Let's look at it. This is e of w, the thing
I'm differentiating. What is e of w? e of w is the error measure, whatever
you have, between the value of your hypothesis, that is the value of the
neural network in its current state, with the weights frozen. You apply x_n. You go forward until you get
the output, that is h of x_n. You compare that to the target output,
which is the label of the example, y_n. And that error will be your e of w. Why is it e of w? Because h depends on w. So that is not mysterious
because h of x_n is what? It's the value of the output. Right? And that happens to be the variable in
layer L, variable number 1, that is your output. And, for example, let's say that you
are using mean squared error, just for the moment. This can apply for any analytic
error measure you put here. But if you're using mean squared
error, this would be it. That's a friendly quantity, because now
I want partial by partial. I have this, and this fellow is related to
the thing I'm differentiating with respect to. This is a constant. I can deal with the squared. So I'm getting closer to being able
to evaluate this explicitly. So let's look at x, the output. Well, the output is nothing but-- you
pass the signal through the nonlinearity, right? The nonlinearity is the tanh. Not mysterious. The signal is what I'm differentiating
with respect to. I'm almost done. So now, all I need to do is realize that
when I do this, I will have to know the derivative of theta, because
there is a chain rule and I'm differentiating with respect to this. And this is an intermediate quantity,
so I need to get theta dash. So what is theta dash? So what is the derivative of the tanh? Happens to be 1 minus
the tanh squared. This is for this particular one. If you have another nonlinearity, you
just compute what that is. This is good. So we have delta for the final layer. If I put the input, get the output,
I go through this and I have an explicit value: delta at
the output is the following. So now, the next item is to
back-propagate delta down to get the other delta's. This is the essence
of the algorithm. So now, I'm taking the network, but
now I'm taking the network from-- I'm taking one unit here, and looking at all the units in
the next layer, because these guys happen to be affected by x, and
therefore, happen to be affected by s. Remember delta is partial
something by partial s. And I want to get partial this by
partial s_i, in terms of partial by partial the s's here. I'm going backwards. So I already computed up to here,
and now I want to go here. So now, I need to take into consideration
all the ways that this affects the output, so I'm drawing
the relevant part of the network. This is the quantity that I want. I want to evaluate partial
e by partial s_i So I want to compute partial by
partial this fellow. So now I'm going to apply
the chain rule again. I will get partial e by partial these
fellows, which supposedly in my mind, I already know. That's the first part of the chain. Then I'm going to get partial
this guy by partial x. Fine. As long as I'm making progress towards
the destination, I'm OK. You can do it any way you want. And finally, I'm doing this. Partial x by partial s. So you go through this. This is partial e by partial the final
guy, and these guys happens to be intermediate. However, the way this fellow affects
the output, it affects all of those guys. So when I do the chain rule, I need to
sum up over all the routes that this happens through. And therefore, I need to sum
up over all the points here for this quantity. The way e is affected by this guy is
through the way e is affected by this fellow through here, or by this fellow
through here, et cetera. And therefore, the rule in this
case would be the sum. It looks like a very hairy one,
but it's no big deal. Now let's collapse it to
something very friendly. It's a sum of something. Let's take it one term at a time. We will take this. What is the partial derivative
of x_i by s_i. x_i simply happens to be the
nonlinearity applied to this one. So all I need to do is just
differentiate that nonlinearity, and apply it to the value at hand. So what do you get? You get theta dash applied to the signal. I can have that. How about the next guy? That's an easy one. What is the derivative
of this fellow by x_i? Yeah, this is just the sum. I get the coefficient, the coefficient
happens to be this thing, so that is what I get. Do I have all of this? Yes. The next guy is the interesting one. How do I get this? Well, you don't get it. You already
have it by recursion. This happens to be the old delta. So now I have the lower delta
in terms of the upper delta. And I have the top delta in hand. We are done. We just have to keep doing this,
and we'll get all the delta's. And the form for the delta
is interesting. So this fellow does not depend on
the summation index j, right? And this happens to be the derivative
of the tanh, so it's 1 minus that squared. So when you get 1 minus that
squared, you get this and you can factor it out. The rest of it depends on j and
you are summing this up and you're getting this. Now isn't it lovely to have
an equation like this? This looks exactly like
the forward pass. We're taking something, combining it
with the weights, summing up, and getting this. Instead of applying a nonlinearity,
which we did in the forward, we're multiplying by this funny guy. So it looks like a very much
similar situation. But when you are done, you are going
to get a bunch of delta's at every position where an s is. And from our previous experience, then
we're ready to go with the delta and the x, and adjust the weight that is
sandwiched between them accordingly. So you see the reverse,
now we are going down. It's delta's going down, the
arrows are going down. It used to go up. So let's do this. And then instead of having theta here,
we are multiplying by something, and what we're multiplying
by is this quantity. That's what you do in the
backward propagation. So here's the algorithm. First, the picture
of the algorithm. That's all you do. You take the input, you compute the x's
forward, you get the error, you compute the delta's backward. This is supposed to be delta-- delta
has disappeared for some reason. The delta and the x determine the weight in between. So if you put the algorithm this way,
you initialize the weights and then you pick n at random-- that's what makes
stochastic gradient descent. You do the forward run I described. You do the backward run, and then you
update the weights according to the input and the delta that are
surrounding the weight. You keep this until it's time to stop,
and then you return the final weights, and that is your algorithm. Now there are obviously all the
questions: the termination criterion, the local minima, all of that. That's the
thing we discussed in the Q&A session. There's something specific here that
I want to emphasize, which is the initialization. Because it's very tempting to initialize
weights to 0, which worked actually very well with
logistic regression. If you initialize weights to 0 here,
bad things will happen. So let me describe why. First, the prescription is to
initialize them at random. Why is initializing 0 bad? If you follow the math, you realize that
if I have all the weights being 0, which is what that means, and you
have multiple layers, then either the x's or the delta's will be 0. In every possible weight, one
of the two guys that are sandwiching it will be 0. And therefore, the adjustment of the
weight according to your criterion would be 0. And therefore, nothing will happen. This is just because of the terrible
coincidence that you are perfectly at the top of a hill, unable
to break the symmetry. So you're not moving. If I just nudge you a little
bit, you will be slipping like there's no tomorrow. But as long as you're there,
you're not moving. Pretty much like you think of a donkey
that is hungry, so they put two sacks of food on top of it. All it needs to do is eat or eat. Unfortunately, it's perfectly symmetric,
and the donkey cannot break the symmetry, and it starves to death. So we just want to break the symmetry. So we introduce randomness: shake
the food a little bit, which is here to just start with
a random thing. Choose weights that are small and
random, and you will be OK. One final remark and we'll call it a day,
which is about the hidden layers. So let's look at the network again. This is the network. We have an understanding of this fellow,
and we have an understanding of the output. And the hidden layers were just a means
for us to get more sophisticated dependency. So, if you think what the hidden layers
do, they are actually doing a nonlinear transform, aren't they? I have these raw inputs, and I'm passing
them through this thing, so I can look at these guys and
consider them features. And because they are higher-order
features, I'm able to implement a better one. And this one will be features
of features, and so on. Now the only difference, and it's a big difference, between the nonlinear transform here and
the nonlinear transform we applied explicitly in the case of
linear models, is that these are learned features. Remember when I told you not to look at the data before
you choose the transform. The network is looking at
the data all it wants. It is actually adjusting the weights
to get the proper transform that fits the data. And this is not bothering me, because I
have already charged the network for the proper VC. The weights here that constitute that
guy contribute to the VC dimension. The VC dimension is more or less
the number of weights. That's the rule of thumb here. So it is completely fine to look at the
data, because it's not looking at the data that is bad, it is looking at
the data without accounting for it that is bad. And here it's built in that
it's accounted for. So this is nice, because now you can
see it's not a generic nonlinear transformation, it's a nonlinear
transformation with a view to matching very specifically the dependency
that I'm after. So that's the source of
efficiency there. Now comes the question,
can I interpret what the hidden layers are doing? So I'll tell you a story. Early in my career, I was doing
a consulting job for a bank, and they wanted to apply neural networks
to credit approval. Very easy. Give me the data, we'll do it. We'll take a fairly simple network. So one of the people in the bank that
I was dealing with came and asked me: can you please tell
me what the hidden layers are doing? So in my mind I think, is he doubting
my competence or something? He wants reassurance or
something like that? I mean, the performance is perfect,
and he can try it out of sample and whatnot. But then I realized that the reason he
is asking for the interpretation has absolutely nothing to do
with performance. It's legal. If you deny credit for someone,
you have to tell them why. And you cannot send a letter
to someone saying: sorry, we denied credit because lambda
is less than 0.5. [LAUGHTER] So that's the reason. But the fact that you are not able to
interpret what happens in machine learning is very, very common. Go back to the movie example. We get the factors. We predict the ratings. And let's say you apply the
system, and you keep recommending movies to someone. And the person is so impressed. You are
recommending movies that are on the spot every time. So they come and ask
you: how do you do it? You tell him, because factor number 7
is very important in your case. So he says: OK, great. So what is factor number 7? And then you say-- lots of hand waving. You have no idea what
factor number 7 is, but factor number 7 is important in your case. Very common in machine learning
because you remember, when the learning algorithm tried to learn, it
tried to produce the right hypothesis. It didn't try to explain to you
what the right hypothesis is. That was the goal. Let me stop here, and then take
questions after a short break. Let's start the Q&A. MODERATOR: The first question is could you explain what people
mean by using a momentum term in neural networks? PROFESSOR: Momentum is used
as an enhancement for the batch gradient descent, in order to get some effect
of the 2nd order. So the idea is that if you use gradient
descent, gradient descent is using strictly 1st order,
just the slope. And if the surface is changing so
quickly, which means that the second order is important, you want to get
a glimpse of that without having to go through the trouble of computing the
Hessian, the 2nd-order quantities. So if you take what's called a momentum
term, which means that you take a little bit of the step that you
had previously, and a bit less of the previous step, and so on, you end up accounting for
some aspect of this. Because if the surface is curved, this
goes one way, and if the surface is flat, it goes the other way. So I didn't-- There are lots of heuristics.
The momentum is one of them. For stochastic gradient descent,
the way I described it, actually works very nicely. And in all honesty, if I have to go
to 2nd order I will just go for conjugate gradient, because it's
so principled and really gets the bottom line. I'm not big on using momentum in my
own applications, but other people have found it to be useful. MODERATOR: Some people are asking about
the popularity of neural networks, that it has had its ups and downs. So what's the state of the
art in neural networks research if there's any? PROFESSOR: Initially, neural networks
were going to solve the problems of the universe. So the usual hype. And hype in some sense is not bad for
research, because it gets people excited and gets enough people to
work to get the real results. And then when it settles down, there's
a critical mass of work. So I don't think this was
a bad thing in hindsight. But what happened is that because of the
simplicity of the network and the simplicity of the algorithm, people
used them in many applications. And it became a standard tool. And there are lots of tools you will
find in all kinds of software, where you just apply a neural network. And until this very day, there
are companies that use them very, very regularly. So they are post-research, so to speak. There's very little done in terms of
research. The basic questions have been answered. But in terms of being used in
commerce and industry, they are used. They have very serious competitors,
like, for example, support vector machines and lots of other models,
but they're still in use. Not the top choice nowadays, but every
now and then, someone will publish something and he did this, and
he will have used a neural network and got good results. MODERATOR: OK. How to choose the number of layers? PROFESSOR: This is model selection. So the neural networks is really
a class of models-- a class of hypothesis sets-- and there are obviously bunch
of things to choose. How many layers? And how many units per layer? So if you look from an approximation
point of view, because of the sum of products in logic, you can implement
anything using a fairly shallow network, provided that you have
enough neurons in that layer. But that's not an approximation
question, we are talking about a learning question. So the real question is, how
many weights can I afford? And then the question of organizing
them is less severe. So how many weights can I afford? Because they reflect directly on the
VC dimension and the number of examples I need. And there are actually methods that,
given a particular architecture, it tries to kill some weights in order to
reduce the number of parameters, as a method for regularization. And we'll allude to that, when
we get to regularization. But basically, this is a model selection
question, where model selection tools apply. The most profound of them will be
validation that we will have a lecture dedicated to. MODERATOR: Can you-- why was tanh, the hyperbolic
tan used? PROFESSOR: Why is it used? MODERATOR: Yes. PROFESSOR: OK. I want a soft threshold, and I want it
to go from -1 to +1. And I want it to have a nice analytic
property that I can differentiate it. These are basically the three reasons. In the other case, it was exactly the
same except that I didn't want something to go from
-1 to +1. I wanted something to go from 0 to 1,
because in logistic regression, I wanted a probability. Here, I wasn't really interested in
the continuity for its own sake. There I was, because it's
a probability. Here I was interested in the continuity
just because I wanted the analytic property of differentiation,
in order to apply gradient descent. But what I care about is going from
-1 to +1, which are the hard decision version. MODERATOR: Will the final weights
depend on the order, the way that the samples are being picked? PROFESSOR: Correct. They will depend on the
initial condition. They will depend on the order
of presentation. They will depend on that, but that
is inherent in the game. We are never assured of getting to the
perfect minimum, the global minimum. We will get to a local minimum,
and anything will affect us. But the whole idea is that you are
going to arrive at a minimum. And if you do what we suggested in the
last lecture in the Q&A session, that you just start from different starting
points, and have different randomization for the presentation. This randomization could be, you
pick a point at random. You could pick a random permutation,
and then go through the examples according to that permutation,
and then change permutation from epoch to epoch. Or you could simply be lazy and
just do it sequentially. And all of these, more or less,
get you there-- will get you with different results. So if you try a variety of those,
let's say 100 tries, and then pick the best minimum you have, you will
get a pretty decent minimum, and will be fairly more robust in
terms of independence of the particular choices that you made in
any of the 100 cases. MODERATOR: Could you go
back to slide 12? There. So if you could review the two red
flags for generalization and optimization? PROFESSOR: OK. So the top part of the figure showed
that we are dealing with a sophisticated model, because
in spite of the fact that the unit of it is linear-- the perceptron-- you can implement a circle, just for
illustration. You can implement a pretty difficult surface by
combining those fellows. When you have a powerful model, it means you can express a lot of things,
and therefore the question of generalization comes in because, if you
can express a lot of things, you have a big hypothesis set. And then the question of zooming in
and generalization-- the stuff we handled in theory. But the comment here is that we are
going to have the VC dimension of whatever model we have, and the
VC dimension summarizes all generalization consideration. We may decide that this is too
sophisticated a model, because we look at the VC dimension of it and the
resources of data we have, and we decide we just cannot generalize. But at least it's under control, because
we have the number that describes it. In terms of optimization, now it's not
like I have the target written here and I'm just designing perceptrons. I'm given a data set--
inputs, outputs-- and I have a multilayer perceptron,
each of which is computing a perceptron function, of a perceptron
function, of a perceptron function. And now I want to choose the weights
for the different layers, in order to get there. So obviously that's a very difficult
combinatorial optimization, because it was difficult even for one perceptron. That's why the optimization
here is a red flag. That's why we needed to go for
an approximation using a continuous function, where we have something like
gradient descent that can work for us. MODERATOR: You mentioned the VC
dimension is roughly the number of parameters. So they want to comment on it. PROFESSOR: We are not going to be
as lucky as the case of perceptrons, of getting the
VC dimension exactly. In this case, there are some analyses. And because the weights are not
completely independent in their impact-- you can play around with
weights in different layers and compensate for one another, and there
are some permutations and whatnot. Therefore, they don't contribute full
degrees of freedom, each of them. So you can upper-bound it by the number
of weights, and lower-bound it by something fairly close to the number
of weights, but smaller. So as a rule of thumb, you take
it as the number of weights being the VC dimension. And that has stood the test
of time, in terms of practice. MODERATOR: In terms of the interpretation,
by just looking at the first layer, it's not enough to interpret the-- PROFESSOR: Oh, if your
interpretation is to say: yeah, I understand perfectly what
the first layer does. It gives 0.3 weight to the first input,
and 0.7 weight to the second input, and minus 0.4 weight to the third
input, and sums them up and then compares to the threshold,
which is 0.23. If you take that as an interpretation,
then they are interpretable. But an interpretation here,
what people mean, is interpretation that makes sense. That, for example, your interpretation
in the case of movies, you say this factor is a comedy content. People can relate to that. But what we are saying is that the
factor is relevant to the rating, but cannot be articulated in simple terms
that people would consider interpretation. And similarly for the hidden layer here. MODERATOR: Can you say what happened
in the end with the bank? What explanation was taken in the end? PROFESSOR: No, I can't. [LAUGHING] It's a private consultation.
I cannot comment in detail. But basically, the question was
raised and it made the point. MODERATOR: Can you explain again why in
the past lecture, you mentioned that data snooping is not a good practice? PROFESSOR: Data snooping is a bad
practice if you don't account for it. When we get to data snooping--
we will discuss it in one of the lectures, we will say that you either
avoid it or account for it. The problem is that if you data-snoop,
and you don't account for it in terms of its impact on generalization,
you end up with something that is extremely optimistic. You go to the bank, if you do a private consulting job for
a bank, and tell them: I have something that predicts the stock market great. And then you give them. And when they go to stock market,
it falls on its head, and that's the problem. Because you thought it would
generalize, and it didn't. So data snooping, in the way I presented
it, was the fact that we didn't account for that-- we learned in our mind, but we didn't
account for the VC dimension of the space we worked on. That was the problem, rather than
looking at the data in and of itself. But since the damage is almost
unavoidable, it's a very good practice not to look at the data, because the
accounting is difficult in this case. In the case of neural networks, there
was looking at the data in a very prescribed way. A learning algorithm was actually
trying to find the weights that constitute the hidden layer. So therefore, it is looking
at the data in abundance. On the other hand, the accounting has
already been taken into consideration, because, as I mentioned, the weights
have been counted as contributing to the VC dimension. So we know the impact on the
generalization behavior. MODERATOR: Does the range of the weights
alter the choice of eta? PROFESSOR: Which weight? Repeat the question, please. MODERATOR: Does the range of
the weights affect the value of eta? PROFESSOR: Let's say that
you're making decisions. So let's say that you're
making decisions. Eventually, you will take the output layer
and hard-threshold it, as if you are scaling the weights. But the intermediate weights, the
actual value matters because the actual value of their output
will contribute to the next layer. So you cannot just say that I'm
scale-invariant or anything like that. But supposedly, the learning rate was
only a way to arrive at a minimum of the error function. And the minimum of the error function
will happen at a particular combination of the weights. So it shouldn't affect it, in the
sense of a predictable way. Obviously if I change the rate, I may
end up in a different spot, but it's not like I will end up in
a better spot or a worse spot if I use a reasonable learning rate. Yes it does affect it, but it affects
it in an unpredictable way. Pretty much like you can say:
how does the initial condition affect the result? Well it affects it, but it affects it in
a random way, and you're better off just averaging over a number of cases,
or picking from a number of cases, in order to immunize against
that type of variation. MODERATOR: Is there a relation between
neural networks and genetic algorithms? PROFESSOR: I guess both of
them appeal to someone who's interested in a biology reflection. Genetic algorithms are optimization
techniques, based on getting a generation and having mating and
keeping the good genetic properties, so it doesn't apply. Everything in machine learning has been
applied to everything. So there were actually people trying to
train neural networks using genetic algorithms. You'll find in the literature
all combinations. But at a basic level, neural network
is a model. Genetic algorithm is an optimization
technique. And therefore, there's really no
relationship between them. MODERATOR: Small confusion. Does in-sample training constitute
looking at the data? PROFESSOR: The strict answer is yes. You look at the data all too well. You're are actually looking at the data,
and you're trying to minimize the performance of the data, and all of
that, which again is fine as long as you have already put into account
that the way you're navigating the weight space-- the weight space
has limited VC dimension. And therefore, when you do that and you
get to something, you still have a guarantee of generalization from
what you're arriving at to the out-of-sample. So the learning algorithm looks at
the data. That's all it does. It looks at the data. But it's already-- before we even
turn the learning algorithm loose on the data, we have already chosen
the hypothesis set. And we put the generalization
checks in place. MODERATOR: What do you recommend,
implementing your own neural network or using a package? PROFESSOR: It's-- Honestly, it's a borderline case. For example, if you're doing the
perceptron, you just write it down. It's so simple. In neural networks, it's a little bit
complicated, and there are some bugs that are typical. I used to have this as an exercise, and
then I decided that the logistics of doing it is not worth
the benefit of it. So to answer your question,
I recommend using a package, for neural networks. MODERATOR: Does analyzing-- performing some sort of sensitivity
analysis on the weights give us some information about how the neural
network-- PROFESSOR: Yeah, there's
actually work on that. There are questions of regularization
based on that, on how effective the weight is, and the
disturbance, and whatnot. There are all kinds of analyses.
Neural networks have been studied to a great level of detail. And indeed, the choice of the weights,
the range of the weight, perturbation of the weight-- all
of these have been looked at. MODERATOR: Are there other models
that lend themselves more to interpretation? PROFESSOR: If you have a bunch
of parameters and the algorithm is going to choose them,
then already interpreting those parameters is not clear. You can artificially put constraints
in order to make sure, or you can start from an initial condition
that already has an interpretation, and you're just fine-tuning it. But that's if you are very keen
on the interpretation aspect. MODERATOR: Going back to
the first examples, where there were logic implementations
with the perceptrons. So there was a confusion. Are we still trying to learn weights
here, or we just have them fixed? PROFESSOR: This was an illustration
for the fact that when you combine perceptrons, you're able to
implement more interesting functions. This didn't touch on learning yet. After we do that, we found that the
structure that is multilayered is an interesting model to study. And from then on, it became
a learning question. We had a neural network. We are no
longer going to look at target functions, and try to
choose the neurons. We are just going to put it as a model,
and let the learning algorithm choose the weights, which is backpropagation
in this case. MODERATOR: Could you briefly
explain early stopping? PROFESSOR: OK. I think it is best described when
I talk about regularization and validation. It is basically a way to prevent
overfitting, which is the next topic. So I think it will be much better
understood in the context once we understand what overfitting is, and
what are the tools for the dealing with overfitting-- regularization,
and validation in this case. And then early stopping will
be very easily explained. MODERATOR: A question on stochastic
gradient descent. When you go through an epoch, you choose
points randomly. Only points you have not selected, right? PROFESSOR: There are lots of choices. An epoch is one run. And it's a good idea to get all
the examples contribute. So one way to get it to be random and
still guarantee that you get all of them is, instead of choosing the point
at random, you choose a random permutation from 1 to N, and then
go through that in order. And then for the next epoch,
you do another permutation. If you do it this way, eventually every
example we will contribute the same, but an epoch will be a little
bit more difficult to define. You're can define it simply as N
iterations, regardless of whether you covered all of them or not. That is valid. And some people simply do a sequential
version, no randomness at all. You just go through the examples, or you
have a fixed permutation, and you go through the examples in that
order, and keep repeating it. And there are some observations about
differences, but the differences are not that profound. MODERATOR: Does having layers
and no loops limit the power of the neural network? PROFESSOR: Loops as in feedback,
I'm assuming. Once you have feedback, even the
definition of what function I'm implementing becomes tricky, because
I'm feeding on myself. So it's a completely different type. There are recurrent neural networks,
which actually is the model that started work in neural networks. And it has completely different
mathematics and application domains. Here we're implementing a function,
and it is clean enough to do it in a layered way, in order to get
a nice algorithm like that. And since we showed that you can
basically implement anything if you have a big enough model, you
are not missing out on something by doing that. One can become, say-- Maybe I can get a smaller network
if I can jump layers, which is possible. It's an interesting intellectual
curiosity, but in terms of practical impact, it has very little. MODERATOR: In terms of the VC dimension,
since it roughly depends on the number of parameters, if you had
a fixed number of nodes, but you arrange them in layers, what do you
gain or what do you lose in that? PROFESSOR: OK. If you believe in the rule of thumb,
and it's just a rule of thumb that is based on upper and lower bounds, then if I rearrange the number of nodes
and number of that, the number of weights will change because
the number of weights-- I say how many neurons here, and how
many neurons here, and I multiply the number, and that will give
me the number of weights. So as long as you take your guiding
number, the bottom line number is how many weights did I put in the network. You'll be more or less OK. Obviously, you can take extreme cases,
where I have one neuron feeding into one neuron, feeding into one neuron--
the example I gave last time. So you have tons of weights that are
really not contributing much. But within reason, if you have taken
general architectures that are reasonable, then the number of weights
is the operative quantity. MODERATOR: OK, we should quit. PROFESSOR: So, we'll see you
next week.