ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced neural
networks, and we started with multilayer perceptrons, and the idea
is to combine perceptrons using logical operations like OR's and AND's, in
order to be able to implement more sophisticated boundaries than the
simple linear boundary of a perceptron. And we took a final example, where we
were trying to implement a circle boundary in this case, and
we realized that we can actually do this-- at least approximate it-- if we have a sufficient
number of perceptrons. And we convinced ourselves that combining
perceptrons in a layered fashion will be able to implement
more interesting functionalities. And then we faced the simple problem
that, even for a single perceptron, when the data is not linearly separable,
the optimization-- finding the boundary based on data-- is a pretty difficult optimization
problem. It's combinatorial optimization. And therefore, it is next to hopeless
to try to do that for a network of perceptrons. And therefore, we introduced neural
networks that came in as a way of having a nice algorithm for multilayer
perceptrons, by simply softening the threshold. Instead of having it as just going from
-1 to +1, it would go from -1 to +1 gradually
using a sigmoid function, in this case the tanh. And if the signal which is
given by this amount-- the usual signal that goes
into the perceptron-- is large, large negative
or large positive, the tanh approximates -1 or +1. So we get the decision
function we want. And if s is very small, this is almost
linear-- tanh(s) is linear. And the most important aspect about it
is that it's differentiable-- it's a smooth function, and therefore the dependency
of the error in the output on the parameters w_ij will be a well-behaved
function, for which we can apply things like gradient descent. And the neural network
looks like this. It starts with the input, followed by
a bunch of hidden layers, followed by the output layer. And we spent some time trying to argue
about the function of the hidden layers, and how they transform the
inputs into a particularly useful nonlinear transformation, as far as
implementing the output is concerned, and the question of interpretation. And then we introduced the
backpropagation algorithm, which is applying stochastic gradient descent
to neural networks. Very simply, it decides on the direction
along every coordinate in the w space, using the very simple
rule of gradient descent. And in this case, you only
need two quantities. One of them is x_i, that was implemented
using this formula, the forward formula, so to speak, going from
layer l minus 1 to layer l. And then there is another quantity
that we defined, which was called delta, that is computed backwards. You start from layer l, and then
go to layer l minus 1. And the formula is strikingly similar
to the formula in the forward thing, but instead of the nonlinearity
being applied, you multiply by something. And once you get all the delta's and x's
by a forward and a backward run, then you simply can decide on the move
in every weight, according to this very simple formula that involves
the x's and the delta's. And the simplicity of the
backpropagation algorithm, and its efficiency, are the reasons why neural
networks have become very popular as a standard tool of implementing functions
that need machine learning in industry, for quite some time now. Today, I'm going to start
a completely new topic. It's called overfitting, and it will
take us three full lectures to cover overfitting and the techniques
that go with it. And the techniques are very important,
because they apply to almost any machine learning problem that
you're going to see. And they are applied on top of any
algorithm or model you use. So you can use neural networks
or linear models, et cetera. But the techniques that we're going to
use here, which are regularization and validation, apply to all
of these models. So this is another layer of techniques
for machine learning. And overfitting is a very important topic. It is fair to say that
the ability to deal with overfitting is what separates professionals from
amateurs in machine learning. Everybody can fit, but if you know what
overfitting is, and how to deal with it, then you have an edge that
someone who doesn't know the fundamentals would not be
able to comprehend. So the outline today is, first, we are
going to start-- what is the notion? what is overfitting? And then we are going to identify
the main culprit for overfitting, which is noise. And after observing some experiments, we
will realize that noise covers more territory than we thought. There's actually another type of noise,
which we are going to call deterministic noise. It's a novel notion that is very
important for overfitting in machine learning, and we're going to
talk about it a little bit. And then, very briefly, I'm going to
give you a glimpse into the next two lectures by telling you how
to deal with overfitting. And then we will be ready, having diagnosed what the problem
is, to go for the cures-- regularization next time, and validation
the time after that. OK. Let's start by illustrating the
situation where overfitting occurs. So let's say we have a simple
target function. Let's take it to be a 2nd-order
target function, a parabola. So my input space is the real numbers. I have only a scalar input x. And there's a value y, and I have
this target that is 2nd-order. We are going to generate five data
points from that target, in order to learn from. This is an illustration. Let's look at the five data points. As you see, the data points look like
they belong to the curve, but they don't seem to belong perfectly
to the curve. So there must be noise, right? This is a noisy case, where
the target itself-- the deterministic part of the target
is a function, and then there is added noise. It's not a lot of noise, obviously--
very small amount. But nonetheless, it will
affect the outcome. So we do have a noisy
target in this case. Now, if I just told you that you have
five points, which is the case you face when you learn. The target disappears, I have five
points, and you want to fit them. Going back to your math, you realize,
I want to fit five points. Maybe I should use-- a 4th-order
polynomial will do it, right? You have five parameters. So let's fit it with
4th-order polynomial. This is the guy who doesn't know
machine learning, by the way. So I say, I'm going to use
the 4th-order polynomial. And what will the fit look like? Perfect fit, in sample. And you measure your quantities. The first quantity is E_in. Success! We achieved zero training error. And then when you go for the
out-of-sample, you are comparing the red curve to the blue curve, and
the news is not good. I'm not even going to calculate
it, it's just huge. This is a familiar situation
for us, and we know what the deal is. The point I want to make here is that,
when you say overfitting, overfitting is a comparative term. It must be that one situation
is worse than another. You went further than you should. And there is a distinction between
overfitting, and just bad generalization. So the reason I'm calling this
overfitting is because, if you use, let's say, 3rd-order polynomial, you
will not be able to achieve zero training error, in general. But you will get a better E_out. Therefore, the overfitting here happened
by using the 4th-order instead of the 3rd-order. You went further. That's the key. And that point is made even more clearly,
when you talk about neural networks and overfitting
within the same model. In the case of overfitting with
3rd-order polynomial versus 4th-order polynomial, you are comparing
two models. Here, I'm going to take just neural
networks, and I'll show you how overfitting can occur within
the same model. So let's say we have a neural network,
and it is fitting noisy data. That's a typical situation. So you run your backpropagation
algorithm with a number of epochs, and you plot what happens to E_in,
and you get this curve. Can you see this curve at all? Let me try to magnify it, hoping
that it will become clearer. A little bit better. This is the number of epochs. You start from an initial condition,
a random vector. And then you run stochastic gradient
descent, and evaluate the total E_in at the end of every epoch,
and you plot it. And it goes down. It doesn't go to zero. The data is noisy. You don't have enough parameters
to fit it perfectly. But this looks like a typical situation,
where E_in goes down. Now, because this is an experiment, you
have set aside a test set that you did not use in training. And what you are going to do, you are
going to take this test set and evaluate what happens out-of-sample. Not only at the end, but as you go. Just to see, as I train, am I making
progress out-of-sample or not? You're definitely making
progress in-sample. So you plot the out-of-sample,
and this is what you get. So this is estimated by a test set. Now, there are many things you can say
about this curve, and one of them is, in the beginning when you start
with a random w, in spite of the fact that you're using a full
neural network, when you evaluate on this point, you have only one
hypothesis that does not depend on the data set. This is the random w that you got. So it's not a surprise that E_in and E_out
are about the same value here. Because they are floating around. As you go down the road, and start
exploring the weight space by going from one iteration to the next, you're
exploring more and more of the space of weights. So you are getting the benefit, or the
harm, of having the full neural network model, gradually. In the beginning here, you are only
exploring a small part of the space. So if you can think of an effective
VC dimension as you go, if you can define that, then there is
an effective VC dimension that is growing with time until it gets-- after you have explored the whole
space, or at least potentially explored the whole space, if you
had different data sets-- then you have the effective VC
dimension, will be the total number of free parameters in the model. So the generalization error, which is
the difference between the red and green curve, is getting
worse and worse. That's not a surprise. But there is a point, which is
an important point here, which happens around here. Let me now shrink this back, now that
you know where the curves are. And let's look at where
overfitting occurs. Overfitting occurs when you knock down
E_in, so you get a smaller E_in, but E_out goes up. If you look at these curves, you will
realize that this is happening around here. Now there is very little, in terms of the
difference of generalization error, before the blue line and
after the blue line. Yet I am making a specific distinction,
that crossing this boundary went into overfitting. Why is that? Because up till here, I can always
reduce the E_in, and in spite of the fact that E_out is following suit with
very diminishing returns, it's still a good idea to minimize E_in. Because you are getting smaller E_out. The problems happen when you cross,
because now you think you're doing well, you are reducing E_in, and you are
actually harming the performance. That's what needs to be taken care of. So that's where overfitting occurs. In this situation, it might be a very
good idea to be able to detect when this happens, and simply stop at that
point and report that, instead of reporting the final hypothesis
you will get after all the iterations, right? Because in this case, you're going to get
this E_out instead of that E_out, which is better. And indeed, the algorithm that goes with
that is called early stopping. And it will be based on validation. And although it's based on validation,
it really is a regularization, in terms of putting the brakes. So now we can see the relative
aspect of overfitting. And overfitting can happen when you compare
two things, whether the two things are two different models, or two
instances within the same model. And we look at this and say that if
there is overfitting, we'd better be able to detect it, in order to stop
earlier than we would otherwise, because otherwise we will
be harming ourselves. So this is the main story. Now let's look at what is overfitting
as a definition, and what is the culprit for it. Overfitting, as a criterion,
is the following. It's fitting the data more
than is warranted. And this is a little bit strange. What would be more than is warranted? I mean, we are in machine learning. We are the business of fitting data. So I can fit the data. I keep fitting it. But there comes a point, where
this is no longer good. Why does this happen? What is the culprit? The culprit, in this case, is that you're
actually fitting the noise. The data has noise in it, and you are
trying to look at the finite sample set that you got, and you're
trying to get it right. In trying to get it right, you are
inadvertently fitting the noise. This is understood. I can see that this is not good. At least, it's not useful at all. Fitting the noise, there's no pattern to
detect in the noise, so fitting the noise cannot possibly
help me out-of-sample. However, if it was only just useless,
we would be OK. We wouldn't be having this lecture. Because you think, I give
the data, the data has the signal and the noise. I cannot distinguish between them. I just get x and get y.
y has a component which is a signal, and a component which is noise, but I get just
one number. I cannot distinguish between the two. And I am fitting them. And now I'm going to fit the noise. Let's look at it this way. I'm in the business of fitting. I cannot distinguish the two. Fitting the noise is the
cost of doing business. If it's just useless, I wasted some
effort, but nothing bad happened. The problem really is
that it's harmful. It's not a question of being useless,
and that's a big difference. Because machine learning
is machine learning. If you fit the noise in-sample, the
learning algorithm gets a pattern. It imagines a pattern, and extrapolates
that out-of-sample. So based on the noise, it gives you something
out-of-sample and tells you this is the pattern in the data, obviously,
which it isn't. And that will obviously worsen your
out-of-sample, because it's taking you away from the correct solution. So you can think of the learning
algorithm in this case, when detecting a pattern that doesn't exist, the learning algorithm is hallucinating. Oh, there's a great pattern, and this is
what it looks like, and it reports it, and eventually, obviously that
imaginary thing ends up hurting the performance. So let's look at a case study. And the main reason for the case study,
because we vaguely now understand that it's a problem of the noise, so
let's see how does the noise affect the situation? Can we get overfitting without noise? What is the deal? So I'm going to give you
a specific case. I'm going to start with
a 10th-order target. 10th-order target means
10th-order polynomial. I'm always working on
the real numbers. The input is a scalar, and I'm
defining polynomials based on that, and I'm going to take
10th-order target. The 10th-order target, one
of them looks like this. You choose the coefficient somehow,
and you get something like that. A fairly elaborate thing. And then you generate data, and the data
will be noisy, because we want to investigate the impact of
noise on overfitting. Let's say I'm going to generate
15 data points in this case. So this is what you get. Let's look at these points. The noise here is not trivial
as it was last time. There's a difference. Obviously,
these are not lying on the curve. So there is a noise that is
contributing to that. Now the other guy, which is a 50th order, is noiseless. That is, I'm going to generate
a 50th-order polynomial, so it's obviously much more elaborate than the
blue curve here, but I'm not going to add noise to it. I'm going to generate also 15 points
from this guy, but the 15 points, as you will see, perfectly
lie on the curve. This is all of them here. So this is the data, this
is the target, and the data lies on the target. These are two interesting cases. One of them is a simple
target, so to speak. Added noise, that makes it complicated. This one is complicated
in a different way. It's a high-order target to begin
with, but there is no noise. These are the two cases that I'm
going to try to investigate overfitting in. We are going to have two different
fits for each target. We are in the business of overfitting. We have to have comparative models. So I'm going to have two models
to fit every case. And see if I get overfitting
here, and I get it here. This is the first guy
that we saw before. The simple target with noise. And this guy is the other one, which is
the complex target without noise. 10th-order, 50th-order. We'll just refer to them as a noisy
low-order target, and a noiseless high-order target. This is what we want to learn. Now, what are we going to learn with? We're going to learn with two models. One of them is the same thing--
we have a 2nd-order polynomial that we're going to use to
fit. That's our model. And we're going to have
a 10th-order polynomial These are the two guys that
we are going to use. Here's what happens with
the 2nd-order fit. You have the data points, and you fit
them, and it's not surprising. For the 2nd order, it's a simple
curve, and it tries to find a compromise. Here we are
applying mean squared error, so this is what you get. Now, let's analyze the performance
of this fellow. What I'm going list here, as you see,
I'm going to say, what is the in-sample error, what is the out-of-sample
error, for the 2nd order which is already here, and the 10th order,
which I haven't shown yet. The in-sample error
in this case is 0.05. This is a number. Obviously, it depends
on the scale. It's some number. When you get the out-of-sample version,
not surprisingly, it's bigger, because this one fit the data. The other one is out-of-sample,
so it's going to be bigger. But the difference is not dramatic, and
this is the performance you get. Now let's apply the 10th-order fit. You already foresee what
a problem can exist here. The red curve sees the data, tries to
fit that, uses all the degrees of freedom it has-- it has 11 of them--
and then it gets this guy. And when you look at the in-sample
error, obviously the in-sample error must be smaller than
the in-sample error here. You have more to fit and you
fit it better, so you get smaller in-sample error. And what is out-of-sample error? Just terrible. So this is patently
a case of overfitting. When you went from 2nd order to
10th order, the in-sample error indeed went down. The out-of-sample error went up. Way up. So you say, this confirms
what we have said before. We are fitting the noise. And you can see here that you're
actually fitting the noise. You can see the red curve is trying to
go for these guys, and you know that these guys are off the target. Therefore, the red curve is bending
particularly, in order to capture something that is really noise. So this is the case. Here it's a little bit strange, because
here we don't have any noise. And we also have the same models.
We're going to take the same two models. We have 2nd order and 10th
order, fitting here. Let's see how they perform here. Well, this is the 2nd-order fit. Again, that's what you expect
from a 2nd-order fit. And you look at the in-sample error and
out-of-sample error, and they are OK-- ballpark fine. You get some error, and the other
one is bigger than it. Now we go for the 10th order, which
is the interesting one. This is the 10th order. You need to remember that the 10th
order is fitting a 50th order. So it really doesn't have enough
parameters to fit, if we had all the glory of the target function
in front of us. But we don't have all the glory
of the target function. We have only 15 points. So it does as good a job as
possible for fitting. And when we look at the in-sample error,
definitely the in-sample error is smaller than here. Because we have more. It's actually extremely small. It did it really, really, well. And then when you go for
the out-of-sample. Oh, no! You see, this is squared error. So these guys, when you go down
and when you go up, kill you. And indeed they did. So this is overfitting galore. And now you ask yourself, you just
told us about noise and not noise. This is noiseless, right? Why did we get overfitting here? We will find out that the reason
we are getting overfitting here, because actually this guy has noise. But it's not your usual noise. It's another type of noise. And getting that notion down is very
important to understand the situations in practice, where you are going
to get overfitting. You could be facing a completely
noiseless, in the conventional sense, situation, and yet there is overfitting,
because you are fitting another type of noise. So let's look at the irony
in this example. Here is the first example-- the noisy simple target. So you are learning a 10th-order target,
and the target is noisy. And I'm not showing the target here, I'm
showing the data points together with the two fits. Now let's say that I tell you that
the target is 10th order, and you have two learners. One of them is O, and
one of them is R-- O for overfitting, and R is for
restricted, as it turns out. And you tell them, guys, I'm not going
to tell you what the target is, because if I tell you what
the target is, this is no longer machine learning. But let me help you out a little bit. The target is a 10th-order polynomial. And I'm going to give you 15 points. Choose your model. Fair enough? The information given does not
depend on the data set, so it's a fair thing. The first learner says, I know
that the target is 10th order. Why not pick a 10th-order model? Sounds like a good idea. And they do this, and they get the red
curve, and they cry and cry and cry! The other guy said, oh,
it's 10th-order model? Who cares? How many points do you have? 15. OK, 15. I am going to take a 2nd order, and I am
actually pushing my luck, because 2nd order is 3 parameters, I have
15 points, the ratio is 5. Someone told us a rule of thumb
that it should be 10. I'm flirting with danger. But I cannot use a line when you are
telling me the thing is 10th order, so let me try my luck with 2nd. That's what you do. And they win. So it's a rather interesting irony,
because there is a thought in people's mind that you try to get as much
information about the target function, and put it in the hypothesis set. In some sense this is true,
for certain properties. But if you are matching the complexity,
here the guy who actually took the 10th-order target, and decided
to put the information all too well in the hypothesis-- I'm taking a 10th-order hypothesis set-- lost. So again, we know all too well now.
The question is, you match the data resources, rather than the
target complexity. There will be other properties
of the target function, that we will take to heart. Symmetry and whatnot, there are a bunch
of hints that we can take. But the question of complexity is not
one of the things that you just apply the general idea of: let me match
the target function. That's not the case. In this case, you are looking at
generalization issues, and you know that generalization issues depend
on the size and the quality of the data set. Now, the example that I just gave you, we
have seen it before when we introduced learning curves, if you remember
what those were. Those were, yeah, I'm going to put
how E_in and E_out change with a number of examples. And I gave you
something, and I told you that this is an actual situation we'll see later,
and this is the situation. So this is the case where you take the
2nd-order polynomial model, H_2, and the inevitable error, which is the black
line, comes now not only from the limitations of the model--
an inability for a 2nd order to replicate a 10th order, which is the
target in this case-- but also because there is noise added. Therefore, there's an amount
of error that is inevitable because of the noise. But the model is very limited. The generalization is not bad,
which is the difference between the two curves. And if you have more examples, the two
curves will converge, as they always do, but they converge to the inevitable
amount of error, which is dictated by the fact that you're using
such a simple model in this case. And when we looked at the other case,
also introduced in this case-- this was the 10th-order fellow. So the 10th-order fellow is-- you can
fit a lot, so the in-sample error is always smaller than here. That is understood. The out-of-sample error starts by
being terrible, because you are overfitting. And then it goes down, and it converges
to something that is better, because that carries the ability
of H_10 to approximate a 10th order, which should be
perfect, except that we have noise. So all of this actually is due to
the noise added to the examples. And the gray area is the interesting
part for us. Because in the gray area, the in-sample
error for the more complex model is smaller. It's smaller always, but we
are observing it in this case. And the out-of-sample error is bigger. That's what defines the gray area. Therefore in this gray area,
very specifically, overfitting is happening. If you move from the simpler model to
the bigger model, you get better in-sample error and worse
out-of-sample error. Now we realize that this guy is not going
to lose forever. The guy who chose the correct complexity is
not going to lose forever. They lost only because of the number
of examples that was inadequate. If the number of examples is adequate,
they will win handily. Like here-- if you look here, you end
up with an out-of-sample error far better than you would ever get here. But now I have enough examples,
in order to be able to do that. Now, we understand overfitting. And we understand that overfitting will
not happen for all the numbers of examples, but for a small number of
examples where you cannot pin down the function, then you suffer from the usual
bad generalization that we saw. Now, we notice that we get overfitting
even without noise, and we want to pin it down a little bit. So let's look at this case. This is the case of the 50th-order
target, the higher-order target that doesn't have any noise-- conventional noise, at least. And these are the two fits. And there's still an irony, because
here are the two learners. The first guy chose the 10th order, the
second guy chose the 2nd order. And the idea here is the following. You told me that the target
now doesn't have noise. Right? That means I don't worry
about overfitting. Wrong. But we'll know why. So given the choices, I'm going to try
to get close to the 50th order, because I have a better chance. If I choose the 10th order, someone
else chooses 2nd order, I'm closer to the 50th, so I think
I will perform better. At least that's the concept. So you do this, and you know that there
is no noise, so you decide on this idea, and again you
get bad performance. And you ask yourself,
this is not my day. I tried everything, and I seem
to be making the wise choice, and I'm always losing. And why is this the case,
when there is no noise? And then you ask, is there
really no noise? And that will lead us to defining
that there is an actual noise in this case, and we'll analyze it and
understand what it is about. So I will take these two examples, and
then make a very elaborate experiment. And I will show you the results
of that experiment. I will encourage you, if you
are interested in the subject, to do simulate this experiment. All the parameters are given. And it will give you a very good feel
for overfitting, because now we are going to look at the figure, and have no
doubt in our mind that overfitting will occur whenever you actually
encounter a real problem. And therefore, you have to be careful. It's not like I constructed a particular
funny case. No, if you average over a huge
number of experiments, you will find that overfitting occurs in the
majority of the cases. So let's look at the detailed
experiment. I'm going to study the impact of two
things-- the noise level, which I already conceptually convinced myself
that it's related to overfitting, and the target complexity, just because
it does seem to be related. Not sure why, but it seems like when
I took a complex target, albeit noiseless, I still got overfitting,
so let me see what the target complexity does. We are going to take, as
general target function-- I'm going to describe what it is, and
I'm going to add noise to it. The noise is a function of x. So I'm just getting it generically, and
as always, we have independence from one x to another. In spite of the fact that the
parameters of the noise distribution depend on x-- I can have different
noise for different points in the space-- the realization of epsilon is
independent from one x to another. That is always the assumption.
When we have different data points, they are independent. So this is the thing, and I'm going to
measure the level of noise by the energy in that noise, and we're going
to call it sigma squared. I'm taking the expected value
of epsilon to be 0. If there were an expected value, I would
put it in the target, so I will remain with 0. And then there's fluctuation around it,
and the fluctuation either could be big, large sigma
squared, or small. And I'm quantifying it
with sigma squared. No particular distribution is needed. You can say Gaussian, and indeed I applied Gaussian
in the experiment. But for the statement, you just
need the energy of that. Now let's write it down. I want to make the target function
more complex, at will. So I'm going to make it
higher-order polynomial. Now I have another parameter, pretty
much like the sigma squared. I have another parameter which is
capital Q, the order of the polynomial. I'm calling it Q_f, because it describes
the target complexity of f, just to remember that
it's related to f. And what I do, I define a polynomial,
which is the sum of coefficients times a power of x, from q equals 0 to Q,
so it's indeed a Qth-order polynomial, and I add the noise here. Now, in order to run the experiment
right, I'm going to normalize this quantity, such that the energy
here is always 1. And the reason I do that is
because I want the sigma squared to mean something. The signal to noise ratio is
always what means something. So if I normalize the signal to energy
1, then I can say sigma squared is really the amount of noise. And if you look at this, it is not
easy to generate interesting polynomials using this formula. Because if you pick these guys at
random-- let's say independent coefficients at random, in order
to generate a general target, these guys are the powers of x. So you start with the x, and then the
parabola, and then the 3rd order, and then the 4th order, and
then the 5th order. Very, very boring guys. One of them is doing this way, and the
other one is doing this way, and they get steeper and steeper. So if you combine them with random
coefficients, you will almost always get something that looks this way, or
something that looks this way. And the other guys don't play a role,
because this one dominates. The way to get interesting guys here
is, instead of generating the alpha_q's here as random, you go for
a standard set of polynomials, which are called Legendre polynomials. Legendre polynomials are just
polynomials with specific coefficients. There is nothing mysterious about them,
except that the choice of the coefficients is such that, from one
order to the next, they're orthogonal to each other. So it's like harmonics in
a sinusoidal expansion. If you take the 1st-order Legendre,
then the 2nd, and the 3rd, and the 4th, and you take the inner product,
you see they are 0. They are orthogonal to each other, and
you normalize them to get energy 1. Because of this, if you have
a combination of Legendre's with random coefficients, then you get
something interesting. All of a sudden, you get the shape. And when you are done, it
is just a polynomial. All you do, you collect the guys that
happen to be the coefficients of x, the coefficients of x squared,
coefficients of x cubed, and these will be your alpha's. Nothing changed in the fact
that I'm generating a polynomial. I just was generating the alpha's in
a very elaborate way, in order to make sure that I get interesting targets. That's all there is to it. As far as we are concerned, we generated
guys that have this form and happened to be interesting--
representative of different functionalities. So in this case we have the noise
level. That's one parameter that affects overfitting. We have potentially-- the target complexity seems to
be affecting overfitting. At least we are conjecturing
that it is. And the final guy that affects
overfitting is the number of data points. If I give you more data points, you are
less susceptible to overfitting. Now I'd like to understand the
dependency between these. And if we go back to the experiment we
had, this is just one instance of those, where the target complexity
here is 10. I use the 10th-order polynomial,
so Q_f is 10. The noise is whatever the distance
between the points and the curve is. That's what captures sigma squared. And the data size here is 15. I have 15 data points. So this is one instance. I'm basically generating at will random
instances of that, in order to see if the observation of
overfitting persists. Now, how am I going
to measure overfitting? I'm going to define an overfit
measure, which is a pretty simple one. We're fitting a data set
from x_1, y_1 to x_N, y_N. And we are using two models, our usual two models. Nothing changed. We either use 2nd-order polynomials,
or the 10th-order polynomials. And if going from the 2nd-order
polynomial to the 10th-order polynomial gets us in trouble,
then we are overfitting. And we would like to quantify that. When you compare the out-of-sample
errors of the two models, you have a final hypothesis from H_2,
and this is the fit-- the green curve that you have seen. And another final hypothesis from the
other model, which is the red curve-- the wiggly guy. If you want to define an overfit
measure based on the two, what you do is you get the out-of-sample error for
the more complex guy, minus the out of sample error
for the simple guy. Why is this an overfit measure? Because if the more complex guy is
worse, it means its out-of-sample error is bigger, and you
get a positive number, large positive if the overfitting
is terrible. And if this is negative, it means that
actually the more complex guy is doing better, so you are not overfitting. Zero means that they are the same. So now I have a number in my mind that
measures the level of overfitting in any particular setting. And if you apply this to, again, the
same case we had before, you look at here, and the out-of-sample error
for the red is terrible. The out-of-sample error of green
is nothing to be proud of, but definitely better. And the overfit measure in this case
will be positive, so we have overfitting. Now let's look at the result of
running this for tens of millions of iterations. Not epochs iterations. Complete runs. Generate the target, generate
the data set, fit both, look at the overfit measure. Repeat 10 million times, for
all kinds of parameters. So you get a pattern for
what is going on. This is what you get. First, the impact of sigma squared. I'm going to have a plot
in which you get N, the number of examples, and the level of noise,
sigma squared. And on the plot, I'm going to give
a color depending on the intensity of the overfit. That intensity will be depending on
the number of points, and the level of the noise that you have. And this is what you get. First let's look at the
color convention. So 0 is green. If you get redder, there's
more overfitting. If you get bluer, there
is less overfitting. Now I looked at the number
of examples, and I picked interesting range. If you go, this is 80,
100, and 120 points. So what happens to 40? All of them are dark red. Terrible overfitting. And if you go beyond that, you
have enough examples now not to overfit, so it's almost all blue. So I'm just giving you the
transition part of it. You look at it. There is a noise level. As I increase the noise level,
overfitting worsens. Why is that? Because if I pick any number
of examples, let's say 100. If I had 100, and it had that little
noise, I'm doing fine. Doing fine in terms of
not overfitting. And as I go, I get into the red
region, and then I get deeply into the red region. So this tells me, indeed,
that overfitting worsens with sigma squared. By the way, for all of the targets here,
I picked a fixed complexity. 20. 20th-order polynomial. I fixed it because I just wanted
a number, and I wanted only to relate the noise to the overfitting. So that's what I'm doing here. When I change the complexity,
this will be the other plot. For this guy, we get something that
is nice, and it's really according to what we expect. As you increase the number of points,
the overfitting goes down. As you increase the level of noise,
the overfitting goes up. That is what we expect. Now let's go for the impact of Q_f,
because that was the mysterious part. There was no noise and we
are getting overfitting. Is this going to persist? What is the deal? This is what you get. So here, we fixed the level of noise. We fixed it at sigma
squared equals 0.1. Now we are increasing the target
complexity, from trivial to 100th-order polynomial. That's a pretty serious guy. And we are plotting the same range for
the number of points, from 80, 100, 120. That's where it happens. And you can see that overfitting
occurs significantly. And it worsens also with
the target complexity. Because let's say, you
look at this guy. If you look at this guy, you are here
in the green, and gets red, and then it gets darker red. Not as pronounced as in this case. But you do get the overfitting effect
by increasing the target complexity. And when the number of examples is
bigger, then there's less overfitting, as you expect it to be. But if you go high enough-- I guess it's getting lighter blue,
green, yellow. Eventually, it will get to red. And if you look at these two guys, the
main observation is that the red region is serious. Overfitting is real and here to stay,
and we have to deal with it. It's not like an individual
case there. Now, there are two things you can
derive from these two figures. The first thing is that there seems
to be another factor, other than conventional noise-- let's call it
conventional noise for the moment-- that affects overfitting. And we want to characterize that. That is the first thing we derive. The second thing we derive is
a nice logo for the course! That's where it came from. So now let's look at noise, and
look at the impact of noise. And you can notice that noise is
between quotation marks here, because now we're going to expand our horizon
about what constitutes noise. Here are the two guys. And in the first case, we are going
now to call it stochastic noise. Noise is stochastic, but obviously
we are calling it stochastic because the other guy will
not be stochastic. And there's absolutely
nothing to add here. This is what we expect. We're just calling it a name. Now we are going to call whatever effect
that is done by having a more complex target here, we are going
also to call it noise. But it is going to be called
deterministic noise. Because there is nothing
stochastic about it. There's a particular target function. I just cannot capture it, so
it looks like noise to me. And we would like to understand what
deterministic noise is about. However, if you look at it, and now you
speak in terms of stochastic noise and deterministic noise, and you would
like to see what affects overfitting. So, we put it in a box. First observation: if I have more points, I
have less overfitting. If you move from here to
here, things get bluer. If you move from here to here,
things get bluer. I have less overfitting. Second thing: if I increase the stochastic noise-- increase the energy in the
stochastic noise-- the overfitting goes up. Indeed, if I go from here to
here, things get redder. And finally, with deterministic noise,
which is vaguely associated in my mind with the increase of target complexity,
I also increase the overfitting. If I go from here to here,
I am getting redder. Albeit I have to travel further, and
it's a bit more subtle, but the direction is that I get more
overfitting as I get more deterministic noise, whatever
that might be. So now, let's spend some time just
analyzing what deterministic noise is, and why it affects overfitting
the way it does. Let's start with the definition. What is it? It will be actually noise. If I tell you what is
the stochastic noise, you will say, here's my target, and
there is something on top of it. That is what I call stochastic noise. So the deterministic noise will be
the same thing, except that it captures something deterministic. It's the part of the target that your
hypothesis set cannot capture. So let's look at the picture. Here is the picture. This is your target, the blue guy. You take a hypothesis set that-- let's
say simple, and you look for the guy that best approximates f. Not in the learning sense. You actually try very hard to find
the best possible approximation. You're still not going to get f,
because your hypothesis set is limited, but the best guy will be
sitting there, and it will fail to pick certain part of the target. And that is the part we are labeling
the deterministic noise. And if you think from an operational
point of view, if you are that hypothesis, noise is all the same. It's something I cannot capture. Whether I couldn't capture it, because
there's nothing to capture-- as in stochastic noise-- or I couldn't capture it, because I'm
limited in capturing, and this I have to consider as out of my league. Both of them are noise, as
far as I'm concerned. Something I cannot deal with. This is how we define it. And then we ask, why are
we calling it noise? It's a little bit of
a philosophical issue. But let's say that you have
a young sibling-- your kid brother-- has just learned fractions. So they used to have just
1, 2, 3, 4, 5, 6. They're not even into negative numbers,
and they learn fractions, and now they're very excited. They realize that there's more
to numbers than just 1, 2, 3. So you are the big brother. You are big Caltech guy. So you must know more about numbers. They come ask you, tell
me more about numbers. Now, in your mind, you probably
can explain to them negative numbers a little bit by deficiency. Real numbers, just intuitively
continuous. You are not going to tell them about limits,
or anything like that. They're too young for that. But you probably are not going to tell
them about complex numbers, are you? Because their hypothesis set is so
limited that complex numbers, for them, would be completely noise. And the problem with explaining something
that people cannot capture is that they will create a pattern
that really doesn't exist. And then you tell them complex number,
and they really can't comprehend it, but they got the notion. So now it's the noise. They fit the
noise, and they tell you, is 7.34521 a complex number? Because in their minds-- they just got on to a tangent. So you're better off just
killing that part. And giving them a simple thing that they
can learn, because the additional part will actually mislead them. Mislead them, as in noise. So this is our idea, that if I have
a hypothesis set, and there is part of the target that I cannot capture,
there's no point in trying to capture it, because when you try to capture it,
you are detecting a false pattern that you cannot extrapolate,
given your limitations. That's why it's called noise. Now the main differences between
deterministic noise and stochastic noise-- both of them can be
plotted, a realization-- but the main differences are, the
first thing is that deterministic noise depends on your hypothesis set. For the same target function, if you
use a more sophisticated hypothesis set, the deterministic noise will be
smaller, because you were able to capture more. Obviously, the stochastic
noise will be the same. Nothing can capture it, so all
hypotheses are the same. We cannot capture it, and
therefore it's noise. The other thing is that, if I give you
a particular point x, deterministic noise is a fixed amount, which is the
difference between the value of the target at that point and the best
hypothesis approximation you have. If I gave you stochastic noise, then
you are generating this at random. And if I give you two instances
of x, the same x, the noise will change from one
occurrence to another, whereas here, it's the same. Nonetheless, they behave exactly the
same for machine learning, because invariably we have a given data set. Nobody changes x's on us, and give
us another realization of the x. We just have the x's given to
us together with the labels. So this doesn't make
a difference for us. And we settle on a hypothesis set. Once you settle on a hypothesis set, the
deterministic noise is as bad as the stochastic noise. It's something that we cannot capture,
and it depends on something that we have already fixed, so it doesn't
depend on anything. So in a given learning situation,
they behave the same. Now, let's see the impact
on overfitting. This is what we have seen before. This is the case where we have
increasing target complexity, so increasing deterministic noise in the
terminology we just introduced, and the number of points. And red means
overfitting, so this is how much overfitting is there. And we are looking at deterministic
noise, as it relates to the target complexity. Because the quantitative thing
we had is target complexity. We defined what a realization of
deterministic noise is, but it's not clear to us what quantity we should
measure out of deterministic noise, in order to tell us that this is the
level of noise that results in overfitting, yet. We have the one in the case of
stochastic noise very easily. We just take the energy of it. So here we realize that as you increase
the target complexity, the deterministic noise increases, which is
the overfitting phenomenon that we observe-- increases. But you'll notice there's something
interesting here. It doesn't start until you get to 10. Because this was overfitting of what? The 10th order versus the 2nd order. So if you're going to start having
deterministic noise, you'd better go above 10, so that there is something
that you cannot approximate. This is the part where it's there. So here, I wouldn't say proportional, but it
definitely increases with the target complexity, and it decreases
with N as we expect. Now for the finite N, you suffer the
same way you suffer from the stochastic noise. We have declared that deterministic noise
is the part that your hypothesis set cannot capture. So what is the problem? If I cannot capture it, it won't
hurt me, because when I try to fit, I won't capture it anyway. No. You cannot capture it in its entirety. But if I give you only a finite sample,
then you only get few points, and you may be able to capture
a little bit of the stochastic noise, or the deterministic
noise in this case. Again, if I have 10 points-- if you give me a million points,
and even if there is stochastic noise, there's nothing I can do to
capture the noise. Let me remind you of the example
we gave in linear regression. We took linear regression and said,
let's say that we are learning a linear function. So linear regression would
be perfect in this case. This is the target. And then we added noise to the examples,
so instead of getting the points perfectly on that line,
you get points right or left. And then we tried to use linear
regression to fit it. If you didn't have any noise,
linear regression would be perfect in this case. Now, since there's noise, and it doesn't
really see the line-- it only sees those guys, it eats a little bit
into the noise, and therefore gets deviated from the target. And that is
why you are getting worse performance than without the noise. Now, if I have 10 points, linear
regression will have easy time eating into that, because there
isn't much to fit. There are only 10 guys, and maybe
there's some linear pattern there. If I get a million points, the chances
are I won't be able to fit any of them at all, because they are noise all
over the place, and I cannot find a compromise using my few parameters, and
therefore I will end up really not being affected by them. In the infinite case, I
cannot get anything. They are noise, and I cannot fit them.
They are out of my ability. But the problem is that once you have
a finite sample, you're given the unfortunate ability to be able to fit
the noise, and you will indeed fit it. Whether it's stochastic-- that it doesn't make sense-- or
deterministic, that there is no point in fitting it, because you know in your
hypothesis set, there is no way to generalize out-of-sample for it. It is out of your ability. So the problem here is that for the
finite N, you get to try to fit the noise, both stochastic
and deterministic. Now, let me go quickly through
a quantitative analysis that will put deterministic noise and stochastic noise
in the same equation, so that they become clear. Remember bias-variance? That was a few lectures ago. What was that about? We had a decomposition of the expected
out-of-sample error into two terms. And this is the expected value of
out-of-sample error. I remember, this is the hypothesis we get, and we
have dependency on the data set that got us. We compare it to the target function,
and we get the expected value with respect to those. And that ended up being a variance,
which tells me how far I am from the centroid within the hypothesis set, and
that means that there's a variety of things I get based on D. And the other one is how far the
centroid is from the target, which tells me the bias of my hypothesis
set from the target. And the leap of faith we had is that
this quantity, which is the average hypothesis you get over all data sets,
is about the same as the best hypothesis in the hypothesis set. So we had that. And in this case, f was noiseless
in this analysis. Now, I'd like to add noise to the
target, and see how this decomposition will go, because this will give us
a very good insight into the role of stochastic noise versus
deterministic noise. So we add noise. And we're going to plot it red, because
we want to pay attention to it, and because we are going
to get the expected values with respect to it. So y now is the realization,
the target plus epsilon. And I'm going to assume that the
expected value of the noise is 0. Again, if the expected value is
something else, we put that in the target, and leave the part which is
pure fluctuation outside, and call that epsilon. Now I would like to
repeat the analysis, more quickly, obviously, with the added noise. Here is the noise term. First, this is what we started with. So I'm comparing what you get in your
hypothesis, in a particular learning situation, to the target. But now the target is noisy. So the first thing is to replace
this fellow by the noisy version, which is y. I know that y has f of
x, plus the noise. That's what I'm comparing to. And now, because y depends on the
noise, I'm not only getting the averaging with respect to the data set,
I'm also getting the average with respect to the realization
of the noise. So I'm getting expected value
with respect to D and epsilon-- epsilon affecting y. So you expand this, and this is just rewriting it. f of x
plus epsilon is y, so I'm writing it this way. And we do the same thing we did before,
but just carrying this around, until we see where it goes. So what did we do? We added and subtracted the centroid-- the average hypothesis, remember-- in preparation for getting squared
terms, and cross terms. And here we have the epsilon
added to the mix. And then we write it down. And in the first case, we get the
squared, so we put these together and put them as squared. We take these two guys together,
and put them as squared. And this guy by itself,
we put it as squared. We will still have cross terms, but
these are the ones that I'm going to focus on. And then we have more cross terms
than we had before, because there's epsilon in it. But the good news is that, if you get
the expected value of the cross terms, all of them will go to 0. The ones that used to go
to 0 will go to 0. The other ones will go to 0, because the
expected value of epsilon goes to 0, and epsilon is independent of
the other random thing here, which is the data set. Data set is generated. Its noise is generated. Epsilon is generated on the test point
x, which is independent, and therefore you will get 0. So it's very easy to argue that this is
0, and you will get basically the same decomposition with
this fellow added. So let's look at it. Well, we'll see that there
are actually two noise terms that come up. This is the variance term. Let me put it. This is the bias term. And this is the added term, which
is just sigma squared, the energy of the noise. Let me just discuss this a little bit. We had the expected value with respect
to D, and with respect to epsilon. And then, remember that we take the
expected value with respect to x, average over all the space, in order to
get just the bias and variance, rather than the bias of x-- of your
test point. So I did that already. So every expectation is with respect to
the data set, with respect to the input point, and with respect to the
realization of the noisy epsilon. But I'm keeping the guys that survive,
because the other guys-- epsilon doesn't appear here, so the
thing is constant with respect to it, so I take it out. Here, neither epsilon nor D appears
here, so I just leave it for simplicity. And here, D doesn't appear, but epsilon
and x appear, so I do it this way. I could put the more elaborate
notation, but I just wanted to keep it simple. Now, look at this decomposition. We have the moving from your
hypothesis to the centroid, from the centroid to the target proper, and then
from the target proper to the actual output, which has
a noise aspect to it. So it's again the same thing of trying
to approximate something, and putting it in steps. Now if you look at the last quantity,
that is patently the stochastic noise. The interesting thing is that there
is another term here which is corresponding to the
deterministic noise. And that is this fellow. That's another name for the bias. Why is that? Because our leap of faith told us
that this guy, the average, is about the same as the best hypothesis. So we are measuring how the best
hypothesis can approximate f. Well, this tells me the energy
of deterministic noise. And this is why it's deterministic
noise. And putting it this way it gives
you the solid ground to treat them the same. Because if you increase the number of
examples, you may get better variance. There is more examples,
so you don't float around fitting all of them. So the red region, that used to be the
variance, shrinks and shrinks. These guys are both inevitable. There is nothing you can do about this,
and there's nothing you can do about this given a hypothesis set. So these are fixed. But again, in the bias-variance,
remember the approximation was overall approximation. We took the entire target function,
and the entire hypothesis. We didn't look at particular
data points. We looked at approximation proper, and
that's why these are inevitable. You tell me what the hypothesis set is,
well, that's the best I can do. And this is the best I can do as far
as the noise, which is just not predicting anything in the noise. Now, both the deterministic noise and
the stochastic noise will have a finite version on the data points, and
the algorithm will try to fit them. And that's why this guy
gets a variety. Because depending on the particular fit of
those, you will get one or another. So these guys affect the variance, by
making the fit more susceptible to going in more places. Depending on what happens, I will go
this way and that way-- not because it's indicated by the target function
I want to learn, but just because there is a noise present in the sample
that I am blindly following, because I can't distinguish noise from signal,
and therefore I end up with more variety, and I end up with worse
variance and overfit. Now very briefly, I'm going
to give you a lead into the next two lectures. We understand what overfitting is, and we
understand that it's due to noise. And we understand that noise is in the
eye of the beholder, so to speak. There is stochastic noise, but there's
another noise which is not really noise, but depends on which
hypothesis looks at it. It looks like noise to some, and not
look like noise to other, and we call that deterministic noise. And we saw experimentally that
it affects overfitting. So how do we deal with overfitting? What does it mean to deal
with overfitting? We want to avoid it. We don't want to spend more energy
fitting, and get worse out-of-sample error, whether by choice of a model,
or by actually optimizing within a model, like we did with
neural networks. There are two cures. One of them is called regularization,
and that is best described as putting the brakes. So overfitting-- you are going,
going, going, going, and you hurt yourself. So all I'm doing here is, I'm
just making sure that you don't go all the way. And when you do that, I'm going
to avoid overfitting this way. The other one is called validation. What is the cure in this
case for overfitting? You check the bottom line, and make
sure that you don't overfit. It's a different philosophy. That is, the reason I'm overfitting is
because I'm going for E_in, and I'm minimizing it, and I'm
going all the way. I say, no, wait a minute. E_in is not a very good indication
for what happens. Maybe there's another way to be able to
tell what is actually happening out of sample, and therefore avoid
overfitting, because you can check on what is happening in the real
quantity you care about. So these are the two approaches. I'll give you just an appetizer-- a very short appetizer for putting
the brakes-- the regularization part, which is the subject of next lecture. Remember this curve? That's what we started with. We had the five points, we had the
4th-order polynomial, we fit, and we ended up in trouble. And we can describe this as free fit, that is, fit all you can. So fit all you can, five points, I'll
take 4th-order polynomial, go for it, I get this, and that's what happens. Now, putting the brakes means that
you're going to not allow yourself to go all the way, and you are going
to have a restrained fit. The reason I'm showing this is
because it's fairly dramatic. You will think that I need-- this curve is so incredibly bad that
you think you really need to do something dramatic in
order to avoid that. But here, what I'm going to do, I'm just
going to make you fit, and I'm actually going to make you fit
using a 4th-order polynomial. I'll give you that privilege. But I'm going to prevent you from
fitting the points perfectly. I'm going to put some
friction in it, such that you cannot get
exactly to the points. And the amount of brake I'm
going to put here is so minimal, it's laughable. When you go for your car service, they
measure the brake, and they tell you, oh, the brake is 70%, et cetera,
and then when it gets to 40%, they tell you you need to
do something about the brake. The brake's here are about 1%. So if this was a car, you would be
braking here, and you would be stopping in Glendale! It's like completely ridiculous. But that little amount of brake
will result in this. Totally dramatic. Fantastic fit. The red curve is a 4th-order polynomial,
but we didn't allow it to fit all the way. And you can see that it's not fitting
all the way, because it really is not getting the points right. It's getting there, but not exactly. So we don't have to do much to
prevent the overfitting. But we need to understand what
is regularization, and how to choose it, et cetera. And this we'll talk about next time. And then the time after that, we're
going to talk about validation, which is the other prescription. I will stop here, and we will take
questions after a short break. Let's start the Q&A, and we'll start
with a question in house. STUDENT: So on previous lecture we
spoke about stochastic gradient descent, and we say that we should
choose point by point, and move in the direction of gradient
of error in this point. PROFESSOR: Negative
of the gradient, yes. STUDENT: So the question is,
how important is it to choose points randomly? I mean, we can choose them just from the list-- first point,
second point, and so on? PROFESSOR: Yeah. Depending on the runs, it could be no
difference at all, or it could be a real difference. And the best way to think of
randomization in this case is that it's an insurance policy. There's something about the pattern
that is detrimental in a particular case. You are always safe by picking the
points at random, because there's no chance that the random thing will
have a pattern eventually, if you keep doing it. So in many cases, you just run
through examples 1 through N, 1 through N. 1 through N,
and you will be fine. Some cases, you take
a random permutation. Some cases even, you stay true to
picking the point at random, and you hope that the representation of
a point will be the same, in the long run. In my own experience, there is little
difference in a typical case. Every now and then, there's
a funny case. And therefore, you are safer using
the stochastic presentation-- the random presentation
of the examples-- in order to be able not to fall
into the trap in those cases. Yeah. There's another question in house. STUDENT: Hi, Professor. I have a question about slide 4. It's about neural networks. I don't understand-- how do you draw the
out-of-sample error on that plot? PROFESSOR: OK. In general, you cannot, obviously,
draw the out-of-sample error. If you could draw it, you
will just pick it. This is a case where, I give you
a data set, and you decide to set aside part of the data
set for testing. So you are not involving it
at all in the training. And what you do, you go about your
training, and at the end of every epoch, when you evaluate the in-sample
error on the entire batch, which is the green curve here, you also evaluate,
for that set of weights-- the frozen weights at the end of the
epoch-- you evaluate that on the test set, and you get a point. And because that point is not involved
in the training, it becomes an out-of-sample point, and that
gets the red point. And you go down. Now, there's an interesting tricky point
here, because if you decide at some point to maybe, I look at the red curve. Now I am going to stop where
the red curve is minimum. STUDENT: Yes. PROFESSOR: OK? Now at that point, the set that used to
be a test set is no longer a test set, because now it has just
been involved in a decision regarding training. Becomes slightly contaminated, becomes
a validation set, which we're going to talk about when we talk
about validation. but that is really the premise. STUDENT: OK. I understand. Also, can I-- slide 16? PROFESSOR: Slide 16. STUDENT: I didn't follow that. Why the
two noises are the same, for the same learning problem. PROFESSOR: They're the same
in the sense that they are part of the outputs that I'm being given, or that I'm
trying to predict. And that part, I cannot predict
regardless of what I do. In the case of stochastic
noise, it's obvious. There's nothing to predict there,
so whatever I do, I miss it. In the case here, it's particular to
the hypothesis set that I have. So I take a hypothesis set, and look
in a non-learning scenario, look at the target function and choose
your best scenario. You choose, this is my best hypothesis,
which we called here h star. If you look at the difference
between h star and f, the difference is a part which I cannot
capture, because the best I could do is h star. So the remaining part is what I'm
referring to as deterministic noise, and it is beyond my ability
given my hypothesis set. So that's why they are the same-- the
same in the sense of unreachable as far as my resources are concerned. STUDENT: OK. In a real problem, do we know the
complexity of the target function? PROFESSOR: In general, no. We also don't know the particulars of the
noise. We know that the problem is noisy, but we cannot identify the noise. We cannot, in most cases,
even measure the noise. So the purpose here is to understand
that, even in the case of a noiseless target in the conventional
sense, there is something that we can identify-- conceptually identify-- that does affect the overfitting. And even if we don't know the
particulars of it, we will have to put in place the guards, in order
to avoid overfitting. That was the goal here,
rather than try to-- Any time you see the target function
drawn, you should immediately have an alarm bell that this is conceptual,
because you never actually see the target function in a real
learning situation. STUDENT: Oh. So, that's why the two
noises are equal, then. Because we don't know the target
function, so we don't know which part is deterministic. PROFESSOR: Yeah. If I knew the target,
and if I knew the noise, then the situation would be good, but
then I don't need machine learning. I already have that. STUDENT: Thank you. PROFESSOR: So we go for the
questions from the outside? MODERATOR: Yeah.
Quick conceptual question. Is it OK to say that the deterministic
noise is the part of reality that is too complex to be modeled? PROFESSOR: It is definitely
part of the reality-- that part. And basically, it's our failure to model
it is what made it noise, as far as we are concerned. So obviously you can, in some sense,
model it by going to a bigger hypothesis set. The bigger hypothesis set will have
a closer h star to the target, and therefore the difference
will be small. But the situation pertains to the case
where you already chose the hypothesis set according to prescriptions of
VC dimension, number of examples, and other considerations. And given that hypothesis set, you
already concede that even if the target is noiseless, there is part of
it which behaves as noise, as far as I'm concerned. And I will have to treat it as such, when
I consider overfitting and the other considerations. MODERATOR: Also, is it fair to say that
over-training will cause overfitting? PROFESSOR: I think they
probably are synonymous. Overfitting is relative. Over-training will be relative within
the same model, if I try to give it a definition. That you over-train, so you already
settled on the model, and you're over-training it. The case of neural network
would be over-training. The case of choosing the 3rd-order
polynomial versus the 4th-order polynomial will not really be
over-training, but it will be overfitting. It's all technicalities, but
just to answer the question. MODERATOR: Practically, when
you implement these algorithms, and there's also some approximation, maybe due to the
floating-point number or something. So is this another source of error? Does it produce overfitting? Or is it-- PROFESSOR: It's-- Formally speaking, yes,
it's another source. But it is so minute with respect
to the other guys, that it's never mentioned. We have another in-house question. STUDENT: A couple of lectures ago, we
spoke about 3rd linear model, which is logistic regression. PROFESSOR: You said
the 3rd linear model? STUDENT: Yes. So the question is, is it true
that initially I have data which is completely linearly separable? So the points marked-- some points are marked -1, and some
are +1, and there is a plane which separates them. Is it true that applying this learning
model, you're never stuck in a local minimum and get 0 in-sample error? PROFESSOR: OK. This is a very specific question
about logistic regression. If the thing is completely clean, then
you obviously can get closer and closer to having the probability
being perfect, by having bigger and bigger weights. So there is a minimum. And again, it's a unique minimum. Except that the minimum is
at infinity, in terms of the size of the weight. But this doesn't bother you, because you
are really going to stop at some point when the gradient is small,
according to your specification. And you can specify this
any way you want. So the goal is not necessarily
to arrive at the minimum. Which hardly ever happens, even if
the thing is not at infinity. But get close enough, in the sense that
the value is close to the minimum, and therefore you achieve the small
error that you want. MODERATOR: Can you resolve again the contradiction of when
you increase the complexity of the model, you should be
reducing your bias, and hence your deterministic noise? So here we had an example
when we had H-- well, H_10 had more
error than H_2. PROFESSOR: H_10 had total error
more than H_2. If we were doing the
approximation game, H_10 would be better. We had three terms in the bias-variance.
If we were only going by these two, then there is no question
that the bigger model, H_10, will win. Because this is for all, and this one
will be better for H_10 than H_2, because H_10 is closer to the target
we want, and therefore we will be making smaller error. This is not the source of the
problem of overfitting. This is just identifying terms in the
bias-variance decomposition, bias-variance-noise decomposition
in this case, that correspond to the different
types of noise. The problem of overfitting
happens here. And that happens because of
the finite-sample version of both. That is, I get N points in which there
is a contribution of noise coming from the stochastic and coming
from the deterministic. On those points, the algorithm will try
to fit that noise, in spite of the fact that if it knew, it wouldn't,
because it knows that they're out of reach. But it gets a finite sample, and it can
use its resources to try to fit part of that noise, and that is
what is causing overfitting. And that ends up being harmful, and so
harmful in the H_10 case, that the harm offsets the fact that I'm closer
to the target function. That doesn't help me very much. Because, same thing we said before, Let's say there's H_10. And the target function
is sitting here. That doesn't do me much good if my
algorithm, and the distraction of the noise, leads me to go
in that direction. I will be further from the target
function than another guy who, only working with this, remained in the
confines and ended up being closer to the target function. It's a question of the variance term
that results in overfitting, not this guy, in spite of the fact that
these guys contain both types of noise contributing to their value. But their value is static. It doesn't change with N, and
it has nothing to do with the overfitting aspect. MODERATOR: In the case of polynomial
fitting, a way to avoid the overfitting could be to
use piecewise linear-- piecewise linear functions
around each point. So it is a method of regularization? Or is it-- PROFESSOR: OK. Depends on the number of degrees
of freedom you have. You can have piecewise linear,
which is really horrible. It's like something you can't tell. It depends on how many pieces. If you have as many pieces as there
are points, you can see what the problem is. So it really is, what is the
VC dimension of your model? And I can take it-- if it's piecewise
linear, and I have only four parameters, then I don't worry too much
that it's piecewise linear. I only worry about the four
parameters aspect of it. 10th-order polynomial was bad because
of the 11 parameters, not because of other factor. But anything you do to restrict your
model, in terms of the fitting, can be called regularization. And there are some good methods and
bad methods, but they are all regularization, in terms
of putting the brakes. MODERATOR: Some practical question is,
how do you usually get the profile of the out-of-sample error? Do you sacrifice points, or-- PROFESSOR: OK. This is obviously a good question. When we talk about validation--
validation has an impact on overfitting. It's used to do that. But it's also used in model
selection in general. And because of that, it's very tempting
to say, I'm going to use validation, and I'm going to set
aside a number of points. But obviously, the problem is that when
you set aside a number of points, you deprive yourself from a resource that
you could have used for training, in order to arrive at a better hypothesis. So there's a tradeoff, and we'll
discuss that tradeoff in very specific terms, and find ways
to go around it, like cross-validation. But this will be the subject of the
lecture on validation, coming up soon. MODERATOR: In the example of the color
plots, here the order of the polynomial is a good indication
of the VC dimension, right? PROFESSOR: These are the plots. What is the question? MODERATOR: Here, Q_f is directly related
to the VC dimension, right? PROFESSOR: The target complexity
has nothing to do with the VC dimension. It's the target. I'm talking about different targets. The VC dimension has to do only with
the two fellows we are using. We are using H_2 and H_10, 2nd-order
polynomials and 10th-order polynomials. So if we take the degrees of freedom
as being a VC dimension, they will have different VC dimension. And the discrepancy in the VC dimension,
given the same number of examples, is the reason why we
have discrepancy in the out-of-sample error. But you also have a discrepancy
in the in-sample error. And the case of overfitting is such that
the in-sample error is moving in one direction, and the out-of-sample
moving in another direction. So the only relevant thing in this plot
to the VC dimension is the fact that the two models have different
VC dimensions, H_2 and H_10. MODERATOR: I guess you never
really have a measure on the target complexity, like in practice? PROFESSOR: Correct. This was an illustration. And even in the case of the illustration,
when we had explicitly a definition of the target complexity, it
wasn't completely clear how to map this into energy of deterministic
noise, a counterpart for sigma squared here. This is completely clean. And as you can see, because of that,
the plot is very regular. Here, first we define this in
a particular case, in order to be able to run an experiment. Second, in terms of that, it's not clear
what is-- can you tell me what is the energy of the deterministic
noise here? There's quite a bit of normalization
that was done. So when we normalize the target
in order to make sigma squared meaningful, we sacrifice the fact--
the target now is sandwiched between limited range. And therefore the amount of energy, of
whatever the deterministic noise is, will be limited, regardless of
how complex is the target is. So there is a compromise we had
to do, in order to be able to find these plots. However, the moral of the story here
is that there's something about the target complexity that behaved in the
same way, as far as overfitting is concerned, as noise. And we identified it as
deterministic noise. We didn't quantify it further. And it will be-- It's possible to quantify it. You can get the energy for this
and that, and you can do it. But these are research topics. As far as we are concerned, in a real
situation, we won't be able to identify either the stochastic noise
or the deterministic noise. We just know they're there. We know the impact of overfitting. And we will be able to find methods
in order to be able to cure the overfitting, without knowing all of the
specifics that we could possibly know about the noise involved. MODERATOR: Do you ever measure the-- is there some similar kind of measure
of the model complexity, of the target function? Do you ever use a VC dimension
for that? PROFESSOR: Not explicitly. One can apply it. You say what
is the model that would include the target function? And then, based on the inclusion of the
target function, you can say that this is the complexity of that model. The analysis we use is such that the
complexity of the target function doesn't come in, in terms
of the VC analysis. But there are other methods. There are other approaches, other than the VC analysis, where
the target complexity matters. So I didn't particularly spend time
trying to capture the complexity of the target function until this moment,
where the complexity of the target function could translate to something
in the bias-variance decomposition, and that has an impact on overfitting
and generalization. MODERATOR: I think that's it. PROFESSOR: We will see you on Thursday.