ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about
regularization, which is a very important technique in
machine learning. And the main analytic step that we took
is to take a constrained form of regularization, where you explicitly
forbid some of the hypotheses from being considered, and thereby reducing
the VC dimension and improving the generalization property, to
an unconstrained version which creates an augmented error in which no
particular vector of weights is prohibited per se, but basically you
have a preference of weights based on a penalty that has to do
with the constraint. And that equivalence will make us focus
on the augmented-error form of regularization, in every
practice we have. And the argument for it was to take the
constrained version and look at it, either as a Lagrangian which would be
the formal way of solving it, or as we did it in a geometric way, to find
a condition that corresponds to minimization under a constraint, and
find that this would be locally equivalent to minimizing this
in an unconstrained way. Then we went to the general form of
a regularizer, and we called it Omega of h. And it depends on small h rather than
capital H, which was the other Omega that we used in the VC analysis. And in that case, we formed the
augmented error as the in-sample error, plus this term. And the idea now is that the
augmented error will be a better thing to minimize, if you want to minimize the
out-of-sample error, rather than that-- just minimizing E_in by itself. And there are two choices here. One of them is the regularizer Omega, weight decay or weight
elimination, or the other forms you may find. And the other one is lambda, which is
the regularization parameter-- the amount of regularization
you're going to put. And the long and short of it is that
the choice of Omega in a practical situation is really a heuristic choice,
guided by theory and guided by certain goals, but there is no
mathematical way in a given practical situation to come up with
a totally principled omega. But we follow the guidelines,
and we do quite well. So we make a choice of Omega towards
smoother or simpler hypotheses. And then we leave the amount of
regularization to the determination of lambda, and lambda is a little
bit more principled. We'll find out that we will determine
lambda using validation, which is the subject of today's lecture. And when you do that, you will
get some benefit of Omega. If you choose a great Omega, you
will get a great benefit. If you choose an OK Omega, you
will get some benefit. If you choose a terrible Omega, you are
still safe, because lambda will tell you-- the validation
will tell you-- just take lambda equal to 0, and
therefore no harm done. And as you see, the choice of lambda is
indeed critical, because when you take the correct amount of lambda, which
happens to be very small in this case, the fit, which is the red curve,
is very close to the target, which is the blue. Whereas if you push your luck, and have
more of the regularization, you end up constraining the fit so much
that the red-- it wants to move toward the blue, but it can't because
of the penalty, and ends up being a poor fit for the blue curve. So that leads us to today's lecture,
which is about validation. Validation is another technique that
you will be using in almost every machine learning problem
you will encounter. And the outline is very simple. First, I'm going to talk about
the validation set. There are two aspects that
I'm going to talk about. The size of the validation
set is critical. So we'll spend some time looking at
the size of the validation set. And then we'll ask ourselves, why
did we call it validation in the first place? It looks exactly like the test
set that we looked at before. So why do we call it validation? And the distinction will
be pretty important. And then we'll go for model selection,
a very important subject in machine learning. And it is the main task of validation. That's what you use validation for. And we'll find that model selection
covers more territory than what the name may suggest to you. Finally, we will go to cross-validation,
which is a type of validation that is very interesting,
that allows you, if I give you a budget of N examples, to basically use
all of them for validation, and all of them for training, which looks like
cheating, because validation will look like a distinct activity from
training, as we will see. But with this trick, you will be able
to find a way to go around that. Now, let me contrast validation with
regularization, as far as control of overfitting is concerned. We have seen, in one form or another,
the following by-now-famous equation, or inequality or rule, where you have the
out-of-sample error that you want equal the in-sample error, or at most equal
the in-sample error, plus some penalty. Could be penalty for model complexity,
overfit complexity, a bunch of other ways of describing that. But basically, this tells us that
E_in is not exactly E_out. That, we know all too well. And there is a discrepancy, and the
discrepancy has to do with the complexity of something. An overfit penalty has to do with the
complexity of the model you are using to fit, and so on. So in terms of this equation, I'd like
to pose both regularization and validation as an activity that
deals with this equation. So what about regularization? We put the equation. What did regularization do? It tried to estimate this penalty. Basically, what we did is concoct
a term that we think captures the overfit penalty. And then, instead of minimizing the
in-sample, we minimize the in-sample plus that, and we call that
the augmented error. And hopefully, the augmented error
will be a better proxy for E_out. That was the deal. And we notice that we are very,
very inaccurate in the choice here. We just say, smooth, pick lambda,
you can use this, you can use that. So obviously, we are not satisfying
any equality by any chance. But we are basically getting
a quantity that has a monotonic property, that when you minimize this,
this gets minimized, which does the job for us. Now, to contrast this, let's look at
validation, when it's dealing with the same equation. What does validation do? Well, validation cuts to the chase. It just estimates the out-of-sample. Why bother with this analysis, and
overfit, and this and that? You want to minimize
the out-of-sample? Let's estimate the out-of-sample,
and minimize it. Obviously, it's too good to be true,
but it's not totally untrue. Validation does achieve something
in that direction. So let me spend a few slides just
describing the estimate. I'm trying to estimate the
out-of-sample error. This is not completely a foreign idea
to us, because we use a test set in order to do that. So let's focus on this, and see what
are the parameters involved in estimating the out-of-sample error. Let's look at the estimate. The starting point is to take
an out-of-sample point x, y. This is a point that was
not involved in training. We used to call it test point. Now we are going to call
it validation point. It's not going to become clear why we
are giving it a different name for a while, until we use the validation
set for something, and then the distinction will become clear. But as far as you are concerned now,
this is just a test point. We are estimating E_out, and we will
just read the value of E_out and be happy with that, and not
do anything further. So you take this point, and the error
on it is the difference between what your hypothesis does on x, and what
the target value is, which is y. And what is the error? We have seen many forms of the error. Let's just mention two
to make it concrete. This could be a simple squared error. We have seen that in
linear regression. It could be the binary error. We have seen that in classification. So nothing foreign here. Now, if you take this quantity,
and we are now treating it as an estimate for E_out, a poor estimate, but nonetheless
an estimate. We call it an estimate because, if you
take the expected value of that with respect to the choice of x, with the
probability distribution over the input space that generates x,
what will that value be? Well, that is simply E_out. So indeed, this quantity, the random
variable here, has the correct expected value. It's an unbiased estimate or E_out. But unbiased means that it's as likely
to be here or here, in terms of expected value. But we could be this, and this would
be a good estimate, or we could be this, and this would be
a terrible estimate. Because you are not getting
all of them. You are just getting one of them. So if this guy swings very large, and
I tell you this is an estimate of E_out, and you get it here, this is
what you will think E_out is. So there is an error, but
the error is not biased. That's what this equation says. But we have to evaluate that swing, and
the swing is obviously evaluated by the usual quantity, the variance. And let's just call the variance
sigma squared. It depends on a number of things,
including what is your error measure and whatnot, but that is what
a single point does. So you get an estimate, but
the estimate is poor because it's one point, and therefore sigma squared
is likely to be large. So you are unlikely to use the estimate
on one point as your guide to E_out. What do you use? You move from one point, to a full set. So you get what? You get a validation set that you are
going to use to estimate E_out. Now, the notation we are going to have
is that the number of points in the validation set is K. Remember that
the number of points in the training set was N. So this will be K points, also generated
according to the same rules-- independently, according to the
probability distribution over the input space. And the error on that set we are
going to call it E_val, as in validation error. So we have E_in, and we have E_out. Now we are introducing another one,
E_val, the validation error. And the form for it is what
you expect it to be. You take the individual errors on the
examples, and you take the average, like you did with the training set, and
this one is the validation error. The only difference is that this
is done out-of-sample. These guys were not used in training,
and therefore you would expect that this would be a good estimate for
the out-of-sample performance. Let's see if it is. What is the expected value of
E_val, the validation error? Well, you take the expected
value of this fellow. The expectation goes inside. So the main component is the expected
value of this fellow, which we have seen before-- expected value
on a single point. And you just average linearly,
as you did. Now, this quantity happens
to be E_out. The expected value on
one point is E_out. Therefore, when you do that,
you just get E_out again. So indeed, again, the validation error is
an unbiased estimate of the out-of-sample error, provided that all you did with
the validation set is just measure the out-of-sample error. You didn't use it in any way. Now, let's look at the variance, because
that was our problem with the single-point estimate, and let's
see if there's an improvement. When you get the variance, you are
going to take this formula. And then you are going to have a double
summation, and have all cross terms of e between different points. So you will have the covariance between
the value for k equals 1 and k equals 2, k equals 1 and
k equals 3, et cetera. And you also have that diagonal guys,
which is the variance in this case, with k equals 1 and
k equals 1 again. The main component you are going to
get are the variance, and a bunch of covariances. Actually, there are more covariances
than variances, because the variances are the diagonal, the covariances
are the off-diagonal. There are almost K squared of them. The good thing about the covariance
in this case is that it will be 0, because we picked the points
independently. And therefore, the covariance between
a quantity that depends on these points will be 0. So I'm only stuck with the diagonal
elements, which happen to have this form. I have the variance here. And when I put the summation,
something interesting happens. I have the summation again,
a double summation reduced to one, because I'm only summing the diagonal. But I still have the normalizing factor
with the number of elements. Because I had K squared elements, the
fact that many of them dropped out is just to my advantage. I still have the 1 over K squared, and
that gives me the better variance for the estimate based on E_val,
than on a single point. This is your typical analysis
of adding a bunch of independent estimates. So you get the sigma squared. That was the variance on
a particular point. But now you divide it by K. Now we see
a hope, because even if the original estimate was this way, maybe we can
have K big enough that we keep shrinking the error bar, such that the
E_val itself as a random variable becomes this, which is around
E_out-- what we want. And therefore, it becomes
a reliable estimate. This looks promising. Now we can write the E_val, which
is a random variable, to be E_out, which is the value we want, plus or
minus something that averages to 0, and happens to be the order of approximately
1 over square root of K. If the variance is 1 over K, then
the standard deviation is 1 over square root of K. I'm assuming here that sigma
squared is constant in the range that I'm using. And therefore, the dependency
on K only comes from here. Therefore, I have this quantity
that tells me this is what I'm estimating, and this is the error I'm
committing, and this is how the error is behaving as I increase
the number K. The interesting point now
is that K is not free. It's not like I tell you, it
looks like if I increase K, this is a good situation. So why don't we use more and
more validation points? Because the reality is, K not given to
you on top of your training set. What I give you is a data set, N points,
and it's up to you to use how many to train, and how
many to validate. I'm not going to give you more, just
because you want to validate. So every time you take a point for
validation, you are taking it away from training, so to speak. Let's see the ramifications
of this regime. K is taken out of N. So let's
now have the notation. We are given a data set D, as
we always called it, and it has N points. What do we do with it? We are going to take K points,
and use them for validation. And you can take any K points, as long
as you don't look at the particular input and output. Let's say you pick K points at
random, from the N points. That will be a valid set
of validation for you. So I have the K points, and therefore
I'm left with N minus K for training. The ones I left for training,
I'm going to call D_train. I didn't have to use that when I
didn't have validation, because D all went to training,
so I didn't need to have the distinction. Now, because I have two utilities, I'm
going to take the guys that go into training and call that subset
D_train. And the guys that I hold
for validation I'm going to call it D_val. The union of them is D.
That's the setup. Now, we looked in the previous slide
at the reliability of the estimate of the validation set. And we found that this reliability, if we
measure it by the error bar on the fluctuation, it will be the order of
1 over square root of K, the number of validation points. Then our conclusion is that if you use
small K, you have a bad estimate. And the whole role we have for
validation so far is estimate, so we are not doing a good job. So we need to increase K. It looks like a good idea, just from
that point of view, to take large K. But there are ramifications for taking
large K, so we have a question mark. And let's try to be
quantitative about it. Remember this fellow? That was the learning curve. What did it do? It told us, as you increase the number
of training points, what is the expected value of E_out and what is the
expected value of E_in, for a given model, the model that I'm plotting
the learning curves of. Right? Now, the number of data points used to
be N. Here I'm writing it as N minus K. Why am I doing that? Because under the regime of
validation, this is what I'm using for training. Therefore if you increase K, you are
moving in this direction, right? I used to be here, and I used
to expect that level of E_out. Now I am here, and I'm expecting
that level of E_out. That doesn't look very promising. I may get a reliable estimate, because
I'm using bigger K, but I'm getting a reliable estimate of a worse quantity. If you want to take an extreme case, you
are going to take this estimate and go to your customer, and tell them what
you expect for the performance to be. So you don't only deliver
the final hypothesis. You deliver the final hypothesis, with
an estimate for how it will do when they test it on a point that
you haven't seen before. Now, we want the estimate to be very
reliable, and you forget about the quality of the hypothesis. So you keep increasing K, keep
increasing K, keep increasing K. You end up with a very, very
reliable estimate. The problem is that it's an estimate of
a very, very poor quantity, because you used 2 examples to train, and
you are basically in the noise. So the statement you are going to make
to your customer in this case is that, here is a system. I am very sure that it's terrible! That is unlikely to please a customer. So now, we realize that there is a price
to be paid for K. It turns out that we are going to have a trick that
will make us not pay that price. But still, the question of what happens
when K is big is a question mark in our mind. What I'm going to do now, I'm going
to tell you, you used K to estimate the error. Now, what I'm going to tell you, why
don't you restore the data set after you have the estimate? Because the
estimate now is in your pocket, train on the full set, so that you
get the better guy. Well, I estimated for the smaller guy. What are we doing here? Let's just do this systematically. Let's put K back into the pot. So here is the regime. I'm going to describe this figure,
but let's talk it piece by piece. We have the data set D, right? We separated it into
D_train and D_val. D itself has N points. We took N minus K to train,
K to validate. That's the game. What happened? If we used the full training set to
train, we would get a final hypothesis that we called g. This is just a matter of notation. But under the regime of validation,
you took out some guys. And therefore, you are using
only D_train to train. And this has N minus K. Doesn't
have all the examples. Therefore, I am going to generically
label the final hypothesis that I get from training on a reduced set, D_train,
I am going to call it g minus. Just to remind ourselves that it's
not on the full training set. So now, here is the idea, if
you look at the figure. I have the D. Let me get it a bit
smaller so that we can get the output. If I use the training set by
itself, I would get g. What I am doing now is that I am going
to take D_train, which has fewer examples, and the rest
go to validation. I use D_train to get g minus, and then
I take g minus and evaluate it on D_val, the validation set, in
order to get an estimate. So the trick now is that instead of
reporting g minus as my final hypothesis, I know if I added the other
data points here to the pot, I am going to get a better
out-of-sample. I don't know what it is. I don't have an estimate for it. But I know it's going to be better
than the one for g minus, simply because of the learning curve. On average, I get more examples, I
get better out-of-sample error. So I put it back and then report g. So it's a funny situation. I'm giving you g, and I'm giving you
the validation estimate on g minus. Why? Because that's the only
estimate I have. I cannot give you the estimate on g,
because now if I get g, I don't have any guys to validate on. So you can see now the compromise. Under this scenario, I'm not really
losing in performance by taking a bigger validation set, because I'm going
to put them back when I get the final hypothesis. What I am losing here is that, the
validation error I'm reporting, is a validation error on a different
hypothesis than the one I am giving you. And if the difference is big, then
my estimate is bad, because I'm estimating on something other
than what I am giving you. And that's what happens
when you have large K. When you have large K, the discrepancy
between g minus and g is bigger. And I am giving you the
estimate on g minus. So that estimate is poor. And therefore, I get
a bad estimate again. Now, you see the subtlety here. This is the regime that is used
in validation, universally. After you do your thing, and you do your
estimates, and, as you will see further, you do your choices, you go and
put all the examples to train on, because this is your best bet of
getting a good hypothesis. If your K is small, the validation
error is not reliable. It's a bad estimate, just because
the variance of it is big. I have small K, it's 1 over square
root of K, so I'm doing this. If you get big K, the problem is not
the reliability of the estimate. The problem is that the thing you are
estimating is getting further and further away from the thing
you are reporting. So now we have a compromise. We don't want K to be too small, in
order not to have fluctuations. We don't want K to be too big, in order
not to be too far from what we are reporting. And as usual in machine learning,
there is a rule of thumb. And the rule of thumb
is pretty simple. That's why it's a rule of thumb. It says, take one fifth
for validation. That usually gives you the
best of both worlds. Nothing proved. You can find counterexamples. I'm not going to argue with that. It's a rule of thumb. Use it in practice, and actually you
will be quite successful here. There's an argument with some
people, whether it should be N over 5 or N over 6. I'm going not going
to fret over that. It's a rule of thumb, after
all, for crying out loud! We'll just leave it at that. So we now have that. Let's go to the other aspect. We know what validation is, and we
understand how critical it is to choose the number, and we
have a rule of thumb. Now let's ask the question, why
are we calling this validation in the first place? So far, it's purely a test. We get an out-of-sample point. The estimate is unbiased. What is the deal? We call it validation, because
we use it to make choices. And this is a very important point,
so let's talk about it in detail. Once I make my estimate affect the
learning process, the set I am using is going to change nature. So let's look at a situation
that we have seen before. Remember this fellow? Yeah, this was early stopping
in neural networks. And let me magnify it for you
to see the green curve. Do you see the green curve now? OK, so there is a green curve. Now we'll scoot it back. So the in-sample goes down. Out-of-sample-- let's say that I have a general estimate
for the out-of-sample-- goes down with it until such a point
that it goes up, and we have the overfitting, and we talked about it. And in this case, it's a good
idea to have early stopping. Now, let's say that you are using K
points, that you did not use for training, in order to estimate E_out. That would be E_test, the test error, if
all you are doing is just plotting the red in order to look
at it and admire it. Oh, that's a nice curve. Oh, it's going up. But you're not going to take
any actions based on it. Now, if you decide that,
this is going up, I had better stop here. That changes the game dramatically. All of a sudden, this is
no longer a test error. Now it's a validation error. So you ask yourself, what the heck? It's just semantics. It's the same curve. Why am I calling it a different name? I'm calling it a different name, because
it used to be unbiased. That is, if this is an estimate of E_out,
not the actual E_out, there will be an error bar in estimating E_out. But it is as likely to be optimistic,
as pessimistic. Now, when you do early stopping, you
say, I'm going to stop here and I'm going to use this value as my estimate
for what you are getting. I claim that your
estimate is now biased. It's the same point. You told us it was unbiased before. What is the deal? Let's look at a very specific
simple case, in order to understand what happens. This is no longer a test set. It becomes, in red, a validation set. Fine, fine. Now convince us of the
substance of it. We know the name. So let's look at the difference, when
you actually make a choice. Very simple thing
that you can reason. Let's say I have the test set, which is
unbiased, and I'm claiming that the validation set has an optimistic bias. Optimism is good. But here, it's optimism followed
by disappointment. It's deception. We are just calling it optimistic, to
understand that it's always in the direction of thinking that the error
will be smaller than it will, actually, turn out to be. So let's say we have two hypotheses. And for simplicity, let's have them
both have the same E_out. So I have two hypotheses. Each of them has out-of-sample
error 0.5. Now, I'm using a point to
estimate that error. And I have two estimates, e_1 for the hypothesis 1, and
e_2 for the hypothesis 2. I'm going to use-- because the
estimate has fluctuations in it, just again for simplicity, I'm going to
assume that both e_1 and e_2 are uniform between 0 and 1. So indeed, the expected value is half,
which is the expected value I want, which is the out-of-sample error. Now, I'm not going to assume strictly
that e_1 and e_2 are independent, but you can assume they are independent
for the sake of argument. But they can have some level of
correlation, and you'll still get the same effect. Let's think now that they are
independent variables, e_1 and e_2. Now, e_1 is an unbiased estimate of
its out-of-sample error, right? Right. e_2 is the same, right? Right. Unbiased means the expected value
is what it should be. And the expected value,
indeed in this case, is what it should be, 0.5. Now, let's take the game, where we
pick one of the hypotheses, either h_1 or h_2. How are we going to pick it? We are going to pick it according
to the value of the error. So now, the measurement we have
is applying to that choice. What I'm going to do,
I'm going to pick the smaller of e_1 and e_2. And whichever that one is, I'm going
to pick the hypothesis that corresponds to it. So this is mini learning. The error-- just pick one, and this is the one. My question to you is very simple. What is the expected value of e? A naive thought would say, you told us
the expected value of e_1 is 1/2. You told us the expected
value of e_2 is 1/2. e has to be either e_1 or e_2. So the expected value should be 1/2? Of course not, because now the rules
of the game-- the probabilities that you're applying-- have changed, because
you are deliberately picking the minimum of the realization. And it's very easy to see that the
expected value of e is less than 0.5. The easiest thing to say is that if
I have two variables like that, the probability that the minimum will be
less than 1/2 is 75%, because all you need to do is one of them
being less than 1/2. If the probability of being less
than 1/2 is 75%, you expect the expected value to be less than 1/2. It's mostly there. The mass is mostly below. So now you realize this is what? This is an optimistic bias. And that is exactly the same as what
happened with the early stopping. We picked the point because it's minimum
on the realization, and that is what we reported. Because of that-- the thing
used to be this, but we wait. When it's there, we ignore it. When it's here, we take it. So now that introduces a bias,
and that bias is optimistic. And that will be true for
the validation set. Our discussion so far is based
on just looking at the E_out. Now we're going to use it, and we're
going to introduce a bias. Fortunately for us, the utility of
validation in machine learning is so light, that we are going
to swallow the bias. Bias is minor. We are not going to push our luck. We are not going to estimate tons of
stuff, and keep adding bias until the validation error basically becomes
training error in disguise. We're just going to-- let's choose
a parameter, choose between models, and whatnot. And by and large, if you do that, and
you have a respectable-size validation set, you get a pretty reliable
estimate for the E_out, conceding that it's biased, but the bias
is not going to hurt us too much. So this is the general philosophy. Now with this understanding, let's
use validation set for model selection, which is what
validation sets do. That is the main use
of validation sets. And the choice of lambda, in the
case we saw, happens to be a manifestation of this. So let's talk about it. Basically, we are going to use the
validation set more than once. That's how we are going
to make the choice. So let's look. This is a diagram. I'm going to build it up. Let's build it up, and then I'll
focus on it, and look at how the diagram reflects the logic. We have M models that
we're going to choose from. When I say model, you are thinking of
one model versus another, but this is really talking more generally. I could be talking about models as in,
should I use linear models or neural networks or support vector machines? These are models. I could be using only
polynomial models. And I'm asking myself, should
I go for 2nd order, 5th order, or 10th order? That's a choice between models. I could be using 5th-order polynomials
throughout, and the only thing I'm choosing, should I choose lambda of the
regularization to be 0.01, 0.1, or 1? All of this lies under
model selection. There's a choice to be made, and I want
to make it in a principled way, based on the out-of-sample error,
because that's the bottom line. And I'm going to use the validation
set to do that. This is the game. So we'll call them, since they're
models, I have H_1 up to H_M. And we are going to use D to train,
and I am going to get as a result of that-- it's not the whole set, as usual,
so I left some for validation. And I'm going to get g minus. That is our convention for whenever
we train on something less than the full set. But because I'm getting a hypothesis
from each model, I am labeling it by the subscript m. So there is g_1 up
to g_M, with a minus, because they used D_train to train. So I get one for each model. And then I'm going to evaluate that
fellow, using the validation set. The validation set are the examples that
were left out from D, when I took the D_train. So now I'm going to do this. All I'm doing is exactly what I did
before, except I'm doing it M times, and introducing the notation
that goes with that. Let's look at the figure
now a little bit. Here is the situation. I have the data set. What do I do with it? I break it into two, validation and training. I use the training to apply to
each of these hypothesis sets, H_1 up to H_M. And when I train, I end up
with a final hypothesis. It is with a minus, a small
minus in this case, because I'm training on D_train. And they correspond to the hypotheses
they came from, so g_1, g_2, up to g_M. These are done without any
validation, just training on a reduced set. Once I get them, I'm going to
evaluate their performance. I'm going to evaluate their performance
using the validation set. So I take the validation
set and run it here. It's out-of-sample as far they're
concerned, because it's not part of D_train. And therefore, I'm going to get
estimates-- these are the validation errors. I'm just giving them a simple notation
as E_1, E_2, up to E_M. Now, your model selection is to look
at these errors, which supposedly reflect the out-of-sample performance if
you use this as your final product, and you pick the best. Now that you are picking one of them,
you immediately have alarm bells-- bias, bias, bias. Something is happening now, because
now we are going to be biased. Each of these guys was an unbiased
estimate of the out-of-sample error of the corresponding hypothesis. You pick the smallest of them,
and now you have a bias. So the smallest of them will give the
index m star, whichever that might be. So E_m star is the validation error on
the model we selected, and now we realize it has an optimistic bias. And we are not going to take g_m
star minus, which is the one that gave rise to this. We are now going to go back
to the full data set, as we said in our regime. We are going to train with it. And from that training, which is
training now on the model we chose, we are going to get the final hypothesis,
which is g_m star. So again, we are reporting the
validation error on a reduced hypothesis, if you will, but reporting
the hypothesis-- the best we can do, because we know that we get
better out-of-sample when we add the examples. So this is the regime. Let's complete the slide. E_m, that we introduced, happens to be
the value of the validation error on the reduced, as we discussed. And this is true for all of them. And then you pick the model m star, that
happens to have the smallest E_m. And that is the one that you are going
to report, and you are going to restore your D, as we did before,
and this is what you have. This is the algorithm
for model selection. Now, let's look at the bias. I'm going to run an experiment
to show you the bias. So let me put it here and
just build towards it. What is the bias now? We know we selected a particular
model, and we selected it based on D_val. That's the killer. When you use the estimate to choose,
the estimate is no longer reliable, because you particularly
chose it, so now it looks optimistic. Because by choice, it has a good
performance, not because it has an inherently good performance. Because
you looked for the one with the good performance. So the expected value of this fellow
is now a biased estimate of the ultimate quantity we want, which
is the out-of-sample error. So E_val, the sample thing,
is biased from that. And we would like to evaluate that. Here is the illustration on the
curve, and I'm going to ask you a question about it, so you have to pay
attention in order to be able to answer the question. Here is the experiment. I have a very simple situation. I have only two models
to choose between. One of them is 2nd-order polynomials,
and the other one is 5th-order polynomials. I'm generating a bunch of problems, and
in each of them, I make a choice based on validation set. And after that, I look at the
actual out-of-sample error. And I'm trying to find out whether there
is a systematic bias in the one I choose, with respect to
its out-of-sample error. So it's not clear which-- I'm saying that I chose H_2 or H_5. In each run, I may choose H_2 sometimes
and H_5 sometimes, whichever gave me the smaller E_val. And I'm taking the average over
an incredible number of runs. That's why you have a smooth curve. So this will give me an indication of
the typical bias you get when you make a choice between two models, under the
circumstances of the experiment. Now the experiment is done carefully,
with few examples. The total is 30-some examples. And I'm taking a validation set,
which is 5 examples, 15 examples, up to 25 examples. So at this point really, the
number of examples left for training is very small. And I'm plotting this, so this is what I
get for the average over the runs, of the validation error on the model I
chose-- the final hypothesis of the model I chose. And this is the out-of-sample
error of that guy. Now, I'd like to ask
you two questions. Think about them, and also
for the online audience, please think about them. First question-- why are the curves going up? This is K, the size of
the validation set. I'm evaluating it. It's not because I'm evaluating
on more points that the curves are going up. It's because when I use more for
validation, I'm inherently using less for training. So there's an N minus K that is
going the other direction. And what we are seeing here really
is the learning curve, backwards. This is E_out. I have more and more examples to train
as I go here, so the out-of-sample error goes down. So in the other direction, it goes up. And this, being an estimate for
it, goes up with it. So that makes sense. Second question-- why are the two curves getting
closer together? Whether they're going up or down, that's
not my concern at this point, just the fact that they are
converging to each other. Now, that has to do with
K proper, directly. The other had to do with K indirectly,
because I'm left with N minus K. But now, when I have bigger K, the
estimate is more and more reliable, and therefore I get closer
to what I'm estimating. So we understand this. This is the definitely evidence, and in every situation you will
have, there will be a bias. How much bias depends on a number of
factors, but the bias is there. Let's try to find, analytically,
a guideline for the type of bias. Why is that? Because I'm using the validation set to
estimate the out-of-sample error, and I'm really claiming that it's close
to the out-of-sample error. And we realize that, if I don't
use it too much, I'll be OK. But what is too much? I want to be a little bit quantitative
about it, at least as a guideline. So I have M models, and you can see
that the M is in red. That should remind you when we had
M in red very early in the course, because M used
to make things worse. It was the number of hypotheses, when we
were talking about generalization. And it was really that, when you have
bigger M, you are in bigger trouble. So it seems like we are also going to
be in bigger trouble here, but the manifestation is different. We have now M models we
are choosing from-- models in the general sense. This could be M values of the
regularization parameter lambda in a fixed situation, but we are still
making one of M choices. Now, the way to look at it is to think
that the validation set is actually used for training, but training on
a very special hypothesis set, the hypothesis set of the finalists. What does that mean? So I have H_1 up to H_M. I'm going to run a full training
algorithm on each of them, in order to find a final hypothesis from
each, using D_train. Now, after I'm done, I am only
left with the finalists, g_1 up to g_M, with a minus sign because
they are trained on the reduced set. So the hypothesis set that I am training
on now is just those guys. As far as the validation set is
concerned, it didn't know what happened before. It doesn't relate to D_train. All you did, you gave it this hypothesis
set, which is the final hypotheses from your previous guy,
and you are asking it to choose. And what are you going to choose? You are going to choose
the minimum error. Well, that is simply training. If I just told you that this is your
hypothesis set, and that D_val is your training, what would you do? You will look for the hypothesis
with the smallest error. That's what you are doing here. So we can think of it now as if we are
actually training on this set. And this tells us, oh, we need to
estimate the discrepancy or the bias between this and that. Now it's between the validation error
and the out-of-sample error. But the validation error is really the
training error on this special set. So we can go back to our good old
Hoeffding and VC, and say that the out-of-sample error, in this case, given from those--
and now you can see that the choice here is star. So I'm actually choosing
one of those guys. This is my training, and the final,
final hypothesis is this guy-- is less than or equal to the out-of-sample
error, plus a penalty for the model complexity. And the penalty, if you use even
the simple union bound, will have that form. You still have the 1 over square root
of K, so you can always make it better by having more examples. But then you have a contribution because
of the number of guys you are choosing from. If you are choosing between
10 guys, that's one thing. If you are choosing between
100 guys, that's another. It's worse. Well, benignly worse, because
it's logarithmic, but nonetheless, worse. And if you are choosing between
an infinite number of guys, we know better than to dismiss
the case off-hand. You say, infinite number
of guys, we can't do that. No, no, no. Because once you
go to the infinite choices, you don't count anymore. You go for a VC dimension
of what you are doing. That's what the effective
complexity goes with. And indeed, if you're looking for choice of
one parameter, let's say I'm picking the regularization parameter. When you are actually picking the
regularization parameter, and you haven't put a grid-- you don't
say, I'm choosing between 1, 0.1, and 0.01, et cetera-- a finite number. I'm actually choosing the numerical
value of lambda, whatever it would be. So I could end up with
lambda equal 0.127543. You are making a choice between
an infinite number of guys, but you don't look at it as an infinite
number of guys. You look at it as a single parameter. And we know a single parameter
goes with a VC dimension 1. That doesn't faze us. We dealt with VC dimensions
much bigger than that. And we know that if we have one
parameter, or maybe two parameters, and the VC dimension maybe is 2,
if you have a decent set-- in this case decent K, not decent N,
because that's the size of the set you are talking about-- then your estimate will not
be that far from E_out. This is the idea. So now you can apply this with the VC analysis. Instead of just going for
the number, which is the union bound, you go for the VC version, and
now apply it to this fellow. And you can ask yourself, if
I have a regularization parameter, what do I need? Or if I have another thing, which
is the early stopping. What is early stopping? I'm choosing how many
epochs to choose. Epochs is integer, but there is
a continuity to it, so I'm choosing where to stop. All of those choices, where one parameter
is being chosen one way or the other, correspond
to one degree of freedom. So if I tell you the rule of thumb is
that, when you are using the validation set, if it's a reasonable-size
set, let's say 100 points, and you use those 100 points to choose a couple
of parameters, you are OK. You already can relate to that. You don't need me to tell you that. Because, 100 points,
VC dimension 2, yeah, I can get something. Now, if I give you the 100 points
and tell you you are choosing 20 parameters, you immediately
say, this is crazy. Your estimate will be completely
ruined, because you are now contaminating the thing. This is now genuinely training, because
the choice of the value of a parameter is what? Well, that's what training did. The training of a neural network tried to
choose the weights of the network, the parameters. There were just so many of them,
that we called it training. Now, when it's only one parameter or two,
we call it choice of a parameter by validation. So it's a gray area. If you push your luck in that direction,
the validation estimate will lose its main attraction, which
is the fact that it's a reasonable estimate of the out-of-sample,
that we can rely on. The reliability goes down. So there is this tradeoff. So with the data contamination, let me
summarize it as follows. We have error estimates. We have seen some of them. We looked at the in-sample error, the
out-of-sample error, or E_test, and then we have E_val, the
validation error. I'd like to describe those as data
contamination, that if you use the data to make choices, you are
contaminating it as far as its ability to estimate the real performance. That's the idea. So you can look at what
is contamination. It's the built-in optimistic,
better described as deceptive because it's bad-- you are going to go to the
bank and tell them, I can forecast the stock market. No, you can't. So that's bad. You were optimistic before
you went there. After that, you are in trouble. You are trying to get the bias in
estimating E_out, and you are trying to measure what is the level
of contamination. So let's look at the
three sets we used. We have the training set. This is just totally contaminated. Forget it. We took a neural network with 70 parameters,
and we did backpropagation, and we went back and forth, and we ended up
with something, and we have a great E_in, and we know that E_in
is no indication of E_out. This has been contaminated to death. So you cannot really rely on E_in,
as an estimate for E_out. When you go to the test set,
this is totally clean. It wasn't used in any decisions. It will give you an estimate. The estimate is unbiased. When you give that as your estimate,
your customer is as likely to be pleasantly surprised, as
unpleasantly surprised. And if your test set is big, they are
likely not to be surprised at all. It'll be very close to your estimate. So there is no bias there. Now, the validation set is in between. It's slightly contaminated, because
it made few choices. And the wisdom here, please keep it slightly contaminated. Don't get carried away. Sometimes when you are in the middle
of a big problem, with lots of data, you choose this parameter. Then, oh, there's another parameter I
want to choose, so you use the same validation set-- alarm bells, alarm bells--
and you keep doing it. So you should have a regime to begin
with, that you should have not only one validation set. You could have a number of them, such
that when one of them gets dirty, contaminated, you move on to the other
one which hasn't been used for decisions, and therefore the
estimates will be reliable. Now we go to cross-validation. Very sweet regime, and it has to do with
the dilemma about K. So now we're not talking about biased versus
unbiased, because this is already behind us. Now we're looking at an estimate
and a variation of the estimate, as we did before. And we have the discipline to make
sure that we don't mess it up by making it biased. So that is taken for granted. Now I'm just looking at a regime of
validation as we described it, versus another regime which will get us
a better estimate, in terms of the error bar, the fluctuation around
the estimate we want. So we had the following
chain of reasoning. E_out of g, the hypothesis we are
actually going to report, is what we would like to know. If we know that, we are set. We don't have that, but that
is approximately the same as E_out of g minus. This is the out-of-sample error, the
proper out-of-sample error, but on the hypothesis that was trained
on a reduced set. Correct? And if I didn't take too
many examples, they are close to each other. This one happens to be close to
the validation estimate of it. So here, it is because it's a different
set that I'm training on. Here, it's because I am
making a finite-sample estimate of the quantity. Here, I could go up
and down from this. I'm looking at this chain. This
is really what I want, and this is what I'm working with. This is unknown to me. In order to get from here to
here, I need the following. I need K to be small, so that g
minus is fairly close to g. And therefore I can claim that their
out-of-sample error is close, because the bigger K is, the bigger the
discrepancy between the training set and the full set, and therefore the
bigger the discrepancy between the hypothesis I get here, and the
hypothesis I get here. So I'd like K to be small. But also, I'd like K to be large,
because the bigger K is, the more reliable this estimate is, for that. So I want K to have two conditions. It has to be small, and
it has to be large. We will achieve both. You'll see in a moment. New mathematics is going
to be introduced! So here is the dilemma. Can we have K to be both
small and large? The method looks like complete cheating,
when you look at it first, and then you realize,
this is actually valid. So what do we do? I'm going to describe one form of
cross-validation, which is the simplest to describe, which is called "leave
one out". Other methods will be "leave more out", that's all. But let's focus on "leave one out". Here is the idea. You give me a data set of N.
I am going to use N minus 1 of them for training. That's good, because now I am very close
to N, so the hypothesis g minus will be awfully close to g. That's great, wonderful,
except for one problem. You have one point to validate on. Your estimate will be completely
laughable, right? Not so fast. In terms of a notation, I'm going to
create a reduced data set from D, call it D_n, because I'm actually
going to repeat this exercise for different
indices n. What do I do? I take the full data set, and then take
one of the points, that happens to be n, and take it out. This will be the one that
is used for validation. And the rest of the guys are going
to be used for training. Nothing different, except that
it's a very small validation set. That's what is different. Now the final hypothesis, that we learn
from this particular set, we have to call g minus because it's
not on the full set. But now, because it depends on which guy
we left out, we give it the label of the guy we left out. So we know that this one is trained
on all the examples but n. Let's look at the validation error,
which has to be one point. This would be what? This would be E validation-- big
symbol of this and that-- but in reality, the validation set is
one point, so this is simply just the error on the point I left out. g did not involve the n-th example. It was taken out. And now that we froze it, we are going
to evaluate it on that example, so that example is indeed
out-of-sample for it. So I get this fellow. Now, I know that this guy is an unbiased
estimate, and I know that it's a crummy estimate. That much, I know. Now, here is the idea. What happens if I repeat this exercise
for different n? So I generate D_1, do all of this,
and end up with this estimate. Do D_2, all of this-- end up
with another estimate. Each estimate is out-of-sample with
respect to the hypothesis that it's used to evaluate. Now, the hypotheses are different. So I'm not really getting the
performance of a particular hypothesis. For this hypothesis, this
is the estimate. It's off. For this hypothesis, this
is the estimate. It's off. For this hypothesis,
this is the estimate. The common thread between all the
hypotheses is that they are hypotheses that were obtained by training on
N minus 1 data points. That is common between all of them. It's different N minus 1
data points, but nonetheless, it's N minus 1. Because of the learning curve,
I know there is a tendency. If I told you this is the number of
examples, you can tell me what is the expected out-of-sample error. So in spite of the fact that these are
different hypotheses, the fact that they come from the same
number of points, N minus 1, tells me that they are all realizations
of something that is the expected value of all of them. So the small errors estimate
the error on these guys, and these guys estimate the error of
the expected value on N minus 1 examples, regardless of the
identity of the examples. So there is something common
between these guys. They are trying to estimate something. So now what I'm going to do, I am going
to define the cross-validation error to be-- E cross-validation, E_cv, to be the average of those guys. It's a funny situation now. These came from N full training
sessions, each of them followed by a single evaluation on
a point, and I get a number. And after I'm done with all of this, I
take these numbers and average them. Now, if you think of it as a validation
set, now all of a sudden the validation set is
very respectable. It has N points. Never mind the fact that each of them
is evaluated on a different hypothesis. I was able to use N minus 1
points to train, and that will give me something very close to what
happens with N. And I'm using N points to validate. The catch, obviously-- these are not independent, because the
examples were used to create the hypotheses, and some example
was used to evaluate them. And you will see that each of them is
affected by the other, because the hypothesis either has the point you left
out, or you are evaluating on that. Let's say, e_1 and e_3. e_1 was used to evaluate the error on
a hypothesis that involved the third example, because the third example
was in, when I talk about e_1. Then e_3 was used to evaluate on the
third example, but on a hypothesis that involved e_1. So you can see where
the correlation is. Surprisingly, the effective number, if
you use this, is very close to N. It's as if they were independent. If you do the variance analysis,
you will be using out of 100 examples, it's probably
as if you were using 95 examples. So it's remarkably efficient,
in terms of getting that. So this is the algorithm. Now, let's illustrate it. If you understand this, you
understand cross-validation. I'm illustrating it for
the "leave one out". I have a case. I am trying to estimate a function. I actually generated this function
using a particular target. I'm not going to tell you yet what
it is. Added some noise. And I am trying to use cross-validation,
in order to choose a model, or to just evaluate the
out-of-sample error. So let's evaluate the out-of-sample
error using the cross-validation method, for a linear model. So what do you do? First order of business, take a point
that you will leave out. Right? So now, this guy is the training set,
and this guy is the validation set. It's one point. Then you train. And you get a good fit. Then, you evaluate the validation
error on the point you left out. That will be that. That's one session. We are going to repeat this three times,
because we have three points. So this is the second time we do it. This time, this point was left out. These guys were the training. I connected them and
computed the error. Third one. You can see the pattern. After I am done, I'm going to compute
the cross-validation error to be simply the average
of the three errors. So let's say we are using
squared errors. e_1 is the squared of this
distance, et cetera, and you are adding them up, one third. This will be the cross-validation error. What I am saying now is that you
are going to take this as an indication for how well the linear
model fits the data, out-of-sample. If you look in-sample, obviously
it fits the data perfectly. And if you use the three points, the
line will be something like that. It will fit it pretty decently. But you have no way to tell how you are
going to perform out-of-sample. Here, we created a mini out-of-sample, in
each case, and we took the average performance of those as an indication
of what will happen out-of-sample. Mind you, we are using
only 2 points here. And when we are done, we are going
to use it on 3 points. That's g minus versus g. It's a little bit dramatic here,
because 2 and 3-- the difference is 1, but
the ratio is huge. But think of 99 versus 100. Who cares? It's close enough. This is just for illustration. So let's use this for model selection. We did the linear model,
and we call it linear. So now let's go for the usual suspect,
the constant model, exactly with the same data set. Let's look at the first guy. These are the two points left out, the
two points left out, and this is the one for validation. You train on those. Here, you connected. Here, you have the middle number-- it's a constant number. And this would be you error
here. Right? Second guy, you get the
idea? Third guy. Now, if your question is:
is the linear model better than the constant model in this case? the only thing you look in all of this
is the cross-validation error. So this guy, this guy, this guy,
averaged, is the grade-- negative grade, because it's error--
for the linear model. This guy, this guy, this guy, averaged,
is that grade for the constant model. And as you see, the constant
model wins. And it's a matter of record that these
three points were actually generated by a constant model. Of course, they could have been
generated by anything. But on average, they will give
you the correct decision. And they avoid a lot of funny heuristics
that you can apply. You can say-- wait a minute,
linear model, OK. Any two points I pick, the
slope here is positive. So there is a very strong indication
that there is a positive slope involved, and maybe it's a linear
model with a positive slope. Don't go there. You can fool yourself into
any pattern you want. Go about it in a systematic way. This is a quantity we know,
the cross-validation error. This is the way to compute it. We are going to take it as the
indication, notwithstanding that there is an error bar because it's a small
sample, in this case 3, and also because we are making the
decision for 2 points, and we are using it for 3 points. These are obviously inherent, but
at least it gives you something systematic. And indeed, it gives you the correct
choice in this case. So let's look at
cross-validation in action. I'm going to go with
a familiar case. You remember this one? Oh, these were the handwritten
digits, and we extracted two features, symmetry and intensity. And we are plotting the different guys,
and we would like to find a separating surface. We are going to use a nonlinear
transform, as we always do. And in this case, what I'm going to
do, I'm going to sample 500 points from this set at random for training,
and use the rest for testing the hypothesis. What is the nonlinear transformation? It's huge-- 5th order. So I am going to take all
20 features, or 21 including the constant. And what am I going to
use validation for? This is the interesting part. What I'm going to use validation
for is, where do I cut off? So I'm comparing 20 models. The first model is,
just take this guy. Second model is, take x_1 and x_2. Third model, take x_1, x_2, and
x_1 squared, et cetera. Each of them is a model. I can definitely train on
it and see what happens. And I'm going to use cross-validation
"leave one out", in order to choose where to stop. So if I have 500 examples, realize that
every time I do this, I have to have 500 training sessions. Each training session has 499 points. It's quite an elaborate thing. But when you do this, this
is the curve you get. You get different errors. Let me magnify it. This is the number
of features used. This is the cutoff I talked about. You can go all the way
up to 20 features. When you look at the training error, not
surprisingly, the training error always goes down. What else is new? You have more, you fit better. The out-of-sample error, which I'm
evaluating on the points that were not involved at all in this process,
cross-validation or otherwise, just out of sample totally, I get this fellow. And the cross-validation error, which
I get from the 500 examples by excluding one point at a time and taking
the average, is remarkably similar to E_out. It tracks it very nicely. And if I use it as a criterion for model
choice, the minima are here. So if I take between 5 and 7,
let's say I take 6. I would say, let me cut off at 6
and see what the performance is like. Let's look at the result of that, without validation, and with validation. Without validation, I'm using
the full model, all 20. And you can see, we have seen
this before-- overfitting. I'm sweating bullets to include this
single point in the middle, and after I included it, guess what? None of the out-of-sample points
was red here. This was just an anomaly. So I didn't get anything for it. This is a typical thing. It's unregularized. Now, when you use the validation, and
you stop at the 6th because the cross-validation error told you so,
it's a nice, smooth surface. It's not perfect error, but it didn't
put an effort where it didn't belong. And when you look at the bottom line,
what is the in-sample error here? 0%. You got it perfect. We know that. And the out-of-sample sample error? 2.5%. For digits, that's OK. OK, but not great. Here, we went. And now the in-sample error is 0.8%. But we know better. We don't care about the in-sample
error going to 0. That's actually harmful in some cases. The out-of-sample error is 1.5%. Now, if you are in the range--
2.5% means that you are performing 97.5%. Here, you are performing 98.5%. 40% improvement in that
range is a lot. There is a limit here that
you cannot exceed. So here, you are really doing great
by just doing that simple thing. Now you can see why validation is
considered, in this context, as similar to regularization. It does the same thing. It prevented overfitting, but it
prevented overfitting by estimating the out-of-sample error, rather than
estimating something else. Now, let me go and very quickly-- and I will close the lecture with it-- give you the more general form. We talked about "leave one out".
Seldom you use "leave one out" in real problems, and you
can think of why. Because if I give you 100,000 data
points, and you want to leave one out, you are going to have 100,000 sessions
training on 99,999 for each, and you will be an old person before
the results are out. So when you have "leave one out",
you have N training sessions using N minus 1 points each, right? Now, let's consider to take
more points for validation. 1 point makes it great, because
N minus 1 is so close to N, that my g minus will
be so close to g. But hey, 100,000, if you decided
to take 100,000 minus 1,000, that's still 99,000. That's fairly close to 100,000. You don't have to make
it difference 1. So what you do is, you take your
data set, and you break it into a number of folds. Let's say 10-fold. So this will be 10-fold
cross-validation. And each time, you take one of the guys
here, that is, 1/10 in this case, use it for validation, and the
9/10, you use them for training. And you change, from one run to another,
which one you take for validation. So "leave one out" is exactly the
same, except that here, the 10, replace it by N. I break the thing
into 1 example at a time, and then I validate on 1 example. Here, I'm taking a chunk. And therefore, you have fewer
training sessions, in this case 10 training sessions, with not that much of a difference, in
terms of the number of examples. If N is big, instead of taking 1,
you take a few more. Now, the reason I introduced this is
because this is what I actually recommend to you. Very specifically, 10-fold
cross-validation works very nicely in practice. So the rule is, you take the total
number of examples, divide them by 10, and that is the size of
your validation set. You repeat it 10 times, and you get
an estimate, and you are ready to go. That's it. I will stop here, and we'll take
questions after a short break. Let's start the Q&A. And we
have an in-house question. STUDENT: You told about
validation, and you told that we should restrict ourselves in amount of
parameters we should estimate. Do we have a rule of thumb about
the number of these parameters? So is, say, K over 20 parameters
reasonable for the maximum number? PROFESSOR: It obviously depends
on the number of data points. So the reason why I didn't give a rule
of thumb in this case, because it goes with the number of points. But let's say that if I have 100 points
for validation, so it's a small data set, I would say that a couple
of parameters would be fine. At least, that's my own experience. And you can afford more,
when you have more. And when you have more, you can even
afford more than one validation set, in which case, you use each of them
for a different estimate. But the simplest thing, I would say,
a couple of parameters for 100 points would be OK. MODERATOR: Can you clarify why model
choice by validation doesn't count as data snooping? PROFESSOR: For the same reason
that the answer is usually given for a question like that, because
it is accounted for. I took the validation set, the
validation set are patently out of sample, and I used them
to make a choice. And when I did that choice, I made
sure that the discrepancy between in-sample and out-of-sample on the
validation set is very little. So we had this discussion of how much
bias there is, and we want to make sure that the discrepancy
is very little. So because I have already done the
accounting, I can take it as a reasonable estimate for the
out-of-sample. That is why. In the other case, the problem with the
data snooping that I gave is that you use the data in order
to make choices, and in that case, huge choices. You looked at the data and you chose
between different models, and you didn't pay for it. You didn't account for it. That's where the problem was. MODERATOR: Some people recommend
using cross-validation 10 times. What does that add? PROFESSOR: The regime I
described, I only need to tell you 10-fold, 12-fold, 50-fold, and
then the rest is fixed. So if I use 10-fold, then
by definition I'm going to do this 10 times. It's not a choice, given the regime
that I described. In each run, I am choosing one of the 10 to
be my validation, and the rest for training, and taking the average. So the question is asking,
do I do this 10 times? Inherently, built in the method
is that you use it 10 times, if that's the question. MODERATOR: I think the question goes to,
since you chose your 10 data sets inside, then you'd run
cross-validation. What if you do it again choosing 10
subsets, and you repeat that process? PROFESSOR: There are variations. For example, even, let's say, with the
"leave one out", maybe I can take a point at random, and not necessarily
insist on going through all the examples-- do it like 50 times,
and take the average. Or I can take subsets, like in the
10-fold, but I take random subsets and stop at some point. So there are variations of those. The ones I described are
the most standard ones. But there are obviously
variations. And one can do an analysis
for them as well. MODERATOR: Is there any rule for
separating data among training, validation, and test? PROFESSOR: Random is
the only trustworthy thing. Because if you use your judgment
somehow, you may introduce a sampling bias, which we'll talk about
in a later lecture. And the best way to avoid that for sure,
if you sort of flip coins to choose your examples, then you
know that you are safe. MODERATOR: What's the computational
complexity of adding a cross-validation? PROFESSOR: I didn't give
the formula for it. Basically, for "leave one out", you
are doing N times as much training as you did before. The evaluation is trivial. Most of
the time goes for the training. So you can ask yourself, how many
training sessions do I have to do now that I'm using cross-validation, versus
what I had to do before? Before, you had to do one session. Here, you have to do as many sessions
as there are folds. So 10-fold will be 10 times. "Leave one out" would be N,
because it's really N-fold, if you want, and so on. MODERATOR: A clarification-- can you use both regularization
and cross-validation? PROFESSOR: Absolutely. In fact, one of the biggest utilities
for validation is to choose the regularization parameter. So inherently in those
cases, you do it. You can use it to choose the
regularization parameter. And then you can also use it on
the side, to do something else. So both of them are active
in the same problem. And in most of the practical cases you
will encounter, you will actually be using both. Very seldom can you get away without
regularization, and very seldom can you get away without validation. MODERATOR: Someone is asking that, this
seems to be a brute force method for model selection. Is there a way to branch and bound
how many hypotheses to consider? PROFESSOR: There are lots
of methods for model selection. This is the only one, at least among the
major ones, which does not require assumptions. I can do model selection based on,
I know my target function is symmetric, so I'm going to
choose a symmetric model. That can be considered
model selection. And there are a bunch of other logical
methods to choose the model. The great thing about validation is
that there are no assumptions whatsoever. You have M models. What are the models?
What assumptions do they have? How close they are, or not close
to the target function-- Who cares? They are M models. I am going to take a validation set, and
I'm going to find this objective criterion, which is a validation or
cross-validation error, and I'm going to use it to choose. So it's extremely simple to implement,
and very immune to assumptions. Obviously, if you make assumptions and
you know that the assumption are valid, then you would be doing
better than I am doing. But then you know that the
assumptions are valid. I'm taking a case where I don't want to
make assumptions, that I don't know hold, and still make the
model selection. MODERATOR: In the case where the data
depends on time evolution, how can validation update the model? Is it used for that, or not? PROFESSOR: Validation makes
a principled choice, regardless of the nature of that choice. Let's say that I have a time series, and
one of the things in time series-- let's say they're for financial
forecasting-- is that, you can train, and then you
get a system, and then the world is not stationary. So a system that used to work,
doesn't work anymore. You can make choices about, let's
say I have a bunch of models, and I want to know which one of them works
at a particular time, given some conditions. You can make the model selection
based on validation, and then you take that model and apply it to the real
data, or there are a bunch of things you can do. But in terms of tracking the evolution
of systems, again, if you translate the problem into making a choice, then
you are ready to go with validation. So the answer is yes. And the method is to make it
spelled out as a choice. MODERATOR: Another clarification-- so with cross-validation,
there's still some bias. can you quantify why is it better
than just regular validation? PROFESSOR: Both validation and
cross-validation will have bias for the same reasons. The only question is the reliability
of the estimate. Let's say that I use "leave
one out", so here's E_out. And the bias aside, if I use
"leave one out", I'm using all N of the examples eventually,
when I average them. So the error bar is small. Granted, it's not as small as it would
be if the N errors were independent of each other. But it's fairly close to being
as if they were independent. So I get that estimate. Therefore, anytime you have this
estimate, it becomes less vulnerable to bias. Because if I have this play, and I'm
pulling down, I'm not going to pull down too far, because I'm
still within here. If I have the other guy which is
completely swinging, it's very easy to pull it down and I get worse
effect of the bias. So whenever you minimize the error bar,
you minimize the vulnerability to bias as well. That's the only thing that
cross-validation does. It allows you to use a lot of examples
to validate, while using a lot of examples to train. That's the key. MODERATOR: Going back to the previous
lecture, a question on that. Can you see the augmented error as
conceptually the same as a low-pass filtered version of the initial
error, or not? PROFESSOR: It can be translated
to that under the condition that the regularizer is a smoothness
regularizer, because that's what low-pass filters do. So as an intuition, it's not
a bad thing to consider in the case of something like weight
decay. It's not going to be strictly low-pass as in working in the Fourier
domain and cutting off, et cetera. But it will have the same
effect of being smooth. If you have a question, please step to the microphone,
and you can ask it. So there's a question in house. STUDENT: Yes. It seems that cross-validation is
a method to deal with limited size of the data set. So is it possible in practice that
we have a data set so large that cross-validation is not needed or not
beneficial, or do people do it all the time in principle? PROFESSOR: It is possible, and
one of the cases is the Netflix case, where you had 100 million points. So you think at this point, nobody will
care about cross-validation. But it turned out that even in this
case, the 100 million points only had a very small subset which come from the
same distribution as the output. So the 100 million-- again, it's the same question
as the time evolution. You have people making ratings,
and different people making different number of ratings, and this
changes for a number of reasons. Even the same user, after you rate for a while, you tend
to change from the initial rating. Maybe you are initially
excited or something. So there are lots of
considerations like that. So eventually, the number of points
that were patently coming from the same distribution as the out-of-sample
was much smaller than 100 million. And these are the ones that were
used to make big decisions, like validation decisions. And in that case, even if we started
with 100 million, it might be a good idea to use cross-validation
at the end. And if you use something like 10-fold
cross-validation, then it's not that big a deal, because you are just
multiplying the effort by 10, which is, given what is involved,
not that big a deal. And you really get a dividend
in performance. And if you insist on performance,
then it becomes indicated. So the answer is yes, because it doesn't
cost that much, and because sometimes in a big data set, the
relevant part, or the most relevant part, is smaller than the whole set. MODERATOR: Say there's a scenario where
you find your model through cross-validation, and then you
test the out-of-sample error. But somehow you test a different model,
and it gives you a smaller out-of-sample error. Should you still keep the one you
found through cross-validation? PROFESSOR: So I went
through this learning and came up with a model. Someone else went through whatever
exercise they have and came up with a final hypothesis in this case. And I am declaring mine the winner
because of cross-validation, and now we are saying that there's further
statistical evidence. We get an out-of-sample error that tells
me that mine is not as good as the other one. Then it really is the question of,
I have two samples, and I'm doing an evaluation. And one of them tells me something, and
the other one tells me the other. So I need to consider first the size
of them. That will give me the relative size of the error bar.
And correlations, if any. And bias, which cross-validation may have, whereas the
other one, if it's truly out of sample, does not. If I go through the math, and maybe
the math won't go through-- it's not always the case-- I will get an indication about
which one I would favor. But basically, it's purely a statistical
question at this point. MODERATOR: When there are few points,
and cross-validation is going to be done, is it a good idea to re-sample
to enlarge the current sample, or not really? PROFESSOR: So I have a small data set. That's the premise? And I'm doing cross-validation. So what is the-- MODERATOR: So the problem is,
since you have few samples, do you want to re-sample? PROFESSOR: So instead of
breaking them into chunks, keep taking at random? Well, I don't have from my experience
something that would indicate that one would win over the other. And I suspect that if you are close to
10-fold, you probably are close to the best performance you can get with
variations of these methods. And the problem is that all of these
things are not completely pinned down mathematically. There is a heuristic part of it, because
even cross-validation, we don't know what the correlation
is, et cetera. So we cannot definitively answer the
question of which one is better. It's a question of trying in a number
of problems, after getting the theoretical guidelines, and
then choosing something. What is being reported here is that
the 10-fold cross-validation stood the test of time. That's the statement. MODERATOR: When there is a big class
size imbalance, does cross-validation become a problem? PROFESSOR: When there is an imbalance
between the classes, that is a bunch of +1's and fewer -1's,
there are certain things that need to be taken into consideration, in order
to make learning go through well-- in order to basically avoid the
learning algorithm going for the all +1 solution, because it's
a very attractive one. So there are a bunch of things that can
be taken into consideration, and I can see a possible role
for cross-validation. But it's not a strong component
as far as I can see. The question of balancing them, making
sure that you avoid the all-constant, and stuff like that will probably
play a bigger role. MODERATOR: How does the bias behave when
we increase the number of points that we leave out? The size of t if we leave t out. PROFESSOR: The points we
leave out are the validation points. And if we are using the 10-fold or
12-fold, et cetera, the total number that go into the summation will be
constant, because in spite of the fact that we're taking different numbers,
we go through all of them, and we add them up. So that number doesn't change. MODERATOR: So how does it change,
if instead of doing 10-fold, you use 20-fold? How does that-- PROFESSOR: How does it change? It doesn't change the number of total
points going into the estimate of cross-validation. But what was the original question? MODERATOR: So how does
the bias behave? PROFESSOR: Oh. Well, given that the total number will
give you the error bar, and given that the bias is really a function of how
you use it, rather than something inherent in the estimate, the error bar
will give you an indication of how vulnerable you are to bias. Say that, if you take two scenarios
where the error bar is comparable, you have no reason to think that one of them
will be more vulnerable to bias or another. Now, you need a very detailed analysis
to see the difference between taking one at a time coming from N minus 1,
et cetera, and to consider the correlations, and then taking 1/10 at
the time and adding them up, to find out what is the correlation and what is
the effective number of examples, and therefore what is the error bar. In any given situation, that would
be a pretty heavy task to do. So basically, that answer is that as
long as you do a number of folds, and you take every example to
appear in the cross-validation estimate exactly once, then there is no
preference between them as far as the bias is concerned. MODERATOR: I think that's it. PROFESSOR: Very good. We'll see you on Thursday.