ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we finished the VC analysis. And that took us three full lectures. The end result was the definition of the
VC dimension of a hypothesis set. It was defined as the most points that
the hypothesis set can shatter. And we used the VC dimension in
establishing that learning is feasible, on one hand, and then in
estimating the example resources that are needed in order to learn. One of the important aspects of
the VC analysis is the scope. The VC inequality, and the generalization
bound that corresponds to it, describe the generalization
ability of the final hypothesis you are going to pick. It describes that in terms of the VC
dimension of the hypothesis set, and makes a statement that is true for
all but delta of the data sets that you might get. So this is where it applies. And the most important part of the
application are the disappearing blocks, because it gives the generality
that the VC inequality has. So the VC bound is valid for any
learning algorithm, for any input distribution that may take place, and
also for any target function that you may be able to learn. So this is the most theoretical part. And then we went into a little bit of
a practical part, where we are asking about the utility of the VC
dimension in practice. You have a learning problem-- someone
comes with a problem, and you would like to know how many examples. What is
the size of the data set you need, in order to be able to achieve
a certain level of performance. The way we did this analysis is by
plotting the core aspect of the delta, the probability of error
in the VC bound. And we found that it's
behaving regularly. We focused on a certain aspect of these
curves, which correspond to different VC dimensions. And the main aspect is
below this line. This line designates
the probability 1. We want the probability of the bad event
to be small, so we are working in this region. And the x-axis here is the number
of examples-- the size of your data set. And we don't particularly care about
the shape of these guys. They could be a little bit
nonlinear, et cetera. But the quantity we are looking for is,
if we cut through this way, what is the behavior of the x-axis, the number
of examples, in terms of the VC dimension, which is the label
for the colored curves? And we realized that, given this analysis,
it is very much proportional. And we were able to say that,
theoretically, the bound will give us that the number of examples needed
would be proportional to the VC dimension, more or less. And although the constant of
proportionality, if you go for the bound, will be horrifically
pessimistic-- you will end up requiring tens of
thousands of examples for something for which you really need
only maybe 50 examples-- the good news is that the actual
quantity behaves in the same way as the bound. So the number of examples needed
is, in practice, as a practical observation, indeed proportional
to the VC dimension. And furthermore, as a rule of thumb,
in order to get to the interesting part, or interesting delta and epsilon,
you need the number of examples to be 10 times
the VC dimension. More will be better. Less might work. But the ballpark of it is that you have
a factor of 10, in order to start getting interesting generalization
properties. We ended with summarizing the entire
theoretical analysis into a very simple bound, which we are referring
to as the generalization bound, that tells us a bound on the out-of-sample
performance given the in-sample performance. And that involved adding
a term, capital Omega. And Omega captures all the
theoretical analysis we had. It's a function of N, function of the
hypothesis set through the VC dimension, function of your tolerance
for probability of error, which is delta. And although this is a bound, we keep
saying that, in reality, E_out will be equal to E_in plus something
that behaves like Omega. And we will take advantage of that,
when we get a technique like regularization. So that's the end of the VC analysis,
which is the biggest part of the theory here. And we are going to switch today to
another approach, which is the bias-variance tradeoff. It's a stand-alone theory. It gives us a different angle
on generalization. And I am going to cover it, beginning
to end, during this lecture. This is the plan. The outline is very simple. We are going to talk about the bias
and variance, define them, see the tradeoff, take a very detailed example--
one particular example-- in order to demonstrate what the
bias and variance are. And then we are going to introduce
a very interesting tool for illustrating learning, which are called
learning curves. And we are going to contrast the
bias-variance analysis versus the VC analysis on these learning curves, and
then apply them to the linear regression case that we
are familiar with. So this is the plan. The first part is the
bias and variance. In the big picture, we have been trying
to characterize a tradeoff. And roughly speaking, the tradeoff
is between approximation and generalization. So let me discuss this for a moment,
before we put bias and variance into the picture. We would like to get to small E_out. That's the purpose of learning. If E_out is small, then
you have learned. You have a hypothesis that approximates
the target function well. There are two components to this, and
we are very familiar with them now. We are looking for a good
approximation of f. That's the approximation part. But we would like that approximation
to hold out-of-sample. We are not going to be happy if we
approximate f well in-sample, and we behave badly out-of-sample. These are the two components. In the case of a more complex
hypothesis set, you are going to have a better chance of approximating
f, obviously. I have more hypotheses to run around. I'll be able to find one of
them that is closer to the target function I want. The problem is that, if you have the
bigger hypothesis set, you are going to have a problem identifying
the good hypothesis. That is, if you have fewer hypotheses,
you have a better chance of generalization. And one way to look at it is
that, I'm trying to approximate the target function. You give me a hypothesis set. Now, if I tell you I have
good news for you. The target function is actually
in the hypothesis set. You have the perfect approximation
under your control. Well, it's under your hand, but not
necessarily under your control. Because you still have to navigate
through the hypothesis set in order to find the good candidate. And the way you navigate is
through the data set. That is your only resource for finding
one hypothesis versus the other. So the target function could be
sitting there calling for you. Please, I am the target
function, come. But you can't see it. You're just navigating with the training
set, you have very limited resources, and you end up with something
that is really bad. Having f in the hypothesis set
is great for approximation. But having a big hypothesis set, that is
big enough to include that, may be bad news, because you will
not be able to get it. Now if you think about it, what is the
ideal hypothesis set for learning? If I only had a hypothesis set that has
a singleton hypothesis, which happens to be the target function, then
I have the best of both worlds. The perfect approximation. I will zoom in directly, because
it is only one. Well, you might as well go
and buy a lottery ticket. That's the equivalent. We don't know the target function, so
we will have to make the hypothesis set big enough to stand a chance. And once we do that, then the question
of generalization kicks in. This is this big picture. So let's try to fit the VC analysis in
it, and then fit the bias-variance analysis in it, before we even know what
the bias-variance analysis is, in order to see where we
are going with this. So we are quantifying this tradeoff. And the quantification, in the case
of the VC analysis, was what? Was the generalization bound. E_in is approximation. Because I am actually trying to fit the
target function-- I am just fitting them on the sample. That's the restriction here. So if I get this well, then I'm
approximating f well, at least on some points. This guy is purely generalization. The question is, how do you generalize
from in-sample to out-of-sample? So this is a way of quantifying it. Now the bias-variance analysis
has another approach. It also decomposes E_out, as you did
in the generalization bound. But it does decompose it into
two different entities. The first one is an approximation
entity, how well H can approximate f. Well, what is the difference then? The difference is that the bias-variance
asks the question, how can H approximate f, overall? Not on your sample. In reality. As if
you had access to the target function, and you are stuck with this hypothesis
set, and you are eagerly looking for which hypothesis best describes
the target function. And then you quantify how well that best
hypothesis performs, and that is your measure of the approximation
ability. Then what is the other component? The other component is exactly
what I alluded to. Can you zoom in on it? So this is the best hypothesis, and it
has a certain approximation ability. Now I need to pick it, so I have to use
the examples in order to zoom in into the hypothesis set, and
pick this particular one. Can I zoom in on it? Or do I get something that is a poor
approximation of the approximation? And that decomposition will
give us the bias-variance. And we'll be able to put them at the
end of the lecture, side by side, in order to compare: here is what the VC
analysis does, and here is what bias-variance does. The analysis, from a mathematical
point of view, applies to real-valued targets. And that's good news. Because remember, in the VC analysis, we
were confined to binary functions in the particular analysis that I did. You can extend it, but
it's very technical. So it's a good idea to see the same
tradeoff, and the same generalization questions, apply to real-valued functions. Now we have regression, and we are
able to make a statement about generalization on regression, which we
will apply very specifically to linear regression, the model that we already
studied that has real-valued outputs. And we are going to confine the analysis
here to squared error. The reason we are doing this is that, for
the math to go through in such a way that these two guys decompose
cleanly-- there are no cross terms, we will need the squared error. So this is a restriction
of the analysis. There are ways to extend it. They are not as clean, so this
is the simplest form that we are going to use. Let's start. Our starting point is E_out. So let me put it-- Don't worry about the gap. The gap here will be filled. What do we have? We have E_out. E_out depends on the hypothesis
you pick. E_out is E_out of your
final hypothesis. How does it perform on
the overall space? And in order to do that, since we are
talking about squared error, you are going to take the value of your
hypothesis, and compare it to the value of the target function,
and take that squared. And that will be your error. So this is the building block for
getting the out-of-sample performance. Now the gap here comes from the fact
that, if you look at the final hypothesis, the final hypothesis
depends on a number of things. Among other things, it does depend
on the data set that I'm going to give you, right? Because if I give you a different data
set, you'll find a different final hypothesis. That dependency is quite important
in the bias-variance analysis. Therefore, I am going to make
it explicit in the notation. It has always been there, but I didn't
need to carry ugly notation throughout, when I'm not using it. Here I'm using it, so we'll
have to live with it. So now I'll make that
dependency explicit. I'm having now a superscript, which
tells me that this g comes from that particular data set. If you give me another data set, this
will be a different g. And you take the same g, apply it to x,
compare it to f, and this is your error. And finally, in order for it to be
genuinely out-of-sample error, you need to get the expected value of that
error over the entire space. So this is what we have. Now what we would like to do, we would
like to see a decomposition of this quantity into the two conceptual
components, approximation and generalization, that we saw. So here's what we are going to do. We are going to take this quantity,
which equals this quantity, as I mentioned here, and then realize
that this depends on the particular data set. I would like to rid this from the
dependency on the specific data set that I give you. So I'm going to play
the following game. I am going to give you a budget of
N examples, training examples to learn from. If I give you that budget N, I could
generate one D and another D and another D, each of them
with N examples. Each of them will result in a different
hypothesis g, and each of them will result in a different
out-of-sample error. Correct? So if I want to get rid of the
dependency on the particular sample that I give you, and just know the
behavior-- if I give you N data points, what happens?-- then I would
like to integrate D out. So I am going to get the expected value
of that error, with respect to D. This is not a quantity that
you are going to encounter in any given situation. In any given situation, you have
a specific data set to work with. However, if I want to analyze the
general behavior-- someone comes to my door, how many
examples do you have, and they tell me 100. I haven't seen the examples yet. So it stands to logic that I say,
for 100 examples, the following behavior follows. So I must be taking an expected value
with respect to all possible realizations of 100 examples. And that is, indeed, what
I am going to do. I'm going to get the expected
value of that. And this is the quantity that
I am going to decompose. And this obviously happens to be the
expected value of the other guy, and we have that. Now I am going to take this quantity,
the expression for the quantity that I'm interested in, and keep deriving
stuff until I get to the decomposition I want. The first order of business,
I have two expectations. The first thing I'm going to do, I am
going to reverse the order of the expectations. Why can I do that? I am integrating. So now I change the order
of integration. I am allowed to do that, because the
integrand is strictly non-negative. So I get this. And the reason for that is because
I am really interested in the expectation with respect to D, and I'd
rather not carry the expectation with respect to x throughout. So I am going to get rid of that
expectation for a while, until I get a clean decomposition. And when I get the clean decomposition,
I'll go back and get the expectation, just to
keep the focus clear. You focus on the inside quantity. If I give you the expression for the
inside quantity for any x, then all you need to do in order to get the
quantity that you need, is get the expected value of what I
said, with respect to x. So this is the quantity that we are
going to carry to the next slide. Let's do that. And the main notion, in order to evaluate
this quantity, is the notion of an average hypothesis. It's a pretty interesting idea. Here is the idea. You have a hypothesis set, and you are
learning from a particular data set. And I am going to define
a particular hypothesis. I am going to call it the
average hypothesis. And because it's average, I am going
to give it a bar notation. So what is this fellow? Well, this fellow is
defined as follows. You learn from a data set. You get a hypothesis. Someone else learns from
another data set. They get another hypothesis, et cetera. So how about getting the expected
value of these hypotheses? What does that formally mean? We have x fixed. So we actually are in a good position,
because g of x is really just a random variable at this point. It's a random variable, determined
by the choice of your data. The data is the randomization source. x is fixed, so you think I have one
test point in the space, that I'm interested in. Maybe you are playing the stock market,
and now you are only interested in what's going to happen tomorrow. So you take the inputs, and these are
the only inputs you're interested in performing on. That's your x. And all of the questions
now pertain to this. You are learning from other data. And then you ask yourself, how am
I performing on this point? That is the point x. Now you are looking at this point. And you say, if you give me a data set
versus another, I am going to get different values for the
hypothesis on that point. It stands to logic that, if I take the
average with respect to all possible data sets, that would be awesome. Because now I am getting the benefit
of an infinite number of data sets. I am using them in the capacity
of one data set at a time. But I am getting value. Maybe the correct value
should be here. But since I am getting fluctuations
because of the data set, sometimes I'm here, sometimes I'm here, et cetera. If you get the expected value,
you will get it right. So this looks like a great
quantity to have. And in reality, we will
never have that. Because if you give me an infinite
number of examples, I'm not going to divide them neatly into N and N
and N, and learn from these, and then take the average. I'm just going to take all your examples,
and learn all through and get the target function almost perfectly. So this is just a conceptual tool
for us to do the analysis. But we understand what it is. If you now vary x, your test
point in general, then you take that random variable and the expected value
of it, and the function that is constituted by the expected values at
different points is your g bar. So this is understood. Why do I need this for the analysis? Because if you look at the top thing,
I have here squared, so I'm probably going to expand it. And in expanding it, I am just going
to get a linear term of this. And I have an expected value. So you can see that I'm going to
get something that requires me to define g bar. That's the technical utility here. But the conceptual utility
is very important. And if you want to tell someone
what g bar is, think that you have many, many data sets. And the game is such that you learn from
one data set at a time, and you want to make the most of
it after you learn. What do you do? You take votes. You take just the average. You have this. There is 1 over K here,
the size of those. So this is the average. Now let's see how we can use
g bar, in order to get the decomposition we want. Here, this is again the quantity I'm
passing from one slide to another, in order not to forget. This is the quantity that I'd like
to decompose. The first thing I am going to do, I am
going to make it longer, by the simple trick of adding g bar
and subtracting it. I'm allowed to do that, right? Doing that, I am going to
consolidate these two terms, and I'm going to consolidate
these two terms. And then expand with the squared. So let's do that. You get this. This is the first consolidated
guy with the squared. This is the second consolidated
guy with the squared. Am I missing something? Yes, I am missing the cross terms. So let's add the cross terms. And I get twice the product. This equals that. So the expected value here applies
to the whole thing. The first order of business is to look
at the cross terms, because they are annoying, and see if I
can get rid of them. That's where the benefit of
the squared error comes in. I am getting the expected value
with respect to D, right? So this fellow is a constant. Therefore, when I get the expected value
of this whole thing, all I need to do is get the expected value of
this part, because this one will factor out. Now, if I get the expected value of this,
the expected value of the sum is the sum of the expected values-- one of
the few universal rules that you can apply, without asking any
other questions. So I get the expected value of this. What is the expected value of g^D? Wait a minute. That was g bar, by definition. That's how we defined it, right? So I get g bar, minus the expected
value of a constant, which happens to be g bar. So this goes to 0, and happily
this guy goes away. Now I have only these two guys. So let's write them. I have the expected value of this whole
thing, which again is the sum of the expected values. The first guy is a genuine expected
value of these guys. When I apply expected value to this
guy, again this is just a constant, so the expected value of it is itself. The second guy I add without bothering
with the expected value, because it's just a constant. So this is what I have as
the expression for the quantity that I want. Now let's take this and look at it
closely, because this will be the bias and variance. This is the quantity again,
and it equals this fellow. Now let's look beyond the math,
and understand what's going on. This measure-- this quantity-- tells you
how far your hypothesis, that you got from learning on a particular
data set, differs from the ultimate thing, the target. And we are decomposing
this into two steps. The first step is to ask you, how far
is your hypothesis that you got from that particular data set, from the best
possible you can get using your hypothesis set? Now there is a leap here, because
I don't know whether this is the best in the hypothesis set. I got it by the averaging. But since I'm averaging from several
data sets, it looks like a pretty good hypothesis. I am not even sure that it's actually
in the hypothesis set. It's the average of guys that came
from the hypothesis set. But I can definitely construct
hypothesis sets, where the average of hypotheses does not necessarily
belong there. So there are some funny stuff. But just think of it that, this
is an intermediate step. Instead of going all the way to the
target function, here is your hypothesis set. It restricts your resources. Now I am getting the best possible
out of it, based on some formula. I'm learning from infinite
number of data sets. This is a pretty good hypothesis. So how far are you from
that hypothesis? That's the first step. The second step is how far that
hypothesis, that great hypothesis, is from the ultimate target function. So hopping from your guy to the target,
goes into a small hop from your guy to the best hypothesis, and another
hop from the best hypothesis to the target function. And the neat thing is that they
decomposed cleanly. And we found that they decomposed
cleanly because the cross term disappeared. That's the advantage of the particular
measure that we have. Now we need to give names
to these guys. They will be the bias and variance. I'd like you to think for five seconds,
and you don't have to even answer the question. Which will be the bias, and which
will be the variance? Just look at which would be
a better description to them. I'm not going to ask. This is not a quiz, like last time. This is the bias. Why is it the bias? Because what I'm saying is that,
learning or no learning, your hypothesis set is biased away
from the target function. Because this is the best I could
do, under fictitious scenario. You have infinite data sets, and you
are doing all of this, and you're taking the average, and that's the
best you could come up with. And you are still far away
from the target. So it must be a limitation
of your hypothesis set. I'm going to measure that limitation and
say that your hypothesis set, which is represented at its best by
this guy, is biased away from the target function. So this is the bias term. And again, bias applies to that
particular point x, the test point in the input space that
I'm interested in. The other guy must be the variance. Why is that? Because if I knew everything, if I could
zoom in perfectly, I would zoom in onto the best, assuming
this is there, so I have this guy. But you don't. You have one data set at a time. When you get one data set,
you get this guy. You get another data set,
you get another guy. These are different from that. So you are away from that, and I'm
measuring how far you are away. But because the expected value
of this fellow is g bar. And I am comparing the difference
squared with this. It is properly called variance. This is the variance of what I am
getting, due to the fact that I get a finite data set. Every time I get a data set, I get
a different one, and I am measuring the distance from the core
that I get here. So this we call the bias, and
this we call the variance. This is very clean. Now let's go back, and put it
into the original form. Remember this guy? This is where we started. We got the other expression, and then
we neglected to take the expected value with respect to x, in order
to simplify the analysis. We would like to get that back,
so we'll look at this. This was the expected value, with respect
to x, of the quantity we just decomposed. Now I take the decomposition and put it
back, in order to get the expected value of the out-of-sample error, in
terms of the bias and variance. So this will be what? This will be the expected value with
respect to x, of bias plus variance with x. And the expected value of the bias with
respect to x, I'm just going to call it bias. The expected here, I'm going to call it
variance, and that's what you get. And this is the bias-variance
decomposition. Now I have a single number that
describes the expected out-of-sample. So I give you a full learning situation. I give you a target function, and
an input distribution, and a hypothesis set, and a learning algorithm. And you have all the components. You go about, and learn
for every data set. And you get-- someone else learned
from another data set. And get the expected value of
the out-of-sample error. And I'm telling you if this
out-of-sample error is 0.3, well, 0.05 of it is because of bias, and
0.25 is because of variance. So 0.05 means that your hypothesis set
is pretty good in approximation, but maybe it's too big. Therefore, you have a lot of
variance, which is 0.25. This is the decomposition. Now let's look at the tradeoff of
generalization versus approximation, in terms of this decomposition. That was the purpose. Here is the bias, explicitly
written as a formula. And here is the variance. We would like to argue that there is
a tradeoff, that when you change your hypothesis set-- you make it bigger,
more complex, or smaller. One of these guys goes up, and
one of these guys goes down. So I'm going to argue
about it informally. And then we'll take a specific
example, where we are going to get exact numbers. But this is just to realize
that this decomposition actually captures the tradeoff of approximation
versus generalization. Why is that? Let's look at this picture. Here, I have a small hypothesis set. One function, if you want, but, in
general, let's call it small. This one, I have a huge
hypothesis set. So I have here the black points are
hypotheses, that are candidates. Someone gives me a data set, and
I learn, and choose something. Now the target function
is sitting here. If I use this guy, obviously I am far
away from the target function. And therefore, the bias is big. If I have a big hypothesis set-- this
is big enough that it actually includes the f. Then when I learn, on average,
I would be very close to f. Maybe I won't hit f exactly, because
of the nonlinearity of the regime. The regime, I get N examples, learn and
keep it, another N example, learn and keep it, and then
take the average. I might have lost some because
of the nonlinearity. I might not get f, but I'll
get pretty close. So the bias here is very,
very small, close to 0. In terms of the variance here,
there is no variance. If I have one target function, I don't
care what data set you give me. I will always give you that function. So there's nothing to lose here
in terms of variance. Here, I have so many varieties
that, depending on the examples you give me, I may pick this. And in another example-- because I'm
fitting your data, so I get a red cloud around this. And their centroid will be g bar,
the one that is good, but I may get one or the other. And the size of this guy
measures the variance. This is the price I pay. Now you can see that if I go from
a small hypothesis to a bigger hypothesis, the bias goes down,
and the variance goes up. The idea here, if I make the hypothesis
set bigger, I am making the bias smaller, because I
am making this bigger. I'm getting it closer to f, and being
able to approximate it better, so the bias is diminishing. But I am making this-- so
the bias goes down. And here the variance goes up. Why is the variance going up? Because the red cloud becomes
bigger and bigger. If I have this thing, then I have more
variety to choose from, and I am getting bigger variance to work with. So this is the nature of the tradeoff. You may not believe this, because
I just drew a picture and argued very informally. So now let's take a very concrete
example, and we will solve it beginning to end. And if you understand this example
fully, you will have understood bias and variance perfectly. So let's see. I took that simplest possible example
that I can get a solution of, fully. My target is a sinusoid. That's an easy function. And I just wanted to restrict
myself to -1, +1. So I'm going to get sine pi x. Just to scale it so that it's
from -1 to +1, gets me the whole action. Therefore, the target function
formally defined, is from -1, +1, to the real numbers. The co-domain is the real numbers. But obviously, the function would
be restricted from -1 to +1, as a range. Now the target function is unknown. That's what we have been preaching
for several lectures now. And now I am just giving you
the target function. Again, this is an illustration. When we come to learning, we will try
to blank it out, so that it becomes unknown in our mind. But in order to understand the analysis
of the bias-variance, we would like to know what target
function we are working with. We're going to get things in terms of
it, and then you will understand why the tradeoff exists. So the function looks like this.
Surprise-- just like a sinusoid. Now the catch is the following. You are going to learn this function. I am going to give you a data set. How big is the data set? I am not in a generous mood today,
so I am just going to give you two examples. And from the two examples, you need to
learn the whole target function. I'll try. N equals 2. The next item is to give
you the hypothesis set. I'm going to give you two hypothesis
sets to play with. So one of you gets one, and another gets
another, and you try to learn and then compare the results. Well, I have two examples. So I cannot give you
a 17th-order polynomial. So I am just going to give
you the following. The two models are H_0 and H_1. H_0 happens to be the constant model. Just give me a constant. I am going to approximate the sine
function with a constant. OK, this doesn't look good. But that's what we are working with. And the other one is far
more sophisticated. It's so elaborate, you will love it. It's linear. Looks good now, having seen the
constant already, right? These are your two hypothesis sets. And we would like to see
which one is better. Better for what? That's the key issue. Let's start to answer the question of
approximation first, and then go to the question of learning. Here is the question of approximation,
H_0 versus H_1. When I talk about approximation, I
am not talking about learning. I am giving you the target
function, outright. It's a sinusoid. If it's a sinusoid, why don't I
say it's just a sinusoid, and have E_out equal 0? Oh, because the rule of the game is that
you're using one of the models. You have use either the constant
or the linear. Do your best. Use all the information you have. But if you use the constant,
return a constant. If you use the linear,
return a line. You are not going to be able to return
a bigger hypothesis than those. That's the game. OK? Let's see what happens with H_1. Here is the target. I am trying to fit it with
a line, an arbitrary line. Can you think of what it looks like? Line is not much, but at least I can
get something like this, right? Try to get part of the
slope, et cetera. I can solve this. I get a line in general, calculate
the mean squared error. It will be a function of 'a' and 'b'. Differentiate with respect to 'a'
and 'b', and get the optimal. It's not a big deal. So you end up with this. That's your best approximation. This is not a learning situation, but
this is the best you can do using the linear model. Under those conditions,
you made errors. And these are your errors. You didn't get it right, and these
regions tell you how far you are from the target. Let's do it with the other guy. Now I have a constant. I want to approximate this
guy with a constant. What is the constant? I guess I have to work with 0. That's the best I have. Remember, it's mean squared error. So if I move the 0, the big error will
contribute a lot, because it's squared. So I just put it in the middle,
and this is your hypothesis. And how much is your error? Big. The whole thing is your error. Let's quantify it. If you get the expected values of mean
squared error, you'll get a number, which here will be 0.5, and here
will be approximately 0.2. So the linear model wins. Yeah, I'm approximating. I have more parameters, sure. If you give me third order, I
will be able to do better. If you give me 17th order, I'll
be able to do better. But that's the game. In terms of approximation,
the more the merrier. Because you have all the information.
There's no question of zooming in. Now let's go for learning. This course is about machine learning, right?
Not about approximation. So this is the important part for us. Let's play the same game with
a view to learning. You have two examples. You are going to learn from them. You are restricted to one hypothesis
set or the other. So let's start with H_1, and
I'll go to H_0 again. This is your target function. Now you get two examples. I'm going to, let's say, uniformly pick
two examples independently, and I get these two examples. I'd like you to fit the examples, and
we'll see how well you approximate the target function. The first item of business is to
get rid of the target function. Because you don't know it. You only know the examples. So in a learning situation,
this is what you get. Now I ask you to fit a line. Line, two points. I can do that. This is what you do. Now that you settled on the final
hypothesis, I'm going to grade you. So I'm going to bring back the target
function, and compare this to that, and give you what is your
out-of-sample error. Let's do it for H_0. You have the same two points. You're fitting them with a constant. How would you do that? Probably the midpoint will give
you the least squared error on these two points, right? So this would be your
final hypothesis. And you get back your target function,
in order to evaluate your out-of-sample error, and
this is what you get. Now you can see what the problem is. I can compute the error here,
and I can have the error regions, and all of that. But this depends on which
two points I gave you. If I give you another two points, I
give you another two points, et cetera, I am not sure how to really
compare them, because it does depend on your data set. That's why we needed the
bias-variance analysis. That's why we got the expected value
of the error, with respect to the choice of the data set, so that we
actually are talking inherently about a linear model learning a target using
two points, regardless of which two points I'm talking about. So let's do the bias and variance
decomposition for the constant guy. Here is the figure. It's an interesting figure. Here I am generating data sets, each
of size 2 points, and then fitting a line. And the line would be the midpoint. I keep repeating this exercise,
and I am showing you the final hypothesis you get. I repeated it a very large
number of times. This is a real simulation, and these
are the hypotheses you get. You can see that when you get this line,
it means that the two points were equally distant from here. Sometimes I get the points here. Sometimes I get them equal to
that, so I get here. The middle point is
a little bit heavier. Because, obviously, the chance of
getting them on the two lobes is there, and so on. So this is basically the
distribution you get. Each of them will give you
an out-of-sample error. And the interesting thing for us is
the expected out-of-sample error. That's what will grade the model. Now what we are going to do, we are
going to get the bias and variance decomposition based on that. And that is our next figure. Look at this carefully. The green guy, the very light
green guy, is g bar of x. This is the average hypothesis
you get. How did I get that? I simply added up all of these guys,
and divided by their number. And it is expected obviously, by the
symmetry, that on average, I will get something very close to 0. The interesting thing is that you can see
now that g bar, here, happens to be also the best approximation. If I keep repeating this, I will
actually get the 0 guy, which I was able to get when I had the full target
function I was approximating. Here I don't have the full
target function. I have one hypothesis at a time. I am getting the average,
but I am getting this. So there is a justification for saying
that g bar will be the best hypothesis. Because this game of getting one at
a time, and then getting the average, does get me somewhere. But do remember, this is not the output
of your learning process. I wish it were. It isn't. The output of your learning process
is one of those guys, and you don't know which. It just happens that, if you repeat
it, this will be your average. And because you are getting different
guys here, there will be a variance around this. And the variance, I'm describing it
basically by the standard deviation you are going to get. So the error between the green line
and the target function will give you the bias. And the width of the gray region
will give you the variance. Understood what the analysis is? So that takes care of H_0. Let's go to H_1. So to remember, the learning
situation for H_0 was this. This is when I had the constant model. What will happen if you are actually
fitting the two points, not with a constant, which you do at the midpoint,
but you are fitting them with a complete line? What will it look like? It will look like this. Wow. You can see where the problem is. Talk about variance. Take two points. You connect them. Wherever the two points, you
get this jungle of lines. This is for exactly the same data sets
that gave me the horizontal lines in the previous slide. So this is what I get. Now I ask myself, what on
average will I get? You can immediately say, on average,
you'd better get a positive slope. There is a tendency to
get a positive slope. Because when you get the points
split, you will get this. Sometimes you get a negative
slope here, here. But that is balanced by getting
a positive slope here. You can argue this, but
you can do the math. And then you get the bias-variance
decomposition. This will be your average. This is g bar. And this will be the variance you get. The variance depends on x. This is the way we defined it. And when you want one variance to
describe it, you get the expected value of the width squared
of that gray region. This gray region has the
standard deviation. Now you can see exactly that I am getting better approximation
than the previous guy. But I sure am getting very bad variance,
which is expected here. Now you can see what the tradeoff is. And the question is, given these
two models, which one wins from a learning scenario? You need to ask the question,
to remember what it is. I am trying to approximate a sinusoid. Is it better to do it with
a constant or a general line? The answer to that question
is obvious. But that is not the question
I am asking in learning. The question I am asking in learning,
you have two points coming from something I don't know. Is it better to use
a constant or a line? You notice the difference. I am going to put them side by side,
and then see which is the winner. So this guy has a big bias
and a small variance. This guy has a small bias
and a big variance. Let's get quantitative. What is the bias here? It's actually 0.5. Exactly the same we got when we
were approximating outright. It's the 0. That's the expected value. You get 0.5, the mean squared error. What is the bias here? It's 0.21. Interestingly enough, when we did the
approximation, it was about 0.2. And indeed, this is not
exactly the best fit. Remember when I told you there
is a nonlinearity aspect. You are taking two points at the time,
and then taking a fit, and then taking the average. And it's conceivable that this will give
you something different from if you have the full curve, and you
are fitting it outright. The difference is usually
very small, and it is. But here you get something which is not
exactly perfect, but is very close to perfect. So obviously, here the
bias is much smaller. Let's look at the variance. What is the variance here? The variance here is 0.25. It's not too bad. The variance here, we
expect it to bigger. But is it big enough to kill us? It's a disaster, complete
and utter disaster. And now, when you see what is the
expected out-of-sample error, you add these two numbers. Here I'm going to get 0.75, and
here you are going to get something much bigger. And the winner is-- Now you go to your friends, and tell
them that I learned today that in order to approximate a sine, I am better
off approximating it with a constant than with a general line. And have a smile on your face. Of course you know what you're talking
about, but they might not really appreciate the humor here. This is the game. I think we understand it well. So the lesson learned, if I want to
articulate it, is that when you are in a learning situation always remember: you
are matching the model complexity to the data resources you have,
not to the target complexity. I don't know the target. And even if I knew the level of
complexity it has, I don't have the resources to match it. Because if I match it, I will have the
target in my hypothesis set, but I will never arrive at it. Pretty much like I'm sitting in my
office, and I want a document of some kind, an old letter. Someone has asked me for a letter of
recommendation, and I don't want to rewrite it for you. So I want to take the old guy and just
see what I wrote, and then add the update to that. Before everything was archived
in the computers, it used to be a piece of paper. So I know the letter of recommendation
is somewhere. Now I face the question, should I
write the letter of recommendation from scratch? Or should I look for the letter
of recommendation? The recommendation is there. It's much easier when I find it. However, finding it is a big deal. So the question is not that the
target function is there. The question is, can I find it? Therefore, when I give you 100 examples,
you choose the hypothesis set to match the 100 examples. If the 100 examples are terribly
noisy, that's even worse. Because their information
to guide you is worse. That's what I mean by the
data resources you have. The data resources you have is, what do
you have in order to navigate the hypothesis set? Let's pick a hypothesis set that
we can afford to navigate. That is the game in learning. Done with the bias and variance. Now we are going to take just
an illustrative tool, called the learning curves. And then we are going to put the bias
and variance versus the VC analysis on those curves. So what are the learning curves? They are related to what we think of
intuitively as a learning curve. But they are a technical term here. They are basically plotting the expected
value of E_out and E_in. We have done E_out already. But here we also plot the expected value
of E_in, as a function of N. Let's go through the details. I give you a data set of size N. We know
what the expected value of the out-of-sample error is. We have seen that already in the
bias-variance decomposition. And this is the quantity. I know this is the quantity that I will
get in any learning situation. It depends on the data set. If I want a quantity that describes
just the size of the set, I will integrate this out, and get the expected
value with respect to D. That's the quantity I have. And the other one is exactly the
same, except it's in-sample. We didn't use it in the bias-variance
analysis. This one, I am going to get the expected
value of the in-sample. So I want to get, given this situation,
if I give you N examples, how well are you going to fit them? Well, it depends on the examples. But on average, this is how well
you are going to fit them. And you ask yourself, how
do these vary with N? And here comes the learning curve. As you get more examples,
you learn better. So hopefully, the learning curve--
and we'll see what the learning curve looks like. Let's take a simple model first. So it's a simple model. And because it's a simple model, it
does not approximate your target function well. The best out-of-sample error
you can do is pretty high. When you learn, the in-sample will be
very close to the out-of-sample. So let's look first at the behavior as
you increase N. As you increase N, hopefully the out-of-sample
error is going down. I have more examples to learn from. I have a better chance of approximating
the target function. And indeed, it goes. And it can go down and down, until it
gets to the absolute limit of your hypothesis set. Your hypothesis set very simple. It doesn't have a very good
approximation for your target. This is the best it can do. The best you can do is
the best you can do. So that's what you get. When you look at the in-sample, it
actually goes the other way around. Because here my task is
simpler than here. Here I am trying to fit 5 examples. Here I am trying to fit 20 examples. And I only have the examples to fit. I'm not looking at target function,
or anything like that. So obviously, I can use my degrees of
freedom in the hypothesis set, and fit the 5 examples better, and get
a smaller in-sample error. Whereas if I increase N, I will
get a worse in-sample error. It doesn't bother me, because
the in-sample error is not the bottom line. The out-of-sample is. And as you can see, although I am
getting worse in-sample, I am getting better out-of-sample. And indeed, the discrepancy between
them, which is the generalization error, is getting tighter and tighter
as N increases. Completely logical. By the way, this is a real model, so when
we talk about overfitting, I will tell you what that model is, as the
simple model and the complex model. The complex model, exactly the same
behavior, except it's shifted. It's a complex model, so it has
a better approximation for your target function. So it can achieve, in principle,
a better out-of-sample error. You have so many degrees of freedom, that
you were able to fit the training set perfectly up to here. This corresponds, more or less,
to the VC dimension. The VC dimension can
shatter everything. So you can shatter these guys. You can fit them perfectly. So you get zero error. You start compromising when you have
more guys and you cannot shatter, so maybe you have to compromise. And you end up starting
to have in-sample error. And the in-sample error goes up, and the
out-of-sample error goes down. The interesting thing is that in
here, when you have this, I fit the examples perfectly. I'm so happy. What is out-of-sample? An utter disaster. Absolutely no information. We didn't learn anything. We just memorized the examples. So here, again, the out-of-sample
error goes down. The in-sample error goes up. Same argument exactly. They get closer together. But obviously the discrepancy between
them is bigger, because I have a more complex set. Therefore, the generalization
error is bigger. The bound on it is bigger
in the VC analysis. And the actual value is bigger. So this is the analysis. It's a very simple tool. And the reason I introduced it here is
that I want to illustrate the bias and variance analysis versus the VC analysis,
using the learning curves. It will be very illustrative to
understand how the two theories relate to each other. Let's start with the VC analysis
on learning curves. These are learning curves. The in-sample error goes
up, as promised. The out-of-sample error goes down. There is a best approximation that
corresponds to this level of out-of-sample error, if we
actually knew the thing. And what did we do in the VC analysis? We had the in-sample error, which is
this region, the height of this region, and then we had a bound on the
generalization error, which is Omega. And we said that the bound behaves the
same way as the quantity itself. So the bound actually will
not be this thing. It will be way bigger. But in proportionality, it will
give us the same proportion. So as you increase N, the generalization
error goes down. The bound on it goes down. Omega goes down, which
we already realized. And obviously, you can
take another model. And if the model is very complex, the
discrepancy between them becomes bigger, which agrees with that. So this is the decomposition of it. Now I took some liberties, in order
to be able to do that. The VC analysis doesn't
have expected values. So I took expected values
of everything there is. So there is some liberty taken, in
order to put it to fit in that diagram, but the principle holds. The blue region is the in-sample
error, and the red region is basically the Omega. That is what happens in
the generalization bound. Think for a moment, which region will be
blue and which region will be red in the bias-variance analysis? I'll get exactly the same
curves, the same model. So what will it be? It will be this. That's the difference. In the bias-variance, I got the bias
based on the best approximation. I didn't look at how you
performed in-sample. I assumed hypothetically that you
could look for the best possible approximation. And I charged the bias for that. And this is the bias you have. So this is the best you can do. And this is the error you are making. Again, there is a liberty taken here. Because this is genuinely the best
approximation in your hypothesis set. The one I am using for the
bias-variance analysis is the error on g bar. And we said, g bar will be
close in error to this guy. It may not even be in
the hypothesis set. So there is some liberty, but
it's not a huge liberty. This is very much close to what you
are getting in the bias-variance. And the rest of it is the variance. Because you get the bias plus that, and
you will get the expected value of the out-of-sample error. Now you can see why they are both
talking about the same thing. Both of them are talking
about approximation. That's the blue part. Here it's approximation overall. And here it's approximation in-sample. And both of them take into consideration
what happens in terms of generalization. Well, the red region here
is maybe twice the size. Not twice the size in general. It will be twice
the size actually in the linear regression example. But basically, they have
the same behavior. They have just different scale. So they capture the same principle of
generalizing, or the uncertainty about which hypothesis to pick, or how much
do I lose from going in-sample to out-of-sample. So they have the same behavior. And the only difference here is that,
here the bias obviously is constant with respect to N. The bias depends
on the hypothesis set. Now this is also an assumption. Because it says, I have 2 examples
and take the average. I will get an error. If I have 10 examples and take
the average, I get an error. Is it the same? Well, in both cases, you effectively used
an infinite number of examples. Because the first one you used two
at a time, and you repeated it an infinite number of times,
and you took an average. This, you used 10 at a time,
and you took an average. I grant you maybe the 10 will
give you a better situation. But again, it's a little bit of
a license, in order to be able to attribute the bias and variance to this
line, which happens to be the best hypothesis proper within
your hypothesis set. So this is the contrast between the
two theoretical approaches we have covered, in this lecture and the
previous three lectures. I am going to end up with the analysis
for the linear regression case. So I'm going to basically go
through it fairly quickly. This is a very good exercise to do. And if you read the exercise and you
follow the steps, it will give you very good insight into the
linear regression. I'll try to explain the
highlights of it. Let's start with a reminder
of linear regression. So linear regression,
I'm using a target. For the purpose of simplification, I am
going to use a noisy target, which is linear plus noise. So I'm using linear regression to learn
something linear plus noise. If it weren't for the noise,
I would get it perfectly. It's already linear. But because of the noise, I will
be deviating a little bit. This is just to make the mathematics
that results easier to handle. Now you're given a data set. And the data set is a noisy data set. So each of these is picked
independently. This y depends on x, and the only
unknown here is the noise. So you get this value, it gives you
the average, and then you add a noise to get the y. Do you remember the linear
regression solution? Regardless of what the target function
is, you look at the data, and this is what you get for the solution. You take the input data set,
and the output data set. You do this algebraic combination, and
whatever comes out is your output of the linear regression. This is your final hypothesis. We have done that. And now we are going to think about
a notion of the in-sample error, not in-sample error as a summary quantity,
but the in-sample error pattern. How much error do I get
in the first example? How much error do I get in the
second, third, et cetera? Just for our purposes. So what would that be? Well, that would be what I got
in the final hypothesis. I apply the final hypothesis
to the input points. I am going to get a pattern of values
that my hypothesis is predicting. I compare them to the actual targets,
which happen to be stored in the y. And that would be an error pattern. So it would be plus something
minus something, plus something minus something. And if I add the squared values here,
get the average of those, I will get what we call the in-sample error. For the out-of-sample error, I am going
to play a simplifying trick here, in order to get the learning
curve in the finite case. Here I am going to consider that, in
order to get the out-of-sample error, what I'm going to do I am going to just
generate the same inputs, which is a complete no-no in out-of-sample. Supposedly in out-of-sample, you get
points that you haven't seen before. You have seen these x's before. But the redeeming value is that I'm
now going to give you fresh noise. So that's the unknown, and that is what
allows me to say that it plays the role of an out-of-sample. I'm going to generate another set of
points with different noises, but on the same inputs in order to
simplify the analysis. You see that the x's
here are involved. And if I use the same inputs,
things will simplify. And in that case, if you ask yourself
what is the out-of-sample error of those, it's exactly the same. I evaluated on the points. They happen to be the points
for the out-of-sample. And I'm comparing it with y. I'm calling it y dash, which is exactly
the same thing, except with noise dash, another realization
of the noise. This is the outline of the setup to
get us the learning curves we want. When you do the analysis, not that
difficult at all, you will get this very interesting curve. This is the learning curve, and
it has very specific values. sigma squared, that's the
variance of the noise. This is the best you can do. I expect that, because you told
me the target is linear. So I can get that perfectly. But then, there is this added noise. I cannot capture the noise. What is the variance of the
noise? sigma squared. So this is the error
that is inevitable. Look at the in-sample error. Up to d plus 1, you were perfect. Yeah, of course I am perfect. Because I have d plus 1 parameters in
linear, and I am fitting less than those, so I can fit them perfectly. It doesn't mean much for the
out-of-sample error, but that's what I get. I start compromising when
I get more points. And as I go with more points,
here I'm fitting the noise. I am fitting the noise less. The noise is averaging out. Now I'm getting very, very close,
to as if there was no noise. Because the pattern persists,
which is the linear guy. And the noise, if I get more
examples, more or less cancels out in the fitting. I don't have enough degrees of
freedom to fit them all. So I get to average, until
eventually I get to as if I am doing it perfectly. And out-of-sample goes down. There is a very specific formula that
you can get, which is interesting. So let me finish with this. The best approximation error
is sigma squared. That's the line, right? What is the expected in-sample error? It has a very simple formula, which is--
everything is scaled by sigma squared. So What you have here is,
it's almost perfect. And you are doing better than perfect, by
this amount, the ratio of d plus 1. Remember what d plus 1 was? For the perceptron, it
was the VC dimension. Here it's also a VC dimension of sorts,
the degrees of freedom that linear regression has. So we divide the degrees of freedom
by the number of examples. That is the factor that you get. And you realize here that this
is the best you can do. And here you are doing
better than the best. Why is it better than the best? Because I'm not trying to
fit the whole function. I am only fitting the finite sample. So I'm doing very well, and I'm very
happy about it, little that I know that I'm actually harming myself. Because what I'm doing here,
I am fitting the noise. And as a result of that, I am deviating
from the optimal guy. And I am paying the price
in out-of-sample error. What is the price I am paying
in out-of-sample error? It is the mirror image. I lose exactly in out-of-sample
what I gained in-sample. And the most interesting quantity
is the summary quantity. What is the expected generalization
error? Well, the generalization error is the
difference between this and that. I have the formula for them. So all I need to do is write this. Let me magnify this. This is the generalization error. It has the form of the VC dimension
divided by the number of examples. In this case, it's exact. And this is what I promised last time. I told you that this rule of
proportionality between a VC dimension and a number of examples persists to
the level where sometimes, you just divide the VC dimension by the number of
examples, and that will give you a generalization error. This is the concrete version of it, in
spite of the fact that here is not a VC dimension. This is real-valued. But it's degrees of freedom,
so it plays the role. We could actually solve for it and
realize that this is indeed the compromise between the degrees of
freedom I have, in the case of linear regression, and the number
of examples I am using. So we will stop here. And we will go into questions and
answers after a short break. Let's go into the questions. MODERATOR: Right. The first question is if you
can go back to slide 19. PROFESSOR: 19. MODERATOR: The question is if you can
explain how complex models are better than simple models. PROFESSOR: OK. Better in something. I think the key issue in the theory
is, there is a tradeoff. Nothing is better on all fronts, and
nothing is worse on all fronts. So let's compare the simple model
and the complex model. In terms of the ability to approximate,
whether that ability to approximate is in-sample, or whether
the ability to approximate is absolute, what is the ability to
approximate in the absolute? Here is my hypothesis set, and
I have a target function. The horizontal line, that height gives
you the error of approximation. So if you go from a simple model to
a complex model, you will be able to approximate better. That is obvious. And that also is inherited if your
approximation is focused only on the training examples. In this case, you are comparing
not the horizontal lines, but the blue curves. This is the error you make in
approximating the sample you get. And again, the approximation for the
simple model is worse than the approximation for the complex model. So if your game is approximation, and
that's your purpose, then obviously the complex model is better. In this particular case, you
can also ask yourself about the generalization ability. The generalization ability will be the
discrepancy between, either the blue and red curve. That would be the VC analysis. This would be how much I lose from going
from in-sample to out-of-sample. Or how much I lose from a perfect
approximation, in the case of the bias-variance analysis, to getting E_out,
because of my inability to zoom in on the right hypothesis. This would be that area here. So whether you are taking the difference
between the blue and red curve, or the difference between the
red curve and the black line, that area is smaller here than here. Therefore, the simple model
is better, as far as the generalization is concerned. Now because it's a tradeoff, and I have
one of them better and one of them worse, then the question is, when
I put them together, which is better? Because the bottom line in learning
is the red curve. That's what I care about. This is the performance of the system
that I'm going to deliver to my customer, and they're going
to test it out-of-sample. And if they get it right,
they will be happy. So now because I have two quantities
that I'm adding, and one of them is going down, and one of them is going
up, then it is obvious that the sum could go either way. And in this case, you can see
that it is going either way. For example, if you have few examples,
then E_out here is OK. It's not great, but it's decent. If you have the same number of examples
here, E_out is a disaster. So if you have few examples,
you simply cannot afford the complex model. You are better off working with a simple
model, and you will get better out-of-sample error. If I give you much bigger resource
of the examples-- if you are here, now this one is limited by the
fact that it's simple. It cannot get any better. It has all the information. It zooms in perfectly, but
it cannot get any better. This guy now gets to use its degrees of
freedom properly, and gets you to a smaller value. So for larger number of points, you
get a better performance here. That's why we are saying that you should
match the complexity of the model to the data resources you have,
which in this case are represented by N. We're talking about different target
functions and different things. But in choosing this model or another,
what really dictates the performance is the number of examples versus
the complexity of the model. MODERATOR: OK. When you did the analysis for linear
regression, if you did it using the perceptron model, would you get
the same generalization error? PROFESSOR: Let's go for that. The analysis of the bias-variance, and
it's also inherited in the learning curves-- the analysis is very clean
when you use mean squared error. Obviously, you can use mean squared
error in the perceptron. There will be a correspondence here. But the ability to get such a clean
formula here really depends on the very particulars of linear regression. If you go back to the previous slide,
where the assumption is, it was very critical to make the assumption that the
out-of-sample is this way, and to make the target very specifically linear
plus noise, in order to be able to simplify. The result, by the way, holds
in general, asymptotically. So if you take genuine out-of-sample,
which means that you pick different points, you will get
a different matrix X. So you will apply w that you got from
in-sample, you'll apply it to X dash in this case, which is this,
and then y dash. And the problem is that when you plug
it in here, and try to get a formula for that, the formula will depend on
how the X dash relates to the X. When it's the same, they cancel
out neatly, and you get the formula that I had. But asymptotically, if you make certain
assumptions about how X is generated and you take the asymptotic
result, you will get the same thing. So the short answer is the following. The analysis in the exact form that I
gave, which gives me these very neat results, is very specific to linear
regression, very specific to the choice of out-of-sample as I did it, if you want to give the answer
exactly in a finite case. If you use a perceptron, you will be
able to find a parallel, but it may not be as neat. MODERATOR: Quick clarification. sigma squared is the variance
of the noise in the-- PROFESSOR: Yeah. I just realized that. I have been using bias-variance. The lecture is called bias-variance, and
now we have variance of the noise. So obviously, I am so used to these
things that I didn't notice. When I say the variance here, this has
absolutely nothing to do with the bias-variance analysis
that I talked about. It's a noise. I am trying to measure
the energy of it. It's a zero-mean noise, so the energy of
it is proportional to the variance. So I should have called it-- the energy of
the noise is sigma squared, in order not to confuse people. But I hope that I did not
confuse too many people. MODERATOR: Can the bias-variance
analysis be used for model selection? PROFESSOR: Bias-variance
analysis, just because it is so specific, it actually assumes that you
know the target function, if you want to get the quantities explicitly. So for example linear regression,
I assume the form is linear plus noise. For the sinusoidal case, we got the
answers, and we were able to choose. But you actually knew that
it was a sinusoid. So the bias-variance analysis
is taken as a guide. But it's a very important guide. Because I can ask myself how do I
affect-- I want to get E_out to be down. Now I know that there
are two contributing factors, bias and variance. Can I get the variance down, without
getting the bias up? That's a bunch of techniques. Regularization will belong
to that category. Can I get both of them down? That will be learning from hints. There will be something that affects
both of them, and so on. So you can map different techniques
to how they are affecting the bias and variance. I would say that, in terms of any
application to learning situation, it's a guideline rather than something
that I'm going to plug in, and tell you what the model is. The answer for the model selection is
mostly through validation, which we're going to talk about in a few lectures. And this is the gold standard for the
choices you make in a learning situation, including
choosing the model. MODERATOR: I have a question
getting a little bit ahead. In ensemble methods, like boosting or
something, is there a reason under these analyses why those methods work? PROFESSOR: I almost included
this in the lecture, but I thought it was one too many. If you look at the idea of g bar. Let me try to get to its definition. This was just a theoretical
tool of analysis. I have g bar equals the expected
value of that. And if I want to do it with a finite
number of sets, I sum up this, and normalize by 1 over K. Although this was just
a theoretical way of getting the bias-variance decomposition, and this is
a conceptual way of understanding what it is, there is an ensemble
learning method that builds exactly on this, which is called Bagging--
bootstrap aggregation. And the idea is, what do I need
in order to get g bar? We said g bar is great,
if I can get it. But it requires an infinite number
of data sets, and I have only one data set. So the idea of Bagging is that, I am
going to use my data set to generate a large number of different data sets. How am I going to do that? Well, that's bootstrapping. Bootstrapping always looks like magic. You know where the expression comes from? Bootstrapping, you try to lift
yourself by pulling on your bootstraps. Which obviously, you cannot do,
because you are pulling on it. But that's what you do. Here we are trying to create something,
where it isn't there. And in this particular case, what you
do is you sample randomly from your data set, in order to get different
data sets, and then average. And believe it or not, that gives
you actually a dividend. It gives you something about
the ensemble learning. There are other, obviously
more sophisticated, methods of ensemble learning. And one way or the other, they appeal to
the fact that you are reducing the variance by averaging
a bunch of stuff. So you can say that it's either taken
outright, like Bagging, or inspired in some sense, that it's a good idea
to average because you cancel out fluctuations. MODERATOR: If we use the Bayesian
approach, does this bias-variance dilemma still appear? PROFESSOR: Repeat
the question, please. MODERATOR: If you use a Bayesian
approach, does this bias-variance still appear? PROFESSOR: OK. The bias-variance is there to stay. It's a fact. And we can take a particular approach,
and then we are going to perhaps find an explicit expression for
the bias, and an explicit expression for the variance. But nothing will change about the
nature of things because of the approach I have. Now the Bayesian approach is very
particular, because the Bayesian approach makes certain assumptions. And after you make these assumptions,
you can answer all questions perfectly. You can answer questions like that,
and other questions as well. And I will talk about the Bayesian
approach in the very last lecture of the course. So I will defer answers, that are specific
to that, until that point. But basically, the answer to this very
specific question, it's like if you ask, does the VC dimension change
if you apply the Bayesian? Well, you apply the Bayesian, this
is just a bunch of assumptions. The VC dimension is there. Maybe by using the Bayesian you'll be
able to find more direct quantities to predict what you want. But the VC dimension is there, because
it's defined in a general setup. MODERATOR: A question about relation
with numerical function approximation. In that field, there's interpolation
and extrapolation. When is there extrapolation
in machine learning? PROFESSOR: Function
approximation is one of the fields that is very much related. Because we are given a finite sample,
and they're coming from a function, and you're trying to approximate it. And this is one of the applications. In general, interpolation is
easier than extrapolation, because you have a handle. And if you want to articulate that in
terms of the stuff we have, the variance in interpolation is smaller
than the variance in extrapolation in general. Remember, the lines in the sinusoid? They were all over the place. If you take between the two points-- so
I have the sinusoid, and I have the two points, I'm connecting
them with a line. Between the two points, I am very much
in good shape, because the sine is this way, and I am this way. So it's not that big a deal. The further out you go, then there
is a lot of fluctuation. And that is reflected in
the extrapolation. MODERATOR: OK. When the variance is big, we
know we're extrapolating. Is that the answer? PROFESSOR: No. I will say there is an association
between them. To answer this specifically, you need
to understand the particular case. There may be cases, where the
extrapolation doesn't have a lot of variance and whatnot. I'm just trying to map in general,
what the quantity here corresponds to, in that. The problem with extrapolation can be
posed, in this case, in terms of more variance than interpolation. But I'm not making a mathematical
statement that this is guaranteed to be the case. MODERATOR: Could you explain what the
literature means by the bias-variance covariance dilemma? PROFESSOR: OK. You can pursue this analysis a little
bit further to the cases where you have cross terms. Particularly for boosting,
this is the case. And then there is a question of, I
am trying to get these guys that I'm going to average in order to
get the final hypothesis. That's my game. Now it would be nice if I can
get them to be independent. Because when I get them to be
independent, then adding them up reduces the variance
in a very good way. But then, in general, when you
actually apply some of these algorithms, there is a correlation
between one and another. So there's a covariance. So there's a question of the
balance between the two. But it really is, in terms of
application, related more to ensemble learning than to just the general
bias-variance analysis as I did it. Because in the bias-variance analysis,
I had the luxury of picking independently generated data sets,
generating independent guys, and then averaging them, because it's
a conceptual aspect. But when you actually are using
a technique, where you are constructing these guys based on variations of the
data set, then the covariance starts playing a role. MODERATOR: A question about,
I guess, naming the things. Is linear regression
actually learning? Or is it just fitting along the lines
of function approximation? PROFESSOR: Linear regression,
is a learning technique. And fitting is the first
part of learning. So you always fit, in order to learn. The only added thing is that you want
to make sure that, as you fit, you always perform well out-of-sample. That's what the theory was about. I've been spending four lectures trying
to make sure, that when you do the intuitive thing, I give
you data, you fit them. You could do that without taking
a machine learning course. Now I'm telling you that you have to
have the checks in place, such that when you fit in-sample, something good
happens in what you care about, which is out-of-sample. So that's the-- MODERATOR: All right. I think that's it. PROFESSOR: Very good. We'll see you next week.