ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced some important
concepts in our theoretical development. And the first concept was dichotomies. And the idea is that there is an input
space behind this opaque sheet, and there is a hypothesis that's
separating red regions from blue regions. But we don't get to see that. What we get to see are just the
data points, holes in that sheet if you will. And there could be very exciting stuff
happening behind that sheet, and all you get to see is when the boundary
crosses one of these points, and a blue point turns red or vice-versa. So if you think of the purpose for the
dichotomies, we had a problem with counting the number of hypotheses,
because we end up with a very large number. But if you restrict your attention to
dichotomies, which are the hypotheses restricted to a finite set of points,
the blue and red points here, then you don't have to count everything
that is happening outside. You only count it as different when
something different happens only on those points. So a dichotomy is a mini-hypothesis,
if you will. And it counts the hypotheses only
on the finite set of points. This resulted in a definition that
parallels the number of hypotheses, which is the number of dichotomies
in this case. So we define the growth function. The growth function is-- you pick
the points x_1 up to x_N. You pick them wisely, with a view to
maximizing the dichotomies, such that the number you get will be more than any
number another person gets with N points. That's the purpose. So you take your hypothesis set, which
applies to the entire input space, and then apply it
only to x_1 up to x_N. This will result in a pattern of
+1 or -1's, N of them. And as you vary the hypothesis within
this set, you will get another pattern, another pattern,
another pattern. So you will get a set of different
patterns that are all the dichotomies that can be generated by this hypothesis
set, on this set of points. And the number of those guys is
what we are interested in. It will play the role of the
number of hypotheses. And that is the growth function. Now in principle, the growth function
can be 2 to the N. You may be in an input space and a hypothesis set,
such that you can generate any pattern you want. However, in most of the cases, the
restriction of using hypotheses coming from H will result in missing
out on some of the patterns. Some patterns will simply
be impossible. And that led us to the idea
of a break point. For the case of a perceptron in
two dimensions, which is the case we studied, we realize that for four
points, there will always be a pattern that cannot be realized
by a perceptron. There is no way to have a line come
here, and separate those red points from the blue points. And any choice of four points will
also result in missing patterns. Therefore, the number k equals 4, in this
case, is defined as a break point for the perceptrons. And our theoretical goal is to take that
single number, which is the break point, and be able to characterize the
entire growth function for every N. And therefore, be able to characterize
the generalization, as we will see. We then talked about the maximum
number of dichotomies, under the constraint that there
is a break point. And we had an illustrative example to
tell you that, when you tell me that you cannot get all patterns on any-- in this case, two points-- that is a very strong restriction on
the number of dichotomies you can get on a larger number of points. So this is the simplest case. If you take any two columns, you
cannot get all four patterns. That's by decree. I'm telling you that the hypothesis
has a break point of 2. And then I'm asking you, under those
constraints, how many lines you can get, how many different
patterns you can get. And you go and you add them
up, and you end up in this case with only four. So you lost half of them. And you can see that if we have 10
points, and you apply the same restriction, there will be so many
lost, because now the restriction applies to any pair of points. Now, if you look at this schedule,
this does not appeal to any particulars of the hypothesis set or the
input space, other than the fact that the break point is 2. I could be in a situation, where the
hypothesis set cannot generate some of these guys for other reasons. But here, I'm abstracting only
a hypothesis set and an input space. I don't want to bother to
know more about them. Just tell me that they have a break
point, and I'm trying to find under that single constraint, how many
can I possibly have? And I already have, by that
combinatorial constraint, a restriction which is strong enough
to get me a good-enough result. That's good, because now I don't have
to worry about every hypothesis set, and every input space, you give me. I just ask you:
what is the break point? And I'm able to make a statement about
the growth function not being bigger than something. That is the key. We move on to today's lecture, and
the title is, properly, the Theory of Generalization. It's very theoretical. And today's lecture is the most
theoretical of the entire course. So fasten your seat belts,
and let's start! We have two items of business. The first one is to show that the growth
function, with a break point, is indeed polynomial. The second one is to show that we can
actually take that notion, the growth function, and put it in place of
M, the number of hypotheses, in Hoeffding's inequality. So basically, we are saying in the
first part: it's worthwhile to study the growth function. Because being polynomial will
be very advantageous. And then, the second one is: we can
actually do something good with it. We can do the replacement. These are the only two items. Let's start. We are going to bound the growth
function by a polynomial. And I just wanted to point some
of the aspects of that. If I say m_H of N is polynomial, it's not
that I am going to actually solve for the growth function, and show that
it is this particular polynomial, and the coefficients. All I am saying is that it is really
just bounded above by a polynomial. I don't have to get the particulars
of m_H of N, the growth function. I am going to just tell you that this
is less than something, less than something, less than a polynomial. That's all I need, because eventually
I am going to put this in the Hoeffding inequality. And as long as it's bounded by
a polynomial, I am in business. Because the negative exponential
will kill it, as we discussed, and we are OK. So we can be a bit loose, which
is very good in theory. Because now you leave a lot
of artifacts that you don't need to study. And just talk about the upper bound in
the general case, and still get what you want to get. The key quantity we are going to use,
which is a purely combinatorial quantity, we are going to
call it B of N and k. This is exactly the quantity we
were seeking in the puzzle. I give you N points. I tell you that k is a break point, and
ask you: how many different patterns can you get under those conditions? In that case, we had three points
and the break point was 2. And we answered this question
by construction. We played around with the patterns until
we got it, and then we said it's 4. Now, as I develop the theory,
the puzzle will come up in one of the results. I would like you to keep an eye and
say which slide, and which particular part of the slide, addresses the very
specific puzzle we talked about. The definition here is the maximum
number of dichotomies on N points, such that they have a break point k. So this is N and this is k. And the good thing here is that I didn't
appeal to any hypothesis set, or any input space. This is a purely combinatorial
quantity. And because it's a combinatorial quantity,
I am going to be able to pin it down exactly, as it turns out. And now, when I pin it down exactly, you
go and you find the fanciest input space, and the fanciest hypothesis set. You pick the break point for that, and
you use that here, ridding the problem of all the other aspects, and
you still are able to make an upper bound statement. You can say that the growth function,
for the particular case you talked about, is less than or equal
to-- and just go to this combinatorial quantity. The plan is clear. So let's look at the bound
for B of N, k. And we are going to do it recursively. It's a very cute argument, and I am
going to build it very carefully. So I want your attention. Consider the following table. Very much like the puzzle, we are
going to list x_1, x_2, up to x_N. N points, which used
to be three points. And I am going to try to put as many
patterns as I can, under a constraint that there is a break point. So I will be putting the first pattern
this way, and the second pattern, and so on, trying to fill this table. Now, I am going to do a structural
analysis of this, and this will happen through this division. Let's look at it. Still the same problem, x_1
and x_N is my vector. And I am trying to fill this with
as many rows as possible, under a constraint of a break point. But now I am going to isolate
the last point. Why am I isolating the last point? Because I want a recursion. I want to be able to relate this fellow,
to the same fellow applied to smaller quantities. And you have seen enough of that to
realize that, if I manage to do that, I might be able to actually
solve for B of N and k. That's why I'm isolating
the last point. After I do the isolation, I am going
to group the rows of the big matrix, into some groups. This is just my way
of looking at things. I haven't changed anything. What I am going to do, I am going to
shuffle the rows around, after you have constructed them. So we have a full matrix now, and I am
shuffling them, and putting some guys in the first group. And the first group I
am going to call S_1. Here is the definition
of the group S_1. These are the rows that appear
only once, as far as x_1 up to x_N-1 are concerned. Well, every row in its entirety appears
only once, because these are different rows. That's how I'm constructing
the matrix. But if you take out the last guy, it is
conceivable that the first N minus 1 coordinates happen twice, once with
extension -1, once with extension +1. So I am taking the guys that
go with only one extension, whatever it might be. Could be -1 or could be +1,
but not both, and putting them in this group. Fairly well defined. So you fill it up, and these are all the
rows that have a single extension. Now, you go under this, and you
define the number of rows in this group to be alpha. It is a number. I am just going to call it alpha. And you can see where this is going,
because now I'm going to claim that the B of N and k, which is the total
number of rows in the entire matrix, is alpha plus something. That is obvious. I have already taken care of alpha,
and I am going to add up the other stuff later on. So what is the other stuff? That is the stuff I am
going to call S_2. And you probably have a good
guess what these are. These are the guys that happen
with both patterns. That is, they happen with extension
+1 and with -1. That is disjoint from the first group. A typical member will
look like this. This is the same guy from x_1 up to
x_N-1, as it appears here. It just appears here with +1,
and appears here with -1. And I keep doing it. So what I'm doing, I just reorganize
the rows of the matrix to fall into these nice categories. The other guy? Exactly the same thing. So the second one corresponds to
the second one, and so on. Now, that covers all the rows. I look at x_1 up to x_N-1. I either have both extensions,
or one extension. That's it. One extension belongs to the first
group. Two extensions belong to the second group in both ways,
with +1 and -1. In terms of the counting, this has
beta rows, whatever beta might be. This also has beta rows, because
they're identical. And therefore, the number B of N and
K, which I'm interested in, is alpha plus 2 beta. That is complete. Just calling things names. So now, I am going to try to find
a handle on alphas and betas, so that I can find a recursion for the big
function B of N and k. B of N and k are the maximum
number of rows-- patterns I can get on N points,
such that no k columns have all possible patterns. That's the definition. I am going to relate that to the same
quantity on smaller numbers, smaller N and smaller k. So the first is to estimate
alpha and beta. I'd like to ask you to focus on
the x_1 up to x_N-1 columns. And I am going to help you visually
do that, by graying out the rest. Now for a moment, look at these. Are these rows different? They used to be different when
you have the extension. Well, let me see. The first group, I know they
are different, because they have one extension. If there is one which is repeated, then
it must be repeated with both extensions, in order to get different
rows all over, and that violates the condition for being here. They are here because they
have only one extension. These guys are the same. This one appears with -1, and
here appears with +1. But if you cut the last guy, this
guy is identical to this guy, right? This second guy is identical
to the second guy. So I cannot count these
as different rows. I can do that when I gray
out one of the groups. Now, these are patently different. Nothing here is repeated, because we said
they have only one extension, and they are all tucked in here. These two guys, there are no two guys
here that are equal, because they all have the same extension. And supposedly, the whole row
makes it different rows. Therefore, these guys are
different from each other. And these guys are different from here
because again, if they are equal, then I will have an extension. And then the guys here will belong to
a row that had both extensions. Very easy. Just a verbose argument,
but we end up with these guys being different. Now, I like the fact that these guys are
being different, because when they are different, I can relate
them to B of N and k. B of N and k was the maximum number of
patterns-- different rows, that's how I am counting them-- such that a condition occurs. So what is the condition
that is occurring here? I can say that alpha plus beta, which
is the total number of rows or patterns in this mini-matrix, can I say
something about a break point for this small matrix? Yeah. The original matrix, I could
not find all possible patterns on any k columns, right? So I cannot possibly find all possible
patterns on any k columns on this smaller set. Because if I find all possible patterns
on k columns here, they will serve as all possible patterns
in the big matrix. And I know, that doesn't exist. So I can now confidently say that alpha
plus beta, which is the number of different guys here, is less than or
equal to B of N minus 1, because I have only x_1 up to x_N-1, and k, because that is the break
point for these guys as well. Why am I saying less than
or equal to, not equal? When I constructed the original matrix,
it was equal by construction. I looked at the maximum number
of rows I get. And I told you this is
what I constructed. And therefore, by definition,
this is B of N and k. Here, I obtained this
in a special way. I took out a guy from the other
matrix, and did that. I am not sure that this is the best way
to maximize the number of rows. At least it's conceivably not. But for sure it's at most B of
N minus 1 and k, because that is the maximum number. I am safe saying that it's
less than or equal to. So I have the first one. Now, let's try to estimate
beta by itself. This is the more subtle argument. In this case, we are going to
focus now on the second part only, the S_2 part. The guys that appear twice
in the big matrix. So let's focus on them. Now, when I focus on them, these
guys are very easy to analyze. They are here and here
exactly the same. This block is identical
to this block. The interesting thing, when I look at
these guys, is that I am going to be able to argue that these guys have
a break point of k minus 1, not k. The argument is very cute. Let's say that you have all possible
patterns on k minus 1 guys, in this small matrix. First, I have to kill these. These are not different guys, because
these are identical to these. So let me reduce it to the guys
that are patently different. I'm now looking at this matrix. I am claiming that k minus
1 is a break point here. Why is that? Because if you had k minus 1 guys
here, where you get all possible patterns, then by adding both copies,
+1 and -1, and adding x_N, you will be getting k columns overall that
have all possible patterns, which you know you cannot have because k is
a break point for the whole thing. So now I'm taking advantage of the
fact that these guys repeat. It's very dangerous to have k minus 1
guys, because now I have the k that I know doesn't exist. Let's do it illustratively. Here is a pattern here. You add the +1 extension
and the -1 extension, by taking this column. If you get all possible patterns on k
minus 1, and you add this guy, then you have both patterns here, and you will
end up with all possible patterns on k points on the overall matrix. That enables me to actually count
this in terms of B of N and k, again, with the proper values of N and k. We can say that beta is less than
or equal to-- again, less than or equal to because I obtained this
matrix by lots of eliminations. I didn't do it deliberately to maximize
the number, so I don't know whether it's the maximum. But I sure know that it's less than or
equal to the maximum, by the definition of what a maximum is. And that would be of what? I have N minus 1 point and I argued
for a break point of k minus 1. So I end up with this fellow. Both arguments are very simple. Now, we pull the rabbit out of the hat! You put it together. What do we have? This is the full matrix. The first item was just calling things
names, the number of rows in the big matrix is B of N and k,
by definition-- by construction. I organized it such that there is alpha,
and there is beta, and there is another beta, so this one is the first
result I got, which is B of N and k equals alpha plus 2 beta. What else did I get? I got that alpha plus beta is at
most B of N minus 1 and k. That was the first slide
of the analysis. We have seen that. So this basically takes this matrix,
and does an analysis on it. And it has a break point k, because k
will be inherited when you go to the bigger one. That's what we did. The other one is, beta is less than
B of N minus 1 and k minus 1. And this is the case where I only looked
at this guy, and now I have to be more restrictive in terms of all
possible patterns, because I have an extension to add, and I would be
violating the big constraint. So I ended up with this being less
than or equal to B of N minus 1 and k minus 1. Anybody notice anything in this slide? How convenient! I have alpha plus 2 beta
there, and I have alpha plus beta on one, and beta on one. If I add them, I am in business. I can actually now relate B of N and k
to other B of N and k, and alpha and beta are gone. B of N and k, now I know, has
to be at most this fellow. So you can see where the
recursion is going. Now I know that this property
holds for the B of N and k. And now all I need to do is solve it, in
order to find an actual numerical value for B of N and k. And that numerical value will serve as
an upper bound for any growth function of a hypothesis set that
has a break point k. Let's do the numerical
computation first. I have this recursion, and I can see
that, from smaller values of N and K, I can get bigger values, or I can get
an upper bound on bigger values. Let's do it in a table. Here is a table. Here is the value of N-- 1, 2. This is the number of points, the
number of columns in the matrix. And this is k. This is the break point
I am talking about. So this will be-- there's a break point
of 1, break point 2, break point 3, et cetera. And what I'd like to do here, I'd like
to fill this table with an upper bound on B of N and k. I'd like to put numbers here, that I know
that B of N and k can be at most that number. And we can construct this matrix very,
very easily having this recursion. Here's what we do. First, I fill the boundary conditions. Let's look at this. Here it says that there
is a break point of 1. I cannot get all possible
patterns on one point. Well, what are all possible
patterns on one point? -1 and +1. It's one point. So I cannot get both
-1 and +1. That's a pretty heavy restriction. So I'm asking myself, let's say you
have now N columns in the matrix. How many different rows can you get in
that matrix, under that constraint? Well, I'm in trouble. Because if I have the first pattern, and
then I put a second pattern, the second pattern must be different from
the first one in at least one column. That's what makes it different. If it's identical in every column, then
it's not a different pattern, right? So you go to that point,
where it's different. And unfortunately, for that point
you get both possible patterns. So we are stuck. We can only have one pattern
under this constraint. Hence, the 1's-- 1, 1, 1, 1, 1. That's good. Now, in the other direction,
it's also easy. In this case, it's 2. It's very easy to argue. Now, I am taking the case where
I have only one column. So I'm asking myself, how many patterns
I can get for one column. Well, the most is 2. Why am I getting 2's here? Because in the upper diagonal of
this table, the constraint I am putting is vacuous. Here, for example, I am telling you how
many different patterns can you get on one point, such that no four
points have all possible patterns. Four points, what are
you talking about? You have only one point. So that's no constraint at all. Therefore, it doesn't restrict the
choices, and the maximum number is the maximum number I would get unrestricted,
which happens to be 2. If I have one point,
I get two patterns. That's why you have the
2's sitting here. Now, I covered the boundary conditions,
and that's really all I need to complete the entire table, given the
nature of the constraint I have. Why is that? Because that constraint looks like this. If you know the solid blue guys, I
will tell you the empty blue guy. Because this would be--
look at N and k. This is N and k. This would be N minus 1 and k. This would be N minus 1 and k minus 1. That's exactly what this says. So if I have these two points, I can
get a value here, which will be an upper bound on this fellow. Let us actually go through
this table, and fill it up. The first guy I'm going
to take is this 1 and 2. According to this shape, I might
be able to get this fellow. What would that fellow be? 3, right? You just add the two numbers. How about the next guy? Anybody has a guess here? OK, 4. And then? A bunch of 4's. Always get 2's. I am actually happy about this because
you see that when k grows big, much bigger than N, as we said, the
constraint is vacuous. So I should be getting all
possible patterns on the number of points I have. And as you can see, for
1, I get the 2's. For 2, I will get eventually the 4's. And for 3, it will be the 8's. So that is very nice. Let's go over the next row. Can I solve this one? Now that I got this one, I can become
more sophisticated and get this one. See where this came from? How about the next one,
what would that be? That should be 7, right? 8. A bunch of 8's. This is kind of fun. And you can fill up the entire table. So we have it completely
solved, numerically. It would be nice to have a formula,
which we will have in a minute. But numerically, we will have that. Now, let me highlight one guy. Do you see anything that
changed colors? I claim that you have
seen this before. That's the puzzle. You had three points. Your break point was 2. And now we know for a fact that the
maximum number you can get is 4, without having to go through
the entire torture we went through last time. Can we try this? Can we try that?
Oh, I am violating-- You don't have to do that. Here are the numbers, just by computing
a very simple recursion. Now, let's go for the analytic
version of that. What I'd like to do, I'd like to
find a formula that computes this number outright. I don't have to go through this
computation numerically. So let's do that. This is the analytic solution
for B of N and k. Again, this is the recursion. And now we have a theorem. Yeah, when you're doing mathematical
theory, you have to have theorems. Otherwise, you lose your
qualifications! What does the theorem say? It tells you that this is a formula
that is an upper bound for B of N and k. What is this formula? This is N choose i, the combinatorial
quantity. And you sum this up from
i equals 0 to k minus 1. So both N and k appear. N appears as the number here, and k
appears as the limit for the index of summation-- appears as k minus 1. This quantity will be an upper
bound for B of N and k. You can now, if you believe that,
which we will argue in a moment, you compute this number. And that will be an upper bound for the
growth function of any hypothesis set that has a break point k, without
asking any questions whatsoever about the hypothesis set or the input space. It shouldn't come as a surprise
that this quantity is right, because if you look at this, this is really
screaming for something binomial or combinatorial. Clearly, it will come
out one way or the other. But why is it this way? Well, what we are going to show, we are
going to show that this is exactly the quantity we computed numerically
for the other table. And we are going to do
it by induction. So the recursion we did, we are just
going to do it analytically. How do you do that? You start with boundary conditions. What were the boundary conditions? We argued that this is, indeed,
the value of B of N and k. And hence, an upper bound on
it, from the last slide. Now we want to verify that this
quantity actually returns those numbers, when you plug in the value
N equals 1 or k equals 1. How do I do that? You just do it. Just plug in, and it will come out. I'm not going even to bother doing it. It's a very simple formula. You just evaluate it, and you get that. The interesting part is the recursion. I would like to argue that if
this formula holds for the solid blue points, then it will also
hold for this guy. And then by induction, since it holds
for all of these guys, I can just do this step by step and fill the schedule,
with the truth of this being the correct value for the numbers
that appear here. Everybody is clear
about the logic of it? So let's do the induction. We have the induction step. We just want to make clear
what the induction step is. You are going to assume that
the formula holds for this point and this point. So indeed, if you plug in the values for
N and k, which here, N minus 1 and k minus 1, and here it would be N minus
1 and k, you plug it into that particular formula. Then the numbers will be correct. That's the assumption. And then you prove that, if this is true,
then the formula will hold here. That's the induction step. Fair enough. So let's do that. This is the formula for N and k. You just need to remember it. N appears here and k appears here. Minus 1 is an integral
part of the formula. This is the value for
k, not for k minus 1. The value, for k, happens to be the sum
from i equals 0 to that k, minus 1. So this is the formula that
is supposed to be here. And we would like to argue
that this is equal to-- what is this one? This one is for N minus 1, and still k. So this would be-- I am moved from here to here. So this will be the value here. And what is the other guy? That will be the value for N minus 1. And now it's for k minus 1, because
you still take the other minus 1. It becomes k minus 2. This part belongs here. So this is the induction step. We don't have it yet. That's what we want to establish. So let me put a question mark
to make sure that we haven't established it yet. What I am going to do, I am going to
take the right-hand side, and keep reducing it, until the left-hand
side appears. That's all. And then we'll be done with
the induction step. And since we have the boundary
conditions, we will have proved the theorem we asserted. The first thing I am going to do,
I am going to look at this fellow. And I notice that the index goes
here from i equals 0 to k minus 1. Here, it goes from i equals 0
to k minus 2. I'd like to merge the two summations. So in order to merge the two summations,
I will make them the same number of terms, first. Very easy. I will just take the zeroth
term, which would be N minus 1 choose 0, which is 1, out. And now the summation goes from
i equals 1 to k minus 1. Now, I go to the other
guy and do this. What did I do? I just changed the name of
the dummy variable i. I wanted the index to go from 1
to k minus 1, in order to be able to merge it easily. Here, it goes from 0 to k minus 2. So what do I do? I just make this i, and
make this i minus 1. So i minus 1 goes from 0 to k
minus 2, as this i used to. Just changing the names. And now, having done that, I am ready
to merge the two summations. And they are merged. Now, I would like to be able to take
this, and produce one quantity. And you can do it by brute force. This is no mysterious quantity. This is what? This is N minus 1 times N minus
2 times N minus 3, i terms, divided by i factorial. And this one applies the same thing. So you end up with something, and then
you do all kinds of algebra, and it looks familiar. And then you reduce it
to another quantity. So there's always an algebraic
way of reducing it. But I am going to reduce it with a very
simple combinatorial argument. I am going to claim that this is-- the 1 remains the same. And this actually, the whole thing
here reduces to N choose i. So these two guys become this one. Instead of doing the algebra,
I am going to give you a combinatorial argument. That is, this quantity is identical
to N choose i. Let's say that I am trying to choose
10 people from this room. And let's say that the room
has N people. There are N people. How many ways can you chose 10
people out of this room? That is N choose 10. Let's put this on the side. Here is another way of counting it. We can count the number of ways you can
pick 10 people, excluding me, plus the number of ways you can pick
10 people, including me. Right? These are disjoint, and they
cover all the cases. Let's look at excluding me. How many ways can you pick 10 people
from the room, excluding me? Well, then you are picking the
10 people from N minus 1. I am the minus 1. So that would be N minus 1 choose 10. Put this in the bank. How many ways can you pick
10 people, including me? Well, you already decided you are
including me, so you are only deciding on the 9 remaining guys. So that would be N minus 1 choose 9. So we have N minus 1 choose 10, plus
N minus 1 choose 9, that equals the original number, which
was N choose 10. Look at this. What do we have? This is excluding me. This is including me. And this is the original count. So it's a combinatorial identity, and
we don't have to go through the torture of the algebra in order to
prove that it's exactly the same. Now, I go back. I look, this goes from
i equals 1 to k minus 1. I have this 1, so I conveniently put
it back, and get this formula. Have you seen it before? Yeah, it looks familiar. Oh, this is the one we want to prove. So it means that we are done. That's it. We have an exact solution for the
upper bound on B of N and k. Since we spent some time developing
it, let's look at it and celebrate it, and be happy about it. First thing: yes, it's a polynomial,
because all of this torture was to get a polynomial, right? If we did all of this, and it's perfect
math, and the end result was not a polynomial, then we are in trouble. Because although the quantity is
correct, it's not going to be useful in the utility that we are aiming at. So why is it polynomial? Remember that for a particular
hypothesis set, the break point is a fixed point. It doesn't grow with N. You
ask in a hypothesis set, can I get all possible dichotomies
on four points? That's a question for the perceptron. No. Then 4, in the perceptron,
is a break point. Now, I can ask myself what the
perceptron does on 100 points. And the break point is still
4, just a constant. You give me a hypothesis set,
I give you a break point. That's a fixed number. So according to our argument now, the
growth function for a hypothesis set that has a break point k is less than or
equal to the purely combinatorial quantity, B of N and k, which is defined
as the maximum such number of dichotomies you can get, under the
constraint that k is a break point. And that was less than or equal
to the nice formula we had. So we can now make this statement. You go in a real learning
situation. Let's say you have a neural network
making a decision, and you tell me the break point for that neural
network is 17. I don't ask you what is a neural
network, because we don't know yet, so you don't have to know. I don't ask you what is the
dimensionality of the Euclidean space you are working on. You told me 17. Your growth function of your neural
network that I don't know, in the space that I don't know, happens to
be less than or equal to that, and I know that I'm correct. So is this quantity polynomial in N? That's what we need. Because remember, in the Hoeffding, there
was a negative exponential in N. If we get this to be polynomial
in N, we are in business. Well, any one of those guys is what? N times N minus 1 times N minus 2,
i times, divided by i factorial. i factorial doesn't matter,
it's a constant. So you basically get N multiplied by
itself a number of times, i times, for the i-th term. The most that N will be multiplied by
itself is when you get to i equals k minus 1, the maximum. And then N will be multiplied
by itself k minus 1 times. Therefore, the maximum power in this
quantity is N to the k minus 1. This comes from N times N minus
1 times N minus 2 times-- k times, that corresponds to the
case where i equals k minus 1. When you get N choose k minus
1, that's what you get. Anything else, will give you a power
of N, but it's a smaller power. This is the most you will have. What do we know about this fellow k? We know it's just a number. It's a constant. It doesn't change with N. And
therefore, this is indeed a polynomial in N. And we have achieved
what we set out to do. That is pretty good. Let's take three examples, in order to
make this relate to experiences we had before. This is the famous quantity by now. You know it by heart. I have the N. I remember k. I have to put minus 1. And that is the upper bound for anything
that has a break point k. Now, let's take hypothesis sets we
looked at before, for which we computed the growth function explicitly, and see
if they actually satisfy the bound. They had better, because this is math. We proved it. But just to see that this
is, indeed, the case. Positive rays. Oh, remember positive rays
from some time ago? We have one dimension,
so just the real line. Then we take from a point on,
it goes to +1. And we said that the whole analysis
here is exactly to avoid what I just did. You don't have to tell
me what is the input. You don't have to tell me-- you just have to tell me what? What is the break point? That's all we want. you can call it positive rays. You can call it George, I don't care! It has a break point of 2. That's what I pull. We did compute the growth
function for the positive rays. We did it by brute force. We looked at it, and we see what
the patterns are, and did a combinatorial argument. And we ended up, that the growth function
for this guy is N plus 1. Let us see if this satisfies
the bound. This is supposedly
less than or equal to. And you substitute here for
N, which is the number here. And the break point is k. So you're summing, from i equals
0 to 1, this quantity. You have N choose
0, also known as 1. Plus N choose 1, also known
as N. That's it. So you get this to be less
than or equal to-- wow! Look at the analysis we did
to get the N plus 1. And we get exactly the same. With all the bounds and-- we think that there is
a big slack here. But here, actually it's exactly tight. We get the same answer exactly, without
looking at anything of the geometry of what the hypothesis set was. Let's try another one. Maybe we'll continue to be lucky. Positive intervals. Yeah, I remember those were
the more sophisticated-- oh, I'm sorry. I am not supposed to ask any questions
about the hypothesis. I'm asking about the break point only. I remember now. So tell me what the break point is. That was k equals 3. And we did compute the
growth function. Remember, this one was a funny one. We're picking two segments out of
N plus 1, and then adding the 1. So we ended up-- this
would be the formula. What would be the bound according
to the result we just had? This would be again, this formula. And now k equals 3. So I have N choose 0 plus N
choose 1 plus N choose 2. I get 1 plus N plus something
that has squared terms. And I do the reduction
and what do I get? Boring, boring. I seem to be getting it all the time. It doesn't happen this way. It will always happen
that it's true. But there will be a slack
in many cases. So, we verified it. We are very comfortable
now with the result. Let's apply it to something where we
could not get the growth function. Remember this old fellow? Well, in the two-dimensional perceptron,
we went through a full argument just to prove that
the break point is 4. But we didn't bother go through
a general number of points N, and ask ourselves, how many hypotheses can the
perceptron generate on N points? Can you imagine the torture? We do this-- can I get this pattern-- And you have to do this for every
N. So we didn't do it. The growth function is unknown to us. We just know the break point. But using just that factor, we are
able to bound the growth function completely. And you substitute again
with k equals 4. You get another term, which is cubic. And you do the reduction. And lo and behold, you
have that statement. That statement holds for the perceptrons
in two dimensions. And you can see that this
will apply to anything. So now it was all worth the trouble,
because now we have a very simple characterization of hypothesis sets. And we can take this, and
move to the other part. Remember, this part, which has now
disappeared is proving that it's polynomial. Proving that we are interested
in the growth function. If it wasn't polynomial, we wouldn't
be interested in it. Now, this is an interesting
quantity. This one tells us that-- oh, and by the
way, it's not only interesting. You actually can use it. We can put it in the Hoeffding
inequality, and claim that the Hoeffding inequality is true
using the growth function. Now let's see what we want, to remind
you of the context of substituting for the total number of hypotheses
by the growth function. We wanted, instead of
having this fellow-- this is Hoeffding, and this is the number
of hypotheses using the union bound, which we said is next
to useless whenever M is big, or M is infinite. And instead of that, we wanted
really to substitute for this by the growth function. So this is what we are trying to do. We are trying to justify that instead
of this, you can actually say that. Well, it turns out that this is not
exactly what we are going to say. We are going to modify the constants
here for technical reasons that will become clear. But the essence is the same. There would be the growth
function here. It will be polynomial in N, and it
will be killed by the negative exponential, provided that there is
a break point-- any break point. Now, how are we going
to prove that? We are going to have
a pictorial proof. What a relief, because I think
you are exhausted by now. So the formal proof is
in the Appendix. It's six pages. My purpose of the presentation here
is to give you the key points in the proof, so that you don't get
lost in epsilon's and delta's. There are basically certain things
you need to establish. And once you know that that's what you
are looking for, you can bite the bullet and go through it line by line. The two aspects are the following. Why did we do this growth function? We used the growth function because it's
smaller, so it will be helpful. But how could it possibly replace M? Because M was assuming no overlaps
at all in the bad regions. Remember? So now that we know that there are
overlaps, this will take care of it. The question is, how does the
growth function actually relate to the overlaps? You need to establish that. So this is the first one. And when we establish that, we find that
it's a completely clean argument at everything, except for
one annoying aspect. Growth function relates
to a finite sample. So we will get a perfect handle
on the E_in, the in-sample error part of the deal. But in the Hoeffding inequality,
there is this E_out. And E_out relates to the performance
over the entire space. So we are no longer talking about
dichotomies, we are talking about full hypotheses. We lose the benefit of
the growth function. So what do we do about E out? That was a question that
was asked last time. What to do about E_out, in order
to get the argument to conform while we are just using a finite
sample, is the second step. After that, it's a technical
putting it together, in order to get the final result. That's the plan. But the proofs are pictures. So let's have a blank page. And let's say you are an artist
and this is your canvas. It's a very special canvas. It's the canvas of data sets. What is that? Every point here is an entire
data set, x_1, x_2, up to x_N. Fix N in your mind. So this is one vector. This is another vector. This is another vector. And this canvas covers the entire
set of possible data sets. Now, why am I doing this space? Well, I am doing this space because the
event of being good or bad, whether E_in goes to E out, depends
on the sample. Depends on the data set. For some data sets, you will
be close to the E_out. For some data sets, you are
not going to be close. So I want to draw it here, in order to
look at the bad regions and the overlaps, and then argue why
the growth function is useful for the overlaps. Now, we assume that there's
a probability distribution. And for simplicity, let's say that the
area corresponds to the probability. So the total area of the canvas is 1. Now, you look at the event, the
bad event, that E_in is far away from E_out. And let's say that you paint
the points that correspond to that event red. So you pick-- is this data
set good or bad? What does it mean, good or bad? I look at E_in in that data set, compare
it to E_out on a particular hypothesis, and then paint
it red if it's bad. So I have a hypothesis in mind, and I am
painting the points here red or leaving them alone, according to whether
they violate Hoeffding's inequality or not. And I get this, just illustratively. And you realize that I didn't
paint a lot of area. And that is because of
Hoeffding inequality. Hoeffding inequality tells me
that that area is small. So I'm entitled to put a small patch. Now we went from one hypothesis, which
is this guy, to the case where we have multiple hypotheses, using
the union bound. So again, this is the space of data
sets, exactly the same one. And now, I am saying for the first
hypothesis, you get this bad region. What happens when you have
a second hypothesis? Because I am using the very pessimistic
union bound, I am hedging my bets and saying you get
a bad region that is disjoint. Another hypothesis-- two of them. More. More. Oh, no. We are in trouble. The colored area is the bad area. Now the canvas is the bad area. That's why we get the problem
with the union bound. Because obviously, having them disjoint
fills up the canvas very quickly. Each of them is small, but
I have so many of them. Infinity of them as a matter of fact. This will overflow. Well, no, it won't overflow. Just figuratively speaking. So that's what I'm going to have. What is the argument
we are applying now? We are not applying the union bound. We are going to a new canvas. And that canvas is called the VC bound,
as in Vapnik-Chervonenkis. We'll see it in a moment. So what do you do? Your first hypothesis, same thing. When you take the second hypothesis,
you take the overlaps into consideration. So it falls here. You get more. You get more. You get all of them. It's not as good as the first one. I never expected it to be. But definitely not as bad as the
second one, because now they are overlapping. And indeed, the total area, which is
the bad region-- something bad happening-- is a small fraction of
the whole thing. And I can learn. So we are trying to characterize
this overlap. That's the whole deal with
the growth function. One way to do it is the one that
I alluded to before. Study the hypothesis set. Study the probability distribution. Get the full joint probability
distribution of any two events involving two hypotheses, and
then characterize this. Well, good luck! We won't do that. The reason we introduced the growth
function, because it's an abstract quantity that is simple, and is going
to characterize the overlaps. The question is, how is the
growth function going to characterize the overlaps? Here is what is going to happen. I will tell you that if you look at
this canvas, if any point gets painted at all, it will get
painted over 100 times. Let's say that I have that guarantee. I don't know which hypotheses
will paint it again. But any point that gets a red, it will
get a blue, and a green-- 100 times. If I tell you that statement, what
do you know about the total area that is colored? Now it's, at most, one hundredth
of what it used to be. Because when I had them non-overlapping,
they filled it over. Now for every point that is colored,
I have to do it 100 times. So I am overusing these guys, and
these guys will have to shrink. And I will get one hundredth of that. That is basically the essence
of the argument. What the growth function tells you is
that-- what is the growth function? Number of dichotomies. If you take a dichotomy, this is not
the full hypothesis, but the hypothesis on a finite set of points. There are many, many hypotheses that
return the exact same dichotomy. Remember the gray sheet
with the holes. Lots of stuff can be happening
behind the sheet. And as far am I am concerned, they
are all the same dichotomy. So all of these guys will be behaving
exactly the same way. If one of them colored the
point, the others will. This tells me that the
redundancy is captured by the growth function. That would be a very clean argument. And it would have been a very simple
proof, except for one annoyance. That the point being colored doesn't
depend only on the sample, but depends also on the entire space. Because the point gets colored
because it's a bad point. What is a bad point? The frequency on the sample, that is
patently on the sample, deviates from E_out. Oh, E_out involves the
entire hypothesis. If I have the gray sheet and the
holes, I cannot compute E out. I have to peel it off, look, and get
the areas in order to get E out. So the argument is great, as long as you
can tell me how do I go around the presence of E_out? And that's the second
part of the proof. What to do about E_out. The simplest argument possible. That is really the breakthrough
that Vapnik and Chervonenkis did. Back to the bin, just because it's
an illustration of the binary case. So here, we have one hypothesis. And we have E out, which is the
overall, in the entire space-- the error in the entire space. We pick a sample, and then we get E_in,
which is the value for the error on this one. So we have seen this before. And we said this tracks this, according
to the Hoeffding inequality. And the problem is that when you have
many, many bins, some of these guys will start deviating from E_out, to the
level if you pick according to the sample, you are no longer sure that
you picked the right one, because the deviation could be big. That was the argument. Now, I want to get rid of E_out. The way I am going to do it is this. Instead of picking one sample, I
am going to pick two samples, independently. So obviously, they are not
identical samples. Some of them are green
or red, et cetera. But they are coming from
the same distribution. Now, let's see what is going on. E_out and E_in track each other, because
E_in is generated from this distribution. Now, let's say I look at these two
samples and give them names. I am going to call them
E_in and E_in dash. They're both in-sample. It happens to be a different sample. So I have two samples. I am going to call this E_in,
and this E_in dash. My question is, does E_in track
E_in dash, if you have one bin? Well, each of them tracks
E_out, right? Because it was generated by it. So consequently, they
track each other. A bit more loosely, because
you have now two ways of getting the sample error. On the other hand, if I do
two presidential polls-- one polling asks 3000 people. Another asks 3000 people. These are different 3000 people. You fully expect that the result will
be close to each other, right? So these guys track each other. OK, fine. What is the advantage? The advantage is the following. If I now have multiple bins, the
problem I had here is reflected exactly in the new tracking. When I had multiple bins, the tie
between E_out and E_in became looser and looser. Because I'm looking for worst case, and
I might be unlucky enough, that the tracking now lost the tightness that one
bin with Hoeffding would dictate. If I am doing multiple bins, and not
looking at the bin at all, just looking at the two samples from each
bin, they track each other, but they also get loosely apart
as I go for more. Let's say, I tell you this experiment. You pick two samples. They are close in terms of
the fraction of red. If you keep repeating it, can you get
one sample to be mostly red, and the other sample to be mostly green? Yeah. If you are patient enough,
it will happen. Exactly for the same reason,
because you keep looking for it until it happens. So the mathematical ramifications of
multiple hypotheses happen here, exactly the same way they happen here. The finesse now is that, if I
characterize it using the two samples, then I am completely
in the realm of dichotomies. Because now I'm not appealing
to E_out at all. I am only appealing to what
happens in a sample. It's a bigger sample. I have 2N marbles now instead of N.
But still, I can define a growth function on them. And now the characterization is
full, and I am ready to go. These are the only two components
you need to worry about as you read the proof. Now, let's put it together. This is what we wanted. This is not true. Don't hold this against me. And to make sure, this is
not quite what we have. This would be direct substitution of the
plain-vanilla growth function in terms of M. We are not going to have that, but we
are going instead to have this. Lets look and compare. These look the same, except
that this 2 became 4. Is this good or bad? Well, it's bad. We want this probability to be small. Bad, but not fatal. This one goes to here. I have twice the sample. You know why I have 2 now. Because now I use the bigger sample
for the argument, so I need 2N. Oh, but all of this was about
a polynomial and now I don't know whether this will be a polynomial. Yes, you do. If it's polynomial in N, it's
polynomial in N here. Because you get 2N to the
k, then you get 2 to the k. That's a constant. And you still get N to the k. So that remains a polynomial. A bigger polynomial. I don't like it, but you
don't have to like it. It just has to be true, and
do the job we want. And finally, you can see this is minus
2, which was a very helpful factor. This is in the exponent. 2 in the exponent goes
a lot of mileage. And now we knock it down
all the way to 1/8. That's really, really bad news. The reason this is happening
is that, as we go through the technicalities of the proof, the epsilon
will become epsilon over 2. And then will become epsilon over 4, just
to take care of different steps. And when you plug in epsilon over 4
here, you get epsilon squared over 16. And so you get a factor of 1/8. That's the reason for it. So this is what we will end up with. And you can be finicky and try to
improve this constant a lot, but the basic message is that here is
a statement that holds true for any hypothesis set that has a break point. And this fellow is polynomial in N,
with the order of the polynomial decided by the break point. And you will eventually learn, because
if N is big enough-- if I give you enough examples-- using that hypothesis,
you will be able to claim that E_in tracks E_out correctly. This result, which is called the
Vapnik-Chervonenkis inequality, is the most important theoretical result
in machine learning. On that happy note, we will stop
here and take questions after a short break. Let's start the Q&A. MODERATOR: First, a few
clarifications from the beginning. In slide 5, when you choose the N points,
does it mean your data set is of N points, or you just chose
N points from the data set? PROFESSOR: When I apply this to an actual
hypothesis set in an input space, then these actually correspond
to a particular set of points in that space. However, in the abstraction that just
defines the function B, these are just abstract labels. These are labels for which
column I'm talking about. So although I call them x_1 up to
x_N-1, these are not really-- in the abstraction here, they don't
correspond to any particular input space in mind. But when they do, they will
correspond to a sample. And I am supposed to pick the sample
in that space that maximizes the number of dichotomies, et cetera, as
we defined the growth function. But it's a sample that I pick when I
apply this to a particular input space and a hypothesis set. MODERATOR: Also, there are
some people asking-- they didn't understand exactly why
alpha was different to beta. PROFESSOR: alpha
is different from beta? MODERATOR: Yeah. Why? PROFESSOR: Well, the short
answer is that I never made the statement that alpha is
different from beta. I just didn't bother ascertain any
relationship between alpha and beta. I just called them names. If they happen to be
equal, I am happy. If they happen to be
unequal, I am happy. So all I'm doing here is just calling
the guys that happen to have a single extension, the number of
them, calling it alpha. Calling the guys that happen to
have double extension beta. I don't know whether alpha is bigger
than beta, or smaller than beta, in any particular case. And it doesn't matter as far as
the analysis is concerned. If I call them this way, then it will
always be true that the total number of rows here is alpha plus beta plus
beta, which is alpha plus twice beta. So there is really no assertion about
the relative value of alpha and beta. MODERATOR: Moving on to the case
where you show the break points, and how it satisfies the bound. What happens if k equals infinity? No break points, basically. PROFESSOR: This is for the
positive rays and whatnot? MODERATOR: Yeah. So for example, if you
had the convex sets. PROFESSOR: k equals infinity
means there is no break point. In that case, you don't have to bother
with any of the analysis I did. No break points means what? Means the growth function is
2 to the N for every N, right? We just computed it exactly. If you want a bound for it, yes,
it's bounded by 2 to the N. Not a polynomial. So all of these cases, we're addressing
the case where there is a break point, because that is the case
where I can guarantee a polynomial. And therefore, I can
guarantee learning. That is the interesting case. If there is no break point, this
theoretical line of analysis will not guarantee learning. So if I have a hypothesis set that
happens to be able to shatter every set of points, I cannot make a statement
using this line of analysis that it will learn. And one example we had
was convex sets. So convex sets have a growth function
of 2 to the N. Well, it really is a very pessimistic estimate here, because
the points have to be really, really very funny. You have to build the pathological case,
in order not to be able to learn. And in many cases, you might be. But again, if I want a uniform statement
based only on the break point, this is the most I can say
using this line of analysis. MODERATOR: OK. Just a quick review. How is the break point calculated? PROFESSOR: Calculated. The break point is-- this is the only
time you actually need to visit the input space and the hypothesis set. You basically-- you are sitting in
a room with your hypothesis set. Someone gave you a problem
for credit approval. You decided to use perceptrons, and
you decided to use a nonlinear transformation. And you do all of that, and you
start programming it. And you would like to know some
prediction of the generalization performance that you are going to get. So you go into the room, and ask yourself:
for this hypothesis set, over this input space, what is the break point? So now you have to actually go and
study your hypothesis set. And then find out that using this
hypothesis set, I cannot separate, let's say, 10 points in
every possible way. Very much along the argument we used for
the perceptron in two dimensions. We found out that we cannot separate
four points in every possible way. But the good news is that, you don't have
to do it anew, because for most of the famous learning models, this
has already been done. For the perceptrons, we will
get an exact break point. For any-dimensional perceptron. So 20-dimensional perceptron,
here's the break point, and here's the growth function. Or, here's the bound. Similarity from neural networks,
there is a break point. Not exact estimate of the break point,
but a bound on the break point. And again, in most of these cases,
bounds work, because we are always trying to bound above. And we have room to play with, because
a polynomial is a polynomial is a polynomial. So if you become a little bit sloppy
and forget something, and the break point-- you say 10 instead of 7-- it's not going to break the back of
learning versus non-learning. It's just going to tell you more
pessimistically, how much resources do you need in order to learn? Which is more benign damage
than deciding, oh, I cannot learn at all. MODERATOR: OK. Can you come up with an example, where
these bounds are not tight as here? PROFESSOR: There's one case,
which I could have covered but I didn't, where you take positive
and negative rays. So positive rays, you
take the real line. And from a point on it's +1.
Before, it's -1. Positive and negative rays, it means you
are also allowed to take rays that return +1 first, and
then -1 later. And the union of them is the model
called positive and negative rays. It's a good exercise to do. Take that home and try to find,
what is the break point? And you'll find that although the break
point for positive rays is 2, in this case the break
point is actually 3. And the reason is that for two points,
now you can get everything because you know the ray can be here. So they are both minus. The ray could be here. They are both plus. The ray could be here. It's minus plus. But now, use the negative ray
to get the +1, -1. So now you can shatter two points. And you would fail only for the three
points, when the middle guy is different, because you cannot
get it this way. So you will get-- and the
break point is 3. When you do the break point of
3, you will get the bound, the blue bound here. You will get that to be squared. Pretty much like here, because we have
a break point that corresponds directly to squared. I don't care whether the 3 is coming
from positive intervals, or coming from positive and negative rays. It's 3. Therefore, the blue bound
is quadratic. If you compute the number of dichotomies
you can get, which is the growth function, it will
actually be linear. So there will be a discrepancy between
linear for the exact estimate of the growth function, to quadratic
of the bound. So there are cases that
you can come up with, easily. And as a matter of fact, this
is the exception on this, rather than the rule. In most of the cases, there
will be a slack. MODERATOR: And this question drives
the point of the whole lecture. It says, we have been focusing on
having E_in equal to E_out, or close to E_out, not in the
actual value of E_in. So using our hypotheses, there are just
as many percentage errors in the training data as the real data. Why is that? PROFESSOR: This goes back to
separating the question of learning into the two questions. There was one question which
was addressed now. We are trying to get E_in
to track E_out. Why do I need that? Because I don't know E_out, and
I will not know E_out. That is simply an unknown
quantity for me. And I want to work with
something to minimize. I cannot minimize something
that I don't know. So if the theoretical guarantees tell me
that E_in is a proxy for E_out, and if I minimize E_in, E_out
will follow suit. Then I can now work with a quantity
that I know, and do it. That's the first part, that
they are tracking each other. The second part is the practical. Now, I am going to go and
try to minimize E_in. This is the second part of it. MODERATOR: Also, they're asking if-- can you clarify more, why
is the VC dimension useful? PROFESSOR: The VC dimension,
as of now, is an unknown quantity. I didn't say that word "VC
dimension" at all. I said every building block that
will result in the definition. However, the good news is, what is
the title of the next lecture? The VC dimension. You will be completely content with
everything you wanted to know about the VC dimension, and weren't
afraid to ask! OK? MODERATOR: Yeah, the crowd is
saying that they're still digesting the lecture. PROFESSOR: OK. As I mentioned before, if you didn't
follow this in real time, don't be discouraged. It's actually very sweet material. And you can look at the lecture again. And you can read the proof. And you can do all of the homework,
until it settles in your mind. This is the most abstract, or the
most theoretical, lecture of the entire course. And if you get through this one, and
you understand it well, you are in good shape as far as the
rest of the course. There will be mathematics, but it will
be more friendly mathematics. Friendly, as in less abstract. For someone who is not theoretically
inclined, the more abstract the mathematics is, the less they
can follow it, because they cannot relate to it. So this one has the abstraction. The other mathematics will be
much easier to relate to. MODERATOR: What was wrong with the
"not quite" expression on the last slide? PROFESSOR: OK. Basically, the top statement
is simply false. It was my way of relating
what I'm trying to do, to what has already happened. There used to be M in place
of the growth function. So the growth function is here. There used to be M. So the easiest
way for me to describe what is happening with the theory, is to tell
you that you are going to take M out, and replace it with this. As usual, it's not that easy. Even remember with the Hoeffding,
when I complained about the 2 here and the 2 here? Well, you have to have them, in order
for the statement to be true. So for the statement to be true, we
needed to do some technical stuff that really didn't change the essence of
the statement here, but made it a little bit different by changing
the constants. And therefore, we have a proof
for it. It holds. And it captures the essence of that. I just didn't want to bother telling you
this because, if I told you this in the first place, you would have
been completely lost. Why 4? Why 2? What is this 1/8? And forget about the essence. So the easiest way to do it,
we are replacing it. This is not the final form,
but we are replacing it. Until you get the idea: indeed,
I can replace it. But oh, in order to replace it,
I need to have a bigger sample that we argued for. So I need 2N. Oh, and now the bigger samples are not
tracking to each other as well as each of them is tracking the actual
out-of-sample error. So I need to modify these
values, and so on. So it becomes much easier to swallow. The technicalities will come in, in order
to make the proof go through. MODERATOR: OK. Can you review the definition
of B of N and k? PROFESSOR: B of N and k. Assume you have N points, and assume
that k is a break point. So you're assured that no
k points will have all possible patterns on them. After these two assumptions, make
no further assumptions. You don't know where this came from. You don't know what space
you are working with. You don't know what the
hypothesis set is. You just know that in your setup,
when you get N points-- that is the N here-- and the break point is
k-- that is k here. Under those conditions, can you
bound the growth function? Can you tell me that the growth
function can never be bigger than something? That something is what I am
calling B of N and k. So what am I doing? I'm taking the minimal conditions
you gave me. I have N points, and k
is a break point. And asking myself: what is the maximum
number of dichotomies you can possibly have, under no other constraints,
in order to satisfy these two constraints? And I'm calling this B of N and k. Why did I do it? First, it's going to help, being an upper
bound for any hypothesis set that has a break point k, because
it is the maximum. The second one, it's a purely
combinatorial quantity. So I have a much better chance of
analyzing it, without going through the hairy details of input spaces, and
correlation between events, and so on. And that is indeed, what ended
up being the case. We had a very simple recursion on it,
and we found a formula for it. And that formula now serves as
an upper bound for the more hairy quantity, which is the growth function
that is very particular to a learning situation, an input space,
and a hypothesis set. MODERATOR: Also, a particular question
on the proof of B of N and k, the recursion. Slide 5. The question is, why does k not
change when going back from N to N minus 1? PROFESSOR: OK. Here, if you look at x_1, x_2, up to
x_N, the disappearing x_N here, no k columns can have all
possible patterns. These k columns could involve the last
column, and could involve only the first N minus 1 columns. Just no k columns whatsoever can
have all possible patterns. So when I look at the reduced one, N
minus 1, I know for a fact that no k columns of these guys can have
all possible patterns. Because that would qualify as
k columns of the bigger one. So k doesn't really change. The only time I had a different k
is when I had a nice argument that, if you have k minus 1 points which
have all possible patterns on the smaller set, then adding the
last column will get us in trouble with k columns. So for that, I needed an argument. But in general, when I take the
statement on face value, k is fixed. And the k columns could be anything. Could be involving the last column, or
could be restricted to the first guy. Could be the first k columns,
for all I care. MODERATOR: How does this formalization
apply to, say, a regression problem? PROFESSOR: Again, this
is all binary functions. So the classification of
+1 and -1. And as I mentioned, the entire analysis,
the VC analysis, can be extended to real-valued functions. It's a very technical extension that, in
my humble opinion, does not add to the insight. And therefore, instead of doing that and
going very technical, in order to gain very little in terms of the
insight, I decided that when I get to regression functions, I am going to
apply a completely different approach, which is the bias-variance tradeoff. It will give us another insight into
the situation, and will tackle the real-valued functions directly,
the regression functions. And therefore, I think we'll have both
the benefit of having another way of looking at it, and covering
both types of functions. MODERATOR: There's this person that
says, I feel silly asking this, but is the bottom line that we can prove
learnability if the learning model cannot learn everything? PROFESSOR: OK. We proved learnability under a condition
about the hypothesis set. When you say learning everything,
you are really talking about the target function. So the target function is unknown. What I am telling you here is that, if
you tell me that there is a break point, I can tell you that if you have
enough examples, E_in will be close to E_out for the hypothesis you pick,
whichever way you pick it. It remains to be seen whether you are
going to be able to minimize E_in, to a level that will make you happy. I will never know that until
you start minimizing. So if the target function happens to be
extremely difficult, or completely random-- unlearnable, you are not
going to see this in the generalization question. The generalization question is
independent of the target function. I didn't bring it up here at all. It has to do with the
hypothesis set only. The target function will come in--
if I get E_in to be small, E_out will be small. I know that from the generalization
argument that I made. Can I get E_in to be small? If the target function is random, you
will get a sample that that is extremely difficult to fit. And you are not going to be able
to get E_in to be small. But at least, you will realize that
you could not learn, in that particular case. And in another target function, you will
realize that you can learn, because E_in went down. So the question of whether I can learn
or not, the generalization part of it is independent of the target function. The second question is very much
dependent on the target function, but the good news is that it
happens in sample. I can observe it, and realize how
well, or not so well, I learned. MODERATOR: Also, going back to a previous
question, does this also generalize to multi-class problems? PROFESSOR: Basically, there
is no restriction on the inputs or the outputs. There is a counterpart. And instead of saying break point,
what is a break point? And dichotomies, they are
not really dichotomies. You have real values. You have no real values, so
there are technicalities to be done in order to be able to reduce
them to this case. But the same principle applies,
regardless of the type of function you have. MODERATOR: I think that's it. PROFESSOR: Very good. Thank you, and we'll
see you next week.