Lecture 06 - Theory of Generalization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced some important concepts in our theoretical development. And the first concept was dichotomies. And the idea is that there is an input space behind this opaque sheet, and there is a hypothesis that's separating red regions from blue regions. But we don't get to see that. What we get to see are just the data points, holes in that sheet if you will. And there could be very exciting stuff happening behind that sheet, and all you get to see is when the boundary crosses one of these points, and a blue point turns red or vice-versa. So if you think of the purpose for the dichotomies, we had a problem with counting the number of hypotheses, because we end up with a very large number. But if you restrict your attention to dichotomies, which are the hypotheses restricted to a finite set of points, the blue and red points here, then you don't have to count everything that is happening outside. You only count it as different when something different happens only on those points. So a dichotomy is a mini-hypothesis, if you will. And it counts the hypotheses only on the finite set of points. This resulted in a definition that parallels the number of hypotheses, which is the number of dichotomies in this case. So we define the growth function. The growth function is-- you pick the points x_1 up to x_N. You pick them wisely, with a view to maximizing the dichotomies, such that the number you get will be more than any number another person gets with N points. That's the purpose. So you take your hypothesis set, which applies to the entire input space, and then apply it only to x_1 up to x_N. This will result in a pattern of +1 or -1's, N of them. And as you vary the hypothesis within this set, you will get another pattern, another pattern, another pattern. So you will get a set of different patterns that are all the dichotomies that can be generated by this hypothesis set, on this set of points. And the number of those guys is what we are interested in. It will play the role of the number of hypotheses. And that is the growth function. Now in principle, the growth function can be 2 to the N. You may be in an input space and a hypothesis set, such that you can generate any pattern you want. However, in most of the cases, the restriction of using hypotheses coming from H will result in missing out on some of the patterns. Some patterns will simply be impossible. And that led us to the idea of a break point. For the case of a perceptron in two dimensions, which is the case we studied, we realize that for four points, there will always be a pattern that cannot be realized by a perceptron. There is no way to have a line come here, and separate those red points from the blue points. And any choice of four points will also result in missing patterns. Therefore, the number k equals 4, in this case, is defined as a break point for the perceptrons. And our theoretical goal is to take that single number, which is the break point, and be able to characterize the entire growth function for every N. And therefore, be able to characterize the generalization, as we will see. We then talked about the maximum number of dichotomies, under the constraint that there is a break point. And we had an illustrative example to tell you that, when you tell me that you cannot get all patterns on any-- in this case, two points-- that is a very strong restriction on the number of dichotomies you can get on a larger number of points. So this is the simplest case. If you take any two columns, you cannot get all four patterns. That's by decree. I'm telling you that the hypothesis has a break point of 2. And then I'm asking you, under those constraints, how many lines you can get, how many different patterns you can get. And you go and you add them up, and you end up in this case with only four. So you lost half of them. And you can see that if we have 10 points, and you apply the same restriction, there will be so many lost, because now the restriction applies to any pair of points. Now, if you look at this schedule, this does not appeal to any particulars of the hypothesis set or the input space, other than the fact that the break point is 2. I could be in a situation, where the hypothesis set cannot generate some of these guys for other reasons. But here, I'm abstracting only a hypothesis set and an input space. I don't want to bother to know more about them. Just tell me that they have a break point, and I'm trying to find under that single constraint, how many can I possibly have? And I already have, by that combinatorial constraint, a restriction which is strong enough to get me a good-enough result. That's good, because now I don't have to worry about every hypothesis set, and every input space, you give me. I just ask you: what is the break point? And I'm able to make a statement about the growth function not being bigger than something. That is the key. We move on to today's lecture, and the title is, properly, the Theory of Generalization. It's very theoretical. And today's lecture is the most theoretical of the entire course. So fasten your seat belts, and let's start! We have two items of business. The first one is to show that the growth function, with a break point, is indeed polynomial. The second one is to show that we can actually take that notion, the growth function, and put it in place of M, the number of hypotheses, in Hoeffding's inequality. So basically, we are saying in the first part: it's worthwhile to study the growth function. Because being polynomial will be very advantageous. And then, the second one is: we can actually do something good with it. We can do the replacement. These are the only two items. Let's start. We are going to bound the growth function by a polynomial. And I just wanted to point some of the aspects of that. If I say m_H of N is polynomial, it's not that I am going to actually solve for the growth function, and show that it is this particular polynomial, and the coefficients. All I am saying is that it is really just bounded above by a polynomial. I don't have to get the particulars of m_H of N, the growth function. I am going to just tell you that this is less than something, less than something, less than a polynomial. That's all I need, because eventually I am going to put this in the Hoeffding inequality. And as long as it's bounded by a polynomial, I am in business. Because the negative exponential will kill it, as we discussed, and we are OK. So we can be a bit loose, which is very good in theory. Because now you leave a lot of artifacts that you don't need to study. And just talk about the upper bound in the general case, and still get what you want to get. The key quantity we are going to use, which is a purely combinatorial quantity, we are going to call it B of N and k. This is exactly the quantity we were seeking in the puzzle. I give you N points. I tell you that k is a break point, and ask you: how many different patterns can you get under those conditions? In that case, we had three points and the break point was 2. And we answered this question by construction. We played around with the patterns until we got it, and then we said it's 4. Now, as I develop the theory, the puzzle will come up in one of the results. I would like you to keep an eye and say which slide, and which particular part of the slide, addresses the very specific puzzle we talked about. The definition here is the maximum number of dichotomies on N points, such that they have a break point k. So this is N and this is k. And the good thing here is that I didn't appeal to any hypothesis set, or any input space. This is a purely combinatorial quantity. And because it's a combinatorial quantity, I am going to be able to pin it down exactly, as it turns out. And now, when I pin it down exactly, you go and you find the fanciest input space, and the fanciest hypothesis set. You pick the break point for that, and you use that here, ridding the problem of all the other aspects, and you still are able to make an upper bound statement. You can say that the growth function, for the particular case you talked about, is less than or equal to-- and just go to this combinatorial quantity. The plan is clear. So let's look at the bound for B of N, k. And we are going to do it recursively. It's a very cute argument, and I am going to build it very carefully. So I want your attention. Consider the following table. Very much like the puzzle, we are going to list x_1, x_2, up to x_N. N points, which used to be three points. And I am going to try to put as many patterns as I can, under a constraint that there is a break point. So I will be putting the first pattern this way, and the second pattern, and so on, trying to fill this table. Now, I am going to do a structural analysis of this, and this will happen through this division. Let's look at it. Still the same problem, x_1 and x_N is my vector. And I am trying to fill this with as many rows as possible, under a constraint of a break point. But now I am going to isolate the last point. Why am I isolating the last point? Because I want a recursion. I want to be able to relate this fellow, to the same fellow applied to smaller quantities. And you have seen enough of that to realize that, if I manage to do that, I might be able to actually solve for B of N and k. That's why I'm isolating the last point. After I do the isolation, I am going to group the rows of the big matrix, into some groups. This is just my way of looking at things. I haven't changed anything. What I am going to do, I am going to shuffle the rows around, after you have constructed them. So we have a full matrix now, and I am shuffling them, and putting some guys in the first group. And the first group I am going to call S_1. Here is the definition of the group S_1. These are the rows that appear only once, as far as x_1 up to x_N-1 are concerned. Well, every row in its entirety appears only once, because these are different rows. That's how I'm constructing the matrix. But if you take out the last guy, it is conceivable that the first N minus 1 coordinates happen twice, once with extension -1, once with extension +1. So I am taking the guys that go with only one extension, whatever it might be. Could be -1 or could be +1, but not both, and putting them in this group. Fairly well defined. So you fill it up, and these are all the rows that have a single extension. Now, you go under this, and you define the number of rows in this group to be alpha. It is a number. I am just going to call it alpha. And you can see where this is going, because now I'm going to claim that the B of N and k, which is the total number of rows in the entire matrix, is alpha plus something. That is obvious. I have already taken care of alpha, and I am going to add up the other stuff later on. So what is the other stuff? That is the stuff I am going to call S_2. And you probably have a good guess what these are. These are the guys that happen with both patterns. That is, they happen with extension +1 and with -1. That is disjoint from the first group. A typical member will look like this. This is the same guy from x_1 up to x_N-1, as it appears here. It just appears here with +1, and appears here with -1. And I keep doing it. So what I'm doing, I just reorganize the rows of the matrix to fall into these nice categories. The other guy? Exactly the same thing. So the second one corresponds to the second one, and so on. Now, that covers all the rows. I look at x_1 up to x_N-1. I either have both extensions, or one extension. That's it. One extension belongs to the first group. Two extensions belong to the second group in both ways, with +1 and -1. In terms of the counting, this has beta rows, whatever beta might be. This also has beta rows, because they're identical. And therefore, the number B of N and K, which I'm interested in, is alpha plus 2 beta. That is complete. Just calling things names. So now, I am going to try to find a handle on alphas and betas, so that I can find a recursion for the big function B of N and k. B of N and k are the maximum number of rows-- patterns I can get on N points, such that no k columns have all possible patterns. That's the definition. I am going to relate that to the same quantity on smaller numbers, smaller N and smaller k. So the first is to estimate alpha and beta. I'd like to ask you to focus on the x_1 up to x_N-1 columns. And I am going to help you visually do that, by graying out the rest. Now for a moment, look at these. Are these rows different? They used to be different when you have the extension. Well, let me see. The first group, I know they are different, because they have one extension. If there is one which is repeated, then it must be repeated with both extensions, in order to get different rows all over, and that violates the condition for being here. They are here because they have only one extension. These guys are the same. This one appears with -1, and here appears with +1. But if you cut the last guy, this guy is identical to this guy, right? This second guy is identical to the second guy. So I cannot count these as different rows. I can do that when I gray out one of the groups. Now, these are patently different. Nothing here is repeated, because we said they have only one extension, and they are all tucked in here. These two guys, there are no two guys here that are equal, because they all have the same extension. And supposedly, the whole row makes it different rows. Therefore, these guys are different from each other. And these guys are different from here because again, if they are equal, then I will have an extension. And then the guys here will belong to a row that had both extensions. Very easy. Just a verbose argument, but we end up with these guys being different. Now, I like the fact that these guys are being different, because when they are different, I can relate them to B of N and k. B of N and k was the maximum number of patterns-- different rows, that's how I am counting them-- such that a condition occurs. So what is the condition that is occurring here? I can say that alpha plus beta, which is the total number of rows or patterns in this mini-matrix, can I say something about a break point for this small matrix? Yeah. The original matrix, I could not find all possible patterns on any k columns, right? So I cannot possibly find all possible patterns on any k columns on this smaller set. Because if I find all possible patterns on k columns here, they will serve as all possible patterns in the big matrix. And I know, that doesn't exist. So I can now confidently say that alpha plus beta, which is the number of different guys here, is less than or equal to B of N minus 1, because I have only x_1 up to x_N-1, and k, because that is the break point for these guys as well. Why am I saying less than or equal to, not equal? When I constructed the original matrix, it was equal by construction. I looked at the maximum number of rows I get. And I told you this is what I constructed. And therefore, by definition, this is B of N and k. Here, I obtained this in a special way. I took out a guy from the other matrix, and did that. I am not sure that this is the best way to maximize the number of rows. At least it's conceivably not. But for sure it's at most B of N minus 1 and k, because that is the maximum number. I am safe saying that it's less than or equal to. So I have the first one. Now, let's try to estimate beta by itself. This is the more subtle argument. In this case, we are going to focus now on the second part only, the S_2 part. The guys that appear twice in the big matrix. So let's focus on them. Now, when I focus on them, these guys are very easy to analyze. They are here and here exactly the same. This block is identical to this block. The interesting thing, when I look at these guys, is that I am going to be able to argue that these guys have a break point of k minus 1, not k. The argument is very cute. Let's say that you have all possible patterns on k minus 1 guys, in this small matrix. First, I have to kill these. These are not different guys, because these are identical to these. So let me reduce it to the guys that are patently different. I'm now looking at this matrix. I am claiming that k minus 1 is a break point here. Why is that? Because if you had k minus 1 guys here, where you get all possible patterns, then by adding both copies, +1 and -1, and adding x_N, you will be getting k columns overall that have all possible patterns, which you know you cannot have because k is a break point for the whole thing. So now I'm taking advantage of the fact that these guys repeat. It's very dangerous to have k minus 1 guys, because now I have the k that I know doesn't exist. Let's do it illustratively. Here is a pattern here. You add the +1 extension and the -1 extension, by taking this column. If you get all possible patterns on k minus 1, and you add this guy, then you have both patterns here, and you will end up with all possible patterns on k points on the overall matrix. That enables me to actually count this in terms of B of N and k, again, with the proper values of N and k. We can say that beta is less than or equal to-- again, less than or equal to because I obtained this matrix by lots of eliminations. I didn't do it deliberately to maximize the number, so I don't know whether it's the maximum. But I sure know that it's less than or equal to the maximum, by the definition of what a maximum is. And that would be of what? I have N minus 1 point and I argued for a break point of k minus 1. So I end up with this fellow. Both arguments are very simple. Now, we pull the rabbit out of the hat! You put it together. What do we have? This is the full matrix. The first item was just calling things names, the number of rows in the big matrix is B of N and k, by definition-- by construction. I organized it such that there is alpha, and there is beta, and there is another beta, so this one is the first result I got, which is B of N and k equals alpha plus 2 beta. What else did I get? I got that alpha plus beta is at most B of N minus 1 and k. That was the first slide of the analysis. We have seen that. So this basically takes this matrix, and does an analysis on it. And it has a break point k, because k will be inherited when you go to the bigger one. That's what we did. The other one is, beta is less than B of N minus 1 and k minus 1. And this is the case where I only looked at this guy, and now I have to be more restrictive in terms of all possible patterns, because I have an extension to add, and I would be violating the big constraint. So I ended up with this being less than or equal to B of N minus 1 and k minus 1. Anybody notice anything in this slide? How convenient! I have alpha plus 2 beta there, and I have alpha plus beta on one, and beta on one. If I add them, I am in business. I can actually now relate B of N and k to other B of N and k, and alpha and beta are gone. B of N and k, now I know, has to be at most this fellow. So you can see where the recursion is going. Now I know that this property holds for the B of N and k. And now all I need to do is solve it, in order to find an actual numerical value for B of N and k. And that numerical value will serve as an upper bound for any growth function of a hypothesis set that has a break point k. Let's do the numerical computation first. I have this recursion, and I can see that, from smaller values of N and K, I can get bigger values, or I can get an upper bound on bigger values. Let's do it in a table. Here is a table. Here is the value of N-- 1, 2. This is the number of points, the number of columns in the matrix. And this is k. This is the break point I am talking about. So this will be-- there's a break point of 1, break point 2, break point 3, et cetera. And what I'd like to do here, I'd like to fill this table with an upper bound on B of N and k. I'd like to put numbers here, that I know that B of N and k can be at most that number. And we can construct this matrix very, very easily having this recursion. Here's what we do. First, I fill the boundary conditions. Let's look at this. Here it says that there is a break point of 1. I cannot get all possible patterns on one point. Well, what are all possible patterns on one point? -1 and +1. It's one point. So I cannot get both -1 and +1. That's a pretty heavy restriction. So I'm asking myself, let's say you have now N columns in the matrix. How many different rows can you get in that matrix, under that constraint? Well, I'm in trouble. Because if I have the first pattern, and then I put a second pattern, the second pattern must be different from the first one in at least one column. That's what makes it different. If it's identical in every column, then it's not a different pattern, right? So you go to that point, where it's different. And unfortunately, for that point you get both possible patterns. So we are stuck. We can only have one pattern under this constraint. Hence, the 1's-- 1, 1, 1, 1, 1. That's good. Now, in the other direction, it's also easy. In this case, it's 2. It's very easy to argue. Now, I am taking the case where I have only one column. So I'm asking myself, how many patterns I can get for one column. Well, the most is 2. Why am I getting 2's here? Because in the upper diagonal of this table, the constraint I am putting is vacuous. Here, for example, I am telling you how many different patterns can you get on one point, such that no four points have all possible patterns. Four points, what are you talking about? You have only one point. So that's no constraint at all. Therefore, it doesn't restrict the choices, and the maximum number is the maximum number I would get unrestricted, which happens to be 2. If I have one point, I get two patterns. That's why you have the 2's sitting here. Now, I covered the boundary conditions, and that's really all I need to complete the entire table, given the nature of the constraint I have. Why is that? Because that constraint looks like this. If you know the solid blue guys, I will tell you the empty blue guy. Because this would be-- look at N and k. This is N and k. This would be N minus 1 and k. This would be N minus 1 and k minus 1. That's exactly what this says. So if I have these two points, I can get a value here, which will be an upper bound on this fellow. Let us actually go through this table, and fill it up. The first guy I'm going to take is this 1 and 2. According to this shape, I might be able to get this fellow. What would that fellow be? 3, right? You just add the two numbers. How about the next guy? Anybody has a guess here? OK, 4. And then? A bunch of 4's. Always get 2's. I am actually happy about this because you see that when k grows big, much bigger than N, as we said, the constraint is vacuous. So I should be getting all possible patterns on the number of points I have. And as you can see, for 1, I get the 2's. For 2, I will get eventually the 4's. And for 3, it will be the 8's. So that is very nice. Let's go over the next row. Can I solve this one? Now that I got this one, I can become more sophisticated and get this one. See where this came from? How about the next one, what would that be? That should be 7, right? 8. A bunch of 8's. This is kind of fun. And you can fill up the entire table. So we have it completely solved, numerically. It would be nice to have a formula, which we will have in a minute. But numerically, we will have that. Now, let me highlight one guy. Do you see anything that changed colors? I claim that you have seen this before. That's the puzzle. You had three points. Your break point was 2. And now we know for a fact that the maximum number you can get is 4, without having to go through the entire torture we went through last time. Can we try this? Can we try that? Oh, I am violating-- You don't have to do that. Here are the numbers, just by computing a very simple recursion. Now, let's go for the analytic version of that. What I'd like to do, I'd like to find a formula that computes this number outright. I don't have to go through this computation numerically. So let's do that. This is the analytic solution for B of N and k. Again, this is the recursion. And now we have a theorem. Yeah, when you're doing mathematical theory, you have to have theorems. Otherwise, you lose your qualifications! What does the theorem say? It tells you that this is a formula that is an upper bound for B of N and k. What is this formula? This is N choose i, the combinatorial quantity. And you sum this up from i equals 0 to k minus 1. So both N and k appear. N appears as the number here, and k appears as the limit for the index of summation-- appears as k minus 1. This quantity will be an upper bound for B of N and k. You can now, if you believe that, which we will argue in a moment, you compute this number. And that will be an upper bound for the growth function of any hypothesis set that has a break point k, without asking any questions whatsoever about the hypothesis set or the input space. It shouldn't come as a surprise that this quantity is right, because if you look at this, this is really screaming for something binomial or combinatorial. Clearly, it will come out one way or the other. But why is it this way? Well, what we are going to show, we are going to show that this is exactly the quantity we computed numerically for the other table. And we are going to do it by induction. So the recursion we did, we are just going to do it analytically. How do you do that? You start with boundary conditions. What were the boundary conditions? We argued that this is, indeed, the value of B of N and k. And hence, an upper bound on it, from the last slide. Now we want to verify that this quantity actually returns those numbers, when you plug in the value N equals 1 or k equals 1. How do I do that? You just do it. Just plug in, and it will come out. I'm not going even to bother doing it. It's a very simple formula. You just evaluate it, and you get that. The interesting part is the recursion. I would like to argue that if this formula holds for the solid blue points, then it will also hold for this guy. And then by induction, since it holds for all of these guys, I can just do this step by step and fill the schedule, with the truth of this being the correct value for the numbers that appear here. Everybody is clear about the logic of it? So let's do the induction. We have the induction step. We just want to make clear what the induction step is. You are going to assume that the formula holds for this point and this point. So indeed, if you plug in the values for N and k, which here, N minus 1 and k minus 1, and here it would be N minus 1 and k, you plug it into that particular formula. Then the numbers will be correct. That's the assumption. And then you prove that, if this is true, then the formula will hold here. That's the induction step. Fair enough. So let's do that. This is the formula for N and k. You just need to remember it. N appears here and k appears here. Minus 1 is an integral part of the formula. This is the value for k, not for k minus 1. The value, for k, happens to be the sum from i equals 0 to that k, minus 1. So this is the formula that is supposed to be here. And we would like to argue that this is equal to-- what is this one? This one is for N minus 1, and still k. So this would be-- I am moved from here to here. So this will be the value here. And what is the other guy? That will be the value for N minus 1. And now it's for k minus 1, because you still take the other minus 1. It becomes k minus 2. This part belongs here. So this is the induction step. We don't have it yet. That's what we want to establish. So let me put a question mark to make sure that we haven't established it yet. What I am going to do, I am going to take the right-hand side, and keep reducing it, until the left-hand side appears. That's all. And then we'll be done with the induction step. And since we have the boundary conditions, we will have proved the theorem we asserted. The first thing I am going to do, I am going to look at this fellow. And I notice that the index goes here from i equals 0 to k minus 1. Here, it goes from i equals 0 to k minus 2. I'd like to merge the two summations. So in order to merge the two summations, I will make them the same number of terms, first. Very easy. I will just take the zeroth term, which would be N minus 1 choose 0, which is 1, out. And now the summation goes from i equals 1 to k minus 1. Now, I go to the other guy and do this. What did I do? I just changed the name of the dummy variable i. I wanted the index to go from 1 to k minus 1, in order to be able to merge it easily. Here, it goes from 0 to k minus 2. So what do I do? I just make this i, and make this i minus 1. So i minus 1 goes from 0 to k minus 2, as this i used to. Just changing the names. And now, having done that, I am ready to merge the two summations. And they are merged. Now, I would like to be able to take this, and produce one quantity. And you can do it by brute force. This is no mysterious quantity. This is what? This is N minus 1 times N minus 2 times N minus 3, i terms, divided by i factorial. And this one applies the same thing. So you end up with something, and then you do all kinds of algebra, and it looks familiar. And then you reduce it to another quantity. So there's always an algebraic way of reducing it. But I am going to reduce it with a very simple combinatorial argument. I am going to claim that this is-- the 1 remains the same. And this actually, the whole thing here reduces to N choose i. So these two guys become this one. Instead of doing the algebra, I am going to give you a combinatorial argument. That is, this quantity is identical to N choose i. Let's say that I am trying to choose 10 people from this room. And let's say that the room has N people. There are N people. How many ways can you chose 10 people out of this room? That is N choose 10. Let's put this on the side. Here is another way of counting it. We can count the number of ways you can pick 10 people, excluding me, plus the number of ways you can pick 10 people, including me. Right? These are disjoint, and they cover all the cases. Let's look at excluding me. How many ways can you pick 10 people from the room, excluding me? Well, then you are picking the 10 people from N minus 1. I am the minus 1. So that would be N minus 1 choose 10. Put this in the bank. How many ways can you pick 10 people, including me? Well, you already decided you are including me, so you are only deciding on the 9 remaining guys. So that would be N minus 1 choose 9. So we have N minus 1 choose 10, plus N minus 1 choose 9, that equals the original number, which was N choose 10. Look at this. What do we have? This is excluding me. This is including me. And this is the original count. So it's a combinatorial identity, and we don't have to go through the torture of the algebra in order to prove that it's exactly the same. Now, I go back. I look, this goes from i equals 1 to k minus 1. I have this 1, so I conveniently put it back, and get this formula. Have you seen it before? Yeah, it looks familiar. Oh, this is the one we want to prove. So it means that we are done. That's it. We have an exact solution for the upper bound on B of N and k. Since we spent some time developing it, let's look at it and celebrate it, and be happy about it. First thing: yes, it's a polynomial, because all of this torture was to get a polynomial, right? If we did all of this, and it's perfect math, and the end result was not a polynomial, then we are in trouble. Because although the quantity is correct, it's not going to be useful in the utility that we are aiming at. So why is it polynomial? Remember that for a particular hypothesis set, the break point is a fixed point. It doesn't grow with N. You ask in a hypothesis set, can I get all possible dichotomies on four points? That's a question for the perceptron. No. Then 4, in the perceptron, is a break point. Now, I can ask myself what the perceptron does on 100 points. And the break point is still 4, just a constant. You give me a hypothesis set, I give you a break point. That's a fixed number. So according to our argument now, the growth function for a hypothesis set that has a break point k is less than or equal to the purely combinatorial quantity, B of N and k, which is defined as the maximum such number of dichotomies you can get, under the constraint that k is a break point. And that was less than or equal to the nice formula we had. So we can now make this statement. You go in a real learning situation. Let's say you have a neural network making a decision, and you tell me the break point for that neural network is 17. I don't ask you what is a neural network, because we don't know yet, so you don't have to know. I don't ask you what is the dimensionality of the Euclidean space you are working on. You told me 17. Your growth function of your neural network that I don't know, in the space that I don't know, happens to be less than or equal to that, and I know that I'm correct. So is this quantity polynomial in N? That's what we need. Because remember, in the Hoeffding, there was a negative exponential in N. If we get this to be polynomial in N, we are in business. Well, any one of those guys is what? N times N minus 1 times N minus 2, i times, divided by i factorial. i factorial doesn't matter, it's a constant. So you basically get N multiplied by itself a number of times, i times, for the i-th term. The most that N will be multiplied by itself is when you get to i equals k minus 1, the maximum. And then N will be multiplied by itself k minus 1 times. Therefore, the maximum power in this quantity is N to the k minus 1. This comes from N times N minus 1 times N minus 2 times-- k times, that corresponds to the case where i equals k minus 1. When you get N choose k minus 1, that's what you get. Anything else, will give you a power of N, but it's a smaller power. This is the most you will have. What do we know about this fellow k? We know it's just a number. It's a constant. It doesn't change with N. And therefore, this is indeed a polynomial in N. And we have achieved what we set out to do. That is pretty good. Let's take three examples, in order to make this relate to experiences we had before. This is the famous quantity by now. You know it by heart. I have the N. I remember k. I have to put minus 1. And that is the upper bound for anything that has a break point k. Now, let's take hypothesis sets we looked at before, for which we computed the growth function explicitly, and see if they actually satisfy the bound. They had better, because this is math. We proved it. But just to see that this is, indeed, the case. Positive rays. Oh, remember positive rays from some time ago? We have one dimension, so just the real line. Then we take from a point on, it goes to +1. And we said that the whole analysis here is exactly to avoid what I just did. You don't have to tell me what is the input. You don't have to tell me-- you just have to tell me what? What is the break point? That's all we want. you can call it positive rays. You can call it George, I don't care! It has a break point of 2. That's what I pull. We did compute the growth function for the positive rays. We did it by brute force. We looked at it, and we see what the patterns are, and did a combinatorial argument. And we ended up, that the growth function for this guy is N plus 1. Let us see if this satisfies the bound. This is supposedly less than or equal to. And you substitute here for N, which is the number here. And the break point is k. So you're summing, from i equals 0 to 1, this quantity. You have N choose 0, also known as 1. Plus N choose 1, also known as N. That's it. So you get this to be less than or equal to-- wow! Look at the analysis we did to get the N plus 1. And we get exactly the same. With all the bounds and-- we think that there is a big slack here. But here, actually it's exactly tight. We get the same answer exactly, without looking at anything of the geometry of what the hypothesis set was. Let's try another one. Maybe we'll continue to be lucky. Positive intervals. Yeah, I remember those were the more sophisticated-- oh, I'm sorry. I am not supposed to ask any questions about the hypothesis. I'm asking about the break point only. I remember now. So tell me what the break point is. That was k equals 3. And we did compute the growth function. Remember, this one was a funny one. We're picking two segments out of N plus 1, and then adding the 1. So we ended up-- this would be the formula. What would be the bound according to the result we just had? This would be again, this formula. And now k equals 3. So I have N choose 0 plus N choose 1 plus N choose 2. I get 1 plus N plus something that has squared terms. And I do the reduction and what do I get? Boring, boring. I seem to be getting it all the time. It doesn't happen this way. It will always happen that it's true. But there will be a slack in many cases. So, we verified it. We are very comfortable now with the result. Let's apply it to something where we could not get the growth function. Remember this old fellow? Well, in the two-dimensional perceptron, we went through a full argument just to prove that the break point is 4. But we didn't bother go through a general number of points N, and ask ourselves, how many hypotheses can the perceptron generate on N points? Can you imagine the torture? We do this-- can I get this pattern-- And you have to do this for every N. So we didn't do it. The growth function is unknown to us. We just know the break point. But using just that factor, we are able to bound the growth function completely. And you substitute again with k equals 4. You get another term, which is cubic. And you do the reduction. And lo and behold, you have that statement. That statement holds for the perceptrons in two dimensions. And you can see that this will apply to anything. So now it was all worth the trouble, because now we have a very simple characterization of hypothesis sets. And we can take this, and move to the other part. Remember, this part, which has now disappeared is proving that it's polynomial. Proving that we are interested in the growth function. If it wasn't polynomial, we wouldn't be interested in it. Now, this is an interesting quantity. This one tells us that-- oh, and by the way, it's not only interesting. You actually can use it. We can put it in the Hoeffding inequality, and claim that the Hoeffding inequality is true using the growth function. Now let's see what we want, to remind you of the context of substituting for the total number of hypotheses by the growth function. We wanted, instead of having this fellow-- this is Hoeffding, and this is the number of hypotheses using the union bound, which we said is next to useless whenever M is big, or M is infinite. And instead of that, we wanted really to substitute for this by the growth function. So this is what we are trying to do. We are trying to justify that instead of this, you can actually say that. Well, it turns out that this is not exactly what we are going to say. We are going to modify the constants here for technical reasons that will become clear. But the essence is the same. There would be the growth function here. It will be polynomial in N, and it will be killed by the negative exponential, provided that there is a break point-- any break point. Now, how are we going to prove that? We are going to have a pictorial proof. What a relief, because I think you are exhausted by now. So the formal proof is in the Appendix. It's six pages. My purpose of the presentation here is to give you the key points in the proof, so that you don't get lost in epsilon's and delta's. There are basically certain things you need to establish. And once you know that that's what you are looking for, you can bite the bullet and go through it line by line. The two aspects are the following. Why did we do this growth function? We used the growth function because it's smaller, so it will be helpful. But how could it possibly replace M? Because M was assuming no overlaps at all in the bad regions. Remember? So now that we know that there are overlaps, this will take care of it. The question is, how does the growth function actually relate to the overlaps? You need to establish that. So this is the first one. And when we establish that, we find that it's a completely clean argument at everything, except for one annoying aspect. Growth function relates to a finite sample. So we will get a perfect handle on the E_in, the in-sample error part of the deal. But in the Hoeffding inequality, there is this E_out. And E_out relates to the performance over the entire space. So we are no longer talking about dichotomies, we are talking about full hypotheses. We lose the benefit of the growth function. So what do we do about E out? That was a question that was asked last time. What to do about E_out, in order to get the argument to conform while we are just using a finite sample, is the second step. After that, it's a technical putting it together, in order to get the final result. That's the plan. But the proofs are pictures. So let's have a blank page. And let's say you are an artist and this is your canvas. It's a very special canvas. It's the canvas of data sets. What is that? Every point here is an entire data set, x_1, x_2, up to x_N. Fix N in your mind. So this is one vector. This is another vector. This is another vector. And this canvas covers the entire set of possible data sets. Now, why am I doing this space? Well, I am doing this space because the event of being good or bad, whether E_in goes to E out, depends on the sample. Depends on the data set. For some data sets, you will be close to the E_out. For some data sets, you are not going to be close. So I want to draw it here, in order to look at the bad regions and the overlaps, and then argue why the growth function is useful for the overlaps. Now, we assume that there's a probability distribution. And for simplicity, let's say that the area corresponds to the probability. So the total area of the canvas is 1. Now, you look at the event, the bad event, that E_in is far away from E_out. And let's say that you paint the points that correspond to that event red. So you pick-- is this data set good or bad? What does it mean, good or bad? I look at E_in in that data set, compare it to E_out on a particular hypothesis, and then paint it red if it's bad. So I have a hypothesis in mind, and I am painting the points here red or leaving them alone, according to whether they violate Hoeffding's inequality or not. And I get this, just illustratively. And you realize that I didn't paint a lot of area. And that is because of Hoeffding inequality. Hoeffding inequality tells me that that area is small. So I'm entitled to put a small patch. Now we went from one hypothesis, which is this guy, to the case where we have multiple hypotheses, using the union bound. So again, this is the space of data sets, exactly the same one. And now, I am saying for the first hypothesis, you get this bad region. What happens when you have a second hypothesis? Because I am using the very pessimistic union bound, I am hedging my bets and saying you get a bad region that is disjoint. Another hypothesis-- two of them. More. More. Oh, no. We are in trouble. The colored area is the bad area. Now the canvas is the bad area. That's why we get the problem with the union bound. Because obviously, having them disjoint fills up the canvas very quickly. Each of them is small, but I have so many of them. Infinity of them as a matter of fact. This will overflow. Well, no, it won't overflow. Just figuratively speaking. So that's what I'm going to have. What is the argument we are applying now? We are not applying the union bound. We are going to a new canvas. And that canvas is called the VC bound, as in Vapnik-Chervonenkis. We'll see it in a moment. So what do you do? Your first hypothesis, same thing. When you take the second hypothesis, you take the overlaps into consideration. So it falls here. You get more. You get more. You get all of them. It's not as good as the first one. I never expected it to be. But definitely not as bad as the second one, because now they are overlapping. And indeed, the total area, which is the bad region-- something bad happening-- is a small fraction of the whole thing. And I can learn. So we are trying to characterize this overlap. That's the whole deal with the growth function. One way to do it is the one that I alluded to before. Study the hypothesis set. Study the probability distribution. Get the full joint probability distribution of any two events involving two hypotheses, and then characterize this. Well, good luck! We won't do that. The reason we introduced the growth function, because it's an abstract quantity that is simple, and is going to characterize the overlaps. The question is, how is the growth function going to characterize the overlaps? Here is what is going to happen. I will tell you that if you look at this canvas, if any point gets painted at all, it will get painted over 100 times. Let's say that I have that guarantee. I don't know which hypotheses will paint it again. But any point that gets a red, it will get a blue, and a green-- 100 times. If I tell you that statement, what do you know about the total area that is colored? Now it's, at most, one hundredth of what it used to be. Because when I had them non-overlapping, they filled it over. Now for every point that is colored, I have to do it 100 times. So I am overusing these guys, and these guys will have to shrink. And I will get one hundredth of that. That is basically the essence of the argument. What the growth function tells you is that-- what is the growth function? Number of dichotomies. If you take a dichotomy, this is not the full hypothesis, but the hypothesis on a finite set of points. There are many, many hypotheses that return the exact same dichotomy. Remember the gray sheet with the holes. Lots of stuff can be happening behind the sheet. And as far am I am concerned, they are all the same dichotomy. So all of these guys will be behaving exactly the same way. If one of them colored the point, the others will. This tells me that the redundancy is captured by the growth function. That would be a very clean argument. And it would have been a very simple proof, except for one annoyance. That the point being colored doesn't depend only on the sample, but depends also on the entire space. Because the point gets colored because it's a bad point. What is a bad point? The frequency on the sample, that is patently on the sample, deviates from E_out. Oh, E_out involves the entire hypothesis. If I have the gray sheet and the holes, I cannot compute E out. I have to peel it off, look, and get the areas in order to get E out. So the argument is great, as long as you can tell me how do I go around the presence of E_out? And that's the second part of the proof. What to do about E_out. The simplest argument possible. That is really the breakthrough that Vapnik and Chervonenkis did. Back to the bin, just because it's an illustration of the binary case. So here, we have one hypothesis. And we have E out, which is the overall, in the entire space-- the error in the entire space. We pick a sample, and then we get E_in, which is the value for the error on this one. So we have seen this before. And we said this tracks this, according to the Hoeffding inequality. And the problem is that when you have many, many bins, some of these guys will start deviating from E_out, to the level if you pick according to the sample, you are no longer sure that you picked the right one, because the deviation could be big. That was the argument. Now, I want to get rid of E_out. The way I am going to do it is this. Instead of picking one sample, I am going to pick two samples, independently. So obviously, they are not identical samples. Some of them are green or red, et cetera. But they are coming from the same distribution. Now, let's see what is going on. E_out and E_in track each other, because E_in is generated from this distribution. Now, let's say I look at these two samples and give them names. I am going to call them E_in and E_in dash. They're both in-sample. It happens to be a different sample. So I have two samples. I am going to call this E_in, and this E_in dash. My question is, does E_in track E_in dash, if you have one bin? Well, each of them tracks E_out, right? Because it was generated by it. So consequently, they track each other. A bit more loosely, because you have now two ways of getting the sample error. On the other hand, if I do two presidential polls-- one polling asks 3000 people. Another asks 3000 people. These are different 3000 people. You fully expect that the result will be close to each other, right? So these guys track each other. OK, fine. What is the advantage? The advantage is the following. If I now have multiple bins, the problem I had here is reflected exactly in the new tracking. When I had multiple bins, the tie between E_out and E_in became looser and looser. Because I'm looking for worst case, and I might be unlucky enough, that the tracking now lost the tightness that one bin with Hoeffding would dictate. If I am doing multiple bins, and not looking at the bin at all, just looking at the two samples from each bin, they track each other, but they also get loosely apart as I go for more. Let's say, I tell you this experiment. You pick two samples. They are close in terms of the fraction of red. If you keep repeating it, can you get one sample to be mostly red, and the other sample to be mostly green? Yeah. If you are patient enough, it will happen. Exactly for the same reason, because you keep looking for it until it happens. So the mathematical ramifications of multiple hypotheses happen here, exactly the same way they happen here. The finesse now is that, if I characterize it using the two samples, then I am completely in the realm of dichotomies. Because now I'm not appealing to E_out at all. I am only appealing to what happens in a sample. It's a bigger sample. I have 2N marbles now instead of N. But still, I can define a growth function on them. And now the characterization is full, and I am ready to go. These are the only two components you need to worry about as you read the proof. Now, let's put it together. This is what we wanted. This is not true. Don't hold this against me. And to make sure, this is not quite what we have. This would be direct substitution of the plain-vanilla growth function in terms of M. We are not going to have that, but we are going instead to have this. Lets look and compare. These look the same, except that this 2 became 4. Is this good or bad? Well, it's bad. We want this probability to be small. Bad, but not fatal. This one goes to here. I have twice the sample. You know why I have 2 now. Because now I use the bigger sample for the argument, so I need 2N. Oh, but all of this was about a polynomial and now I don't know whether this will be a polynomial. Yes, you do. If it's polynomial in N, it's polynomial in N here. Because you get 2N to the k, then you get 2 to the k. That's a constant. And you still get N to the k. So that remains a polynomial. A bigger polynomial. I don't like it, but you don't have to like it. It just has to be true, and do the job we want. And finally, you can see this is minus 2, which was a very helpful factor. This is in the exponent. 2 in the exponent goes a lot of mileage. And now we knock it down all the way to 1/8. That's really, really bad news. The reason this is happening is that, as we go through the technicalities of the proof, the epsilon will become epsilon over 2. And then will become epsilon over 4, just to take care of different steps. And when you plug in epsilon over 4 here, you get epsilon squared over 16. And so you get a factor of 1/8. That's the reason for it. So this is what we will end up with. And you can be finicky and try to improve this constant a lot, but the basic message is that here is a statement that holds true for any hypothesis set that has a break point. And this fellow is polynomial in N, with the order of the polynomial decided by the break point. And you will eventually learn, because if N is big enough-- if I give you enough examples-- using that hypothesis, you will be able to claim that E_in tracks E_out correctly. This result, which is called the Vapnik-Chervonenkis inequality, is the most important theoretical result in machine learning. On that happy note, we will stop here and take questions after a short break. Let's start the Q&A. MODERATOR: First, a few clarifications from the beginning. In slide 5, when you choose the N points, does it mean your data set is of N points, or you just chose N points from the data set? PROFESSOR: When I apply this to an actual hypothesis set in an input space, then these actually correspond to a particular set of points in that space. However, in the abstraction that just defines the function B, these are just abstract labels. These are labels for which column I'm talking about. So although I call them x_1 up to x_N-1, these are not really-- in the abstraction here, they don't correspond to any particular input space in mind. But when they do, they will correspond to a sample. And I am supposed to pick the sample in that space that maximizes the number of dichotomies, et cetera, as we defined the growth function. But it's a sample that I pick when I apply this to a particular input space and a hypothesis set. MODERATOR: Also, there are some people asking-- they didn't understand exactly why alpha was different to beta. PROFESSOR: alpha is different from beta? MODERATOR: Yeah. Why? PROFESSOR: Well, the short answer is that I never made the statement that alpha is different from beta. I just didn't bother ascertain any relationship between alpha and beta. I just called them names. If they happen to be equal, I am happy. If they happen to be unequal, I am happy. So all I'm doing here is just calling the guys that happen to have a single extension, the number of them, calling it alpha. Calling the guys that happen to have double extension beta. I don't know whether alpha is bigger than beta, or smaller than beta, in any particular case. And it doesn't matter as far as the analysis is concerned. If I call them this way, then it will always be true that the total number of rows here is alpha plus beta plus beta, which is alpha plus twice beta. So there is really no assertion about the relative value of alpha and beta. MODERATOR: Moving on to the case where you show the break points, and how it satisfies the bound. What happens if k equals infinity? No break points, basically. PROFESSOR: This is for the positive rays and whatnot? MODERATOR: Yeah. So for example, if you had the convex sets. PROFESSOR: k equals infinity means there is no break point. In that case, you don't have to bother with any of the analysis I did. No break points means what? Means the growth function is 2 to the N for every N, right? We just computed it exactly. If you want a bound for it, yes, it's bounded by 2 to the N. Not a polynomial. So all of these cases, we're addressing the case where there is a break point, because that is the case where I can guarantee a polynomial. And therefore, I can guarantee learning. That is the interesting case. If there is no break point, this theoretical line of analysis will not guarantee learning. So if I have a hypothesis set that happens to be able to shatter every set of points, I cannot make a statement using this line of analysis that it will learn. And one example we had was convex sets. So convex sets have a growth function of 2 to the N. Well, it really is a very pessimistic estimate here, because the points have to be really, really very funny. You have to build the pathological case, in order not to be able to learn. And in many cases, you might be. But again, if I want a uniform statement based only on the break point, this is the most I can say using this line of analysis. MODERATOR: OK. Just a quick review. How is the break point calculated? PROFESSOR: Calculated. The break point is-- this is the only time you actually need to visit the input space and the hypothesis set. You basically-- you are sitting in a room with your hypothesis set. Someone gave you a problem for credit approval. You decided to use perceptrons, and you decided to use a nonlinear transformation. And you do all of that, and you start programming it. And you would like to know some prediction of the generalization performance that you are going to get. So you go into the room, and ask yourself: for this hypothesis set, over this input space, what is the break point? So now you have to actually go and study your hypothesis set. And then find out that using this hypothesis set, I cannot separate, let's say, 10 points in every possible way. Very much along the argument we used for the perceptron in two dimensions. We found out that we cannot separate four points in every possible way. But the good news is that, you don't have to do it anew, because for most of the famous learning models, this has already been done. For the perceptrons, we will get an exact break point. For any-dimensional perceptron. So 20-dimensional perceptron, here's the break point, and here's the growth function. Or, here's the bound. Similarity from neural networks, there is a break point. Not exact estimate of the break point, but a bound on the break point. And again, in most of these cases, bounds work, because we are always trying to bound above. And we have room to play with, because a polynomial is a polynomial is a polynomial. So if you become a little bit sloppy and forget something, and the break point-- you say 10 instead of 7-- it's not going to break the back of learning versus non-learning. It's just going to tell you more pessimistically, how much resources do you need in order to learn? Which is more benign damage than deciding, oh, I cannot learn at all. MODERATOR: OK. Can you come up with an example, where these bounds are not tight as here? PROFESSOR: There's one case, which I could have covered but I didn't, where you take positive and negative rays. So positive rays, you take the real line. And from a point on it's +1. Before, it's -1. Positive and negative rays, it means you are also allowed to take rays that return +1 first, and then -1 later. And the union of them is the model called positive and negative rays. It's a good exercise to do. Take that home and try to find, what is the break point? And you'll find that although the break point for positive rays is 2, in this case the break point is actually 3. And the reason is that for two points, now you can get everything because you know the ray can be here. So they are both minus. The ray could be here. They are both plus. The ray could be here. It's minus plus. But now, use the negative ray to get the +1, -1. So now you can shatter two points. And you would fail only for the three points, when the middle guy is different, because you cannot get it this way. So you will get-- and the break point is 3. When you do the break point of 3, you will get the bound, the blue bound here. You will get that to be squared. Pretty much like here, because we have a break point that corresponds directly to squared. I don't care whether the 3 is coming from positive intervals, or coming from positive and negative rays. It's 3. Therefore, the blue bound is quadratic. If you compute the number of dichotomies you can get, which is the growth function, it will actually be linear. So there will be a discrepancy between linear for the exact estimate of the growth function, to quadratic of the bound. So there are cases that you can come up with, easily. And as a matter of fact, this is the exception on this, rather than the rule. In most of the cases, there will be a slack. MODERATOR: And this question drives the point of the whole lecture. It says, we have been focusing on having E_in equal to E_out, or close to E_out, not in the actual value of E_in. So using our hypotheses, there are just as many percentage errors in the training data as the real data. Why is that? PROFESSOR: This goes back to separating the question of learning into the two questions. There was one question which was addressed now. We are trying to get E_in to track E_out. Why do I need that? Because I don't know E_out, and I will not know E_out. That is simply an unknown quantity for me. And I want to work with something to minimize. I cannot minimize something that I don't know. So if the theoretical guarantees tell me that E_in is a proxy for E_out, and if I minimize E_in, E_out will follow suit. Then I can now work with a quantity that I know, and do it. That's the first part, that they are tracking each other. The second part is the practical. Now, I am going to go and try to minimize E_in. This is the second part of it. MODERATOR: Also, they're asking if-- can you clarify more, why is the VC dimension useful? PROFESSOR: The VC dimension, as of now, is an unknown quantity. I didn't say that word "VC dimension" at all. I said every building block that will result in the definition. However, the good news is, what is the title of the next lecture? The VC dimension. You will be completely content with everything you wanted to know about the VC dimension, and weren't afraid to ask! OK? MODERATOR: Yeah, the crowd is saying that they're still digesting the lecture. PROFESSOR: OK. As I mentioned before, if you didn't follow this in real time, don't be discouraged. It's actually very sweet material. And you can look at the lecture again. And you can read the proof. And you can do all of the homework, until it settles in your mind. This is the most abstract, or the most theoretical, lecture of the entire course. And if you get through this one, and you understand it well, you are in good shape as far as the rest of the course. There will be mathematics, but it will be more friendly mathematics. Friendly, as in less abstract. For someone who is not theoretically inclined, the more abstract the mathematics is, the less they can follow it, because they cannot relate to it. So this one has the abstraction. The other mathematics will be much easier to relate to. MODERATOR: What was wrong with the "not quite" expression on the last slide? PROFESSOR: OK. Basically, the top statement is simply false. It was my way of relating what I'm trying to do, to what has already happened. There used to be M in place of the growth function. So the growth function is here. There used to be M. So the easiest way for me to describe what is happening with the theory, is to tell you that you are going to take M out, and replace it with this. As usual, it's not that easy. Even remember with the Hoeffding, when I complained about the 2 here and the 2 here? Well, you have to have them, in order for the statement to be true. So for the statement to be true, we needed to do some technical stuff that really didn't change the essence of the statement here, but made it a little bit different by changing the constants. And therefore, we have a proof for it. It holds. And it captures the essence of that. I just didn't want to bother telling you this because, if I told you this in the first place, you would have been completely lost. Why 4? Why 2? What is this 1/8? And forget about the essence. So the easiest way to do it, we are replacing it. This is not the final form, but we are replacing it. Until you get the idea: indeed, I can replace it. But oh, in order to replace it, I need to have a bigger sample that we argued for. So I need 2N. Oh, and now the bigger samples are not tracking to each other as well as each of them is tracking the actual out-of-sample error. So I need to modify these values, and so on. So it becomes much easier to swallow. The technicalities will come in, in order to make the proof go through. MODERATOR: OK. Can you review the definition of B of N and k? PROFESSOR: B of N and k. Assume you have N points, and assume that k is a break point. So you're assured that no k points will have all possible patterns on them. After these two assumptions, make no further assumptions. You don't know where this came from. You don't know what space you are working with. You don't know what the hypothesis set is. You just know that in your setup, when you get N points-- that is the N here-- and the break point is k-- that is k here. Under those conditions, can you bound the growth function? Can you tell me that the growth function can never be bigger than something? That something is what I am calling B of N and k. So what am I doing? I'm taking the minimal conditions you gave me. I have N points, and k is a break point. And asking myself: what is the maximum number of dichotomies you can possibly have, under no other constraints, in order to satisfy these two constraints? And I'm calling this B of N and k. Why did I do it? First, it's going to help, being an upper bound for any hypothesis set that has a break point k, because it is the maximum. The second one, it's a purely combinatorial quantity. So I have a much better chance of analyzing it, without going through the hairy details of input spaces, and correlation between events, and so on. And that is indeed, what ended up being the case. We had a very simple recursion on it, and we found a formula for it. And that formula now serves as an upper bound for the more hairy quantity, which is the growth function that is very particular to a learning situation, an input space, and a hypothesis set. MODERATOR: Also, a particular question on the proof of B of N and k, the recursion. Slide 5. The question is, why does k not change when going back from N to N minus 1? PROFESSOR: OK. Here, if you look at x_1, x_2, up to x_N, the disappearing x_N here, no k columns can have all possible patterns. These k columns could involve the last column, and could involve only the first N minus 1 columns. Just no k columns whatsoever can have all possible patterns. So when I look at the reduced one, N minus 1, I know for a fact that no k columns of these guys can have all possible patterns. Because that would qualify as k columns of the bigger one. So k doesn't really change. The only time I had a different k is when I had a nice argument that, if you have k minus 1 points which have all possible patterns on the smaller set, then adding the last column will get us in trouble with k columns. So for that, I needed an argument. But in general, when I take the statement on face value, k is fixed. And the k columns could be anything. Could be involving the last column, or could be restricted to the first guy. Could be the first k columns, for all I care. MODERATOR: How does this formalization apply to, say, a regression problem? PROFESSOR: Again, this is all binary functions. So the classification of +1 and -1. And as I mentioned, the entire analysis, the VC analysis, can be extended to real-valued functions. It's a very technical extension that, in my humble opinion, does not add to the insight. And therefore, instead of doing that and going very technical, in order to gain very little in terms of the insight, I decided that when I get to regression functions, I am going to apply a completely different approach, which is the bias-variance tradeoff. It will give us another insight into the situation, and will tackle the real-valued functions directly, the regression functions. And therefore, I think we'll have both the benefit of having another way of looking at it, and covering both types of functions. MODERATOR: There's this person that says, I feel silly asking this, but is the bottom line that we can prove learnability if the learning model cannot learn everything? PROFESSOR: OK. We proved learnability under a condition about the hypothesis set. When you say learning everything, you are really talking about the target function. So the target function is unknown. What I am telling you here is that, if you tell me that there is a break point, I can tell you that if you have enough examples, E_in will be close to E_out for the hypothesis you pick, whichever way you pick it. It remains to be seen whether you are going to be able to minimize E_in, to a level that will make you happy. I will never know that until you start minimizing. So if the target function happens to be extremely difficult, or completely random-- unlearnable, you are not going to see this in the generalization question. The generalization question is independent of the target function. I didn't bring it up here at all. It has to do with the hypothesis set only. The target function will come in-- if I get E_in to be small, E_out will be small. I know that from the generalization argument that I made. Can I get E_in to be small? If the target function is random, you will get a sample that that is extremely difficult to fit. And you are not going to be able to get E_in to be small. But at least, you will realize that you could not learn, in that particular case. And in another target function, you will realize that you can learn, because E_in went down. So the question of whether I can learn or not, the generalization part of it is independent of the target function. The second question is very much dependent on the target function, but the good news is that it happens in sample. I can observe it, and realize how well, or not so well, I learned. MODERATOR: Also, going back to a previous question, does this also generalize to multi-class problems? PROFESSOR: Basically, there is no restriction on the inputs or the outputs. There is a counterpart. And instead of saying break point, what is a break point? And dichotomies, they are not really dichotomies. You have real values. You have no real values, so there are technicalities to be done in order to be able to reduce them to this case. But the same principle applies, regardless of the type of function you have. MODERATOR: I think that's it. PROFESSOR: Very good. Thank you, and we'll see you next week.
Info
Channel: caltech
Views: 165,696
Rating: 4.889688 out of 5
Keywords: Machine Learning (Field Of Study), Vapnik, VC dimension, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, Chervonenkis, generalization, in sample, out of sample, growth function, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), California Institute Of Technology (Organization), Theory (Quotation Subject), Abu-Mostafa, Yaser
Id: 6FWRijsmLtE
Channel Id: undefined
Length: 78min 11sec (4691 seconds)
Published: Sat Apr 21 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.