Lecture 05 - Training Versus Testing

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about error and noise. And these are two notions that relate the learning problem to practical situations. In the case of error measures, we realized that in order to specify the error that is caused by your hypothesis, we should try to estimate the cost of using your h, instead of f which should have been used in the first place. And that is something the user can specify, of the price to pay when they use h instead of f. And that is the principled way of defining an error measure. In the absence of that, which happens for quite a bit of time, we go to plan B and resort to analytic properties, or practical properties of optimization, in order to choose the error measure. Once you define the error between the performance of your hypothesis versus the target function on a particular point, you can plug this in, into different error quantities, like the in-sample error and the out-of-sample error, and get those values in terms of the error measure by getting an average. In the case of the training set, you estimate the error on the training points, and then you average with respect to the N examples that you have. And in the case of out-of-sample, theoretically the definition would be that you also evaluate the error between h and f on a particular point x, give the weight of x according to its probability, and get the expected value with respect to this x. The notion of noisy targets came from the fact that what we are trying to learn may not be a deterministic function, the only function in mathematics, where y is uniquely determined by the value of x. But rather, when y is affected by x-- y is distributed according to a probability distribution, which gives you y given x. And we talked about, for example, in the case of credit application, two identical applications may lead to different credit behavior. Therefore, the credit behavior is a probabilistic thing, not a deterministic function of the credit application. You can go back to our first example, let's say, of the movie rentals. If you rate a movie, you may rate the same movie at different times differently, depending on your mood and other factors. So there's always a noise factor in these practical problems. And that is captured by the transitional probability from x to y, probability of y given x. When we look at the diagram involving this probability-- so now we replace the target function, which used to be a function, by a probability distribution, which can be modeled as a target function plus noise. And these feed into the generation of the training examples. And when you look at the unknown input distribution, which we introduced technically in order to get the benefit of Hoeffding inequality, that also feeds into the training example. This determines x. And this determines y given x. And then you generate these examples independently, according to this distribution. So when we had x being the only probabilistic thing, and y being a deterministic function of x, then x_1 was independent of x_2, independent of x_N. And then you compute each y, according to the function, on the corresponding x. When you have the noisy version, then the pair x_1 and y_1 is generated according to the joint probability distribution, which is P of x, the original one, times P of y given x, the one you introduced to accommodate the noise. And then the independence lies between different pairs. So x_1, y_1 would be independent of x_2, y_2, independent of x_3, y_3 and so on. And when you get the expected values for errors, you now have to take into consideration the probability with respect to both x and y. So what used to be the expected value with respect to x, is now the expected value with respect to x and y. And then you plug in x into h, and correspond it to the probabilistic value of y that happened to occur. And that would be now the out-of-sample error in this case. Now in this lecture, I'm going to start the theory track that will last for this particular route three lectures, followed by another theory lecture on a related but different topic. And the idea is to relate training to testing, in-sample and out-of-sample, in a realistic way. So the outline will be the following. We'll spend some time talking about training versus testing, a very intuitive notion. But we'd like to put the mathematical framework that describes what is training versus testing. And then we will introduce quantities that will be mathematically helpful in characterizing that relationship. And after I give you a number of examples to make sure that the notion is clear, we are going to introduce the key notion, the break point. And the break point is the one that will later result in the VC dimension, the main notion in the theory of learning. And finally, I end up with a puzzle. It's an interesting puzzle that will hopefully fix the ideas that we talked about in the lecture. So now let's talk about training versus testing. And I'm going to take a very simple example that you can relate to. Let's say that I'm giving you a final exam. So now I want to help you out. So before the final exam, I give you some practice problems and solutions, so you can work on and prepare yourself for the final exam. That is very typical. Now if you look at the practice problems and solutions, this would be your training set, so to speak. You're going to look at the question. You're going to answer. You're going to compare it with the real answer. And then you are going to adjust your hypothesis, your understanding of the material, in order to do it better, and go through them and perhaps go through them again, until you get them right or mostly right or figure out the material. And now you are more ready for the final exam. Now the reason I gave you the practice problems and solutions is to help you do better on the final, right? Why don't I just give you the problems on the final, then? Excellent idea, I can see! Now the problem is obvious. The problem is that doing well on the final is not the goal, in and of itself. The goal is for you to learn the material, to have a small E_out. The final exam is only a way of gauging how well you actually learned. And in order for it to gauge how well you actually learned, I have to give you the final at the point you have already fixed your hypothesis. You prepared. You studied. You discussed with people. You now sit down to take the final exam. So you have one hypothesis. And you go through the exam. And therefore, your answer on, let's say, the 50 questions of the final-- hopefully, it's not going to be that long if there's a final-- will reflect what your understanding will be outside. So the distinction is conceptual. And now, let's put mathematically what is training versus testing? It will be an extremely simple distinction, although it's an important distinction. Here is what testing is, in terms of a mathematical description. You have seen this before. This is the plain-vanilla Hoeffding. This part is how well you did on the final exam. This is how well you understand the material proper. And since you have only one hypothesis-- this is a final, you are fixed, and you just take the exam. Your performance on the exam tracks well how you understand the material. And therefore, the difference between them is small. And the probability that it's not small is becoming less and less, when the number of questions, in this case, goes up. So that is what testing is. How about training? Almost identical, except for one thing-- this fellow. Because in the case of training, this is how you performed on the practice problems. In the practice problems, you had the answers, and you modified your hypothesis. And you looked at it, and you got an answer wrong. So you modified your hypothesis again. You are learning better. That's all very nice. But now the practice set is contaminated. You pretty much almost memorized what it is. And there's a price to pay for that, in terms of how your performance on the practice, which is E_in in this case, tracks how well you understand the material, which is still E_out. And the price you pay is how much you explored. And that was reflected by the simple M, which was the number of hypotheses in the very simple derivation we did. So if you want an executive summary of this lecture, we are just going to try to get M to be replaced by something more friendly, because you realize that M-- if you just measure the complexity of your hypothesis set by the number of hypotheses-- this is next to useless in almost all cases. Something as simple as the perceptron has M equals infinity. And therefore, this guarantee is no guarantee at all. If we can replace M with another quantity, and justify that, and that quantity is not infinite even if the hypothesis set is infinite, then we are in business. And we can start talking about the feasibility of learning in an actual model, and be able to establish the notion in a way that we can apply to a real situation. That's the plan. We're talking about M, so the first question is to ask, where did this M come from? If we are going to replace it, we need to understand where it came from, to understand the context for replacing it. Well, there are bad events that we have talked about. And the bad events are called B, because they are bad. That's good! And then-- these are the bad events. What is the bad event that we were trying to avoid? We were trying to avoid the situation where your in-sample performance does not track the out-of-sample performance. If their difference is bigger than epsilon, this is a bad situation. And we're trying to say that the probability of a bad situation is small. That was the starting point. Now we applied the union bound, and we got the probability of several bad events. This is the bad event for the first hypothesis. You can see here that there is m, a small m. m is 1, 2, 3, 4, up to M. So there are M hypotheses, capital M hypotheses that I'm talking about. And I would like the probability of any of them happening to be small. Why is that? Because your learning algorithm is free to pick whichever hypothesis it wants, based on the examples. So if you tell me that the probability of any of the bad events is small, then whichever hypothesis your algorithm picks, they will be OK. And I want that guarantee to be there. So let's try to understand the probability of the B_1 or B_2 or B_M. What does it look like? Well, if you look at a Venn diagram, and you place B_1 and B_2 and B_3 as areas here, these areas-- these are different events. They could be disjoint, in which case the circles will be far apart. Or they could be coincident, which will be on top of each other. They could be independent, which means that they are proportionately overlapping. There could be many situations. Now the point of the bound is that we would like to make that statement regardless of the correlations between the events. And therefore, we use the union bound, which actually bounds it by the total area of the first one, plus the total area of the second one, et cetera, as if they were disjoint. Well, that will always hold regardless of the level of overlap. But you can see that this is a poor bound because in this case, we are estimating it to be about three times the area, when it's actually closer to just the area, because the overlap is so significant. And therefore, we would like to be able to take into consideration the overlaps, because with no overlaps, you just get M terms. And you're stuck with M, and infinity, in almost all the interesting hypothesis sets. Now of course, you can go-- in principle, you can go and I give you the hypothesis set, which is the perceptron. And you can try to formalize, what is this bad event in terms of the perceptron. And what happens when you go to the other perceptron, and try to get the full joint distribution of all of these guys, and solve this exactly. Well, you can, in principle-- theoretically. It's a complete nightmare, completely undoable. And if we have to do this for every hypothesis set you propose, there wouldn't be learning theory around. People will just give up. So what we are going to do, we are going to try to abstract from the hypothesis set a quantity that is sufficient to characterize the overlaps, and get us a good bound, without having to go through the intricate details of analyzing how the bad events are correlated. That would be the goal. And we will achieve it, through a very simple argument. So that's where M comes from. When we asked, can we improve on M? Maybe M is the best we can do. It's not like we wish to improve it, so it has to be improved. Maybe that's the best we can say. If you have an infinite hypothesis, then you're stuck, and that's that. But it turns out that, no, the overlap situation we talked about is actually very common. Yes, we can improve on M. And the reason is that the bad events are extremely overlapping in a practical situation. Let's take the example we know, which is the perceptron, to understand what this is. I'm going through the example because now we have lots of binary things-- +1 versus -1 for the target, +1 versus -1 for the hypothesis, agreeing versus disagreeing, et cetera. I want to pin down exactly what is the bad event, in terms of this picture, so that we understand what we are talking about. Here is the target function for a perceptron. And it returns +1 for some guys, -1 for some guys. That's easy. And then you have a hypothesis, a perceptron. And this is not the final hypothesis. This is a badly performing hypothesis. But it is a general perceptron. If you find any vector of weights, you'll find another blue line. So now in terms of this picture, could someone tell me what is E_out? What is the out-of-sample error for this hypothesis, when it's applied to this target function? It's not that difficult. It is actually just these areas, the differential areas. This is where they disagree. One is saying +1. One is saying -1. So these two areas-- if you get the total area if it's uniform, the total probability if it's not-- then this will give you the value of E_out. That's one quantity. How about E_in? For E_in, you need a sample. So first, you generate a sample. Here's a constellation of points. Some of these points, as you see, will fall into the bad region, here and here. And I color them red. So the fraction of red compared to all the sample gives you E_in. That is understood. This is E_in and E_out. And these are the guys that I want to track each other. OK, fine. I understand this part. And in words. Now you'll look at: what is the change E_in and E_out, when you change your hypothesis? So here's your first hypothesis. Now take another perceptron. You probably already suspect that this is hugely overlapping. Whatever you're talking about, it must be overlapping, because they're so close to each other. But let's pin down the specific event that, we would like to argue, is overlapping. So the change in E_out when you go from, let's say, the blue hypothesis, this blue hypothesis, to the green hypothesis-- the change in E_out would be the area of this yellow thing, not very much. A very thin area. That's where E_out changed, right? So if you look at the area, that gives you delta E_out. If you look at the delta E_in, the change of the labels of data points-- if one of the data points happens to fall in this yellow region, then its error status will change from one hypothesis to another, because one hypothesis got it right, and the other one got it wrong. Now the chances of a point falling here is small. So you can see why we are arguing that the change delta E_out and the change delta E_in is small. The area is small, and the probability of a point falling there is small. Moreover, they are actually moving in the same direction because the change is actually depending on the area of the yellow part. So this-- let's say that this is increasing. If they increase, they increase both, because I get a net positive area for the delta E_out. And the probability of falling there also increases. Now, the reason I'm saying that, is because what we care about are these. We would like to make the statement that, how E_in tracks E_out for the first hypothesis, for the blue perceptron, is comparable to how E_in tracks E_out for the second one. Why are we interested in that? Because we would like to argue that this exceeding epsilon happens often, when this exceeds epsilon. The events are overlapping. We are not looking for the absolute value of those, we are just saying that, if this exceeds epsilon, this also exceeds epsilon most of the time. And therefore, the picture we had last time is actually true. These guys are overlapping. The bad events are overlapping. And at least we stand a hope that we will get something better than just counting the number of hypotheses, for the complexity we are seeking. So we can improve M. That's good news. We can improve M. We're going to replace it with something. What are we going to replace it with? I'm going to introduce to you now the notion that will replace M. It is not going to be completely obvious that we can actually replace M with this quantity. That will require a proof. And that will take us into next lecture. The purpose here is to define the quantity, and make you understand it well, because this is the quantity that will end up characterizing the complexity of any model you use. So we want to understand it well. And we are going to motivate that it can replace M. It will be plausible. It makes sense. It's not a crazy quantity. It also counts the number of hypotheses, of sorts. And therefore, let's define the quantity and become familiar with it. And then next time, we will like the quantity so much that we'll bite the bullet, and go through the proof that we can actually replace M with this quantity. So what is the quantity? The quantity is based on the following. When we count the number of hypotheses, we obviously take into consideration the entire input space. What does that mean? These are four different perceptrons. So I take the input space. And the reason these guys are different is because they are different on at least one point in the input space. That's what makes two functions different. And because the input space is infinite, continuous, that's why we get an infinite number of hypotheses. So let's say that, instead of counting the number of hypotheses on the entire input space, I'm going to restrict my attention only to the sample. So I generate only the input points, which are finite points, put them on the diagram. So I have this constellation of points. And when I look at these points alone, regardless of the entire input space, those perceptrons will classify them. These guys will turn into red and blue, according to the regions they fall in. Now, in order to fully understand what it means to count only on the number of points, we have to wipe out the input space. So that's what I'm going to do. That's what you have. So you can imagine the perceptron is somewhere. And it's splitting the points. And now what I'm counting is-- for this constellation, which is a fixed constellation of points, how many patterns of red and blue can I get? Now when you do this, you're not counting the hypotheses proper, because the hypotheses are defined on the input space. You are counting them on a restricted set. But still, you're counting. You're counting the number of hypotheses. For example, if I give you a hypothesis set where you get all possible combinations of red and blue, that's a powerful hypothesis. If I give you a hypothesis where you get only few, that's not so powerful a hypothesis. So the count here also corresponds in our mind to the strength, or the power, of the hypothesis set, which in our mind is what we try to capture by the crude M. So we are going to count the number of hypotheses. I'm putting them between quotations. Why? Because now the hypotheses are defined only on a subset of the points. So I'm going to give them a different name, when I define them only on a subset of the points, in order not to confuse the hypotheses, on the general input space, with this case. I'm going to call them dichotomies. And the idea is that I give you N points. And there is a dichotomy between what goes into red, and what goes into blue. That's where the name came from. So when you look only at the points, and you look at this, which ones are blue and which ones are red, are a dichotomy. And if you want to understand it, let's look at this. Let's say that you're looking at the full input space. And this is your perceptron. And this is the function it's implementing. And then you put a constellation of points. The way to understand dichotomies is to think that I have an opaque sheet of paper, that has holes in it. And you put that opaque sheet of paper on top of your input space. So you don't see the input space. You only see it through the eyes of those points. So what do you see when you put this? You end up with this here. You don't see anything. You don't see where the hypothesis is. You just see that these guys turned blue, and these guys turned red or pink. Now as you vary the perceptron, as you vary the line here, you are not going to notice it here, until the line crosses one of the points. So I could be running around here, here, here, and here, and generating an infinite number of hypotheses, for which I'm charging a huge M. And this guy is sitting here, looking. Nothing happened. It's the same thing. I'm counting it as 1. And then when you cross, you end up with another pattern. So all of a sudden, these guys are blue. And these guys are red. That's when, let's say, this guy is horizontal here rather than vertical here. So you can always think that we reduced the situation to where we're going to look at the problem exactly as it is, except through this sheet that has only N holes. Let's put, in mathematical terms, the dichotomies which are the mini hypotheses, the hypotheses restricted to the data points. A hypothesis formally is a function. And the function takes the full input space X, and produces -1, +1. That's the blue and red region that we saw. A dichotomy, on the other hand, is also a hypothesis. We can even give it the same name, because it's returning the same values for the points it's allowed to return values on. But the domain of it is not the full input space, but very specifically, x_1 up to x_N. These are-- each one of these points belongs to X, to the input space. But now I'm restricting my function here. And again, the result is -1, +1, exactly as it was here. That's what a dichotomy is. Now if I ask you how many hypotheses there are, let's say for the perceptron case? Very easy. It can be infinite. In the case of the perceptron, it's infinite. Why? Because this guy is seriously infinite. So the number of functions is just infinite, by a margin! So that's fine. Now if you ask yourself, what is the number of dichotomies? Let's look at the notation first, and then answer the question. The dichotomy is a function h applied to one of those. So when I talk about it, the value, I would say h of x_1 or h of x_2, one value at a time. If I decide to use the fancy notation, I say I'm going to apply small h to the entire vector, x_1, x_2, up to x_N. I would be meaning that you tell me the values of h of x on each of them. So you return a vector of the values, h of x_1, h of x_2, up to h of x_N. That's not an unusual notation. Now if you apply the entire set of hypotheses H to that, what you are doing is that you are applying each member here, which is h, to the entire vector. Each time you apply one of those guys, you get -1, -1, +1, +1, -1, +1, -1, et cetera. So you get a full dichotomy. And then you apply another h, and you get another dichotomy, and so on. However, as you vary h, which has an infinite number of guys, many of these guys will return exactly the same dichotomy, because the dichotomies are very restricted. I have these N points only. And I'm returning +1 or -1 on them only. So how many different ones can I possibly get? At most, 2 to the N. If H is extremely expressive, it will get you all 2 to the N. If not, it will get you smaller than 2 to the N. So I can start with the most infinite type of hypothesis. And if I translate it into dichotomies, I have an upper bound of 2 to the N for the number of dichotomies I have. So this thing now becomes a candidate for replacing the number of hypotheses. Instead of the number of hypotheses, we're talking about the number of dichotomies. Now we define the actual quantity. Capital M is red. And I keep it red throughout. And we are going now to define small m, which I will also keep as red. That will hopefully, and provably as we will see next time, replace M. It's called the growth function. What is the idea of the growth function? The growth function counts the most dichotomies you can get, using your hypothesis set on any N points. So here is the game. I give you a budget N. That's my decision. You choose where to place the points, x_1 up to x_N. Your choice is based on your attempt to find as many dichotomies as possible, on the N points, using the hypothesis set. So it would silly, for example, to take the points and put them, let's say, on a line, because now you are restricted in separating them. But you can see the most I can get if I put them in this general constellation. And then you count the number of dichotomies you are going to get. And what you're going to report to me is the value of the growth function on the N that I passed on to you. So I give you N, you go through this exercise, and you return a number that is the growth function. Let's put it formally. The growth function is going to be called m, in red as I promised. And it is the maximum. Maximum with respect to what? With respect to any choice of N points from the input space. That is your part. I gave you the N. So I told you what N is. And then you chose x_1 up to x_N with a view to maximizing something. What are you maximizing? Well, we had this funny notation. H applied to this entire vector is actually the set of dichotomies, the vectors, -1, +1, -1, +1, and then the next guy and the next guy-- the actual vectors here. When you put this cardinality on top of them, you're just counting them. You're asking yourself: how many dichotomies do I get? So you're maximizing, with respect to the choice of x_1 up to x_N, this thing. That will give you the most expressive facet of the hypothesis set on N points, that number. I tell you 10. And you come back with the number 500. It means that by your choice of the x_1 up to x_10, you managed to generate 500 different guys, according to the hypothesis set that I gave you. Now because of this, you can see now that there is an added notation here. It used to be m, but it actually depends on the hypothesis set, right? It's the growth function for your hypothesis set. So I'm making that dependency explicit, by putting a subscript H. Furthermore, this is a full-fledged function. M was a number. I give you a hypothesis set. It's an number. Well, it happens to be infinite, but it's a number. Here, I'm giving you a full function. That is, I tell you N, you tell me what the growth function is. So it's a little bit more complicated. And because it is this way, m_H is actually a function of N. That's the growth function. So that is the notion. Now what can we say about the growth function? Well, if the number of dichotomies is at most 2 to the N, because that's as many +1, -1, N-tuples you can produce, then the maximum of them is also bounded by the same thing, at most 2 to the N. Well, if we are going to replace M with m, I would say 2 to the N is an improvement over infinity. If we can afford to do it. Maybe it's not a great improvement, nonetheless improvement. Now, let's apply the definition to the case of perceptrons, in order to give it flesh, so we understand what the notion is. It's not just an abstract quantity. We take the perceptrons, and we would like to get the growth function of the perceptrons. Well, getting the growth function of the perceptron is quite a task. If I tell you what is M for the perceptron? Infinity. And then you go home. What is the growth function of the perceptron? You have to tell me what is the growth function at N equals 1, what is at N equals 2, at N equals 3, at N equals 4. It's a whole function. So we say, 1 and 2 is easy. Let's start with N equals 3. So I'm choosing 3 points. And I chose them wisely, so that I can maximize the number of dichotomies. And now I'm asking myself, what is the value of the growth function for the perceptron for the value N equals 3? Well, it's not that difficult. You can see, I can actually get everything there is to get. Why? Because I can have my line here, or I can have my line here, or I can have my line here. That's 3 possibilities times 2 because I can make it +1 versus two -1's, or -1 versus two +1's. We are counting 6 so far. And then I can have my hypothesis sitting here. That will make them all +1. Or I can have it sitting here, which makes them all -1. That's 8. That's all of them. The perceptron hypothesis is as strong as you can get, if you only restrict your attention to 3 points. So the answer would be what? Is it already 8? Wait a minute. Someone else chose the points co-linear, and then found out that if you want these guys to go to the -1 class, and this guy to go to the +1 class, there is no perceptron that is capable of doing this. Correct? You cannot pass a line that will make these two guys go to +1, and this guy go to -1, if these are co-linear. Does this bother us? No. Because we are taking the maximum. So this, the quantity you computed here, since you got to the 8-- you cannot go above 8. That defines it. And indeed, you can with authority answer the question that the growth function for this case, m at N equals 3, is 8. Now let's see if we are still in luck when we go to N equals 4. What is the growth function for 4 points? We'll choose the point in general position again. We are not going to have any co-linearity, in order to maximize our chances. But then we are stuck with the following problem. Even if you choose the points in general position, there is this constellation-- there is this particular pattern on the constellation, which is -1, -1, and +1, +1. Can you generate this using a perceptron? No. And the opposite of it, you cannot either. If this was -1, -1, and this one, +1, +1. Can you find any other 4 points, where you can generate everything? No. I can play around, and there is always 2 missing guys, or even worse. If I choose the points unwisely, I will be missing more of them. So the maximum you are getting is that you are missing 2 out of all the possibilities. And the growth function here is 14, not 16, as it might have been if you had the maximum. Now this is a very satisfactory result, because perceptrons are pretty limited models. We use them because they are simple, and there's a nice algorithm that goes with them. So we have to expect that the quantity we are measuring the sophistication of the perceptrons with, which is the growth function, had better not be the maximum possible. Because if it's the maximum possible, then we are declaring: perceptrons are as strong as can be. Now they break. And they are limited. And if I pick another model, which, let's say-- just for the extreme case-- the set of all hypotheses. What would be the growth function for the set of all hypotheses? It would be 2 to the N, because I can generate anything. So now, according to this measure that I just introduced, the set of all hypotheses is stronger than the perceptrons. Satisfactory result, simple but satisfactory. Now what I'm going to do-- I'm going to take some examples, in which we can compute the growth function completely for all values of N. You can see that if I continued with this and say, let's go with the perceptron. 5 points. You put the 5 points, and then you try. Am I missing this? Or maybe if I change the position of the points. It's just a nightmare, just to get 5. And basically, if you just do it by brute force, it's not going to happen. So I'm taking examples where we can actually, by a simple counting argument, get the value of the growth function for the entire domain, N from 1 up to infinity, in order to get a better feel for the growth function. That's the purpose of this portion. Our first model, I'm going to call positive rays. Let's look at what positive rays look like. They are defined on the real line. So the input space is R, the real numbers. And they are very simple. From a point on, which we are going to call 'a'-- this is the parameter that decides one hypothesis versus the other in this particular hypothesis set. All the points that are bigger go to +1. All the points that are smaller go to -1. And it's called positive ray, because here is the ray-- very simple hypothesis set. Now in order to define the growth function, I need a bunch of points. So I'm going to generate some points. I'm going to call them x_1 up to x_N. And I am going to choose them as general as possible. I guess there is very little generality when you're talking about a line. Just make sure that they don't fall on each other. If they fall on each other, you cannot really dichotomize them at all. If you put them separately, you'll be OK. So you have these N points. Now when you apply your hypothesis, the particular hypothesis that is drawn on the slide, to these points, you are going to get this pattern. True? And you're asking yourself, how many different patterns I can get on these N points by varying my hypothesis, which means that I'm varying the value of 'a'? That is the parameter that gives me one hypothesis versus the other. Formally, the hypothesis set is a set from the real numbers to -1, +1. And I can actually find an analytic formula here. If you want an analytic formula, you remember the sign? This is, I think-- If you apply it, that's exactly what I described. Now we ask ourselves, what is the growth function? Here is a very simple argument. If you have N points, the value of the dichotomy-- which ones go to blue and which ones go to red-- depends on which segment between the points 'a' will fall in. If 'a' falls here, you get this pattern. If 'a' falls here, this guy will be red. And the rest of the guys will be blue. So I get a different dichotomy. I get different dichotomies when I choose a different line segment. How many line segments are there to choose from? I have N points. I have N minus 1 sandwiched ones, and one here when all of them are red, and one here when all of them are blue. Right? So I have N plus 1 choices. And that's exactly the number of dichotomies I'm going to get on N points, regardless of what N is. So I found that the growth function, for this thing, is exactly N plus 1. Let's take a more sophisticated model, and see if we get a bigger growth function. Because that's the whole idea, right? The next guy is positive intervals. What are these? They're like the other guys, except they're a little bit more elaborate. Instead of having a ray, you have an interval. Again, you're talking about the real line. And you are going to define an interval from here to here. And anything that lies within here, will map to +1 and will become blue. And anything outside, whether it's right or left, will go to -1. That's obviously more powerful than the previous one, because you can think of the positive ray as having an infinite interval. That's fine. So you put the points. We have done this before. And they get classified this way. And I'm asking myself, how many different dichotomies I can get now by choosing really 2 parameters, the beginning of the interval and the end of the interval. These are my 2 parameters, that will tell me one hypothesis versus the other. How many different patterns can I get? Again, the function is very simple. It's defined on the real numbers. And now the counting argument, which is an interesting one. The way you get a different dichotomy is by choosing 2 different line segments, to put the ends of the interval in. If I start the interval here and end it here, I get something. If I start the interval here and end it here, I get something else. If I start the interval here and end here, I get something else. And that is exactly one-to-one mapping between the dichotomies and the choice of 2 segments. So if this is the case, then I can very simply say that the growth function, in this case, is the number of ways to pick 2 segments out the N plus 1 segments. And that would be N plus 1 choose 2. There is only 1 missing. When you count, there are 2 rules-- make sure that you count everything, and make sure that you don't count anything twice. Very simple. So we counted almost everything. But the missing guy here is what? Let's say that all of them are blue. Is this counted already? Yes, because I can choose this segment and this segment. And that is already counted in this. But if they're all red, what does that mean? It means that the beginning of the interval, and the end of the interval, happen to be within the same segment. So they didn't capture any point. And that, I didn't count. And it doesn't matter which segment they're in, because I will get just the all reds. So it's one dichotomy. So all I need to do is just add 1. And that's the number. Do a little algebra, and you get this. That is the growth function for this hypothesis set. And now I'm happy, because I see it's quadratic. It's more powerful than the previous guy, which was linear. Now let's up the ante, and go to the third one. Convex sets. This time, I'm taking the plane, rather than the line. So it's R squared. And my hypotheses are simply the convex regions. If you look at the values of x at which the hypothesis is +1, this has to be a convex region, any convex region. A convex region is a region where, if you pick any 2 points within the region, the entirety of the line segment connecting them lies within the region. That's the definition. So this is my artwork for a convex region. You take any 2 points and-- So this is an example of that. The blue is the +1. And the red is the -1. That's the entire space. So this is a valid hypothesis. Now you can see that there is an enormous variety of convex sets that qualify as hypotheses. But there are some which don't qualify. For example, this one is not convex, because of this fellow. Here's the line segment, and it went out of the region. So that's not convex. We understand what the hypothesis set is. Now we come to the task. What is the growth function for this hypothesis set? In other to answer this, what you need is-- you put your points. I give you N, and you place them. So here is a cloud of points. I give you N, and you say, it seems like putting them in general position is a good idea. So let's put them in a general position. And let's try to see how many patterns I can get out of these, using convex regions. Man, this is going to be tough because I can see-- Let's see. First, I cannot get all of them, right? Because let's say I take the outermost points, and map them all to +1. This will force all the internal points to be +1, because I'm using a convex region. Therefore, I cannot get +1's for the out guys, and any -1 whatsoever inside. So that excludes a lot of dichotomies. Now I have to do real counting. But wait a minute. The criterion for choosing the cloud of points was not to make them look good and general, but to maximize your growth function. Is there another choice for the points that gives me more hypotheses than these? As a matter of fact, is there another choice, for where I put the points, that will give me all possible dichotomies using convex regions? If you succeed in that, then you don't care about this cloud. The other one will count, because you are taking the maximum. Here is the way to do it. Take a circle, and put your points on the perimeter of that circle. Now I maintain that you can get any dichotomy you want on these points. What is the argument? Well, pick your favorite one. I have a bunch of blues and a bunch of reds. Can I realize this using a convex region? Yes. I just connect these guys. And the interior of this goes to +1. And whatever is outside goes to -1. And I am assured it's convex, because the points are on the perimeter of a circle. That means what? That means that the growth function is 2 to the N, notwithstanding the other guy. You realize now a weakness in defining the growth function as the maximum, because in a real learning situation, the chances are the points you're going to get are not going to end up on a perimeter of a circle. They are likely to be all over the place. And some of them will be interior points, in which case you're not going to get all possibilities. But we don't want to keep studying the particular probability distribution, and the particular data set you get, and so on. We would like to have a simple quantity. And therefore, we're taking the maximum overall, which will have a simple combinatorial property. The price we pay is that, the chances are the bound we are going to get is not going to be as tight as possible. But that's a normal price. If you want a general result that applies to all situations, it's not going to be all that tight in any given situation. That is the normal tradeoff. But here, the growth function is indeed 2 to the N. Just as a term, when you get all possible hypotheses, all possible dichotomies, you say that the hypothesis set shattered the points-- broke them in every possible way. So we can say, can we shatter this set, et cetera? That's what it means. You get all possible combinations on them. Just as a term. Now let's look at the 3 growth functions on one slide, in order to be able to compare. We started with the positive rays, and we got a linear growth function. And then we went on to the positive intervals. And we had a quadratic function. And that is good, because we are getting more sophisticated and the growth function is getting bigger. And then we went to convex sets, which are-- It's powerful and two-dimensional and all, but not that powerful. Convex sets are still-- It's really, although we got a bigger one, it's inordinately bigger. Maybe we should have gotten N cubed. But that's what we have. At least it goes this way. So sometimes that thing will be too much. But in general, you can see the trend that, with more sophisticated, you get a bigger growth function. Now let's go back to the big picture, to see where that growth function will fit. Remember this inequality? Oh, yes. We have seen it. We have seen it often. We are tired of it! What we are trying to do is replace M. And we decided to replace it with the growth function m. M can be infinity. m is a finite number, at most 2 to the N, so that's good. What happens if we replace M with small m? Let's say that we can do that, which we'll establish in the next lecture. What will happen? If your growth function happens to be polynomial, you are in great shape. Why is that? Because if you look at this quantity, this is a negative exponential. epsilon can be very, very small. epsilon squared can be really, really, really small. But this remains a negative exponential in N. And for any choice of epsilon you wish, this will kill the heck out of any polynomial you put here, eventually. Right? I can put a 1000th-order polynomial, and can have epsilon equal 10 to the minus 6. And if you're patient enough, or if your customer has enough data, which would be an enormous amount of data, you will eventually get this to win. And you will get the probability to be diminishingly small, which means that you can generalize. That's a very attractive observation, because now all you need to do is just declare that this is polynomial, and you're in business. We saw that it's not that easy to evaluate this explicitly. But maybe, there is a trick that will make us able to declare that it is polynomial. And once you declare that a hypothesis set has a polynomial growth function, we can declare that learning is feasible using that hypothesis, period. We may become finicky and ask ourselves, how many examples do you need for what, et cetera? But at least, we know we can do it. If you're given enough examples, you will be able to generalize from a finite set, albeit big, to the general space with a probability assurance. So that's pretty good. I'm happy that this is the case. So maybe we can, as I mentioned, just prove that m_H is polynomial, the growth function is polynomial. Can we do that? Maybe we can. Maybe we cannot. Here's the key notion that will enable us to do that. We are going to define what is called the break point. You give me a hypothesis set, and I tell you it has a break point. Perceptrons, 4. Another set, the break point is 7. Just one number. That's much better than giving me a full growth function for every N. Just one number. So what is the break point? The definition is the following. It's the point at which you fail to get all possible dichotomies. So you can see that, if the break point is 3, this is not a very fancy hypothesis set. I can't even generate all 8 possibilities on 3 points. If the break point is 100, well, that's a pretty respectable guy, because I can generate everything up to 99 points, all 2 to the 99 of them. And then I start failing at 100. So you can see that the break point also has a correspondence to the complexity of the hypothesis set. If no data set of size k can be shattered by H-- that is, if there are no choice of k points in which you are able to generate all possible dichotomies. Then you call k a break point for H. So let's look at-- what is the-- So that's what it means. You can't shatter, so less than 2 to the k, which are all the possibilities for k data points. So for the 2D perceptron, can you think of what is the break point? We did it already. We didn't explicitly say it in those terms. But this is the hint. For 3, we did everything. For 4, we knew we cannot do everything. So it doesn't matter whether it's 14 or 15 or 12 or 5. As long as it breaks, it breaks. It's not 16. And therefore, in this case, the break point is 4. That number 4 will characterize the perceptrons. Just to tell me, I have a hypothesis set. And it is defined-- I don't want to know the input space. Wait a minute. OK, I'm not going to tell you the input space. I'm going to tell you the hypotheses. The hypotheses are produced by the-- I don't want to hear it. Just tell me the break point, and I will tell you the learning behavior. Also, if you have a break point, every bigger point is also a break point. That is, if you cannot get all possibilities on 10 points, then you certainly cannot get all of them on 11. If you could get them on 11, just kill one. And you will have gotten them on 10. Let's look at the 3 examples, and find what are the break points. Positive rays had this guy. This is a formula. We can plug in for N. And we ask ourselves, when do I get to the point where I no longer get 2 to the N, numerically for a particular value. What is the break point here? N equals 1. I get 1 plus 1. That's 2. That also happens to be 2 to the 1. 2: N plus 1 is 3. Oh, that's less than 4. So 2 must be a break point. This is since we invested in computing the function, we are just lazy now and just substituting. But you could go for the original thing, and say that's obvious. Because this particular combination of points-- if I want the rightmost point to be red, and the left one to be blue, there is no way for the positive ray to generate that. And therefore, that 2 is a break point. There's something where I fail. Let's go for this one. We need faster calculators now. 1, 1/2, et cetera. Wow. It's exactly. When I put 1, it gives me 2. It must be the correct formula. Let's write 2. At 4, I get 2. And-- it calculates. What is the break point? It must be bigger than the other guy, because it's more elaborate. And you realize it's 3. If you put 2 points, you will get the 4. And if you put 3, you'll get 7, which is short of 8. Again, that's not a mystery. That's what you cannot get using the interval. You cannot get the middle point to be red while the other ones are blue. So you cannot get all possibilities on 3 points. Therefore, 3 is a break point. What is the break point for the convex sets? Tell me how many points where I can fail. Well, I'm never going to fail. So if you like, you can say this is infinity. Let's define it this way. So also, the break point-- just a single number-- has the property we want. It gets more sophisticated as the model gets more sophisticated. So what is the main result? The main result is that the first part will be-- if you don't have a break point, I have news for you. The growth function is 2 to the N. OK, yes. That's the definition. Thank you. So that cannot possibly be the main result. So what is the main result? The main result is that if you have a break point, any break point, 1, 5, 7000. Just tell me that there is a break point. You don't even have to tell me what is the break point. We are going to make a statement about the growth function. The growth function is-- do I hear a drum roll? [MAKES DRUM SOUND] It's guaranteed to be polynomial in N. Wow, we have come a long way. I used to ask you what are the hypotheses, and count them. That was hopeless because it's infinity. We defined the growth function, and we have to evaluate it. That was painful. Then we found the break point. Maybe it's easier to compute the break point. I just want to find a clever way, and say that I cannot get it. Now all I need to hear from you is that there is a break point. And I'm in business as far as the generalization is concerned, because I know that regardless of what polynomial you get, you will be able to learn eventually. I will become more particular, and ask you what is the break point, when I try to find the budget of examples you need in order to get a particular performance. But in principle, if I just want to say you can use this hypothesis set, and you can learn, I just want you to tell me I have a break point. That's all I want. This is a remarkable result. And I have to give you a puzzle to appreciate it. The idea of the puzzle is the following. If I just tell you that there's a break point, the constraint on the number of dichotomies you get, because there is a break point, is enormous. If I tell you a break point is, let's say, 3, how many can you get on 100 points? On those 100 points, for any choice of 3 guys, you cannot have all possible combinations-- at any 3 points, all 100 choose 3 of them. So the combinatorial restriction is enormous. And you will end up losing possible dichotomies in droves, because of that restriction. And therefore, the thing that used to be 2 to the N, if it's unrestricted, will collapse to polynomial. Let's take a puzzle, and try to compute this in a particular case. Here is the puzzle. We have only 3 points. And for this hypothesis set, I'm telling you that the break point is 2. So you cannot get all possible four dichotomies on any 2 points. If you put x_1 and x_2, you cannot get -1 -1, -1 +1, +1 -1, and +1 +1. All of them. You cannot get it. One of them has to be missing. So I'm asking you, given that this is the constraint, how many dichotomies can you get on 3 points? You can see, this is what I'm trying to do because I'm telling you that the restriction on 2 will-- If I didn't have the restriction, I would be putting eight. So I'm just telling you this case. So how many do I get? For visual clarity, I'm going to express them as either black or white circles, just for you to be able to-- instead of writing -1 or +1. This dichotomy is fine. It doesn't violate anything. I've only one possibility. So we keep adding. Everything is fine. As a matter of fact, everything will remain fine until we get to four, because the whole idea is that I cannot get all four on any of them. So if I have less than four, I cannot possibly get four combinations. You see what the point is. This is still allowed. I'm going through it as a binary one. So this is 0 0 0, 0 0 1, et cetera. I'm still OK, right? Am I still OK? [MAKES BUZZER SOUND] You have violated the constraint. You cannot put the last row, because it now violates the constraint. I have to take it out. So let's take it out. Try the next guy. Maybe we are in luck. Are we OK? OK. That's promising. So let's go for the next guy. Maybe we'll get it. Are we OK? [MAKES BUZZER SOUND] Tough. So we have to take out the last row. How about this one? Nope. We take it out. We don't have too many options left, right? Actually, this is the last guy. It had better work. Does it work? No. So that's what we can do. We lost half of them. Now you may think, maybe you messed it up because you started very regularly. Just started from all 0, 0 0 1. But if I started differently, I may be able to achieve more. It's conceivable. Please don't lose sleep over it. The only row you are going to be able to add to this table is this one. This is indeed the solution. And you can verify it at home. Now we know that indeed the break point is a very good restriction. And we are going, in the next lecture, to prove that it actually leads to a polynomial growth, which is the main result we want. Let me stop here. And we will take the questions after a short break. Let's start with the questions. MODERATOR: The first question is, what if the target or the hypotheses are not binary? PROFESSOR: There is a counterpart for the entire theory that I'm going to develop, for real-valued functions and other types of functions. The development of the theory is technical enough, that I'm going to develop it only for the binary case, because it is manageable. And it carries all of the concepts that you need. The other case is more technical. And I don't find the value of going to that level of technicality useful, in terms of adding insight. What I'm going to do is, I'm going to apply a different approach to real-valued functions, which is the bias-variance tradeoff. And it's a completely different approach from this one, that will give us another angle on generalization that is particularly suitable for real-valued functions. But the short answer is that, if the function is not binary, there is a counterpart to what I'm saying that will work. But it is significantly more technical than the one I am developing. MODERATOR: Just as a sanity check. When the hypothesis set can shatter the points, this is a bad thing, right? PROFESSOR: OK. There is a tradeoff that will stay with us for the entire course. It's bad and good. If you shatter the points, it's good for fitting the data, because I know that if you give me the data, regardless of what your data is, I'm going to be able to fit them because, I have something that can generate a hypothesis for any particular set of combinations. So if your question is, can I fit the data? Then shattering is good. When you go to generalization, shattering is bad, because basically you can get anything. So it doesn't mean anything that you fit the data. And therefore, you have less hope of generalization, which will be formalized through the theoretical results. And the correct answer is, what is the good balance between the two extremes? And then we'll find a value for which we are not exactly shattering the points, but we are not very restricted, in which we are getting some approximation, and we're getting some generalization. And that will come up. MODERATOR: Is there a similar trick to the one you used for convex sets in higher dimensions? PROFESSOR: So if you-- The principle I explained, I explained it in terms of two-dimensional and perceptrons. If you look at the essence of it, the space is X. It could be anything. The only restrictions I have are binary functions. So this could be a high-dimensional space. And the surfaces will be very sophisticated surfaces. And all I'm reading off, as far as this lecture is concerned, is how many patterns do I get on a number of N points. MODERATOR: Also a question on the complexity. Why is usually polynomial time considered as acceptable? PROFESSOR: OK. Polynomial, in this case, is polynomial growth in the number of points N. It just so happens that we are working with the Hoeffding inequality that gives us a very helpful term, which is the negative exponential. And therefore, if you get a polynomial, as I mentioned, any polynomial, you are guaranteed that for a large enough N, the probability-- the right-hand side of the Hoeffding, including the growth function, will be small. And therefore, the probability of something bad happening is small. Now obviously, there are other functions that also will be killed by the negative exponential. For example, if I had a growth function of the form, let's say, e to the square root of N, that's not a polynomial. But that will also be killed by the negative exponential, because it's square root versus the other one. It just so happens that we are in the very fortunate situation that the growth function is either identically 2 to the N, or else it's polynomial. There is nothing in between. If you draw something that is super polynomial and sub exponential and try to find the hypothesis set for which this is a growth function, you will fail. So I'm getting it for free. I'm just taking the simplicity of the polynomial, because lucky for me, the polynomials are the ones that come out. And they happen to serve the purpose. MODERATOR: OK. A few people are asking, could you repeat the constraints of the puzzle? Because they didn't get the-- PROFESSOR: OK. Let's look at the puzzle. I am putting 3 bits on every row. I'm trying to get as many different rows as possible, under the constraint that if you focus on any 2 of them-- so if I focus on x_1 and x_2 and go down the columns, it must be that one of the possible patterns for x_1 and x_2 is missing. Because I'm saying that 2 is a break point, so I cannot shatter any 2 points. Therefore, I cannot shatter x_1 and x_2, among others, meaning that I cannot get all possible patterns. There are only four possible patterns, which is, if you take it as a binary 0 0, 0 1, 1 0, 1 1. And I'm representing them using the circles. In this case, the x_1 and x_2 get 0 0, so to speak. If I keep adding a pattern-- So let's look at here. x_1 and x_2, how many patterns do they have? They have this pattern. They have it again. That doesn't count. So there's only one pattern here, plus one, is two. So on x_1 and x_2, I have only two patterns. So I haven't violated anything, because I will be only violating if I get all four patterns. So I'm OK, and similarly for the other guys. Things become interesting when you start getting the fourth row. Now again, if you look at the first 2 points, I get one pattern here and one pattern here. There are only two patterns. Nothing is violated as far these 2 points are concerned. But the constraint has to be satisfied for any choice of 2 points. So if you particularly choose x_2 and x_3, and count the number of patterns, you realize, 0 0, 0 1, 1 0, 1 1. I am in trouble. That's why we put it in red. Because now these guys have all possible patterns. And I know, by the assumption of the problem, that I cannot get all four patterns on any 2 points. So I cannot get this. So I'm unable to add this row under those constraints. And therefore, I'm taking it away. And I'm going through the exercise. And every time I put a row, I keep an eye on all possible combinations. So here, I put-- let's look at x_1 and x_2. 1 pattern, 2, 3. I'm OK. x_2 and x_3, 1 pattern, which is here and here. 2, 3. I'm also OK. And then you put x_1, x_3. Here is a pattern. It repeats here. 0 0 and 0 0. So that's one. And then I get this one and this one, 3. So this one is perfect, everyone. Not perfect in any sense, except that I didn't violate anything. So I'm allowed to put that row. Now when I extend this further, and start putting the new guys, for this guy, there is a violation. And you can scan your eyes and try to find the violation. And I'm highlighting it in red. So I'm showing you that for x_1 and x_3, there are the four patterns. Here's one pattern, the second one. I didn't count this one, just because it's already happened. So I just highlight four different ones, and then the third one and fourth one. So I cannot possibly add this row, because it violates the constraint on these 2 points. So I take it out and keep adding. Another attempt, this is the next guy. It still violates. Why does it violate? For the same argument. Look at the red guys. You find all possible patterns. So I cannot have it. So we take it away. And then the last one that is remaining is this guy. And that also doesn't work, because it violates it for those guys. You can look at it and verify. And the conclusion here is that I cannot add anything. So that's what I'm stuck with. And therefore, the number of different rows I can get under the constraint that 2 is a break point-- in this case, is 4. Obviously, the remark I mentioned is that maybe you can start instead of gradually from 0 0 0, 0 0 1, maybe you can start more cleverly or something. But however, anyway you try it, it's sufficiently symmetric in the bits that it doesn't make a difference. You will be stuck with at most 4. MODERATOR: OK. In the slide with the Hoeffding inequality, does anything change when you change-- specifically, does a probability measure change when you change from a hypothesis to dichotomy? PROFESSOR: For this one? MODERATOR: Yeah. PROFESSOR: Yeah. The idea here, M is the number of hypotheses, period. So it's infinity for perceptrons. We have to live with that. In our attempt to replace it with the growth function, we are going to replace it by something that is not infinite, bounded above by 2 to the N. As you can see, 2 to the N is not really helpful because I have a positive exponential and negative exponential. And that's not very decisive. Therefore I am trying to find if I can put a growth function-- not only put the growth function here, but also show that the growth function is polynomial for the models of interest that I have, and therefore be able to get this to be a small quantity for a real learning model, like the perceptron or the other one, neural networks, et cetera. All of these will have a polynomial growth function, as we will see. So that's where the number of hypotheses, which is M, goes to the number of dichotomies, which is the growth function. Not a direct substitution, as we will see. There are some technicalities involved. But that is what gets me the right-hand side to be a manageable right hand side, and goes to 0 as N grows, which tells me that the probability of generalization will be high. MODERATOR: OK. Is there a systematic way to find the break points? PROFESSOR: There is. It's not one size fits all. The are arguments, for example, you can go for neural networks. And sometimes you find it by finding a particular combination that you cannot break, and argue that this is the break point. Sometimes you can argue by-- Let me try to find a crude estimate for the growth function. Let's say the growth function cannot be more than this. And then as you go by, you realize that this is not exponential. So there has to be a break point at some point. This would be less than 2 to the N, and therefore will be a break point. So in that case, the estimate for the break point will be just an estimate. It will not be an exact value. But it will a maximum. We have a question in house. STUDENT: Hi. So in this slide, the top end is the number of testing points and the lower end is the number of training points. PROFESSOR: Yeah. N is always the size of the sample. And it's a question of interpretation between the two, whether that sample is used for testing, which means that you have already frozen your hypothesis, and you are just verifying, testing it. Or in the other case, you haven't frozen your hypothesis. And you are using the same sample to go around and find one. And you are charged for the going around aspect by M. STUDENT: So let's say that our customer gives us k sample points. How do we decide how many of them do we reserve for testing points, how many for training? PROFESSOR: This is a very good point. There will be a lecture down the road called validation, in which this is going to be addressed very specifically. There are rules of thumb. There are some mathematical results, but there is a rule of thumb. There are few rules of thumb that I'm going to say without proof, that stood the test of time. And one of the rules of thumb has to apply to, how many do we reserve, in order to first not to diminish the training set very much, and still have a big enough test set so that the estimate is reliable? So this will come up. Thank you. There is another question. STUDENT: Hi, professor. I have one question. So for 2 hypotheses that have the same dichotomy, is it true that the in-sample error is the same for the 2 hypotheses? PROFESSOR: OK. If it has the same dichotomy, it's even a stronger condition than this, because it returns exactly the same values. Now the in-sample error is the fraction of errors I got right and wrong. The target function is fixed. So that is not going to change. So obviously, I'm going to get the same pattern of errors. And if I get the same pattern of errors, then obviously I'm getting the same fraction of errors, among other things. Now if you're asking, for these 2 hypotheses, what is the out-of-sample error? That's a different story, because for the out-of-sample error, you take the hypothesis in its entirety. So in spite of the fact that it's the same on the set of points, it may be not the same on the entire input space, which it isn't because they're different hypotheses. And therefore, you get a different E_out. But the answer is yes. You will get the same in-sample error. STUDENT: Oh, yes. I see. That's why I was asking. Because I think that the out-of-sample error is different for 2 hypotheses. So can we replace the M with-- PROFESSOR: Exactly. And the biggest technicality in the proof-- We were saying, we're going to replace M by the growth function. That's a very helpful thing. Now, there has to be a proof. And I will argue for the proof, and the overlapping aspects, and some of this. The key point is, what do I do about this fellow? Because when I consider the sample, this one is very much under control. As you said, if I have 2 hypotheses that are the same here, they are the same here. But they are not the same here. So the statement here depends on E_out, depends on the whole input space. So how am I going to get away with that? That's really the main technical contribution for the proof. And that will come up next time. STUDENT: Sure, thank you. PROFESSOR: Sure. MODERATOR: So-- why is it called a growth function? PROFESSOR: A growth function. I really-- The person who introduced this called it a growth function. I guess he called it a growth function, because it grows, as you increase N. I don't think there is any particular merit for the name. MODERATOR: Is there-- what is a real-life situation similar to the one in the puzzle, where you realize that this break point may be too small? PROFESSOR: OK. The first order of business is to get the break point out of the way-- that there is a break point, we are in business. Second one is, how does the value of the break point relate to the learning situation? Do I need more examples when I have a bigger break point? The answer is yes. What is the estimate? And there's a theoretical estimate, a bound. Maybe the bound is too loose. So we'll have to find practical rules of thumb that translate the break point to a number of examples. All of this is coming up. So the existence of the break point means learning is feasible. The value of the break point tells us the resources needed to achieve a certain performance. And that will be addressed. MODERATOR: Is there a probabilistic statement for the Hoeffding inequality that is an alternative to the case-by-case discussion on M's growth rate in N? PROFESSOR: There are alternatives to Hoeffding. So there are alternatives to Hoeffding, and you can get different results or emphasize different things. I am sticking to Hoeffding. And I'm not indulging too much into its derivation, or the alternatives, because this is a mathematical tool that I'm borrowing. And I'm taking it for granted. And I picked the one that will help us the most, which is this one. So yes, there are variations. But I am deliberately not getting into them, in order not to dilute the message. I want people to become so incredibly familiar and bored with this one, then they know it cold. Because when we get to modify it, including the growth function and the other technical points, I'd like the base point to be completely clear in people's mind, so that they don't get lost with the modifications. So that's why I'm sticking to this. MODERATOR: I think that's it. PROFESSOR: Very good. We'll see you next time.
Info
Channel: caltech
Views: 195,651
Rating: 4.9427481 out of 5
Keywords: Machine Learning (Field Of Study), in sample, out of sample, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, training set, test set, generalization, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), Theory (Quotation Subject), California Institute Of Technology (Organization), Abu-Mostafa, Yaser
Id: SEYAnnLazMU
Channel Id: undefined
Length: 76min 57sec (4617 seconds)
Published: Thu Apr 19 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.