Lecture 11 - Overfitting

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced neural networks, and we started with multilayer perceptrons, and the idea is to combine perceptrons using logical operations like OR's and AND's, in order to be able to implement more sophisticated boundaries than the simple linear boundary of a perceptron. And we took a final example, where we were trying to implement a circle boundary in this case, and we realized that we can actually do this-- at least approximate it-- if we have a sufficient number of perceptrons. And we convinced ourselves that combining perceptrons in a layered fashion will be able to implement more interesting functionalities. And then we faced the simple problem that, even for a single perceptron, when the data is not linearly separable, the optimization-- finding the boundary based on data-- is a pretty difficult optimization problem. It's combinatorial optimization. And therefore, it is next to hopeless to try to do that for a network of perceptrons. And therefore, we introduced neural networks that came in as a way of having a nice algorithm for multilayer perceptrons, by simply softening the threshold. Instead of having it as just going from -1 to +1, it would go from -1 to +1 gradually using a sigmoid function, in this case the tanh. And if the signal which is given by this amount-- the usual signal that goes into the perceptron-- is large, large negative or large positive, the tanh approximates -1 or +1. So we get the decision function we want. And if s is very small, this is almost linear-- tanh(s) is linear. And the most important aspect about it is that it's differentiable-- it's a smooth function, and therefore the dependency of the error in the output on the parameters w_ij will be a well-behaved function, for which we can apply things like gradient descent. And the neural network looks like this. It starts with the input, followed by a bunch of hidden layers, followed by the output layer. And we spent some time trying to argue about the function of the hidden layers, and how they transform the inputs into a particularly useful nonlinear transformation, as far as implementing the output is concerned, and the question of interpretation. And then we introduced the backpropagation algorithm, which is applying stochastic gradient descent to neural networks. Very simply, it decides on the direction along every coordinate in the w space, using the very simple rule of gradient descent. And in this case, you only need two quantities. One of them is x_i, that was implemented using this formula, the forward formula, so to speak, going from layer l minus 1 to layer l. And then there is another quantity that we defined, which was called delta, that is computed backwards. You start from layer l, and then go to layer l minus 1. And the formula is strikingly similar to the formula in the forward thing, but instead of the nonlinearity being applied, you multiply by something. And once you get all the delta's and x's by a forward and a backward run, then you simply can decide on the move in every weight, according to this very simple formula that involves the x's and the delta's. And the simplicity of the backpropagation algorithm, and its efficiency, are the reasons why neural networks have become very popular as a standard tool of implementing functions that need machine learning in industry, for quite some time now. Today, I'm going to start a completely new topic. It's called overfitting, and it will take us three full lectures to cover overfitting and the techniques that go with it. And the techniques are very important, because they apply to almost any machine learning problem that you're going to see. And they are applied on top of any algorithm or model you use. So you can use neural networks or linear models, et cetera. But the techniques that we're going to use here, which are regularization and validation, apply to all of these models. So this is another layer of techniques for machine learning. And overfitting is a very important topic. It is fair to say that the ability to deal with overfitting is what separates professionals from amateurs in machine learning. Everybody can fit, but if you know what overfitting is, and how to deal with it, then you have an edge that someone who doesn't know the fundamentals would not be able to comprehend. So the outline today is, first, we are going to start-- what is the notion? what is overfitting? And then we are going to identify the main culprit for overfitting, which is noise. And after observing some experiments, we will realize that noise covers more territory than we thought. There's actually another type of noise, which we are going to call deterministic noise. It's a novel notion that is very important for overfitting in machine learning, and we're going to talk about it a little bit. And then, very briefly, I'm going to give you a glimpse into the next two lectures by telling you how to deal with overfitting. And then we will be ready, having diagnosed what the problem is, to go for the cures-- regularization next time, and validation the time after that. OK. Let's start by illustrating the situation where overfitting occurs. So let's say we have a simple target function. Let's take it to be a 2nd-order target function, a parabola. So my input space is the real numbers. I have only a scalar input x. And there's a value y, and I have this target that is 2nd-order. We are going to generate five data points from that target, in order to learn from. This is an illustration. Let's look at the five data points. As you see, the data points look like they belong to the curve, but they don't seem to belong perfectly to the curve. So there must be noise, right? This is a noisy case, where the target itself-- the deterministic part of the target is a function, and then there is added noise. It's not a lot of noise, obviously-- very small amount. But nonetheless, it will affect the outcome. So we do have a noisy target in this case. Now, if I just told you that you have five points, which is the case you face when you learn. The target disappears, I have five points, and you want to fit them. Going back to your math, you realize, I want to fit five points. Maybe I should use-- a 4th-order polynomial will do it, right? You have five parameters. So let's fit it with 4th-order polynomial. This is the guy who doesn't know machine learning, by the way. So I say, I'm going to use the 4th-order polynomial. And what will the fit look like? Perfect fit, in sample. And you measure your quantities. The first quantity is E_in. Success! We achieved zero training error. And then when you go for the out-of-sample, you are comparing the red curve to the blue curve, and the news is not good. I'm not even going to calculate it, it's just huge. This is a familiar situation for us, and we know what the deal is. The point I want to make here is that, when you say overfitting, overfitting is a comparative term. It must be that one situation is worse than another. You went further than you should. And there is a distinction between overfitting, and just bad generalization. So the reason I'm calling this overfitting is because, if you use, let's say, 3rd-order polynomial, you will not be able to achieve zero training error, in general. But you will get a better E_out. Therefore, the overfitting here happened by using the 4th-order instead of the 3rd-order. You went further. That's the key. And that point is made even more clearly, when you talk about neural networks and overfitting within the same model. In the case of overfitting with 3rd-order polynomial versus 4th-order polynomial, you are comparing two models. Here, I'm going to take just neural networks, and I'll show you how overfitting can occur within the same model. So let's say we have a neural network, and it is fitting noisy data. That's a typical situation. So you run your backpropagation algorithm with a number of epochs, and you plot what happens to E_in, and you get this curve. Can you see this curve at all? Let me try to magnify it, hoping that it will become clearer. A little bit better. This is the number of epochs. You start from an initial condition, a random vector. And then you run stochastic gradient descent, and evaluate the total E_in at the end of every epoch, and you plot it. And it goes down. It doesn't go to zero. The data is noisy. You don't have enough parameters to fit it perfectly. But this looks like a typical situation, where E_in goes down. Now, because this is an experiment, you have set aside a test set that you did not use in training. And what you are going to do, you are going to take this test set and evaluate what happens out-of-sample. Not only at the end, but as you go. Just to see, as I train, am I making progress out-of-sample or not? You're definitely making progress in-sample. So you plot the out-of-sample, and this is what you get. So this is estimated by a test set. Now, there are many things you can say about this curve, and one of them is, in the beginning when you start with a random w, in spite of the fact that you're using a full neural network, when you evaluate on this point, you have only one hypothesis that does not depend on the data set. This is the random w that you got. So it's not a surprise that E_in and E_out are about the same value here. Because they are floating around. As you go down the road, and start exploring the weight space by going from one iteration to the next, you're exploring more and more of the space of weights. So you are getting the benefit, or the harm, of having the full neural network model, gradually. In the beginning here, you are only exploring a small part of the space. So if you can think of an effective VC dimension as you go, if you can define that, then there is an effective VC dimension that is growing with time until it gets-- after you have explored the whole space, or at least potentially explored the whole space, if you had different data sets-- then you have the effective VC dimension, will be the total number of free parameters in the model. So the generalization error, which is the difference between the red and green curve, is getting worse and worse. That's not a surprise. But there is a point, which is an important point here, which happens around here. Let me now shrink this back, now that you know where the curves are. And let's look at where overfitting occurs. Overfitting occurs when you knock down E_in, so you get a smaller E_in, but E_out goes up. If you look at these curves, you will realize that this is happening around here. Now there is very little, in terms of the difference of generalization error, before the blue line and after the blue line. Yet I am making a specific distinction, that crossing this boundary went into overfitting. Why is that? Because up till here, I can always reduce the E_in, and in spite of the fact that E_out is following suit with very diminishing returns, it's still a good idea to minimize E_in. Because you are getting smaller E_out. The problems happen when you cross, because now you think you're doing well, you are reducing E_in, and you are actually harming the performance. That's what needs to be taken care of. So that's where overfitting occurs. In this situation, it might be a very good idea to be able to detect when this happens, and simply stop at that point and report that, instead of reporting the final hypothesis you will get after all the iterations, right? Because in this case, you're going to get this E_out instead of that E_out, which is better. And indeed, the algorithm that goes with that is called early stopping. And it will be based on validation. And although it's based on validation, it really is a regularization, in terms of putting the brakes. So now we can see the relative aspect of overfitting. And overfitting can happen when you compare two things, whether the two things are two different models, or two instances within the same model. And we look at this and say that if there is overfitting, we'd better be able to detect it, in order to stop earlier than we would otherwise, because otherwise we will be harming ourselves. So this is the main story. Now let's look at what is overfitting as a definition, and what is the culprit for it. Overfitting, as a criterion, is the following. It's fitting the data more than is warranted. And this is a little bit strange. What would be more than is warranted? I mean, we are in machine learning. We are the business of fitting data. So I can fit the data. I keep fitting it. But there comes a point, where this is no longer good. Why does this happen? What is the culprit? The culprit, in this case, is that you're actually fitting the noise. The data has noise in it, and you are trying to look at the finite sample set that you got, and you're trying to get it right. In trying to get it right, you are inadvertently fitting the noise. This is understood. I can see that this is not good. At least, it's not useful at all. Fitting the noise, there's no pattern to detect in the noise, so fitting the noise cannot possibly help me out-of-sample. However, if it was only just useless, we would be OK. We wouldn't be having this lecture. Because you think, I give the data, the data has the signal and the noise. I cannot distinguish between them. I just get x and get y. y has a component which is a signal, and a component which is noise, but I get just one number. I cannot distinguish between the two. And I am fitting them. And now I'm going to fit the noise. Let's look at it this way. I'm in the business of fitting. I cannot distinguish the two. Fitting the noise is the cost of doing business. If it's just useless, I wasted some effort, but nothing bad happened. The problem really is that it's harmful. It's not a question of being useless, and that's a big difference. Because machine learning is machine learning. If you fit the noise in-sample, the learning algorithm gets a pattern. It imagines a pattern, and extrapolates that out-of-sample. So based on the noise, it gives you something out-of-sample and tells you this is the pattern in the data, obviously, which it isn't. And that will obviously worsen your out-of-sample, because it's taking you away from the correct solution. So you can think of the learning algorithm in this case, when detecting a pattern that doesn't exist, the learning algorithm is hallucinating. Oh, there's a great pattern, and this is what it looks like, and it reports it, and eventually, obviously that imaginary thing ends up hurting the performance. So let's look at a case study. And the main reason for the case study, because we vaguely now understand that it's a problem of the noise, so let's see how does the noise affect the situation? Can we get overfitting without noise? What is the deal? So I'm going to give you a specific case. I'm going to start with a 10th-order target. 10th-order target means 10th-order polynomial. I'm always working on the real numbers. The input is a scalar, and I'm defining polynomials based on that, and I'm going to take 10th-order target. The 10th-order target, one of them looks like this. You choose the coefficient somehow, and you get something like that. A fairly elaborate thing. And then you generate data, and the data will be noisy, because we want to investigate the impact of noise on overfitting. Let's say I'm going to generate 15 data points in this case. So this is what you get. Let's look at these points. The noise here is not trivial as it was last time. There's a difference. Obviously, these are not lying on the curve. So there is a noise that is contributing to that. Now the other guy, which is a 50th order, is noiseless. That is, I'm going to generate a 50th-order polynomial, so it's obviously much more elaborate than the blue curve here, but I'm not going to add noise to it. I'm going to generate also 15 points from this guy, but the 15 points, as you will see, perfectly lie on the curve. This is all of them here. So this is the data, this is the target, and the data lies on the target. These are two interesting cases. One of them is a simple target, so to speak. Added noise, that makes it complicated. This one is complicated in a different way. It's a high-order target to begin with, but there is no noise. These are the two cases that I'm going to try to investigate overfitting in. We are going to have two different fits for each target. We are in the business of overfitting. We have to have comparative models. So I'm going to have two models to fit every case. And see if I get overfitting here, and I get it here. This is the first guy that we saw before. The simple target with noise. And this guy is the other one, which is the complex target without noise. 10th-order, 50th-order. We'll just refer to them as a noisy low-order target, and a noiseless high-order target. This is what we want to learn. Now, what are we going to learn with? We're going to learn with two models. One of them is the same thing-- we have a 2nd-order polynomial that we're going to use to fit. That's our model. And we're going to have a 10th-order polynomial These are the two guys that we are going to use. Here's what happens with the 2nd-order fit. You have the data points, and you fit them, and it's not surprising. For the 2nd order, it's a simple curve, and it tries to find a compromise. Here we are applying mean squared error, so this is what you get. Now, let's analyze the performance of this fellow. What I'm going list here, as you see, I'm going to say, what is the in-sample error, what is the out-of-sample error, for the 2nd order which is already here, and the 10th order, which I haven't shown yet. The in-sample error in this case is 0.05. This is a number. Obviously, it depends on the scale. It's some number. When you get the out-of-sample version, not surprisingly, it's bigger, because this one fit the data. The other one is out-of-sample, so it's going to be bigger. But the difference is not dramatic, and this is the performance you get. Now let's apply the 10th-order fit. You already foresee what a problem can exist here. The red curve sees the data, tries to fit that, uses all the degrees of freedom it has-- it has 11 of them-- and then it gets this guy. And when you look at the in-sample error, obviously the in-sample error must be smaller than the in-sample error here. You have more to fit and you fit it better, so you get smaller in-sample error. And what is out-of-sample error? Just terrible. So this is patently a case of overfitting. When you went from 2nd order to 10th order, the in-sample error indeed went down. The out-of-sample error went up. Way up. So you say, this confirms what we have said before. We are fitting the noise. And you can see here that you're actually fitting the noise. You can see the red curve is trying to go for these guys, and you know that these guys are off the target. Therefore, the red curve is bending particularly, in order to capture something that is really noise. So this is the case. Here it's a little bit strange, because here we don't have any noise. And we also have the same models. We're going to take the same two models. We have 2nd order and 10th order, fitting here. Let's see how they perform here. Well, this is the 2nd-order fit. Again, that's what you expect from a 2nd-order fit. And you look at the in-sample error and out-of-sample error, and they are OK-- ballpark fine. You get some error, and the other one is bigger than it. Now we go for the 10th order, which is the interesting one. This is the 10th order. You need to remember that the 10th order is fitting a 50th order. So it really doesn't have enough parameters to fit, if we had all the glory of the target function in front of us. But we don't have all the glory of the target function. We have only 15 points. So it does as good a job as possible for fitting. And when we look at the in-sample error, definitely the in-sample error is smaller than here. Because we have more. It's actually extremely small. It did it really, really, well. And then when you go for the out-of-sample. Oh, no! You see, this is squared error. So these guys, when you go down and when you go up, kill you. And indeed they did. So this is overfitting galore. And now you ask yourself, you just told us about noise and not noise. This is noiseless, right? Why did we get overfitting here? We will find out that the reason we are getting overfitting here, because actually this guy has noise. But it's not your usual noise. It's another type of noise. And getting that notion down is very important to understand the situations in practice, where you are going to get overfitting. You could be facing a completely noiseless, in the conventional sense, situation, and yet there is overfitting, because you are fitting another type of noise. So let's look at the irony in this example. Here is the first example-- the noisy simple target. So you are learning a 10th-order target, and the target is noisy. And I'm not showing the target here, I'm showing the data points together with the two fits. Now let's say that I tell you that the target is 10th order, and you have two learners. One of them is O, and one of them is R-- O for overfitting, and R is for restricted, as it turns out. And you tell them, guys, I'm not going to tell you what the target is, because if I tell you what the target is, this is no longer machine learning. But let me help you out a little bit. The target is a 10th-order polynomial. And I'm going to give you 15 points. Choose your model. Fair enough? The information given does not depend on the data set, so it's a fair thing. The first learner says, I know that the target is 10th order. Why not pick a 10th-order model? Sounds like a good idea. And they do this, and they get the red curve, and they cry and cry and cry! The other guy said, oh, it's 10th-order model? Who cares? How many points do you have? 15. OK, 15. I am going to take a 2nd order, and I am actually pushing my luck, because 2nd order is 3 parameters, I have 15 points, the ratio is 5. Someone told us a rule of thumb that it should be 10. I'm flirting with danger. But I cannot use a line when you are telling me the thing is 10th order, so let me try my luck with 2nd. That's what you do. And they win. So it's a rather interesting irony, because there is a thought in people's mind that you try to get as much information about the target function, and put it in the hypothesis set. In some sense this is true, for certain properties. But if you are matching the complexity, here the guy who actually took the 10th-order target, and decided to put the information all too well in the hypothesis-- I'm taking a 10th-order hypothesis set-- lost. So again, we know all too well now. The question is, you match the data resources, rather than the target complexity. There will be other properties of the target function, that we will take to heart. Symmetry and whatnot, there are a bunch of hints that we can take. But the question of complexity is not one of the things that you just apply the general idea of: let me match the target function. That's not the case. In this case, you are looking at generalization issues, and you know that generalization issues depend on the size and the quality of the data set. Now, the example that I just gave you, we have seen it before when we introduced learning curves, if you remember what those were. Those were, yeah, I'm going to put how E_in and E_out change with a number of examples. And I gave you something, and I told you that this is an actual situation we'll see later, and this is the situation. So this is the case where you take the 2nd-order polynomial model, H_2, and the inevitable error, which is the black line, comes now not only from the limitations of the model-- an inability for a 2nd order to replicate a 10th order, which is the target in this case-- but also because there is noise added. Therefore, there's an amount of error that is inevitable because of the noise. But the model is very limited. The generalization is not bad, which is the difference between the two curves. And if you have more examples, the two curves will converge, as they always do, but they converge to the inevitable amount of error, which is dictated by the fact that you're using such a simple model in this case. And when we looked at the other case, also introduced in this case-- this was the 10th-order fellow. So the 10th-order fellow is-- you can fit a lot, so the in-sample error is always smaller than here. That is understood. The out-of-sample error starts by being terrible, because you are overfitting. And then it goes down, and it converges to something that is better, because that carries the ability of H_10 to approximate a 10th order, which should be perfect, except that we have noise. So all of this actually is due to the noise added to the examples. And the gray area is the interesting part for us. Because in the gray area, the in-sample error for the more complex model is smaller. It's smaller always, but we are observing it in this case. And the out-of-sample error is bigger. That's what defines the gray area. Therefore in this gray area, very specifically, overfitting is happening. If you move from the simpler model to the bigger model, you get better in-sample error and worse out-of-sample error. Now we realize that this guy is not going to lose forever. The guy who chose the correct complexity is not going to lose forever. They lost only because of the number of examples that was inadequate. If the number of examples is adequate, they will win handily. Like here-- if you look here, you end up with an out-of-sample error far better than you would ever get here. But now I have enough examples, in order to be able to do that. Now, we understand overfitting. And we understand that overfitting will not happen for all the numbers of examples, but for a small number of examples where you cannot pin down the function, then you suffer from the usual bad generalization that we saw. Now, we notice that we get overfitting even without noise, and we want to pin it down a little bit. So let's look at this case. This is the case of the 50th-order target, the higher-order target that doesn't have any noise-- conventional noise, at least. And these are the two fits. And there's still an irony, because here are the two learners. The first guy chose the 10th order, the second guy chose the 2nd order. And the idea here is the following. You told me that the target now doesn't have noise. Right? That means I don't worry about overfitting. Wrong. But we'll know why. So given the choices, I'm going to try to get close to the 50th order, because I have a better chance. If I choose the 10th order, someone else chooses 2nd order, I'm closer to the 50th, so I think I will perform better. At least that's the concept. So you do this, and you know that there is no noise, so you decide on this idea, and again you get bad performance. And you ask yourself, this is not my day. I tried everything, and I seem to be making the wise choice, and I'm always losing. And why is this the case, when there is no noise? And then you ask, is there really no noise? And that will lead us to defining that there is an actual noise in this case, and we'll analyze it and understand what it is about. So I will take these two examples, and then make a very elaborate experiment. And I will show you the results of that experiment. I will encourage you, if you are interested in the subject, to do simulate this experiment. All the parameters are given. And it will give you a very good feel for overfitting, because now we are going to look at the figure, and have no doubt in our mind that overfitting will occur whenever you actually encounter a real problem. And therefore, you have to be careful. It's not like I constructed a particular funny case. No, if you average over a huge number of experiments, you will find that overfitting occurs in the majority of the cases. So let's look at the detailed experiment. I'm going to study the impact of two things-- the noise level, which I already conceptually convinced myself that it's related to overfitting, and the target complexity, just because it does seem to be related. Not sure why, but it seems like when I took a complex target, albeit noiseless, I still got overfitting, so let me see what the target complexity does. We are going to take, as general target function-- I'm going to describe what it is, and I'm going to add noise to it. The noise is a function of x. So I'm just getting it generically, and as always, we have independence from one x to another. In spite of the fact that the parameters of the noise distribution depend on x-- I can have different noise for different points in the space-- the realization of epsilon is independent from one x to another. That is always the assumption. When we have different data points, they are independent. So this is the thing, and I'm going to measure the level of noise by the energy in that noise, and we're going to call it sigma squared. I'm taking the expected value of epsilon to be 0. If there were an expected value, I would put it in the target, so I will remain with 0. And then there's fluctuation around it, and the fluctuation either could be big, large sigma squared, or small. And I'm quantifying it with sigma squared. No particular distribution is needed. You can say Gaussian, and indeed I applied Gaussian in the experiment. But for the statement, you just need the energy of that. Now let's write it down. I want to make the target function more complex, at will. So I'm going to make it higher-order polynomial. Now I have another parameter, pretty much like the sigma squared. I have another parameter which is capital Q, the order of the polynomial. I'm calling it Q_f, because it describes the target complexity of f, just to remember that it's related to f. And what I do, I define a polynomial, which is the sum of coefficients times a power of x, from q equals 0 to Q, so it's indeed a Qth-order polynomial, and I add the noise here. Now, in order to run the experiment right, I'm going to normalize this quantity, such that the energy here is always 1. And the reason I do that is because I want the sigma squared to mean something. The signal to noise ratio is always what means something. So if I normalize the signal to energy 1, then I can say sigma squared is really the amount of noise. And if you look at this, it is not easy to generate interesting polynomials using this formula. Because if you pick these guys at random-- let's say independent coefficients at random, in order to generate a general target, these guys are the powers of x. So you start with the x, and then the parabola, and then the 3rd order, and then the 4th order, and then the 5th order. Very, very boring guys. One of them is doing this way, and the other one is doing this way, and they get steeper and steeper. So if you combine them with random coefficients, you will almost always get something that looks this way, or something that looks this way. And the other guys don't play a role, because this one dominates. The way to get interesting guys here is, instead of generating the alpha_q's here as random, you go for a standard set of polynomials, which are called Legendre polynomials. Legendre polynomials are just polynomials with specific coefficients. There is nothing mysterious about them, except that the choice of the coefficients is such that, from one order to the next, they're orthogonal to each other. So it's like harmonics in a sinusoidal expansion. If you take the 1st-order Legendre, then the 2nd, and the 3rd, and the 4th, and you take the inner product, you see they are 0. They are orthogonal to each other, and you normalize them to get energy 1. Because of this, if you have a combination of Legendre's with random coefficients, then you get something interesting. All of a sudden, you get the shape. And when you are done, it is just a polynomial. All you do, you collect the guys that happen to be the coefficients of x, the coefficients of x squared, coefficients of x cubed, and these will be your alpha's. Nothing changed in the fact that I'm generating a polynomial. I just was generating the alpha's in a very elaborate way, in order to make sure that I get interesting targets. That's all there is to it. As far as we are concerned, we generated guys that have this form and happened to be interesting-- representative of different functionalities. So in this case we have the noise level. That's one parameter that affects overfitting. We have potentially-- the target complexity seems to be affecting overfitting. At least we are conjecturing that it is. And the final guy that affects overfitting is the number of data points. If I give you more data points, you are less susceptible to overfitting. Now I'd like to understand the dependency between these. And if we go back to the experiment we had, this is just one instance of those, where the target complexity here is 10. I use the 10th-order polynomial, so Q_f is 10. The noise is whatever the distance between the points and the curve is. That's what captures sigma squared. And the data size here is 15. I have 15 data points. So this is one instance. I'm basically generating at will random instances of that, in order to see if the observation of overfitting persists. Now, how am I going to measure overfitting? I'm going to define an overfit measure, which is a pretty simple one. We're fitting a data set from x_1, y_1 to x_N, y_N. And we are using two models, our usual two models. Nothing changed. We either use 2nd-order polynomials, or the 10th-order polynomials. And if going from the 2nd-order polynomial to the 10th-order polynomial gets us in trouble, then we are overfitting. And we would like to quantify that. When you compare the out-of-sample errors of the two models, you have a final hypothesis from H_2, and this is the fit-- the green curve that you have seen. And another final hypothesis from the other model, which is the red curve-- the wiggly guy. If you want to define an overfit measure based on the two, what you do is you get the out-of-sample error for the more complex guy, minus the out of sample error for the simple guy. Why is this an overfit measure? Because if the more complex guy is worse, it means its out-of-sample error is bigger, and you get a positive number, large positive if the overfitting is terrible. And if this is negative, it means that actually the more complex guy is doing better, so you are not overfitting. Zero means that they are the same. So now I have a number in my mind that measures the level of overfitting in any particular setting. And if you apply this to, again, the same case we had before, you look at here, and the out-of-sample error for the red is terrible. The out-of-sample error of green is nothing to be proud of, but definitely better. And the overfit measure in this case will be positive, so we have overfitting. Now let's look at the result of running this for tens of millions of iterations. Not epochs iterations. Complete runs. Generate the target, generate the data set, fit both, look at the overfit measure. Repeat 10 million times, for all kinds of parameters. So you get a pattern for what is going on. This is what you get. First, the impact of sigma squared. I'm going to have a plot in which you get N, the number of examples, and the level of noise, sigma squared. And on the plot, I'm going to give a color depending on the intensity of the overfit. That intensity will be depending on the number of points, and the level of the noise that you have. And this is what you get. First let's look at the color convention. So 0 is green. If you get redder, there's more overfitting. If you get bluer, there is less overfitting. Now I looked at the number of examples, and I picked interesting range. If you go, this is 80, 100, and 120 points. So what happens to 40? All of them are dark red. Terrible overfitting. And if you go beyond that, you have enough examples now not to overfit, so it's almost all blue. So I'm just giving you the transition part of it. You look at it. There is a noise level. As I increase the noise level, overfitting worsens. Why is that? Because if I pick any number of examples, let's say 100. If I had 100, and it had that little noise, I'm doing fine. Doing fine in terms of not overfitting. And as I go, I get into the red region, and then I get deeply into the red region. So this tells me, indeed, that overfitting worsens with sigma squared. By the way, for all of the targets here, I picked a fixed complexity. 20. 20th-order polynomial. I fixed it because I just wanted a number, and I wanted only to relate the noise to the overfitting. So that's what I'm doing here. When I change the complexity, this will be the other plot. For this guy, we get something that is nice, and it's really according to what we expect. As you increase the number of points, the overfitting goes down. As you increase the level of noise, the overfitting goes up. That is what we expect. Now let's go for the impact of Q_f, because that was the mysterious part. There was no noise and we are getting overfitting. Is this going to persist? What is the deal? This is what you get. So here, we fixed the level of noise. We fixed it at sigma squared equals 0.1. Now we are increasing the target complexity, from trivial to 100th-order polynomial. That's a pretty serious guy. And we are plotting the same range for the number of points, from 80, 100, 120. That's where it happens. And you can see that overfitting occurs significantly. And it worsens also with the target complexity. Because let's say, you look at this guy. If you look at this guy, you are here in the green, and gets red, and then it gets darker red. Not as pronounced as in this case. But you do get the overfitting effect by increasing the target complexity. And when the number of examples is bigger, then there's less overfitting, as you expect it to be. But if you go high enough-- I guess it's getting lighter blue, green, yellow. Eventually, it will get to red. And if you look at these two guys, the main observation is that the red region is serious. Overfitting is real and here to stay, and we have to deal with it. It's not like an individual case there. Now, there are two things you can derive from these two figures. The first thing is that there seems to be another factor, other than conventional noise-- let's call it conventional noise for the moment-- that affects overfitting. And we want to characterize that. That is the first thing we derive. The second thing we derive is a nice logo for the course! That's where it came from. So now let's look at noise, and look at the impact of noise. And you can notice that noise is between quotation marks here, because now we're going to expand our horizon about what constitutes noise. Here are the two guys. And in the first case, we are going now to call it stochastic noise. Noise is stochastic, but obviously we are calling it stochastic because the other guy will not be stochastic. And there's absolutely nothing to add here. This is what we expect. We're just calling it a name. Now we are going to call whatever effect that is done by having a more complex target here, we are going also to call it noise. But it is going to be called deterministic noise. Because there is nothing stochastic about it. There's a particular target function. I just cannot capture it, so it looks like noise to me. And we would like to understand what deterministic noise is about. However, if you look at it, and now you speak in terms of stochastic noise and deterministic noise, and you would like to see what affects overfitting. So, we put it in a box. First observation: if I have more points, I have less overfitting. If you move from here to here, things get bluer. If you move from here to here, things get bluer. I have less overfitting. Second thing: if I increase the stochastic noise-- increase the energy in the stochastic noise-- the overfitting goes up. Indeed, if I go from here to here, things get redder. And finally, with deterministic noise, which is vaguely associated in my mind with the increase of target complexity, I also increase the overfitting. If I go from here to here, I am getting redder. Albeit I have to travel further, and it's a bit more subtle, but the direction is that I get more overfitting as I get more deterministic noise, whatever that might be. So now, let's spend some time just analyzing what deterministic noise is, and why it affects overfitting the way it does. Let's start with the definition. What is it? It will be actually noise. If I tell you what is the stochastic noise, you will say, here's my target, and there is something on top of it. That is what I call stochastic noise. So the deterministic noise will be the same thing, except that it captures something deterministic. It's the part of the target that your hypothesis set cannot capture. So let's look at the picture. Here is the picture. This is your target, the blue guy. You take a hypothesis set that-- let's say simple, and you look for the guy that best approximates f. Not in the learning sense. You actually try very hard to find the best possible approximation. You're still not going to get f, because your hypothesis set is limited, but the best guy will be sitting there, and it will fail to pick certain part of the target. And that is the part we are labeling the deterministic noise. And if you think from an operational point of view, if you are that hypothesis, noise is all the same. It's something I cannot capture. Whether I couldn't capture it, because there's nothing to capture-- as in stochastic noise-- or I couldn't capture it, because I'm limited in capturing, and this I have to consider as out of my league. Both of them are noise, as far as I'm concerned. Something I cannot deal with. This is how we define it. And then we ask, why are we calling it noise? It's a little bit of a philosophical issue. But let's say that you have a young sibling-- your kid brother-- has just learned fractions. So they used to have just 1, 2, 3, 4, 5, 6. They're not even into negative numbers, and they learn fractions, and now they're very excited. They realize that there's more to numbers than just 1, 2, 3. So you are the big brother. You are big Caltech guy. So you must know more about numbers. They come ask you, tell me more about numbers. Now, in your mind, you probably can explain to them negative numbers a little bit by deficiency. Real numbers, just intuitively continuous. You are not going to tell them about limits, or anything like that. They're too young for that. But you probably are not going to tell them about complex numbers, are you? Because their hypothesis set is so limited that complex numbers, for them, would be completely noise. And the problem with explaining something that people cannot capture is that they will create a pattern that really doesn't exist. And then you tell them complex number, and they really can't comprehend it, but they got the notion. So now it's the noise. They fit the noise, and they tell you, is 7.34521 a complex number? Because in their minds-- they just got on to a tangent. So you're better off just killing that part. And giving them a simple thing that they can learn, because the additional part will actually mislead them. Mislead them, as in noise. So this is our idea, that if I have a hypothesis set, and there is part of the target that I cannot capture, there's no point in trying to capture it, because when you try to capture it, you are detecting a false pattern that you cannot extrapolate, given your limitations. That's why it's called noise. Now the main differences between deterministic noise and stochastic noise-- both of them can be plotted, a realization-- but the main differences are, the first thing is that deterministic noise depends on your hypothesis set. For the same target function, if you use a more sophisticated hypothesis set, the deterministic noise will be smaller, because you were able to capture more. Obviously, the stochastic noise will be the same. Nothing can capture it, so all hypotheses are the same. We cannot capture it, and therefore it's noise. The other thing is that, if I give you a particular point x, deterministic noise is a fixed amount, which is the difference between the value of the target at that point and the best hypothesis approximation you have. If I gave you stochastic noise, then you are generating this at random. And if I give you two instances of x, the same x, the noise will change from one occurrence to another, whereas here, it's the same. Nonetheless, they behave exactly the same for machine learning, because invariably we have a given data set. Nobody changes x's on us, and give us another realization of the x. We just have the x's given to us together with the labels. So this doesn't make a difference for us. And we settle on a hypothesis set. Once you settle on a hypothesis set, the deterministic noise is as bad as the stochastic noise. It's something that we cannot capture, and it depends on something that we have already fixed, so it doesn't depend on anything. So in a given learning situation, they behave the same. Now, let's see the impact on overfitting. This is what we have seen before. This is the case where we have increasing target complexity, so increasing deterministic noise in the terminology we just introduced, and the number of points. And red means overfitting, so this is how much overfitting is there. And we are looking at deterministic noise, as it relates to the target complexity. Because the quantitative thing we had is target complexity. We defined what a realization of deterministic noise is, but it's not clear to us what quantity we should measure out of deterministic noise, in order to tell us that this is the level of noise that results in overfitting, yet. We have the one in the case of stochastic noise very easily. We just take the energy of it. So here we realize that as you increase the target complexity, the deterministic noise increases, which is the overfitting phenomenon that we observe-- increases. But you'll notice there's something interesting here. It doesn't start until you get to 10. Because this was overfitting of what? The 10th order versus the 2nd order. So if you're going to start having deterministic noise, you'd better go above 10, so that there is something that you cannot approximate. This is the part where it's there. So here, I wouldn't say proportional, but it definitely increases with the target complexity, and it decreases with N as we expect. Now for the finite N, you suffer the same way you suffer from the stochastic noise. We have declared that deterministic noise is the part that your hypothesis set cannot capture. So what is the problem? If I cannot capture it, it won't hurt me, because when I try to fit, I won't capture it anyway. No. You cannot capture it in its entirety. But if I give you only a finite sample, then you only get few points, and you may be able to capture a little bit of the stochastic noise, or the deterministic noise in this case. Again, if I have 10 points-- if you give me a million points, and even if there is stochastic noise, there's nothing I can do to capture the noise. Let me remind you of the example we gave in linear regression. We took linear regression and said, let's say that we are learning a linear function. So linear regression would be perfect in this case. This is the target. And then we added noise to the examples, so instead of getting the points perfectly on that line, you get points right or left. And then we tried to use linear regression to fit it. If you didn't have any noise, linear regression would be perfect in this case. Now, since there's noise, and it doesn't really see the line-- it only sees those guys, it eats a little bit into the noise, and therefore gets deviated from the target. And that is why you are getting worse performance than without the noise. Now, if I have 10 points, linear regression will have easy time eating into that, because there isn't much to fit. There are only 10 guys, and maybe there's some linear pattern there. If I get a million points, the chances are I won't be able to fit any of them at all, because they are noise all over the place, and I cannot find a compromise using my few parameters, and therefore I will end up really not being affected by them. In the infinite case, I cannot get anything. They are noise, and I cannot fit them. They are out of my ability. But the problem is that once you have a finite sample, you're given the unfortunate ability to be able to fit the noise, and you will indeed fit it. Whether it's stochastic-- that it doesn't make sense-- or deterministic, that there is no point in fitting it, because you know in your hypothesis set, there is no way to generalize out-of-sample for it. It is out of your ability. So the problem here is that for the finite N, you get to try to fit the noise, both stochastic and deterministic. Now, let me go quickly through a quantitative analysis that will put deterministic noise and stochastic noise in the same equation, so that they become clear. Remember bias-variance? That was a few lectures ago. What was that about? We had a decomposition of the expected out-of-sample error into two terms. And this is the expected value of out-of-sample error. I remember, this is the hypothesis we get, and we have dependency on the data set that got us. We compare it to the target function, and we get the expected value with respect to those. And that ended up being a variance, which tells me how far I am from the centroid within the hypothesis set, and that means that there's a variety of things I get based on D. And the other one is how far the centroid is from the target, which tells me the bias of my hypothesis set from the target. And the leap of faith we had is that this quantity, which is the average hypothesis you get over all data sets, is about the same as the best hypothesis in the hypothesis set. So we had that. And in this case, f was noiseless in this analysis. Now, I'd like to add noise to the target, and see how this decomposition will go, because this will give us a very good insight into the role of stochastic noise versus deterministic noise. So we add noise. And we're going to plot it red, because we want to pay attention to it, and because we are going to get the expected values with respect to it. So y now is the realization, the target plus epsilon. And I'm going to assume that the expected value of the noise is 0. Again, if the expected value is something else, we put that in the target, and leave the part which is pure fluctuation outside, and call that epsilon. Now I would like to repeat the analysis, more quickly, obviously, with the added noise. Here is the noise term. First, this is what we started with. So I'm comparing what you get in your hypothesis, in a particular learning situation, to the target. But now the target is noisy. So the first thing is to replace this fellow by the noisy version, which is y. I know that y has f of x, plus the noise. That's what I'm comparing to. And now, because y depends on the noise, I'm not only getting the averaging with respect to the data set, I'm also getting the average with respect to the realization of the noise. So I'm getting expected value with respect to D and epsilon-- epsilon affecting y. So you expand this, and this is just rewriting it. f of x plus epsilon is y, so I'm writing it this way. And we do the same thing we did before, but just carrying this around, until we see where it goes. So what did we do? We added and subtracted the centroid-- the average hypothesis, remember-- in preparation for getting squared terms, and cross terms. And here we have the epsilon added to the mix. And then we write it down. And in the first case, we get the squared, so we put these together and put them as squared. We take these two guys together, and put them as squared. And this guy by itself, we put it as squared. We will still have cross terms, but these are the ones that I'm going to focus on. And then we have more cross terms than we had before, because there's epsilon in it. But the good news is that, if you get the expected value of the cross terms, all of them will go to 0. The ones that used to go to 0 will go to 0. The other ones will go to 0, because the expected value of epsilon goes to 0, and epsilon is independent of the other random thing here, which is the data set. Data set is generated. Its noise is generated. Epsilon is generated on the test point x, which is independent, and therefore you will get 0. So it's very easy to argue that this is 0, and you will get basically the same decomposition with this fellow added. So let's look at it. Well, we'll see that there are actually two noise terms that come up. This is the variance term. Let me put it. This is the bias term. And this is the added term, which is just sigma squared, the energy of the noise. Let me just discuss this a little bit. We had the expected value with respect to D, and with respect to epsilon. And then, remember that we take the expected value with respect to x, average over all the space, in order to get just the bias and variance, rather than the bias of x-- of your test point. So I did that already. So every expectation is with respect to the data set, with respect to the input point, and with respect to the realization of the noisy epsilon. But I'm keeping the guys that survive, because the other guys-- epsilon doesn't appear here, so the thing is constant with respect to it, so I take it out. Here, neither epsilon nor D appears here, so I just leave it for simplicity. And here, D doesn't appear, but epsilon and x appear, so I do it this way. I could put the more elaborate notation, but I just wanted to keep it simple. Now, look at this decomposition. We have the moving from your hypothesis to the centroid, from the centroid to the target proper, and then from the target proper to the actual output, which has a noise aspect to it. So it's again the same thing of trying to approximate something, and putting it in steps. Now if you look at the last quantity, that is patently the stochastic noise. The interesting thing is that there is another term here which is corresponding to the deterministic noise. And that is this fellow. That's another name for the bias. Why is that? Because our leap of faith told us that this guy, the average, is about the same as the best hypothesis. So we are measuring how the best hypothesis can approximate f. Well, this tells me the energy of deterministic noise. And this is why it's deterministic noise. And putting it this way it gives you the solid ground to treat them the same. Because if you increase the number of examples, you may get better variance. There is more examples, so you don't float around fitting all of them. So the red region, that used to be the variance, shrinks and shrinks. These guys are both inevitable. There is nothing you can do about this, and there's nothing you can do about this given a hypothesis set. So these are fixed. But again, in the bias-variance, remember the approximation was overall approximation. We took the entire target function, and the entire hypothesis. We didn't look at particular data points. We looked at approximation proper, and that's why these are inevitable. You tell me what the hypothesis set is, well, that's the best I can do. And this is the best I can do as far as the noise, which is just not predicting anything in the noise. Now, both the deterministic noise and the stochastic noise will have a finite version on the data points, and the algorithm will try to fit them. And that's why this guy gets a variety. Because depending on the particular fit of those, you will get one or another. So these guys affect the variance, by making the fit more susceptible to going in more places. Depending on what happens, I will go this way and that way-- not because it's indicated by the target function I want to learn, but just because there is a noise present in the sample that I am blindly following, because I can't distinguish noise from signal, and therefore I end up with more variety, and I end up with worse variance and overfit. Now very briefly, I'm going to give you a lead into the next two lectures. We understand what overfitting is, and we understand that it's due to noise. And we understand that noise is in the eye of the beholder, so to speak. There is stochastic noise, but there's another noise which is not really noise, but depends on which hypothesis looks at it. It looks like noise to some, and not look like noise to other, and we call that deterministic noise. And we saw experimentally that it affects overfitting. So how do we deal with overfitting? What does it mean to deal with overfitting? We want to avoid it. We don't want to spend more energy fitting, and get worse out-of-sample error, whether by choice of a model, or by actually optimizing within a model, like we did with neural networks. There are two cures. One of them is called regularization, and that is best described as putting the brakes. So overfitting-- you are going, going, going, going, and you hurt yourself. So all I'm doing here is, I'm just making sure that you don't go all the way. And when you do that, I'm going to avoid overfitting this way. The other one is called validation. What is the cure in this case for overfitting? You check the bottom line, and make sure that you don't overfit. It's a different philosophy. That is, the reason I'm overfitting is because I'm going for E_in, and I'm minimizing it, and I'm going all the way. I say, no, wait a minute. E_in is not a very good indication for what happens. Maybe there's another way to be able to tell what is actually happening out of sample, and therefore avoid overfitting, because you can check on what is happening in the real quantity you care about. So these are the two approaches. I'll give you just an appetizer-- a very short appetizer for putting the brakes-- the regularization part, which is the subject of next lecture. Remember this curve? That's what we started with. We had the five points, we had the 4th-order polynomial, we fit, and we ended up in trouble. And we can describe this as free fit, that is, fit all you can. So fit all you can, five points, I'll take 4th-order polynomial, go for it, I get this, and that's what happens. Now, putting the brakes means that you're going to not allow yourself to go all the way, and you are going to have a restrained fit. The reason I'm showing this is because it's fairly dramatic. You will think that I need-- this curve is so incredibly bad that you think you really need to do something dramatic in order to avoid that. But here, what I'm going to do, I'm just going to make you fit, and I'm actually going to make you fit using a 4th-order polynomial. I'll give you that privilege. But I'm going to prevent you from fitting the points perfectly. I'm going to put some friction in it, such that you cannot get exactly to the points. And the amount of brake I'm going to put here is so minimal, it's laughable. When you go for your car service, they measure the brake, and they tell you, oh, the brake is 70%, et cetera, and then when it gets to 40%, they tell you you need to do something about the brake. The brake's here are about 1%. So if this was a car, you would be braking here, and you would be stopping in Glendale! It's like completely ridiculous. But that little amount of brake will result in this. Totally dramatic. Fantastic fit. The red curve is a 4th-order polynomial, but we didn't allow it to fit all the way. And you can see that it's not fitting all the way, because it really is not getting the points right. It's getting there, but not exactly. So we don't have to do much to prevent the overfitting. But we need to understand what is regularization, and how to choose it, et cetera. And this we'll talk about next time. And then the time after that, we're going to talk about validation, which is the other prescription. I will stop here, and we will take questions after a short break. Let's start the Q&A, and we'll start with a question in house. STUDENT: So on previous lecture we spoke about stochastic gradient descent, and we say that we should choose point by point, and move in the direction of gradient of error in this point. PROFESSOR: Negative of the gradient, yes. STUDENT: So the question is, how important is it to choose points randomly? I mean, we can choose them just from the list-- first point, second point, and so on? PROFESSOR: Yeah. Depending on the runs, it could be no difference at all, or it could be a real difference. And the best way to think of randomization in this case is that it's an insurance policy. There's something about the pattern that is detrimental in a particular case. You are always safe by picking the points at random, because there's no chance that the random thing will have a pattern eventually, if you keep doing it. So in many cases, you just run through examples 1 through N, 1 through N. 1 through N, and you will be fine. Some cases, you take a random permutation. Some cases even, you stay true to picking the point at random, and you hope that the representation of a point will be the same, in the long run. In my own experience, there is little difference in a typical case. Every now and then, there's a funny case. And therefore, you are safer using the stochastic presentation-- the random presentation of the examples-- in order to be able not to fall into the trap in those cases. Yeah. There's another question in house. STUDENT: Hi, Professor. I have a question about slide 4. It's about neural networks. I don't understand-- how do you draw the out-of-sample error on that plot? PROFESSOR: OK. In general, you cannot, obviously, draw the out-of-sample error. If you could draw it, you will just pick it. This is a case where, I give you a data set, and you decide to set aside part of the data set for testing. So you are not involving it at all in the training. And what you do, you go about your training, and at the end of every epoch, when you evaluate the in-sample error on the entire batch, which is the green curve here, you also evaluate, for that set of weights-- the frozen weights at the end of the epoch-- you evaluate that on the test set, and you get a point. And because that point is not involved in the training, it becomes an out-of-sample point, and that gets the red point. And you go down. Now, there's an interesting tricky point here, because if you decide at some point to maybe, I look at the red curve. Now I am going to stop where the red curve is minimum. STUDENT: Yes. PROFESSOR: OK? Now at that point, the set that used to be a test set is no longer a test set, because now it has just been involved in a decision regarding training. Becomes slightly contaminated, becomes a validation set, which we're going to talk about when we talk about validation. but that is really the premise. STUDENT: OK. I understand. Also, can I-- slide 16? PROFESSOR: Slide 16. STUDENT: I didn't follow that. Why the two noises are the same, for the same learning problem. PROFESSOR: They're the same in the sense that they are part of the outputs that I'm being given, or that I'm trying to predict. And that part, I cannot predict regardless of what I do. In the case of stochastic noise, it's obvious. There's nothing to predict there, so whatever I do, I miss it. In the case here, it's particular to the hypothesis set that I have. So I take a hypothesis set, and look in a non-learning scenario, look at the target function and choose your best scenario. You choose, this is my best hypothesis, which we called here h star. If you look at the difference between h star and f, the difference is a part which I cannot capture, because the best I could do is h star. So the remaining part is what I'm referring to as deterministic noise, and it is beyond my ability given my hypothesis set. So that's why they are the same-- the same in the sense of unreachable as far as my resources are concerned. STUDENT: OK. In a real problem, do we know the complexity of the target function? PROFESSOR: In general, no. We also don't know the particulars of the noise. We know that the problem is noisy, but we cannot identify the noise. We cannot, in most cases, even measure the noise. So the purpose here is to understand that, even in the case of a noiseless target in the conventional sense, there is something that we can identify-- conceptually identify-- that does affect the overfitting. And even if we don't know the particulars of it, we will have to put in place the guards, in order to avoid overfitting. That was the goal here, rather than try to-- Any time you see the target function drawn, you should immediately have an alarm bell that this is conceptual, because you never actually see the target function in a real learning situation. STUDENT: Oh. So, that's why the two noises are equal, then. Because we don't know the target function, so we don't know which part is deterministic. PROFESSOR: Yeah. If I knew the target, and if I knew the noise, then the situation would be good, but then I don't need machine learning. I already have that. STUDENT: Thank you. PROFESSOR: So we go for the questions from the outside? MODERATOR: Yeah. Quick conceptual question. Is it OK to say that the deterministic noise is the part of reality that is too complex to be modeled? PROFESSOR: It is definitely part of the reality-- that part. And basically, it's our failure to model it is what made it noise, as far as we are concerned. So obviously you can, in some sense, model it by going to a bigger hypothesis set. The bigger hypothesis set will have a closer h star to the target, and therefore the difference will be small. But the situation pertains to the case where you already chose the hypothesis set according to prescriptions of VC dimension, number of examples, and other considerations. And given that hypothesis set, you already concede that even if the target is noiseless, there is part of it which behaves as noise, as far as I'm concerned. And I will have to treat it as such, when I consider overfitting and the other considerations. MODERATOR: Also, is it fair to say that over-training will cause overfitting? PROFESSOR: I think they probably are synonymous. Overfitting is relative. Over-training will be relative within the same model, if I try to give it a definition. That you over-train, so you already settled on the model, and you're over-training it. The case of neural network would be over-training. The case of choosing the 3rd-order polynomial versus the 4th-order polynomial will not really be over-training, but it will be overfitting. It's all technicalities, but just to answer the question. MODERATOR: Practically, when you implement these algorithms, and there's also some approximation, maybe due to the floating-point number or something. So is this another source of error? Does it produce overfitting? Or is it-- PROFESSOR: It's-- Formally speaking, yes, it's another source. But it is so minute with respect to the other guys, that it's never mentioned. We have another in-house question. STUDENT: A couple of lectures ago, we spoke about 3rd linear model, which is logistic regression. PROFESSOR: You said the 3rd linear model? STUDENT: Yes. So the question is, is it true that initially I have data which is completely linearly separable? So the points marked-- some points are marked -1, and some are +1, and there is a plane which separates them. Is it true that applying this learning model, you're never stuck in a local minimum and get 0 in-sample error? PROFESSOR: OK. This is a very specific question about logistic regression. If the thing is completely clean, then you obviously can get closer and closer to having the probability being perfect, by having bigger and bigger weights. So there is a minimum. And again, it's a unique minimum. Except that the minimum is at infinity, in terms of the size of the weight. But this doesn't bother you, because you are really going to stop at some point when the gradient is small, according to your specification. And you can specify this any way you want. So the goal is not necessarily to arrive at the minimum. Which hardly ever happens, even if the thing is not at infinity. But get close enough, in the sense that the value is close to the minimum, and therefore you achieve the small error that you want. MODERATOR: Can you resolve again the contradiction of when you increase the complexity of the model, you should be reducing your bias, and hence your deterministic noise? So here we had an example when we had H-- well, H_10 had more error than H_2. PROFESSOR: H_10 had total error more than H_2. If we were doing the approximation game, H_10 would be better. We had three terms in the bias-variance. If we were only going by these two, then there is no question that the bigger model, H_10, will win. Because this is for all, and this one will be better for H_10 than H_2, because H_10 is closer to the target we want, and therefore we will be making smaller error. This is not the source of the problem of overfitting. This is just identifying terms in the bias-variance decomposition, bias-variance-noise decomposition in this case, that correspond to the different types of noise. The problem of overfitting happens here. And that happens because of the finite-sample version of both. That is, I get N points in which there is a contribution of noise coming from the stochastic and coming from the deterministic. On those points, the algorithm will try to fit that noise, in spite of the fact that if it knew, it wouldn't, because it knows that they're out of reach. But it gets a finite sample, and it can use its resources to try to fit part of that noise, and that is what is causing overfitting. And that ends up being harmful, and so harmful in the H_10 case, that the harm offsets the fact that I'm closer to the target function. That doesn't help me very much. Because, same thing we said before, Let's say there's H_10. And the target function is sitting here. That doesn't do me much good if my algorithm, and the distraction of the noise, leads me to go in that direction. I will be further from the target function than another guy who, only working with this, remained in the confines and ended up being closer to the target function. It's a question of the variance term that results in overfitting, not this guy, in spite of the fact that these guys contain both types of noise contributing to their value. But their value is static. It doesn't change with N, and it has nothing to do with the overfitting aspect. MODERATOR: In the case of polynomial fitting, a way to avoid the overfitting could be to use piecewise linear-- piecewise linear functions around each point. So it is a method of regularization? Or is it-- PROFESSOR: OK. Depends on the number of degrees of freedom you have. You can have piecewise linear, which is really horrible. It's like something you can't tell. It depends on how many pieces. If you have as many pieces as there are points, you can see what the problem is. So it really is, what is the VC dimension of your model? And I can take it-- if it's piecewise linear, and I have only four parameters, then I don't worry too much that it's piecewise linear. I only worry about the four parameters aspect of it. 10th-order polynomial was bad because of the 11 parameters, not because of other factor. But anything you do to restrict your model, in terms of the fitting, can be called regularization. And there are some good methods and bad methods, but they are all regularization, in terms of putting the brakes. MODERATOR: Some practical question is, how do you usually get the profile of the out-of-sample error? Do you sacrifice points, or-- PROFESSOR: OK. This is obviously a good question. When we talk about validation-- validation has an impact on overfitting. It's used to do that. But it's also used in model selection in general. And because of that, it's very tempting to say, I'm going to use validation, and I'm going to set aside a number of points. But obviously, the problem is that when you set aside a number of points, you deprive yourself from a resource that you could have used for training, in order to arrive at a better hypothesis. So there's a tradeoff, and we'll discuss that tradeoff in very specific terms, and find ways to go around it, like cross-validation. But this will be the subject of the lecture on validation, coming up soon. MODERATOR: In the example of the color plots, here the order of the polynomial is a good indication of the VC dimension, right? PROFESSOR: These are the plots. What is the question? MODERATOR: Here, Q_f is directly related to the VC dimension, right? PROFESSOR: The target complexity has nothing to do with the VC dimension. It's the target. I'm talking about different targets. The VC dimension has to do only with the two fellows we are using. We are using H_2 and H_10, 2nd-order polynomials and 10th-order polynomials. So if we take the degrees of freedom as being a VC dimension, they will have different VC dimension. And the discrepancy in the VC dimension, given the same number of examples, is the reason why we have discrepancy in the out-of-sample error. But you also have a discrepancy in the in-sample error. And the case of overfitting is such that the in-sample error is moving in one direction, and the out-of-sample moving in another direction. So the only relevant thing in this plot to the VC dimension is the fact that the two models have different VC dimensions, H_2 and H_10. MODERATOR: I guess you never really have a measure on the target complexity, like in practice? PROFESSOR: Correct. This was an illustration. And even in the case of the illustration, when we had explicitly a definition of the target complexity, it wasn't completely clear how to map this into energy of deterministic noise, a counterpart for sigma squared here. This is completely clean. And as you can see, because of that, the plot is very regular. Here, first we define this in a particular case, in order to be able to run an experiment. Second, in terms of that, it's not clear what is-- can you tell me what is the energy of the deterministic noise here? There's quite a bit of normalization that was done. So when we normalize the target in order to make sigma squared meaningful, we sacrifice the fact-- the target now is sandwiched between limited range. And therefore the amount of energy, of whatever the deterministic noise is, will be limited, regardless of how complex is the target is. So there is a compromise we had to do, in order to be able to find these plots. However, the moral of the story here is that there's something about the target complexity that behaved in the same way, as far as overfitting is concerned, as noise. And we identified it as deterministic noise. We didn't quantify it further. And it will be-- It's possible to quantify it. You can get the energy for this and that, and you can do it. But these are research topics. As far as we are concerned, in a real situation, we won't be able to identify either the stochastic noise or the deterministic noise. We just know they're there. We know the impact of overfitting. And we will be able to find methods in order to be able to cure the overfitting, without knowing all of the specifics that we could possibly know about the noise involved. MODERATOR: Do you ever measure the-- is there some similar kind of measure of the model complexity, of the target function? Do you ever use a VC dimension for that? PROFESSOR: Not explicitly. One can apply it. You say what is the model that would include the target function? And then, based on the inclusion of the target function, you can say that this is the complexity of that model. The analysis we use is such that the complexity of the target function doesn't come in, in terms of the VC analysis. But there are other methods. There are other approaches, other than the VC analysis, where the target complexity matters. So I didn't particularly spend time trying to capture the complexity of the target function until this moment, where the complexity of the target function could translate to something in the bias-variance decomposition, and that has an impact on overfitting and generalization. MODERATOR: I think that's it. PROFESSOR: We will see you on Thursday.

Info

Channel: caltech

Views: 102,163

Rating: 4.9479675 out of 5

Keywords: Machine Learning (Field Of Study), overfitting, over-fitting, noise, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, stochastic noise, deterministic noise, regularization, validation, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), California Institute Of Technology (Organization), Abu-Mostafa, Yaser

Id: EQWr3GGCdzw

Channel Id: undefined

Length: 79min 49sec (4789 seconds)

Published: Thu May 10 2012