Lecture 08 - Bias-Variance Tradeoff

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we finished the VC analysis. And that took us three full lectures. The end result was the definition of the VC dimension of a hypothesis set. It was defined as the most points that the hypothesis set can shatter. And we used the VC dimension in establishing that learning is feasible, on one hand, and then in estimating the example resources that are needed in order to learn. One of the important aspects of the VC analysis is the scope. The VC inequality, and the generalization bound that corresponds to it, describe the generalization ability of the final hypothesis you are going to pick. It describes that in terms of the VC dimension of the hypothesis set, and makes a statement that is true for all but delta of the data sets that you might get. So this is where it applies. And the most important part of the application are the disappearing blocks, because it gives the generality that the VC inequality has. So the VC bound is valid for any learning algorithm, for any input distribution that may take place, and also for any target function that you may be able to learn. So this is the most theoretical part. And then we went into a little bit of a practical part, where we are asking about the utility of the VC dimension in practice. You have a learning problem-- someone comes with a problem, and you would like to know how many examples. What is the size of the data set you need, in order to be able to achieve a certain level of performance. The way we did this analysis is by plotting the core aspect of the delta, the probability of error in the VC bound. And we found that it's behaving regularly. We focused on a certain aspect of these curves, which correspond to different VC dimensions. And the main aspect is below this line. This line designates the probability 1. We want the probability of the bad event to be small, so we are working in this region. And the x-axis here is the number of examples-- the size of your data set. And we don't particularly care about the shape of these guys. They could be a little bit nonlinear, et cetera. But the quantity we are looking for is, if we cut through this way, what is the behavior of the x-axis, the number of examples, in terms of the VC dimension, which is the label for the colored curves? And we realized that, given this analysis, it is very much proportional. And we were able to say that, theoretically, the bound will give us that the number of examples needed would be proportional to the VC dimension, more or less. And although the constant of proportionality, if you go for the bound, will be horrifically pessimistic-- you will end up requiring tens of thousands of examples for something for which you really need only maybe 50 examples-- the good news is that the actual quantity behaves in the same way as the bound. So the number of examples needed is, in practice, as a practical observation, indeed proportional to the VC dimension. And furthermore, as a rule of thumb, in order to get to the interesting part, or interesting delta and epsilon, you need the number of examples to be 10 times the VC dimension. More will be better. Less might work. But the ballpark of it is that you have a factor of 10, in order to start getting interesting generalization properties. We ended with summarizing the entire theoretical analysis into a very simple bound, which we are referring to as the generalization bound, that tells us a bound on the out-of-sample performance given the in-sample performance. And that involved adding a term, capital Omega. And Omega captures all the theoretical analysis we had. It's a function of N, function of the hypothesis set through the VC dimension, function of your tolerance for probability of error, which is delta. And although this is a bound, we keep saying that, in reality, E_out will be equal to E_in plus something that behaves like Omega. And we will take advantage of that, when we get a technique like regularization. So that's the end of the VC analysis, which is the biggest part of the theory here. And we are going to switch today to another approach, which is the bias-variance tradeoff. It's a stand-alone theory. It gives us a different angle on generalization. And I am going to cover it, beginning to end, during this lecture. This is the plan. The outline is very simple. We are going to talk about the bias and variance, define them, see the tradeoff, take a very detailed example-- one particular example-- in order to demonstrate what the bias and variance are. And then we are going to introduce a very interesting tool for illustrating learning, which are called learning curves. And we are going to contrast the bias-variance analysis versus the VC analysis on these learning curves, and then apply them to the linear regression case that we are familiar with. So this is the plan. The first part is the bias and variance. In the big picture, we have been trying to characterize a tradeoff. And roughly speaking, the tradeoff is between approximation and generalization. So let me discuss this for a moment, before we put bias and variance into the picture. We would like to get to small E_out. That's the purpose of learning. If E_out is small, then you have learned. You have a hypothesis that approximates the target function well. There are two components to this, and we are very familiar with them now. We are looking for a good approximation of f. That's the approximation part. But we would like that approximation to hold out-of-sample. We are not going to be happy if we approximate f well in-sample, and we behave badly out-of-sample. These are the two components. In the case of a more complex hypothesis set, you are going to have a better chance of approximating f, obviously. I have more hypotheses to run around. I'll be able to find one of them that is closer to the target function I want. The problem is that, if you have the bigger hypothesis set, you are going to have a problem identifying the good hypothesis. That is, if you have fewer hypotheses, you have a better chance of generalization. And one way to look at it is that, I'm trying to approximate the target function. You give me a hypothesis set. Now, if I tell you I have good news for you. The target function is actually in the hypothesis set. You have the perfect approximation under your control. Well, it's under your hand, but not necessarily under your control. Because you still have to navigate through the hypothesis set in order to find the good candidate. And the way you navigate is through the data set. That is your only resource for finding one hypothesis versus the other. So the target function could be sitting there calling for you. Please, I am the target function, come. But you can't see it. You're just navigating with the training set, you have very limited resources, and you end up with something that is really bad. Having f in the hypothesis set is great for approximation. But having a big hypothesis set, that is big enough to include that, may be bad news, because you will not be able to get it. Now if you think about it, what is the ideal hypothesis set for learning? If I only had a hypothesis set that has a singleton hypothesis, which happens to be the target function, then I have the best of both worlds. The perfect approximation. I will zoom in directly, because it is only one. Well, you might as well go and buy a lottery ticket. That's the equivalent. We don't know the target function, so we will have to make the hypothesis set big enough to stand a chance. And once we do that, then the question of generalization kicks in. This is this big picture. So let's try to fit the VC analysis in it, and then fit the bias-variance analysis in it, before we even know what the bias-variance analysis is, in order to see where we are going with this. So we are quantifying this tradeoff. And the quantification, in the case of the VC analysis, was what? Was the generalization bound. E_in is approximation. Because I am actually trying to fit the target function-- I am just fitting them on the sample. That's the restriction here. So if I get this well, then I'm approximating f well, at least on some points. This guy is purely generalization. The question is, how do you generalize from in-sample to out-of-sample? So this is a way of quantifying it. Now the bias-variance analysis has another approach. It also decomposes E_out, as you did in the generalization bound. But it does decompose it into two different entities. The first one is an approximation entity, how well H can approximate f. Well, what is the difference then? The difference is that the bias-variance asks the question, how can H approximate f, overall? Not on your sample. In reality. As if you had access to the target function, and you are stuck with this hypothesis set, and you are eagerly looking for which hypothesis best describes the target function. And then you quantify how well that best hypothesis performs, and that is your measure of the approximation ability. Then what is the other component? The other component is exactly what I alluded to. Can you zoom in on it? So this is the best hypothesis, and it has a certain approximation ability. Now I need to pick it, so I have to use the examples in order to zoom in into the hypothesis set, and pick this particular one. Can I zoom in on it? Or do I get something that is a poor approximation of the approximation? And that decomposition will give us the bias-variance. And we'll be able to put them at the end of the lecture, side by side, in order to compare: here is what the VC analysis does, and here is what bias-variance does. The analysis, from a mathematical point of view, applies to real-valued targets. And that's good news. Because remember, in the VC analysis, we were confined to binary functions in the particular analysis that I did. You can extend it, but it's very technical. So it's a good idea to see the same tradeoff, and the same generalization questions, apply to real-valued functions. Now we have regression, and we are able to make a statement about generalization on regression, which we will apply very specifically to linear regression, the model that we already studied that has real-valued outputs. And we are going to confine the analysis here to squared error. The reason we are doing this is that, for the math to go through in such a way that these two guys decompose cleanly-- there are no cross terms, we will need the squared error. So this is a restriction of the analysis. There are ways to extend it. They are not as clean, so this is the simplest form that we are going to use. Let's start. Our starting point is E_out. So let me put it-- Don't worry about the gap. The gap here will be filled. What do we have? We have E_out. E_out depends on the hypothesis you pick. E_out is E_out of your final hypothesis. How does it perform on the overall space? And in order to do that, since we are talking about squared error, you are going to take the value of your hypothesis, and compare it to the value of the target function, and take that squared. And that will be your error. So this is the building block for getting the out-of-sample performance. Now the gap here comes from the fact that, if you look at the final hypothesis, the final hypothesis depends on a number of things. Among other things, it does depend on the data set that I'm going to give you, right? Because if I give you a different data set, you'll find a different final hypothesis. That dependency is quite important in the bias-variance analysis. Therefore, I am going to make it explicit in the notation. It has always been there, but I didn't need to carry ugly notation throughout, when I'm not using it. Here I'm using it, so we'll have to live with it. So now I'll make that dependency explicit. I'm having now a superscript, which tells me that this g comes from that particular data set. If you give me another data set, this will be a different g. And you take the same g, apply it to x, compare it to f, and this is your error. And finally, in order for it to be genuinely out-of-sample error, you need to get the expected value of that error over the entire space. So this is what we have. Now what we would like to do, we would like to see a decomposition of this quantity into the two conceptual components, approximation and generalization, that we saw. So here's what we are going to do. We are going to take this quantity, which equals this quantity, as I mentioned here, and then realize that this depends on the particular data set. I would like to rid this from the dependency on the specific data set that I give you. So I'm going to play the following game. I am going to give you a budget of N examples, training examples to learn from. If I give you that budget N, I could generate one D and another D and another D, each of them with N examples. Each of them will result in a different hypothesis g, and each of them will result in a different out-of-sample error. Correct? So if I want to get rid of the dependency on the particular sample that I give you, and just know the behavior-- if I give you N data points, what happens?-- then I would like to integrate D out. So I am going to get the expected value of that error, with respect to D. This is not a quantity that you are going to encounter in any given situation. In any given situation, you have a specific data set to work with. However, if I want to analyze the general behavior-- someone comes to my door, how many examples do you have, and they tell me 100. I haven't seen the examples yet. So it stands to logic that I say, for 100 examples, the following behavior follows. So I must be taking an expected value with respect to all possible realizations of 100 examples. And that is, indeed, what I am going to do. I'm going to get the expected value of that. And this is the quantity that I am going to decompose. And this obviously happens to be the expected value of the other guy, and we have that. Now I am going to take this quantity, the expression for the quantity that I'm interested in, and keep deriving stuff until I get to the decomposition I want. The first order of business, I have two expectations. The first thing I'm going to do, I am going to reverse the order of the expectations. Why can I do that? I am integrating. So now I change the order of integration. I am allowed to do that, because the integrand is strictly non-negative. So I get this. And the reason for that is because I am really interested in the expectation with respect to D, and I'd rather not carry the expectation with respect to x throughout. So I am going to get rid of that expectation for a while, until I get a clean decomposition. And when I get the clean decomposition, I'll go back and get the expectation, just to keep the focus clear. You focus on the inside quantity. If I give you the expression for the inside quantity for any x, then all you need to do in order to get the quantity that you need, is get the expected value of what I said, with respect to x. So this is the quantity that we are going to carry to the next slide. Let's do that. And the main notion, in order to evaluate this quantity, is the notion of an average hypothesis. It's a pretty interesting idea. Here is the idea. You have a hypothesis set, and you are learning from a particular data set. And I am going to define a particular hypothesis. I am going to call it the average hypothesis. And because it's average, I am going to give it a bar notation. So what is this fellow? Well, this fellow is defined as follows. You learn from a data set. You get a hypothesis. Someone else learns from another data set. They get another hypothesis, et cetera. So how about getting the expected value of these hypotheses? What does that formally mean? We have x fixed. So we actually are in a good position, because g of x is really just a random variable at this point. It's a random variable, determined by the choice of your data. The data is the randomization source. x is fixed, so you think I have one test point in the space, that I'm interested in. Maybe you are playing the stock market, and now you are only interested in what's going to happen tomorrow. So you take the inputs, and these are the only inputs you're interested in performing on. That's your x. And all of the questions now pertain to this. You are learning from other data. And then you ask yourself, how am I performing on this point? That is the point x. Now you are looking at this point. And you say, if you give me a data set versus another, I am going to get different values for the hypothesis on that point. It stands to logic that, if I take the average with respect to all possible data sets, that would be awesome. Because now I am getting the benefit of an infinite number of data sets. I am using them in the capacity of one data set at a time. But I am getting value. Maybe the correct value should be here. But since I am getting fluctuations because of the data set, sometimes I'm here, sometimes I'm here, et cetera. If you get the expected value, you will get it right. So this looks like a great quantity to have. And in reality, we will never have that. Because if you give me an infinite number of examples, I'm not going to divide them neatly into N and N and N, and learn from these, and then take the average. I'm just going to take all your examples, and learn all through and get the target function almost perfectly. So this is just a conceptual tool for us to do the analysis. But we understand what it is. If you now vary x, your test point in general, then you take that random variable and the expected value of it, and the function that is constituted by the expected values at different points is your g bar. So this is understood. Why do I need this for the analysis? Because if you look at the top thing, I have here squared, so I'm probably going to expand it. And in expanding it, I am just going to get a linear term of this. And I have an expected value. So you can see that I'm going to get something that requires me to define g bar. That's the technical utility here. But the conceptual utility is very important. And if you want to tell someone what g bar is, think that you have many, many data sets. And the game is such that you learn from one data set at a time, and you want to make the most of it after you learn. What do you do? You take votes. You take just the average. You have this. There is 1 over K here, the size of those. So this is the average. Now let's see how we can use g bar, in order to get the decomposition we want. Here, this is again the quantity I'm passing from one slide to another, in order not to forget. This is the quantity that I'd like to decompose. The first thing I am going to do, I am going to make it longer, by the simple trick of adding g bar and subtracting it. I'm allowed to do that, right? Doing that, I am going to consolidate these two terms, and I'm going to consolidate these two terms. And then expand with the squared. So let's do that. You get this. This is the first consolidated guy with the squared. This is the second consolidated guy with the squared. Am I missing something? Yes, I am missing the cross terms. So let's add the cross terms. And I get twice the product. This equals that. So the expected value here applies to the whole thing. The first order of business is to look at the cross terms, because they are annoying, and see if I can get rid of them. That's where the benefit of the squared error comes in. I am getting the expected value with respect to D, right? So this fellow is a constant. Therefore, when I get the expected value of this whole thing, all I need to do is get the expected value of this part, because this one will factor out. Now, if I get the expected value of this, the expected value of the sum is the sum of the expected values-- one of the few universal rules that you can apply, without asking any other questions. So I get the expected value of this. What is the expected value of g^D? Wait a minute. That was g bar, by definition. That's how we defined it, right? So I get g bar, minus the expected value of a constant, which happens to be g bar. So this goes to 0, and happily this guy goes away. Now I have only these two guys. So let's write them. I have the expected value of this whole thing, which again is the sum of the expected values. The first guy is a genuine expected value of these guys. When I apply expected value to this guy, again this is just a constant, so the expected value of it is itself. The second guy I add without bothering with the expected value, because it's just a constant. So this is what I have as the expression for the quantity that I want. Now let's take this and look at it closely, because this will be the bias and variance. This is the quantity again, and it equals this fellow. Now let's look beyond the math, and understand what's going on. This measure-- this quantity-- tells you how far your hypothesis, that you got from learning on a particular data set, differs from the ultimate thing, the target. And we are decomposing this into two steps. The first step is to ask you, how far is your hypothesis that you got from that particular data set, from the best possible you can get using your hypothesis set? Now there is a leap here, because I don't know whether this is the best in the hypothesis set. I got it by the averaging. But since I'm averaging from several data sets, it looks like a pretty good hypothesis. I am not even sure that it's actually in the hypothesis set. It's the average of guys that came from the hypothesis set. But I can definitely construct hypothesis sets, where the average of hypotheses does not necessarily belong there. So there are some funny stuff. But just think of it that, this is an intermediate step. Instead of going all the way to the target function, here is your hypothesis set. It restricts your resources. Now I am getting the best possible out of it, based on some formula. I'm learning from infinite number of data sets. This is a pretty good hypothesis. So how far are you from that hypothesis? That's the first step. The second step is how far that hypothesis, that great hypothesis, is from the ultimate target function. So hopping from your guy to the target, goes into a small hop from your guy to the best hypothesis, and another hop from the best hypothesis to the target function. And the neat thing is that they decomposed cleanly. And we found that they decomposed cleanly because the cross term disappeared. That's the advantage of the particular measure that we have. Now we need to give names to these guys. They will be the bias and variance. I'd like you to think for five seconds, and you don't have to even answer the question. Which will be the bias, and which will be the variance? Just look at which would be a better description to them. I'm not going to ask. This is not a quiz, like last time. This is the bias. Why is it the bias? Because what I'm saying is that, learning or no learning, your hypothesis set is biased away from the target function. Because this is the best I could do, under fictitious scenario. You have infinite data sets, and you are doing all of this, and you're taking the average, and that's the best you could come up with. And you are still far away from the target. So it must be a limitation of your hypothesis set. I'm going to measure that limitation and say that your hypothesis set, which is represented at its best by this guy, is biased away from the target function. So this is the bias term. And again, bias applies to that particular point x, the test point in the input space that I'm interested in. The other guy must be the variance. Why is that? Because if I knew everything, if I could zoom in perfectly, I would zoom in onto the best, assuming this is there, so I have this guy. But you don't. You have one data set at a time. When you get one data set, you get this guy. You get another data set, you get another guy. These are different from that. So you are away from that, and I'm measuring how far you are away. But because the expected value of this fellow is g bar. And I am comparing the difference squared with this. It is properly called variance. This is the variance of what I am getting, due to the fact that I get a finite data set. Every time I get a data set, I get a different one, and I am measuring the distance from the core that I get here. So this we call the bias, and this we call the variance. This is very clean. Now let's go back, and put it into the original form. Remember this guy? This is where we started. We got the other expression, and then we neglected to take the expected value with respect to x, in order to simplify the analysis. We would like to get that back, so we'll look at this. This was the expected value, with respect to x, of the quantity we just decomposed. Now I take the decomposition and put it back, in order to get the expected value of the out-of-sample error, in terms of the bias and variance. So this will be what? This will be the expected value with respect to x, of bias plus variance with x. And the expected value of the bias with respect to x, I'm just going to call it bias. The expected here, I'm going to call it variance, and that's what you get. And this is the bias-variance decomposition. Now I have a single number that describes the expected out-of-sample. So I give you a full learning situation. I give you a target function, and an input distribution, and a hypothesis set, and a learning algorithm. And you have all the components. You go about, and learn for every data set. And you get-- someone else learned from another data set. And get the expected value of the out-of-sample error. And I'm telling you if this out-of-sample error is 0.3, well, 0.05 of it is because of bias, and 0.25 is because of variance. So 0.05 means that your hypothesis set is pretty good in approximation, but maybe it's too big. Therefore, you have a lot of variance, which is 0.25. This is the decomposition. Now let's look at the tradeoff of generalization versus approximation, in terms of this decomposition. That was the purpose. Here is the bias, explicitly written as a formula. And here is the variance. We would like to argue that there is a tradeoff, that when you change your hypothesis set-- you make it bigger, more complex, or smaller. One of these guys goes up, and one of these guys goes down. So I'm going to argue about it informally. And then we'll take a specific example, where we are going to get exact numbers. But this is just to realize that this decomposition actually captures the tradeoff of approximation versus generalization. Why is that? Let's look at this picture. Here, I have a small hypothesis set. One function, if you want, but, in general, let's call it small. This one, I have a huge hypothesis set. So I have here the black points are hypotheses, that are candidates. Someone gives me a data set, and I learn, and choose something. Now the target function is sitting here. If I use this guy, obviously I am far away from the target function. And therefore, the bias is big. If I have a big hypothesis set-- this is big enough that it actually includes the f. Then when I learn, on average, I would be very close to f. Maybe I won't hit f exactly, because of the nonlinearity of the regime. The regime, I get N examples, learn and keep it, another N example, learn and keep it, and then take the average. I might have lost some because of the nonlinearity. I might not get f, but I'll get pretty close. So the bias here is very, very small, close to 0. In terms of the variance here, there is no variance. If I have one target function, I don't care what data set you give me. I will always give you that function. So there's nothing to lose here in terms of variance. Here, I have so many varieties that, depending on the examples you give me, I may pick this. And in another example-- because I'm fitting your data, so I get a red cloud around this. And their centroid will be g bar, the one that is good, but I may get one or the other. And the size of this guy measures the variance. This is the price I pay. Now you can see that if I go from a small hypothesis to a bigger hypothesis, the bias goes down, and the variance goes up. The idea here, if I make the hypothesis set bigger, I am making the bias smaller, because I am making this bigger. I'm getting it closer to f, and being able to approximate it better, so the bias is diminishing. But I am making this-- so the bias goes down. And here the variance goes up. Why is the variance going up? Because the red cloud becomes bigger and bigger. If I have this thing, then I have more variety to choose from, and I am getting bigger variance to work with. So this is the nature of the tradeoff. You may not believe this, because I just drew a picture and argued very informally. So now let's take a very concrete example, and we will solve it beginning to end. And if you understand this example fully, you will have understood bias and variance perfectly. So let's see. I took that simplest possible example that I can get a solution of, fully. My target is a sinusoid. That's an easy function. And I just wanted to restrict myself to -1, +1. So I'm going to get sine pi x. Just to scale it so that it's from -1 to +1, gets me the whole action. Therefore, the target function formally defined, is from -1, +1, to the real numbers. The co-domain is the real numbers. But obviously, the function would be restricted from -1 to +1, as a range. Now the target function is unknown. That's what we have been preaching for several lectures now. And now I am just giving you the target function. Again, this is an illustration. When we come to learning, we will try to blank it out, so that it becomes unknown in our mind. But in order to understand the analysis of the bias-variance, we would like to know what target function we are working with. We're going to get things in terms of it, and then you will understand why the tradeoff exists. So the function looks like this. Surprise-- just like a sinusoid. Now the catch is the following. You are going to learn this function. I am going to give you a data set. How big is the data set? I am not in a generous mood today, so I am just going to give you two examples. And from the two examples, you need to learn the whole target function. I'll try. N equals 2. The next item is to give you the hypothesis set. I'm going to give you two hypothesis sets to play with. So one of you gets one, and another gets another, and you try to learn and then compare the results. Well, I have two examples. So I cannot give you a 17th-order polynomial. So I am just going to give you the following. The two models are H_0 and H_1. H_0 happens to be the constant model. Just give me a constant. I am going to approximate the sine function with a constant. OK, this doesn't look good. But that's what we are working with. And the other one is far more sophisticated. It's so elaborate, you will love it. It's linear. Looks good now, having seen the constant already, right? These are your two hypothesis sets. And we would like to see which one is better. Better for what? That's the key issue. Let's start to answer the question of approximation first, and then go to the question of learning. Here is the question of approximation, H_0 versus H_1. When I talk about approximation, I am not talking about learning. I am giving you the target function, outright. It's a sinusoid. If it's a sinusoid, why don't I say it's just a sinusoid, and have E_out equal 0? Oh, because the rule of the game is that you're using one of the models. You have use either the constant or the linear. Do your best. Use all the information you have. But if you use the constant, return a constant. If you use the linear, return a line. You are not going to be able to return a bigger hypothesis than those. That's the game. OK? Let's see what happens with H_1. Here is the target. I am trying to fit it with a line, an arbitrary line. Can you think of what it looks like? Line is not much, but at least I can get something like this, right? Try to get part of the slope, et cetera. I can solve this. I get a line in general, calculate the mean squared error. It will be a function of 'a' and 'b'. Differentiate with respect to 'a' and 'b', and get the optimal. It's not a big deal. So you end up with this. That's your best approximation. This is not a learning situation, but this is the best you can do using the linear model. Under those conditions, you made errors. And these are your errors. You didn't get it right, and these regions tell you how far you are from the target. Let's do it with the other guy. Now I have a constant. I want to approximate this guy with a constant. What is the constant? I guess I have to work with 0. That's the best I have. Remember, it's mean squared error. So if I move the 0, the big error will contribute a lot, because it's squared. So I just put it in the middle, and this is your hypothesis. And how much is your error? Big. The whole thing is your error. Let's quantify it. If you get the expected values of mean squared error, you'll get a number, which here will be 0.5, and here will be approximately 0.2. So the linear model wins. Yeah, I'm approximating. I have more parameters, sure. If you give me third order, I will be able to do better. If you give me 17th order, I'll be able to do better. But that's the game. In terms of approximation, the more the merrier. Because you have all the information. There's no question of zooming in. Now let's go for learning. This course is about machine learning, right? Not about approximation. So this is the important part for us. Let's play the same game with a view to learning. You have two examples. You are going to learn from them. You are restricted to one hypothesis set or the other. So let's start with H_1, and I'll go to H_0 again. This is your target function. Now you get two examples. I'm going to, let's say, uniformly pick two examples independently, and I get these two examples. I'd like you to fit the examples, and we'll see how well you approximate the target function. The first item of business is to get rid of the target function. Because you don't know it. You only know the examples. So in a learning situation, this is what you get. Now I ask you to fit a line. Line, two points. I can do that. This is what you do. Now that you settled on the final hypothesis, I'm going to grade you. So I'm going to bring back the target function, and compare this to that, and give you what is your out-of-sample error. Let's do it for H_0. You have the same two points. You're fitting them with a constant. How would you do that? Probably the midpoint will give you the least squared error on these two points, right? So this would be your final hypothesis. And you get back your target function, in order to evaluate your out-of-sample error, and this is what you get. Now you can see what the problem is. I can compute the error here, and I can have the error regions, and all of that. But this depends on which two points I gave you. If I give you another two points, I give you another two points, et cetera, I am not sure how to really compare them, because it does depend on your data set. That's why we needed the bias-variance analysis. That's why we got the expected value of the error, with respect to the choice of the data set, so that we actually are talking inherently about a linear model learning a target using two points, regardless of which two points I'm talking about. So let's do the bias and variance decomposition for the constant guy. Here is the figure. It's an interesting figure. Here I am generating data sets, each of size 2 points, and then fitting a line. And the line would be the midpoint. I keep repeating this exercise, and I am showing you the final hypothesis you get. I repeated it a very large number of times. This is a real simulation, and these are the hypotheses you get. You can see that when you get this line, it means that the two points were equally distant from here. Sometimes I get the points here. Sometimes I get them equal to that, so I get here. The middle point is a little bit heavier. Because, obviously, the chance of getting them on the two lobes is there, and so on. So this is basically the distribution you get. Each of them will give you an out-of-sample error. And the interesting thing for us is the expected out-of-sample error. That's what will grade the model. Now what we are going to do, we are going to get the bias and variance decomposition based on that. And that is our next figure. Look at this carefully. The green guy, the very light green guy, is g bar of x. This is the average hypothesis you get. How did I get that? I simply added up all of these guys, and divided by their number. And it is expected obviously, by the symmetry, that on average, I will get something very close to 0. The interesting thing is that you can see now that g bar, here, happens to be also the best approximation. If I keep repeating this, I will actually get the 0 guy, which I was able to get when I had the full target function I was approximating. Here I don't have the full target function. I have one hypothesis at a time. I am getting the average, but I am getting this. So there is a justification for saying that g bar will be the best hypothesis. Because this game of getting one at a time, and then getting the average, does get me somewhere. But do remember, this is not the output of your learning process. I wish it were. It isn't. The output of your learning process is one of those guys, and you don't know which. It just happens that, if you repeat it, this will be your average. And because you are getting different guys here, there will be a variance around this. And the variance, I'm describing it basically by the standard deviation you are going to get. So the error between the green line and the target function will give you the bias. And the width of the gray region will give you the variance. Understood what the analysis is? So that takes care of H_0. Let's go to H_1. So to remember, the learning situation for H_0 was this. This is when I had the constant model. What will happen if you are actually fitting the two points, not with a constant, which you do at the midpoint, but you are fitting them with a complete line? What will it look like? It will look like this. Wow. You can see where the problem is. Talk about variance. Take two points. You connect them. Wherever the two points, you get this jungle of lines. This is for exactly the same data sets that gave me the horizontal lines in the previous slide. So this is what I get. Now I ask myself, what on average will I get? You can immediately say, on average, you'd better get a positive slope. There is a tendency to get a positive slope. Because when you get the points split, you will get this. Sometimes you get a negative slope here, here. But that is balanced by getting a positive slope here. You can argue this, but you can do the math. And then you get the bias-variance decomposition. This will be your average. This is g bar. And this will be the variance you get. The variance depends on x. This is the way we defined it. And when you want one variance to describe it, you get the expected value of the width squared of that gray region. This gray region has the standard deviation. Now you can see exactly that I am getting better approximation than the previous guy. But I sure am getting very bad variance, which is expected here. Now you can see what the tradeoff is. And the question is, given these two models, which one wins from a learning scenario? You need to ask the question, to remember what it is. I am trying to approximate a sinusoid. Is it better to do it with a constant or a general line? The answer to that question is obvious. But that is not the question I am asking in learning. The question I am asking in learning, you have two points coming from something I don't know. Is it better to use a constant or a line? You notice the difference. I am going to put them side by side, and then see which is the winner. So this guy has a big bias and a small variance. This guy has a small bias and a big variance. Let's get quantitative. What is the bias here? It's actually 0.5. Exactly the same we got when we were approximating outright. It's the 0. That's the expected value. You get 0.5, the mean squared error. What is the bias here? It's 0.21. Interestingly enough, when we did the approximation, it was about 0.2. And indeed, this is not exactly the best fit. Remember when I told you there is a nonlinearity aspect. You are taking two points at the time, and then taking a fit, and then taking the average. And it's conceivable that this will give you something different from if you have the full curve, and you are fitting it outright. The difference is usually very small, and it is. But here you get something which is not exactly perfect, but is very close to perfect. So obviously, here the bias is much smaller. Let's look at the variance. What is the variance here? The variance here is 0.25. It's not too bad. The variance here, we expect it to bigger. But is it big enough to kill us? It's a disaster, complete and utter disaster. And now, when you see what is the expected out-of-sample error, you add these two numbers. Here I'm going to get 0.75, and here you are going to get something much bigger. And the winner is-- Now you go to your friends, and tell them that I learned today that in order to approximate a sine, I am better off approximating it with a constant than with a general line. And have a smile on your face. Of course you know what you're talking about, but they might not really appreciate the humor here. This is the game. I think we understand it well. So the lesson learned, if I want to articulate it, is that when you are in a learning situation always remember: you are matching the model complexity to the data resources you have, not to the target complexity. I don't know the target. And even if I knew the level of complexity it has, I don't have the resources to match it. Because if I match it, I will have the target in my hypothesis set, but I will never arrive at it. Pretty much like I'm sitting in my office, and I want a document of some kind, an old letter. Someone has asked me for a letter of recommendation, and I don't want to rewrite it for you. So I want to take the old guy and just see what I wrote, and then add the update to that. Before everything was archived in the computers, it used to be a piece of paper. So I know the letter of recommendation is somewhere. Now I face the question, should I write the letter of recommendation from scratch? Or should I look for the letter of recommendation? The recommendation is there. It's much easier when I find it. However, finding it is a big deal. So the question is not that the target function is there. The question is, can I find it? Therefore, when I give you 100 examples, you choose the hypothesis set to match the 100 examples. If the 100 examples are terribly noisy, that's even worse. Because their information to guide you is worse. That's what I mean by the data resources you have. The data resources you have is, what do you have in order to navigate the hypothesis set? Let's pick a hypothesis set that we can afford to navigate. That is the game in learning. Done with the bias and variance. Now we are going to take just an illustrative tool, called the learning curves. And then we are going to put the bias and variance versus the VC analysis on those curves. So what are the learning curves? They are related to what we think of intuitively as a learning curve. But they are a technical term here. They are basically plotting the expected value of E_out and E_in. We have done E_out already. But here we also plot the expected value of E_in, as a function of N. Let's go through the details. I give you a data set of size N. We know what the expected value of the out-of-sample error is. We have seen that already in the bias-variance decomposition. And this is the quantity. I know this is the quantity that I will get in any learning situation. It depends on the data set. If I want a quantity that describes just the size of the set, I will integrate this out, and get the expected value with respect to D. That's the quantity I have. And the other one is exactly the same, except it's in-sample. We didn't use it in the bias-variance analysis. This one, I am going to get the expected value of the in-sample. So I want to get, given this situation, if I give you N examples, how well are you going to fit them? Well, it depends on the examples. But on average, this is how well you are going to fit them. And you ask yourself, how do these vary with N? And here comes the learning curve. As you get more examples, you learn better. So hopefully, the learning curve-- and we'll see what the learning curve looks like. Let's take a simple model first. So it's a simple model. And because it's a simple model, it does not approximate your target function well. The best out-of-sample error you can do is pretty high. When you learn, the in-sample will be very close to the out-of-sample. So let's look first at the behavior as you increase N. As you increase N, hopefully the out-of-sample error is going down. I have more examples to learn from. I have a better chance of approximating the target function. And indeed, it goes. And it can go down and down, until it gets to the absolute limit of your hypothesis set. Your hypothesis set very simple. It doesn't have a very good approximation for your target. This is the best it can do. The best you can do is the best you can do. So that's what you get. When you look at the in-sample, it actually goes the other way around. Because here my task is simpler than here. Here I am trying to fit 5 examples. Here I am trying to fit 20 examples. And I only have the examples to fit. I'm not looking at target function, or anything like that. So obviously, I can use my degrees of freedom in the hypothesis set, and fit the 5 examples better, and get a smaller in-sample error. Whereas if I increase N, I will get a worse in-sample error. It doesn't bother me, because the in-sample error is not the bottom line. The out-of-sample is. And as you can see, although I am getting worse in-sample, I am getting better out-of-sample. And indeed, the discrepancy between them, which is the generalization error, is getting tighter and tighter as N increases. Completely logical. By the way, this is a real model, so when we talk about overfitting, I will tell you what that model is, as the simple model and the complex model. The complex model, exactly the same behavior, except it's shifted. It's a complex model, so it has a better approximation for your target function. So it can achieve, in principle, a better out-of-sample error. You have so many degrees of freedom, that you were able to fit the training set perfectly up to here. This corresponds, more or less, to the VC dimension. The VC dimension can shatter everything. So you can shatter these guys. You can fit them perfectly. So you get zero error. You start compromising when you have more guys and you cannot shatter, so maybe you have to compromise. And you end up starting to have in-sample error. And the in-sample error goes up, and the out-of-sample error goes down. The interesting thing is that in here, when you have this, I fit the examples perfectly. I'm so happy. What is out-of-sample? An utter disaster. Absolutely no information. We didn't learn anything. We just memorized the examples. So here, again, the out-of-sample error goes down. The in-sample error goes up. Same argument exactly. They get closer together. But obviously the discrepancy between them is bigger, because I have a more complex set. Therefore, the generalization error is bigger. The bound on it is bigger in the VC analysis. And the actual value is bigger. So this is the analysis. It's a very simple tool. And the reason I introduced it here is that I want to illustrate the bias and variance analysis versus the VC analysis, using the learning curves. It will be very illustrative to understand how the two theories relate to each other. Let's start with the VC analysis on learning curves. These are learning curves. The in-sample error goes up, as promised. The out-of-sample error goes down. There is a best approximation that corresponds to this level of out-of-sample error, if we actually knew the thing. And what did we do in the VC analysis? We had the in-sample error, which is this region, the height of this region, and then we had a bound on the generalization error, which is Omega. And we said that the bound behaves the same way as the quantity itself. So the bound actually will not be this thing. It will be way bigger. But in proportionality, it will give us the same proportion. So as you increase N, the generalization error goes down. The bound on it goes down. Omega goes down, which we already realized. And obviously, you can take another model. And if the model is very complex, the discrepancy between them becomes bigger, which agrees with that. So this is the decomposition of it. Now I took some liberties, in order to be able to do that. The VC analysis doesn't have expected values. So I took expected values of everything there is. So there is some liberty taken, in order to put it to fit in that diagram, but the principle holds. The blue region is the in-sample error, and the red region is basically the Omega. That is what happens in the generalization bound. Think for a moment, which region will be blue and which region will be red in the bias-variance analysis? I'll get exactly the same curves, the same model. So what will it be? It will be this. That's the difference. In the bias-variance, I got the bias based on the best approximation. I didn't look at how you performed in-sample. I assumed hypothetically that you could look for the best possible approximation. And I charged the bias for that. And this is the bias you have. So this is the best you can do. And this is the error you are making. Again, there is a liberty taken here. Because this is genuinely the best approximation in your hypothesis set. The one I am using for the bias-variance analysis is the error on g bar. And we said, g bar will be close in error to this guy. It may not even be in the hypothesis set. So there is some liberty, but it's not a huge liberty. This is very much close to what you are getting in the bias-variance. And the rest of it is the variance. Because you get the bias plus that, and you will get the expected value of the out-of-sample error. Now you can see why they are both talking about the same thing. Both of them are talking about approximation. That's the blue part. Here it's approximation overall. And here it's approximation in-sample. And both of them take into consideration what happens in terms of generalization. Well, the red region here is maybe twice the size. Not twice the size in general. It will be twice the size actually in the linear regression example. But basically, they have the same behavior. They have just different scale. So they capture the same principle of generalizing, or the uncertainty about which hypothesis to pick, or how much do I lose from going in-sample to out-of-sample. So they have the same behavior. And the only difference here is that, here the bias obviously is constant with respect to N. The bias depends on the hypothesis set. Now this is also an assumption. Because it says, I have 2 examples and take the average. I will get an error. If I have 10 examples and take the average, I get an error. Is it the same? Well, in both cases, you effectively used an infinite number of examples. Because the first one you used two at a time, and you repeated it an infinite number of times, and you took an average. This, you used 10 at a time, and you took an average. I grant you maybe the 10 will give you a better situation. But again, it's a little bit of a license, in order to be able to attribute the bias and variance to this line, which happens to be the best hypothesis proper within your hypothesis set. So this is the contrast between the two theoretical approaches we have covered, in this lecture and the previous three lectures. I am going to end up with the analysis for the linear regression case. So I'm going to basically go through it fairly quickly. This is a very good exercise to do. And if you read the exercise and you follow the steps, it will give you very good insight into the linear regression. I'll try to explain the highlights of it. Let's start with a reminder of linear regression. So linear regression, I'm using a target. For the purpose of simplification, I am going to use a noisy target, which is linear plus noise. So I'm using linear regression to learn something linear plus noise. If it weren't for the noise, I would get it perfectly. It's already linear. But because of the noise, I will be deviating a little bit. This is just to make the mathematics that results easier to handle. Now you're given a data set. And the data set is a noisy data set. So each of these is picked independently. This y depends on x, and the only unknown here is the noise. So you get this value, it gives you the average, and then you add a noise to get the y. Do you remember the linear regression solution? Regardless of what the target function is, you look at the data, and this is what you get for the solution. You take the input data set, and the output data set. You do this algebraic combination, and whatever comes out is your output of the linear regression. This is your final hypothesis. We have done that. And now we are going to think about a notion of the in-sample error, not in-sample error as a summary quantity, but the in-sample error pattern. How much error do I get in the first example? How much error do I get in the second, third, et cetera? Just for our purposes. So what would that be? Well, that would be what I got in the final hypothesis. I apply the final hypothesis to the input points. I am going to get a pattern of values that my hypothesis is predicting. I compare them to the actual targets, which happen to be stored in the y. And that would be an error pattern. So it would be plus something minus something, plus something minus something. And if I add the squared values here, get the average of those, I will get what we call the in-sample error. For the out-of-sample error, I am going to play a simplifying trick here, in order to get the learning curve in the finite case. Here I am going to consider that, in order to get the out-of-sample error, what I'm going to do I am going to just generate the same inputs, which is a complete no-no in out-of-sample. Supposedly in out-of-sample, you get points that you haven't seen before. You have seen these x's before. But the redeeming value is that I'm now going to give you fresh noise. So that's the unknown, and that is what allows me to say that it plays the role of an out-of-sample. I'm going to generate another set of points with different noises, but on the same inputs in order to simplify the analysis. You see that the x's here are involved. And if I use the same inputs, things will simplify. And in that case, if you ask yourself what is the out-of-sample error of those, it's exactly the same. I evaluated on the points. They happen to be the points for the out-of-sample. And I'm comparing it with y. I'm calling it y dash, which is exactly the same thing, except with noise dash, another realization of the noise. This is the outline of the setup to get us the learning curves we want. When you do the analysis, not that difficult at all, you will get this very interesting curve. This is the learning curve, and it has very specific values. sigma squared, that's the variance of the noise. This is the best you can do. I expect that, because you told me the target is linear. So I can get that perfectly. But then, there is this added noise. I cannot capture the noise. What is the variance of the noise? sigma squared. So this is the error that is inevitable. Look at the in-sample error. Up to d plus 1, you were perfect. Yeah, of course I am perfect. Because I have d plus 1 parameters in linear, and I am fitting less than those, so I can fit them perfectly. It doesn't mean much for the out-of-sample error, but that's what I get. I start compromising when I get more points. And as I go with more points, here I'm fitting the noise. I am fitting the noise less. The noise is averaging out. Now I'm getting very, very close, to as if there was no noise. Because the pattern persists, which is the linear guy. And the noise, if I get more examples, more or less cancels out in the fitting. I don't have enough degrees of freedom to fit them all. So I get to average, until eventually I get to as if I am doing it perfectly. And out-of-sample goes down. There is a very specific formula that you can get, which is interesting. So let me finish with this. The best approximation error is sigma squared. That's the line, right? What is the expected in-sample error? It has a very simple formula, which is-- everything is scaled by sigma squared. So What you have here is, it's almost perfect. And you are doing better than perfect, by this amount, the ratio of d plus 1. Remember what d plus 1 was? For the perceptron, it was the VC dimension. Here it's also a VC dimension of sorts, the degrees of freedom that linear regression has. So we divide the degrees of freedom by the number of examples. That is the factor that you get. And you realize here that this is the best you can do. And here you are doing better than the best. Why is it better than the best? Because I'm not trying to fit the whole function. I am only fitting the finite sample. So I'm doing very well, and I'm very happy about it, little that I know that I'm actually harming myself. Because what I'm doing here, I am fitting the noise. And as a result of that, I am deviating from the optimal guy. And I am paying the price in out-of-sample error. What is the price I am paying in out-of-sample error? It is the mirror image. I lose exactly in out-of-sample what I gained in-sample. And the most interesting quantity is the summary quantity. What is the expected generalization error? Well, the generalization error is the difference between this and that. I have the formula for them. So all I need to do is write this. Let me magnify this. This is the generalization error. It has the form of the VC dimension divided by the number of examples. In this case, it's exact. And this is what I promised last time. I told you that this rule of proportionality between a VC dimension and a number of examples persists to the level where sometimes, you just divide the VC dimension by the number of examples, and that will give you a generalization error. This is the concrete version of it, in spite of the fact that here is not a VC dimension. This is real-valued. But it's degrees of freedom, so it plays the role. We could actually solve for it and realize that this is indeed the compromise between the degrees of freedom I have, in the case of linear regression, and the number of examples I am using. So we will stop here. And we will go into questions and answers after a short break. Let's go into the questions. MODERATOR: Right. The first question is if you can go back to slide 19. PROFESSOR: 19. MODERATOR: The question is if you can explain how complex models are better than simple models. PROFESSOR: OK. Better in something. I think the key issue in the theory is, there is a tradeoff. Nothing is better on all fronts, and nothing is worse on all fronts. So let's compare the simple model and the complex model. In terms of the ability to approximate, whether that ability to approximate is in-sample, or whether the ability to approximate is absolute, what is the ability to approximate in the absolute? Here is my hypothesis set, and I have a target function. The horizontal line, that height gives you the error of approximation. So if you go from a simple model to a complex model, you will be able to approximate better. That is obvious. And that also is inherited if your approximation is focused only on the training examples. In this case, you are comparing not the horizontal lines, but the blue curves. This is the error you make in approximating the sample you get. And again, the approximation for the simple model is worse than the approximation for the complex model. So if your game is approximation, and that's your purpose, then obviously the complex model is better. In this particular case, you can also ask yourself about the generalization ability. The generalization ability will be the discrepancy between, either the blue and red curve. That would be the VC analysis. This would be how much I lose from going from in-sample to out-of-sample. Or how much I lose from a perfect approximation, in the case of the bias-variance analysis, to getting E_out, because of my inability to zoom in on the right hypothesis. This would be that area here. So whether you are taking the difference between the blue and red curve, or the difference between the red curve and the black line, that area is smaller here than here. Therefore, the simple model is better, as far as the generalization is concerned. Now because it's a tradeoff, and I have one of them better and one of them worse, then the question is, when I put them together, which is better? Because the bottom line in learning is the red curve. That's what I care about. This is the performance of the system that I'm going to deliver to my customer, and they're going to test it out-of-sample. And if they get it right, they will be happy. So now because I have two quantities that I'm adding, and one of them is going down, and one of them is going up, then it is obvious that the sum could go either way. And in this case, you can see that it is going either way. For example, if you have few examples, then E_out here is OK. It's not great, but it's decent. If you have the same number of examples here, E_out is a disaster. So if you have few examples, you simply cannot afford the complex model. You are better off working with a simple model, and you will get better out-of-sample error. If I give you much bigger resource of the examples-- if you are here, now this one is limited by the fact that it's simple. It cannot get any better. It has all the information. It zooms in perfectly, but it cannot get any better. This guy now gets to use its degrees of freedom properly, and gets you to a smaller value. So for larger number of points, you get a better performance here. That's why we are saying that you should match the complexity of the model to the data resources you have, which in this case are represented by N. We're talking about different target functions and different things. But in choosing this model or another, what really dictates the performance is the number of examples versus the complexity of the model. MODERATOR: OK. When you did the analysis for linear regression, if you did it using the perceptron model, would you get the same generalization error? PROFESSOR: Let's go for that. The analysis of the bias-variance, and it's also inherited in the learning curves-- the analysis is very clean when you use mean squared error. Obviously, you can use mean squared error in the perceptron. There will be a correspondence here. But the ability to get such a clean formula here really depends on the very particulars of linear regression. If you go back to the previous slide, where the assumption is, it was very critical to make the assumption that the out-of-sample is this way, and to make the target very specifically linear plus noise, in order to be able to simplify. The result, by the way, holds in general, asymptotically. So if you take genuine out-of-sample, which means that you pick different points, you will get a different matrix X. So you will apply w that you got from in-sample, you'll apply it to X dash in this case, which is this, and then y dash. And the problem is that when you plug it in here, and try to get a formula for that, the formula will depend on how the X dash relates to the X. When it's the same, they cancel out neatly, and you get the formula that I had. But asymptotically, if you make certain assumptions about how X is generated and you take the asymptotic result, you will get the same thing. So the short answer is the following. The analysis in the exact form that I gave, which gives me these very neat results, is very specific to linear regression, very specific to the choice of out-of-sample as I did it, if you want to give the answer exactly in a finite case. If you use a perceptron, you will be able to find a parallel, but it may not be as neat. MODERATOR: Quick clarification. sigma squared is the variance of the noise in the-- PROFESSOR: Yeah. I just realized that. I have been using bias-variance. The lecture is called bias-variance, and now we have variance of the noise. So obviously, I am so used to these things that I didn't notice. When I say the variance here, this has absolutely nothing to do with the bias-variance analysis that I talked about. It's a noise. I am trying to measure the energy of it. It's a zero-mean noise, so the energy of it is proportional to the variance. So I should have called it-- the energy of the noise is sigma squared, in order not to confuse people. But I hope that I did not confuse too many people. MODERATOR: Can the bias-variance analysis be used for model selection? PROFESSOR: Bias-variance analysis, just because it is so specific, it actually assumes that you know the target function, if you want to get the quantities explicitly. So for example linear regression, I assume the form is linear plus noise. For the sinusoidal case, we got the answers, and we were able to choose. But you actually knew that it was a sinusoid. So the bias-variance analysis is taken as a guide. But it's a very important guide. Because I can ask myself how do I affect-- I want to get E_out to be down. Now I know that there are two contributing factors, bias and variance. Can I get the variance down, without getting the bias up? That's a bunch of techniques. Regularization will belong to that category. Can I get both of them down? That will be learning from hints. There will be something that affects both of them, and so on. So you can map different techniques to how they are affecting the bias and variance. I would say that, in terms of any application to learning situation, it's a guideline rather than something that I'm going to plug in, and tell you what the model is. The answer for the model selection is mostly through validation, which we're going to talk about in a few lectures. And this is the gold standard for the choices you make in a learning situation, including choosing the model. MODERATOR: I have a question getting a little bit ahead. In ensemble methods, like boosting or something, is there a reason under these analyses why those methods work? PROFESSOR: I almost included this in the lecture, but I thought it was one too many. If you look at the idea of g bar. Let me try to get to its definition. This was just a theoretical tool of analysis. I have g bar equals the expected value of that. And if I want to do it with a finite number of sets, I sum up this, and normalize by 1 over K. Although this was just a theoretical way of getting the bias-variance decomposition, and this is a conceptual way of understanding what it is, there is an ensemble learning method that builds exactly on this, which is called Bagging-- bootstrap aggregation. And the idea is, what do I need in order to get g bar? We said g bar is great, if I can get it. But it requires an infinite number of data sets, and I have only one data set. So the idea of Bagging is that, I am going to use my data set to generate a large number of different data sets. How am I going to do that? Well, that's bootstrapping. Bootstrapping always looks like magic. You know where the expression comes from? Bootstrapping, you try to lift yourself by pulling on your bootstraps. Which obviously, you cannot do, because you are pulling on it. But that's what you do. Here we are trying to create something, where it isn't there. And in this particular case, what you do is you sample randomly from your data set, in order to get different data sets, and then average. And believe it or not, that gives you actually a dividend. It gives you something about the ensemble learning. There are other, obviously more sophisticated, methods of ensemble learning. And one way or the other, they appeal to the fact that you are reducing the variance by averaging a bunch of stuff. So you can say that it's either taken outright, like Bagging, or inspired in some sense, that it's a good idea to average because you cancel out fluctuations. MODERATOR: If we use the Bayesian approach, does this bias-variance dilemma still appear? PROFESSOR: Repeat the question, please. MODERATOR: If you use a Bayesian approach, does this bias-variance still appear? PROFESSOR: OK. The bias-variance is there to stay. It's a fact. And we can take a particular approach, and then we are going to perhaps find an explicit expression for the bias, and an explicit expression for the variance. But nothing will change about the nature of things because of the approach I have. Now the Bayesian approach is very particular, because the Bayesian approach makes certain assumptions. And after you make these assumptions, you can answer all questions perfectly. You can answer questions like that, and other questions as well. And I will talk about the Bayesian approach in the very last lecture of the course. So I will defer answers, that are specific to that, until that point. But basically, the answer to this very specific question, it's like if you ask, does the VC dimension change if you apply the Bayesian? Well, you apply the Bayesian, this is just a bunch of assumptions. The VC dimension is there. Maybe by using the Bayesian you'll be able to find more direct quantities to predict what you want. But the VC dimension is there, because it's defined in a general setup. MODERATOR: A question about relation with numerical function approximation. In that field, there's interpolation and extrapolation. When is there extrapolation in machine learning? PROFESSOR: Function approximation is one of the fields that is very much related. Because we are given a finite sample, and they're coming from a function, and you're trying to approximate it. And this is one of the applications. In general, interpolation is easier than extrapolation, because you have a handle. And if you want to articulate that in terms of the stuff we have, the variance in interpolation is smaller than the variance in extrapolation in general. Remember, the lines in the sinusoid? They were all over the place. If you take between the two points-- so I have the sinusoid, and I have the two points, I'm connecting them with a line. Between the two points, I am very much in good shape, because the sine is this way, and I am this way. So it's not that big a deal. The further out you go, then there is a lot of fluctuation. And that is reflected in the extrapolation. MODERATOR: OK. When the variance is big, we know we're extrapolating. Is that the answer? PROFESSOR: No. I will say there is an association between them. To answer this specifically, you need to understand the particular case. There may be cases, where the extrapolation doesn't have a lot of variance and whatnot. I'm just trying to map in general, what the quantity here corresponds to, in that. The problem with extrapolation can be posed, in this case, in terms of more variance than interpolation. But I'm not making a mathematical statement that this is guaranteed to be the case. MODERATOR: Could you explain what the literature means by the bias-variance covariance dilemma? PROFESSOR: OK. You can pursue this analysis a little bit further to the cases where you have cross terms. Particularly for boosting, this is the case. And then there is a question of, I am trying to get these guys that I'm going to average in order to get the final hypothesis. That's my game. Now it would be nice if I can get them to be independent. Because when I get them to be independent, then adding them up reduces the variance in a very good way. But then, in general, when you actually apply some of these algorithms, there is a correlation between one and another. So there's a covariance. So there's a question of the balance between the two. But it really is, in terms of application, related more to ensemble learning than to just the general bias-variance analysis as I did it. Because in the bias-variance analysis, I had the luxury of picking independently generated data sets, generating independent guys, and then averaging them, because it's a conceptual aspect. But when you actually are using a technique, where you are constructing these guys based on variations of the data set, then the covariance starts playing a role. MODERATOR: A question about, I guess, naming the things. Is linear regression actually learning? Or is it just fitting along the lines of function approximation? PROFESSOR: Linear regression, is a learning technique. And fitting is the first part of learning. So you always fit, in order to learn. The only added thing is that you want to make sure that, as you fit, you always perform well out-of-sample. That's what the theory was about. I've been spending four lectures trying to make sure, that when you do the intuitive thing, I give you data, you fit them. You could do that without taking a machine learning course. Now I'm telling you that you have to have the checks in place, such that when you fit in-sample, something good happens in what you care about, which is out-of-sample. So that's the-- MODERATOR: All right. I think that's it. PROFESSOR: Very good. We'll see you next week.
Info
Channel: caltech
Views: 136,118
Rating: 4.9567308 out of 5
Keywords: Machine Learning (Field Of Study), bias-variance, learning curve, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, linear regression, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), California Institute Of Technology (Organization), Theory (Quotation Subject), Abu-Mostafa, Yaser
Id: zrEyxfl2-a8
Channel Id: undefined
Length: 76min 50sec (4610 seconds)
Published: Sat Apr 28 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.