Lecture 03 -The Linear Model I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we discussed the feasibility of learning. And we realized that learning is indeed feasible, but only in a probabilistic sense. And we modeled that probabilistic sense in terms of a bin that has an out-of-sample performance. We already mapped that to the out-of-sample performance. The performance we don't know. And in order to be able to tell what E_out of h is-- h is the hypothesis that corresponds to that particular bin-- we look at the in-sample. And we realize that the in-sample tracks the out-of-sample well through the mathematical relationship which is the Hoeffding Inequality, that tells us that the probability that E_in deviates from E_out by more than our specified tolerance is a small number. And that small number is a negative exponential in N. So the bigger the sample, the more reliable that E_in will track E_out well. That was the basic building block. But then we realized that this applies to a single bin. And a single bin corresponds to a single hypothesis. So now we go for a case where we have a full model, h_1 up to h_M. And we take the simple case of a finite hypothesis set. And we ask ourselves, what would apply in this case? We realized that the problem with having multiple hypotheses is that the probability of something bad happening could accumulate. Because if there is a 0.5% chance that the first hypothesis is bad, in the sense of bad generalization, and 0.5% for the second one, we could be so unlucky as to have these 0.5% accumulate, and end up with a significant probability that one of the hypotheses will be bad. And when one of the hypotheses will be bad, if we are further unlucky, and this is the hypothesis we pick as our final hypothesis, then E_in will not track E_out for the hypothesis we pick. So we need to accommodate the case where we have multiple hypotheses. And the argument was extremely simple. g is our notation for the final hypothesis. It is one of these guys that the algorithm will choose. Well, the probability that E_in doesn't track E_out will obviously be included in the fact that E_in for h_1 doesn't track the out-of-sample for that one, or E_in for h_2 doesn't track, or E_in of h_M doesn't track. The reason is very simple. g is one of the guys. If something bad happens with g, it must happen with one of these guys at least, the one that was picked. So we can always say that this implies these things, which is this or this or this or this or this. And after that, we apply a very simple mathematical rule, which is the union bound. The probability of an event or another event or another event is at most the sum of the probabilities. That rule applies regardless of the correlation between these events, because it takes the worst-case scenario. If all the bad events happen disjointly, then you add up the probabilities. If there is some correlation, and they overlap, you will get a smaller number. In all of the cases, the probability of this big event will be less than or equal to the sum of the individual probabilities. And this is useful because in the coin flipping case, which started this argument, the events are independent. In the case of the hypotheses of a model, the events may not be independent, because we have the same sample. And we are only changing the hypotheses. So it could be that the deviation here is related to the deviation here. But the union bound doesn't care. Regardless of such correlations, you will be able to get a bound on the probability of this event. And therefore, you will be able to bound the probability that you care about, which has to do with the generalization, to the individual Hoeffding applied to each of those. And since you have M of them, you have an added M factor. So the final answer is that the probability of something bad happening after learning is less than or equal to this quantity, which is a helpful small quantity, times M. And we realize that now we have a problem because if you use a bigger hypothesis set, M will be bigger. And therefore, the right-hand side here will become bigger and bigger when you add the M. And therefore, at some point, it will even become meaningless. And we are not even worried yet about M being infinity, which will be true for many hypothesis sets, in which case, this is totally meaningless. However, we weren't establishing the final result in learning. We were establishing the principle that, through learning, you can generalize. And we have established that. It will take us a couple of weeks to get from that to the ability to say that a general learning model, an infinite one, will generalize. And we will get the bound on generalization. That's what the theory of generalization will address. So today the subject is linear models. And as I mentioned at the beginning, this is out of sequence. If I was following the logical sequence, I would go immediately to the theory and take M, which takes care of the finite case, and then generalize it to the more general case. However, as I mentioned, I decided to give you something concrete and practical to work with early on. And then we will go back to the theory after that. The linear model is one of the most important models in machine learning. And what we are going to do in this lecture, we're going to start with a practical data set that we are going to use over and over in this class. And then, if you remember the perceptron that we introduced in the first lecture, the perceptron is a linear model. So here is the sequence of the lecture. We are going to take the perceptron and generalize it to non-separable data. That's a relief, because we already admitted that separable data is very rare. And we would like to see what will happen when we have non-separable data. Then, we are going to generalize this further to the case where the target function is not a binary classification function, but a real-valued function. That also is a very important generalization. And linear regression, as you will see, is one of the most important techniques that is applied mostly in statistics and economics, and also in machine learning. Finally, as if we didn't do enough generalization already, we are going to take this and generalize it to a nonlinear case. All in a day's work, all in one lecture. It's a pretty simple model. And at the end of the lecture, you will be able to actually deal with very general situations. And you may ask yourself, why am I calling the lecture Linear Model when I'm going to talk about nonlinear transformation? Well, you'll realize that nonlinear transformation remains within the realm of linear models. That's not obvious. We will see how that materializes. So that's the plan. Now, let's look at a real data set that we are going to use, and will be available to you to try different ideas on. And it's very important to try your ideas on real data. Regardless of how sure you are when you have a toy data set that you generate, you should always go for real data sets and see how the system that you thought of performs in reality. So here is the data set. It comes from ZIP codes in the postal office. So people write the ZIP code. And you extract individual characters, individual digits. And you would like to take the image, which happens to be 16 by 16 gray level pixels, and be able to decipher what is the number in it. Well, that looks easy except that people write digits in so many different ways. And if you look at it, there will be some cases like this fellow. Is this a 1 or a 7? Is this a 0 or an 8? So you can see that there is a problem. And indeed, if you get a human operator to actually read these things and classify them, they will probably be making an error of about 2.5%. And we would like to see if machine learning can at least equal that, which means that we can automate the process, or maybe beat that. So this is a data set that we are going to work with. Let's look at it a little bit more closely to see how we input it to our algorithm. We have one algorithm so far, which is the perceptron learning algorithm. We are going to try on this. And then we are going to generalize it a little bit. The first item is the question of input representation. What do I mean? This is your input, the raw input, if you will. Now this is 16 pixels by 16 pixels. So there are 256 real numbers in that input. If you look at the raw input x, this would be x_1, x_2, x_3, dot, dot, dot, dot, and x_256. That's a very long input to encode such a simple object. And we add our mandatory x_0. Remember, in linear models, we have this constant coordinate, x_0 equals 1, we add in order to take care of the threshold. So this will always be in the background whether we mention it or not. If you take this raw input and try the perceptron directly on it, you realize that the linear model in this case, which has a bunch of parameters, has really just too many parameters. It has 257 parameters. If you are working in a 257th-dimensional space, that is a huge space. And the poor algorithm is trying to simultaneously determine the values of all of these w's based on your set. So the idea of input representation is to simplify the algorithm's life. We know something about the problem. We know that it's not really the individual pixels that matter. You can probably extract some features from the input, and then give those to the learning algorithm and let the learning algorithm figure out the pattern. So this gives us the idea of features. What are features? Well, you extract the useful information. And as a suggestion, very simple one, let's say that in this particular case, instead of giving the raw input with all of the pixel values, you extract some descriptors of what the image is like. For instance, you look at this. Depending on whether this is the digital 8 or the digit 1, et cetera, there is a question of the intensity, average intensity. 1 doesn't have too many black pixels. 8 has a lot. 5 has some. So if you simply add up the intensity of all the pixels, you probably will get a number that is related to the identity. It doesn't uniquely determine it, but it's related. It's a higher-level representation of the raw information there. Same as symmetry-- if you think of the digit 1, 1 will be symmetric. If you flip it upside down, or you flip it right and left, you will get something that overlaps significantly with it. So you can also define a symmetry measure, which means that you take the symmetric difference between something and its flipped versions, and you see what you get. If something is symmetric, things will cancel because it's symmetric. You'll get a very small value. And if something is not symmetric, let's say like the 5, you will get lots of values in the symmetric difference. And you will get a high value for that. So what you are measuring is the anti-symmetry. You take the negative of that, and you get the symmetry. So you get another guy, which is the symmetry. So now, x_1 is the intensity variable, x_2 is the symmetry variable. Now admittedly, you have lost information in that process. But the chances are you lost as much irrelevant information as relevant information. So this is a pretty good representation of the input, as far as the learning algorithm is concerned. And you went from 257 dimensional to 3 dimensional. That's a pretty good situation. And you probably realize that having 257 parameters is bad news for generalization, if you extrapolate from what we said. Having 3 is a much better situation. So this is what we are going to work with. When you take the linear model in this case, you just have w_0, w_1, and w_2. And that's what the perceptron algorithm, for example, needs to use-- to determine. Now let's look at the illustration of these features. You have these as your inputs. And x_1 is the intensity, x_2 is the symmetry. What do they look like? They look like this. This is a scatter diagram. Every point here is a data point. It's one of the digits, one of the images you have. And I'm taking the simple case of just distinguishing the 1's from the 5's. So I'm only taking digits that are 1's or 5's. And you can always take other digits versus each other, and then combine the decision. If you can solve this unit problem, you can generalize it to the other problem. So when you put all the 1's and all the 5's in a scatter diagram, you realize for example that the intensity on the 5's is usually more than the intensity on the 1's. There are more pixels occupied by the 5's than the 1's. This is the coordinate which is the intensity. And indeed, the red guys, which happen to be the 5's, are tilted a little bit more to the right, corresponding to the intensity. If you look at the other coordinate, which is symmetry, the 1 is often more symmetric than the 5. Therefore, the guys that happen to be the 1's, that are the blue, tend to be higher on the vertical coordinate. And just by these two coordinates, you already see that this is almost linearly separable. Not quite, but it's separable enough that if you pass a boundary here, you will be getting most of them right. Now you realize that it's impossible really to ask to get all of them right because, believe it or not, this fellow is a 5, at least meant to be a 5 by the guy who wrote it. So we have to accept the fact that there will be stuff that is completely undoable. And we will accept an error. It's not a zero error. But hopefully, it's a small error. So this is what the features look like. Now what does the perceptron learning algorithm do? What it does is this complicated figure, which takes the evolution of E_in and E_out as a function of iteration. When you apply the perceptron learning algorithm, you apply it only to E_in. E_in is the only value you have. E_out is sitting out there. We don't know what it is. We just hope that E_in tracks it well. Let's look at the figure. These are the iteration numbers. So this is the first misclassified example. You go and apply the perceptron learning algorithm again, again, again for 1000 times. As you do that, E_in, which is the green curve, will go down and sometimes will go up. We realize that the perceptron learning algorithm takes care of one point at a time, and therefore may mess up other points while it's taking care of a point. So in general, it can go up or down. But the bad news here is that the data is not linearly separable. And we made the remark that the perceptron learning algorithm behaves very badly when the data is not linearly separable. It can go from something pretty good to something pretty bad, in just one iteration. So this is a very typical behavior of the perceptron learning algorithm. Because the data is not linearly separable, the perceptron learning algorithm will never converge. So what do we do? We force it to terminate at iteration 1000. That is, we stop at 1000 and take whatever weight vector we have. And we call this g, the final hypothesis of the perceptron learning algorithm. Now we obviously look at this, and we say, if I only took this guy. This is a better guy than the other. But you know, you're just applying the algorithm and cutting it off. Now, one of the things you observe from here, I plotted E_out. You're not going to be able to plot E_out in a real problem that you deal with, if E_out is really an unknown function. You may be able to estimate it using some test examples. But all you need to know here is that E_out is drawn here for illustration, just to tell you what is happening in reality as you work on the in-sample error. And in this case, you find that E_out actually tracks the E_in pretty well. There is a difference. So if we go from here to here, that's our epsilon. It's a big epsilon. But the good news is that it tracks it. When this goes down, this goes down. When this goes up, this goes up. So if you make your decision based on E_in, the decision based on E_out will also be good. That's good for generalization. And that is one of the advantages of something as simple as the perceptron learning algorithm. It doesn't have too many parameters. And because of our efforts in getting only three features, it has even three parameters now. So the chances are that it will generalize well, which it does. Now what does the final boundary look like? This is only the illustration here, it's just-- this is the evolution. Eventually, you end up with a hypothesis. The hypothesis would separate the points in the scatter diagram you saw. So what does it look like? Well, it looks like this. This is your boundary. This is the final hypothesis, that corresponds to the hypothesis you got at the final iteration. Well, it's OK, but definitely not good. It's too deep into the blue region. You would have been better off doing this. And the chances are maybe earlier guys that had better in-sample error will do that. But that's what you have to live with, if you apply the perceptron learning algorithm. So now we go and try to modify the perceptron learning algorithm in a very simple way, that is the simplest modification you can ever imagine. So let's see what happens. This is what that PLA did, right? And when we looked at it, we said: if we only could keep this value. Well, this value is not a mystery. It happened in your algorithm. You can measure it explicitly. It's an in-sample error. And you know that it's better than the value you ended up with. So in spite of the fact that you're doing these iterations according to the prescribed perceptron learning algorithm rule-- modify the weights according to one misclassified point-- you can keep track of the total in-sample error of the intermediate hypothesis you got. Right? And only keep the guy that happens to be the best throughout. So you're going to continue as if it's really the perceptron learning algorithm. But when you are at the end, you keep this guy and report it as the final hypothesis. What an ingenious idea! Now the reason the algorithm is called the pocket algorithm is because the whole idea is to put the best solution so far in your pocket. And when you get a better one, you take the better one, put it in your pocket, and throw the old one. And when you are done, report the guy in your pocket. We can do that. What does this diagram look like, when you are looking at the pocket algorithm? Much better. You can look at these values, and it is the best value so far. Here, we went down. And here, we indeed went down. Here, we went up. You see this green thing? Here, we didn't, because the good guy is in our pocket and that's what we're reporting the value for. And we continued with it until we dropped again. And we dropped again. And we never changed that, because there was never a better guy than this guy. So when we come to iteration 1000, we have this fellow. Now when you do that, you can use perceptron learning algorithm with non-separable data, terminate it by force at some iteration, and report the pocket value. And that will be your pocket algorithm. And if you look at the classification boundary, PLA versus pocket, this is what we had with the perceptron learning algorithm. We complained a little bit that it's too deep in the blue region. And when you look at the other guy, which is the pocket algorithm, it looks better. It actually does what we thought it would do. It separates them better. Still, obviously, it cannot separate them perfectly. Nothing can, because they are not linearly separable. On the other hand, this is a good hypothesis to report. So with this very simple algorithm, you can actually deal with general inseparable data, but inseparable data in the sense that it's basically separable. However, it really is-- this guy is bad, and this guy is bad. There's nothing we can do about them. But there are few, so we will just settle for this. We'll see that there are other cases of inseparable data that is truly inseparable, in which we have to do something a little bit more drastic. So that's as far as the classification is concerned. Now we go to linear regression. The word regression simply means real-valued output. There is absolutely no other connotation to it. It's a glorified way of saying my output is real-valued. And it comes from earlier work in statistics. And there's so much work on it that people could not get rid of that term. And it is now the standard term. Whenever you have a real-valued function, you call it a regression problem. So that's, with that, out of the way. Now, linear regression is used incredibly often in statistics and economics. Every time you say: are these variables related to that variable, the first thing that comes to mind is linear regression. Let me give an example. Let's say that you would like to relate your performance in different types of courses, to your future earnings. This is what you do. You look at-- here are the courses I took. Here is the math, science, engineering, humanities, physical education, other. And you get your GPA in each of them. So here, I got 3.5 Here, I got 3.8 Here, I got 3.2 Here, I got 2.8 2.8? No, no. That doesn't happen at Caltech! You go for the other one, et cetera. So you just have the GPA's for the different groups of courses. Now, you say-- someone graduates. I'm going to look 10 years after graduation, and see their annual income. So the inputs are the GPA's in the courses at the time they graduated. The output is how much money they make per year 10 years away from graduation. Now you ask yourself: how do these things affect the output? So apply linear regression, as you will see it in detail, and you finally find maybe the math and sciences are more important. Or maybe all of that is an illusion. It was actually the humanities that are important. You don't know. You will see the data, and the data will tell you what affects what. And any other situation like that, people simply resort to linear regression. So in order to build it up, we are going to use the credit example again, in order to be able to contrast it with the classification problem we have seen before. What do we have? We have in the classification-- we have the credit approval, yes or no. That's a classification function, binary function, which says the output is +1 or -1. In the case of regression, we will have real-valued function. And the interpretation in this case is that you're trying to predict the proper credit line for a customer. The customer applies. And it's not a question of approving the credit or not. Do you give them credit limit of $800 or $1,200 or $30,000 or what, depending on their input? So this is a real-valued function. And we are going to apply regression. Now you take the input. This is the same input as we had before, data from the applicant that are related to the credit behavior, so the age, the salary. I suspect that the salary will figure very significantly now when you're trying to tell the credit line, because if someone is making 30,000 a year, you probably are not going to give them a credit line of 200,000. So you can see that this will probably be affected. And there are other guys that merely have to do with the stability of the person. Years in residence. If the person has been in the same residence for 10 years, they are unlikely to skip town. On the other hand, if they have been there for only one month, well, you don't know-- that type of thing. So you have these variables. You encode them as the input x. And then your output in this case, which is the linear regression output, is a hypothesis form which takes this particular form. Let's spend some time with it to understand it. First, it's regression because the output is real. It's linear regression because the form, in terms of the input, is linear. Now, we have seen this before. We sum up from basically 1 to d. These are the genuine inputs, the weighted version of the input variables. And then we add the mandatory x_0, which is 1, which takes care of the threshold, which is w_0. This is the form we have seen before, except that when we saw it before, we took this as a signal that we only care about its sign. If it's plus, we approve credit. If it's minus, we don't approve credit. And we treated it as a credit score, per se, when you take out the threshold. Now in this case, this is the output. We don't threshold it. We don't say it's +1 or -1. There is w_0 in. But we don't take it as +1 or -1. We take it as a real number. And this is the dollar amount we are going to give you as a credit line. Now the signal here will play a very important role in all the linear algorithms. This is what makes the algorithm linear. And whether you leave it alone as in linear regression, you take a hard threshold as in classification or, as we will see later, you can take a soft threshold, and you get a probability and all of that-- All of these are considered linear models. And the algorithm depends on this particular part, which is the signal being linear. We also took the trouble to put it in vector form. And the vector form will simplify the calculus that we do in this lecture in order to derive the linear regression algorithm. But you can always-- if you hate the vector form, you can always go back to this. There is nothing mysterious about this. This simply has a bunch of parameters, w_0, w_1, up to w_d. And if I'm trying to minimize something, you can minimize it with respect to scalar variables, which applies very primitive calculus. But we obviously will do it in the shorthand version, which is the vector or the matrix form, in order to be able to get the derivation in an easier way. So that's the problem. What is the data set in this case? Well, it's historical data, but it's a different set of historical data. The credit line is decided by different officers. Someone sits down and evaluates your application and decides that this person gets 1000 limit, this person gets 5000 limit, and whatnot. All we are trying to do in this particular example is to replicate what they're doing. We don't want the credit officer to do that. The credit officers sometimes are inconsistent from one another. They may have a good day or a bad day. So we'd like to figure out what pattern they collectively have in deciding the credit, and have an automated system decide that. That's what the linear regression system will do for us. The historical data here are again examples from previous customers. And the previous customers-- this is x_1, and this is y_1. So this is the application that the customer gave. And this is the credit line that was given to them. No tracking of credit behavior, we're just trying to replicate what the experts do in this case. And then you realize that each of these y's is actually a real number, which is the credit line that is given to customer x_n. And that real number will likely be a positive integer. It's a credit line. It's a dollar amount. And what we are doing is trying to replicate that. That's the statement of the problem. So what does linear regression do? First, we have to measure the error. We didn't talk about that in the case of classification, because it was so simple. Here, it's a little bit less simple. And then, we'll be able to discuss the error function for classification as well. What do we mean by that? You will have an algorithm that tries to find the optimal weights. These are the weights you're going to have. These weights are going to determine what hypothesis you get. Some hypotheses will approximate f well. Some hypotheses will not. We would like to quantify that, to give a guidance to the algorithm in order to move from one hypothesis to another. So we will define an error measure. And the algorithm will try to minimize the error measure by moving from one hypothesis to the next. If you take linear regression, the standard error function used there is the squared error. Let me write it down. Well, if you had a classification, there is only a simple agreement on a particular example. You either got it right or got it wrong. There is nothing else. Therefore, in that case, we just defined binary error. Did you get it right or wrong? And we found the frequency of getting it right. And we got the E_in and E_out. Here, you are estimating a credit line. So if the guy gets 1000, and you tell them 900, that's not too bad. If the guy gets 1000, and you tell them 5000, that's bad. So you need to measure how bad the situation is. And you define an error measure, and you define it by the simple squared error. Now, squared error doesn't have an inherent merit here. It just happens to be the standard error function used with linear regression. And its merit really is the simplicity in the analytic solution that we are going to get. But when we discuss error measures in the next lecture, we will go back to the principle, does error measure matter? Why? How do we choose it? Et cetera. This will be answered in a principled way next time. But for this time, let's take this as a standard error measure we are going to use. When you look at the in-sample error, you use the error measure. On the particular example n, n from 1 to N. For each example, this is the contribution of the error. Each of these is affected by the same w, because h depends on w. So as you change w, this value will change for every example. And this is the error in that example. And if you want to get all the in-sample error, you simply take the average of those. That will give me a snapshot of how my hypothesis is doing on the data set. And now, we are going to ask our algorithm to take this error and minimize it. Let's actually just look at what happens as an illustration. This is the simplest case for linear regression. The input is one-dimensional. I have only one relevant variable. I want to relate your overall GPA to your earnings 10 years from now. Your overall GPA is x. Your earnings 10 years from now is y. That's it. OK? [CHUCKLES] I would have properly called this x_1 according to our notation. And then there would be an x_0, which is the constant 1. But I didn't bother, because I have only one variable. But this is what we have. So you look at this. And you see that, for different x's, you have these guys. Wow. Your earnings are going down with-- Well, that may not have been the example that is drawn here. What linear regression does is it tries to produce a line, which is what you have here, that tries to fit this data according to the squared-error rule. So it may look like this. And in this case, the threshold here depends on w_0. The slope depends on w_1, which is the weight for x. And that is the solution you have. Now you didn't get it right, but what you got is some errors. And you realize that-- this is the error on the first example. This is the error on the second example. And if you sum up the squares of the lengths of these bars, that is what we called the in-sample error that we defined in the previous viewgraph. Well, linear regression can apply to more than one dimension. And I can plot 2 dimensions here just to illustrate it. It's the same principle. What you have here is you have x_1. If I can get the pointer-- OK, we'll leave it to rest. We have x_1 and x_2. And in this case, the linear thing is really a plane. And you're again not separating, but trying to estimate these guys. And you're making errors. And in general, when you go to a higher-dimensional space, the line-- which is the reason why we call it linear-- is not really a line. It's a hyperplane, one dimension short of the space you are working with. And that's what you are trying to use to approximate the guys. Now let's look at the expression for E_in. And that is the analytic expression we are going to try to minimize. And that will make us derive the linear regression algorithm. We wrote this before. And you have the value of the hypothesis minus y_n squared. That is because it's a squared error. And because it's linear regression, this value, h of x_n, happens to be w transposed x_n. It's a linear function of x_n. Now let us try to write this down in a vector form. I will explain this in detail. But let's look at this. Instead of the summation, all of a sudden, I have a norm squared of something that is-- Capital X, I haven't seen capital X before. I haven't seen vector y before. Well, it's basically a consolidation of the different x_n's here. x_n is a vector. So you put the vectors in a matrix. You call it X. And you put the scalars, the y_n, in a vector. And you call it y. The definition of capital X and the vector y is as follows. For the matrix X, what you do-- you put your first example here. So this would be the constant coordinate 1, the first coordinate, second coordinate, up to the d-th coordinate, the last coordinate. And then you go for the second example, and do the same and construct this matrix. And for y, you put the corresponding output. This is the output for the first example, output for the second example, output for the last example. Now one thing to realize about the matrix X is that it's pretty tall. The typical situation is that you have few parameters. We reduced them to three, for example, in the case of the classification of the digits. But you usually have many, many examples, in the 1000's. So this will be a very, very long matrix. Now the way you take this-- well, the norm squared will be simply this vector transposed times itself. And when you do it, you realize that what you are doing is summing up contributions from the different components. And each component happens to be exactly what you are having here. So this becomes a shorthand for writing this expression. Now, let's look at minimizing E_in. When you look at minimizing, you realize that the matrix X, which has the inputs of the data, and y, which has the outputs of the data, are, as far as we are concerned, constants. This is the data set someone gave me. The parameter I'm actually playing with in order to get a good hypothesis is w. So E_in is of w. And w appears here. And the rest are constants. If I do any calculus of minimization, it is with respect to w. So I try to minimize this. And what you do-- you get the derivative and equate it with 0, except here, it's a glorified derivative. You get the gradient, which is the derivative on a bunch of them all at once. And there is a formula for it, which is pretty simple in this case. I will explain it. By the way, if you hate this, and you want to make sure, because linear regression is so important. And you want to verify that it's true, you can always go for the scalar form, get partial E by partial every w: partial w_0, partial w_1, partial w_d, get a formula that is a pretty hairy one, and then try to reduce it. And-- surprise, surprise-- you will get the solution here that we have in matrix form in two steps. Now if you look at this, deal with it in terms of calculus as if it was just a simple square. If this was a simple square, and w was the variable, what would the derivative be? You will get 2 sitting outside. Well, you've got it here. And then you will get the same thing in a linear form. You got it here. And then you will get whatever constant was multiplied by w to sit outside, which you got here. You just got here with a transpose, because this is really not a square. This is the transpose of this times itself. That's where you get the transpose. Pretty straightforward and standard matrix calculus. So that's what you have. And then you equate this to 0, but it's a fat 0. It's a vector of 0's. You want all the derivatives to be 0 all at once. And that will define a point where this achieves a minimum. Now, you would suspect that the solution will be simple, because this is a very simple quadratic form. And indeed, the solution is simple. And if you look at it, you realize that if I want this to be 0, then I want this to cancel out. So when I multiply X transposed X w, I get the same thing as X transposed y. So they cancel out, and I get my 0. So you write this down, and you find that this is the situation. I want this term to be equal to this term. And that will give me the 0. The interesting thing is that in spite of the fact that the matrix X is a very tall matrix, definitely not square, hence not invertible, X transposed X is actually a square matrix, because X transposed is this way and X is this way. You multiply them, and you get a pretty small square matrix. And as we will see, the chances are overwhelming that it will be invertible. So you can actually solve this very simply, by inverting this. You multiply by the inverse in this direction. You multiply by this. This will disappear, and you will get an explicit formula for w, which you were trying to solve for. And when you do that, you will get w equals this funny symbol, X dagger. What is X dagger? This is simply a shorthand for writing this. So I got the inverse of that, and then multiplied it by here. So this is really what I get to be multiplied by y. I call it X dagger. And indeed, it gets multiplied by y to give me my w. Now the X dagger is a pretty interesting notion. It's called the pseudo-inverse of X. X, being a non-invertible matrix, does not have an inverse. But it does have a pseudo-inverse. And the pseudo-inverse has interesting properties. For example, if you take this, the X dagger, and multiply it by X-- so X dagger times X-- what do you get? You add X here. You get X transposed X. Oh, I have X transposed X to the -1 here. So they cancel out, and I get an identity. So when I multiply X dagger by X, I get the identity. So it's OK to call it an inverse of sorts. It doesn't work the other way around. The other way around gives us an interesting matrix, which we'll talk about later. But basically, this is the essence of it. If we were in a trivial situation where X was a square-- I have 3 parameters, and I have 3 examples to determine them-- that can be solved perfectly. I can actually get this to be 0. And how would you get it to be 0? You would just multiply by the proper inverse of X in this case, and you will get X inverse y. So this is pretty much similar, when X is a tall one. And we are not going to get a 0. We're just going to get a minimum using the pseudo-inverse. Now I would like you to appreciate the pseudo-inverse from a computational point of view. This is the formula for the pseudo-inverse that you will need to compute, in order to get the solution for linear regression. So let's look at it. Something is inverted. And when you see inversion in matrix, you say, oh, computation, computation. If this was a million by a million, I'm in trouble. If this is 5 by 5, I'm in good shape. So we'd like to know, what kind of matrix do we have here? Well, nothing mysterious about what's inside this. You have this fellow, which is X transposed. It's d plus 1, d is the length of your input, 1 is the added constant variable. So these are the number of parameters. This would be 3 in the digit classification guy. We have only x_1 and x_2, so d equals 2. d plus 1 equals 3, which corresponds to x_0, x_1, x_2, or to w_0, w_1, w_2. So this is 3 times N. N is the scary one. That's the number of examples. That could be in the thousands. Now you multiply this by X, and that's what you have. The multiplication will be-- multiplication is not that difficult. Even if this is 10,000, I can multiply this by 10,000. But the good news is that when I go to this guy, I will have to be dealing with a simpler guy. Let's just complete the formula first. This is what you have. This is what you are computationally doing. And if you look at what's inside here, it completely shrinks. That is what the matrix inside is. It's just 3 by 3 in our case. You can invert that. Just accumulating it is the one that you have to go through all of the examples. And there's a very simple way of doing it. It's not that difficult to get this fellow. And you can see now that, oh, good thing that we had 3 parameters. If we had the 257 parameters to begin with, this would have been 257 by 257. Not that this will discourage us. But if you go for some raw inputs, you can get something really in the thousands or sometimes even more than that. So the computational aspect of this is very simple. And there are so many packages for computing the pseudo-inverse, or outright getting the solution for linear regression, that you will never have to do that yourself, except if you're doing something very specialized. If you do have something very specialized, it's not that bad. So that is the final matrix. And the final matrix will have the same dimensions as this guy. And if you look at it, this will be multiplied by what? Multiplied by y, which is y_1, y_2, y_3, y_N, corresponding to different outputs. And then, as a result of that, you will get the w's-- w_0, w_1, up to w_d. Indeed, if you multiply this by an N tall vector, you will get a d plus 1 tall vector, and that's what we expect. Let's now flash the full linear regression algorithm here. That's a crowded slide. That is what you do. The first thing is you take the data that is given to you, and put them in the proper form. What is the proper form? You construct the matrix X and the vector y. And these are what we introduced before. This will be the input data matrix, and this will be the target vector. And once you construct them, you are basically done, because all you are going to do-- you plug this into a formula, which is the pseudo-inverse. And then you will return the value w, that is the multiplication of that pseudo-inverse with y. And you are done. Now you can call this one-step learning if you want. With the perceptron learning algorithm, it looked more like learning, because I have an initial hypothesis. And then I take one example at a time, and try to figure out what is going on, move this around, et cetera. And after 1000 iterations, I get something. It looks more like what we learn. We learn in steps. This looks like cheating. You give me the thing, and [MAKES SOUND]. And you have the answer. Well, as far as we are concerned, we don't care how you got it. If it's correct and gives you a correct E_out, you have learned. And because this is so simple, this is a very popular algorithm that is used often, and used often as a building block for other guys. We can afford to use it as a building block, because the step here will be so simple that we can become more sophisticated in using it. Just one remark about the inversion-- this has to be invertible in order for this formula to hold. Now the chances are, that this will be invertible in a real application you have, is close to 1. The reason is the following. Usually, you use very few parameters and tons of examples. You will be very, very, very unlucky to have these so dependent on each other that you cannot even capture the dimensionality which is the number of columns. The number of columns is 3, 5, 10, and you have 10,000 of those. So the chances are overwhelming in a real problem that this will be invertible. Nonetheless, if it is not invertible, you can still define the pseudo-inverse. It will not be unique and has some elaborate features, but it's not a big deal. That is not a situation you will encounter in practice. So now we have linear regression. I'm going to tell you that you can use linear regression not only for a real-valued function, for regression problems. But you're also going to be able to use it for classification. Maybe the perceptron is now going out of business. It has a competitor now. And the competitor has a very simple algorithm. So let's see how this works. The idea is incredibly simple. Linear regression learns a real-valued function. Yeah, we know that. That is the real-valued function. The value belongs to the real numbers. Fine. Now the main observation, the ingenious observation, is that binary-valued functions, which are the classification functions, are also real-valued. +1 and -1, among other things, happen to be real numbers. So linear regression is not going to refuse to learn them as real numbers. Right? So what do we do? You use linear regression in order to get a solution, such that the solution is approximately y_n in the mean squared sense. For every example, the actual value of the signal is close to the numerical +1 and the numerical -1. That's what linear regression does. Now, having done that with y_n equals +1 or -1, you realize that in this case, if you take the classification version of it-- you take the sign of that signal in order to be able to classify as +1 or -1. If the value is genuinely close to +1 or -1 numerically, then the chances are when it's +1, this would be positive. And when it's -1, it's negative. The chances are-- you're getting close to a number, you'll probably cross the zero in doing that. And if you cross the zero, the classification will be correct. So if you take this, and then plug it in as weights for classification, you will likely get something that will give you-- likely to agree with +1 or -1. That's a pretty simple trick, because it's almost free. All you need to do-- I have a classification problem. Let's run linear regression. It's almost for free. Do this one-step learning, get a solution, and use it for classification. Now, let's see if this is as good as it sounds. Well, the weights are good for classification, so to speak, just by conjecture. But they also may serve as good initial weights for classification. Remember that the perceptron algorithm, or the pocket algorithm, are really very slow to get there. You start with a random guy. Half the guys are misclassified. And it just goes around, tries to correct one, messes up the others, until it gets to the region of interest. And then it converges. Why not give it a jump start? Why not run linear regression first, get the w's. We know that the w's are OK, but they are not really tailored toward classification. But they're good initial condition. Feed those to the pocket algorithm, and let it run to the solution, which is a classification solution. That's a pretty nice idea. So let's actually look at the linear regression boundary. Now, I take an example here. Again, I have the +1 class and the -1 class. And I applied-- we're trying to find, what is the linear regression solution? Now, we remember, the blue region and the pink region belong to classification. When you talk about linear regression, you have the value here. And the signal is 0 here. The signal is positive, more positive, more positive, more positive. And here, the signal is negative, more negative, more negative, more negative. There is a real-valued function that we are trying to interpret as a classification by taking the sign. Now, if you look at what the linear regression is trying to do when you use it for classification, all of these guys have a target value -1. It is actually trying to make the numerical value equal -1 to all of them. So the chances are, these will be -1. This will be -2, -3. And the linear regression algorithm is very sad about that. It considers it an error, in spite of the fact that, when we plug it into the classification, it just has the correct sign. And that's all we care about. But we are applying linear regression. It is actually trying very hard to make all of them -1 at the same time, which obviously it cannot. And you can see now the problem with linear regression. In its attempt to make this -8, -1, it moved the boundary to the level where it's in the middle of the red region. And now, it's very happy because it minimized its error function. But that's not really the classification. Nonetheless, it's a good starting point. And then you take the classification now, that forgets about the values and tries to adjust it according to the classification. And you will get a good boundary. That's the contrast between applying linear regression for classification and linear classification outright. Now we are done. I'm going to start on nonlinear transformation. And I'm going to give you a very interesting tool to play with. Here is the deal. You probably realized that, even when dealing with non-separable data, we are dealing with non-separable data that are really basically separable with few exceptions. But in reality, when you take a real problem, a real-life problem, you will find that the data you are going to get could be anything. It could be, for example, something that looks like this. So you want to classify these as +1's and these as -1's. Let's take the classification paradigm here. Now I can put the line anywhere. And obviously, I'm in trouble because this is not linearly separable, even by a long shot. You can look at this and say: I can see the pattern here. Closer to the center, you have blues. Closer to the peripherals, you have reds. So it would be very nice if I could apply a hypothesis that looks like this. Yes. The only problem is that that's not linear. We don't have the tools to deal with that, yet. Wouldn't it be nice if in two viewgraphs, you can use linear regression and linear classification, the perceptron or the pocket, to apply it to this guy? That's what will happen. I told you this is a practical lecture. So we take another example of nonlinearity. We take the credit line. Now if you look at the credit line, the credit line is affected by years in residence. We argued that if someone has been in the same residence for a long time, there is stability and trustworthiness. And someone has been a short time, there's a question mark. Now one thing is to say that this is a variable that affects the output. Another thing to say is that this is a variable that affects the output linearly. It would be strange if I'm trying to determine a credit line, to decide that the credit line will be proportional to the time you have lived in residence. If you have 10 years, 20 years, I will give you twice the credit line. It doesn't make sense. Because stability is established probably by the time you get to 5 years. After that, it's diminishing returns. So it would be very nice if I can instead of using the linear one, define nonlinear features, which is the following. Let's take the condition, the logical condition, that the years in residence are less than 1. And in my mind, I'm considering that this is not very stable. You haven't been there for very long. And another guy, which is x_i greater than 5, you have been there for more than 5 years. So you are stable. The notation here, when I put something between these brackets, means that this returns 1 if the condition is true, and returns 0 if the condition is false. So this is 1, 0, and this is 1, 0. Now if I had those as variables in my linear regression, they would be much more friendly to the linear formula in deciding the credit line, rather than the crude input. But these are nonlinear functions of x_i. And again, we have the nonlinearity. And we wonder if we can apply the same techniques to a nonlinear case. This is the question. Can we use linear models? The key question to ask is, linear in what? What do I mean? Look at linear regression. What does it implement? It implements this. This is indeed a linear formula. And when you look at the linear classification counterpart, it implements this. This is a linear formula, and the algorithm being simple depends on this part being linear. And then you just make a decision based on that signal. Now, these you would think are called linear because they are linear in the x's, which they are. Yeah, I get these inputs. And I combine them linearly. And I get my surface. That's why I'm calling it linear. However, you will realize that, more importantly, these guys are linear in w. Now when you go from the definition of a function to learning, the roles are reversed. The inputs, which are supposed to be the variable when you evaluate a function, are now constants. They are dictated by the training set. They're just a bunch of numbers someone gave me. The real variables, as far as learning is concerned, are the parameters. The fact that it's linear in the parameters is what matters in deriving the perceptron learning algorithm, and the linear regression algorithm. If you go back to the derivation, it didn't matter what the x's were. The x's were sitting there as constants. And their linearity in w is what enabled the derivation. That results in the algorithm working, because of linearity in the weights. Now that opens a fantastic possibility, because now I can take the inputs, which are just constants. Someone gives me data. And I can do incredible nonlinear transformations to that data. And it will just remain more elaborate data, but constant. When I get to learn using the nonlinearly transformed data, I'm still in the realm of linear models, because the weight that will be given to the nonlinear feature will have a linear dependency. Let's look at an example. Let's say that you take x_1 and x_2. I omitted the constant x_0 here, for simplicity. And these are the guys that gave us trouble. These are the coordinates. This is x_1. This is x_2. These guys should map to +1. These guys should map to -1. I don't have a linear separator. OK, fine. These are data, right? So everything that appears within this box is just a bunch of constant x's and corresponding constants y. Now I'm going to take a transformation. I'm going to call it phi. Every point in that space, I'm going to transform to another space. And my formula for transformation will be this. I'm assuming here that the origin of the coordinate system is here. So I'm taking x_1 squared and x_2 squared. And you can see where I'm leading, because now I'm measuring distances from the origin. And that seems to be a helpful guy here. Now in doing this, all I did was take constants and produce other constants. Now, you can look at this and say: this is my training data. I take your original training data, do the transformation, and forget about the original one. Can you solve the problem in the new space? Oh, yes you can, because that's what they look like in the new space. All of a sudden, the red guys, which happen to be far away, will have bigger values for x_1 squared and x_2 squared. They will sit here. And the guys that are closer to the origin, by the time they transform them, they will have smaller values here. So this is now your new data set. Can you separate this using a perceptron? Yes, I can. I can put a line going through here. Great. When you get a new point to classify, transform it the same way, classify it here, and then report that. That's the game. And there is really no limit, at least computationally, in terms of what you can do here. You can dream up really elaborate nonlinear transformations, transform the data, and then do the classification. There is a catch. And it's a big catch. I will stop here. And we'll continue with the nonlinear transformation at the beginning of the next lecture. And we'll take a short break now, before we go to the Q&A session. We have from the online audience. MODERATOR: A popular question is how to figure out in a systematic way the nonlinear transformations, instead of from the data. PROFESSOR: I said that the nonlinear transformation is a loaded question. And there will be two steps in dealing with it. I will talk about it a little bit more elaborately at the beginning of next lecture. And then we are going to talk about the guidelines for choice, and what you can do and what you cannot do, after we develop the theory of generalization because it is very sensitive to the generalization issue. And that should not come as a surprise, because I can see that I can take the input, which is, let's say, two variables corresponding to two parameters. And I want the transformation to be as elaborate as possible, in order to stand a good chance of being able to separate them linearly. So I'm going to go all out. I'm just going to keep getting nonlinear coordinates-- x_1, x_1 squared, x_1 cubed, x_1 squared x_2, e to the x, just go on. Now at some point, you should smell a rat, because you realize that I have this very, very long vector and corresponding number of parameters. And generalization may become an issue, which it will become an issue. So there are guidelines for how far you can go. And also, there are guidelines for how you can choose them. Do I look at the data and figure out what is a good nonlinear transformation? Is this allowed? Is this not allowed? What the ramifications are? All of these will become clear only after you look at the theory part. MODERATOR: OK. There's a question about slide 15. So regarding the expression of E_in. How does the in-sample error here, or the out-of-sample error, relate to the probabilistic definition of last time? PROFESSOR: OK. Here we dealt only with the in-sample error. So we decided on E_in. And in general in learning, you only have the in-sample error to deal with. You have on the side a guarantee that when you do well in-sample, you will do well out-of-sample. So you never handle the out-of-sample explicitly. You just handle the in-sample, and have the theoretical guarantee that what you are doing will help you out-of-sample. Now, the error measure here was a squared error. Therefore, when you define the in-sample error, you get the squared error and average it. And when you define the out-of-sample error, it's really the expected value of the squared error. Now in the case of the binary classification, the error was binary. You're either right or wrong. So you can always define the in-sample error as also the average of the question. Am I right or wrong on every point? So if you are right, there's no error and you get 0. If you are wrong, you get 1. So you ask yourself: what is the frequency of 1's in-sample? And that would give you the in-sample error. The expected value of that error happens to be the probability of error. That's why we simply, without going into expectation and in-sample average versus out-of-sample expected value-- in the case of classification, we simply talked about frequency of error and probability of error, not because they are different, but just because they are simple to state. But in reality, the aspect of them that made them qualify as in-sample and out-of-sample is that the probability is the expected value of an error measure that happens to be a binary error measure. And the frequency of error happens to be the average value of that error measure. STUDENT: So you showed us a very nice graph with negative slope about dependence of future income and-- PROFESSOR: This is unintentional. I didn't think of the income at the time I drew the graph. So any implication that you should really do worse in school in order to gain more money is-- I disown any such conclusion! STUDENT: OK. But you mentioned the example of determining future income from grade point average, or at least finding some correlation. So the question I'm interested in is, where can we get data? PROFESSOR: You can get-- obviously, the alumni association of every school keeps track of the alumni. And they send them questionnaires. And they have some of the inputs, and how much money they make. There are a number of parameters. So there will be a number of schools that have that. And actually, this is actually used. If you realize that something is related to success or something, you can go back and revise your curriculum or revise your criteria. So the data is indeed available, if that's the question. STUDENT: I mean, it's available in principle. But can we get it? PROFESSOR: Oh, we get it. I thought it was generic we. I don't-- obviously, the data will be anonymous after a while. You'll just get the GPA and the income, without knowing who the person is. You are dependent on the kindness of the alumni associations at different schools, I guess. Or maybe there are some available in public domain. I have not looked. So my understanding is that you want to run linear regression, see what happens, and then focus your time on the courses that matter. That's the idea now? That's your feedback? MODERATOR: A technical question. Why is the w_0 included in the linear regression. So there's a confusion about this. And also in that point, what do you do specifically in the binary case? How do you incorporate the +1's or -1? There's some people asking about this. PROFESSOR: Let me answer one at a time. I'll talk about the threshold first. Why the threshold is there, right? Let's look here. If you look at the line here, the linear regression line. The linear regression line is not a homogeneous line. It doesn't pass by the origin. If I told you that you cannot use a threshold, then the constant part of the equation goes away, and the line you have will have to pass through the origin. Can you imagine if you were trying to fit this with a line? Obviously, it would be down there if you have the negative slope, or if you want to pass through the points up there. So obviously, I need the constant in order to get a proper model. And in general, there is an offset depending on the values of these variables. And the offset is compensated for by the threshold. That's why we need the threshold for linear regression. What is the second question? MODERATOR: In the binary case, when you use y as +1 or -1, why does that just work? PROFESSOR: Well, if you apply linear regression, you have the following guarantee at the end. The hypothesis you have has the least squared error from the targets on the examples. That's what has been achieved by the linear regression algorithm. Now the outputs of the examples being +1 or -1, we can put that together with the first statement. And then we realize that the output of my hypothesis is closest to the value +1 or -1 with a mean squared error. The leap of faith is that, if you are close to +1 versus -1, then the chances are when you are close to +1, you are at least positive. And when you are close to -1, you are at least negative. If you accept that leap of faith, then the conclusion is that, when you take the threshold of the value of the signal from linear regression, you will get the classification right because positive will give you +1. Negative will give you -1. This is not quite the case, because in the attempt to numerically replicate all the points, the signal for linear regression can become-- let's say as I mentioned, +7 for some points and -7 for another point. And the linear regression is trying to push the w, which is what will end up being the boundary, in order to capture that numerical value. So in attempting to fit stuff that is irrelevant to the classification, it may mess up the classification. And that's why the suggestion is, don't use it as a final thing for classification. Just use it as an initial weight, and then use a proper classification, something as simple as the pocket algorithm, in order to fine-tune it further in order to get the classification part, without having to suffer from the numerical angle. MODERATOR: So also on that, does it make a difference what you use? +1, -1, or something else? PROFESSOR: OK. If it's plus something and minus the same thing, it's a matter of scale. If it's plus and minus, and not symmetric, it will be absorbed in the threshold. So it really doesn't matter. It will just make things look different. MODERATOR: Regarding the first part of the lecture, how do you usually come up with features? PROFESSOR: OK. The best approach is to look at the raw input, and look at the problem statement, and then try to infer what would be a meaningful feature for this problem? For example, the case where I talked about the years in residence. It does make sense to derive some features that are closer to the linear dependency. There is no general algorithm for getting features. This is the part where you work with the problem, and you try to represent the input in a better way. And the only catch is, if you look at the data in order to try to derive the features, there is a problem there that will become apparent when we come to the theory. But the bottom line is that, if you don't look at the data, and you study the problem and derive features based on that, that will almost always be helpful if you don't have too many of them. If you have too many of them, it starts becoming a problem. But something-- first order, usually when I get a problem, I look at the data. And I probably can think of less than a dozen variables that will be helpful. And I put all of them. And usually, a dozen variables in this case doesn't increase the input space by much. These are big problems. So I don't suffer much from the generalization issue. MODERATOR: So added to that, a short clarification-- so the nonlinear transformations-- they become features? PROFESSOR: Yeah. The word feature, we are going to use. There's a feature space which is called Z. And anything that you take the input and transform it into something else, this will be called feature. And features of features will also be features. So if you take for example the classification of the digits, we had the pixel values. That's the raw input. And then we had the symmetry and the intensity. These were features. If you go further and find nonlinear transformations of those, these will also be called features. A feature is any higher-level representation of a raw input. MODERATOR: Another question is: how does this analysis change if we cannot assume that the data-- if they're not independent. PROFESSOR: Not clear about the question. So there is really-- I think I get it. Probably when we get the inputs, the question is independence versus dependence. And the independence was used in getting the generalization bound. That's probably the direction of the question. The independence was from one data point to another. So I have N inputs. And I want these guys to be generated independently, according to a probability distribution. If they were originally independent, and I transformed one of them and transformed the other, the independence is inherited. There is no question of independence between coordinates of the same input. The independence was a question of the independence between the different inputs. MODERATOR: So the different inputs. PROFESSOR: Different input points. MODERATOR: So another question is, are there methods that use different hyperplanes and intersections of them to separate data? PROFESSOR: Correct. The linear model that we have described is the building block of so many models in machine learning. You will find that if you take a linear model with a soft threshold, not the hard-threshold version, and you put a bunch of them together, you will get a neural network. If you take the linear model, and you try to pick the separating boundary in a principled way, you get support vector machines. If you take the nonlinear transformation, and you try to find a computationally efficient way of doing it, you get kernel methods. So there are lots of methods within machine learning that build on the linear model. The linear model is somewhat underutilized. It's not glorious. It's not glorious, but it does the job. The interesting thing is that if you have a problem, there is a very good chance that if you take a simple linear model, you will be able to achieve what you want. You may not be able to brag about it. But you are going to do the job. And obviously, the other models will give you incremental performance in some cases. MODERATOR: So a question, getting a little bit ahead-- how do you assess the quality of E_in and E_out systematically? PROFESSOR: This is a theoretical question. E_in is very simple. I have the value of E_in. I can assess its value by just looking at its value. I can evaluate it at any given point. And this is what makes the algorithm able to pick the best in-sample hypothesis, by picking the one that has the smallest in-sample error. The out-of-sample error, I don't have access to. There will be some methods described after the theory that will give us an explicit estimate of the out-of-sample error. But in general, I rely on the theory that guarantees that the in-sample error tracks the out-of-sample error, in order to go all out for the in-sample error, and hope that the out-of-sample error follows, which we have seen in the graph when we were looking at the evolution of the perceptron. And the in-sample error was going down and up. And the out-of-sample error was also going down and up, albeit with a discrepancy between the two. But they were tracking each other. MODERATOR: So here's a question that's kind of a confusion. If you want to fit a polynomial, is this still a linear regression case? PROFESSOR: Correct. Because right now, let's say we have a single input variable, x, like the case I gave. So you have x and y. Now you have a line. If you use the nonlinear transformation, you can transform this x to x, x squared, x cubed, x to the fourth, x to the fifth, and then fit a line to the new space. And a line in the new space will be a polynomial in the old space. So this is covered through the nonlinear transformation. MODERATOR: What is the relation between linear regression least squares with maximum likelihood estimation. PROFESSOR: OK. When you look at linear regression in the statistics literature, there are many more assumptions about the probabilities and what the noise is. And you can get actually more results about it. Under certain conditions, you can relate it to the maximum likelihood. You can say, Gaussian goes with the squared error. And in this case, minimizing it will correspond to maximum likelihood. So there is a relationship. On the other hand, I prefer to give the linear regression in the context of machine learning, without making too many assumptions about distributions and whatnot, because I want it to be applied to a general situation rather than applied to a particular situation. As a result of that, I will be able to say less in terms of what is the probability of being right or wrong. I just have the generalization from in-sample and out-of-sample. But that suffices for most of the machine learning situation. So there is a relationship. And it's studied fairly well in other disciplines. But it is not of particular interest to the line of logic that I'm following. MODERATOR: So a popular question is: can you give at least a set of usual nonlinear transformations used? PROFESSOR: There will be many. When we get to support vector machines, we will be dealing with a number of transformations, some of them polynomials like the ones that were mentioned. One of the useful ones is referred to as radial basis functions. We will talk about that as well. So there will be transformations. And the main point is to be able to understand what you can and what you cannot do, in terms of jeopardizing the generalization performance by taking a nonlinear transformation. So after we are done with that theory, we will have a significant level of freedom of choosing what nonlinear transform we use. And we'll have some guidelines of some of the famous nonlinear transforms. So this is coming up. MODERATOR: I think you already answered this question last time. But again, someone asks, is it impossible for machine learning to find a pattern of a pseudo-random number generator? PROFESSOR: Well, if it's pseudo random, then in principle, if you get the seed, you can produce it. But the way it's usually used is you use a pseudo-random number, and then you take a few bits and have them as an output for different inputs. So just looking at the inputs and trying to decipher it-- it's next to impossible. So it's a practical question. Philosophically, yes you can. Practically, it looks random for all intents and purposes. MODERATOR: So what are the different treatments for continuous responses versus discrete responses in I guess-- PROFESSOR: Yeah. Obviously, this is dictated by the problem. If someone comes, and they want to approve credit, etc, I'm going to use the classification hypothesis set. If someone wants to get a credit line or something else, then I will have to use regression. So it really is dependent on the problem. And the funny part is that real numbers look more sophisticated. Yet the algorithm that goes with them, which is linear regression, is much easier than the other one. The reason is that the other one is combinatorial. And combinatorial optimization is pretty difficult in general. So the answer to the question is that it depends on the target function that the person is coming up with. And when there is cross fertilization between the techniques, it's just a way to use an analytic advantage from one method to give the other one a jump start, or to give it a reasonable solution. But it's a computational question. The distinction is really in the problem statement itself. MODERATOR: Can you say what makes a nonlinear transformation good? PROFESSOR: OK. I will be able to talk about this a little bit more intelligently after the theory. I would like to emphasize that the theory part will be very important in giving us all the tools to talk, with authority, about all the issues that are being raised. So there is a reason for including the theory before we go into more details. This lecture was meant to give you just a little bit of standard tools that you use, and if you look at it now, you can use for many applications and many data sets, because now you can deal with non-separable data. You can deal with real-valued data. And you can even deal with some nonlinear situations. So it's just a toolbox for you to get your hands wet. And then things will become more principled when we develop more material. MODERATOR: Yeah, I think that's it. PROFESSOR: OK, that's it. We will see you on Thursday.
Info
Channel: caltech
Views: 313,667
Rating: 4.9252338 out of 5
Keywords: Machine Learning (Field Of Study), Linear Model, linear regression, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, perceptron, pseudo-inverse, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), California Institute Of Technology (Organization), Abu-Mostafa, Yaser
Id: FIbVs5GbBlQ
Channel Id: undefined
Length: 79min 44sec (4784 seconds)
Published: Thu Apr 12 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.