Lecture 14 - Support Vector Machines

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about validation, which is a very important technique in machine learning for estimating the out-of-sample performance. And the idea is that we start from the data set that is given to us, that has N points. We set aside K points for validation, for just estimation, and we train with the remaining points, N minus K. Because we are training with a subset, we end up with a hypothesis that we are going to label g minus, instead of g. And it is on this g minus, that we are going to get an estimate of the out-of-sample performance by the validation error. And then there is a leap of faith, when we put back all the examples in the pot in order to come up with the best possible hypothesis-- to work with the most training examples. We are going to get g, and we are using the validation error we had on the reduced hypothesis, if you will, to estimate the out-of-sample performance on the hypothesis we are actually delivering. And there is a question of how accurate an estimate this would be for E_out. And we found out that K cannot be too small, and cannot be too big, in order for this estimate to be reliable. And we ended up with a rule of thumb of about 20% of the data set go to validation. That will give you a reasonable estimate. Now this was an unbiased estimate. So we get an E_out. We can get better than E_out or worse than E_out in general, as far as E_val estimating the performance of g minus. On the other hand, once you use the validation error for model selection, which is the main utility for validation, you end up with a little bit of an optimistic bias, because you chose a model that performs well on that validation error. Therefore, the validation error is not going to necessarily be an unbiased estimate of the out-of-sample error. It will have a slight positive, or optimistic, bias. And we showed an experiment where, using very few examples in this case, in order to exaggerate the effect. We can see the impact of-- the blue curve is the validation error, and the red curve is the out-of-sample error on the same hypothesis, just to pin down the bias. And we realize that, as we increase the number of examples, the bias goes down. The difference between the two curves goes down. And indeed, if you have a reasonable-size validation set, you can afford to estimate a couple of parameters for sure, without contaminating the data too much. So you can assume that the measurement you're getting from the validation set is a reliable estimate. Then, because the number of examples turned out to be an issue, we introduced the cross-validation, which is, by and large, the method of validation you're going to be using in a practical situation. Because it gets you the best of both worlds. So in this case, we-- illustrating a case where we have 10-fold cross-validation. So you divide the data set into 10 parts. You train on nine, and validate on the tenth, and keep that estimate of the error. And you keep repeating as you choose the validation subset to be one of those. So you have 10 runs. And each of them gives you an estimate on a small number of examples, 1/10 of the examples. And then by the time you average all of these estimates, that will give you a general estimate of what the out-of-sample error would be on 9/10 of the data, in spite of the fact that they are different 9/10 each time. And in that case, the advantage of it is that the 9/10 is very close to 1, so the estimate you are getting is very close. And furthermore, the number of examples taken into consideration in getting an estimate of the validation error is really N. You got all of them, albeit in different runs. So this is really the way to go in cross-validation. And invariably, in any learning situation, you will need to choose a model, a parameter, something-- to make a decision. And validation is the method of choice in that case, in order to make that. OK. So we move on to today's lecture, which is Support Vector Machines. Support vector machines are arguably the most successful classification method in machine learning. And they are very nice, because there is a principled derivation for the method. There is a very nice optimization package that you can use in order to get the solution. And the solution also has a very intuitive interpretation. So it's a very, very neat piece of work for machine learning. So the outline will be the following. We are going to introduce the notion of the margin, which is the main notion in support vector machines. And we'll ask a question of maximizing the margin-- getting the best possible margin. And after formulating the problem, we are going to go and get the solution. And we're going to do that analytically. It will be a constrained optimization problem. And we faced one before in regularization, where we gave a geometrical solution, if you will. This time we are going to do it analytically, because the formulation is simply too complicated to have an intuitive geometric solution for. And finally, we are going to expand from the linear case to the nonlinear case in the usual way, thus expanding all of the machinery to a case where you can deal with nonlinear surfaces, instead of just a line in a separable case, which is the main case we are going to handle. So now let's talk about linear separation. Let's say I have a linearly separable data set. Just take four, for example. There are lines that will separate the red from the blue. Now, when you apply perceptron, you will get a line. When you apply any algorithm, you will get a line, and separate-- you get 0 training error. And everything is fine. And now there is a curious point when you ask yourself: I can get different lines. Is there any advantage of choosing one of the lines over the other? That is the new addition to the problem. So let's look at it. Here is a line. So I chose this line to separate the two. You may not think that this is the best line. And we'll try to take our intuition and understand why this is not the best line. So I'm going to think of a margin, that is, if this line moves a little bit, when is it going to cross over? When is it going to start making an error? So in this case, let's put it as a yellow region around it. That's the margin you have. So if you choose this line, this is the margin of error. Sort of informal notion. Now you can look at this line. And it does seem to have a better margin. And you can now look at the problem closely and say: let me try to get the best possible margin. And then you get this line, which has this margin, that is exactly right for the blue and red points. Now, let us ask ourselves the following question. Which is the best line for classification? As far as the in-sample error is concerned, all of them give in-sample error 0. As far as generalization questions are concerned, as far as our previous analysis has done, all of them are dealing with linear model with four points. So generalization, as an estimate, will be the same. Nonetheless, I think you will agree with me that if you had your choice, you will choose the fat margin. Somehow it's intuitive. So let's ask two questions. The first one is: why is a bigger margin better? Second one. If we are convinced that a bigger margin is better, then you ask yourself: can I solve for w that maximizes the margin? Now it is quite intuitive that the bigger margin is better, because think of a process that is generating the data. And let's say that there is noise in it. If you have the bigger margin, the chances are the new point will still be on correct side of the line. Whereas, if I use this one, there's a chance that the next red point will be here, and it will be misclassified. Again, I'm not giving any proofs. I'm just giving you an intuition here. So it stands to logic that indeed, the bigger margin is better. And now we're going to argue that the bigger margin is better for a reason that relates to our VC analysis before. So anybody remember the growth function from ages ago? What was that? So we take the dichotomies of the line on points in the plane. And let's say, we take three points. So on three points, you can get all possible dichotomies by a line. The blue versus not-blue region. And you can see that by varying where the line is, I can get all possible 2 to the 3 equals 8 dichotomies here. So you know that the growth function is big. And we know that the growth function being big is bad news for generalization. That was our take-home lesson. So now let's see if this is affected by the margin. So now we are taking dichotomies, not only the line, but also requiring that the dichotomies have a fat margin. Let's look at dichotomies, and their margin. Now in this case, I'm putting the same three points. And I'm putting a line that has the biggest possible margin for the constellation of points I have. So you can see here. I put it. It sandwiched them. Every time, it touches all the points. It cannot extend any further because it will get beyond the points. And when you look at it, this is a thin margin for this particular dichotomy. This is an intermediate one. This is a fat one. And this is a hugely fat one, but that's the constant one. That's not a big deal. Now let's say that I told you that you are allowed to use a classifier, but you have to have at least that margin for me to accept it. So now I'm requiring the margin to be at least something. All of a sudden, these guys that used to be legitimate dichotomies using my model, are no longer allowed. So effectively by requiring the margin to be at least something, I'm putting a restriction on the growth function. Fat margins imply fewer dichotomies possible. And therefore, if we manage to separate the points with a fat dichotomy, we can say that fat dichotomies have a smaller VC dimension, smaller growth function than if I didn't restrict them at all. And, although this is all informal, we will come at the end of the lecture to a result that estimates the out-of- sample error based on the margin. And we will find out that indeed, when you have a bigger margin, you will be able to achieve better out-of-sample performance. So now that I completely and irrevocably convinced you that the fat margins are good, let us try to solve for them. That is, find the w that not only classifies the points correctly, but achieves so with the biggest possible margin. So how are we going to do that? Well the margin is just the distance from the plane to a point. So I'm going to take from the data set the point x_n, which happens to be the nearest data point to the line, that we have used in the previous example. And the line is given by the linear equation-- equals 0. And since we're going to use a higher dimensional thing, I'm not going to refer to it as a line. I'm going to refer to it as a plane-- hyperplane really-- but just plane for short. So we're talking about d-dimensional space and a hyperplane that separates the points. So we would like to estimate that. And we ask ourselves: if I give you w and the x's, can you plug them into a formula and give me the distance between that plane, that is described by w, and the point x_n? I'm now taking the nearest point, because then that distance will be the margin that I'm talking about. Now there are two preliminary technicalities that I'm going to invoke here. And they will simplify the analysis later on. So here is the first one. The first one is to normalize w. What do I mean by that? For all the points in the data set, near and far, when you take w transposed times x_n, you will get a number that is different from 0. And indeed, it will agree with the label y_n, because the points are linearly separable. So I can take the absolute value of this, and claim that it's greater than 0 for every point. Now I would like to relate w to the margin, or to the distance. But I realize that here, there is a minor technicality that is annoying. Let's say that I multiply the vector w by a million. Does the plane that I'm talking about change? No. This is the equation of it. I can multiply by any positive number, and I get the same plane. So the consequence of that is that any formula that takes w and produces the margin will have to have, built in it, scale invariance. We'll be dividing by something that takes out that factor that does not affect which plane I'm talking about. So I'm going to do it now, in order to simplify the analysis later. I'm going to consider all representations of the same plane. And I'm going to pick one where this is normalized, by requiring that for the minimum point, this fellow is 1. I can always do that. I can scale w up and down until I get the closest one to have this equal to 1. There's obviously no loss in generality. Because in this case, this is a plane. And I have not missed any planes by doing that. Now the quantity w x_n which is the signal, as we talked about it, is a pretty interesting thing. So let's look at it. I have the plane. So the plane has the signal equals 0. And it doesn't touch any points. The points are linearly separable. Now when you get the signal to be positive, you are moving in one direction. You hit the closest point. And then you hit more points, the interior points, so to speak. And when you go in the other direction and it's negative, you hit the other points, the nearest point on the negative side, and then the interior points which are further out. So indeed that signal actually relates to the distance, but it's not the Euclidean distance. It just has an order of the points, according to which is nearest and which is furthest. But what I'd like to do, I would like to actually get the Euclidean distance. Because I'm not comparing the performance of this plane on different points. I'm comparing the performance of different planes on the same point. So I have to have the same yardstick. And the yardstick I'm going to use is the Euclidean distance. So I'm going to take this as a constraint. And when I solve for it, I will find out that the problem I'm now solving for is much easier to solve. And then I can get the plane. And the plane will be general under this normalization. The second one is pure technicality. Remember that we had x being in Euclidean space R to the d. And then we added this artificial coordinate x_0 in order to take care of w_0 that was the threshold, if you think of it as comparing with a number, or a bias if you think of it as adding a number. And that was convenient just to have the nice vector and matrix representation and so on. Now it turns out that, when you solve for the margin, the w_1 up to w_d will play a completely different role from the role w_0 is playing. So it is no longer convenient to have them as the same vector. So for the analysis of support vector machines, we're going to pull w_0 out. So the vector w now is the old vector w_1 up to w_d. And you take out w_0. And in order not to confuse it and call it w, because it has a different role, we are going to call it here b, for bias. OK? So now the equation for the plane is w, our new w, times x plus b equals 0. And there is no x_0. x_0 used to be multiplied by b, also known as w_0. So every w you will see in this lecture will belong to this convention. And now if you can look at this-- this will be w transposed x_n plus b. Absolute value equals 1. And the plane will be w transposed x plus b equals 0. Just a convention that will make our math much more friendly. So these are the technicalities that I wanted to get out of the way. Now, big box, because it's an important thing. It will stay with us. And then we go for computing the distance. So now, we would like to get the distance between x_n-- we took x_n to be the nearest point, and therefore the distance will be the margin. And we want to get the distance from the plane. So let's look at the geometry of the situation. I have this as the equation for the plane. And I have the conditions that I talked about. This is the geometry. I have a plane. And I have a point x_n. And I'd like to estimate the distance. First statement. The vector w is perpendicular to the plane. That should be easy enough if you have seen any geometry before, but it's not very difficult to argue. But remember now that the vector w is in the X space. I'm not talking about the weight space. I'm talking about w as you plug in the values and you get a vector. And I'm looking at that vector in the input space X. And I'm saying it's perpendicular to the plane. Why is that? Because let's say that you pick any two points-- call them x dash and x double dash-- on the plane proper. So they are lying there. What do I know about these two points? Well, they are on the plane, so they had better satisfy the equation of the plane. Right? So I can conclude that it must be that, when I plug in x dash in that equation, I will get 0. And when I plug in x double dash, I will get 0. Conclusion: If I take the difference between these two equations, I will get w transposed times x dash minus x double dash, equals 0. And now you can see that good old b dropped out. And this is the reason why it has a different treatment here. The other guys actually mattered. But the b plays a different role. So when you see an equation like that, your conclusion is what? Your conclusion is that w, as a vector, must be orthogonal to x dash minus x double dash, as a vector. So when you look at the plane, here is the vector x dash minus x double dash. Let me magnify it. And this must be orthogonal to the vector w. So the interesting thing is that we didn't make any restrictions on x dash and x double dash. These could be any two points on the plane, right? So now the conclusion is that w, which is the same w-- the vector w that defines the plane, is orthogonal to every vector on the plane. Right? Therefore, it is orthogonal to the plane. So we got that much. We know that now w has an interpretation. Now we can get the distance. Once you know they are orthogonal to the plane, you probably can get the distance. Because what do we have? The distance between x_n and the plane, and we put them here, is what? Can be computed as follows. Pick any point, one point, on the plane. We just call it generic x. And then you take the projection of the vector going from here to here. You project it on the direction which is orthogonal to the plane. And that will be your distance. Right? So we just need to put the mathematics that goes with that. So here's the vector. And here is the other vector, which we know that is orthogonal to the plane. Now if you project this fellow on this direction, that length will give you the distance. Now in order to get the projection, what do you do? You get the unit vector in the direction. So you take w, which is this vector-- could be of any length-- and you normalize it by its norm. And you get a unit vector under which the projection would be simply a dot product. So now the w hat is a shorter w, if the norm of w happens to be bigger than 1. And what you get-- you get the distance being simply the inner product. You take the unit vector, dot that. And that is your distance. Except for one minor issue. This could be positive or negative depending on whether w is facing x or facing the other direction so in order to get the distance proper, you need the absolute value. So we have a solution for it. Now we can write the distance as-- this is the formula. Now I multiply it by w hat. I know what the formula for w hat is. I write it down. And now I have it in this form. Now this can be simplified if I add the missing term, plus b minus b. Why is that? Can someone tell me what is w^T x plus b, which is this quantity being subtracted here? This is the value of the equation of the plane, for a point on the plane. So this will happen to be 0. How about this quantity, w^T x_n plus b, for my point x_n. Well, that was the quantity that we insisted on being 1. Remember when we normalized the w, because w's could go up and down. And we scaled them such that the absolute value of this quantity is 1. So all of a sudden, this thing is just 1. And you end up with the formula for the distance, given that normalization, being simply 1 over the norm. That's a pretty easy thing to do. So if you take the plane and insist on a canonical representation of w by making this part 1 for the nearest point, then your margin will simply be 1 over the norm of w you used. This I can use, in order now to choose what combination of w's will give me the best possible margin, which is the next one. So let's now formulate the problem. Here is the optimization problem that resulted. We are maximizing the margin. The margin happens to be 1 over the norm. So that is what we are maximizing. Subject to what? Subject to the fact that for the nearest point, which happens to have the smallest value of those guys-- so the minimum over all points in the training set. I took the quantity here and scaled w up or down in order to make that quantity 1. So I take this as a constraint. When you constrain yourself this way, then you are maximizing 1 over w. And that is what you get. So what do we do with this? Well, this is not a friendly optimization problem. Because if the constraints have a minimum in them, that's bad news. Minimum is not a nice function to have. So what we are going to do now, we are going to try to find an equivalent problem that is more friendly. Completely equivalent, by very simple observations. So the first observation is that I want to get rid of the minimum. That's my biggest concern. So the first thing I notice that-- not to mention the absolute value. So the absolute value of this happens to be equal to this fellow. Why is that? Well, every point is classified correctly. I'm only considering the points that separate the data sets correctly. And I'm choosing between them, for the one that maximizes the margin. Because they are classifying the points correctly, it has to be that the signal agrees with the label. Therefore when you multiply, the label is just +1 or -1, and therefore it takes care of the absolute value part. So now I can use this instead of the absolute value. I still haven't gotten rid of the minimum. And I don't particularly like dividing 1 over the norm, which has a square root in it. But that is very easily handled. Instead of maximizing 1 over the norm, I'm going to minimize this friendly quantity, quadratic one. I'm minimizing now. So I'm maximizing 1 over, minimizing that. Everybody sees that it's equivalent. So now we can see. Does anybody see quadratic programming coming up in the horizon? There's our quadratic formula. The only thing I need to do is just have the constraints being friendly constraints, not a minimum and absolute value. Just inequality constraints that are linear in nature. And I claim that you can do this by simply taking subject to these. So this doesn't bother me, because I already established that it deals with the absolute value. But here, I'm taking greater than or equal to 1 for all points. I can see that if the minimum is 1, then this is true. But it is conceivable that I do this optimization, and I end up with a quantity for which all of these guys happen to be strictly greater than 1. That is a feasible point, according to the constraints. And if this by any chance gives me the minimum, then that is the minimum I'm going to get. And the problem with that is that this is a different statement from the statement I made here. That's the only difference. Well, is it possible that the minimum will be achieved at a point where this is greater than 1 for all of them? A simple observation tells you: no, this is impossible. Because let's say that you got that solution. You tell me: this is the minimum I can get for w transposed w, right? And I got it for values where this is strictly greater than 1. Then what I'm going to do, I'm going to ask you: give me your solution. And I'm going to give you a better solution. What am I going to do? I'm going to scale w and b proportionately down until they touch the 1. You have a slack, right? So I can just pull all of them, just slightly, until one of them touches 1. Now under those conditions, definitely, if the original constraints were satisfied, the new constraints will be satisfied. All of them are just proportional. I can pull out the factor, which is a positive factor. And indeed, if this is the case, this will be the case for the other one. And the point is that the w I got is smaller than yours because I scaled them down, right? So it must be that my solution is better than yours. Conclusion: When you solve this, the w that you will get necessarily satisfies these with at least one of those guys with equality. Which means that the minimum is 1. And therefore, this problem is equivalent to this problem. This is really very nice. So we started from a concept, and geometry, and simplification, and now we end up with this very friendly statement that we are going to solve. And when you solve it, you're going to get the separating plane with the best possible margin. So let's look at the solution. Formally speaking, let's put it in a constrained optimization question. The constrained optimization here-- you minimize this objective function subject to these constraints. We have seen those. And the domain you're working on, w happens to be in the Euclidean space R to the d. b happens to be a scalar, belongs to the real numbers. That is the statement. Now when you have a constrained optimization-- we have a bunch of constraints here. And we will need to go an analytic route in order to solve it. Geometry won't help us very much. So what we're going to do here, we are going to ask ourselves: oh, constrained optimization. I heard of Lagrange. You form a Lagrangian, and then all of a sudden the constrained become unconstrained, and you solve it, and you get the multipliers lambda. Lambda is pretty much what we got in regularization before. We did it geometrically. We didn't do it explicitly with Lagrange. But that's what you get. Now the problem here is that the constraints you have are inequality constraints, not equality constraints. That changes the game a little bit, but just a little bit. Because what people did is simply look at these and realize that there is a slack here. If I call the slack s squared, I can make this equality. And then I can solve the old Lagrangian, with equality. I can comment on that in the Q&A session, because it's a very nice approach. And that approach was derived independently by two sets of people, Karush, which is the first K, and Kuhn-Tucker, which is the KT. And the Lagrangian under the inequality constraint is referred to as KKT. So now, let us try to solve this. And I'd like, before I actually go through the mathematics of it, to remind you that we actually saw this before in the constrained optimization we solved before under inequality constraints, which was regularization. And it is good to look at that picture, because it will put the analysis here in perspective. So in that case, you don't have to go through the details. We were minimizing something-- you don't have to worry about the formula exactly-- under a constraint. And the constraint is an inequality constraint that resulted in weight decay, if you remember. And we had a picture that went with it. And what we did was, we looked at the picture and found a condition for the solution. And the condition for the solution showed that the gradient of your objective function, of the thing you are trying to minimize, becomes something that is related to the constraint itself. In this case: normal. The most important aspect to realize is that, when you solve the constrained problem here, the end result was that the gradient is not 0. It would have been 0 if the problem was unconstrained. If I asked you to minimize this, you just go for gradient equals 0, and solve. So now, because of the constraint, the constraint kicks in, and you have the gradient being something related to the constraint. And that's what will happen exactly when we have the Lagrangian in this case. But one of the benefits of having-- of reminding you of the regularization is that there's a conceptual dichotomy, no pun intended, between the regularization and the SVM. SVM is what we're doing here, maximizing the margin, and regularization. So let's look at both cases and ask ourselves: what are we optimizing, and what is the constraint? If you remember in regularization, we already have the equation. What we are minimizing is the in-sample error. So we are optimizing E_in, under the constraints that are related to w transposed w, the size of the weights. That was weight decay. If you look at the equation we just found out in order to maximize the margin, what we are actually optimizing is w transposed w. That is what you're trying to minimize. Right? And your constraint is that you're getting all the points right. So your constraint is that E_in is 0. So it's the other way around. But again, because both of them will blend in the Lagrangian, and you will end up doing something that is a compromise, it's conceptually not a big shock that we are reversing roles here, and minimizing what is in our mind a constraint, and constraining what is in our mind an objective function to be minimized. Back to the formulation. So now, let's look at the Lagrange formulation. And I would like you to pay attention to this slide. Because once you get the formulation, we're not going to do much beyond getting a clean version of the Lagrangian, and then passing it on to a package of quadratic programming to give us a solution. But at least, arriving there is important. So let's look at it. We are minimizing-- this is our objective function-- subject to constraints of this form. First step, take the inequality constraints and put them in the 0 form. So what do I mean by that? Instead of saying that's greater or equal to 1, you put it as minus 1, and then require that this is greater than or equal to 0. And now you see, it got multiplied by a Lagrange multiplier. So think of this, since this should be greater than 0, this is the slack. So the Lagrange multipliers get multiplied by the slack. And then you add them up. And they become part of the objective. And they come out as a minus, simply because the inequalities here are in the direction greater than or equal to. That's what goes with the minus here. I'm not proving any of that. I'm just motivating for you that this formula makes sense, but there's mathematics that actually pins it down exactly. And you're minimizing this. So now let's give it a name. It's a Lagrangian. It is dependent on the variables that I used to minimize with respect to, w and b. And now I have a bunch of new variables which are the Lagrange multipliers, the vector alpha, which is called lambda in other cases. Here it's standard, alpha. And there are N of them. There's a Lagrange multiplier for every point in the set. We are minimizing this with respect to what? With respect to w and b. So that was the original thing. The interesting part, which you should pay attention to, is that you're actually maximizing with respect to alpha. Again, I'm not making a mathematical proof that this method holds. But this is what you do. And it's interesting because when we had equality, we didn't worry about maximization versus minimization. Because all you did, you get the gradient equals 0. So that applies for both maximum and minimum. So we didn't necessarily pay attention to it. Here you have to pay attention to it, because you are maximizing with respect to alphas, but the alphas have to be non-negative. Once you restrict the domain, you can't just get the gradient to be 0, because the function-- if the function was all over and this way, you get the minimum. And minimum has gradient 0. But if I tell you to stop here, the function could be going this way. And this is the point you're going to pick. And the gradient here is definitely not 0. So the question of maximizing versus minimizing, you need to pay attention here. We are not going to pay too much attention to it, because we'll just tell the quadratic programming guy, please maximize. And it will give us the solution. But that is the problem we are solving. So now we do at least the unconstrained part. With respect to w and b, you are just minimizing this. So let's do it. We're going to take the gradient of the Lagrangian with respect to w. So I'm getting partial by partial for every weight that appears. And I get the equation here. How do I get that? I can differentiate. So I'm going to differentiate this. I get a w. The squared goes with the half. When I get this, I ask myself: what is the coefficient of w? I get alpha, y_n, and x_n. Right? That one gets multiplied by w for every n equals 1 to N. So I get that. And I have a minus sign here, that comes here. Everything else drops out. So this is the formula. And what do I want the gradient to be? I want it to be the vector 0. So that's a condition. What is the other one? I now get the derivative with respect to b. b is a scalar. That's the remaining parameter. And when I look at it, can we do this? What gets multiplied by b? Oh I guess it's just the alphas. Everything else drops out. So-- oh, not just alphas! It's y_n. So here's the b. It gets multiplied by y_n and alpha. And that's what I get. And you get this to be equal to the scalar 0. So optimizing this with respect to w and b resulted in these two conditions. Now what I'm going to do, I'm going to go back and substitute with these conditions in the original Lagrangian, such that the maximization with respect to alpha-- which is the tricky part, because alpha has a range-- will become free of w and b. And that formulation is referred to as the dual formulation of the problem. So let's substitute. Here are what I got from the last slide. This one I got from the gradient with respect to w equals 0. So w has to be this. And this one from the partial by partial b, equals 0. I get those. And now I'm going to substitute them in the Lagrangian. And the Lagrangian has that form. Now let's do this carefully, because things drop out nicely. And I get a very nice formula at the end, which is function of alpha only. So this equals-- first part, I get the summation of the Lagrange multipliers. Where did I get that? I got that because I have -1 here. It gets multiplied by alpha_n for all of those. Canceled with this minus, so I get summation over that. So this part I got. So let me kill the part that I already used. So I kill the -1. So that part I got. Next. I look at this and say: I have +b here, right? So when I take +b, it gets multiplied by y_n alpha_n, summed up from n equals 1 to N. Now, I look at this and say: oh, the summation of alpha_n y n from n equals 1 to N is 0. So the guys that get multiplied by b, will get to 0. And therefore, I can kill +b. Now when I have it down to this, it's very easy to see. Because you look at the form for w, when you have w transposed w, you are going to get a quadratic version of this. You get some double summation, alpha alpha y y x x, right? With the proper name of the dummy variable, to get it right. And when you have here, well, you have already alpha_n y_n and x n, and now when you substitute w by this, you're going to get exactly the same thing. You're going to get another alpha, another y, another x. So this will be exactly the same as this, except that this one has a factor half, this has a factor -1. So you add them up. And you end up with this. So we look at this: what happened to w? What happened to b? All gone. We are now just function of the Lagrange multipliers. And therefore, we can call this L of alpha. Now this is a very nice quantity to have, because this is a very simple quadratic form in the vector alpha. Alpha here appears as a linear guy. Here appears as a quadratic guy. That's all. Now I need to put the constraints. I put back the things I took out. And let's look at the maximization with respect to alpha, subject to non-negative ones. This is a KKT condition. I have to look for solutions under these conditions. And I also have to consider the conditions that I inherited from the first stage. So I have to satisfy this, and I have to satisfy this, for the solution to be valid. So this one is a constraint over the alphas, and therefore I have to take it as a constraint here. But I don't have to take the constraint here, because that is vacuous as far as alphas are concerned. This does no constraint over alphas whatsoever. You do your thing. You come up with alphas. And you call whatever that formula is, the resulting w. Since w doesn't appear in optimization, I don't worry about it at all. So I end up with this thing. Now if I didn't have those annoying constraints, I would be basically done. Because I look at this, that's pretty easy. I can express one of the alphas in terms of the rest of the alphas. Right? Factor it out. Substitute for that alpha here. And all of a sudden, I have a purely unconstrained optimization for a quadratic one. I solve it. I get something, maybe a pseudo inverse or something, and I'm done. But I cannot do that simply because I'm restricted to those choices. And therefore, I have to work with a constrained optimization, albeit a very minor constrained optimization. Now let's look at the solution. The solution goes with quadratic programming. So the purpose of the slide here is to translate the objective and the constraints we had into the coefficients that you're going to pass on to a package called quadratic programming. So this is a practical slide. First, what we are doing is maximizing with respect to alpha this quantity that we found, subject to a bunch of constraints. Quadratic programming packages come usually with minimization. So we need to translate this into minimization. How are going to do that? We're just going to get the minus of that. So this would become this minus that. So let's do that. We got the minus, minimum of this. So now it's ready to go. Now the next step will be pretty scary. Because what I'm going to do, I'm going to expand this, isolating the coefficients from the alphas. The alphas are the parameters. You're not passing alphas to the quadratic programming. Quadratic programming works with a vector of variables that you called alpha. What you are passing are the coefficients of your particular problems that are decided by these numbers, that the quadratic programming will take, and then will be able to give you the alphas that would minimize this quantity. So this is what it looks like. I have a quadratic term, alpha transposed alpha. And these are the coefficients in the double summation. These are numbers that you read off your training data. You give me x_1 and y_1. I'm going to compute these numbers for all of these combinations. And I end up with a matrix. That matrix gets passed to quadratic programming. And quadratic programming asks you for the quadratic term, and asks you for the linear term. Where the linear term, just to be formal, happens to be, since we are just taking minus alpha, it's -1 transposed alpha, which is the sum of those guys. So this is the bunch of linear coefficients that you pass. And then the constraints-- you put the constraints again in the same way, subject to. So there's a part which asks you for constraints. And here again, the constraints-- you care about the coefficients of the constraints. So this is a linear equality constraint. So we are going to pass the y transposed, which are the coefficients here, as a vector. And it will ask you for, finally, the range of alphas that you need. And the range of alphas that you need happens to be between 0, so that would be the vector 0-- would be your lower bound. Infinity will be your upper bound. So you read off this slide. You give it to the quadratic programming. And the quadratic programming gives you back an alpha. And if you're completely discouraged by this, let me remind you that all of this is just to give you what to pass to the package. This actually looks exactly like this. That's all you're doing. A very simple quadratic function, with a linear term. You're minimizing it, subject to linear equality constraint, plus a bunch of range constraints. And when you expand it, in terms of numbers, this is what you get. And that's what we're going to use. So now we are done. We have done the analysis. We knew what to optimize. It fit one of the standard optimization tools. It happens to be convex function in this case, so that the quadratic programming will be very successful. And then we pass it, and we get a number back. Just a word of warning before we go there. You look at the size of this matrix. And it's N by N. Right? So the dimension of the matrix depends on the number of examples. Well, if you have a hundred examples, no sweat. If you have 1000 examples, no sweat. If you have a million examples, this is really trouble. Because this is really a dense matrix. These numbers could come up with anything. So all the entries matter. And if you end up with a huge matrix, quadratic programming will have pretty hard time finding the solution. To the level where there are tons of heuristics to solve this problem when the number of examples is big. It's a practical consideration, but it's an important consideration. But basically, if you're working with problems-- the typical machine learning problem, where you have, let's say not more than 10,000, then it's not formidable. 10,000 is flirting with danger, but that's what it is. So pay attention to the fact that, in spite of the fact that there's a standard way of solving it, and the fact that it's convex, so it's friendly, it is not that easy when you get a huge number of examples. And people have hierarchical methods and whatnot, in order to deal with that case. So let's say we succeeded. We gave the matrix and the vectors to quadratic programming. Back comes what? Back comes alpha. This is your solution. So now we want to take this solution, and solve our original problem. What is w, what is b, what is the surface, what is the margin? You answer the questions that all of this formalization was meant to tackle. So the solution is vector of alphas. And the first thing is that it is very easy to get the w because, luckily, the formula for w being this was one of the constraints we got from solving the original one. When we got the gradient with respect to w, we found out this is the thing. So you get the alphas, you plug them in, and then you'll get the w. So you get the vector of weights you want. Now I would like to tell you a condition which is very important. And it will be the key to defining support vectors in this case, which is another KKT condition that will be satisfied at the minimum, which is the following. Quadratic programming hands you alpha. Let's say that-- alpha is the same length as the number of examples-- let's say you have 1000 examples. So it gives you a vector of 1000 guys. You look at the vector, and to your surprise-- you don't know yet whether it's pleasant or unpleasant surprise-- a whole bunch of the alphas are just 0. The alphas are restricted to be non-negative. They all have to be greater than or equal to 0. If you find any one of them negative, then you say quadratic programming made a mistake. But it won't make a mistake. It will give you numbers that are non-negative. But the remarkable part, out of the 1000, more than 900 are 0's. So you say: something is wrong? Is there a bug in my thing or something? No. Because of the following. The following condition holds. It looks like a big condition. But let's read it. This is the constraint in the 0 form. So this is greater than or equal to 1. So minus 1 would be greater than or equal to 0. This is what we called the slack. So the condition that is guaranteed to be satisfied, for the point you're going to get, is that either the slack is 0, or the Lagrange multiplier is 0. The product of them will definitely be 0. So if there's a positive slack, which means that you are talking about an interior point. Remember that I have a plane, and I have a margin. And the margin touches on the nearest point. And that is what defines the margin. Then there are interior points, where the slack is bigger than 1. At those points, the slack is exactly 1. No, not the slack. The slack is 0. The value is 1. The other ones, the slack will be positive. So for all the interior points, you're guaranteed that the corresponding Lagrange multiplier will be 0. OK? I claim that we saw this before, again in the regularization case. Remember this fellow? We had a constraint which is to be within the red circle. And we're trying to optimize a function that has equi-potentials around this. So this is the absolute minimum. And it grows and grows and grows. And because we are in the constraint, we couldn't get the absolute minimum when we went there. When we had the constraint being vacuous, that is, the constraint doesn't really constrain us, and the absolute optimal is inside, we ended up with no need for regularization, if you remember? And the lambda for regularization in that case was 0. That is the case, where you have an interior point, and the multiplier is 0. And then when you got a genuine guy that you have to actually compromise, you ended up with a condition that requires lambda to be positive. So these are the guys where the constraint is active. And therefore you get a positive lambda, while this guy is by itself 0. So now we come to an interesting definition. So alpha is largely 0's, interior points. The most important points in the game are the points that actually define the plane and the margin. And these are the ones for which alpha_n's are positive. And these are called support vectors. So I have N points. And I classify them, and I got the maximum margin. And because it's a maximum margin, it touched on some of the +1 and some of the -1 points. Those points support the plane, so to speak. And they're called support vectors. And the other guys are interior points. And the mathematics of it tells us that we can identify those, because we can go for lambdas that happen to be positive, the alphas in this case. And the alpha greater than 0 will identify a support vector. Again, when I put a box, it's an important thing. So this is an important notion. So let's talk about support vectors. I have a bunch of points here to classify. And I go through the entire machinery. I formulate the problem. I get the matrix. I pass it to quadratic programming. I get the alpha back. I compute the w. All of the above. And this is what I get. So where are the support vectors this picture? They are the closest ones to the plane, where the margin region touched. And they happen to be these three. This one, this one, this one. So all of these guys that are here, and all of these guys are here will just contribute nothing to the solution. They will get alpha equals 0 in this case. And the support vectors achieve the margin exactly. They are the critical points. The other guys-- their margin, if you will, is bigger or much bigger. And for the support vectors, you satisfy this with equal 1. So all of this is fine. Now, we used to compute w in terms of the summation of alpha_n y_n x_n. Because we said that this is the quantity we got, when we got the gradient with respect to w equals 0. So this is one of the equations. And this is our way to get the alphas back, which is the currency we get back from quadratic programming, and plug it in, in order to get the w. This goes from n equals 1 to N. Now that I notice that many of the alphas are 0, and alpha is only positive for support vectors, then I can say that I can sum this up over only the support vectors. It looks like a minor technicality. So the other terms happen to be 0, so you excluded them. You just made the notation more clumsy in this case. But there's a very important point. Think of alphas now as the parameters of your model. When they're 0s, they don't count. Just expect almost all of them to be 0. What counts is the actual values of the parameters that will be some number bigger than 0. So now, your weight vector-- it's a d-dimensional vector-- is expressed in terms of the constants which are your data set, x_n and their label. Plus few parameters, hopefully few parameters, which is just the number of support vectors. So you have three support vectors, then this-- let's say you're working at 20-dimensional space. So I'm looking at 20-dimensional space. I'm getting a weight. Well, it's 20-dimensional space in disguise. Because of the constraint you put, you got something that is effectively three-dimensional. And now you can realize why there might be a generalization dividend here. Because I end up with fewer parameters than the express parameters that are in the value I get. So, we can also-- now that we have it-- solve for the b. Because you want w and b-- b is the bias, or corresponding to the threshold term, if you will. And it's very easy to do. Because all you need to do is take any support vector, any one of them, and for any of them you know that this equation holds. You already solved for w, by that. So you plug this in. And the only unknown in this equation would be b. And as a check for you, take any support vector and plug it in. And you have to find the same b coming out. That was your check that everything in the math went through. You take any of them, and you solve for b. And now you have w and b, and you are ready with the classification line or hyperplane that you have. Now let me close with the nonlinear transforms, which will be a very short presentation that has an enormous impact. We are talking about a linear boundary. And we are talking about linearly separable case, at least in this lecture. In the next lecture, I'm going to go to the non-separable case. But a non-separable case could be handled here in the same way we handled non-separable case with the perceptrons. Instead of working in the X space, we went to the Z space. And I'd like to see what happens to the problem of support vector machines, as we stated it and solved it, when you actually move to the higher dimensional space. Is the problem becoming more difficult? Does it hold? Et cetera. So let's look at it. So we're going to work with z instead of x. And we're going to work in the Z space instead of the X space. So let's first put what we are doing. Analytically, after doing all of the stuff, and I even forgot what the details are, all I care about is that: would you please maximize this with respect to alpha, subject to a couple of sets of constraints. So you look at here. And you can see, when I transform from x to z, nothing happens to the y's. The labels are the same. And these are the guys that probably will be changed, because now I'm working in a new space. So I'm putting them in a different color. So if I work in the X space, that's what I'm working with. And these are the guys that I'm going to multiply in order to get the matrix that I pass on to quadratic programming. Now let's take the usual nonlinear transform. So this is your X space. And in X space, I give you this data. Well, this data is not separable, not linearly separable, and definitely not nearly linearly separable. This is the case where you need a nonlinear transformation. And we did this nonlinear transformation before. Let's say you take just x1 squared and x2 squared. And then you get this, and this one is linearly separable. So all you're doing now is working in the Z space. And instead of getting just a generic separator, you're getting the best separator, according to SVM, and then mapping it back, hoping that it will have dividends in terms of the generalization. So you look at this. I'm moving from X to Z. So when I go back to here, what do you do? All you need to do is replace the x's with z's. And then you forget that there was ever an X space. I have vector z. I do the inner product in order to get these numbers. These numbers I'm going to pass on to quadratic programming. And when I get the solution back, I have the separating plane or line in the Z space. And then when I want to know what the surface is in the X space, I map it back. I get the pre-image of it. And that's what I get. The most important aspect to observe here is that-- OK, the solution is easy. Let's say I move from two-dimensional to two-dimensional here. Nothing happened. Let's say I move from two-dimensional to a million-dimensional. Let's see how much more difficult the problem became. What do I do? Now I have a million-dimensional vector, inner product with a million dimensional vector. That doesn't faze me at all. Just an inner product. I get a number. But when I'm done, how many alphas do I have? This is the dimensionality of the problem that I'm passing to quadratic programming. Exactly the same thing. It's the number of data points. Has nothing to do with the dimensionality of the space you're working in. So you can go to an enormous space, without paying the price for it in terms of the optimization you're going to do. You're going to get a plane in that space. You can't even imagine, because it's million-dimensional. It has a margin. The margin will look very interesting in this case. And supposedly it has good generalization property. And then you map it back here. But the difficulty of solving the problem is identical. The only thing that is different is just getting those coefficients. You'll be multiplying longer vectors. But that is the least of our concerns. The other one is that you're going to get the full matrix of this. And quadratic programming will have to manipulate the matrix. And that's where the price is paid. So that price is constant, as long as you give it this number. It doesn't care whether it was inner product of 2 by 2, or inner product of a million by million. it will just hand you the alphas. And then you interpret the alphas in the space that you created it from. So the w will belong to the Z space. Now let's look at, if I do the nonlinear transformation, do I have support vectors? Yes, you have support vectors for sure in the Z space. Because you're working exclusively in the Z space, you get the plane there. You get the margin. The margin will touch some points. These are your support vectors by definition. And you can identify them even without looking geometrically at the Z space, because what are the support vectors? Oh, I look at the alphas I get. And the alphas that are positive, these correspond to support vectors. So without even imagining what the Z space is like, I can identify which guys happen to have the critical margin in the Z space, just by looking at the alphas. So support vectors live in the space you are doing the process in, in this case, the Z space. In the X space, there is an interpretation. So let's look at the X space here. If I have these guys, not linearly separable, and you decided to go to a high-dimensional Z space. I'm not going to tell you what. And you solved the support vector machine. You got the alphas. You got the line, or the hyperplane in that space. And then you are putting the boundary here that corresponds to this guy. And this is what the boundary looks like. Now, we have alarm bells-- overfitting, overfitting! Whenever you see something like that, you say wait. That's the big advantage you get out of support vectors. So I get this surface. This surface is simply what the line in the Z space with the best margin got. That's all. So if I look at what the support vectors are in the Z space, they happen to correspond to points here. They are just data points. Right? So let me identify them here, as pre-images of support vectors. People will say they are support vectors. But you need to be careful, because the formal definition is in the Z space. So they may look like this. So let's look at it. This is one. This is another. This is another. This is another. And usually they are when you turn. You would think that in the Z space, this is being sandwiched. So this is what it's likely to be. Now the interesting aspect here is that if this is true, then one, two, three, four-- I have only four support vectors. So I have only four parameters, really, expressing w in the Z space. Because that's what we did. We said that w equals summation, over the support vectors, of the alphas. Now that is remarkable, because I just went to a million-dimensional space. w is a million-dimensional vector. And when I did the solution, and if I get four-- only four, which is very lucky if you are using a million dimensional, but just for illustration. If I get four support vectors, then effectively, in spite of the fact that I used the glory of the million-dimensional space, I actually have four parameters. And the generalization behavior will go with the four parameters. So this looks like a sophisticated surface, but it's a sophisticated surface in disguise. It was so carefully chosen that-- there are lots of snakes that can go around and mess up the generalization. This one will be the best of them. And you have a handle on how good the generalization is, just by counting the number of support vectors. And that will get us-- Yeah, this is a good point I forgot to mention. So the distance between the support vectors and the surface here are not the margin. The margins are in the linear space, et cetera. They're likely, these guys, to be close to the surface. But the distance wouldn't be the same. And there are perhaps other points that look like they should be support vectors, and they aren't. What makes them support vectors or not is that they achieve the margin in the Z space. This is just an illustrative version of it. And now we come to the generalization result that makes this fly. And here is the deal. Generalization result: E out is less than or equal to something. So you're doing classification. And you are using the classification error, the binary error. So this is the probability of error in classifying an out-of-sample point. The statement here is very much what you expect. You have the number of support vectors, which happens to be the number of effective parameters-- the alphas that survived. This is your guy. You divide it by N, well, N minus 1 in this case. And that will give you an upper bound on E_out. Now I wish this was exactly the result. The result is very close to this. In order to get the correct result, you need to run several versions and get an average in order to guarantee this. So the real result has to do with expected values of those guys. So for several runs, the expected value. But if the expected value lives up to its name, and you expect the expected value, then in that case, the E_out you will get in a particular situation will be bounded above by this, which is a very familiar type of a bound, number of parameters, degrees of freedom, VC dimension, dot dot dot, divided by the number of examples. We have seen this before. And again, the most important aspect is that, pretty much like quadratic programming didn't worry about the nature of the Z space. Could be million-dimensional. And that didn't figure out in the computational difficulty. It doesn't figure out in the generalization difficulty. You didn't ask me about the million-dimensional space. You asked me, after you were done with this entire machinery, how many support vectors did you get? If you have 1000 data points, and you get 10 support vectors, you're in pretty good shape regardless of the dimensionality of the space that you visited. Because, then, 10 over 1000-- that's a pretty good bound on E_out. On the other hand, it doesn't say that now I can go to any dimensional space and things would be fine. Because you still are dependent on the number of support vectors. If you go through this machinery, and then the number of support vectors out of 1000 is 500, you know you are in trouble. And trouble is understood in this case, because that snake will be really a snake-- going around every point, going around every point. So just trying to fit the data hopelessly, getting so many support vectors that the generalization question now becomes useless. But this is the main theoretical result that makes people use support vectors, and support vectors with the nonlinear transformation. You don't pay for the computation of going to the higher dimension. And you don't get to pay for the generalization that goes with that. And then when we go to kernel methods, which is a modification of this next time, you're not even going to pay for the simple computational price of getting the inner product. Remember when I told you take an inner product between a million-vector and itself, and that was minor, even if it's minor, we're going to get away without it. And when we get away without it, we will be able to do something rather interesting. The Z space we're going to visit-- we are now going to take Z spaces that happen to be infinite-dimensional. Something completely unthought of when we dealt with generalization in the old way. Because obviously, in an infinite-dimensional space, I'm not going to be able to actually computationally get the inner product. Thank you. So there has to be another way. And the other way will be the kernel. But that will open another set of possibilities of working in a set of spaces we never imagined touching, and still getting not only the computation being the same, but also the generalization being dependent on something that we can measure, which is the number of support vectors. I will stop here and take questions after a short break. Let's start the Q&A. MODERATOR: OK. Can you please first explain again why you can normalize w transposed x plus b to be 1. PROFESSOR: OK. We would like to solve for the margin given w. That has dependency on the combination of w's you get, which is like the angle that is the relevant one. And also w has an inherent scale in it. So the problem is that the scale has nothing to do with which plane you're talking about. When I take w, the full w and b, and take 10 times that, they look like different vectors as far as the analysis is concerned, but they are talking about the same plane. So if I'm going to solve without the normalization, I will get a solution. But the solution, whatever I'm optimizing, will invariably have in its denominator something that takes out the scale, so that the thing is scale-invariant. I cannot possibly solve. And it will tell me that w has to be this, when in fact any positive multiple of it will serve the same plane. So all I'm doing myself is simplifying my life in the optimization. I want the optimization to be as simple as possible. I don't want it to be something over something. Because then I will have trouble actually getting the solution. Therefore, I started by putting a condition that does not result in loss of generality. Because if I restrict myself to w's, not to planes-- all planes are admitted. But every plane is represented by an infinite number of w's. And I'm picking one particular w to represent them that happens to have that form. When I do that and put it as a constraint, what I end up with, the thing that I'm optimizing happens to be a friendly guy that goes with quadratic programming and I get the solution. I could definitely have started by not putting this condition. Except that I would run into mathematical trouble later on. That's all there is to it. Similarly, I could have left w0. And then all of a sudden, every time I put something, I only tell you: take the norm of the first d guys, or w_1 up to w_d, and forget the first one. So all of this was just pure technical preparation that does not alter the problem at all, that makes the solution friendly later on. MODERATOR: Many people are curious. What happens when the points are not linearly separable? PROFESSOR: There are two cases. One of them: they are horribly not linearly separable, like that. And in this case, you go to a nonlinear transformation, as we have seen. And then there is a slightly not linearly separable, as we've seen before. And in that case, you will see that the method I described today is called hard-margin SVM. Hard-margin because the margin is satisfied strictly. And then you're going to get another version of it, which is called soft margin, that allows for few errors and penalizes for them. And that will be covered next. But basically, it's very much in parallel with the perceptron. Perceptron means linearly separable. If there are few, then you apply something. Let's say like the pocket in that case. But if it's terribly not linearly separable, then you go to a nonlinear transformation. And nonlinear transformation here is very attractive because of the particular positive properties that we discussed. But in general, you actually use a nonlinear transformation together with the soft version, because you don't want the snake to go out of its way just to take care of an outlier. So we are better off just making an error on the outlier, and making the snake a little bit less wiggly. And we will talk about that when we get the details. MODERATOR: Could you explain once again why in this case, just the number of support vectors gives an approximation of the VC dimension, while in other cases the transform-- PROFESSOR: The explanation I gave was intuitive. It's not a proof. There is a proof for these terms that I didn't even touch on. And the idea is the following. We have come to the conclusion that the number of parameters, independent parameters or effective parameters, is the VC dimension in many cases. So to the extent that you can actually accept that as a rule of thumb, then you look at the alphas. I have as many alphas as data points. So if these were actually my parameters, I'd be in deep trouble. Because I have as many parameters as points, so I'm basically memorizing the points. But the particulars of the problem result in the fact that, in almost all the cases, the vast majority of the parameters will be identically 0. So in spite of the fact that they were open to be non-zero, the fact that the expectation is that almost all of them will be 0, makes it more or less that the effective number of parameters are the ones that end up being non-zero. Again, this is not an accurate statement, but it's a very reasonable statement. So the number of non-zero parameters, which corresponds to the VC dimension, also happens to be the number of the support vectors by definition. Because support vectors are the ones that correspond to the non-zero Lagrange multipliers. And therefore, we get a rule, which either counts the number of support vectors or the number of surviving parameters, if you will. And this is the rule that we had at the end, that I said that I didn't prove, but actually gives you a bound on E_out. MODERATOR: Is there any advantage in considering the margin, but using a different norm? PROFESSOR: So there are variations of this. And indeed, some of the aggregation methods, like boosting, has a margin of its own. And then you can compare that. It's really the question of the ease of solving the problem. And if you have a reason for using one norm or another, for a practical problem. For example, if I see that loss goes with squared, or loss goes with the absolute value or whatever, and then I design my margin accordingly, then we go back to the idea of a principled error measure, in this case margin measure. On the other hand, in most of the cases, there is really no preference. And it is the analytic considerations that makes me choose one margin or another. But different measures for the margin, with 1-norm, 2-norm, and other things, have been applied. And there is really no compelling reason to prefer one over the other in terms of performance. So it really is the analytic properties that usually dictate the choice. MODERATOR: Is there any pruning method that can maybe get rid of some of the support vectors, or not really? PROFESSOR: So you're not happy with even reducing it to support vectors? You want to get rid of some of them. Well-- Offhand, I cannot think of a method that I can directly translate into-- as if it's getting rid of some of the support vectors. What happens for computational reasons is that when you solve a problem that is huge in data set, you cannot solve it all. So sometimes what happens is that you take subsets, and you get the support vectors. And then you take the support vectors as a union and get the support vectors of the support vectors, and stuff like that. So these are really computational considerations. But basically, the support vectors are there to support the separating plane. So if you let one of them go, the thing will fall! Obviously, I'm half-joking only. But because really, they are the ones that dictate the margin, so their existence really tells you that the margin is valid. And that's really why they are there. MODERATOR: Some people are worried that a noisy data set would completely ruin the performance of the SVM. So how does it deal with this? PROFESSOR: It will ruin as much as it will ruin any other method. It's not particularly susceptible to noise. Except obviously when you have noise, the chances of getting a cleanly linearly separable data is not there. And therefore, you're using the other methods. And if you're using strictly nonlinear transformation, but with hard margin, then I can see the point of ruining. Because now the snake is going around noise. And obviously that's not good, because you're fitting the noise. But in those cases, and in almost all of the cases, you use the soft version of this, which is remarkably similar. It's different assumptions. But the solution is remarkably similar. And therefore in that case, you will be as vulnerable or not vulnerable to noise as you would by using other methods. MODERATOR: All right. I think that's it. PROFESSOR: Very good. We will see you next week.

Info

Channel: caltech

Views: 274,557

Rating: 4.9393306 out of 5

Keywords: Machine Learning (Field Of Study), Support Vector Machine, SVM, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, Vapnik, Lagrange Multiplier (Concepts/Theories), KKT, margin, quadratic programming, kernel methods, Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), California Institute Of Technology (Organization), Abu-Mostafa, Yaser

Id: eHsErlPJWUU

Channel Id: undefined

Length: 74min 15sec (4455 seconds)

Published: Fri May 18 2012