Lecture 16 - Radial Basis Functions

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about kernel methods, which is a generalization of the basic SVM algorithm to accommodate feature spaces Z, which are possibly infinite and which we don't have to explicitly know, or transform our inputs to, in order to be able to carry out the support vector machinery. And the idea was to define a kernel that captures the inner product in that space. And if you can compute that kernel, the generalized inner product for the Z space, this is the only operation you need in order to carry the algorithm, and in order to interpret the solution after you get it. And we took an example, which is the RBF kernel, suitable since we are going to talk about RBF's, radial basis functions, today. And the kernel is very simple to compute in terms of x. It's not that difficult. However, it corresponds to an infinite dimensional space, Z space. And therefore, by doing this, it's as if we transform every point in this space, which is two-dimensional, into an infinite-dimensional space, carry out the SVM there, and then interpret the solution back here. And this would be the separating surface that corresponds to a plane, so to speak, in that infinite dimensional space. So with this, we went into another way to generalize SVM, not by having a nonlinear transform in this case, but by having an allowance for errors. Errors in this case would be violations of the margin. The margin is the currency we use in SVM. And we added a term to the objective function that allows us to violate the margin for different points, according to the variable xi. And we have a total violation, which is this summation. And then we have a degree to which we allow those violations. If C is huge, then we don't really allow the violations. And if C goes to infinity, we are back to the hard-margin case. And if C is very small, then we are more tolerant and would allow violations. And in that case, we might allow some violations here and there, and then have a smaller w, which means a bigger margin, a bigger yellow region that is violated by those guys. Think of it as-- it gives us another degree of freedom in our design. And it might be the case that in some data sets, there are a couple of outliers where it doesn't make sense to shrink the margin just to accommodate them, or by going to a higher-dimensional space with a nonlinear transformation to go around that point and, therefore, generate so many support vectors. And therefore, it might be a good idea to ignore them, and ignoring them meaning that we are going to commit a violation of the margin. Could be an outright error. Could be just a violation of the margin where we are here, but we haven't crossed the boundary, so to speak. And therefore, this gives us another way of achieving the better generalization by allowing some in-sample error, or margin error in this case, at the benefit of getting better generalization prospects. Now the good news here is that, in spite of this significant modification of the statement of the problem, the solution was identical to what we had before. We are applying quadratic programming, with the same objective, the same equality constraint, and almost the same inequality constraint. The only difference is that it used to be, alpha_n could be as big as it wants. Now it is limited by C. And when you pass this to quadratic programming, you will get your solution. Now C being a parameter-- and it is not clear how to choose it. There is a compromise that I just described. The best way to pick C, and it is the way used in practice, is to use cross-validation to choose it. So you apply different values of C, you run this and see what is the out-of-sample error estimate, using your cross-validation, and then pick the C that minimizes that. And that is the way you will choose the parameter C. So that ends the basic part of SVM, the hard margin, the soft margin, and the nonlinear transforms together with the kernel version of them. Together they are a technique really that is superb for classification. And it is, by the choice of many people, the model of choice when it comes to classification. Very small overhead. There is a particular criterion that makes it better than just choosing a random separating plane. And therefore, it does reflect on the out-of-sample performance. Today's topic is a new model, which is radial basis functions. Not so new, because we had a version of it under SVM and we'll be able to relate to it. But it's an interesting model in its own right. It captures a particular understanding of the input space that we will talk about. But the most important aspect that the radial basis functions provide for us is the fact that they relate to so many facets of machine learning that we have already touched on, and other aspects that we didn't touch on in pattern recognition, that it's worthwhile to understand the model and see how it relates. It almost serves as a glue between so many different topics in machine learning. And this is one of the important aspects of studying the subject. So the outline here-- it's not like I'm going to go through one item then the next according to this outline. What I'm going to do-- I'm going to define the model, define the algorithms, and so on, as I would describe any model. In the course of doing that, I will be able, at different stages, to relate RBF to, in the first case, nearest neighbors, which is a standard model in pattern recognition. We will be able to relate it to neural networks, which we have already studied, to kernel methods obviously-- it should relate to the RBF kernel. And it will. And finally, it will relate to regularization, which is actually the origin, in function approximation, for the study of RBF's. So let's first describe the basic radial basis function model. The idea here is that every point in your data set will influence the value of the hypothesis at every point x. Well, that's nothing new. That's what happens when you are doing machine learning. You learn from the data. And you choose a hypothesis. So obviously, that hypothesis will be affected by the data. But here, it's affected in a particular way. It's affected through the distance. So a point in the data set will affect the nearby points, more than it affects the far-away points. That is the key component that makes it a radial basis function. Let's look at a picture here. Imagine that the center of this bump happens to be the data point. So this is x_n. And this shows you the influence of x_n on the neighboring points in the space. So it's most influential nearby. And then the influence goes by and dies. And the fact that this is symmetric around, means that it's function only of the distance, which is the condition we have here. So let me give you, concretely, the standard form of a radial basis function model. It starts from h of x being-- and here are the components that build it. As promised, it depends on the distance. And it depends on the distance such that the closer you are to x_n, the bigger the influence is, as seen in this picture. So if you take the norm of x minus x_n squared. And you take minus-- gamma is a positive parameter, fixed for the moment, you will see that this exponential really reflects that picture. The further you are away, you go down. And you go down as a Gaussian. So this is the contribution to the point x, at which we are evaluating the function, according to the data point x_n, from the data set. Now we get an influence from every point in the data set. And those influences will have a parameter that reflects the value, as we will see in a moment, of the target here. So it will be affected by y_n. That's the influence-- is having the value y_n propagate. So I'm not going to put it as y_n here. I'm just going to put it generically as a weight to be determined. And we'll find that it's very much correlated to y_n. And then we will sum up all of these influences, from all the data points, and you have your model. So this is the standard model for radial basis functions. Now let me, in terms of this slide, describe why it is called radial, basis function. It's radial because of this. And it's basis function because of this. This is your building block. You could use another basis function. So you could have another shape, that is also symmetric in center, and has the influence in a different way. And we will see an example later on. But this is basically the model in its simplest form, and its most popular form. Most people will use a Gaussian like this. And this will be the functional form for the hypothesis. Now we have the model. The next question we normally ask is what is the learning algorithm? So what is a learning algorithm in general? You want to find the parameters. And we call the parameters w_1 up to w_N. And they have this functional form. So I put them in purple now, because they are the variables. Everything else is fixed. And we would like to find the w_n's that minimize some sort of error. We base that error on the training data, obviously. So what I'm going to do now, I'm going to evaluate the hypothesis on the data points, and try to make them match target value on those points-- try to make them match y. As I said, w_n won't be exactly y_n, but it will be affected by it. Now there is an interesting point of notation, because the points appear explicitly in the model. x_n is the n-th training input. And now I'm going to evaluate this on a training point, in order to evaluate the in-sample error. So because of this, there will be an interesting notation. When we, let's say, ask ambitiously to have the in-sample error being 0. I want to be exactly right on the data points. I should expect that I will be able to do that. Why? Because really I have quite a number of parameters here, don't I? I have N data points. And I'm trying to learn N parameters. Notwithstanding the generalization ramifications of that statement, it should be easy to get parameters that really knock down the in-sample error to 0. So in doing that, what I'm going to do, I'm going to apply this to every point x_n, and ask that the output of the hypothesis be equal to y_n. No error at all. So indeed, the in-sample error will be 0. Let's substitute in the equation here. And this is true for all n up to N, and here is what you have. First, you realize that I changed the name of the dummy variable, the index here. I changed it from n to m. And this goes with x_m here. The reason I did that, because I'm going to evaluate this on x_n. And obviously, you shouldn't have recycling of the dummy variable as a genuine variable. So in this case, you want this quantity, which will in this case be the evaluation of h at the point x_n. You want this to be equal to y_n. That's the condition. And you want this to be true for n equals 1 to N. Not that difficult to solve. So let's go for the solution. These are the equations. And we ask ourselves: how many equations and how many unknowns? Well, I have N data points. I'm listing N of these equations. So indeed, I have N equations. How many unknowns do I have? Well, what are the unknowns? The unknowns are the w's. And I happen to have N unknowns. That's familiar territory. All I need to do is just solve it. Let's put it in matrix form, which will make it easy. Here is the matrix form, with all the coefficients for n and m. You can see that this goes from 1 to N. And the second guy goes from 1 to N. These are the coefficients. You multiply this by a vector of w's. So I'm putting all the N equations at once, as in matrix form. And I'm asking this to be equal to the vector of y's. Let's call the matrices something. This matrix I'm going to call phi. And I am recycling the notation phi. phi used to be the nonlinear transformation. And this is indeed a nonlinear transformation of sorts. Slight difference that we'll discuss. But we can call it phi. And then these guys will be called the standard name, the vector w and the vector y. What is the solution for this? All you ask for, in order to guarantee a solution, is that phi be invertible, that under these conditions, the solution is very simply just: w equals the inverse of phi times y. In that case, you interpret your solution as exact interpolation, because what you are really doing is, on the points that you know the value, which are the training points, you are getting the value exactly. That's what you solved for. And now the kernel, which is the Gaussian in this case, what it does is interpolate between the points to give you the value on the other x's. And it's exact, because you get it exactly right on those points. Now let's look at the effect of gamma. There was a gamma, a parameter, that I considered fixed from the very beginning. This guy-- so I'm highlighting it in red. When I give you a value of gamma, you carry out the machinery that I just described. But you suspect that gamma will affect the outcome. And indeed, it will. Let's look at two situations. Let's say that gamma is small. What happens when gamma is small? What happens is that this Gaussian is wide, going this way. If gamma was large, then I would be going this way. Now depending obviously on where the points are, how sparse they are, it makes a big difference whether you are interpolating with something that goes this way, or something that goes this way. And it's reflected in this picture. Let's say you take this case. And I have three points just for illustration. The total contribution of the three interpolations passes exactly through the points, because this is what I solved for. That's what I insisted on. But the small gray ones here are the contribution according to each of them. So this would be w_1, w_2, w_3 if these are the points. And when you add w_1 times the Gaussian, plus w_2 times the Gaussian, et cetera. You get a curve that gives you exactly the y_1, y_2, and y_3. Now because of the width, there is an interpolation here that is successful. Between two points, you can see that there is a meaningful interpolation. If you go for a large gamma, this is what you get. Now the Gaussians are still there. You may see them faintly. But they die out very quickly. And therefore, in spite of the fact that you are still satisfying your equations, because that's what you solved for, the interpolation here is very poor because the influence of this point dies out, and the influence of this point dies out. So in between, you just get nothing. So clearly, gamma matters. And you probably, in your mind, think that gamma matters also in relation to the distance between the points, because that's what the interpolation is. And we will discuss the choice of gamma towards the end. After we settle all the other parameters, we will go and visit gamma and see how we can choose it wisely. With this in mind, we have a model. But that model, if you look at it, is a regression model. I consider the output to be real-valued. And I match the real-valued output to the target output, which is also real-valued. Often, we will use RBF's for classification. When you look at h of x, which used to be regression this way-- it gives you a real number. Now we are going to take, as usual, the sign of this quantity, +1 or -1, and interpret the output as a yes/no decision. And we would like to ask ourselves: how do we learn the w's under these conditions? That shouldn't be a very alien situation to you, because you have seen before linear regression used for classification. That is pretty much what we are going to do here. We are going to focus on the inner part, which is the signal before we take the sign. And we are going to try to make the signal itself match the +1,-1 target, like we did when we used linear regression for classification. And after we are done, since we are trying hard to make it +1 or -1, and if we are successful-- we get exact solution, then obviously the sign of it will be +1 or -1 if you're successful. If you are not successful, and there is an error, as will happen in other cases, then at least since you try to make it close to +1, and you try to make the other one close to -1, you would think that the sign, at least, will agree with +1 or -1. So the signal here is what used to be the whole hypothesis value. And what you're trying to do, you are trying to minimize the mean squared error between that signal and y, knowing that y actually-- on the training set-- knowing that y is only +1 or -1. So you solve for that. And then when you get s, you report the sign of that s as your value. So we are ready to use the solution we had before in case we are using RBF's for classification. Now we come to the observation that the radial basis functions are related to other models. And I'm going to start with a model that we didn't cover. It's extremely simple to cover in five minutes. And it shows an aspect of radial basis functions that is important. This is the nearest-neighbor method. So let's look at it. The idea of nearest neighbor is that I give you a data set. And each data point has a value y_n. Could be a label, if you're talking about classification, could be a real value. And what you do for classifying other points, or assigning values to other points, is very simple. You look at the closest point within that training set, to the point that you are considering. So you have x. You look at what is x_n in the training set that is closest to me, in Euclidean distance. And then you inherit the label, or the value, that that point has. Very simplistic. Here is a case of classification. The data set are the red pluses and the blue circles. And what I am doing is that I am applying this rule of classifying every point on this plane, which is X, the input space, according to the label of the nearest point within the training set. As you can see, if I take a point here, this is the closest. That's why this is pink. And here it's still the closest. Once I'm here, this guy becomes the closest. And therefore, it gets blue. So you end up, as a result of that, as if you are breaking the plane into cells. Each of them has the label of a point in the training set that happens to be in the cell. And this tessellation of the plane, into these cells, describes the boundary for your decisions. This is the nearest-neighbor method. Now, if you want to implement this using radial basis functions, there is a way to implement it. It's not exactly this, but it has a similar effect, where you basically are trying to take an influence of a nearby point. And that is the only thing you're considering. You are not considering other points. So let's say you take the basis function, in this case, to look like this. Instead of a Gaussian, it's a cylinder. It's still symmetric-- depends on the radius. But the dependency is very simple. I am constant. And then I go to 0. So it's very abrupt. In that case, I am not exactly getting this. But what I'm getting is a cylinder around every one of those guys that inherits the value of that point. And obviously, there is the question of overlaps and whatnot, and that is what makes a difference from here. In both of those cases, it's fairly brittle. You go from here to here. You immediately change values. And if there are points in between, you keep changing from blue, to red, to blue, and so on. In this case, it's even more brittle, and so on. So in order to make it less abrupt, the nearest neighbor is modified to becoming K nearest neighbors, that is, instead of taking the value of the closest point, you look for, let's say, for the three closest points, or the five closest points, or the seven closest points, and then take a vote. If most of them are +1, you consider yourself +1. That helps even things out a little bit. So an isolated guy in the middle, that doesn't belong, gets filtered out by this. This is a standard way of smoothing, so to speak, the surface here. It will still be very abrupt going from one point to another, but at least the number of fluctuations will go down. The way you smoothen the radial basis function is, instead of using a cylinder, you use a Gaussian. So now it's not like I have an influence, I have an influence, I have an influence, I don't have any influence. No, you have an influence, you have less influence, you have even less influence, and eventually you have effectively no influence because the Gaussian went to 0. And in both of those cases, you can consider the model, whether it's nearest neighbor or K nearest neighbors, or a radial basis function with different bases. You can consider it as a similarity-based method. You are classifying points according to how similar they are to points in the training set. And the particular form of applying the similarity is what defines the algorithm, whether it's this way or that way, whether it's abrupt or smooth, and whatnot. Now let's look at the model we had, which is the exact-interpolation model and modify it a little bit, in order to deal with a problem that you probably already noticed, which is the following. In the model, we have N parameters, w-- should be w_1 up to w_N. And it is based on N data points. I have N parameters. I have N data points. We have alarm bells that calls for a red color, because right now, you usually have the generalization in your mind related to the ratio between data points and parameters, parameters being more or less a VC dimension. And therefore, in this case, it's pretty hopeless to generalize. It's not as hopeless as in other cases, because the Gaussian is a pretty friendly guy. Nonetheless, you might consider the idea that I'm going to use radial basis functions, so I'm going to have an influence, symmetric and all of that. But I don't want to have every point have its own influence. What I'm going to do, I'm going to elect a number of important centers for the data, have these as my centers, and have them influence the neighborhood around them. So what you do, you take K, which is the number of centers in this case, and hopefully it's much smaller than N. so that the generalization worry is mitigated. And you define the centers-- these are vectors, mu_1 up to mu_K, as the centers of the radial basis functions, instead of having x_1 up to x_N, the data points themselves, being the centers. Now those guys live in the same space, let's say in a d-dimensional Euclidean space. These are exactly in the same space, except that they are not data points. They are not necessarily data points. We may have elected some of them as being important points, or we may have elected points that are simply representative, and don't coincide with any of those points. Generically, there will be mu_1 up to mu_K. And in that case, the functional form of the radial basis function changes form, and it becomes this. Let's look at it. Used to be that we are counting from 1 to N, now from 1 to K. And we have w. So indeed, we have fewer parameters. And now we are comparing the x that we are evaluating at, not with every point, but with every center. And according to the distance from that center, the influence of that particular center, which is captured by w_k is contributed. And you take the contribution of all the centers, and you get the value. Exactly the same thing we did before except, with this modification, that we are using centers instead of points. So the parameters here now are interesting, because I have w_k's are parameters. And I'm supposedly going through this entire exercise because I didn't like having N parameters. I want only K parameters. But look what we did. mu_k's now are parameters, right? I don't know what they are. And I have K of them. That's not a worry, because I already said that K is much smaller than N. But each of them is a d-dimensional vector, isn't it? So that's a lot of parameters. If I have to estimate those, et cetera, I haven't done a lot of progress in this exercise. But it turns out that I will be able, through a very simple algorithm, to estimate those without touching the outputs of the training set, so without contaminating the data. That's the key. Two questions. How do I choose the centers, which is an interesting question, because I have to choose it now-- if I want to maintain that the number of parameters here is small-- I have to choose it without really consulting the y_n's, the values of the the output at the training set. And the other question is how to choose the weights. Choosing the weights shouldn't be that different from what we did before. It will be a minor modification, because it has the same functional form. This one is the interesting part, at least the novel part. So let's talk about choosing the centers. What we are going to do, we are going to choose the centers as representative of the data inputs. I have N points. They are here, here, and here. And the whole idea is that I don't want to assign a radial basis function for each of them. And therefore, what I'm going to do, I'm going to have a representative. It would be nice, for every group of points that are nearby, to have a center near to them, so that it captures this cluster. This is the idea. So you are now going to take x_n, and take a center which is the closest to it, and assign that point to it. Here is the idea. I have the points spread around. I am going to select centers. Not clear how do I choose the centers. But once you choose them, I'm going to consider the neighborhood of the center within the data set, the x_n's, as being the cluster that has that center. If I do that, then those points are represented by that center, and therefore, I can say that their influence will be propagated through the entire space by the radial basis function that is centered around this one. So let's do this. It's called K-means clustering, because the center for the points will end up being the mean of the points, as we'll see in a moment. And here is the formalization. You split the data points, x_1 up to x_n, into groups-- clusters, so to speak-- hopefully points that are close to each other. And you call these S_1 up to S_K. Each cluster will have a center that goes with it. And what you minimize, in order to make this a good clustering, and these good representative centers, is to try to make the points close to their centers. So you take this for every point you have. But you sum up over the points in the cluster. So you take the points in the cluster, whose center is this guy. And you try to minimize the mean squared error there, mean squared error in terms of Euclidean distance. So this takes care of one cluster, S_k. You want this to be small over all the data. So what you do is you sum this up over all the clusters. That becomes your objective function in clustering. Someone gives you K. That is, the choice of the actual number of clusters is a different issue. But let's say K is 9. I give you 9 clusters. Then, I'm asking you to find the mu's, and the break-up of the points into the S_k's, such that this value assumes its minimum. If you succeed in that, then I can claim that this is good clustering, and these are good representatives of the clusters. Now I have some good news, and some bad news. The good news is that, finally, we have unsupervised learning. I did this without any reference to the label y_n. I am taking the inputs, and producing some organization of them, as we discussed the main goal of unsupervised learning is. So we are happy about that. Now the bad news. The bad news is that the problem, as I stated it, is NP-hard in general. It's a nice unsupervised problem, but not so nice. It's intractable, if you want to get the absolute minimum. So our goal now is to go around it. That sort of problem being NP-hard never discouraged us. Remember, with neural networks, we said that the absolute minimum of that error in the general case-- finding it would be NP-hard. And we ended up with saying we will find some heuristic, which was gradient descent in this case. That led to backpropagation. We'll start with a random configuration and then descend. And we'll get, not to the global minimum, the finding of which is NP-hard, but a local minimum, hopefully a decent local minimum. We'll do exactly the same thing here. Here is the iterative algorithm for solving this problem, the K-means. And it's called Lloyd's algorithm. It is extremely simple, to the level where the contrast between this algorithm-- not only in the specification of it, but how quickly it converges-- and the fact that finding the global minimums of NP-hard, is rather mind-boggling. So here is the algorithm. What you do is you iteratively minimize this quantity. You start with some configuration, and get a better configuration. And as you see, I have now two guys in purple, which are my parameters here. mu's are parameters by definition. I am trying to find what they are. But also the sets S_k, the clusters, are parameters. I want to know which guys go into them. These are the two things that I'm determining. So the way this algorithm does it is that it fixes one of them, and tries to minimize the other. It tells you for this particular membership of the clusters, could you find the optimal centers? Now that you found the optimal centers-- forget about the clustering that resulted in that-- these are centers, could you find the best clustering for those centers? And keep repeating back and forth. Let's look at the steps. You are minimizing this with respect to both, so you take one at a time. Now you update the value of mu. How do you do that? You take the fixed clustering that you have-- so you have already a clustering that is inherited from the last iteration. What you do is you take the mean of that cluster. You take the points that belong to that cluster. You add them up and divide by their number. Now in our mind, you know that this must be pretty good in minimizing the mean squared error, because the squared error to the mean is the smallest of the squared errors to any point. That happens to be the closest to the points collectively, in terms of mean squared value. So if I do that, I know that this is a good representative, if this was the real cluster. So that's the first step. Now I have new mu_k's. So you freeze the mu_k's. And you completely forget about the clustering you had before. Now you are creating new clusters. And the idea is the following. You take every point, and you measure the distance between it and mu_k, the newly acquired mu_k. And you ask yourself: is the closest of the mu's that I have? So you compare this with all of the other guys. And if it happens to be smaller, then you declare that this x_n belongs to S_k. You do this for all the points, and you create a full clustering. Now, if you look at this step, we argued that this reduces the error. It has to, because you picked the mean for every one of them, and that will definitely not increase the error. This will also decrease the error, because the worst that it can do is take a point from one cluster and put it in another. But in doing that, what did it do? It picked the one that is closest. So the term that used to be here is now smaller, because it went to the closer guy. So this one reduces the value. This one reduces the value. You go back and forth, and the quantity is going down. Are we ever going to converge? Yes, we have to. Because by structure, we are only dealing with a finite number of points. And there are a finite number of possible values for the mu's, given the algorithm, because they have to be the average of points from those. So I have 100 points. There will be a finite, but tremendously big, number of possible values. But it's finite. All I care about, it's a finite number. And as long as it's finite, and I'm going down, I will definitely hit a minimum. It will not be the case that it's a continuous thing, and I'm doing half, and then half again, and half of half, and never arrive. Here, you will arrive perfectly at a point. The catch is that you're converging to good, old-fashioned local minimum. Depending on your initial configuration, you will end up with one local minimum or another. But again, exactly the same situation as we had with neural networks. We did converge to a local minimum with backpropagation, right? And that minimum depended on the initial weights. Here, it will depend on the initial centers, or the initial clustering, whichever way you want to begin. And the way you do it is, try different starting points. And you get different solutions. And you can evaluate which one is better because you can definitely evaluate this objective function for all of them, and pick one out of a number of runs. That usually works very nicely. It's not going to give you the global one. But it's going to give you a very decent clustering, and very decent representative mu's. Now, let's look at Lloyd's algorithm in action. And I'm going to take the problem that I showed you last time for the RBF kernel. This is the one we're going to carry through, because we can relate to it now. And let's see how the algorithm works. The first step in the algorithm, give me the data points. OK, thank you. Here are the data points. If you remember, this was the target. The target was slightly nonlinear. We had -1 and +1. And we have them with this color. And that is the data we have. First thing, I only want the inputs. I don't see the labels. And I don't see the target function. You probably don't see the target function anyway. It's so faint! But really, you don't see it at all. So I'm going now to take away the target function and the labels. I'm only going to keep the position of the inputs. So this is what you get. Looks more formidable now, right? I have no idea what the function is. But now we realize one interesting point. I'm going to cluster those, without any benefit of the label. So I could have clusters that belong to one category, +1 or -1. And I could, as well, have clusters that happen to be on the boundary, half of them are +1, or half of them -1. That's the price you pay when you do unsupervised learning. You are trying to get similarity, but the similarity is as far as the inputs are concerned, not as far as the behavior with the target function is concerned. That is key. So I have the points. What do I do next? You need to initialize the centers. There are many ways of doing this. There are a number of methods. I'm going to keep it simple here. And I'm going to initialize the centers at random. So I'm just going to pick 9 points. And I'm picking 9 for a good reason. Remember last lecture when we did the support vector machines. We ended up with 9 support vectors. And I want to be able to compare them. So I'm fixing the number, in order to be able to compare them head to head. So here are my initial centers. Totally random. Looks like a terribly stupid thing to have three centers near each other, and have this entire area empty. But let's hope that Lloyd's algorithm will place them a little bit more strategically. Now you iterate. So now I would like you to stare at this. I will even make it bigger. Stare at it, because I'm going to do a full iteration now. I am going to do re-clustering, and re-evaluation of the mu, and then show you the new mu. One step at a time. This is the first step. Keep your eyes on the screen. They moved a little bit. And I am pleased to find that those guys, that used to be crowded, are now serving different guys. They are moving away. Second iteration. I have to say, this is not one iteration. These are a number of iterations. But I'm sampling it at a certain rate, in order not to completely bore you. It would be-- clicking through the end of the lecture. And then we would have the clustering at the end of the lecture, and nothing else! So next iteration. Look at the screen. The movement is becoming smaller. Third iteration. Uh. Just a touch. Fourth. Nothing happened. I actually flipped the slide. Nothing happened. Nothing happened. So we have converged. And these are your mu's. And it does converge very quickly. And you can see now the centers make sense. These guys have a center. These guys have a center. These guys, and so on. I guess since it started here, it got stuck here and is just serving two points, or something like that. But more or less, it's a reasonable clustering. Notwithstanding the fact that there was no natural clustering for the points. It's not like I generated these guys from 9 centers. These were generated uniformly. So the clustering is incidental. But nonetheless, the clustering here makes sense. Now this is a clustering, right? Surprise! We have to go back to this. And now, you look at the clustering and see what happens. This guy takes points from both +1 and -1. They look very similar to it, because it only depended on x's. Many of them are deep inside and, indeed, deal with points that are the same. The reason I'm making an issue of this, because the way the center will serve, as a center of influence for affecting the value of the hypothesis. It will get a w_k, and then it will propagate that w_k according to the distance from itself. So now the guys that happen to be the center of positive and negative points will cause me a problem, because what do I propagate? The +1 or the -1? But indeed, that is the price you pay when you use unsupervised learning. So this is Lloyd's algorithm in action. Now I'm going to do something interesting. We had 9 points that are centers of unsupervised learning, in order to be able to carry out the influence of radial basis functions using the algorithm we will have. That's number one. Last lecture, we had also 9 guys. They were support vectors. They were representative of the data points. And since the 9 points were representative of the data points, and the 9 centers here are representative of the data points, it might be illustrative to put them next to each other, to understand what is common, what is different, where did they come from, and so on. Let's start with the RBF centers. Here they are. And I put them on the data that is labeled, not that I got them from the labeled data, but just to have the same picture right and left. So these are where the centers are. Everybody sees them clearly. Now let me remind you of what the support vectors from last time looked like. Here are the support vectors. Very interesting, indeed. The support vectors obviously are here, all around here. They had no interest whatsoever in representing clusters of points. That was not their job. Here these guys have absolutely nothing to do with the separating plane. They didn't even know that there was separating surface. They just looked at the data. And you basically get what you set out to do. Here you were representing the data inputs. And you've got a representation of the data inputs. Here you were trying to capture the separating surface. That's what support vectors do. They support the separating surface. And this is what you got. These guys are generic centers. They are all black. These guys, there are some blue and some red, because they are support vectors that come with a label, because of the value y_n. So some of them are on this side. Some of them are on this side. And indeed, they serve completely different purposes. And it's rather remarkable that we get two solutions using the same kernel, which is RBF kernel, using such an incredibly different diversity of approaches. This was just to show you the difference between when you do the choice of important points in an unsupervised way, and here patently in a supervised way. Choosing the support vectors was very much dependent on the value of the target. The other thing you need to notice is that the support vectors have to be points from the data. The mu's here are not points from the data. They are average of those points. But they end up anywhere. So if you actually look, for example, at these three points. You go here. And one of them became a center, one of them became a support vector. On the other hand, this point doesn't exist here. It just is a center that happens to be anywhere in the plane. So now we have the centers. I will give you the data. I tell you K equals 9. You go and you do your Lloyd's algorithm. And you come up with the centers, and half the problem of the choice is now solved. And it's the big half, because the centers are vectors of d dimensions. And now I found the centers, without even touching the labels. I didn't touch y_n. So I know that I didn't contaminate anything. And indeed, I have only the weights, which happen to be K weights, to determine using the labels. And therefore, I have good hopes for generalization. Now I look at here. I froze it-- it became black now, because it has been chosen. And now I'm only trying to choose these guys, w_k. This is y_n. And I ask myself the same question. I want this to be true for all the data points if I can. And I ask myself: how many equations, how many unknowns? I end up with N equations. Same thing. I want this to be true for all the data points. I have N data points. So I have N equations. How many unknowns? The unknowns are the w's. And I have K of them. And oops, K is less than N. I have more equations than unknowns. So something has to give. And this fellow is the one that has to give. That's all I can hope for. I'm going to get it close, in a mean squared sense, as we have done before. I don't think you'll be surprised by anything in this slide. You have seen this before. So let's do it. This is the matrix phi now. It's a new phi. It has K columns and N rows. So according to our criteria that K is smaller than N, this is a tall matrix. You multiply it by w, which are K weights. And you should get approximately y. Can you solve this? Yes, we have done this before in linear regression. All you need is to make sure that phi transposed phi is invertible. And under those conditions, you have one-step solution, which is the pseudo-inverse. You take phi transposed phi to the -1, times phi transposed y. And that will give you the value of w that minimizes the mean squared difference between these guys. So you have the pseudo-inverse, instead of the exact interpolation. And in this case, you are not guaranteed that you will get the correct value at every data point. So you are going to be making an in-sample error. But we know that this is not a bad thing. On the other hand, we are only determining K weights. So the chances of generalization are good. Now, I would like to take this, and put it as a graphical network. And this will help me relate it to neural networks. This is the second link. We already related RBF to nearest neighbor methods, similarity methods. Now we are going to relate it to neural networks. Let me first put the diagram. Here's my illustration of it. I have x. I am computing the radial aspect, the distance from mu_1 up to mu_K, and then handing it to a nonlinearity, in this case the Gaussian nonlinearity. You can have other basis functions. Like we had the cylinder in one case. But cylinder is a bit extreme. But there are other functions. You get features that are combined with weights, in order to give you the output. Now this one could be just passing the sum if you're doing regression, could be hard threshold if you're doing classification, could be something else. But what I care about is that this configuration looks familiar to us. It's layers. I select features. And then I go to output. Let's look at the features. The features are these fellows, right? Now if you look at these features, they depend on D-- mu, in general, are parameters. If I didn't have this slick Lloyd's algorithm, and K-means, and unsupervised thing, I need to determine what these guys are. And once you determine them, the value of the feature depends on the data set. And when the value of the feature depends on the data set, all bets are off. It's no longer a linear model, pretty much like a neural network doing the first layer, extracting the features. Now the good thing is that, because we used only the inputs in order to compute mu, it's almost linear. We've got the benefit of the pseudo-inverse because in this case, we didn't have to go back and adjust mu because you don't like the value of the output. These were frozen forever based on inputs. And then, we only had to get the w's. And the w's now look like multiplicative factors, in which case it's linear on those w's. And we get the solution. Now in radial basis functions, there is often a bias term added. You don't only get those. You get either w_0 or b. And it enters the final layer. So you just add another weight that is, this time, multiplied by 1. And everything remains the same. The phi matrix has another column because of this. And you just do the machinery you had before. Now let's compare it to neural networks. Here is the RBF network. We just saw it. And I pointed x in red. This is what gets passed to this, gets the features, and gets you the output. And here is a neural network that is comparable in structure. So you start with the input. You start with the input. Now you compute features. And here you do. And the features here depend on the distance. And they are such that, when the distance is large, the influence dies. So if you look at this value, and this value is huge, you know that this feature will have 0 contribution. Here this guy, big or small, is going to go through a sigmoid. So it could be huge, small, negative. And it goes through this. So it always has a contribution. So one interpretation is that, what radial basis function networks do, is look at local regions in the space and worry about them, without worrying about the far-away points. I have a function that is in this space. I look at this part, and I want to learn it. So I get a basis function that captures it, or a couple of them, et cetera. And I know that by the time I go to another part of the space, whatever I have done here is not going to interfere, whereas in the other case of neural networks, it did interfere very, very much. And the way you actually got something interesting, is making sure that the combinations of the guys you got give you what you want. But it's not local as it is in this case. So this is the first observation. The second observation is that here, the nonlinearity we call phi. The corresponding nonlinearity here is theta. And then you combine with the w's, and you get h. So very much the same, except the way you extract features here is different. And w here was full-fledged parameter that depended on the labels. We use backpropagation in order to get those. So these are learned features, which makes it completely not a linear model. This one, if we learned mu's based on their effect on the output, which would be a pretty hairy algorithm, that would be the case. But we didn't. And therefore, this is almost linear in this part. And this is why we got this part fixed. And then we got this one using the pseudo inverse. One last thing, this is a two-layer network. And this is a two-layer network. And pretty much any two-layer network, of this type of structure, lends itself to being a support vector machine. The first layer takes care of the kernel. And the second one is the linear combination that is built-in in support vector machines. So you can solve a support vector machine by choosing a kernel. And you can picture in your mind that I have one of those, where the first part is getting the kernel. And the second part is getting the linear part. And indeed, you can implement neural networks using support vector machines. There is a neural-network kernel for support vector machines. But it deals only with two layers, as you see here, not multiple layers as the general neural network would do. Now, the final parameter to choose here was gamma, the width of the Gaussian. And we now treat it as a genuine parameter. So we want to learn it. And because of that, it turned purple. So now mu is fixed, according to Lloyd. Now I have parameters w_1 up to w_K. And then I have also gamma. And you can see this is actually pretty important because, as you saw, that if we choose it wrong, the interpolation becomes very poor. And it does depend on the spacing in the data set. So it might be a good idea to choose gamma in order to also minimize the in-sample error-- get good performance. So of course, I could do that-- and I could do it for w for all I care-- I could do it for all the parameters, because here is the value. I am minimizing mean squared error. So I'm going to compare this with the value of the y_n when I plug-in x_n. And I get an in-sample error, which is mean squared. I can always find parameters that minimize that, using gradient descent, the most general one. Start with random values, and then descend, and then you get a solution. However, it would be a shame to do that, because these guys have such a simple algorithm that goes with them. If gamma is fixed, this is a snap. You do the pseudo-inverse, and you get exactly that. So it is a good idea to separate that for this one. It's inside the exponential, and this and that. I don't think I have any hope of finding a short cut. I probably will have to do gradient descent for this guy. But I might as well do gradient descent for this guy, not for these guys. And the way this is done is by an iterative approach. You fix one, and solve for the others. This seems to be the theme of the lecture. And in this case, it is a pretty famous algorithm-- a variation of that algorithm. The algorithm is called EM, Expectation-Maximization. And it is used for solving the case of mixture of Gaussians, which we actually have, except that we are not calling them probabilities. We are calling them bases that are implementing a target. So here is the idea. Fix gamma. That we have done before. We have been fixing gamma all through. If you want to solve for w based on fixing gamma, you just solve for it using the pseudo-inverse. Now we have w's. Now you fix them. They are frozen. And you minimize the error, the squared error, with respect to gamma, one parameter. It would be pretty easy to gradient descent with respect to one parameter. You find the minimum. You find gamma. Freeze it. And now, go back to step one and find the new w's that go with the new gamma. Back and forth, converges very, very quickly. And then you will get a combination of both w's and gamma. And because it is so simple, you might be even encouraged to say: why do we have one gamma? I have data sets. It could be that these data points are close to each other, and one data point is far away. Now if I have a center here that has to reach out further, and a center here that doesn't have to reach out, it looks like a good idea to have different gammas for those guys. Granted. And since this is so simple, all you need to do is now is have K parameters, gamma_k, so you doubled the number of parameters. But the number of parameters is small to begin with. And now you do the first step exactly. You fix the vector gamma, and you get these guys. And now we are doing gradient descent in a K-dimensional space. We have done that before. It's not a big deal. You find the minimum with respect to those, freeze them, and go back and forth. And in that case, you adjust the width of the Gaussian according to the region you are in the space. Now very quickly, I'm going to go through two aspects of RBF, one of them relating it to kernel methods, which we already have seen the beginning of. We have used it as a kernel. So we would like to compare the performance. And then, I will relate it to regularization. It's interesting that RBF's, as I described them-- like intuitive, local, influence, all of that-- you will find in a moment that they are completely based on regularization. And that's how they arose in the first place in function approximation. So let's do the RBF versus its kernel version. Last lecture we had a kernel, which is the RBF kernel. And we had a solution with 9 support vectors. And therefore, we ended up with a solution that implements this. Let's look at it. I am getting a sign that's a built-in part of support vector machines. They are for classification. I had this guy after I expanded the z transposed z, in terms of the kernel. So I am summing up over only the support vector. There are 9 of them. This becomes my parameter, the weight. It happens to have the sign of the label. That makes sense because if I want to see the influence of x_n, it might as well be that the influence of x_n agrees with the label of x_n. That's how I want it. If it's +1, I want the +1 to propagate. So because the alphas are non-negative by design, they get their sign from the label of the point. And now the centers are points from the data set. They happen to be the support vectors. And I have a bias there. So that's the solution we have. What did we have here? We had the straight RBF implementation, with 9 centers. I am putting the sign in blue, because this is not an integral part. I could have done a regression part. But since I'm comparing here, I'm going to take the sign and consider this a classification. I also added a bias, also in blue, because this is not an integral part. But I'm adding it in order to be exactly comparable here. So the number of terms here is 9. The number of terms here is 9. I'm adding a bias. I'm adding a bias. Now the parameter here is called w, which takes place of this guy. And the centers here are general centers, mu_k's. These do not have to be points from the data set, indeed they most likely are not. And they play the role here. So these are the two guys. How do they perform? That's the bottom line. Can you imagine? This is exactly the same model in front of me. And in one of them I did what? Unsupervised learning of centers, followed by a pseudo-inverse. And I used linear regression for classification. That's one route. What did I do here? Maximize the margin, equate with a kernel, and pass to quadratic programming. Completely different routes. And finally, I have a function that is comparable. So let's see how they perform. Just to be fair to the poor straight RBF implementation, the data doesn't cluster normally. And I chose the 9 because I got 9 here. So the SVM has the home advantage here. This is just a comparison. I didn't optimize the number of things, I didn't do anything. So if this guy ends up performing better, OK, it's better. SVM is good. But it really has a little bit of unfair advantage in this comparison. But let's look at what we have. This is the data. Let me magnify it, so that you can see the surface. Now let's start with the regular RBF. Both of them are RBF, but this is the regular RBF. This is the surface you get after you do everything I said, the Lloyd, and the pseudo-inverse, and whatnot. And the first thing you realize is that the in-sample error is not zero. There are points that are misclassified. Not a surprise. I had only K centers. And I'm trying to minimize mean squared error. It is possible that some points, close to the boundary, will go one way or the other. I'm interpreting the signal as being closer to +1 or -1. Sometimes it will cross. And that's what I get. This is the guy that I get. Here is the guy that I got last time from the SVM. Rather interesting. First, it's better-- because I have the benefit of looking at the green, the faint green line, which is the target. And I am definitely closer to the green one, in spite of the fact that I never used it explicitly in the computation. I used only the data, the same data for both. But this tracks it better. It does zero in-sample error. It's fairly close to this guy. So here are two solutions coming from two different worlds, using the same kernel. And I think by the time you have done a number of problems using these two approaches, you have it cold. You know exactly what is going on. You know the ramifications of doing unsupervised learning, and what you miss out by choosing the centers without knowing the label, versus the advantage of support vectors. The final items that I promised was RBF versus regularization. It turns out that you can derive RBF's entirely based on regularization. You are not talking about influence of a point. You are not talking about anything. Here is the formulation from function approximation that resulted in that. And that is why people consider RBF's to be very principled, and they have a merit. It is modulo assumptions, as always. And we will see what the assumptions are. Let's say that you have a one-dimensional function. So you have a function. And you have a bunch of points, the data points. And what you are doing now is you are trying to interpolate and extrapolate between these points, in order to get the whole function, which is what you do in function approximation-- what you do in machine learning if your function happens to be one-dimensional. What do you do in this case? There are usually two terms. One of them you try to minimize the in-sample error. And the other one is regularization, to make sure that your function is not crazy outside. That's what we do. So look at the in-sample error. That's what you do with the in-sample error, notwithstanding the 1 over N, which I took out to simplify the form. You take the value of your hypothesis, compare it with the value y, the target value, squared, and this is your error in sample. Now we are going to add a smoothness constraint. And in this approach, the smoothness constraint is always taken, almost always taken, as a constraint on the derivatives. If I have a function, and I tell you that the second derivative is very large, what does this mean? It means-- So you do-- That's not smooth. And if I go to the third derivative, it will be the rate of change of that, and so on. So I can go for derivatives in general. And if you can tell me that the derivatives are not very large in general, that corresponds, in my mind, to smoothness. The way they formulated the smoothness is by taking, generically, the k-th derivative of your hypothesis, hypothesis now is a function of x. I can differentiate it. I can differentiate it k times, assuming that it's parametrized in a way that is analytic. And now I'm squaring it, because I'm only interested in the magnitude of it. And what I'm going to do, I'm going to integrate this from minus infinity to plus infinity. This will be an estimate of the size of the k-th derivative, notwithstanding it's squared. If this is big, that's bad for smoothness. If this is small, that's good for smoothness. Now I'm going to up the ante, and combine the contributions of different derivatives. I am going to combine all the derivatives with coefficients. If you want some of them, all you need to do is just set these guys to zero for the ones you are not using. Typically, you will be using, let's say, first derivative and second derivative. And the rest of the guys are zero. And you get a condition like that. And now you multiply it by lambda. That's the regularization parameter. And you try to minimize the augmented error here. And the bigger lambda is, the more insistent you are on smoothness versus fitting. And we have seen all of that before. The interesting thing is that, if you actually solve this under conditions, and assumptions, and after an incredibly hairy mathematics that goes with it, you end up with radial basis functions. What does that mean? It really means-- I'm looking for an interpolation. And I'm looking for as smooth an interpolation as possible, in the sense of the sum of the squares of the derivatives with these coefficients. It's not stunning that the best interpolation happens to be Gaussian. That's all we are saying. So it comes out. And that's what gives it a bigger credibility as being inherently self-regularized, and whatnot. And you get, this is the smoothest interpolation. And that is one interpretation of radial basis functions. On that happy note, we will stop, and I'll take questions after a short break. Let's start the Q&A. MODERATOR: First, can you explain again how does an SVM simulate a two-level neural network? PROFESSOR: OK. Look at the RBF, in order to get a hint. What does this feature do? It actually computes the kernel, right? So think of what this guy is doing as implementing the kernel. What is it implementing? It's implementing theta, the sigmoidal function, the tanh in this case, of this guy. Now if you take this as your kernel, and you verify that it is a valid kernel-- in the case of radial basis functions, we had no problem with that. In the case of neural networks, believe it or not, depending on your choice of parameters, that kernel could be a valid kernel corresponding to a legitimate Z space, or can be an illegitimate kernel. But basically, you use that as your kernel. And if it's a valid kernel, you carry out the support vector machinery. So what are you going to get? You are going to get that value of the kernel, evaluated at different data points, which happen to be the support vectors. These become your units. And then you get to combine them using the weights. And that is the second layer of the neural network. So it will implement a two-layer neural network this way. MODERATOR: In a real example, where you're not comparing to support vectors, how do you choose the number of centers? PROFESSOR: This is perhaps the biggest question in clustering. There is no conclusive answer. There are lots of information criteria, and this and that. But it really is an open question. That's probably the best answer I can give. In many cases, there is a relatively clear criterion. I'm looking at the minimization. And if I increase the clusters by one, supposedly the sum of the squared distances should go down, because I have one more parameter to play with. So if I increase the things by one, and the objective function goes down significantly, then it looks like it's meritorious, that it was warranted to add this center. And if it doesn't, then maybe it's not a good idea. There are tons of heuristics like that. But it is really a difficult question. And the good news is that if you don't get it exactly, it's not the end of the world. If you get a reasonable number of clusters, the rest of the machinery works. And you get a fairly comparable performance. Very seldom that there is an absolute hit, in terms of the number of clusters that are needed, if the goal is to plug them in later on for the rest of the RBF machinery. MODERATOR: So cross-validation would be useful for-- PROFESSOR: Validation would be one way of doing it. But there are so many things to validate with respect to, but this is definitely one of them. MODERATOR: Also, is RBF practical in applications where there's a high dimensionality of the input space? I mean, does Lloyd algorithm suffer from high dimensionality problems? PROFESSOR: Yeah, it's a question of-- distances become funny, or sparsity becomes funny, in higher-dimensional space. So the question of choice of gamma and other things become more critical. And if it's really very high-dimensional space, and you have few points, then it becomes very difficult to expect good interpolation. So there are difficulties. But the difficulties are inherent. The curse of dimensionality is inherent in this case. And I think it's not particular to RBF's. You use other methods. And you also suffer from one problem or another. MODERATOR: Can you review again how to choose gamma? PROFESSOR: OK. This is one way of doing it. Let me-- Here I am trying to take advantage of the fact that determining a subset of the parameters is easy. If I didn't have that, I would have treated all the parameters on equal footing, and I would have just used a general nonlinear optimization, like gradient descent, in order to find all of them at once, iteratively until I converge to a local minimum with respect to all of them. Now that I realize that when gamma is fixed, there is a very simple way in one step to get to the w's. I would like to take advantage of that. The way I'm going to take advantage of it, is to separate the variables into two groups, the expectation and the maximization, according to the EM algorithm. And when I fix one of them, when I fix gamma, then I can solve for w_k's directly. I get them. So that's one step. And then I fix w's that I have, and then try to optimize with respect to gamma, according to the mean squared error. So I take this guy with w's being constant, gamma being a variable, and I apply this to every point in the training set, x_1 up to x_N, and take it minus y_n squared, sum them up. This is an objective function. And then get the gradient of that and try to minimize it, until I get to a local minimum. And when I get to a local minimum, and now it's a local minimum with respect to this gamma, and with the w_k's as being constant. There's no question of variation of the w_k's in those cases. But I get a value of gamma at which I assume a minimum. Now I freeze it, and repeat the iteration. And going back and forth will be far more efficient than doing gradient descent in all, just because one of the steps that involves so many variables is a one shot. And usually, the EM algorithm converges very quickly to a very good result. It's a very successful algorithm in practice. MODERATOR: Going back to neural networks, now that you mentioned the relation with the SVM's. In practical problems, is it necessary to have more than one hidden layer, or is it-- PROFESSOR: Well, in terms of the approximation, there is an approximation result that tells you you can approximate everything using a two-layer neural network. And the argument is fairly similar to the argument that we gave before. So it's not necessary. And if you look at people who are using neural networks, I would say the minority use more than two layers. So I wouldn't consider the restriction of two layers dictated by support vector machines as being a very prohibitive restriction in this case. But there are cases where you need more than two layers, and in that case, you go just for the straightforward neural networks, and then you have an algorithm that goes with that. There is an in-house question. STUDENT: Hi, professor. I have a question about slide one. Why we come up with this radial basis function? You said that because the hypothesis is affected by the data point which is closest to x. PROFESSOR: This is the slide you are referring to, right? STUDENT: Yeah. This is the slide. So is it because you assume that the target function should be smooth? So that's why we can use this. PROFESSOR: It turns out, in hindsight, that this is the underlying assumption, because when we looked at solving the approximation problem with smoothness, we ended up with those radial basis functions. There is another motivation, which I didn't refer to. It's a good opportunity to raise it. Let's say that I have a data set, x_1 y1, x_2 y_2, up to x_N y_N. And I'm going to assume that there is noise. But it's a funny noise. It's not noise in the value y. It's noise in the value x. That is, I can't measure the input exactly. And I want to take that into consideration in my learning. The interesting ramification of that is that, if I assume that there is noise, and let's say that the noise is Gaussian, which is a typical assumption. Although this is the x that was given to me, the real x could be here, or here, or here. And what I have to do, since I have the value y at that x, the value y itself, I'm going to consider to be noiseless in that case. I just don't know which x it corresponds to. Then you will find that when you solve this, you realize that what you have to do, you have to make the value of your hypothesis not change much by changing x, because you run the risk of missing it. And if you solve it, you end up with actually having an interpolation which is Gaussian in this case. So you can arrive at the same thing under different assumptions. There are many ways of looking at this. But definitely smoothness comes one way or the other, whether by just observing here, by observing the regularization, by observing the input noise interpretation, or other interpretations. STUDENT: OK, I see. Another question is about slide six, when we choose small gamma or large gamma. Yes, here. So actually here, just from this example, can we say that definitely small gamma is better than large gamma here? PROFESSOR: Well, small is relative. So the question is-- this is related to the distance between points in the space, because the value of the Gaussian will decay in that space. And this guy looks great if the two points are here. But the same guy looks terrible if the two points are here because, by the time you get here, it will have died out. So it's all relative. But relatively speaking, it's a good idea to have the width of the Gaussian comparable to the distances between the points so that there is a genuine interpolation. And the objective criterion for choosing gamma will affect that, because when we solve for gamma, we are using the K centers. So you have points that have the center of the Gaussian. But you need to worry about that Gaussian covering the data points that are nearby. And therefore, you are going to have the widths of that up or down, and the other ones, such that the influence gets to those points. So the good news is that there is an objective criterion for choosing it. This slide was meant to make the point that gamma matters. Now that it matters, let's look at the principled way of solving it. And the other way was the principled way of solving it. STUDENT: So does that mean that choosing gamma makes sense when we have fewer clusters than number of samples? Because in this case, we have three clusters and three samples. PROFESSOR: This was not meant to be a utility for gamma. It was meant just to visually illustrate that gamma matters. But the main utility, indeed, is for the K centers. STUDENT: OK, I see. Here actually, in both cases, the in-sample error is zero, same generalization behavior. PROFESSOR: You're absolutely correct. STUDENT: So can we say that K, the number of clusters is a measure of VC dimension, in this sense? PROFESSOR: Well, it's a cause and effect. When I decide on the number of clusters, I decide on the number of parameters, and that will affect the VC dimension. So this is the way it is, rather than the other way around. I didn't want people to take the question as: oh, we want to determine the number of clusters, so let's look for the VC dimension. That would be the argument backwards. So the statement is correct. They are related. But the cause and effect is that your choice of the number of clusters affects the complexity of your hypothesis set. STUDENT: Not the reverse? Because I thought, for example, if you have N data, and we know what kind of VC dimension will give good generalization, so based on that, can we kind of-- PROFESSOR: So this is out of necessity. You're not saying that this is the inherent number of clusters that are needed to do this. This is what I can afford. STUDENT: Yeah, that's what I mean. PROFESSOR: And then in that case, it's true. But in this case, it's not the number of clusters you can afford-- it is indirectly-- it is the number of parameters you can afford, because of the VC dimension. And because I have that many parameters, I have to settle for that number of clusters, whether or not they break the data points correctly or not. The only thing I'm trying to avoid is that I don't want people to think that this will carry an answer to the optimal choice of clusters, from an unsupervised learning point of view. That link is not there. STUDENT: I see. But because like in this example, we deal with-- it seems there's no natural cluster in the input sample, it's uniformly distributed in the input space. PROFESSOR: Correct. And in many cases, even if there is clustering, you don't know the inherent number of clusters. But again, the saving grace here is that we can do a half-cooked clustering, just to have a representative of some points, and then let the supervised stage of learning take care of getting the values right. So it is just a way to think of clustering. I'm trying, instead of using all the points, I'm trying to use K centers. And I want them to be as representative as possible. And that will put me ahead of the game. And then the real test would be when I plug it into the supervised. STUDENT: OK. Thank you, professor. MODERATOR: Are there cases when RBF's are actually better than SVM's? PROFESSOR: There are cases. You can run them in a number of cases, and if the data is clustered in a particular way, and the clusters happen to have a common value, then you would expect that doing the unsupervised learning will get me ahead, whereas the SVM's now are on the boundary and they have to be such that the cancellations of RBF's will give me the right value. So you can definitely create cases where one will win over the other. Most people will use the RBF kernels, the SVM approach. MODERATOR: Then that's it for today. PROFESSOR: Very good. Thank you. We'll see you next week.

Info

Channel: caltech

Views: 139,083

Rating: 4.9451051 out of 5

Keywords: Machine Learning (Field Of Study), Radial Basis Function, RBF, Caltech, MOOC, data, computer, science, course, Data Mining (Technology Class), Big Data, Data Science, learning from data, K-means, Lloyd's algorithm, kernel, unsupervised learning, clustering, regularization, Neural Network (Algorithm Family), Technology (Professional Field), Computer Science (Industry), Learning (Quotation Subject), Lecture (Type Of Public Presentation), Abu-Mostafa, Yaser

Id: O8CfrnOPtLc

Channel Id: undefined

Length: 82min 8sec (4928 seconds)

Published: Tue May 29 2012