ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about kernel
methods, which is a generalization of the basic SVM algorithm to accommodate
feature spaces Z, which are possibly infinite and which we
don't have to explicitly know, or transform our inputs to, in order to be
able to carry out the support vector machinery. And the idea was to define a kernel
that captures the inner product in that space. And if you can compute that kernel, the
generalized inner product for the Z space, this is the only operation
you need in order to carry the algorithm, and in order to interpret
the solution after you get it. And we took an example, which is the
RBF kernel, suitable since we are going to talk about RBF's, radial
basis functions, today. And the kernel is very simple
to compute in terms of x. It's not that difficult. However, it corresponds to an infinite
dimensional space, Z space. And therefore, by doing this, it's as
if we transform every point in this space, which is two-dimensional, into
an infinite-dimensional space, carry out the SVM there, and then interpret
the solution back here. And this would be the separating surface
that corresponds to a plane, so to speak, in that infinite
dimensional space. So with this, we went into another way
to generalize SVM, not by having a nonlinear transform in this case, but
by having an allowance for errors. Errors in this case would be
violations of the margin. The margin is the currency
we use in SVM. And we added a term to the objective
function that allows us to violate the margin for different points, according
to the variable xi. And we have a total violation,
which is this summation. And then we have a degree to which
we allow those violations. If C is huge, then we don't really
allow the violations. And if C goes to infinity, we are
back to the hard-margin case. And if C is very small, then we are
more tolerant and would allow violations. And in that case, we might allow some
violations here and there, and then have a smaller w, which means a bigger
margin, a bigger yellow region that is violated by those guys. Think of it as-- it gives us another
degree of freedom in our design. And it might be the case that in some
data sets, there are a couple of outliers where it doesn't make sense
to shrink the margin just to accommodate them, or by going to
a higher-dimensional space with a nonlinear transformation to go around
that point and, therefore, generate so many support vectors. And therefore, it might be a good idea
to ignore them, and ignoring them meaning that we are going to commit
a violation of the margin. Could be an outright error. Could be just a violation of the margin
where we are here, but we haven't crossed the boundary, so to speak. And therefore, this gives us another
way of achieving the better generalization by allowing some in-sample
error, or margin error in this case, at the benefit of getting better
generalization prospects. Now the good news here is that,
in spite of this significant modification of the statement of the
problem, the solution was identical to what we had before. We are applying quadratic programming,
with the same objective, the same equality constraint, and almost the
same inequality constraint. The only difference is that it
used to be, alpha_n could be as big as it wants. Now it is limited by C.
And when you pass this to quadratic programming, you will
get your solution. Now C being a parameter-- and it is not clear how to choose it. There is a compromise
that I just described. The best way to pick C, and it is the
way used in practice, is to use cross-validation to choose it. So you apply different values of C, you
run this and see what is the out-of-sample error estimate, using your
cross-validation, and then pick the C that minimizes that. And that is the way you will
choose the parameter C. So that ends the basic part of SVM, the
hard margin, the soft margin, and the nonlinear transforms together
with the kernel version of them. Together they are a technique really
that is superb for classification. And it is, by the choice of many people,
the model of choice when it comes to classification. Very small overhead. There is a particular criterion that
makes it better than just choosing a random separating plane. And therefore, it does reflect on
the out-of-sample performance. Today's topic is a new model, which
is radial basis functions. Not so new, because we had a version of
it under SVM and we'll be able to relate to it. But it's an interesting model
in its own right. It captures a particular understanding
of the input space that we will talk about. But the most important aspect that the
radial basis functions provide for us is the fact that they relate to so many
facets of machine learning that we have already touched on, and other
aspects that we didn't touch on in pattern recognition, that it's worthwhile
to understand the model and see how it relates. It almost serves as a glue
between so many different topics in machine learning. And this is one of the important aspects
of studying the subject. So the outline here-- it's not like I'm
going to go through one item then the next according to this outline. What I'm going to do-- I'm going to define the model, define
the algorithms, and so on, as I would describe any model. In the course of doing that, I will be
able, at different stages, to relate RBF to, in the first case, nearest
neighbors, which is a standard model in pattern recognition. We will be able to relate it to neural
networks, which we have already studied, to kernel methods obviously-- it should relate
to the RBF kernel. And it will. And finally, it will relate to
regularization, which is actually the origin, in function approximation,
for the study of RBF's. So let's first describe the basic
radial basis function model. The idea here is that every point in
your data set will influence the value of the hypothesis at every point x. Well, that's nothing new. That's what happens when you
are doing machine learning. You learn from the data. And you choose a hypothesis. So obviously, that hypothesis
will be affected by the data. But here, it's affected
in a particular way. It's affected through the distance. So a point in the data set will affect
the nearby points, more than it affects the far-away points. That is the key component that makes
it a radial basis function. Let's look at a picture here. Imagine that the center of this bump
happens to be the data point. So this is x_n. And this shows you the influence
of x_n on the neighboring points in the space. So it's most influential nearby. And then the influence
goes by and dies. And the fact that this is symmetric
around, means that it's function only of the distance, which is the
condition we have here. So let me give you, concretely, the
standard form of a radial basis function model. It starts from h of x being-- and here are the components
that build it. As promised, it depends
on the distance. And it depends on the distance such
that the closer you are to x_n, the bigger the influence is, as
seen in this picture. So if you take the norm
of x minus x_n squared. And you take minus-- gamma is a positive
parameter, fixed for the moment, you will see that this exponential really
reflects that picture. The further you are away, you go down. And you go down as a Gaussian. So this is the contribution to the
point x, at which we are evaluating the function, according to the data
point x_n, from the data set. Now we get an influence from every
point in the data set. And those influences will have
a parameter that reflects the value, as we will see in a moment,
of the target here. So it will be affected by y_n. That's the influence-- is having
the value y_n propagate. So I'm not going to put it as y_n here. I'm just going to put it generically
as a weight to be determined. And we'll find that it's very
much correlated to y_n. And then we will sum up all of these
influences, from all the data points, and you have your model. So this is the standard model
for radial basis functions. Now let me, in terms of this slide,
describe why it is called radial, basis function. It's radial because of this. And it's basis function
because of this. This is your building block. You could use another basis function. So you could have another shape, that
is also symmetric in center, and has the influence in a different way. And we will see an example later on. But this is basically the model
in its simplest form, and its most popular form. Most people will use
a Gaussian like this. And this will be the functional
form for the hypothesis. Now we have the model. The next question we normally ask is
what is the learning algorithm? So what is a learning algorithm
in general? You want to find the parameters. And we call the parameters
w_1 up to w_N. And they have this functional form. So I put them in purple now, because
they are the variables. Everything else is fixed. And we would like to find the w_n's
that minimize some sort of error. We base that error on the
training data, obviously. So what I'm going to do now, I'm going
to evaluate the hypothesis on the data points, and try to make them match target
value on those points-- try to make them match y. As I said, w_n won't be exactly y_n,
but it will be affected by it. Now there is an interesting point of
notation, because the points appear explicitly in the model. x_n is the n-th training input. And now I'm going to evaluate this on
a training point, in order to evaluate the in-sample error. So because of this, there will
be an interesting notation. When we, let's say, ask ambitiously to
have the in-sample error being 0. I want to be exactly right
on the data points. I should expect that I will
be able to do that. Why? Because really I have quite a number
of parameters here, don't I? I have N data points. And I'm trying to learn
N parameters. Notwithstanding the generalization
ramifications of that statement, it should be easy to get parameters
that really knock down the in-sample error to 0. So in doing that, what I'm going to do,
I'm going to apply this to every point x_n, and ask that the output of
the hypothesis be equal to y_n. No error at all. So indeed, the in-sample
error will be 0. Let's substitute in
the equation here. And this is true for all n up to
N, and here is what you have. First, you realize that I changed
the name of the dummy variable, the index here. I changed it from n to m. And this goes with x_m here. The reason I did that, because I'm
going to evaluate this on x_n. And obviously, you shouldn't have
recycling of the dummy variable as a genuine variable. So in this case, you want this quantity,
which will in this case be the evaluation of h at the point x_n. You want this to be equal to y_n. That's the condition. And you want this to be true for
n equals 1 to N. Not that difficult to solve. So let's go for the solution. These are the equations. And we ask ourselves: how many equations
and how many unknowns? Well, I have N data points. I'm listing N of these equations. So indeed, I have N equations. How many unknowns do I have? Well, what are the unknowns? The unknowns are the w's. And I happen to have N unknowns. That's familiar territory. All I need to do is just solve it. Let's put it in matrix form,
which will make it easy. Here is the matrix form, with all
the coefficients for n and m. You can see that this
goes from 1 to N. And the second guy goes from 1 to N. These are the coefficients. You multiply this by a vector of w's. So I'm putting all the N equations
at once, as in matrix form. And I'm asking this to be equal
to the vector of y's. Let's call the matrices something. This matrix I'm going to call phi. And I am recycling the notation phi. phi used to be the nonlinear
transformation. And this is indeed a nonlinear
transformation of sorts. Slight difference that we'll discuss. But we can call it phi. And then these guys will be called
the standard name, the vector w and the vector y. What is the solution for this? All you ask for, in order to guarantee
a solution, is that phi be invertible, that under these conditions, the
solution is very simply just: w equals the inverse of phi times y. In that case, you interpret your
solution as exact interpolation, because what you are really doing is, on
the points that you know the value, which are the training points, you
are getting the value exactly. That's what you solved for. And now the kernel, which is the Gaussian
in this case, what it does is interpolate between the points to give
you the value on the other x's. And it's exact, because you get it
exactly right on those points. Now let's look at the effect of gamma. There was a gamma, a parameter,
that I considered fixed from the very beginning. This guy-- so I'm highlighting it in red. When I give you a value of
gamma, you carry out the machinery that I just described. But you suspect that gamma
will affect the outcome. And indeed, it will. Let's look at two situations. Let's say that gamma is small. What happens when gamma is small? What happens is that this Gaussian
is wide, going this way. If gamma was large, then I
would be going this way. Now depending obviously on where the
points are, how sparse they are, it makes a big difference whether you are
interpolating with something that goes this way, or something
that goes this way. And it's reflected in this picture. Let's say you take this case. And I have three points
just for illustration. The total contribution of the three
interpolations passes exactly through the points, because this is what I solved
for. That's what I insisted on. But the small gray ones here
are the contribution according to each of them. So this would be w_1, w_2, w_3
if these are the points. And when you add w_1 times the Gaussian,
plus w_2 times the Gaussian, et cetera. You get a curve that gives you
exactly the y_1, y_2, and y_3. Now because of the width, there
is an interpolation here that is successful. Between two points, you can
see that there is a meaningful interpolation. If you go for a large gamma,
this is what you get. Now the Gaussians are still there. You may see them faintly. But they die out very quickly. And therefore, in spite of the fact
that you are still satisfying your equations, because that's what you solved
for, the interpolation here is very poor because the influence of this
point dies out, and the influence of this point dies out. So in between, you just get nothing. So clearly, gamma matters. And you probably, in your mind, think
that gamma matters also in relation to the distance between the points,
because that's what the interpolation is. And we will discuss the choice
of gamma towards the end. After we settle all the other
parameters, we will go and visit gamma and see how we can choose it wisely. With this in mind, we have a model. But that model, if you look at
it, is a regression model. I consider the output
to be real-valued. And I match the real-valued output
to the target output, which is also real-valued. Often, we will use RBF's
for classification. When you look at h of x, which used
to be regression this way-- it gives you a real number. Now we are going to take, as usual, the
sign of this quantity, +1 or -1, and interpret the output
as a yes/no decision. And we would like to ask ourselves: how
do we learn the w's under these conditions? That shouldn't be a very alien situation
to you, because you have seen before linear regression used
for classification. That is pretty much what we
are going to do here. We are going to focus on the inner part,
which is the signal before we take the sign. And we are going to try to make the
signal itself match the +1,-1 target, like we did when we used linear
regression for classification. And after we are done, since we are
trying hard to make it +1 or -1, and if we are successful--
we get exact solution, then obviously the sign of it will
be +1 or -1 if you're successful. If you are not successful, and there is
an error, as will happen in other cases, then at least since you try to
make it close to +1, and you try to make the other one close to -1,
you would think that the sign, at least, will agree with
+1 or -1. So the signal here is what used to
be the whole hypothesis value. And what you're trying to do, you are
trying to minimize the mean squared error between that signal and
y, knowing that y actually-- on the training set-- knowing that
y is only +1 or -1. So you solve for that. And then when you get s, you report
the sign of that s as your value. So we are ready to use the solution we
had before in case we are using RBF's for classification. Now we come to the observation that
the radial basis functions are related to other models. And I'm going to start with
a model that we didn't cover. It's extremely simple to
cover in five minutes. And it shows an aspect of radial basis
functions that is important. This is the nearest-neighbor method. So let's look at it. The idea of nearest neighbor is
that I give you a data set. And each data point has a value y_n. Could be a label, if you're talking
about classification, could be a real value. And what you do for classifying other
points, or assigning values to other points, is very simple. You look at the closest point within
that training set, to the point that you are considering. So you have x. You look at what is x_n in the
training set that is closest to me, in Euclidean distance. And then you inherit the label, or
the value, that that point has. Very simplistic. Here is a case of classification. The data set are the red pluses
and the blue circles. And what I am doing is that I am
applying this rule of classifying every point on this plane, which
is X, the input space, according to the label of the nearest
point within the training set. As you can see, if I take a point
here, this is the closest. That's why this is pink. And here it's still the closest. Once I'm here, this guy
becomes the closest. And therefore, it gets blue. So you end up, as a result of
that, as if you are breaking the plane into cells. Each of them has the label of a point
in the training set that happens to be in the cell. And this tessellation of the plane,
into these cells, describes the boundary for your decisions. This is the nearest-neighbor method. Now, if you want to implement this
using radial basis functions, there is a way to implement it. It's not exactly this, but it has
a similar effect, where you basically are trying to take an influence
of a nearby point. And that is the only thing
you're considering. You are not considering other points. So let's say you take the basis
function, in this case, to look like this. Instead of a Gaussian, it's a cylinder. It's still symmetric--
depends on the radius. But the dependency is very simple. I am constant. And then I go to 0. So it's very abrupt. In that case, I am not
exactly getting this. But what I'm getting is a cylinder
around every one of those guys that inherits the value of that point. And obviously, there is the question of
overlaps and whatnot, and that is what makes a difference from here. In both of those cases,
it's fairly brittle. You go from here to here. You immediately change values. And if there are points in between, you
keep changing from blue, to red, to blue, and so on. In this case, it's even more
brittle, and so on. So in order to make it less abrupt, the
nearest neighbor is modified to becoming K nearest neighbors, that is,
instead of taking the value of the closest point, you look for, let's say,
for the three closest points, or the five closest points, or the seven
closest points, and then take a vote. If most of them are +1, you
consider yourself +1. That helps even things
out a little bit. So an isolated guy in the middle,
that doesn't belong, gets filtered out by this. This is a standard way
of smoothing, so to speak, the surface here. It will still be very abrupt going
from one point to another, but at least the number of fluctuations
will go down. The way you smoothen the radial basis
function is, instead of using a cylinder, you use a Gaussian. So now it's not like I have an influence,
I have an influence, I have an influence, I don't have any influence. No, you have an influence, you have
less influence, you have even less influence, and eventually you have
effectively no influence because the Gaussian went to 0. And in both of those cases, you can
consider the model, whether it's nearest neighbor or K nearest neighbors,
or a radial basis function with different bases. You can consider it as
a similarity-based method. You are classifying points according to
how similar they are to points in the training set. And the particular form of applying
the similarity is what defines the algorithm, whether it's this way or
that way, whether it's abrupt or smooth, and whatnot. Now let's look at the model we had,
which is the exact-interpolation model and modify it a little bit, in order to
deal with a problem that you probably already noticed, which
is the following. In the model, we have
N parameters, w-- should be w_1 up to w_N. And it is based on
N data points. I have N parameters. I have N data points. We have alarm bells that calls for a red
color, because right now, you usually have the generalization in your mind
related to the ratio between data points and parameters, parameters being
more or less a VC dimension. And therefore, in this case, it's
pretty hopeless to generalize. It's not as hopeless as in other cases,
because the Gaussian is a pretty friendly guy. Nonetheless, you might consider the
idea that I'm going to use radial basis functions, so I'm
going to have an influence, symmetric and all of that. But I don't want to have every
point have its own influence. What I'm going to do, I'm going to elect
a number of important centers for the data, have these as my centers,
and have them influence the neighborhood around them. So what you do, you take K,
which is the number of centers in this case, and hopefully it's much smaller
than N. so that the generalization worry is mitigated. And you define the centers-- these are vectors, mu_1 up to mu_K, as the centers of
the radial basis functions, instead of having x_1 up to x_N, the data points
themselves, being the centers. Now those guys live in the same
space, let's say in a d-dimensional Euclidean space. These are exactly in the same space,
except that they are not data points. They are not necessarily data points. We may have elected some of them as
being important points, or we may have elected points that are simply
representative, and don't coincide with any of those points. Generically, there will
be mu_1 up to mu_K. And in that case, the functional form
of the radial basis function changes form, and it becomes this. Let's look at it. Used to be that we are counting
from 1 to N, now from 1 to K. And we have w. So indeed, we have fewer parameters. And now we are comparing the x that we
are evaluating at, not with every point, but with every center. And according to the distance from that
center, the influence of that particular center, which is captured
by w_k is contributed. And you take the contribution of all
the centers, and you get the value. Exactly the same thing we did before
except, with this modification, that we are using centers instead of points. So the parameters here now are
interesting, because I have w_k's are parameters. And I'm supposedly going through this
entire exercise because I didn't like having N parameters. I want only K parameters. But look what we did. mu_k's now are parameters, right? I don't know what they are. And I have K of them. That's not a worry, because I already
said that K is much smaller than N. But each of them is a d-dimensional
vector, isn't it? So that's a lot of parameters. If I have to estimate those, et
cetera, I haven't done a lot of progress in this exercise. But it turns out that I will be able,
through a very simple algorithm, to estimate those without touching the
outputs of the training set, so without contaminating the data. That's the key. Two questions. How do I choose the
centers, which is an interesting question, because I have to choose it
now-- if I want to maintain that the number of parameters here is small-- I have to choose it without really
consulting the y_n's, the values of the the output at the training set. And the other question is
how to choose the weights. Choosing the weights shouldn't be that
different from what we did before. It will be a minor modification, because it
has the same functional form. This one is the interesting part, at least the novel part. So let's talk about choosing
the centers. What we are going to do, we are going
to choose the centers as representative of the data inputs. I have N points. They are here, here, and here. And the whole idea is that I don't
want to assign a radial basis function for each of them. And therefore, what I'm going to do, I'm
going to have a representative. It would be nice, for every group of
points that are nearby, to have a center near to them, so that
it captures this cluster. This is the idea. So you are now going to take x_n, and
take a center which is the closest to it, and assign that point to it. Here is the idea. I have the points spread around. I am going to select centers. Not clear how do I choose the centers. But once you choose them, I'm going to
consider the neighborhood of the center within the data set, the
x_n's, as being the cluster that has that center. If I do that, then those points are
represented by that center, and therefore, I can say that their
influence will be propagated through the entire space by the radial
basis function that is centered around this one. So let's do this. It's called K-means clustering, because
the center for the points will end up being the mean of the points,
as we'll see in a moment. And here is the formalization. You split the data points, x_1 up to
x_n, into groups-- clusters, so to speak-- hopefully points that
are close to each other. And you call these S_1 up to S_K. Each cluster will have
a center that goes with it. And what you minimize, in order to make
this a good clustering, and these good representative centers, is
to try to make the points close to their centers. So you take this for every
point you have. But you sum up over the
points in the cluster. So you take the points in the cluster,
whose center is this guy. And you try to minimize the mean squared
error there, mean squared error in terms of Euclidean distance. So this takes care of one
cluster, S_k. You want this to be small
over all the data. So what you do is you sum this
up over all the clusters. That becomes your objective
function in clustering. Someone gives you K. That is,
the choice of the actual number of clusters is a different issue. But let's say K is 9. I give you 9 clusters. Then, I'm asking you to find the mu's,
and the break-up of the points into the S_k's, such that this value
assumes its minimum. If you succeed in that, then I can
claim that this is good clustering, and these are good representatives
of the clusters. Now I have some good news,
and some bad news. The good news is that, finally, we
have unsupervised learning. I did this without any reference
to the label y_n. I am taking the inputs, and producing
some organization of them, as we discussed the main goal of
unsupervised learning is. So we are happy about that. Now the bad news. The bad news is that the problem,
as I stated it, is NP-hard in general. It's a nice unsupervised problem,
but not so nice. It's intractable, if you want
to get the absolute minimum. So our goal now is to go around it. That sort of problem being NP-hard
never discouraged us. Remember, with neural networks, we said that the absolute minimum of
that error in the general case-- finding it would be NP-hard. And we ended up with saying we will
find some heuristic, which was gradient descent in this case. That led to backpropagation. We'll start with a random configuration
and then descend. And we'll get, not to the global minimum,
the finding of which is NP-hard, but a local minimum, hopefully
a decent local minimum. We'll do exactly the same thing here. Here is the iterative algorithm for
solving this problem, the K-means. And it's called Lloyd's algorithm. It is extremely simple, to the level
where the contrast between this algorithm-- not only in the
specification of it, but how quickly it converges-- and the fact that finding
the global minimums of NP-hard, is rather mind-boggling. So here is the algorithm. What you do is you iteratively
minimize this quantity. You start with some configuration,
and get a better configuration. And as you see, I have now two guys in
purple, which are my parameters here. mu's are parameters by definition. I am trying to find what they are. But also the sets S_k, the
clusters, are parameters. I want to know which
guys go into them. These are the two things
that I'm determining. So the way this algorithm does it is that
it fixes one of them, and tries to minimize the other. It tells you for this particular
membership of the clusters, could you find the optimal centers? Now that you found the
optimal centers-- forget about the clustering that
resulted in that-- these are centers, could you find the best clustering
for those centers? And keep repeating back and forth. Let's look at the steps. You are minimizing this
with respect to both, so you take one at a time. Now you update the value of mu. How do you do that? You take the fixed clustering that
you have-- so you have already a clustering that is inherited
from the last iteration. What you do is you take the
mean of that cluster. You take the points that belong
to that cluster. You add them up and divide
by their number. Now in our mind, you know that this must
be pretty good in minimizing the mean squared error, because the squared
error to the mean is the smallest of the squared errors to any point.
That happens to be the closest to the points collectively, in terms
of mean squared value. So if I do that, I know that this is
a good representative, if this was the real cluster. So that's the first step. Now I have new mu_k's. So you freeze the mu_k's. And you completely forget about
the clustering you had before. Now you are creating new clusters. And the idea is the following. You take every point, and you measure
the distance between it and mu_k, the newly acquired mu_k. And you ask yourself: is the closest
of the mu's that I have? So you compare this with
all of the other guys. And if it happens to be smaller,
then you declare that this x_n belongs to S_k. You do this for all the points, and
you create a full clustering. Now, if you look at this step, we argued
that this reduces the error. It has to, because you picked the mean
for every one of them, and that will definitely not increase the error. This will also decrease the error,
because the worst that it can do is take a point from one cluster
and put it in another. But in doing that, what did it do? It picked the one that is closest. So the term that used to be here is
now smaller, because it went to the closer guy. So this one reduces the value. This one reduces the value. You go back and forth, and the
quantity is going down. Are we ever going to converge? Yes, we have to. Because by structure,
we are only dealing with a finite number of points. And there are a finite number of
possible values for the mu's, given the algorithm, because they have to be
the average of points from those. So I have 100 points. There will be a finite,
but tremendously big, number of possible values. But it's finite. All I care about, it's a finite number. And as long as it's finite,
and I'm going down, I will definitely hit a minimum. It will not be the case that it's
a continuous thing, and I'm doing half, and then half again, and half
of half, and never arrive. Here, you will arrive perfectly
at a point. The catch is that you're converging to
good, old-fashioned local minimum. Depending on your initial configuration,
you will end up with one local minimum or another. But again, exactly the same situation
as we had with neural networks. We did converge to a local minimum
with backpropagation, right? And that minimum depended
on the initial weights. Here, it will depend on the initial
centers, or the initial clustering, whichever way you want to begin. And the way you do it is, try
different starting points. And you get different solutions. And you can evaluate which one is better
because you can definitely evaluate this objective function for
all of them, and pick one out of a number of runs. That usually works very nicely. It's not going to give
you the global one. But it's going to give you a very decent
clustering, and very decent representative mu's. Now, let's look at Lloyd's
algorithm in action. And I'm going to take the problem
that I showed you last time for the RBF kernel. This is the one we're going to
carry through, because we can relate to it now. And let's see how the algorithm works. The first step in the algorithm,
give me the data points. OK, thank you. Here are the data points. If you remember, this was the target. The target was slightly nonlinear. We had -1 and +1. And we have them with this color. And that is the data we have. First thing, I only want the inputs. I don't see the labels. And I don't see the target function. You probably don't see the
target function anyway. It's so faint! But really, you don't see it at all. So I'm going now to take away the
target function and the labels. I'm only going to keep the
position of the inputs. So this is what you get. Looks more formidable now, right? I have no idea what the function is. But now we realize one
interesting point. I'm going to cluster those, without
any benefit of the label. So I could have clusters that belong
to one category, +1 or -1. And I could, as well, have clusters
that happen to be on the boundary, half of them are +1, or
half of them -1. That's the price you pay when you
do unsupervised learning. You are trying to get similarity, but
the similarity is as far as the inputs are concerned, not as far
as the behavior with the target function is concerned. That is key. So I have the points. What do I do next? You need to initialize the centers. There are many ways of doing this. There are a number of methods. I'm going to keep it simple here. And I'm going to initialize
the centers at random. So I'm just going to pick 9 points. And I'm picking 9
for a good reason. Remember last lecture when we did
the support vector machines. We ended up with 9 support vectors. And I want to be able to compare them. So I'm fixing the number, in order to be
able to compare them head to head. So here are my initial centers. Totally random. Looks like a terribly stupid thing to
have three centers near each other, and have this entire area empty. But let's
hope that Lloyd's algorithm will place them a little bit more strategically. Now you iterate. So now I would like you
to stare at this. I will even make it bigger. Stare at it, because I'm going
to do a full iteration now. I am going to do re-clustering, and
re-evaluation of the mu, and then show you the new mu. One step at a time. This is the first step. Keep your eyes on the screen. They moved a little bit. And I am pleased to find that those
guys, that used to be crowded, are now serving different guys. They are moving away. Second iteration. I have to say, this is
not one iteration. These are a number of iterations. But I'm sampling it at a certain rate,
in order not to completely bore you. It would be-- clicking through
the end of the lecture. And then we would have the clustering
at the end of the lecture, and nothing else! So next iteration. Look at the screen. The movement is becoming smaller. Third iteration. Uh. Just a touch. Fourth. Nothing happened. I actually flipped the slide. Nothing happened. Nothing happened. So we have converged. And these are your mu's. And it does converge very quickly. And you can see now the
centers make sense. These guys have a center. These guys have a center. These guys, and so on. I guess since it started here, it got
stuck here and is just serving two points, or something like that. But more or less, it's
a reasonable clustering. Notwithstanding the fact that
there was no natural clustering for the points. It's not like I generated these
guys from 9 centers. These were generated uniformly. So the clustering is incidental. But nonetheless, the clustering
here makes sense. Now this is a clustering, right? Surprise! We have to go back to this. And now, you look at the clustering
and see what happens. This guy takes points from
both +1 and -1. They look very similar to it, because
it only depended on x's. Many of them are deep inside and,
indeed, deal with points that are the same. The reason I'm making an issue of this,
because the way the center will serve, as a center of influence for
affecting the value of the hypothesis. It will get a w_k, and then it will propagate that w_k
according to the distance from itself. So now the guys that happen to be the
center of positive and negative points will cause me a problem,
because what do I propagate? The +1 or the -1? But indeed, that is the price you pay
when you use unsupervised learning. So this is Lloyd's algorithm
in action. Now I'm going to do something
interesting. We had 9 points that are centers of
unsupervised learning, in order to be able to carry out the influence of
radial basis functions using the algorithm we will have. That's number one. Last lecture, we had
also 9 guys. They were support vectors. They were representative
of the data points. And since the 9 points were
representative of the data points, and the 9 centers here are representative
of the data points, it might be illustrative to put them next
to each other, to understand what is common, what is different, where
did they come from, and so on. Let's start with the RBF centers. Here they are. And I put them on the data that is
labeled, not that I got them from the labeled data, but just to have the
same picture right and left. So these are where the centers are. Everybody sees them clearly. Now let me remind you of what
the support vectors from last time looked like. Here are the support vectors. Very interesting, indeed. The support vectors obviously
are here, all around here. They had no interest whatsoever in
representing clusters of points. That was not their job. Here these guys have absolutely
nothing to do with the separating plane. They didn't even know that there
was separating surface. They just looked at the data. And you basically get what
you set out to do. Here you were representing
the data inputs. And you've got a representation
of the data inputs. Here you were trying to capture
the separating surface. That's what support vectors do. They support the separating surface. And this is what you got. These guys are generic centers. They are all black. These guys, there are some blue and
some red, because they are support vectors that come with a label,
because of the value y_n. So some of them are on this side. Some of them are on this side. And indeed, they serve completely
different purposes. And it's rather remarkable that we get
two solutions using the same kernel, which is RBF kernel, using such
an incredibly different diversity of approaches. This was just to show you the
difference between when you do the choice of important points in
an unsupervised way, and here patently in a supervised way. Choosing the support vectors was
very much dependent on the value of the target. The other thing you need to notice is
that the support vectors have to be points from the data. The mu's here are not points
from the data. They are average of those points. But they end up anywhere. So if you actually look, for example,
at these three points. You go here. And one of them became a center, one
of them became a support vector. On the other hand, this point
doesn't exist here. It just is a center that happens
to be anywhere in the plane. So now we have the centers. I will give you the data. I tell you K equals 9. You go and you do your
Lloyd's algorithm. And you come up with the centers,
and half the problem of the choice is now solved. And it's the big half, because the
centers are vectors of d dimensions. And now I found the centers, without
even touching the labels. I didn't touch y_n. So I know that I didn't
contaminate anything. And indeed, I have only the weights,
which happen to be K weights, to determine using the labels. And therefore, I have good
hopes for generalization. Now I look at here. I froze it-- it became black now, because
it has been chosen. And now I'm only trying
to choose these guys, w_k. This is y_n. And I ask myself the same question. I want this to be true for all
the data points if I can. And I ask myself: how many equations,
how many unknowns? I end up with N equations. Same thing. I want this to be true for
all the data points. I have N data points. So I have N equations. How many unknowns? The unknowns are the w's. And I have K of them. And oops, K is less than N. I have
more equations than unknowns. So something has to give. And this fellow is the
one that has to give. That's all I can hope for. I'm going to get it close, in a mean
squared sense, as we have done before. I don't think you'll be surprised
by anything in this slide. You have seen this before. So let's do it. This is the matrix phi now. It's a new phi. It has K columns and N rows. So according to our criteria that K is
smaller than N, this is a tall matrix. You multiply it by w, which
are K weights. And you should get approximately y. Can you solve this? Yes, we have done this before
in linear regression. All you need is to make sure that
phi transposed phi is invertible. And under those conditions, you
have one-step solution, which is the pseudo-inverse. You take phi transposed phi to the -1,
times phi transposed y. And that will give you the value of
w that minimizes the mean squared difference between these guys. So you have the pseudo-inverse,
instead of the exact interpolation. And in this case, you are not guaranteed
that you will get the correct value at every data point. So you are going to be making
an in-sample error. But we know that this
is not a bad thing. On the other hand, we are only
determining K weights. So the chances of generalization
are good. Now, I would like to take this, and
put it as a graphical network. And this will help me relate
it to neural networks. This is the second link. We already related RBF to nearest
neighbor methods, similarity methods. Now we are going to relate
it to neural networks. Let me first put the diagram. Here's my illustration of it. I have x. I am computing the radial aspect, the
distance from mu_1 up to mu_K, and then handing it to a nonlinearity, in this
case the Gaussian nonlinearity. You can have other basis functions. Like we had the cylinder in one case. But cylinder is a bit extreme. But there are other functions. You get features that are combined
with weights, in order to give you the output. Now this one could be just passing the
sum if you're doing regression, could be hard threshold if you're doing
classification, could be something else. But what I care about is that this
configuration looks familiar to us. It's layers. I select features. And then I go to output. Let's look at the features. The features are these fellows, right? Now if you look at these features, they
depend on D-- mu, in general, are parameters. If I didn't have this slick Lloyd's
algorithm, and K-means, and unsupervised thing, I need to determine
what these guys are. And once you determine them, the value
of the feature depends on the data set. And when the value of the feature
depends on the data set, all bets are off. It's no longer a linear model, pretty
much like a neural network doing the first layer, extracting the features. Now the good thing is that, because we
used only the inputs in order to compute mu, it's almost linear. We've got the benefit of the
pseudo-inverse because in this case, we didn't have to go back and adjust mu
because you don't like the value of the output. These were frozen forever
based on inputs. And then, we only had to get the w's. And the w's now look like multiplicative
factors, in which case it's linear on those w's. And we get the solution. Now in radial basis functions, there
is often a bias term added. You don't only get those. You get either w_0 or b. And it enters the final layer. So you just add another weight that
is, this time, multiplied by 1. And everything remains the same. The phi matrix has another
column because of this. And you just do the machinery
you had before. Now let's compare it
to neural networks. Here is the RBF network. We just saw it. And I pointed x in red. This is what gets passed to this, gets
the features, and gets you the output. And here is a neural network that
is comparable in structure. So you start with the input. You start with the input. Now you compute features. And here you do. And the features here depend
on the distance. And they are such that, when the distance
is large, the influence dies. So if you look at this value, and this
value is huge, you know that this feature will have 0 contribution. Here this guy, big or small, is
going to go through a sigmoid. So it could be huge, small, negative. And it goes through this. So it always has a contribution. So one interpretation is that, what
radial basis function networks do, is look at local regions in the space and
worry about them, without worrying about the far-away points. I have a function that
is in this space. I look at this part, and
I want to learn it. So I get a basis function
that captures it, or a couple of them, et cetera. And I know that by the time I go to
another part of the space, whatever I have done here is not going to
interfere, whereas in the other case of neural networks, it did interfere
very, very much. And the way you actually got something
interesting, is making sure that the combinations of the guys you
got give you what you want. But it's not local as
it is in this case. So this is the first observation. The second observation is that here,
the nonlinearity we call phi. The corresponding nonlinearity
here is theta. And then you combine with
the w's, and you get h. So very much the same, except
the way you extract features here is different. And w here was full-fledged parameter
that depended on the labels. We use backpropagation
in order to get those. So these are learned features,
which makes it completely not a linear model. This one, if we learned mu's based on
their effect on the output, which would be a pretty hairy algorithm,
that would be the case. But we didn't. And therefore, this is almost
linear in this part. And this is why we got
this part fixed. And then we got this one using
the pseudo inverse. One last thing, this is
a two-layer network. And this is a two-layer network. And pretty much any two-layer network, of
this type of structure, lends itself to being a support vector machine. The first layer takes
care of the kernel. And the second one is the linear
combination that is built-in in support vector machines. So you can solve a support vector
machine by choosing a kernel. And you can picture in your mind that
I have one of those, where the first part is getting the kernel. And the second part is getting
the linear part. And indeed, you can implement
neural networks using support vector machines. There is a neural-network kernel
for support vector machines. But it deals only with two layers, as you
see here, not multiple layers as the general neural network would do. Now, the final parameter to choose here
was gamma, the width of the Gaussian. And we now treat it as
a genuine parameter. So we want to learn it. And because of that, it turned purple. So now mu is fixed, according
to Lloyd. Now I have parameters w_1 up to w_K. And then I have also gamma. And you can see this is actually pretty
important because, as you saw, that if we choose it wrong, the
interpolation becomes very poor. And it does depend on the spacing
in the data set. So it might be a good idea to choose
gamma in order to also minimize the in-sample error--
get good performance. So of course, I could do that-- and I could do it for
w for all I care-- I could do it for all the parameters,
because here is the value. I am minimizing mean squared error. So I'm going to compare this with
the value of the y_n when I plug-in x_n. And I get an in-sample error,
which is mean squared. I can always find parameters that
minimize that, using gradient descent, the most general one. Start with random values, and then
descend, and then you get a solution. However, it would be a shame to do that,
because these guys have such a simple algorithm that goes with them. If gamma is fixed, this is a snap. You do the pseudo-inverse, and
you get exactly that. So it is a good idea to separate that
for this one. It's inside the exponential, and this and that. I don't think I have any hope
of finding a short cut. I probably will have to do gradient
descent for this guy. But I might as well do gradient
descent for this guy, not for these guys. And the way this is done is
by an iterative approach. You fix one, and solve for the others. This seems to be the theme
of the lecture. And in this case, it is a pretty
famous algorithm-- a variation of that algorithm. The algorithm is called EM,
Expectation-Maximization. And it is used for solving the case
of mixture of Gaussians, which we actually have, except that we are
not calling them probabilities. We are calling them bases that
are implementing a target. So here is the idea. Fix gamma. That we have done before. We have been fixing gamma all through. If you want to solve for w based on
fixing gamma, you just solve for it using the pseudo-inverse. Now we have w's. Now you fix them. They are frozen. And you minimize the error, the squared
error, with respect to gamma, one parameter. It would be pretty easy to gradient
descent with respect to one parameter. You find the minimum. You find gamma. Freeze it. And now, go back to step one
and find the new w's that go with the new gamma. Back and forth, converges
very, very quickly. And then you will get a combination
of both w's and gamma. And because it is so simple, you might
be even encouraged to say: why do we have one gamma? I have data sets. It could be that these data points are
close to each other, and one data point is far away. Now if I have a center here that has to
reach out further, and a center here that doesn't have to reach out, it looks
like a good idea to have different gammas for those guys. Granted. And since this is so simple, all you
need to do is now is have K parameters, gamma_k, so you doubled
the number of parameters. But the number of parameters
is small to begin with. And now you do the first step exactly. You fix the vector gamma,
and you get these guys. And now we are doing gradient descent
in a K-dimensional space. We have done that before. It's not a big deal. You find the minimum with respect
to those, freeze them, and go back and forth. And in that case, you adjust the width
of the Gaussian according to the region you are in the space. Now very quickly, I'm going to go
through two aspects of RBF, one of them relating it to kernel methods,
which we already have seen the beginning of. We have used it as a kernel. So we would like to compare
the performance. And then, I will relate
it to regularization. It's interesting that RBF's, as I
described them-- like intuitive, local, influence, all of that-- you will find
in a moment that they are completely based on regularization. And that's how they arose in the first
place in function approximation. So let's do the RBF versus
its kernel version. Last lecture we had a kernel,
which is the RBF kernel. And we had a solution with
9 support vectors. And therefore, we ended up with
a solution that implements this. Let's look at it. I am getting a sign that's a built-in
part of support vector machines. They are for classification. I had this guy after I expanded the z
transposed z, in terms of the kernel. So I am summing up over only
the support vector. There are 9 of them. This becomes my parameter, the weight. It happens to have the
sign of the label. That makes sense because if I want to
see the influence of x_n, it might as well be that the influence of x_n
agrees with the label of x_n. That's how I want it. If it's +1, I want the
+1 to propagate. So because the alphas are non-negative
by design, they get their sign from the label of the point. And now the centers are points
from the data set. They happen to be
the support vectors. And I have a bias there. So that's the solution we have. What did we have here? We had the straight RBF implementation,
with 9 centers. I am putting the sign in blue, because
this is not an integral part. I could have done a regression part. But since I'm comparing here, I'm going
to take the sign and consider this a classification. I also added a bias, also in blue, because
this is not an integral part. But I'm adding it in order to
be exactly comparable here. So the number of terms here is 9. The number of terms here is 9. I'm adding a bias. I'm adding a bias. Now the parameter here is called w,
which takes place of this guy. And the centers here are general
centers, mu_k's. These do not have to be points from
the data set, indeed they most likely are not. And they play the role here. So these are the two guys. How do they perform? That's the bottom line. Can you imagine? This is exactly the same
model in front of me. And in one of them I did what? Unsupervised learning of centers,
followed by a pseudo-inverse. And I used linear regression
for classification. That's one route. What did I do here? Maximize the margin, equate with a kernel, and
pass to quadratic programming. Completely different routes. And finally, I have a function
that is comparable. So let's see how they perform. Just to be fair to the poor straight RBF
implementation, the data doesn't cluster normally. And I chose the 9 because
I got 9 here. So the SVM has the home
advantage here. This is just a comparison. I didn't optimize the number of
things, I didn't do anything. So if this guy ends up performing
better, OK, it's better. SVM is good. But it really has a little
bit of unfair advantage in this comparison. But let's look at what we have. This is the data. Let me magnify it, so that
you can see the surface. Now let's start with
the regular RBF. Both of them are RBF, but
this is the regular RBF. This is the surface you get after you
do everything I said, the Lloyd, and the pseudo-inverse, and whatnot. And the first thing you realize is that
the in-sample error is not zero. There are points that
are misclassified. Not a surprise. I had only K centers. And I'm trying to minimize
mean squared error. It is possible that some points,
close to the boundary, will go one way or the other. I'm interpreting the signal as being
closer to +1 or -1. Sometimes it will cross. And that's what I get. This is the guy that I get. Here is the guy that I got
last time from the SVM. Rather interesting. First, it's better-- because I have the
benefit of looking at the green, the faint green line, which is the target. And I am definitely closer to the green
one, in spite of the fact that I never used it explicitly
in the computation. I used only the data, the
same data for both. But this tracks it better. It does zero in-sample error. It's fairly close to this guy. So here are two solutions coming
from two different worlds, using the same kernel. And I think by the time you have done
a number of problems using these two approaches, you have it cold. You know exactly what is going on. You know the ramifications of doing
unsupervised learning, and what you miss out by choosing the centers without
knowing the label, versus the advantage of support vectors. The final items that I promised
was RBF versus regularization. It turns out that you can derive RBF's
entirely based on regularization. You are not talking about
influence of a point. You are not talking about anything. Here is the formulation from
function approximation that resulted in that. And that is why people consider RBF's to
be very principled, and they have a merit. It is modulo assumptions, as always. And we will see what the
assumptions are. Let's say that you have
a one-dimensional function. So you have a function. And you have a bunch of points,
the data points. And what you are doing now is you are
trying to interpolate and extrapolate between these points, in order to get the
whole function, which is what you do in function approximation-- what you
do in machine learning if your function happens to be one-dimensional. What do you do in this case? There are usually two terms. One of them you try to minimize
the in-sample error. And the other one is regularization, to
make sure that your function is not crazy outside. That's what we do. So look at the in-sample error. That's what you do with the in-sample
error, notwithstanding the 1 over N, which I took out
to simplify the form. You take the value of your hypothesis,
compare it with the value y, the target value, squared, and this
is your error in sample. Now we are going to add
a smoothness constraint. And in this approach, the smoothness
constraint is always taken, almost always taken, as a constraint
on the derivatives. If I have a function, and I tell you
that the second derivative is very large, what does this mean? It means-- So you do-- That's not smooth. And if I go to the third derivative, it
will be the rate of change of that, and so on. So I can go for derivatives
in general. And if you can tell me that the
derivatives are not very large in general, that corresponds, in
my mind, to smoothness. The way they formulated the
smoothness is by taking, generically, the k-th derivative of your hypothesis,
hypothesis now is a function of x. I can differentiate it. I can differentiate it k times, assuming
that it's parametrized in a way that is analytic. And now I'm squaring it, because
I'm only interested in the magnitude of it. And what I'm going to do, I'm going to
integrate this from minus infinity to plus infinity. This will be an estimate of the
size of the k-th derivative, notwithstanding it's squared. If this is big, that's
bad for smoothness. If this is small, that's
good for smoothness. Now I'm going to up the ante, and combine
the contributions of different derivatives. I am going to combine all the
derivatives with coefficients. If you want some of them, all you need
to do is just set these guys to zero for the ones you are not using. Typically, you will be using, let's
say, first derivative and second derivative. And the rest of the guys are zero. And you get a condition like that. And now you multiply it by lambda. That's the regularization parameter. And you try to minimize the
augmented error here. And the bigger lambda is, the
more insistent you are on smoothness versus fitting. And we have seen all of that before. The interesting thing is that, if
you actually solve this under conditions, and assumptions, and after
an incredibly hairy mathematics that goes with it, you end up with
radial basis functions. What does that mean? It really means-- I'm looking for an interpolation. And I'm looking for as smooth
an interpolation as possible, in the sense of the sum of the squares of the
derivatives with these coefficients. It's not stunning that the best
interpolation happens to be Gaussian. That's all we are saying. So it comes out. And that's what gives it a bigger
credibility as being inherently self-regularized,
and whatnot. And you get, this is the smoothest
interpolation. And that is one interpretation
of radial basis functions. On that happy note, we will stop,
and I'll take questions after a short break. Let's start the Q&A. MODERATOR: First, can you explain
again how does an SVM simulate a two-level neural network? PROFESSOR: OK. Look at the RBF, in order
to get a hint. What does this feature do? It actually computes
the kernel, right? So think of what this guy is doing
as implementing the kernel. What is it implementing? It's implementing theta, the sigmoidal
function, the tanh in this case, of this guy. Now if you take this as your kernel,
and you verify that it is a valid kernel-- in the case of radial
basis functions, we had no problem with that. In the case of neural networks, believe
it or not, depending on your choice of parameters, that kernel could be
a valid kernel corresponding to a legitimate Z space, or can be
an illegitimate kernel. But basically, you use
that as your kernel. And if it's a valid kernel, you carry
out the support vector machinery. So what are you going to get? You are going to get that value of the
kernel, evaluated at different data points, which happen to be
the support vectors. These become your units. And then you get to combine
them using the weights. And that is the second layer
of the neural network. So it will implement a two-layer
neural network this way. MODERATOR: In a real example, where you're
not comparing to support vectors, how do you choose the number of centers? PROFESSOR: This is perhaps
the biggest question in clustering. There is no conclusive answer. There are lots of information
criteria, and this and that. But it really is an open question. That's probably the best
answer I can give. In many cases, there is a relatively
clear criterion. I'm looking at the minimization. And if I increase the clusters by one,
supposedly the sum of the squared distances should go down, because I have
one more parameter to play with. So if I increase the things by one, and
the objective function goes down significantly, then it looks like it's
meritorious, that it was warranted to add this center. And if it doesn't, then maybe
it's not a good idea. There are tons of heuristics
like that. But it is really a difficult question. And the good news is that if you
don't get it exactly, it's not the end of the world. If you get a reasonable number
of clusters, the rest of the machinery works. And you get a fairly comparable
performance. Very seldom that there is an absolute
hit, in terms of the number of clusters that are needed, if the goal is to plug
them in later on for the rest of the RBF machinery. MODERATOR: So cross-validation
would be useful for-- PROFESSOR: Validation would
be one way of doing it. But there are so many things to validate
with respect to, but this is definitely one of them. MODERATOR: Also, is RBF practical in
applications where there's a high dimensionality of the input space? I mean, does Lloyd algorithm suffer
from high dimensionality problems? PROFESSOR: Yeah, it's a question
of-- distances become funny, or sparsity becomes funny, in
higher-dimensional space. So the question of choice of gamma and
other things become more critical. And if it's really very high-dimensional
space, and you have few points, then it becomes very difficult
to expect good interpolation. So there are difficulties. But the difficulties are inherent. The curse of dimensionality
is inherent in this case. And I think it's not
particular to RBF's. You use other methods. And you also suffer from
one problem or another. MODERATOR: Can you review
again how to choose gamma? PROFESSOR: OK. This is one way of doing it. Let me-- Here I am trying to take advantage of
the fact that determining a subset of the parameters is easy. If I didn't have that, I would have
treated all the parameters on equal footing, and I would have just used
a general nonlinear optimization, like gradient descent, in order to find all
of them at once, iteratively until I converge to a local minimum
with respect to all of them. Now that I realize that when gamma is
fixed, there is a very simple way in one step to get to the w's. I would like to take
advantage of that. The way I'm going to take advantage of
it, is to separate the variables into two groups, the expectation
and the maximization, according to the EM algorithm. And when I fix one of them, when
I fix gamma, then I can solve for w_k's directly. I get them. So that's one step. And then I fix w's that I have, and then
try to optimize with respect to gamma, according to the
mean squared error. So I take this guy with w's being
constant, gamma being a variable, and I apply this to every point in the
training set, x_1 up to x_N, and take it minus y_n squared, sum them up. This is an objective function. And then get the gradient of that and
try to minimize it, until I get to a local minimum. And when I get to a local minimum, and
now it's a local minimum with respect to this gamma, and with the
w_k's as being constant. There's no question of variation
of the w_k's in those cases. But I get a value of gamma at
which I assume a minimum. Now I freeze it, and repeat
the iteration. And going back and forth will be far
more efficient than doing gradient descent in all, just because one of the
steps that involves so many variables is a one shot. And usually, the EM algorithm
converges very quickly to a very good result. It's a very successful algorithm
in practice. MODERATOR: Going back to neural
networks, now that you mentioned the relation with the SVM's. In practical
problems, is it necessary to have more than one hidden
layer, or is it-- PROFESSOR: Well, in terms
of the approximation, there is an approximation result that tells you you
can approximate everything using a two-layer neural network. And the argument is fairly similar to
the argument that we gave before. So it's not necessary. And if you look at people who are using
neural networks, I would say the minority use more than two layers. So I wouldn't consider the restriction
of two layers dictated by support vector machines as being a very
prohibitive restriction in this case. But there are cases where you need
more than two layers, and in that case, you go just for the
straightforward neural networks, and then you have an algorithm
that goes with that. There is an in-house question. STUDENT: Hi, professor. I have a question about slide one. Why we come up with this
radial basis function? You said that because the hypothesis is
affected by the data point which is closest to x. PROFESSOR: This is the slide
you are referring to, right? STUDENT: Yeah. This is the slide. So is it because you assume that the
target function should be smooth? So that's why we can use this. PROFESSOR: It turns out, in
hindsight, that this is the underlying assumption, because when we looked at
solving the approximation problem with smoothness, we ended up with those
radial basis functions. There is another motivation,
which I didn't refer to. It's a good opportunity to raise it. Let's say that I have a data
set, x_1 y1, x_2 y_2, up to x_N y_N. And I'm going to assume
that there is noise. But it's a funny noise. It's not noise in the value y. It's noise in the value x. That is, I can't measure
the input exactly. And I want to take that into
consideration in my learning. The interesting ramification of that
is that, if I assume that there is noise, and let's say that the noise
is Gaussian, which is a typical assumption. Although this is the x that was given
to me, the real x could be here, or here, or here. And what I have to do, since I have
the value y at that x, the value y itself, I'm going to consider to
be noiseless in that case. I just don't know which
x it corresponds to. Then you will find that when you solve
this, you realize that what you have to do, you have to make the value of
your hypothesis not change much by changing x, because you run
the risk of missing it. And if you solve it, you end up with
actually having an interpolation which is Gaussian in this case. So you can arrive at the same thing
under different assumptions. There are many ways
of looking at this. But definitely smoothness comes one
way or the other, whether by just observing here, by observing the regularization,
by observing the input noise interpretation, or other
interpretations. STUDENT: OK, I see. Another question is about slide
six, when we choose small gamma or large gamma. Yes, here. So actually here, just from this example,
can we say that definitely small gamma is better than
large gamma here? PROFESSOR: Well,
small is relative. So the question is-- this is related to
the distance between points in the space, because the value of the Gaussian
will decay in that space. And this guy looks great if
the two points are here. But the same guy looks terrible if the
two points are here because, by the time you get here, it
will have died out. So it's all relative. But relatively speaking, it's a good
idea to have the width of the Gaussian comparable to the distances between the
points so that there is a genuine interpolation. And the objective criterion for choosing
gamma will affect that, because when we solve for gamma,
we are using the K centers. So you have points that have
the center of the Gaussian. But you need to worry about that
Gaussian covering the data points that are nearby. And therefore, you are going to have the
widths of that up or down, and the other ones, such that the influence
gets to those points. So the good news is that there is
an objective criterion for choosing it. This slide was meant to make the
point that gamma matters. Now that it matters, let's look at
the principled way of solving it. And the other way was
the principled way of solving it. STUDENT: So does that mean that
choosing gamma makes sense when we have fewer clusters than number of samples?
Because in this case, we have three clusters and three samples. PROFESSOR: This was not meant
to be a utility for gamma. It was meant just to visually illustrate
that gamma matters. But the main utility, indeed,
is for the K centers. STUDENT: OK, I see. Here actually, in both cases,
the in-sample error is zero, same generalization behavior. PROFESSOR: You're
absolutely correct. STUDENT: So can we say that K, the
number of clusters is a measure of VC dimension, in this sense? PROFESSOR: Well, it's
a cause and effect. When I decide on the number of
clusters, I decide on the number of parameters, and that will affect
the VC dimension. So this is the way it is, rather
than the other way around. I didn't want people to take the
question as: oh, we want to determine the number of clusters, so let's
look for the VC dimension. That would be the argument backwards. So the statement is correct. They are related. But the cause and effect is that your
choice of the number of clusters affects the complexity of
your hypothesis set. STUDENT: Not the reverse? Because I thought, for example, if you
have N data, and we know what kind of VC dimension will give good
generalization, so based on that, can we kind of-- PROFESSOR: So this
is out of necessity. You're not saying that this is the
inherent number of clusters that are needed to do this. This is what I can afford. STUDENT: Yeah, that's what I mean. PROFESSOR: And then
in that case, it's true. But in this case, it's not the number
of clusters you can afford-- it is indirectly-- it is the number of parameters you can
afford, because of the VC dimension. And because I have that many parameters,
I have to settle for that number of clusters, whether or not
they break the data points correctly or not. The only thing I'm trying to avoid
is that I don't want people to think that this will carry an answer to
the optimal choice of clusters, from an unsupervised learning
point of view. That link is not there. STUDENT: I see. But because like in this example, we
deal with-- it seems there's no natural cluster in the input sample, it's
uniformly distributed in the input space. PROFESSOR: Correct. And in many cases, even if there is
clustering, you don't know the inherent number of clusters. But again, the saving grace here is that
we can do a half-cooked clustering, just to have a representative
of some points, and then let the supervised stage of learning
take care of getting the values right. So it is just a way to
think of clustering. I'm trying, instead of using
all the points, I'm trying to use K centers. And I want them to be as
representative as possible. And that will put me
ahead of the game. And then the real test would be when
I plug it into the supervised. STUDENT: OK. Thank you, professor. MODERATOR: Are there cases when RBF's
are actually better than SVM's? PROFESSOR: There are cases. You can run them in a number of
cases, and if the data is clustered in a particular way, and the clusters happen
to have a common value, then you would expect that doing the
unsupervised learning will get me ahead, whereas the SVM's now are on the
boundary and they have to be such that the cancellations of RBF's will
give me the right value. So you can definitely create cases where
one will win over the other. Most people will use the RBF kernels,
the SVM approach. MODERATOR: Then that's
it for today. PROFESSOR: Very good. Thank you. We'll see you next week.