Good morning. Today, we will look at non-linear
SVM and kernel functions in order to work with non-linear SVM. In the last class, we
have looked at linear SVM and we have also seen how to deal with noise in linear SVM.
So, we can when the data sets are actually linearly separable, but there is some amount
of noise in the boundary, we can approximate by a linear decision surface and allow soft
merging by incorporating the slack variables, but what is the data set is truly non-linearly
separable. What we can do is that we can use, suppose
x is our input features. We can use mapping to map x to phi x, the original features space
or original attributes space. We can transform into a new feature space and in some cases
it is possible that the training points are linearly separable in the transformed feature
space. So, usually we will be mapping data to a higher dimensional feature space and
in some cases it is found that when you map the data into an appropriate higher dimensional
feature space, in some case dimension maybe where infinite points become linearly separable.
Now, if we do use a big set of features then what about computational costs.
We have seen earlier that while solving the SVM, we need to find sigma alpha i alpha j
y i y j x i dot x j. So, we need to find the pair wise. If have m training examples, we
need to do m square computations like this and each time we need to find a dot product
of x i dot x j. Now, if these vectors have dimensionality D then finding the dot product
takes computation time O D square. So, total m square times D square time is taken. Now,
if we transform it to a higher dimensional space this x i x j will become phi x i dot
phi x j right and if this dimensionality D much greater than small d then this time taken
will be o capital d square which will be quite large.
So, normally if we transform this feature to a higher dimensional feature space. It
will lead to higher computational cost, but we will see that by using kernel function.
We can achieve this transformation without any major implication on computational cost. So, first let us see why we would we able
to convert to a different feature space where the points are linearly separable. Let us
look at the first picture; here we have the black points in a circle and the red points
outside the circle. Now, these points in the x 1 x 2 with x 1 x 2 dimensions they are not
separable by a linear classifier. However, if we go to a different feature space where
the axes are x 1 square and x 2 square, we can plot this black and red points here again
and in this case we find that they can be separated linearly in the transformed feature
space x 1 square x 2 square. A simpler example let us look at this point
when there is a single feature x 1, we have this red point and the black point and clearly
they are not linearly separable. Now, we may convert this to a point x 1 square and these
points may become linearly separable, suppose this is the origin. So, this is the origin
and these are the points are those points which are you know, let us say this is point
3 and this is minus 3. Those within minus 3 and plus 3 are linearly separable. Now,
when we transform this to this space then these points are will lie between 0 and 9
and these points beyond 9. So, they will become linearly separable. It is possible as we saw in this picture that
in certain cases when you convert the feature space to an appropriate high dimensional feature
space, the points may become linearly separable. Here is 1 more example, here we have this
red and blue points after transformation we have we transform it. So, originally we have
2 features, now we have 3 features and with these 3 features we are able to separate these
points in the new feature space. So, this brings us, this is one aspect that points
which are not separable linearly in up in the original feature space may become separable
in the non-linear feature space. The second thing that we will look at is there
are certain cases, certain types of feature transformations where after transformation
the SVM can be solved efficiently. This brings us to a notion of kernel functions or kernels. Now, we have originally, suppose 2 points
x a and x b and by using this transformation phi they now become the new features become
new inputs become phi x a and phi x b. Now, as we saw in the solution of SVM, we need
to find x a dot x b and when we use these phi we will have to solve phi x a dot phi
x b. Now, there are certain phi's, there are certain
phi's not for all functions. So, that corresponding to them we have something like this that phi
x a dot phi x b is a function of x a and x b and this function of x a and x b in certain
cases can be computed without expanding phi x a and phi x b. So, this particular function
k x a x b is a function of x a x b and the value of this is same as the dot product of
phi x a and pi x b, but this function may be easy to compute and more specifically,
we may be able to compute this function without expanding this phi's individual phi's. So,
such functions are called kernel function. So, if you use kernel functions we can use
the kernel to do the work. So, in the solution of SVM where we have to find phi x a dot phi
x b, normally we will for every example we will expand phi x a expand phi x b take the
dot product, but for kernel functions we can simply use this function k x a x b. So, for
certain phi's there is a simple operation k by which we can find is dot product of phi
in a simple times. So, this is called the kernel trick. And when we apply the kernel trick as we saw
that when we use an SVM to classify a test point x, we apply w dot phi x plus b which
is equal to alpha i phi x i dot phi x plus b where i corresponds to the support vectors
for which alpha i is non-zero. Now, if we are using the kernel trick this
phi x i dot phi x, which we have to compute can be computed as kernel function of x i
and x. So, we have seen that in the formulation of SVM and the solution of the dual problem
the dot product we are using the feature vectors both in training as well as test only as the
dot product or scalar product and we have also seen that a kernel function is function
which corresponds to the dot product of 2 feature vectors in some expanded feature space.
So, if you are using the kernel function what we can do is that instead of phi x i dot phi
x we can have k x i x and this will make the computation simpler. So, this k x i x may be in expensive to compute
even phi x may be extremely high dimensional. Let us now look at an example of a kernel
function. So, let us assume a 2-dimensional vector x which has 2 components x 1 and x
2; x 1 and x 2 are the 2 attributes of this vector x. Now, let us say k of x i x j, let us take
k x i x j equal to 1 plus x i dot x j whole square. So, k of x i x j x i x j are 2 vectors
corresponding to the attributes in the original attribute space k x i x j is equal to now
this is the kernel function which is 1 plus x i dot x j whole square.
Now, this function is easy to compute. In order to compute this function you take the
dot product of the original attribute space, add 1 and take the square. Now, we will show
that this k x i x j corresponds to a kernel function. Now, we can expand k x i x j 2 1
plus x i x j whole square. So, x i has 2 components x i 1 x i 2. Similarly, x j has 2 component
x j 1 x j 2. Now, we can expand this function as 1 plus x i 1 square x j 1 square plus 2
times x i 1 x j 1 x j 2 x i 2 x j 2 plus we have x i 2 square x j 2 square plus 2 x i
1 x j 1 plus 2 times x i 2 x j 2. Now, this algebraic expression can be rewritten
as the product of 1 plus x i 1 square or rather it can be rewritten as the dot product of
2 vectors which you can easily verify 1 x i 1 square root 2 x i 1 x i 2 and x i 2 square
and root 2 x i 1 root 2 x i 2 dot product, with let me write in the line below 1 the
next component is x j 1 square next component is root over 2 x j 1 x j 2 then x j 2 square
then next component is root over to x j 1 root over 2 x j two. So, each of this is 3,
4, 5, 6 each of this is a 6-dimensional vector. So, this expression can be written as a dot
product of these 2 vectors and you can easily verify by expanding this. So, k xi x j which
is 1 plus x i dot x j whole square can be written as a dot product of these 2 vectors. So, we can write that k x 1 x 2 equal to 1
plus x i dot x j whole square instead of 1, 2. Let me write x i x j k x i x j equal to
1 plus x i x j whole square equal to dot product of this and this can be written as phi of
x i dot phi of x j right. So, if we transform the x i to this new space where phi x i transforms
to 1 x i 1 square root 2 x i 1 x i 2 x i 2 square root 2 x i 1 root 2 x i 2 instead of
the original vector which is x i 1 x i 2 then the dot product in this space is equal to
k x i x j equal to 1 plus x i dot x i whole square.
So, corresponding to this transformation phi the kernel function can be very easily computed
like this without expanding this particular feature. So, this is an example of a kernel
function. There are many other commonly used kernel functions, some of them we can see
in this slide. So, the linear SVM corresponds to linear kernel
where phi x i equal to x i phi x j equal to x j that is the identity mapping. So, k of
x i x j equal to x i dot x j called linear kernel for general polynomial kernel k x i
x j is 1 plus x i dot x j to the power p. We saw the example where of quadratic kernel
where k x i x j is 1 plus x i x j whole square. But this time we generalised to polynomial
kernel, where we have 1 plus x i x j to the power p; this is the polynomial kernel of
power p. Then we have the Gaussian kernel or the kernel corresponding to the radial
basis function where we have k x i comma x j equal to exponential of minus x i minus
x j whole square by 2 times sigma square. So, this is the Gaussian kernel. Another popular kernel is the sigmoid kernel
which is given by k x i x j equal to 10 hyperbolic of parameter
beta 0 x i dot x j plus beta 1. So, these are some examples of kernel functions which
are quite popular. There can be kernel functions; measures some similarity between instances
x i and x j and this kernel functions you know we can have define many types of kernel
functions depending on the domain and certain notions of similarity.
For example, you can use kernel function for text classification based on the similarity
of the words in the text and so on. In general functions that qualify as kernel functions
satisfy Mercer’s condition; those functions that satisfy Mercer’s condition are called
kernel functions. We will not go into a lot of detail here,
but we can say that kernel functions are some similarity measures and this similarity as
usually symmetric that is k x i x j is equal to k x j x i, but all similarities are not
kernel functions. Mercer’s conditions states that if you can write the similarities between
the points as a matrix and this matrix is symmetric positive semi definite then a kernel
function will exist or in general mercers conditions states that any positive semi definite
kernel k x y for which sigma over i j k of x i x j times c i c j is greater than or equal
to 0 for any real number c i c j, they can be considered as kernel function because such
functions can be expressed as a dot product in high dimensional space.
We will not going to details of this, but this gives us a general characterisation of
kernel functions and we can also compose more than 1 kernel function to get new kernel function,
but we will not make further discussion into this But I will show you a certain cases where
the decision surface is linear or non-linear and the kernel functions which are appropriate
in those cases. If we look at this picture for this case linear kernel with noise will
be quite appropriate whereas, in this case the decision surface is actually do not have
good linear decision surface. You have a quadratic decision surface and the second order polynomial
a quadratic kernel will be appropriate. In this case you have a fourth order polynomial
which gives a good decision surface. Here, eighth order polynomial and these examples
can be used with kernels of order 8 polynomial, order 4 polynomial and so on. This picture shows points this red and blue
are the 2 classes and we are not separable by linear kernel. However, you can use Gaussian
kernels to separate them. Gaussian kernel can be used to separate points where these
points can lie within Gaussian of each other now. So, here Gaussian kernels will be appropriate. Now, let us see that once we have this kernel
function, how we can solve the SVM. So, the formulation is very similar. We have phi x
i and phi x j, it amounts to maximising. In the dual formulation, the solution will amount
to maximising sigma alpha i minus half summation over i j alpha i alpha j y i y j phi x i dot
phi x j, which can be written as k of x i x j and the conditions are same alpha i lies
between 0 and c sigma alpha i y i equal to 0 and after you have solved for this alphas,
you get a solution and when you get a test point x.
The classification of the test point can be found by g x equal to sigma i ranges over
the support vectors alpha i phi x i dot x plus b which becomes k of x i x plus b. So,
for all points k of x i x plus b can be used to get the solution and if k is easy to compute
then the solution computation cost of solution in the training face as well as the testing
face is not much different from the linear SVM So, support vector machine in conclusion,
we can say they perform; they have in found to perform quite well in practice, but in
certain cases if an appropriate kernel function exists. However, the user has to choose the
appropriate kernel function, if an appropriate kernel function exists to map the original
attribute space to a features space where the points are linearly separable then the
support vector machine works well, but the user has to define the appropriate kernel.
They can be expensive in time and space for big datasets.
But because in the solution, you look at the pairwise training examples, but the computation
of maximum margin hyper plane depends on the square of the number of training cases. Secondly,
for the solution we need to store all the support vectors. If the number of support
vectors is large then also solution time will increase the solution. Time depends on the
number of support vectors and number of training examples. The kernel trick can also be used
to do principle component analysis in a higher dimensional space. We have already talked
about principle component analysis where we take the features and reduce it, but we could
to PCA in a higher dimensional space using kernel functions, but we will not talk about
this today. We will just conclude by seeing the support
vector, classify that we have seen is used by merrily for classifying 2 classes, separating
2 classes what if you have multi classes. So, there is the support vector machine by
itself only looks at 2 classes, but we can use a combination of SVM's to deal with multiple
classes. Suppose, we have n classes and you can learn n support vector machine. Suppose,
SVM 1 learns to classify class 1 versus the REST, SVM 2 to separate class 2 versus REST,
SVM n to separate class n versus REST. So, we develop n different support vector classifiers.
Now, to predict output for a new test point, we will apply all the support vector machines
SVM 1, SVM 2, SVM n and then we will find out which 1 of the SVM's give the prediction
with the highest confidence. So, highest confidence is we find w i x i, w i x i plus b and that
should be greater than times y i, that should be greater than equal to 1 minus i, i the
bigger the value is higher the confidence because further it lies from the margin. So,
we can choose out of this n outputs, the 1 for which the point is for this margin and
we can take that positive classification corresponding to that class with this we come to the end
of kernel SVM. Thank you.