Mod-01 Lec-29 Support Vector Machine

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hello. In the last two classes, we have discussed about the radial basis function neural network. And We have also discussed the about if I compared the radial basis function neural network with multilayer perceptron, what is the advantage and disadvantage of the radial basis function neural network with respect to the multilayer perceptron? So, we have said that in case of radial basis function neural network, the network consists of 3 different layers. One of the layer was which is present in all types of neural network that is the input layer, which basically accepts the input feature vector and forward that input feature vector to the layers above it. In case of RBF neural network, we have 1 hidden layer and 1 output layer. The neurons in the output layer basically determine that what the class belongingness of the input feature vector. The function of the neural, of the neurons in the hidden layer is to compute our radial basis function of value and through this radial basis function, what we effectively do is we map the input feature vector from its original dimension to a higher dimensional space through a non-linear mapping or a non-linear transformation. This non-linear mapping is actually done by the set of radial basis functions. And The purpose why we perform this transformation a non-linear transformation from a lower dimensional space to a higher dimensional space is that if you increase the dimensionality of the feature vectors, then the feature vectors belonging to different classes, if they are not linearly separable in a lower dimensional space, it becomes quite likely that they become linearly separable in higher dimensional space. So, that was the basic motivation that we have in case of a radial basis function neural network. As by increasing the dimension of the input feature vectors, we are increasing the possibility of linear separability. So, the output layer of the radial basis function neural network simply computes linear combination of the outputs of the middle layer nodes. Based on this linear combination of the output of the middle layer nodes, it decides to which class the input feature vector x should be classified. So, as a result, the output layer nodes that we have been starting from the middle layer nodes to the output layer nodes. So, that section of the radial basis function neural network is effectively a linear classifier. So, the advantage that we have in case of radial basis function neural network is that I knew how many hidden layers I have to have. It is only 1. In case of multilayer perceptron, I cannot easily decide that how many hidden layers I should have. Similarly, how many nodes per hidden layer that in case of multilayer perceptron that is also cannot be easily decided bias. Whereas in case of RBF network radial basis function neural network, I can set how many nodes in the hidden layer I need to have because it is simply the dimensionality of the space to which I want to cast my input feature vectors. The other advantage is the interpretation of the functions of the hidden layer nodes, which is quite easy in case of RBF neural network. But the interpretation of the functionality of the nodes in the hidden layers is not that clear in case of multilayer perceptron. Also, the training in case of hidden layer is faster than the training in case of multilayer perceptron. The only disadvantage, apparently disadvantage of the RBF neural network is that the classification task in case of RBF neural network takes more time than the classification in case of multilayer perceptron. So, this is what we have discussed over last 2 classes. Today, we are going to discuss about another kind of classifier, which is called a support vector machine. So, we will discuss today the classifier known as support vector machine or SVM. So, when we start our discussion on support vector machine, let us recapitulate one of the previous classifiers, the linear classifiers that we have discussed that is a linear discriminant function. So, during, one of the early lectures, we have said that if we have an input feature vector x and we define a linear discriminant function say g of x. So, for simplicity, let us consider that we are considering a 2 class problem. We have 2 classes. One is class c 1, the other one is class c 2. So We are talking about 2 class problems, class c 1 and class c 2. An unknown feature vector say x is to be classified as either belonging to class c 1 or belonging to class c 2. And for doing this, when we talked about the linear discriminant function, we have defined a linear discriminant function say g of x, which was put in the form w transpose x plus b or maybe in the early lectures, we had put g of x as w transpose x plus w naught. It was something like this or x is the input feature vector, w is the weight vector and the w naught is a bias term. So, this is a linear function in a 2-dimensional space. If our feature vector is a 2-dimensional vector, then this linear equation represents a straight line. If our input feature vector is a 3-dimensional feature vector, then this linear equation, if I put this equal to 0, then I get a linear relation w transpose x plus b equal to 0. So, this linear equation in 2 dimensions represents a straight line. This linear equation in 3 dimensions represents a plane. Whereas if I increase the dimension where the dimensionality of the feature vector is more than 3, this linear equation actually represents what is called an hyper plane and w is nothing but a vector, which is perpendicular to that hyper plane. So, this vector w it represents the orientation of the hyper plane in my d dimensional space where d is the dimensionality of the feature vector. Whereas, this term b or coming to the previous term w naught which is a constant it represents, what is the position of that hyper plane in my d dimensional space. So, the vector w represents the orientation of the hyper plane and the value d or w naught it represents the position of that hyper plane in my d dimensional space. So, this b, term b is usually known as a bias term, which is biasing the position of the hyper plane in the d dimensional space. Now, coming to our classification problem for every feature vector x, I want I have to compute this linear function w transpose x plus b. If this x lies on positive side of the hyper plane, then I will have g x. Let us take a specific vector say g x 1, this is a vector, specific vector. The first vector, which will be equal to w transpose x 1 plus b, if this x 1 is on the positive side of the hyper plane, then I will have w transpose x 1 plus b, which will be equal to 0 will be equal to greater than 0. And if x 1 belongs to negative side of the hyper plane, then I will have w transpose x 1 plus b, which will be less than 0. And if x 1 lies on the hyper plane, then I will have w transpose x 1 plus b is equal to 0. So, this is a hyper plane, which defines my d dimensional space into 2 half spaces. In one of the half space, if I take a vector g of x for that particular vector x will be positive, if I take the vector x into the other half space, then g of x for that vector will be negative. I can have my classification rule that if w transpose x 1 plus b is greater than 0, then I infer that this x 1 will be classified to class c 1 that is it belongs to class c 1, where as if w transpose x 1 plus b is negative, then I can infer that this x 1 belongs to class c 2. So, this is my classification rule. So, what you said is initially we had to train this classifier that means I have to find out what is this w vector w and what is this bias term b, so that this g x with that w the trained value of w and the trained value of b can be used as a classifier. So, for training, we will have used supervised learning so we have been given a large number of samples, some of the samples coming from class c 1 and some of the samples coming from class c 2. So, using these training samples, when I try to train the network, then training of the network was done in an iterative fashion. So, for every training sample belonging to class c 1, we started with an initial value of w and b and for every training sample belonging to class c 1, we have tried to see that whether w transpose x plus b is greater than 0 or not because we have taken a sample from class c 1. So, it has to be greater than 0. If it is not greater than 0, then we have modified w and b in such a way that the position of the hyper plane or as well as the orientation of the hyper plane is so modified that that particular x, which is taken from class c 1 is moved to positive side of this hyper plane. Similarly, if I take a vector from class c 2, we have checked whether this w transpose x where x is taken from class c 2 is negative or not. If it is not negative, then again we have modified w and b. So, there again, we have to see that if this is negative or not if it is not negative, then we have modified w and b in such a way that that particular x is moved to negative side of the hyper plane. So, this is what we have done in case of linear discriminator. And we have also said there that if a sample is just beyond this hyper plane w transpose x 1 plus b equal to 0, even then it might be correctly classified, but the classifier that I so obtain, so I can have a situation something like this. I have 1 feature vector. Let us take 2 dimensional feature vectors something like this. So, these are the feature vectors, which say belong to class c 1. And I can have a set of feature vectors something like this, which belong to class c 2. This is my x 1 dimension. If I have a 2-dimensional feature vector, and this is my x 2 dimension. So, these are the 2 components of the feature vector. Now, if I have a classifier or a hyper plane into dimension, which is a straight line as we said earlier, it is something like this. And these are the feature vectors, which are given for supervised planning. So, if I have a situation something like this, even here you find that this straight line in d dimension, it is a hyper plane it correctly classifies all these feature vectors, which are given for training of the classifier but, whether this type of classifier is desirable. Is it desirable? Obviously, a classifier of this nature is not desirable because this classifier gives a large bias in favor of class c 2, whereas it puts a penalty against class c 1. The reason is this entire space from here to here this is given to class c 2 whereas the margin to class c1 is only this much. Rather than this classifier, I would prefer to have a classifier somewhere over here. So, that the training vectors both from class c 1 and class c 2, both are equally apart from my classifier or the plane separating these 2 different classes and the support vector machine actually tries to find a classifier, which will be positioned in a form something like this, okay. Now, let us see how such a support vector machine can actually be designed. So, the basic aim of the support vector machine is given. If I am standing on one side of the boundary, and if there is a possibility that I can cross the boundary on the other side, I will feel safer, if my distance from the boundary is larger. If my distance from the boundary is very small, so this is my boundary. I am standing somewhere over here, and then a little disturbance or a little noise can push me to the other side of the boundary in which case I will be misclassified but if I am standing somewhere over here where this is my boundary surface, then the margin that I have is quite large. So, to push me to the other side of the boundary to make me misclassified, there has to be a large disturbance. So, I feel that I am safer if my distance from the boundary is very large and I am not safe if my distance from the boundary is very small. So, this is the kind of situation that I have. So, what the support vector machine does is, the way of the support vector machine tries to design the classifier is that is it tries to maximize the distance of the separating boundary between the 2 classes by maximizing the distance of the separating plane from each of the feature vectors whether the feature vector belongs to class c 1 or the feature vector belongs to class c 2, okay. And out of this, when I talk about the support vectors, okay, I will come to that point a bit later. So, what I have is suppose that I have a vector x i, if this vector xi belongs to class c 1, then I will have w transpose x i. It has to be greater than 0. This w transpose x i, I can also put in the form of a dot product w dot x i because this operation is same as this operation plus b that has to be greater than 0. If this xi belongs to class c 1, then I have the situation w transpose x i plus b have to be less than 0. If xi belongs to class c 2, so here w transpose xi plus b will be positive here, w transpose xi plus b will be negative. Now, what I can do is when I go for designing of the classifier, I know to which of the class the sample xi belongs. I know whether x i belongs to class c 1 or x i belongs to class c 2 so along with every x i. I can assign class belongingness. So, what I can put is along with every xi, I can also give a y i, mention a y i, where this y i can be either plus 1 or minus 1. So, if xi belongs to class c 1, the corresponding y i will be positive. If xi belongs to class c 2, the corresponding y i will be negative. And if I feed the data in this form, I compute y i into w transpose d w dot vi plus b, this will always be positive irrespective of whether x i belongs to classs c 1 or x i belongs to class c 2. Because if x i belongs to class c 1 then w dot x i plus b is positive, y i is also positive, which is specified. So, this product will be positive. If xi belongs to class c 2, then w dot xi plus be will be negative, yi is also minus 1. So, look at the product of these 2. The product then will be positive. So, this is what I will have. And using this concept, I can go for the designing of the classifier 1 and get w and b. Then for unknown feature vectors, say let us take an unknown feature vector p. This is an unknown vector, which has to be classified. I have to put this unknown feature vector p into either c 1 or c 2 using that w and b that has been obtained after designing the classifier. So, once if do that, I do not have anyone yi because p is unknown. So, I simply compute w dot p plus b where w and b, they have already been decided during the training process. So, if this w dot p plus b becomes greater than 0, I would classify this p to belong to class c 1. If this is less than 0, I classify p to class c2. So, this is the kind of situation that I have. And as I said that it is of support vector machine, my aim is that I want to maximize that distance of the hyper plane or the separating boundary from each of the feature vector, so that every feature vector feels that they are safer so far as this classifier is concerned, okay and the distance and to do that we have made a modification in our classifier design, linear classifier design. When we talked earlier that we consider the w transpose xi plus b should be greater than 0 for correct classification, we have said that I want w transpose xi plus b should be greater than some gamma where gamma we have said some margin and this margin is nothing but a measure of distance of xi from the separating plane. So, if I have a hyper plane whose equation is given by w dot x plus b is equal to 0, distance of point x from this hyper plane is simply given by w dot x plus b upon mod of w. So, this is known from school level geometry. So, this is the distance of x point x from this hyper plane w dot x plus b is equal to 0. And what we want is we want that this has to be greater than or equal to some margin say gamma. And my decision regarding the correct classification of the feature vectors is independent of scaling vector w because vector w simply tells me that what is the orientation of the plane. So, whatever scaling factor I apply to w, my decision remains the same, okay. So, I can simply put that this w dot x plus b has to be greater than or equal to this into w. And by proper scaling, I can set that this is equal to 1. I can set this term is equal to 1. This is simply a matter of scaling in which case for every vector I have to have a situation that w dot x plus b has to be greater than or equal to 1, if x belongs to c 1. It has to be less than or equal to minus 1, if x belongs to class c 2. And now, for a vector x i, if I multiply by the corresponding y i during the training during the learning process or training process, I have a situation that y i into w dot x i plus b this will always be greater than or equal to 1. And this equality this y i into w i x i w w dot x i plus b will be equal to 1, if x i is a support vector. It will be greater than 1, if x i is not a support vector. So, then the question becomes that what is my support vector. Then, so to I explain what my support vector is. I simply go to my 2dimensional example. So, I have 2dimensional vectors x 1 and x 2. I have a set of feature vectors from say class c 1, which is like this. So, this is class c 1. And I have a set of feature vectors coming from class c 2. So, I put it of this form. So, these are the feature vectors, which are from class c 2. Now, you can find that I can easily find out that if I draw a straight line over here. If I draw another straight line over here, these are the feature vectors from the training vectors which are just on the boundary. So, any disturbance to the feature vectors will greatly affect this. If this feature vector is slightly disturbed, my classification result is not going to vary that much, okay. So, I had to play put my classifier somewhere in the middle of this 2. This is a classifier, which will be more reliable or it is called more generalized. Where my risk of misclassification will be less. And you find that the position of this classifier of this hyper plane depends on the position of this feature vector. It depends on the position of this feature vector. It depends on the position of this feature vector. It does not depend upon the position of this vector, feature vector even if I remove this feature vector from my training set or even if I remove this feature vector from my training set, the position of the hyper plane remains the same. If I remove this feature vector from my training set, the position of the hyper plane is going to be different. So, these are the feature vectors, which are called support vectors. The other vectors are not support vectors. So, this support vector machine is again a linear machine whose design is greatly influenced by the positions of the support vectors. Its position is not that much influenced by the vectors, by the feature vectors, which are not support vectors. So, that is why I said that from this that y i into w i dot x i plus b will be equal to 1 when I have, when this x i is a support vector. So, I have y i into w dot x i plus b, this will be equal to 1 if x i is a support vector, so this equality holds only for the support vectors. And now, what I have to do is if you look at this expression that what I want to do is this w dot xi plus b is a measure of distance of point x i from the plane w dot x i plus b equal to 0. So, this is a factor, which has be and I want to maximize this. So, this is a factor, which has to be used for designing of my support vector machine, okay. I want that this distance or over this distance, the margin has to be as maximum as possible, which can be done from this expression by minimization of mod of w and at the same time by maximization of the bias b. So, when I try to get this support vector machine, I will try to minimize this w. Simultaneously, I want to maximize this b. So, I want to minimize this w. So, this is, this w or the weight vector that I want to minimize. So, the minimization of w is same as if I want to minimize a function of w. This is nothing but w transpose w or w dot w. These 2 are same operations and for mathematical convenience which will be clear later on. I just put a term half. So, half of w dot w that I want to minimize and if I want to minimize this, the trivial value will obviously be w equal to 0, which is not the solution that solution that I am looking for. So, for minimization of this, I have to look for what is the other constraint that I have. My constraint is that for support vectors, I have to have w transpose dot x i plus b. This is into y i that have to be equal to 1. So, this is my constraint. So, I want to minimize w or I want to minimize phi of w subject to the constraint that y i into w dot xi plus b has to be equal to 1 or I say that this has to be greater than or equal to 1. If x i is a support vector, then this has to be equal to 1 and because my support vector machine depends upon the support vectors. So, I will not take this inequality, rather I will take this equality that is y i into w dot xi plus b has be equal to 1. That is the constraint and under this constraint, I had to minimize my weight vector w. So, because it is a constrained optimization problem, this problem can be converted to an unconstrained optimization problem by using the Lagrangian multiplier. So, I can put Lagrangian multiplier. I can define a function of the form L w b because as I said that I want to minimize w. I want to maximize b, which will be half of w dot w minus sum of alpha i into y i into w dot x i plus b minus 1, right? And I have to optimize this Lagrangian where this alpha i is the Lagrangian multiplier. So, when I go for optimization of this Lagrangian from our school level mathematics, we know that this optimization has to be done by taking the derivative of the Lagrangian with respect to w and by taking the derivative of the Lagrangian with respect to b. So, if I take the derivative of this Lagrangian with respect to b, then what I have? So, what I want to compute is del L upon del b. This is what I want to compute. I have to equate this to 0. So, for doing this, let us try to expand this equation. And let us see what this expression is. So, in the expanded form, I will have L w b is equal to half of w dot w minus sum of alpha i y i into w do x i minus sum of alpha i y i b plus sum of alpha i. So, this is the expression or the expanded form I have. Clearly if I take the derivative of this with respect to b. So, I want to compute del l del b. So, this del l del b obviously from here, you find that this is independent of b. This is independent of b. This is also independent of b. So, only term that involves b is this. So, del l del b will be simply minus alpha i y i and that has to be equal to 0. So, I get 1 constraint that sum of alpha i y i that has to be equal to 0 for i varying from 1 to say m, where m is the number of feature vectors, which are different for designing the classifier, okay. So, this is one of the constraints that I have. Next what I do is I will take the derivative of this Lagrangian l with respect to my weight vector w. So, then this is the same Lagrangian. I put in the expanded form. I take the derivative of this. Let me rewrite this Lagrangian. So, l of w b is equal to half of w dot w minus sum of alpha i y i w dot x i minus sum of alpha i y i b plus sum of alpha each of these summations will from i is equal to 1 to m before if I have m number of feature vectors provided for designing the classifier. So now, if I take the derivative of this Lagrangian with respect to w, so what I have is I have del l del w. So, when I take del l w del l w, this will simply be the derivative of this with respect to w is nothing but w and derivative of this with respect to w is sum of alpha i y i. x i. You find that this term is independent of w. This term is independent of w. So, derivative of these terms with respect to w will be equal to 0. And once I take this derivative, I have to take this equal to this derivative equal to 0, which gives me w is equal to sum of alpha i y i x i where this i will be from 1 to m as m is the number of training samples. So, through this derivative process, so this is the w that I have which is my weight vector. You find that this is not a dot product. So, this is why y i into x i x i is a vector alpha, i is a scalar, y i is also a scalar. So, this entire term is a vector, okay. So, through this optimization of the Lagrangian, I have got 2 equations. One is this sum of alpha i into y i for i is equal to 1 to m is equal to 0. And I have this expression for my weight vector w where w is equal to alpha i y i x i where x i is the ith feature, training feature vector for i varying from 1 to m. So, these are the 2 expressions. Now, if I put this expression in my original Lagrangian, then let us see what we have. So, by putting this in my original Lagrangian expression as you said that the Lagrangian was in the expanded form. It was l is equal to half w dot w minus summation of alpha i y i b minus sum of alpha i y i w i xi plus sum of alpha i, okay. So, in the expanded form, which we said earlier, this is half of w dot w minus alpha i y i w dot xi that is this term minus sum of alpha i y i b which is this term plus sum of alpha i. Now, over here, this alpha i y i, this term is equal to 0. So, this term simply gets cancelled. So, the expression I have is l is equal to half of w dot w minus this plus sums of alpha i. And if I put this value of w into this expression. Then what I will have is this will be simply half of summation alpha i because I have the term w dot w dot product of the vector with itself. So, I had to have to have 2 different subscripts over here, one i and the other one I will put as j. So, this will simply be alpha i alpha j y i y j into x i dot x j. So, this is a dot product minus this; will simply be again w dot x i and w is sum of alpha i y i x i. So, this will simply be minus summation of alpha i alpha j y i y j into x i dot x j plus sum of alpha i and which is nothing but you find that this minus this, this term and this term they are same. So, this will be simply sum of alpha i for i varying from 1 to m minus half of sum of alpha i alpha j y i y j, sorry this will be j into x i x j dot x i, which is same as x i dot x j. So, it does not matter. So, I have the Lagrangian expression, which is this. So, for design, what I have to do is I have to maximize this Lagrangian with different values of alpha, which are my Lagrangian multipliers and Lagrangian multipliers are always positive. So, I have 1 constraint that alpha i always have to be greater than or equal to 0. And the other constraints that we have are from here that sum of alpha i y i, i varying from 1 to m that has to be equal to 0. So, I have to find out such Lagrangian multipliers, which maximize this expression this Lagrangian expression. When you try to maximize this Lagrangian expression, it is quite likely that some of the Lagrangians will be equal to 0. Some of the, few of the Lagrangian multipliers will be equal to 0 and few of the Lagrangian multipliers value will be very high. So, if a Lagrangian multiplier and alpha i is equal to 0 that indicates that the corresponding training feature vector x i is not a support vector. If a particular alpha i is very high for that indicates that the corresponding feature vector x i has a high influence over the position of my decision surface or the hyper plane. The other possibility is of course, there if I get an alpha i which is extraordinarily high, I can infer that the corresponding feature vector, which is given it is a spurious point. It might have this. It is a disturbed point, okay, or it is an outlier. All these different types of interpretations I can have over alpha i. So, obviously if alpha equal to 0, the corresponding x i is not a support vector. So, it does not influence the position of the hyper plane. So, what I have to do is I have to use this alpha i’s or I have to use with this alpha i’s. The value of w that I compute is given by this. This w goes into my decision making process. So, as a result, my classification decision will be like this. What I will compute is for an unknown z. If I have a feature vector z, which is unknown, the classification decision, if I say that the decision is d z, d z will be simply this. I have to compute alpha j y j into x j times z because you find that my w is nothing but alpha i y i x I, okay?. What I have to compute is this w dot z. So, that is what I am doing alpha j simply, subscript i is replaced by subscript j so it does not matter. So, I have to compute alpha j y j x j dot z. Take the summation over j is equal to 1 to m plus b and only the sign of this is important for classification. I do not want; I do not need what is of value of this function. I simply need the sign of this function. So, sign of this is important for classification. If the sign is positive, then this z will be classified to class c 1. If the sign is negative, then this z will be classified to class c 2. So now, if I write the steps of the support vector machine design, the steps of the support vector machine will be something like. One more point that is I should mention over here as you are done in case of radial basis function, if the original set of feature vectors in lower dimensional space is not linearly separable, then I cannot have the support vector machine. This is because it assumes that the samples are linearly separated. So, if they are not linearly separable in lower dimensional space, I hvave to cast these feature vectors into higher dimensional space by using functions like radial basis functions, which are called as kernels functions. So, once I cast them into higher dimensional space, then by taking the samples in the higher dimensional space, I can try to design the support vector machine. Then, again for classification, when we have classified this feature vector z before classification, I had to cast that into same higher dimensional space. And after casting that into higher dimensional space, I have to compute this term in the higher dimensional space. Then I can compare the sign of this term. If the sign is positive, then it will be in class c 1. If it is negative, it will become to class c 2. So, I need 2 terms; one is w which has been obtained by optimization as this particular expression. The other one I need will be the value of b, okay. So, the value of b I can be compute as. This is the margin. b will be half of minimum of sum of alpha i y i. I will put it as x i dot x j for of all i for which y i is equal to plus I have to have maximum of sum of alpha i y i into x i dot x j for all i where y i is equal to minus 1. So, that is how I get the value of the b and this w. This w and this b goes in this expression for classification of an unknown feature vector z. So, this is how our support vector machine work. As I said before, the support vector machine is again nothing but a linear machine. The only thing is it helps in placing the separating boundary between the 2 classes or it simply states where my hyper plane should be placed. So that the classifier that I get that is more robust or more generalized. Now, instead of a 2 class problem, if I have a multiclass problem, then I have to have multiple numbers of support vector machines to support. So, I have to have 1 support vector machine, which tells me whether a sample belongs to class class c 1, whether it does not belong to class c 1. Then I have to have a support vector machine, which tells me whether a sample belongs to class c 2 or it does not belong to class c 2. So, I have multiple numbers of support vector machines for a multiclass problem. And it can be shown that if I have n number of classes, then what I need is n minus 1 number of support vector machines like I have to have n minus 1 number of linear classifiers, okay. So, with this, I stop here today. Thank you very much.
Info
Channel: nptelhrd
Views: 77,963
Rating: undefined out of 5
Keywords: Support Vector Machine
Id: SRVswRH5Q7E
Channel Id: undefined
Length: 55min 9sec (3309 seconds)
Published: Mon Jun 02 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.