Mod-01 Lec-20 Perceptron Criterion

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Good morning, so we are discussing about pattern recognition problem, and we are talking about the different discriminant functions, which are helpful for classification of unknown patterns. So, earlier what we have seen is starting from the Bayes decision theory that if you know the probability density function of the parametric form of the probability density function. We can design the different discrimination functions for pattern classification or classification of pattern vectors or feature vectors. We have discussed about specific case that, if the probability density function is either Gaussian or Normal distribution function. And we know the mean and co-variance matrices of the different classes, then we can have different types of discriminant functions or different types of decision boundaries among different classes. In particular, if the co-variance matrices of all the classes are same, of the samples belonging to all the classes are same then we have seen that discriminant function that we get is a linear discriminant function. Whereas, if different classes of the samples belonging to different classes they have different co-variance matrices, arbitrary co-variance matrices, in that case, in general the kind of discriminant function that we get is a quadratic discriminant function. So, in multi dimension it is called hyper quadrics. And accordingly the decision between two different classes in one case is hyper plane. And if the co-variance matrices are arbitrary or different classes have different co-variance matrices, then the decision surface between two classes becomes a hyper quadric surface. However, following the Bayes decision theory if we want to have the discriminant function, we have to know that, what is the parametric form of the probability density function of the samples belonging to that particular class. And given the number of samples if we know the parametric form of the probability density function, we may not know the parameter values exactly. If we know the parametric form of the probability density function, then we have seen in previous classes that we can have the maximum likely hood estimate of those parameter values. So this maximum likely hood estimation says that these are the best representations or best supported by the samples, which are available from that particular class. And then also we have discussed about the dimensionality problem because not every feature has equal discriminating power. So, following the principle component analysis or following the multiple discriminant analysis, we can reduce the dimensionality of the feature vectors ensuring that the features in the reduced dimensional feature vector will have better discriminating power than the features which are avoided, okay. Now, today we will discuss about this same linear discriminant function, but unlike in case of what we have obtained from Bayes decision theory that we have to know the parametric form of the probability density function. In case the parametric form of probability density function is not known. So, then what you have to do is we have to design the discriminant function or we have to find out the decision boundary between two different classes by using the samples which are available from different classes. So, here we do not assume any probability density functions. So, we know, so initially we will discuss about the linearly separable classes. So, we know that the classes are linearly separable and because we are discussing about the supervised learning process. So, we know a set of samples belonging to one class omega 1, we know set of samples belonging to another class omega 2. So, using this information and without using any assumption of the nature of the probability density function. We have to find out the linear discriminant function which discriminates between these two different classes, okay. So, we will assume that the classes are linearly separable. And as the classes are linearly separable, so we should be able to formulate we should be able to design linear discriminant function. Where no probability density function form is assumed so because the discriminant functions will be linear. So, you know that a linear equation can always be written as g X is equal to W transpose x plus W naught, where X is the feature vector. And we assume that the feature vector X is of dimension d. So, it is d dimensional feature vector. And W is called weight vector and W naught is called the bias or threshold weight, okay. So, we see that this particular expression is a linear expression, where X is a d dimensional vector, W is also a d dimensional vector. And this term w transpose X is nothing but inner product of vector W with a vector X. And our decision criteria will be therefore, an unknown feature X, if g X is greater than 0. Then we classify X to belong to class omega 1, if it is less than 0 then X will be classified to class omega 2. And if it is equal to 0 then we say X is on the decision boundary, okay. So, if g X is greater than 0 we decide X belongs to class omega 1, if it is less than 0 we decide X belongs to class omega 2, if g X is equal to 0 then X is on the decision boundary. So, the decision boundary between two classes X, omega 1 and omega 2 is given by the equation g X is equal to 0 or W transpose X plus W naught that is equal to 0. Now, let us see that what is the nature of this weight vector W? So, to find out the nature of the weight vector W let us take two points on the decision boundary. So, what we are interested in the nature of weight vector W. So, we take two points X 1 and X 2 on decision boundary, okay. Now, because X 1 and X 2 are on the decision boundary so W transpose X 1 plus W naught will be equal to W transpose X 2 plus W naught. And both of them equal to 0 because X 1 and X 2 are lying on the decision boundary. So, we can simply write that W transpose X 1 plus W naught is equal to W transpose X 2 plus W naught. So, from this you find that W transpose X 1 minus X 2 this is equal to 0. Now, what is this X 1 minus X 2, X 1 minus X 2 is nothing but, a vector lying on the decision surface, because X 1 is on the decision surface, that is a vector X 2 is also a vector that is lying on the decision surface X 1 minus X 2 is a vector which is lying on the decision surface. And W transpose X 1 minus X 2 is the inner product of W with the vector X 1 minus X 2, and because this inner product is equal to 0. So, that clearly indicates that this vector W is orthogonal to any vector lying on the decision surface. Because we have taken X 1 and X 2 arbitrarily so W transpose X 1 minus X 2 equal to 0. That indicates that vector W is orthogonal to any vector lying on the decision surface that means the vector W is orthogonal to the decision surface. And this surface being a planar surface or a linear surface in d dimensional space we call it a hyper plane. And let us represent that hyper plane by H is the hyper plane, okay. So, this vector W is orthogonal to hyper plane H. Now, what does this g x actually represent? You find that g X gives you some measure of the algebraic distance of vector X from the decision surface or from the hyperplane H, okay. So, this g x this gives an algebraic measure of X from H, okay. Let us see how? How do you say that g x represents an algebraic measure? It is not the exact distance value, but it is measure of the distance value and is something to proportional to the distance algebraic distance of the vector X from the decision surface, from the hyper plane H. How does it give us that measure? Let us consider a case in 3 dimensions. Say I have a 3 dimensional space with the vector components represented by X 1, X 2 and X 3. Suppose, I have a decision surface say something like this, in three dimensional planes. So, this is my plane H and let me take a point X somewhere over here, okay. And when I take this point X over here, let me drop a perpendicular from this point X onto the hyper plane. And let me assume that X p represents the foot of the perpendicular on this hyper plane, okay. And suppose the distance between X p and X, the length of this vector joining X p and X is given by say r. Now, if it so in that case I can write X is equal to X p plus r, r is the distance multiplied by W upon mod of W, because we have seen earlier that the vector W is orthogonal to the hyper plane, orthogonal to the decision surface. So, the direction of the W is in the same direction of X p to X, because X p to X this vector is orthogonal to this hyper plane. W is also orthogonal to the hyper plane. So, this is W. So, I can write this X is equal to X 0. p plus r into W upon mod of W because W upon mod of W is the unit vector in the direction perpendicular to the hyper plane, okay. So, I can write it in this form. And when I write if like this then what is g X, g X is nothing but W Transpose X plus W naught which in this case will be W Transpose X p plus W naught plus r into W Transpose W upon mod of W is it, okay. I am simply replacing this X by this X p plus r into W upon mod of W. Now, out of this the point X p which is of the perpendicular from X onto the hyper plane that lies on the decision surface, okay. So, W transpose X p plus W naught because X p is on the hyper surface W transpose X p plus W naught will be equal to 0 because it is on the decision surface. So, what I get is this will be simply r into and what is W transpose W this is nothing but mod of W square, okay. So what is simply get is r into mod of W, okay. So, you find that this discriminate function g X the value that you get is nothing but r, we said is the distance, perpendicular distance or orthogonal distance of point X from the decision surface, because it is the distance between X p and X, okay. So this r is the orthogonal distance of point X from the decision surface. So, this g X is the orthogonal distance scaled up by the modulus of W, okay. And if I want to find out what is the actual orthogonal distance that is given by r is equal to g X upon mod of W, okay. So, that is why we said that this discriminate function g X actually gives you a measure, algebraic measure of the distance of the point orthogonal distance of vector X from the decision surface, okay. Now, why are we calling it algebraic distance? Sir, here only the first term, which one. You come to our original expression g X is equal to W transpose X plus W naught, okay. This is your discriminate function. What we said is if g X is greater than 0 then belongs to omega naught. If omega one less than 0 then x belongs to omega two. If it is equal to 0 then X is on the decision boundary. That is g X is equal to 0 not W transpose X equal to 0. And g X is W transpose X plus W naught, okay. So, because X p is lying on the decision surface so g of X p will be equal to 0. And g of X p is nothing but W transpose X p plus w naught, is that okay? So, here why we are calling it as algebraic distance is that if r is positive then X lies on the positive side of the hyper plane. If r is negative then X is on the negative side of the hyper plane, okay. And this expression that I have said that r is equal to g X upon mod of W, okay. This is an expression which is nothing new I mean this is what you have already done in your school level mathematics. Because we know that in school level mathematics what you have done is, if I have the equation of a straight line given say a x plus b y plus c is equal to 0. Come to your co-ordinate geometry, okay. And given a point x 1 y 1, what is the distance of point x 1 y 1 from this straight line which is nothing but a x 1 plus b y 1 plus c upon square root of a square plus, okay. And if you compare this expression with this expression they are exactly identical. This expression is in the 2 dimensional, this expression is in d dimensional, where the dimensionality has increased. So, what you have derived over here is nothing new, this is something that you have already done in your school level mathematics. So, that we are saying is that if r is positive then the vector X lies on the positive side of the hyper plane. If r is negative then X lies on the negative side of the hyper plane, okay. And what is the distance of origin from the hyper plane, okay? So, distance of origin from the hyper plane that is simply given by W naught upon mod of W, okay. So here if W naught is positive then origin lies on the positive side of the hyper plane, if W naught is negative then origin lies on the negative side of the hyper plane. And if W naught is equal to 0 then the hyper plane passes through the origin, okay. So, if W naught is equal to 0 then the hyper plane passes through the origin. And not only that your discriminate functions takes a particular form that g X will be simply given by W transpose X, where I do not have any bias or threshold weight. And this is the form which is called homogeneous form. And in mathematics it is always convenient that if I can represent an expression in a homogeneous form that avoids many of the problems that we may face while we try to analyze different given problems, okay. So, after we discuss about the nature of the decision surface if it is hyper plane then what are the conditions or what are the different positions of the hyper plane, okay. Then we can have then the hyper plane has two sides. Actually, hyper place divides your feature space d dimensional space into two half spaces. One of the half spaces is positive the other half space is negative. If a feature vector falls on the positive half space, then it will be classified into one task. If the feature vector falls on the negative half surface it will be classified into another class. If it is on the decision surface then we cannot classify it actually, because it is equally likely that it may belong to class omega 1 or it may belong to class omega 2. So, if the sample falls on the decision surface or falls on the hyper plane, we cannot really take any decision, it is an ambiguous case, okay. Now, after discussing this let us see that how we can design this vector W, ultimately because it is a linear class. I have to find out the discriminate function g X and if I can find out the weight vector W and the threshold weight, so not feature vector, the weight vector W and the bias of the threshold weight W naught then my linear classifier is designed, okay. So, I have to design or I have to decide about W and W naught. Based on the samples that are available because it is supervised learning so we are given samples from both the classes omega 1 and omega 2. So, I have to decide W and W naught based on the information available from those known labeled samples, okay. So, let us talk about design of the weight vector W, okay. And here again for simplicity initially, we will assume that we have two category linearly separable case. Later on we will generalize this to multiple category problems where we have c number classes, but to start with let us start with on the two category case that is we have classes omega 1 and omega 2. So, what we have so far is, we have, we know that we have to have a discriminate function of the form g X is equal to W transpose X plus W naught, okay. And as we said that this expression of the discriminate function this is not in the homogeneous form. However, if I can convert it into homogeneous form then my analysis will be easer or my design will be easier, okay. So, how we can convert this into homogeneous form I can very easily convert this into homogeneous form. If I write this expression in the form a transpose Y where, a is my modified weight vector. And this modified weight vector will include all the components of the weight vector W and also the bias weight or the threshold weight W naught, okay. So, if this weight vector a has to include all the components of W as well as W naught then I have to do some augmentation in the feature vector X, okay. So, what I will do is, I will write this Y is actually augmented feature vector X, okay. So, what will be y will be nothing but, I have to take all the components of x. So, I have to take the first component x 1 second component x 2, all the d components of x that is up to x d. And I make last component is equal to one, okay. So, if I do that for every x if I augment one to give me an augmented feature vector y. Then I can write this expression in an homogeneous form as a transpose y. Simply, because what is this a transpose y, a transpose y is nothing but w 1, w 2 up to w d, w naught. This is a row vector because I have to take transpose multiplied by the column vector x 1, x 2 up to x d then one, okay. So, if you do this multiplication, this multiplication is nothing but, summation of w i x i, i varying from 1 to d plus so up to d components I have taken care by this summation term x 1 w 1 plus x 2 w 2 plus x 3 w 3 up to plus x d w d. So, this is x i w i times x i., i varying from 1 to d plus w naught into 1. So, plus w naught right which is nothing but w transpose x plus w naught, okay. So, I get exactly the same expression as this. Only thing I have to do is I have to augment one to the feature vector x giving me the augmented feature vector y. And after having this homogeneous expression my decision rule remains the same. That if a transpose y i, for the i th augmented feature vector x i is greater than 0. Then I decide that y i of the corresponding or inner vector x i that belongs to class omega 1. And if it is less than 0 then I decide that y i of the corresponding feature vector x i, do not confuse between this subscript and this subscript here I have meant this subscript means the i th component of the feature vector. And this means i th feature vector, okay. So, if it is less than 0 then my decision will be this belongs to class omega 2, is that okay? So, my decision rule remains the same. Only thing is I have a modified formulation of the same problem. Now, let us see that how we can really design this weight vector W or in this case the weight vector a because a consists of all the components W and also the bias vector or the threshold vector w naught, okay. So, our aim is to design the weight vector a by using the knowledge of the labelled samples from class omega 1 as well as class omega 2. So, we assume that we have n number of samples and when I say n number of samples that means we have n number of augmented samples where I have augmented 1, appended 1 to each of the feature vector, okay So, I have n number of samples called training samples, because these are the samples which are used to train the classifier, okay. So, this n number of samples omega y 1, y 2 up to y n. So, each of this y i’s are actually augmented feature vectors, by appending one to the original vector x, okay. Some of this samples, some of this feature vectors are labeled as class omega 1. Some of this feature vectors are labeled as class omega 2, okay. Now, for a particular feature vector augmented feature vector y i as I say the earlier my decision rule will be a transpose y i greater than 0 indicates y i is in class omega 1, less than 0 indicates y i is in class omega 2, okay. If it is equal to 0 I cannot take any decision, right? So if I, what I have to do is, I have to find out the value of this weight vector w. I have some samples which are labeled as class omega 1. I have some samples which are labeled to belong to class omega 2, okay. Now, given a weight vector a, if I take all the samples which are labeled as omega 1, If I find that for each of those samples a transpose y i is greater than 0, okay. That means that given weight vector a is correctly classifying all the samples, all the feature vectors which are labeled as class omega 1. If I also find that for the same weight vector a, all the samples belonging to class omega 2, which are labeled as belonging to class belonging to class omega 2. For all of them a transpose y i is less than 0 then the feature vector then the weight vector a is also classifying the samples which are labeled as belonging to class omega 2, okay. So, that particular which we have obtained, that is the correct weight vector because it is classifying correctly all the samples which are in class omega 1. It is also classifying correctly all the samples which are labeled as omega 2. So, that is a feature vector, weight vector that we want, but how to obtain such a weight vector, okay. So, to obtain such a feature vector another modification that can be done is that now, you see that I have two decision criteria. In one case I am checking that it has to be a transpose y i has to be greater than 0. In another case I am checking that a transpose y i have to be less than 0. So, instead of having these two different criteria cannot we have single criteria, that if some condition is true, a single condition is true I say that the sample is classified correctly. If the condition that is not true, single condition if that is not true I decide that the sample is not classified category, okay. So, cannot I make something like this that, if a transpose y i is greater than 0, I will say sample is correctly classified irrespective of whether this feature vector y i is labeled as belonging to class omega 2 or it is labeled as belonging to class omega 2. So, I do not look at the label of feature vector y, whatever be its level if I can somehow convert this decision in the form that a transpose y i irrespective of label of y i. If a transpose y i is greater than 0 I will say that y i is correctly classified otherwise I will say that it is miss classified. So, otherwise includes less than 0 as well as equal to 0 because equal to 0 does not give me any information. I cannot decide about the class belongingness of the sample, okay. So, whether it is less than 0 or equal to 0, I will say it is miss classified. If it is greater than 0 I will say it is correctly classified. So, how we can do that? From here you said that if a transpose y i is greater than 0 then y i belongs to class omega 1. If a transpose y i is less than 0 then y i belongs to class omega 2, okay. So why do not I do one thing I take all the samples which are labeled as omega 2, okay and then negate it. So instead of considering y i, I consider minus y i, because I know all the training samples which are given from class omega 2. So, I simply negate all those training samples. So, when I try to find out this feature vector, this weight vector a the samples belonging to class omega 1, I will take them as it is, as it is means after augmentation after appending one. And for the samples belonging to class omega 2, I will augment them by appending one and then take negative of it. So, if I take the negative then this a transpose y i which is supposed to be less than 0 now, it will be greater than 0. So, I get uniform decision criteria after negating all the samples belonging to class omega 2 that in all the cases I should have a transpose y i greater than 0. So, I will say that a transpose y i is greater than 0 my sample is correctly classified irrespective of whether the sample is taken from class omega 1 or the sample is taken from class omega 2 is that okay? So, what I will do is I will take all the samples from class omega 1 as it is. The samples from class omega 2, I will take negative of it, okay.. And then use this for designing or to find out what will be the weight vector a, okay. Then again it is better if we can define some sort of criteria function, okay. So, we take some sort of criterion function say J a and the J a has to be minimized, if a is the correct weight vector that means for that weight vector a which classifies all the training samples correctly, J a will be minimum, okay. So, I will define some sort of criterion function we will see what kind of criterion function we can have. And then my job will be, by varying a in an iterative way I have to minimize this J a. And this J a has to take a minimum value when I reach a weight vector which properly classifies all the training samples, okay. And for minimization of this J a, the criteria function j a we can make use of an approach which is called gradient descent approach or gradient descent procedure. What is the gradient descent procedure? Suppose at the k th iteration I know what is a k, a k is the value of the vector a, weight vector a in the k th iteration. I have to update this weight vector a. So, from k th iteration I have to go to k plus first iteration. So, from here when I want to find out the value of a in the k plus first iteration, that is a k plus 1 which will be equal to a k minus sum eta k times grad of J a k. This is what gradient descent procedure. And what is this eta k, eta k actually specifies the learning rate. So, what does this procedure mean? Suppose, I have a criteria function something like this, okay. This is the point where it is minimum. Suppose, in the k th instant whatever the value of a k, I have the criteria function is somewhere over here. So, I put that this criteria function this is J a with respect to a. So, at the k th iteration suppose, this is my a k and the criteria function value of the criteria function is J a k. If I take the gradient of this, gradient will increase in this particular direction. So, what I am doing is, I am moving a k in a direction which is negative to the gradient. That means I am moving a k in this particular direction. So, in k plus first iteration depending upon the value of this eta k which is learning rate I may move either over here or over here or over here or even somewhere over here. So, if the value of eta k which is called learning rate, if the eta k is very large then when I modify this a k, a k plus 1 may be found somewhere over here. That means I am over shooting the minimum value, where if eta k is very small I may land somewhere over here. So, which says that you are landing rate is very slow. If eta k is very large I may over shoot the solution and this may lead to oscillation. So, what I will have is from over here I will move to this point then to this point then to this point then to this point and so on. So, I will have oscillation around the solution vector, is that okay? So, this is what this gradient procedure does. That you start with an initial arbitrary value of the weight vector, for that weight vector you find out what is the criteria value of the criteria function. Then try to minimize the criteria function following the gradient descent procedure, okay. Or this is also called procedure of steepest descent. That means you follow the negative of the maximum gradient value, path of the negative of the maximum gradient value. So, this is also called steepest descent procedure. Is that okay? Now, one such criteria function as i said that, we will talk about the Perceptron criterion function, one such criterion function is called Perceptron criterion, okay. Now what is this Perceptron criterion? Let us assume that at the k th instant, somehow we have been able to obtain the weight vector a k. I am not saying how we have obtained it? Somehow in the k th iteration I am about to find out the weight vector a k, because it is in the k th iteration I say it is a k, okay. Now, this a k, I will try whether it can correctly classify all the training vectors or not, all the training samples or not. And when I consider all this training samples, as I said that the training samples from class omega 2, they will be negated, after augmentation they will be negated, training samples belonging to class omega 1 they will simply be augmented, they will not be negated so that if all the samples are correctly classified for all the samples I must have a k transpose y greater than 0. If there is any sample for which a k transpose y is not greater than 0, it may be less than 0 it may be equal to 0 I say that particular training sample is not correctly classified, okay. So, our aim will be to find out an weight vector a which will classify all the training samples correctly. I do not want to have even a single training sample which is not correctly classified or which is miss classified, right? So, I can try to design the criteria function which will make use of the training samples which are not correctly Classified. Because if the training sample is correctly classified, I do not bother about it because my job is done. If the sample is not correctly classified, then I have to think to modify my weight vector. So, the criteria function that I want to design; I will make use of the training samples which are not correctly classified by that particular weight vector a k. So, accordingly I can define the criterion function, I call it J p a because we are saying it Perceptron criterion, which is equal to sum of minus a transpose y, for all y which are misclassified, is that okay? So, you find that if y is misclassified then a transpose y it is negative, okay. So, minus a transpose y has to be positive, right? So, as a result this Perceptron criterion function j p a, it can never have a negative value. It will always have a positive value, and the minimum value can be 0, okay. So, I have a global minimum for this criterion, Perceptron criterion function. And this global minimum I can find out by gradient descent procedure, is that okay? So, if I want to apply that gradient descent procedure then I have to take the gradient of j p a with respect to this weight vector a. Is that okay? So, I will have grad of J p a with respect to which is nothing but sum of if I take the gradient of a transpose y with respect to a it simply becomes y, right? So, it is simply sum of minus y for all y misclassified. Is that okay? And my update rule or the algorithm to find out this weight vector a will simply be, I take an initial weight vector a 0 arbitrarily, okay. So, a 0 will be arbitrary then a k plus one that is the weight vector in the k plus first iteration will be simply a k plus eta k, which is the learning rate into summation of y, for all y misclassified, okay. So, this is the algorithm for design of my weight vector. So, once I have chosen a 0 arbitrarily. From a 0, I can find out what is a 2 by following this updation rule, a 2 is nothing but sorry a1, from a 0 I can find out a 1 which is nothing but a 0 plus eta k which is the learning rate into summation of all the vectors y, which are actually misclassified by a 0. So, I have a 1, from a 1, I can find out a 2, a 2 is nothing but a 1 plus again eta k plus summation of all the samples which are misclassified by a 1, okay. So, by doing this iteratively I can finally, find out a weight vector which will finally, classify properly all the training samples. May be the number of iterations will be more depending upon your initial choice and the number of feature vectors that you have the distribution of feature vectors, but if the feature vectors are linearly separable will get an weight vector which will properly classify all the training samples, okay. So, we will stop her today next day will continue with this lecture.

Info

Channel: nptelhrd

Views: 11,330

Rating: undefined out of 5

Keywords: Perceptron Criterion

Id: TJFBt0UjS34

Channel Id: undefined

Length: 54min 14sec (3254 seconds)

Published: Mon Jun 02 2014