Ali Ghodsi, Lec 2: Machine learning. classification, Linear and quadrtic discriminant analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let's start the first lecture in classification so I gave you sort of overview of the course that we are going to cover this semester but let me start with the definition of classification more formally so in classification as I told you we are going to predict a discrete random variable boy from another random variable X so we have a set of data and this form we have n data points X usually are vectors in d dimensional space so any object that you have you can express it as a vector it's image text whatever and why is labels the labels comes from a finite set this set and classification means learning a function H such that the domain of this function is X and the range is y so we are trying to map elements of X vectors to labels so I give you this vector you have to tell me it's a table I'll give you another vector you have to know it's a chair it's not the table this image is Ali this image is John so why are labels and X are vectors that you just represent your object as vectors okay so another definition so one term that we are going to use a lot in this course is error rate in here is the definition of error rate so suppose that we have learned a function H it's our classifier and we have set of labels Y what is the chance that what H predicts is not equal to the true label okay so the chance that my function predicts a table when it's a chair this this probability is called true error rate pay attention that this X here is a random variable and this Y is a random variable so we are talking about probability of this random variable not being correct okay so we can't usually compute this probability because what we have is only a sample of our random variable you know you have a random variable X but you have n samples of this random variable and you have random variable Y but you have n samples of this random variable so we can't compute this probability but we can estimate how can we estimate it we can estimate it this way and this I hear is just indicator function so if H X i if H X I is equal to is not equal to Y I then I of this bracket is 1 so basically you know I H X I not evolved to Y 1 is either 1 or 0 if these two are not equal it's going to be 1 otherwise it's going to be 0 so basically this summation simply means that when you have n data points and you have n labels you compute H then you can't the number that h x1 is not equal to true why so you can't this number through this song and then divide it by the total number so if among 100 points ten of them are not correct then ten over 100 is my estimation of the true error so it's also called training error okay okay now we come to a very important concept in this Court which is called Bayes classifier but before I tell you about Bayes classifier let me ask you a question you know suppose that I have labels and they have data points X and I would like to compute probability conditional probability and use conditional probability as a way to do classification so I would like to know that you know my set is stopped suppose that I have only two class so you give me an image in this image is either the image of a picture of Ali or it's John right so I can compute the probability that this image is Ali so you give me this image what's the chance that it's Ali or I can compute this probability so given that this is ally what is the chance that I observe this image the both of these two are meaningful you know given that you know instead of two person let me say Ally and elephant for example given that this is ally what is the chance that I observe this image so the chance is very rare that you observe you know something like this given that this alley right and here given that you have observed this image which the chance that is ally which one of these two scenario you go for if I want if you want to do classification you go for this one or you go for this one who go for the first the second one yes you go for a second why so yeah we can reach the first if you use base formula we can reach the second if you use the first one to use the first formula T so in this case you know we can reach the other one just using base formula but conceptually is there any advantage from these two can we just choose the one that we like or one of them is better than the other sorry isn't easier might be but beside besides being easy or hard conceptually I mean we can we can talk about easiness if we think both of them are correct then we can think which one is easier to compute then we go for that one but before we get to that stage we have to make sure that either one is good but eventually which one you're going to decide based upon yes no I can't do that you know you give me an image you know like I compute with the chance of being elephant and having this picture what's the chance of not being elephant and having this picture a compute both of them and I go for the one which is large so i can assume both cases of labels right that's the same here I have to assume something about the voice and then compute the probability I can assume both case so you know we can have debates about these two in many different ways you know even I can argue that you know this one here actually why it's not even a random variable you cannot define probability for why because it's a label of you know I'm either Ally or I'm not at least there is no randomness here but this randomness about images so we can have arguments from many different aspects but these philosophical arguments aside you have to go for this one not the second you know there is a very famous mistake which is called the persecutors fallacy so it happened in courts in UK and in u.s. many times there are papers about that that in court they decided based on this probability and stuff this one and strong you know you have a person who's accused for a crime and you have some evidence so this basically tells you what's the chance of observing this evidence given that this person is innocent what's the chance that this person is innocent given that I have observed this this evidence according to Bayes rule you know this probability is px according to Bayes rule I can write this birthday this way okay and so what happened in many cases that you know there's a very famous Court in 90s that lady was accused of killing her own children you know one of them was died in after 16 weeks the other one died after eight weeks and a witness in the court was expert testified that the chance of the chance of two children in the same household or our diet is I think one over 75 million it's very very small and court decided based on that but you can see that you know suppose that the chance of observing and evidence given that this person is innocent it's very small it's still the prior the chance that the person is innocent could be quite large if you look at the whole population only a very small proportion of the whole population are guilty our criminal most of them are innocent you know this could be pretty large even the marginal product it could be pretty small you know the chance that you observe such evidence we doubt you know condition on anything regardless that a person is guilty or is not guilty what's the chance that you know two children died it's small so having this a small denominator having this large amount here even this if this is a small this could be looked so the chance of is the person is innocent even if you have a very rare evidence could be quite large but many court went for this decision by a mistake in the past so I mean in your first assignment actually going to prove that this is the best thing that you can do theoretically you know you will see the mathematical argument but at intuitive level so at the moment accept and believe that this is actually what you have to do so if you want to do a classification and you have say for example two classes that's the probability that you have to compute and this is called Bayes classifier okay so basically Bayes classifier means that I have two classes i compute he y equal to 1 given X I compute P y equal to 0 given X and I go for the one which is large if this is more than half I will decide label 1 if this is more than 1/2 I will decide label C but this is called Bayes classifier ok and again a little bit of terminology so this is called posterior this is called class conditional this is called prior and this is called marginal and you know that this marginal is in fact you know summation over already related because we have only two classes it's P X being X 0 y equal to 1 times y 1 plus P being 0 times okay so it's not an independent quantity and depends on class conditional and prior so class conditional because it is a conditional only on one class class that has labeled one what's the distribution of the date and this is marginal because if I don't condition on any classes which the distribution of the date and this is prior which my belief about being in class 1 or R it's my belief of being in class two and usually we use this notation so I used this for class conditional I'll use this for prior and this is my marginal which is just summation of class conditional times marginal or all possible values of fine okay any question so this is based the definition of Bayes classifier suppose that I have two classes as I said and then you compute this conditional probability and you can write this conditional probability as Bayes rule and then you know if this quantity is py being one given the observation then Bayes classifier can be defined this way if this probability is more than half I will label this to class one if it's less than half I will able to Class C so simply I compute the conditional probability Y given X and for two cases either Y is 1 or Y is 0 I go for the one which is larger and label the data that way okay okay another definition decision boundary so decision boundary is set is is set of points such that the probability of being in class 1 is the same as part of being in class here so in Bayes classifier what you would like to assign one to all points that are in class one assign label 0 to all points that are in the other class but there is set of points such that probability there is set of points such that probability of being in class 1 is the same as well being and the on the other class so it's important geometrically because it's sort of a boundary that I can make my decision up in this boundary you know any point on this side have larger probability for class 1 any point on the other side as large a priority for class zero and anything on this boundary have the same probability for Wells class so this is called decision boundary and then if you have decision boundary then the Bayes classifier assign label one if you know you're on one side of this line and zero other way okay there is a theorem that you cannot prove in your first assignment that the Bayes rule is optimal what does it mean that optimal means that if you compute the true error rate for Bayes classifier and compute true error rate for any other classifier so H is any other classifier and H star is Bayes classifier then the true error rate of Bayes classifier is definitely less than any other class so that's the best thing that you can do for classification okay yes well maybe because that's my question what do we need any other method I mean we are done you know course classification course is done so we learned about Bayes classifier we can prove that the Bayes classifier is the best classifier and you can't beat Bayes classifier not because we don't know any better method at the moment it's mathematical bound you know you can't beat this bond not now forever from now under the day of judgment there's no better classroom guarantee okay so why do we need any other match why don't we just use base class for for everything yes sorry time what what type of time complex okay any other ID yes yes so a number of date mercy I think you have seen you know many competitions some of them even get news coverage these days you know that there is a competition on imagenet for example they have millions of data points still you know the Vayner is you know a model that has been trained by deep network for example no one use Bayes classifier you don't have priors in advance now you can extend it to multiple as easy you just if you have K classes you just compute priority of being in class one and two three four up to k and you go for the ones as large so yes you don't have prior not only Pryor actually the problem is that we don't have any of these quantities you know we don't have distribution of class conditional we don't have marginal distribution and we don't have the probably is the most easiest one actually to compute you know prior you can estimate it very easily because if I have 100 data points and 50 of them are class 1 and 50 of them are class 0 I can say my prior is one-half you know I believe is that half of the time there it's a very good estimation so just assume that there is Bernoulli distribution here but what about marginal or what about class conditional you know you give me a set of images which the distribution of this images which P of X I don't know you know if I know what the distribution of these images is it means that I can sample from this distribution and generate new images that look like elephant look like Ali or whatever so I don't have this distribution what's the class conditional distribution if you give me I'm in this class can you tell me what the distribution of the points are and imagine that this points are in very high dimensional space you know so we know very you know we know what what distributions we are familiar with you know we're know leaving our uniform in our Gaussian so on there you know pretty simple distributions but who can say that you know the distribution of images is Gaussian distribution of images is uniform or whatever you know they are not that simple they're very more complicated distribution so the main problem is that we don't have distribution we don't have this distrito then we are done we can't do better than base definitely so that's the main problem so because we can't do this you know there are many different methods you know so one method would be so we don't know this let's to estimate that let's to estimate this distribution so we can estimate this prior easily you know just by counting the number of points in this class and the other clock let estimate this one let's estimate this one so there are techniques for density estimation is that the six we can use those techniques you know for density estimation why don't we do that for different reason because density estimation techniques are usually are usually are not good techniques especially when you're in high dimensional space you know in high dimensional space no way that you can you know you have images with like resolution of six megabytes you know how many pixels do you have your vectors are of the order of millions and you want to estimate a distribution in this high dimensional space no way that you can get it crap so because there is no good way to estimate this distribution this density so there are different approaches to classification that in this course we are going to see some techniques in any of these approach so here is one approach so one approach is empirical risk minimization so in this approach we choose a set of classifier H capital H and we find one member of this set that minimize some estimation of the error rate okay so what does it mean it means that I'm going to assume that my classifier comes from a class so you have to make this assumption all the time you know there is a theorem in machine learning no lunch no free lunch here there is no free lunch means that no assumption no learning so if if someone tells you that without making any assumption without any prior we can learn something from this data don't believe it you know because you mathematically you can prove that it's not possible you need to make assumption so I give you data for classification you have to make assumption about the data so my assumption is that there is a linear classifier here means that my assumption is that a line can separate these two set of class then I can in in the capital set of all lines or all half space I can choose the one which minimize the true error rate I can assume that my classifier can be represented by neural network then among all neural networks that has this type of structures you know three nodes in the first layer of ten nodes in the second layer and so on I will choose I will search for the one which minimizes the error rate so that's one approach to classification so another approach which could be in some cases have overlapped with the first approach is regression approaches you know I told you the difference between regression and classification you know in classification in classification we have this type of data points boy comes from a finite set in regulation you relax this assumption that voice comes from a finite set while it could be real value you know that's the difference between classification and regulation so regulation is easier problem because you don't have this additional condition and why and there are methods to solve this problem so it's hard to sometimes it's hard to calculate a function H to map a vector to only two distinct labels either zero I mean the output of this function should be either 0 or 1 sometimes it's hard to do that but it's easier to estimate a function which map my my vectors to a real value say between 0 and 1 but this is not classification anymore is the regression but that's fine I solve the problem using regression methods then I put a threshold here so in my function Maps everything between 0 and 1 but I decide if it's more than 1/2 I would decide that it's 1 if it's less than 1/2 I decided 0 that's another approach and the third approach is the one that I just mentioned that you can do a density estimation you know I can estimate this density for class 0 I can estimate the density for class 1 I can estimate priors and then I can do basal for classification ok so the first real classifier that we are going to learn is actually based on this tiered approach but before I go through the details of this classifier let me just tell you that easily you can extend this technique to multi-class problem so I also examples that I gave you for was for two classes for multi-class problem if you have K classes you can compute py for each class for class 1 2 3 4 up to K and go for the one I mean label according to the maximum value go for the one which is maximum probability okay okay we are going to use the third approach means density estimation and learn the details of our first classifier this first classifier is called LD a linear discriminant analysis okay so let's see how linear discriminant analysis works okay linear discriminant analysis we just learned that you know I can compute this as so I can apply Bayes rule compute this posterior and go for the class which has larger posterior probability so let's assume that we have two class problem for simplicity then we can see how it can be extended to multi class LD 8-iron I'm talking about LD so I assume that we have two classes so it means that Y comes from this set so then I make another assumption so I would like to go for the tiered approach for density estimation okay for density estimation I can just look at the data try one of the general density estimation techniques to estimate this density or I can assume a parametric form for this density and then go find the parameters from data so I can assume this is uniform this Gaussian this is whatever and then look at the data and compute the parameters further so in LD a our assumption is that class conditional distribution is Gaussian so Gaussian is one of the easiest distribution that you can can work with okay so my assumption is that this is Gaussian so this is my second assumption in LD a so assume f K is multivariate brow C being multivariate Gaussian means that FK is 1 over 2 pi ^ d despite is not the prior it's 3.14 you know and Sigma which is my covariance to the power of 1/2 e to the power of minus 1/2 X minus mu k Sigma inverse transpose X minus mu T that's multivariate Gaussian so that's my second assumption that my conditional distribution here is Gauss ok ok what would be my decision boundary so what was the definition of decision boundary decision wondering was set of points such that probability of being in one class is the same as royalty of being on the other class right so decision boundary is the set of points such that probability of being in class 1 is the same as probability of being in class zero so I would like to compute the decision boundary you know based on the assumption my clique that my class conditional is Gaussian I would like to geometrically now what are the set of points such that the priority of most classes are the same means if I'm on one side of this boundary I have more probability if I'm on the other side I have less promote okay is it clear okay so basically I need to solve for P y equal to 1 given X they call py equal to 0 given X I need to solve for this okay so py being one given X is basically f 1 x pi 1 divided by marginal this should be equal to being in class 0 if now X final / margin okay happy so far so this marginal is the same right so I can cancel the marginal so what I need to solve is f 1 x pi 1 equal to F naught X what it no Y Z but F I assume that F is Gaussian okay so basically I need to solve for this equation 1 over 2 pi D over 2 covariance matrix of class 1 and this X is a vector D dimensional lecture and this mu is also a d-dimensional you know that's my notation for vector in a slides you know it's bolt and when spoiled means it's it's a vector okay so that's f1 x times y 1 you have an estimated PI 1 now but just assume that we know it's a scalar you know it's a value it's assumed that I have it I have some belief that what's the chance of being in class 1 this should be equal to 1 over 2 pi times Sigma of class 0 you know covariance matrix of class zero one half e to the power of minus 1/2 X minus mu 0 Sigma 0 minus X minus mu Z this is a transport I need a transpose here ok times easy so I need to solve for this now ok for simplicity I make my tiered assumption so so far I want to do Bayes classifier I had two assumption so far first I assume that I have two classes then I assume that my class conditional distribution is multivariate Gaussian now I make my third assumption and my third assumption is that covariance matrix of class one is the same as covariance matrix of plastic sorry just for simple is it you know that's how that's how our modeling works right word this seems to be complicated we can't understand that let's make assumption images are Gaussian no I'm just going as soon as there goes okay covariance matrix are the same no I'm just going to assume that they are so why because I can't do better than I can't solve so let's make some assumption in order to be able to solve the problem okay so that's just my assumption that they are the same so I will call them sing it in practice they are not this okay so if I make this assumption then basically here this you know that was Sigma 1 actually in there was Sigma Z if I make this assumption I don't want to repeat this formula again so this is going to be Sigma right and this is going to be Sigma and this is going to be Sigma and this is going to be sick because of the assumption that I made okay so denominator here and the left-hand side is the same as this denominator and the right-hand side after this assumption so I can cancel okay now I have to solve for this and to solve for this let me take lot from both sides to get rid of this exponential on the left hand side I have 1/2 X minus mu 1 transpose Sigma inverse X minus mu 1 Plus like PI 1 is equal to minus 1/2 X minus mu now transpose Sigma inverse X minus mu naught plus log hi-c okay okay so I can take everything to one side so that's going to be plus that's going to be minus equal to Z okay now let me expand this this is just quadratic form right if I expand this quadratic form what do I have you know I have like this is the first term squared I have the second term square and they have minus two times first term second its quadratic form so I'm hi I'm going to have minus one half x transpose Sigma inverse X that's first term squared right I have second term squared which is mu 1 transpose Sigma inverse mu 1 they have minus 1/2 here so it's going to be minus half mu 1 squared Sigma inverse mu 1 and then I have this I'm in two times this term this term which is going to be negative two you know X transpose mu 1 and then it will multiply it by negative 1/2 so it's going to be just plus X transpose me you want am i right let me warn you in the first from the first lecture be careful I tend to make many numerical mistakes on the board so be aware sorry yes sick okay so that's expansion of this quadratic form right no sorry no we have it here we have T okay okay let's expand this one as well so I have 1/2 X transpose Sigma inverse X let's for this there I have minus 1/2 oh sorry I have plus mu no transpose Sigma inverse me along and I have minus X transpose Sigma inverse me you know okay so that was for this term and this stare then I have like point one and they have - like PI zero so I'm going to write it as log PI 1 divided by PI Z equal to Z okay okay so we can do a little bit of simplification here you know I have negative 1/2 X transpose Sigma inverse X here and they have positive 1/2 X transpose Sigma inverse X here so I can cancel this structure then so after canceling these two term you know I have this term u 1 transpose Sigma inverse mu 1 which is just an ax scalar which has nothing to do with X so it's constant it does not depend on X and it's a scalar because me you was a vector of D dimension right so mu transpose is 1 by D and Sigma is my covariance matrix it's D by D matrix and this is D by 1 so it's going to be a scalar just a constant which has nothing to do with X so it's a constant here so I have a term here which depends on X so Sigma inverse mu 1 would be a vector of D dimension and X is a vector of D dimension so I have a term here which is linear on X I have another term here which is linear on X this is also a constant and this is a constant so if I order them I have two terms that are linear and ax so I have like X transpose Sigma inverse V 1 here and they have Sigma inverse mu zero here so that's my linear term then I have constant mu 0 transpose Sigma 0 mu 0 - this there mu 1 transpose Sigma 0 mu 1 and then I have this term which is again constant sorry thank okay so I have a linear term on X and a constant what this equation is I mean geometrically what does it look like sorry it's just linear right it's a line I mean in two-dimensional space it's a line in D dimensional space it's like D dimensional hyper plane so it's a linear function so we want to compute we wanted to compute this decision boundary and it turned out that the decision boundary under this simplification assumption is just a line okay so you know it's a vector it's another vector so it's as if you know I have X transpose a vector beta plus a constant here we call to zero so if you want to do LD a linear discriminant analysis what you have to do is to simply assume that your two classes or Gaussian estimate the covariance matrix for each class estimated mean of each class you want in musi row estimate the prior prior could be easy to estimate I will show you the detail in a minute but it's prior could be easy just count the number of points in this class count the number of points on the other class and you have all of this parameter put it in this formula and it gives you you know equation for a line a linear equation so this is your decision boundary any point on this line has probability of being in class one equal to the protein being in class zero anything on one side of this boundary has one of these probability more than the other so we can make decision based on this line so in LD a you know say for example I want to do digit recognition and they have three type of digits you know I gave you example for two classes you can easily extend this to multi class so I have digits I first assumption is to assume that the distribution of this digits in each class is Gaussian it's a wrong assumption nevertheless I'm going to make this assumption okay I'm going to assume that this to have been sampled from multivariate distribute go chien what does it mean it means you know I observe this two but they assume you know that it is multivariate Gaussian in a sense that if I put it in a matrix then if I victor is this based on the resolution of this picture it's going to be in like some diamond d dimensional space and they assume the pixels have been drawn from this multivariate goes okay so this is Gaussian this is Gaussian and this is go see now I can estimate the mean of each of this class so these are the mean of these three class I can estimate the covariance matrix of each of these classes covariance of this three class are not the same I'm just going to assume that they are okay I'm going to assume that they are the same so they have exactly the same shape if I do so the decision boundary would be linear so it gives me a line which make distinction between these three digits and this is on some real handwritten digits as you can see these two wrong assumptions still can give you a decent classifier this descent class where you know suffice to from one country in some sense any quest yes sorry well I go through the detail but the mean is quite easy you know just do you take the average in Gaussian what would be the maximum likelihood estimation of the mean just add all of the points divided by the number of points that's the mean that's maximum likelihood estimation for the me right yes you're going to estimate the covariance matrix of each classes add them up take the average again I'm going to go through details of that in a minute okay so since I don't want to again open and close this screen you know we had two simplification assumption here one was that the class conditional is Gaussian the other one was that the covariance of classes are the same right if I relax the second assumption the second assumption that the class conditional sorry the covariance of classes are the same if I just relax the second assumption do you know what's going to happen the shave will chance but how right you know there there was a part which is now hidden behind this screen when we did simplification there was a quadratic form like X transpose Sigma inverse X which we cancel out right do you remember that from class 1 and class 2 we cancel out this tutor we were able to cancel this term out because we assume that Sigma for both classes are the same because Sigma 0 is the same as Sigma 1 if I relax this assumption then these 2 term are not the same I cannot cancel them out then I have another term here which is quadratic so if I relax the second assumption then my decision boundary is not linear anymore its quadratic and this is called quadratic discriminant analysis so linear discriminant Allah says is based on the assumption that class conditionals are Gaussian and covariance matrices are the same quadratic discriminant is based on the assumption that class conditional distributions are Gaussian but covariance matrix are not the same so in this case you're going to have decision boundaries like this you know compare this suit it's linear relaxing this assumption it becomes quadratic great it to change the shape of your decision boundary to a quadratic shape okay any question okay we learned about how to do Lda on blackboard so I'm going to escape this is slide this is just I we have we know the details of the you know this assumption and how to make simplification come up to this linear and okay here you know we had this term X transpose Sigma inverse X and we had X transpose Sigma inverse X and we can sell them out that's why we got a linear decision boundary we can sell them out as I said because we assumed that this to Sigma are the same are the same from both class if I relax this assumption they won't cancel each other so my decision boundary will not be linear anymore its quadratic so I'm going to have a term like this and is you already have seen you know the shape okay with the assumption that Sigma 1 is not the same as Sigma Z so now in in general form that's actually what you can do suppose that you have K classes and suppose that you know this class conditional is Gaussian then your base classifier which is our max of this quantity when this quantity is this you know you basically can do quadratic discriminant analysis for K classes just by computing this quantity for each of these K classes as compared and I think you can easily see where this formula comes from you know if you have this if you have this multivariate Gaussian if you take log from this you're going to have minus 1/2 log of determinant of Sigma K just here and then you can get rid of this e you're going to have minus 1/2 of this term which is here right and then that was your prior plus log of PI K so basically what this quantity is is just the log of class conditional times prior so you have K classes label belongs to the class which has the maximum posterior or maximum class conditional times prior this log is you know instead of just maximizing this we can maximize the log of this function right so what would the for this formula is basically the log of class conditional times prior so compute this quantity for each classes when you have K classes compare them assign the label to the one which has the maximum value so that would be your base classifier in multi class scenario when you have you know K classes and you don't you didn't assume that Sigma's are the SI and you know if you assume that all of these classes are the same then it's Lda otherwise it's qu you know in LD a your Sigma will be simplified in this one okay how do we how can we in practice how can we estimate these parameters so you know we made an assumption about the form of the distribution we assume that it's Gaussian but I still we need to estimate the parameters of this Gaussian the mean the the covariance and so on we need to estimate the prior as well so estimating the prior is very easy okay I have 100 points 10 of them is in class 1 90 of them is in class - so n K divided by n n is 100 and K is the number of point in class K 10 over 100 1 over 10 is my prior about class 1 9 over 10 is my prior about cluster okay so easy so based on the assumption that it's very newly distribution how can I assume you mean of each class the which the maximum likelihood estimation of the mean of Gaussian is just average right so take the average of point in that class you know I give you two classes and I assume there are 10 points here 90 points there take these 10 points add them up divide by 10 that's your mu take this 90 points add them up divide by 90 that's your average for the other class so it gives you the mean of each of this class okay yes in the terms of text recognition okay right just images you know I give you set of images these are tools okay I give you in training set it gives you some trees these are each of them can be represented as a vector right so each of them is a D dimensional vector add all of these D dimensional vectors divided by the number of them yep value of the pixels adding up divided by the numbers so and then you can reshape it and look how this image looked like you know it's sort of blurred to you know which is average of all of tools that you have observed average of all of trees that you have observed so what about covariance matrix this is the maximum this maximum likelihood estimation of covariance matrix okay so X minus mu K times X minus mu K transpose and divided by the number of points okay okay so this is Sigma K that's covariance for each class so basically if I have you know if I have tools here and trees here and once here I can estimate this covariance matrix here here and here but if I want to do LD a I mean the assumption is that covariance matrix are the same i computed three different covariance matrices i just need to take the average of this two covariance mention so my sigma would be just summation of the covariance matrices that they had divided by the number of points and this summation could be weighted you know if i have ten points here and they have 90 points there i give more weights 290 points so 90 times the this one plus 10 times the other 1/100 is that clear okay okay now let me show you something which in terms of computation sometimes could be useful and it gives us pretty good insight about the nature of these techniques nature of LD a and Q you know this is the quantity but I'm going to compute this is a quantity that I'm going to compute in the case that I want to do multi-class classification okay if I assume that this Sigma is identity let's just assume that Sigma is identity what does it mean it means that the Gaussian that I have in each class or in class K is like a spherical shape right Gaussian is a straight culture so if I assume that this is Gaussian see what's going to happen to my signal to my sorry this doll it's going to wine s 1/2 log of determinant of I and what's the term rent of our 1 minus 1/2 X minus mu K and this is I the inverse of I is I X minus mu k plus log PI K ok so this first there is log of 1 which is 0 so this first term is 0 ok so X minus mu K x transpose times X minus mu K what that is it sorry you know X is a vector a point mu is another point it's a squared Euclidean distance between these two points right so X mu so that you couldn't distance between these two points so X minus mu K transpose X minus me ok that's squared Euclidean distance and this is a constant like PI K so for simplicity at the moment assume that I'm talking about a case such that prior of each class is the same as the other one so points equally has been distributed so I have two classes 100 point 50 here and 50 there so PI 1 is the same as part for the moment assume that's the case so when is when I compare you know suppose I have two classes I have to compare Delta one with Delta zero and Delta one is a Euclidean distance between point X and mean of class 1 and Delta 0 is the occlusion distance between X and mean of the class 2 or class 0 ok plus a constant that for the moment I assume is the same for any class so what does it mean it means that LD a or QD a classifiers points quite easily you know I have two classes I assume this is Gaussian I assume this is Gaussian this is the mean of class one class this is the mean of the other class and you give me a point to classify this point belongs to this class or belongs to the other class so in the very simplified scenario that Sigma is identity what we are doing is just to look at the distance between this point and the mean of each of these classes and decide that this point belongs to the class that has shorter meet shorter distance so you are closer to the mean of this class ok go here you're closer to the mean of the other class so basically it's intuitively you know I have twos and threes I take the average of two's I take the average of trees so on average to look like this on average trees looks like this so they give me a near digit is a two or three I just compare is it closer to this mean or it's closer to the other means closer to this one okay go for this label it's two or the other way around with this assumption that Sigma is or means Gaussian is stricka-- okay next lecture we're going to see that in this more general case Sigma is notice recall and this is not the same intuitively what's happening in QD a LV
Info
Channel: Data Science Courses
Views: 23,396
Rating: 4.9713264 out of 5
Keywords: Machine Learning (Software Genre), Linear Discriminant Analysis, Statistical Classification
Id: _m7TMkzZzus
Channel Id: undefined
Length: 74min 4sec (4444 seconds)
Published: Sun Sep 27 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.