Mod-02 Lec-23 Gaussian Mixture Model (GMM)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Gaussian mixture model using EM method where do you use and why do you need a mixture Gaussians we have seen in may our analysis specifically if you recollect back some of the discussions which we had on modern business criteria the covariance matrix under the base bayes paradane when we convert a distance short of multivariate Gaussian function into a distance criteria we made an assumption in most of those analysis that the distribution of scatter of the data in whatever dimension it may be 1, 2, 3 or even higher is a Gaussian distribution okay. This short of assumption may not happen in particular but of course most scientists engineers still use a Gaussian disruption for modeling may analysis in may difference applications including single process commutation theory violations whatever may be the advantage is that the Gaussian seems to be the one which is more closer to the natural distribution okay number the other main reason is it is easy to do mathematical manipulations if you have Gaussian functions. Specifically is differentiation exists up to as much of an order as you need infinite order it is suppose it as various other types of advantages which this function provides over other distribution functions but there may be situations where a distribution may not be sticky Gaussian in nature or nature is sickly not Gaussian in such cases there are methods which deal with multiple Gaussian it is like as if I want to cluster the data into several components or several parts. And I assume or we assume that each of those clusters forms a Gaussian distribution so this leaves leads us to analysis which is based on GMM which is casually called or the Gaussian mixture model let us look at the expression of the Gaussian mixture. This is the univariate Gaussian distribution in 1D where the µ is indicated by the mean okay and of course s is a standard deviation s 2 is the variance which is here okay this is actually the variance remember the s is called the standard deviation variance is s2 which is here as well okay so this is the normalizing part of the function and this is the exponential function we had seen this and we had to also extend this to a multivariate Gaussian distribution case. Where this µ becomes a vector of the dimension of the data sample or instance x okay I have changed this G to N here indicating this is a univariate Gaussian distribution in 1D in higher dimension it is an N okay where this s2 represent replaced by the covariance term root over this and this is modern distance function within the exponential term which you have inverse of the covariance matrix this is nothing new. We had this discussion earlier under multi variant Gaussian distribution we put this under the bayes paradane and we formulate distance functions and we know under what properties of the covariance matrix we are going to have between class linear decision boundaries or DP is or non linear boundaries actually the covariance matrix and it is inverse of the covariance matrix dictates the corresponding property of the decision boundary okay. But now we what we will do is we will extend this to a case where we will have not only a multivariate Gaussian distribution one of them in higher dimensions but multiples of these speared over the data. And to do this we basically need to estimate the covariance matrix and µ for a particular distribution and one such method is actually called the maximum likelihood estimation when we need to estimate this for a particular data. Data samples so do this we will look at this ML method which is a simpler one to visualize if yi take the log of this probability function for the pervious expression let us go back. To the expression here this what we are talking the log about an we did this when we actually form the discriminate function for a particular class when we derived a distance criteria. So if you do that these terms so this is all expression is also not nothing new to you have a nonlinear term here and you have a certain constant terms in this expressions okay we need to tell the derivative of this because you need to actually maximize this so take the derivative with respect to the mean and the covariance term because this is are the 2 parameters which you need to estimate for that function. This will give you an expression based on to estimate the mean so the mean estimated by the maximum likelihood or ML method where N is the number of sample points is given by this which a trivial expression which you can get it from here and the covariance matrix is actually given by the overall scatter matrix is given here. So the ML method for estimation of the parameters mean and the s is giving you the same expressions which we have seen earlier this the covariance matrix this is the way you estimate the covariance matrix and the corresponding mean for the data what happens if you have multiple Gaussians or what is called as a mixture of Gaussians. Or Gaussian mixture and this is sometimes called as a linear super position of a set of k number of Gaussians K is the total number of Gaussians typically K is more than 1 but of course in a very special case K can be equal to 1 where are just have one Gaussian and the overall probability is now a summation of all K number of Gaussians where this pk is called the mixing coefficient hence we will call this as the mixing coefficient for the Kth Gaussian the subscript k indicate the corresponding Gaussian. And this expression is the normal distribution normal multi variant Gaussian distribution normal multi variant Gaussian distribution for the class K µ vector of the corresponding dimension as the sample K and this is the covariance matrix for the Kth Gaussian again repeat this is the mixing coefficient for the Kth Gaussian this is the normal multi variant Gaussian distribution for the Kth Gaussian okay the only constraint which we put with respect the mixing coefficients is that each of them live in 0 to 1 and the ? of all this is equal to 1. ? all are mixing so these are basically considered as weights they are also called the weights for the corresponding Gaussian function and if you take the log likelihood, if you take the log likely would of this overall function which is a function of the mean the covariance and the weight coefficients which is also given as a function overall the data samples N is the total number of samples and the p(x) is given here this is the p up to the probability for the distribution for a sample x as given as this. So this what you will get you can take the s so this is the same so replace this expression p(x) sand by this here and this is what you will get okay remembered inside the logarithm you have the ? over K Gaussians and then you have it for as many number of samples okay K is the index indicating K is the index indicating the Kth Gaussian and N is the indexes front to a particular instance or the sample total number of samples is N total number of Gaussians is K this is the expression you have for the lack likelihood for a mixture of Gaussians multi variate Gaussian distribution. Is what we are considering in this case the maximum likelihood may not work to yield a close form solution and you need the method of optimization which is a attractive and that is what EM or expectation maximizations will give you. So let us take this example to show what we are planning to intent if this is a short of a scatter of the data set of data samples which is obtained by the mixture of 3 Gaussian you can see here that this the overall trained of the does not follow a Gaussian distribution by itself, so we can cluster the data into three different components and say each of this individual components is a cluster following a Gaussian distribution in fact this particular data has been obtained or synthetically generated by. Three different Gaussian distributions as given by this three different Iso contours these are asymmetric Gaussian distributions in 2D three different class means indicated by three different colors and they are Iso contour lines and actually if you look at this particular plot this is showing a surface plot in two dimensions where the height of the surface at each individual points reflects the probability density of a corresponding cluster or a Gaussian I repeat again if you look back into the slide. This surface plot can be visualize to be an extension of this plot here where this plots indicated Iso contour lines or curves of equal distance which respect to the class mean but at each point if you compute the probability say you compute the probability at a point here and translate that to a height this is the plot which you will get so you will get surface plot where the height here on the right hand side is indicating the probability density of that function. So what I mean is this is synthetic data obtained by 3 Gaussian functions and overall so this is cluster density may look like this which respect to the clusters so it is difficult actually model this under single Gaussian distribution even in 2D and these are the three sample points corresponding to three clusters, if you remove the cluster level or the color of this data this is the data which you will have. So cannot model this perfectly using a single Gaussian and this data shows the example why do you need multiple Gaussian’s or a mixture of Gaussian’s to model this data. Of course you could ask me a question how do you know a priory how many Gaussian functions you need, okay. So that id something which is not under the scope of the discussion today and it is a matter left for individual researchers to find out for a given data set what is the optimal number k. There are methods to find out the optimal number k, if you have a data set a priory given to you often you can find out some methods by with the best k number of Gaussian’s you can fit on it. In this case of course since I know the data before and I will say that the number of Gaussian’s is three, but if you give you an arbitrary distribution which is non Gaussian in nature where k=2,3,4 or 10 or even more is very difficult to visualize. In practice in general but there are methods in which people adopt to find out what is the ideal value of k for the expression. So we can think of mixing coefficients as prior probabilities for these individual components and so for given value of k we can evaluate the corresponding posterior probabilities called responsibilities which can be visualized as some latent variables in our expression which we saw couple of slides back. Let us get back in the expression of the lock likelihood but before that we will look at this base rule in which we define an expression of ?k as a latent variable given by under the base paradigm but this is nothing new few this is the posterior probability the class priors and so on and so forth which we are discussed earlier. So what we are doing here is taking the class conditional probability to be our normal distribution a mixture of Gaussians so we have replace this term here by the numerator has given by the mixture coefficient and the corresponding unconditional prior to be the S of all the Gaussian’s in the bottom. What is the mixing coefficient µ here, the number of samples for a particular class divided by the total number of samples here, okay. So interpret that the number of samples for a particular class as the number of points assign for a particular curve which is not assigned beforehand we do not know how many samples belong to a particular cluster k, so that has to be obtain and found out which in turn will actually give you the mixing coefficient pk. So what does EM algorithm do, so the Em algorithm is an iterative optimization technique which is operated locally to find out the set of values of the parameters what are the parameters now we need to estimate here in a Gaussian mixture of Gauss, a mixing coefficients the set of class means for individual clusters so if there are k class, let us say somebody decides that I want to fit k mixture of Gaussian’s or KGMM or k Gaussian mixtures to be very precise on the data. So k is known, so if there are k Gaussian clusters which you want to fit so you have k different means, each of this mean has a dimension which is a same as the dimensional of the data. But there are k means, k covariance matrices and k mixture coefficients, so 3xk seems to be the number of what not in terms of the number. But k sets parameters which you need to estimate okay, well of course the mixing coefficients each of them is a scalar quantity that is alright. For µk there are k means each of dimension D, k covariance matrices how many elements D2 elements within each covariance matrix, so these are the set of parameters one needs to estimate. Let us see how the EM does it, so there are two steps in EM one is call the expectation another is called the maximization and this is an iteratively one after the another. You have an expectation step or an estimation step as it is called followed by a maximization step, so an iteratively you follow this pair of steps one after another in a sequence unless you have a condition of conversions which is satisfied at the end. So estimation is for the given parameter values we can compute the expected values of the latent variable and hence it is called an expectation step as well. And then you have a maximization step which updates the parameters of the given model based on the latent variable calculated using the ML method, okay. So let us look at EM algorithm now, so given a Gaussian mixture model our goal is to maximize the likelihood function which we have seen a few slides back with respect to the parameters well p, s and µ comprising the means which is the µ. The co variance matrix s of the components and the mixing pk okay, so the 1st step is initialize the means just an index from 1 to k the co variance terms and mixing coefficients and evaluate the initial value of the log likelihood. You can start with the arbitrary set of random values here as well but instead of being absolutely random what we could also do is take since you do not have something like a class information you can take the overall of the entire data set and take the individual mean of the clusters or the Gaussian to be the data mean okay. The co variance matrix could also start in fact there is lot of some degree of research which is happened to find out what should be the good starting point for any method the more closer you are to the final solution the faster you will convert the better solution is what you will have, so instead of starting with the absolute random values in this case it is possible to that you can have better estimates but I am not touching those aspects in this particular talk. So initialize let us say with some initialize value which could be earn random values we go to the E step of the EM which is the expectation step and that you compute with the little variable as given as the expression here. So this is for the kth Gaussian which and the corresponding expression is given here kth. So the denominator you sum up all the Gaussian distributions for that value for the corresponding x which you have estimated using the random number so sum up all Gaussian in the denominator and the corresponding Gaussian after you have estimated this. Let’s go to the 3rd step which is the EM step and the 2nd step mind you have estimated this ?j or the ?k the index as changed but it is the same variable and that goes inside the expression here to compute the corresponding mean and the co variance from and this is the same as the EM step done earlier except that the written variable is sort of a weight here with comes here and more accurate value of the mean and co variance matrix. The mixing coefficient must also be calculate using the variable computed in the e step in the step 2 earlier as given here, correspondingly so one the µj are available here you can see the 3 expression here the mixing coefficient, co variance and the mean they all are computed using the data sample points and the variable this is the normal distribution expression as given here, so then what you do obtain the Gaussian mixtures using the parameters estimated in step number 3. Now what you need to do here is find out if the corresponding likelihood estimated here truly represents the data samples if this is not which will not typically happened you back go back to step number 2, so you keep repeating step number 2 , 3 and 4 what essentially which is the E and the M part that will help you to actually converge in better solution we will have an illustration of this very soon in next few slides. And you put an converge and say that I will keep on repeating this process till my log likelihood of the distribution satisfies some criteria of converge. So the convergence criteria could be such that in successive iterations or over set few iterations the parameters do not change, what are the parameters the mixing co efficient p the mean µ and the co variance scatter s they do not change over successive iteration over a last few iteration. The other some similar in which you can use is given in step number 4 is when you estimate this likelihood if this itself does not change over a set of iteration, so you compute and restore the log likelihood value which we have computed in the previous iteration compare that with the current one and the change is negligible below a threshold you say that you have met a criteria for conversions and you say that you have an estimated the corresponding Gaussians over the distribution which we have So to wind up let us take an example now in 2d and this images have been obtained also from the book by bishop so we have taken it from the book fashion recognition and machine learning the reference of the book has been given at the beginning of this particular lecture so what does it show here this is a scatter you can almost see it is visible that there are two different clusters of data so it is possible to fit them with two different Gaussians distributions. So almost blindly I am selecting the k to be 2 but you can select k to be 3 or 4 also there will be some amount of convergences whether good or bad another into be seen but in this case let us take the example of and let say the Gaussians have been initialized at these two places with the corresponding mean and scatter this two ellipse is show that this is the Gaussians here k=1, k=2 Gaussians is here that these are the corresponding means at centre of the ellipses and the distributions this scatter is in 2 d. So as you keep proceeding remember what did we do in the EMI algorithm there were two typical steps one was the E step in which you are estimating after the initial set of ransom values have been has been used to start this cycle for the corresponding set of variable or the parameter of the Gaussians mixture model you use a E step to estimate the written variables then using the third step you go to the M step where you actually estimate the parameters again using the lateral variable and continue these process off course you keep in watch in the clock as you proceed. That these are initial set of points which are marked in blue will be assigned to this particular Gaussians and what will happen to this set of points because they are the once which are going to be closed closer to the corresponding Gaussian remember this was the initial stage okay and you start assigning the distribution because this will the points lying close to the blue cluster. Here or the blue the Gaussians market with these blue ellipse will be closer to this set of points and hence they will be assign to this corresponding Gaussians and the points which are labeled in red now we will assign to this so what you do now after this assignment you recomputed the parameters the mixing coefficient the mean and the scatter let’s see how the recomputed means after the first literature look like so once you recomputed they will appear like this. From the data this is distribution for the points which have been assigned to the Gaussian this is step number one and the corresponding which is in two will be marked here this is after the second tip so this is how this start converting you reassign the points one second and this is how the literature at number step 5 we can see that the one of the Gaussians convert to this close cluster of points marked in blue. The points here are red flowing another Gaussians distribution after 20 you have almost convert to this point and this is the final stage of interaction where if you still you will not have a change in either the parameter values or a log like layout criteria after you compute with this set of parameters okay let us have look at this animation once second. So this is the starting point you not know where is the clusters where should put the you can start almost anywhere and then this is how the conversion takes place second interaction 5th interaction and the 20th interaction okay so this is the method by which you can fit a set of Gaussians to certain scatter of data points which do not point a invariant single Gaussians. And this is the method which is actually used not only to form clusters but usually model data points after because after this you can actually apply all your methods of classification or the you want to transform clusters group them under certain criteria and so on so this is one method which is used to model as well as form clustering thank you very much.
Info
Channel: nptelhrd
Views: 43,489
Rating: undefined out of 5
Keywords: Gaussian Mixture Model (GMM)
Id: ZBLyXgjBx3Q
Channel Id: undefined
Length: 26min 5sec (1565 seconds)
Published: Wed Oct 08 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.