Gaussian mixture model using EM method where
do you use and why do you need a mixture Gaussians we have seen in may our analysis specifically
if you recollect back some of the discussions which we had on modern business criteria the
covariance matrix under the base bayes paradane when we convert a distance short of multivariate
Gaussian function into a distance criteria we made an assumption in most of those analysis
that the distribution of scatter of the data in whatever dimension it may be 1, 2, 3 or
even higher is a Gaussian distribution okay. This short of assumption may not happen in
particular but of course most scientists engineers still use a Gaussian disruption for modeling
may analysis in may difference applications including single process commutation theory
violations whatever may be the advantage is that the Gaussian seems to be the one which
is more closer to the natural distribution okay number the other main reason is it is
easy to do mathematical manipulations if you have Gaussian functions. Specifically is differentiation exists up
to as much of an order as you need infinite order it is suppose it as various other types
of advantages which this function provides over other distribution functions but there
may be situations where a distribution may not be sticky Gaussian in nature or nature
is sickly not Gaussian in such cases there are methods which deal with multiple Gaussian
it is like as if I want to cluster the data into several components or several parts. And I assume or we assume that each of those
clusters forms a Gaussian distribution so this leaves leads us to analysis which is
based on GMM which is casually called or the Gaussian mixture model let us look at the
expression of the Gaussian mixture. This is the univariate Gaussian distribution
in 1D where the µ is indicated by the mean okay and of course s is a standard deviation
s 2 is the variance which is here okay this is actually the variance remember the s is
called the standard deviation variance is s2 which is here as well okay so this is the
normalizing part of the function and this is the exponential function we had seen this
and we had to also extend this to a multivariate Gaussian distribution case. Where this µ becomes a vector of the dimension
of the data sample or instance x okay I have changed this G to N here indicating this is
a univariate Gaussian distribution in 1D in higher dimension it is an N okay where this
s2 represent replaced by the covariance term root over this and this is modern distance
function within the exponential term which you have inverse of the covariance matrix
this is nothing new. We had this discussion earlier under multi
variant Gaussian distribution we put this under the bayes paradane and we formulate
distance functions and we know under what properties of the covariance matrix we are
going to have between class linear decision boundaries or DP is or non linear boundaries
actually the covariance matrix and it is inverse of the covariance matrix dictates the corresponding
property of the decision boundary okay. But now we what we will do is we will extend
this to a case where we will have not only a multivariate Gaussian distribution one of
them in higher dimensions but multiples of these speared over the data. And to do this we basically need to estimate
the covariance matrix and µ for a particular distribution and one such method is actually
called the maximum likelihood estimation when we need to estimate this for a particular
data. Data samples so do this we will look at this
ML method which is a simpler one to visualize if yi take the log of this probability function
for the pervious expression let us go back. To the expression here this what we are talking
the log about an we did this when we actually form the discriminate function for a particular
class when we derived a distance criteria. So if you do that these terms so this is all
expression is also not nothing new to you have a nonlinear term here and you have a
certain constant terms in this expressions okay we need to tell the derivative of this
because you need to actually maximize this so take the derivative with respect to the
mean and the covariance term because this is are the 2 parameters which you need to
estimate for that function. This will give you an expression based on
to estimate the mean so the mean estimated by the maximum likelihood or ML method where
N is the number of sample points is given by this which a trivial expression which you
can get it from here and the covariance matrix is actually given by the overall scatter matrix
is given here. So the ML method for estimation of the parameters
mean and the s is giving you the same expressions which we have seen earlier this the covariance
matrix this is the way you estimate the covariance matrix and the corresponding mean for the
data what happens if you have multiple Gaussians or what is called as a mixture of Gaussians. Or Gaussian mixture and this is sometimes
called as a linear super position of a set of k number of Gaussians K is the total number
of Gaussians typically K is more than 1 but of course in a very special case K can be
equal to 1 where are just have one Gaussian and the overall probability is now a summation
of all K number of Gaussians where this pk is called the mixing coefficient hence we
will call this as the mixing coefficient for the Kth Gaussian the subscript k indicate
the corresponding Gaussian. And this expression is the normal distribution
normal multi variant Gaussian distribution normal multi variant Gaussian distribution
for the class K µ vector of the corresponding dimension as the sample K and this is the
covariance matrix for the Kth Gaussian again repeat this is the mixing coefficient for
the Kth Gaussian this is the normal multi variant Gaussian distribution for the Kth
Gaussian okay the only constraint which we put with respect the mixing coefficients is
that each of them live in 0 to 1 and the ? of all this is equal to 1.
? all are mixing so these are basically considered as weights they are also called the weights
for the corresponding Gaussian function and if you take the log likelihood, if you take
the log likely would of this overall function which is a function of the mean the covariance
and the weight coefficients which is also given as a function overall the data samples
N is the total number of samples and the p(x) is given here this is the p up to the probability
for the distribution for a sample x as given as this. So this what you will get you can take the
s so this is the same so replace this expression p(x) sand by this here and this is what you
will get okay remembered inside the logarithm you have the ? over K Gaussians and then you
have it for as many number of samples okay K is the index indicating K is the index indicating
the Kth Gaussian and N is the indexes front to a particular instance or the sample total
number of samples is N total number of Gaussians is K this is the expression you have for the
lack likelihood for a mixture of Gaussians multi variate Gaussian distribution. Is what we are considering in this case the
maximum likelihood may not work to yield a close form solution and you need the method
of optimization which is a attractive and that is what EM or expectation maximizations
will give you. So let us take this example to show what we
are planning to intent if this is a short of a scatter of the data set of data samples
which is obtained by the mixture of 3 Gaussian you can see here that this the overall trained
of the does not follow a Gaussian distribution by itself, so we can cluster the data into
three different components and say each of this individual components is a cluster following
a Gaussian distribution in fact this particular data has been obtained or synthetically generated
by. Three different Gaussian distributions as
given by this three different Iso contours these are asymmetric Gaussian distributions
in 2D three different class means indicated by three different colors and they are Iso
contour lines and actually if you look at this particular plot this is showing a surface
plot in two dimensions where the height of the surface at each individual points reflects
the probability density of a corresponding cluster or a Gaussian I repeat again if you
look back into the slide. This surface plot can be visualize to be an
extension of this plot here where this plots indicated Iso contour lines or curves of equal
distance which respect to the class mean but at each point if you compute the probability
say you compute the probability at a point here and translate that to a height this is
the plot which you will get so you will get surface plot where the height here on the
right hand side is indicating the probability density of that function. So what I mean is this is synthetic data obtained
by 3 Gaussian functions and overall so this is cluster density may look like this which
respect to the clusters so it is difficult actually model this under single Gaussian
distribution even in 2D and these are the three sample points corresponding to three
clusters, if you remove the cluster level or the color of this data this is the data
which you will have. So cannot model this perfectly using a single
Gaussian and this data shows the example why do you need multiple Gaussian’s or a mixture
of Gaussian’s to model this data. Of course you could ask me a question how
do you know a priory how many Gaussian functions you need, okay. So that id something which is not under the
scope of the discussion today and it is a matter left for individual researchers to
find out for a given data set what is the optimal number k. There are methods to find out the optimal
number k, if you have a data set a priory given to you often you can find out some methods
by with the best k number of Gaussian’s you can fit on it. In this case of course since I know the data
before and I will say that the number of Gaussian’s is three, but if you give you an arbitrary
distribution which is non Gaussian in nature where k=2,3,4 or 10 or even more is very difficult
to visualize. In practice in general but there are methods
in which people adopt to find out what is the ideal value of k for the expression. So we can think of mixing coefficients as
prior probabilities for these individual components and so for given value of k we can evaluate
the corresponding posterior probabilities called responsibilities which can be visualized
as some latent variables in our expression which we saw couple of slides back. Let us get back in the expression of the lock
likelihood but before that we will look at this base rule in which we define an expression
of ?k as a latent variable given by under the base paradigm but this is nothing new
few this is the posterior probability the class priors and so on and so forth which
we are discussed earlier. So what we are doing here is taking the class
conditional probability to be our normal distribution a mixture of Gaussians so we have replace
this term here by the numerator has given by the mixture coefficient and the corresponding
unconditional prior to be the S of all the Gaussian’s in the bottom. What is the mixing coefficient µ here, the
number of samples for a particular class divided by the total number of samples here, okay. So interpret that the number of samples for
a particular class as the number of points assign for a particular curve which is not
assigned beforehand we do not know how many samples belong to a particular cluster k,
so that has to be obtain and found out which in turn will actually give you the mixing
coefficient pk. So what does EM algorithm do, so the Em algorithm
is an iterative optimization technique which is operated locally to find out the set of
values of the parameters what are the parameters now we need to estimate here in a Gaussian
mixture of Gauss, a mixing coefficients the set of class means for individual clusters
so if there are k class, let us say somebody decides that I want to fit k mixture of Gaussian’s
or KGMM or k Gaussian mixtures to be very precise on the data. So k is known, so if there are k Gaussian
clusters which you want to fit so you have k different means, each of this mean has a
dimension which is a same as the dimensional of the data. But there are k means, k covariance matrices
and k mixture coefficients, so 3xk seems to be the number of what not in terms of the
number. But k sets parameters which you need to estimate
okay, well of course the mixing coefficients each of them is a scalar quantity that is
alright. For µk there are k means each of dimension
D, k covariance matrices how many elements D2 elements within each covariance matrix,
so these are the set of parameters one needs to estimate. Let us see how the EM does it, so there are
two steps in EM one is call the expectation another is called the maximization and this
is an iteratively one after the another. You have an expectation step or an estimation
step as it is called followed by a maximization step, so an iteratively you follow this pair
of steps one after another in a sequence unless you have a condition of conversions which
is satisfied at the end. So estimation is for the given parameter values
we can compute the expected values of the latent variable and hence it is called an
expectation step as well. And then you have a maximization step which
updates the parameters of the given model based on the latent variable calculated using
the ML method, okay. So let us look at EM algorithm now, so given
a Gaussian mixture model our goal is to maximize the likelihood function which we have seen
a few slides back with respect to the parameters well p, s and µ comprising the means which
is the µ. The co variance matrix s of the components
and the mixing pk okay, so the 1st step is initialize the means just an index from 1
to k the co variance terms and mixing coefficients and evaluate the initial value of the log
likelihood. You can start with the arbitrary set of random
values here as well but instead of being absolutely random what we could also do is take since
you do not have something like a class information you can take the overall of the entire data
set and take the individual mean of the clusters or the Gaussian to be the data mean okay. The co variance matrix could also start in
fact there is lot of some degree of research which is happened to find out what should
be the good starting point for any method the more closer you are to the final solution
the faster you will convert the better solution is what you will have, so instead of starting
with the absolute random values in this case it is possible to that you can have better
estimates but I am not touching those aspects in this particular talk. So initialize let us say with some initialize
value which could be earn random values we go to the E step of the EM which is the expectation
step and that you compute with the little variable as given as the expression here. So this is for the kth Gaussian which and
the corresponding expression is given here kth. So the denominator you sum up all the Gaussian
distributions for that value for the corresponding x which you have estimated using the random
number so sum up all Gaussian in the denominator and the corresponding Gaussian after you have
estimated this. Let’s go to the 3rd step which is the EM
step and the 2nd step mind you have estimated this ?j or the ?k the index as changed but
it is the same variable and that goes inside the expression here to compute the corresponding
mean and the co variance from and this is the same as the EM step done earlier except
that the written variable is sort of a weight here with comes here and more accurate value
of the mean and co variance matrix. The mixing coefficient must also be calculate
using the variable computed in the e step in the step 2 earlier as given here, correspondingly
so one the µj are available here you can see the 3 expression here the mixing coefficient,
co variance and the mean they all are computed using the data sample points and the variable
this is the normal distribution expression as given here, so then what you do obtain
the Gaussian mixtures using the parameters estimated in step number 3. Now what you need to do here is find out if
the corresponding likelihood estimated here truly represents the data samples if this
is not which will not typically happened you back go back to step number 2, so you keep
repeating step number 2 , 3 and 4 what essentially which is the E and the M part that will help
you to actually converge in better solution we will have an illustration of this very
soon in next few slides. And you put an converge and say that I will
keep on repeating this process till my log likelihood of the distribution satisfies some
criteria of converge. So the convergence criteria could be such
that in successive iterations or over set few iterations the parameters do not change,
what are the parameters the mixing co efficient p the mean µ and the co variance scatter
s they do not change over successive iteration over a last few iteration. The other some similar in which you can use
is given in step number 4 is when you estimate this likelihood if this itself does not change
over a set of iteration, so you compute and restore the log likelihood value which we
have computed in the previous iteration compare that with the current one and the change is
negligible below a threshold you say that you have met a criteria for conversions and
you say that you have an estimated the corresponding Gaussians over the distribution which we have
So to wind up let us take an example now in 2d and this images have been obtained also
from the book by bishop so we have taken it from the book fashion recognition and machine
learning the reference of the book has been given at the beginning of this particular
lecture so what does it show here this is a scatter you can almost see it is visible
that there are two different clusters of data so it is possible to fit them with two different
Gaussians distributions. So almost blindly I am selecting the k to
be 2 but you can select k to be 3 or 4 also there will be some amount of convergences
whether good or bad another into be seen but in this case let us take the example of and
let say the Gaussians have been initialized at these two places with the corresponding
mean and scatter this two ellipse is show that this is the Gaussians here k=1, k=2 Gaussians
is here that these are the corresponding means at centre of the ellipses and the distributions
this scatter is in 2 d. So as you keep proceeding remember what did
we do in the EMI algorithm there were two typical steps one was the E step in which
you are estimating after the initial set of ransom values have been has been used to start
this cycle for the corresponding set of variable or the parameter of the Gaussians mixture
model you use a E step to estimate the written variables then using the third step you go
to the M step where you actually estimate the parameters again using the lateral variable
and continue these process off course you keep in watch in the clock as you proceed. That these are initial set of points which
are marked in blue will be assigned to this particular Gaussians and what will happen
to this set of points because they are the once which are going to be closed closer to
the corresponding Gaussian remember this was the initial stage okay and you start assigning
the distribution because this will the points lying close to the blue cluster. Here or the blue the Gaussians market with
these blue ellipse will be closer to this set of points and hence they will be assign
to this corresponding Gaussians and the points which are labeled in red now we will assign
to this so what you do now after this assignment you recomputed the parameters the mixing coefficient
the mean and the scatter let’s see how the recomputed means after the first literature
look like so once you recomputed they will appear like this. From the data this is distribution for the
points which have been assigned to the Gaussian this is step number one and the corresponding
which is in two will be marked here this is after the second tip so this is how this start
converting you reassign the points one second and this is how the literature at number step
5 we can see that the one of the Gaussians convert to this close cluster of points marked
in blue. The points here are red flowing another Gaussians
distribution after 20 you have almost convert to this point and this is the final stage
of interaction where if you still you will not have a change in either the parameter
values or a log like layout criteria after you compute with this set of parameters okay
let us have look at this animation once second. So this is the starting point you not know
where is the clusters where should put the you can start almost anywhere and then this
is how the conversion takes place second interaction 5th interaction and the 20th interaction okay
so this is the method by which you can fit a set of Gaussians to certain scatter of data
points which do not point a invariant single Gaussians. And this is the method which is actually used
not only to form clusters but usually model data points after because after this you can
actually apply all your methods of classification or the you want to transform clusters group
them under certain criteria and so on so this is one method which is used to model as well
as form clustering thank you very much.