Good evening, our today's lecture is cluster
analysis. So, let us discuss from factor analysis some
analog to cluster analysis and the difference so, if you recall factor analysis then there
you found out that, there are n number of observations and p number of manifest variables
and you created certain factors. For example, f 1, f 2 or like this f n factors and you
tried to link some of the our group, some of the variables to factor 1, some other variables
to factor 2 now, rest variables to factor n like this. So, that mean in factor analysis
this is necessarily you have grouped the manifest variables into different groups cofactors
and you have done that with the help of observations in multivariate observations on p variables.
Now, let us think little differently suppose, I have n number of observations and all this
x 1 to x p, this are characteristic features for each of the observations. Now, based on
this characteristic features if you want to group the individuals into several groups,
then this is known as grouping the items or grouping the individuals or grouping the observations
or grouping anything. So, then this one is known as cluster analysis. The mathematics
behind cluster analysis is different than the mathematics used in factor analysis, the
purpose is different, the objectives what we basically sharp they are also difference,
but only one analogy is that, that it is also grouping like factor analysis we want to group
several variables into different factors. Here, that grouping is also done but not with
respect to the variables with respect to the observations. With this back ground let me start cluster
analysis and today's discussion is while first start with certain example, then we go for
what are the criteria to be considered for clustering. Then I will show you some algorithms
of clustering that is known as clustering algorithms and two clustering algorithm will
be discussing one is hierarchical lumaratic clustering and other one is k-means clustering.
And these two clustering how you will be able to run using spss as well as mini tab particularly
hierarchical clustering using spss and k-means clustering with mini tab. Using mini tab I
will be discussing then I will show you one case study and conclusion that is the totality
of the presentation and in this 1 hour lecture. We will try to complete as much as possible
if something remains it will be given in the next class. Let us define what is cluster so, if you see
the chambers dictionary 2005, the definition is cluster is a number of things of the same
kind growing or joined together. For example, if you all of us know the difference
species from the biology different species so, they are grouped together based on certain
features and they are grouped basically, because of their natural similarities. For example,
if we can say that that from food habit from point of view the animals can be that herbivores
omnivores so, 3 different kinds that can be found out omnivores. So, essentially then
a group of homogeneous things are known as cluster. So, the principle in grouping will
be like this which is given in Kaufman and Rousseau 2005 that objects in a same group
are similar, to each other and objects in different groups are as this similar as possible
that is the principle. So, that mean what we are saying clustering
based on certain characteristic features like x 1, x 2 to x p these are the features, these
features will be used to group several items or objects and the grouping is based on similarity
in a group and based on this similarity between groups, that is the principle. So, you want
to make cluster in such a manner that whatever objects writing will be grouping in a particular
cluster say cluster 1, say there is another cluster 2, cluster 1, cluster 2 there may
be another cluster, cluster 3. So, there are several items grouped, here also several items
group, here also several items group, what I mean to say that the items within a group
they as similar as possible. Then so, item between goes these plus these are this they
will be as this similar as possible, that is what I can say principle or i can say that
is the philosophy in doing cluster analysis. Cluster analysis can be thought of a process
model and in this case you see the inputs process and outputs model for cluster analysis.
Under input what we require, we require two things one is what is to be clustered or what
are the things to be clustered and what are the criteria or may be one criteria or several
criteria or several features that will be used to cluster. Then process is you have
to find out the wave of clustering that means, you require a major which will tell you that
some of the objects or items are similar some of the objects and items are this dissimilar.
Hence, the similar items will be joined in group one or in some in a group and dissimilar
will be in different groups and that process is known as partitioning. Also, then output
of this cluster model will be several clusters or groups, let us start with an example. The example, is the safety manager of an automobile
company is interested to group the different departments based on their safety performance
scores. Let, there are 10 different departments and variables that are of importance in measuring
the safety performance are incident score, severity score, and equipment score. The manager
analyzed last 2 years performance of the 10 departments and arrived at the performance
figures given in table one. So, what is essentially happening here suppose, you are a safety manager
of a company and your duty is to keep people safe while working of the work, on the work.
And you have data all the performance on the safety performance of different departments
which are under your control. Over the years your measuring is and then finally what is
your interest if there are many departments and there are certain key features which are
similar to each of the department. So, based on given features you want to find out that
whether the safety performance of all the departments are same, similar or there are
some departments which can be grouped. There are some other departments those can also
be grouped and accordingly what will be your benefit, your benefit will be you will you
will take safety decisions group wise or cluster wise. So, the data collected for example, the 10
department's data mean incident scores so, you are measuring with some measurement system
that mean incident score, mean severity score or mean equipment damage score. And you are finding out that A, B, C, D, E, F then P, Q, R, S, T these are the name of the departments
and this is mean incident score, mean severity score and mean equipment damage score. So,
what is your purpose you want to see that whether all those departments are performing
similarly, in terms of these 3 key features so, let us see what we can do this data. Immediately, what you can do with the data
as a safety manager your aim is to see individually the data, this is what is shown here dot plots.
So, we have done dot plot for mean incident score, mean severity score and mean equipment
damage score. Now, see from mean incident score point of view if you see this dot plot
that you were finding, you were finding out the 2 clusters because some are extremely
left side, some are extremely right side. So, it is obvious that from incidence move
incidence score point of view that means, there are some department which are performing
equally and that is 1, 2, 3, 4, 5 departments here and 5 in 1, 2, 3, 4, 5 departments here.
But if you compare any department from here and here you will find out that there is huge
difference in terms of their mean incident scores. So, you may be in may be tempted to
say that here that ok these are the departments why are I want to actions, similar actions
here and here similar actions for these departments, but the actions will definitely different
from these group of departments to these group of departments. But here, one mistake is there
that if i go buy only MIS, then I am not considering MSS and MEDS so, you are considering individually
the performance, but collectively the performance required to be considered.
So, that is possible if we consider all the 3 features at a time we will see little later
let us see that from MSS point of view what is the status of the department. From MSS
point of view also you are finding 2 groups, but from equipment damage score point of view
it is difficult to tell that there are several groups. If there are several groups then there
are many groups this one this, this this, this or there is only one group. Now, as I
told you that we have taken here only one variable, one characteristic feature at a
time. Now, if you take two at a time you will get
scatter plot like this and this scatter plot you have seen earlier also, MIS incident score
versus a n severity scores you are scatter plots. Now, see from if I consider these 2
characteristic then you are getting 3 clusters, but please keep in mind they are all visually
just this clustering the way we have I have done here that circles it is just seeing the
that location of all the departments on these 2 dimensional scale, but it is obvious from
here that ok there are 3 clusters. Now, similarly if I go for MIS that is mean
incident scores versus equipment damage score you are finding out the 2 cluster, but this
one is not able to we are not able to include any clusters then it will be an unique clusters
so, it may be again 3 clusters in that sense. Now, if you go by other one MSS and MEDS see
2 cluster, but you may be questioning. Here, that what is the guarantee that these are
making 1 cluster, but this is spaded through this so, it is MEDS from MSS point of view
they are all at almost at the same level, but MEDS point of view there is a huge variability.
Now, if you combine then together all the 3, then you see you are getting 3 clusters.
Now, question comes when there are more than 3 variables. So, you number of variable is much, much better
than 3 similarly, this is the dimension. Similarly, you will go for large number n will be large
definitely, but here as we are working grouping in n based on this, this is the key feature
so because it becomes large, what will happen this type of simple plot you cannot make because
at 3-dimensional case it is difficult. So, when you go for more than 3-dimension, 4-dimensions
onwards you will not be able to visualize just like the pictorial representation what
you have seen here. So, we what we mean to say here definitely we are going for more
number of variables or more number of characteristics of or more number of features for any items,
objects individuals that we want to cluster. So, there are 2 criteria first of all you
see that what are the variables that you must consider and what will be your similarity
or dissimilarity measures. So, under variables to be consider it is clearly mentioned in
several books and several researchers also pointed out that, you must be very very careful
while finding out what are the variables that you will be considering. Important variables
are to be considered and trivial variables are to be discarded variables, may be of different
types based on measures like nominal ordinal interval and ratio. This is a very very important
one because although we will not be discussing much about the data types, but if your all
data are majored in interval of ratio scale then you things become easier.
That means, you will be able to get the similarity dissimilarity majors in much much better manner
and in large number of ways, but when you have nominal and ordinal data that getting
the distance is little complicated. And you have to go for those data those techniques
like i-square continuity table then in case of some other majors because where in the
frequency things are coming into consideration that is that will be a different domain from
categorical logic data types. But the things become even complicated when you have the
mixed data types, some data are nominal, some are interval, some are ratio.
So, when you metric and non metric data both data are mixed in the characteristic features
of the variables that will be used in grouping the individuals. So, there is even problem
will be manifold. Today's lecture basically we will be talking about the data the characteristic
features which are basically interval of ratio type or metric in nature. Now, then the question
comes that similarity and dissimilarity measure it is a usually a measure of distance between
the object to be class terms. For example, if I consider only 5 objects
like A, B, C, D, and E then this side A, B, C, D, and E. Now, what is required either
you go for distance, if I go for distance between A and A this will be 0 so, this will
be 0, 0, 0. Now, this will be d ab, this will be d ac, this will be d ad, this will be d
ae. Then this one is d bc, this one is d bd or db and this is your d be, this one is distance
cd this is distance ce, then this will be a distance de, same upper end also same. You
must know the distance that is d whether it is in general, if i say d p q you must have
a major to get this. Now, if you go for similarity major then this
will be this will changed to differently. Suppose, if I say in 100 point scale similarity
then it will be 100, 100, 100, 100 and 100 and here or 1.0. If I say 1 is the most similar
case then 1, 1, 1, 1 like this and this values also you require to find out, but actually
this similarity is just opposite to this similarity sometimes we may say 1 minus this is also
my similarity measure, but there are several measures which can be used and we will see
later on. Then we I will just formally introduce the
like the data matrix in terms of x and this is what is the data matrix for different individual
and for the case what we are discussing for this, this is the data matrix. So, this is
what the data we are planning to collect is and here this is the data what is already
collected. And here, our aim is not grouping the variables like in factorial our aim is
grouping the objects with the help of the variables. What are the different distance measures,
all of us know the Euclidean distance. So, if I have x and y, then suppose this is
this is my P x 1 and y 1 and this one my Q x 2 and y 2 all of us know that what is the
distance P Q d PQ. So, we can write d P Q equal to square root of x 1 minus x 2 square
plus y 1 minus y 2 square. Now, if you go for that means this is like this x 1 minus
x 2 square plus y 1 minus y 2 square this to the power of half. So, if you go for suppose
there are many more dimensions that x 1, x 2 like x P different dimensions are there
so, what will happen at the ultimately. You will get this will be extended to that mean
d P Q that will be extended to your P dimensions that to the power half.
So, if you see that what we have given here that d i j x i 1 minus x j 2 i 2 minus x j
2 like this up to i P square, this is Euclidean distance. Many times you may be interested
to prove more weightage to the distance, then you may not go for Euclidean you may go for
squared Euclidean distance. Similar, to the first one but only this to the power half
is not there. So, there is another distance which is known
as Manhattan distance, this manhattan distance is similar to like this. Suppose, you are here you want to come here
suppose this is my P, this is my Q let this one some value and this is also some co-ordinate
value to dimension case. Now, the shortest one is this, this is the shortest one, but
what you will do when you come here suppose, there is obstacle what you will do and this
is the only way to go you come here and then follow this. So, like this Manhattan city
and that roads, you just like Manhattan city roots you are coming here. That means if this
is O, you are first covering this distance then this distance.
Now, it is what is given that the you this distance is the mode value of these things
this one plus this plus similarly, there are more number of features or variables will
be adding up to P features. Minkowski distance is there is a same thing you just seen, but
it is every distance is first score power to m and then it is basically like geometric
mean. Then what you have done you have basically, again make it in the original scale so all
everything will be made that some will be made that root to the power m. Now, for the data said that what we have discussed
so far that is. So, if you use Euclidean distance you are measuring
this for A to 1 to 1, 2 to 2 like this we have measured the Euclidean distance and these
are the values. See for the diagonal elements are 0 of diagonal elements having values some
of the values are high, some of the values are low and what is required we want basically
to group this 10 departments based on this distance. If you go for squared Euclidean distance then
every item here, these are basically squared and you are getting this value. If you go for Manhattan distance as I told
you that obstacle is there you have to come here and then go there. So, ultimately you
will be getting this distance values 1 to 1, 1 to 1 in the sense every item to item
and or every object to object distance you are getting. Then Minkowski distance matrix that is Manhattan
distance to the power m we are saying it is Minkowski within bracket 2. That means, we
are considering m equal to 2 and this is the distance. So, now then what we have done ultimately
if I go back to the process model you will find out. You will find out that we require things or
objects to be cluster and we require characteristics to be measured that we have already seen for
the safety case, data case. Things are dependent to be cluster characteristics are 3 characteristics
MIS SS MSS and MEDS and then we want some similarity or dissimilarity measure, we are
saying that we are going to the distance measure either Euclidean or squared Euclidean or Manhattan
distance or Minkowski distance. And then what we require we require a partitioning algorithm. So, there are hierarchical joining algorithm,
nonhierarchical algorithms are hierarchical algorithm are that cluster
algorithm and nonhierarchical algorithm like k-means clustering algorithm all those things
are there. So, first we will discuss about the hierarchical algorithm, joining algorithms. There are many O’s to group objects in hierarchical
joining algorithms, the names are single linkage or nearest neighbor algorithm, complete linkage
or furthest neighbor, centroid linkage then average linkage, median linkage and ward linkage.
So, these are the different types of algorithm for joining the objects into different groups.
Let us first understand what is single linkage algorithms?
Single linkage algorithm is here distance between 2 clusters or distance between 2 closest
members of the 2 clusters. For example, you just think of a 2-dimensional case where all
the objects here and the items are put in the appropriate location based on the characteristics
features of that 2-dimensional, 2-dimensional data matrix. Now, you see that arbitrarily
I have given some groups this is one group this is another group this is another group
so, we are saying this is your cluster 1, this is your cluster 2 and this is your cluster
3. Now, question comes how do I make this groups or how do you, you can make this group
here, if I were talking about single linkage then you were taking about the distance between
cluster 1 and cluster 3 is the distance between the 2 nearest objects of the 2 clusters so,
obviously this one. Now, you may be wondering that there is no
cluster started then how suddenly you make this cluster and ultimately how these things
are coming. So, that I will discuss little later, but you first you just understand here
little abstraction you make that ok it is, there is cluster different clusters possible.
And the distance in single linkage means the distance between the 2 clusters is the nearest
neighbors, distance between the nearest neighbors. Now, when we talk about the complete linkage
you are talking about the furthest neighbor. That mean the you find out the member in cluster
3 for example, in this case and member in cluster 2 who are the furthest point it not
they are not nearest they are little furthest the maximum distance where they are. Then
when we were talking about centroid you see the distance between 2 clusters is distance
between multivariate means of the clusters. So, there are several variables here actually
2 variables are there you are basically, considering and the data mean that value for each of the
variables and then accordingly we are making the centroid and then finding out where it
is. So, let me write down then that we have joining
algorithm like single linkage, complete linkage, centroid then you see that average linkage,
then median linkage, then ward. Now, if you see the average then you come
here you see the distance between 2 cluster is average distance between all members of
the 2 clusters. So, there are so many members I can see here so many items here you find
out the average of these items, average of these items and then find out the distance.
And median case it is median distance between all members of the clusters so, all of you
know median so, find out the distance and then get the arrange from smallest to largest
and get the median value. What this tells that average distance between all members
of the two clusters with adjustment for covariances. Actually, if you can recall that when we have
discussed the statistical distance I say when if you consider equivalent distance that all
points on the circle are similar, in terms of equivalent distance, but if there is suppose
an ellipse. The data resemble ellipse you cannot say based on Euclidian distance where
they are equidistance, but if you see that mahalanobis distance, mahalanobis distance.
Then what is happening here ah that all points on the ellipse are equidistance because the
variability across x 1 and the variability across x 2 these are weighted.
And that is why you got like this that x minus mu transpose sigma inverse, I think here we
will use mahalanobis symbols x minus mu. So, if I write that for one variable this then
this, this is my d j d i square that is i j then j then i j then j, just similar manner
that i j or d i square you write x i. That is straight way the multivariate of derivation
then you write mu here i and mu that is better. So, here in warling case what happened that
the 2 cluster with the adjustment of covariance’s here, it is the adjustment of variances and
covariance’s in terms of the symbols. A symbol it is just what distance is similar
to this because of you see that their adjustment of covariance’s between one variable to
another variable. So, these are nothing but to find out distances
and these are applicable only when you have made groups. Suppose, you have not made any
group there are some observations 1 to n and you know the distances for example, A, B,
C and A, B, C so, you found out the distances this is 0, 0 and 0. Suppose, this distance
is 2 this is 5 and this distance is again 3 here, what you will do first, you first
find out the smallest distance, this is my smallest distance so, this two can be grouped
A and B can be grouped. So, if I do this then what happen you are now deter set it initially
deter set when it is not group everything is a cluster, A is cluster, B is cluster,
C is cluster. When it is grouped A B becomes one cluster and that C remains, then A B and
C that remains. Now, A B to A B will be 0 and C to C will
be 0 what will be the distance between A B and C there you have to choose you cancel
any one of this method and then find out. For example, if you use complete linkage,
single linkage let it be if you use single linkage which is basically, known as nearest
neighbor, nearest neighbor then what is required you find out the distance between A and C,
distance between B and C then find out what is the minimum value. If I go for this A and
C this is 5, B and C this is 3, then definitely 3 is the minimum so, your distance is 3. if
go for single linkage if you go for complete linkage that is furthest neighbor, furthest
neighbor then what will happen ultimately we will find out that the maximum of these
two will be taken. So, similarly for centroid average and there are other criteria given
and we have to do this now we will I will show you on this. Now, what are the steps, steps to follow in
hierarchical clustering, step one; identify the variables and objects then you have to
collect data, then select similarity or dissimilarity measures what you want usually go for dissimilarity
measures distance. Then obtain the distance matrix start with n clusters please keep in
mind we are saying that individual items or the objects or what I can say things that
you are trying to group they are unique. So, if there are n objects in clusters starting
point if n objects in clusters n clusters. Then you see the distance from here, for every,
every distance you are seeing you find out the minimum distance and then you group so,
if this 2 are minimum then this will be grouped then things will be coming like this. So,
when we are talking about hierarchical algorithm team clustering what is happening initially
there are n objects so, when you group one only the two of the original n they are grouped.
Now, there will be n minus 1 cluster so, if I say the number of cluster here is n here,
clusters are n minus 1. So, in this manner your things will be reducing n minus 2 like
this finally there will be 1 cluster where all will be grouped so, that is what is we
required to be done here in hierarchical study so first you find out the distance.
So, you take each one a cluster find out the distance between the cluster and then you
take every pair of cluster, pair of clusters from that things and then you think that the
most similar clusters are P Q and this can be grouped that like this. If this is P, this
is Q suppose this is your m what is happen this is most similar so, this two are grouped,
this two are grouped. When you group 2 similar objects so, your number of cluster reduce
to n minus 1 that is what is given here. Merge cluster P and Q and label the newly
formed cluster as P Q update the entries of the distance matrix by matrix d by doing these
things. So, deleting the rows and columns corresponding to cluster P and Q. Basically, if I have 5 items so and you find
out that A B are most similar at the first level distance is less, then your number of
cluster initially n equal to 5 now, these two will be grouped then C D E then A B C
D and E. So, now cluster is here n minus 1 that is 4 now, what will be the entries that
means the distance values original distance is d n cross n. Now, it will be d n minus
1 cross n minus 1, we have to find out this that is what is said here that deleting the
rows and columns corresponding to cluster P and Q here A and B.
So, A rows you delete then adding a row and column giving the distance between the P Q
so, like A B A B you are adding 1 row and column, 1 row for this you are adding 1 column
also you are adding. So, you are deleting 2 here 2 rows and 2 columns adding 1 rows
1 columns and ultimately n minus 2 plus 1, that is n minus 1 is coming here. Then you
repeat what we have done in step number six and step number seven for all n minus 1 times
when all the objects are grouped is when you are coming you have to go up to this 1 cluster
level, when you are reaching there so you stop. Then final one is record the identity
of clusters that are merged at the levels at which the merger takes place. So, from the n level what will happen ultimately
we will find out a situation that from here n levels you will basically, suppose first
you merge these two then let me merge this one so, then let you merge another two here.
Then finally suppose you are merging this two may be another one here, then let the
final merger is coming to place here this is the final one, 1 cluster this is n cluster.
So, long you are not coming to this you are not stopping these are objects then object
to groups then final this is happening. Now, where you will be stopping that all depends
on what is the distance between the objects in a group, the maximum distance between the
objects in a group or the single linkage that single linkage will come between groups, but
what distance between in the same group the maximum distance you say that you are accepting
or not. So, then what will happen as you are going
up to lower number of clustereds actually distance between the objects with in a clusters,
with in a cluster is increasing for every cluster it is increasing, but after certain
distance you may not accept that distance. So, this is little difficult to understand,
but if we go by similarity measure what I mean to say that if I make cluster here, then
this is made one cluster, second cluster, third and four. One I think this one is, this
one is taken care of if I come here this 3 are taken care of you are making 2 clusters.
So, then you see that what is the similarity measures of all these objects within this
cluster and here, it is 1 then there is 100 percent similarity within.
If similarity, reduces from that means here it is 100 percent similarity now slowly it
will reduce it may be you have to find out here how much it is, is it 50 percent is it
10 percent is it 0 percent is it 70 percent what you want to keep. Suppose, at this level
if it is 70 percent are you happy with 70 percent similarity between the objects within
a group if you are happy in this case you can keep 2 clusters. All items here, here
things are such that all the items except this one are grouped under cluster 1 and this
is cluster 2. So, that is step nine record the identity of clusters that are merged at
the levels at which the merger takes place. Now, based on this value you can give some
identity that is cluster 1 it may be give some other interpretable identity and may
be this groups are of some type this group is of another type. For example, from safety
point of view example I have given may be these are low accident prone situation I have
are departments and this is highest high accident prone situation. So, that mean the 2 cluster
low accident group department, high accident group department low medium high hence many
ways you can find out. Now, I am explaining here how this actually
single nearest neighbor or single linkage is working same example similar example, I
have already given you, but it is more formal here you just see the example and try to understand
fully that there are 5 objects and this is distance measure. At the first instance what
you required to do you have to find out which one is the smallest which pair of objects
are having the least distance. Here, the 2 this 2 is the least distance value among this
values forget about 0 because we are talking about item to item distance not the same item
because same item distance is 0 always, same item distance to that item it will be 0, but
we are talking about the other things so it is 2.
So, 2 is 3 and 5 the distance between 3 and 5 is 2 so, see 5, 3 are grouped here 5, 3
grouped here what we have done you deleted these column as well as this column and you
added 1 column 1 row and 1 column here. So, 1 row here these column as well as this row
you have deleted and similarly, that means what I can say the 3 and 5. So, this 1, 3
this 3 and 5 then 3 and 5 this 2 will be deleted and one more will be added here. Now, question
is what is the distance between the cluster 5, 3 and 1 so, that nearest neighbors is that
you find out the distance between 1 and 5 and 1 and 3 take the minimum 1 then 1 and
5 is 11 and 1 and 3 is 3. So, minimum 1 is 3 you have taken 3 and rest
of the items the distance similarly 5, 3 versus 2 similarly 5, 3 versus 4 you have to find
out. So, 5, 3 versus 2 that mean 2, 5 and 2, 3 now if you go by 2 and 3 this is 7 then
you required to find out 2 and 5 the 2 and 5 same by 10 so, the minimum 1, minimum of
7 and 10 is 7 so it is coming seven. So, similarly 8 rest of these things are all ready available
here just because this is 2 to 1 and 2 to 4 and 1 to 4 this distance already it is here
so, if I say 1 to 2 that is 9, 9 is even here like this.
So, your new distance matrix is this now we have what you want here you have to find out
that what are the members or objects that can be grouped further so here 5, 3 is 1 object
or 1 group so you will be treating as 1 object. So, then what we will do we will again find
out which one is the minimum value, distance value so here you see out of these 3 is minimum
and this 5, 3 and 1. So, here you are that is why in this case 5, 3, 1 is grouped and
2, 4 remains so that mean, how many clusters you are making here, you are making here 1,
3, 5 one group 2, 1 another group, 4 another group so, ultimately it is 3 clusters.
In cluster if I say this is my cluster 1 then in this case 1, 3, 5 of these 3 objects are
grouped then cluster 2 is having only 1 object that is 2 number 2 object and class 3 is having
only 4 objects. Now, distance if you see here see we are using single linkage so, in the
first case when there is no groups in the sense that no 2 objects are in one group here,
every object is representing one group in that case there is no distance the 0 the diagonal
elements. When I come to this 4 cluster the second this table when you find out that what
is the minimum distance between what is these is distance 3 and 5 the distance between this.
So, there are two items so you are getting one distance that is 2.
Now, come to the third one when the 3 members are grouped under one cluster 5, 3, 1 then
the distance we are talking about the 3 because it is coming from this 5, 3, 1 so, it is not
that we are taking the maximum 1 it all depends on which linkage you are using. So, earlier
I said the maximum 1 and that also possible when you go for complete linkage. So, 3 then
what will happen here you will find out again the similar table will come and you find out
that 2, 4 will be grouped so, the two cluster 1, 3, 5, 1 cluster and 2, 4 another cluster
and the distance is 5 and finally, 1 cluster and distance is 6, this is the minimum distance
between the members of these. Now, see that same thing can be represented
in a diagram like this here, when distance is o all the objects are forming one cluster
each there are 5 objects so 5 clusters. So, at the distance level one that no grouping
possible only at the distance level two 5, 3 are grouped so, at these level you are sacrificing
a distance of dissimilarity of 2 between these two members, but in this process you are gaining
one way. That way is your number of clusters reduced from 5 to 4 and at the distance of
3 your number of clusters reduced to 3, but your distance sacrificing distance is 3 and
but at the distance of 4 it is not like this nothing possible, but at the distance of 5
you are making 2 clusters at the distance of 6 you are making 1 cluster what is this
diagram known this diagram is known as dendrogram. Actually, this is what I told you the same
diagram it is just 90 degree rotated there so, this is known as dendrogram. So, then
you see that we will go for complete linkage with the same example and you see that the
object is the process steps are similar except the calculation of the distance major the
distance may be here. The maximum distance between the between the between the members
of the groups two groups. In that case if you go on grouping your grouping
is 5, 3 again fast then your 2 and 4 then finally, 1 is coming and finally all will
be grouped, but here distance major if you see that you are sacrificing how much distance
11 distance for one group because here you have considered the maximum distance and when
you see this table here, is the maximum distance is 11. So, but if I compare this with the
earlier one here also we found out that 5, 3 final one if you say that 5, that the second
one 5, 3 and 1, 2, 4 here 5, 3 and 1, 2, 4, but that difference is coming.
Here, 1, 3, 5 is grouped and then 2, 4 remain as it is, but here 3, 5 and 2, 4 again grouped
and so that will be change in the dendrogram in grouping. And it may so happen that you
may not get that exact grouping if you go for by exact grouping what I meant is that
same grouping if you go for different linkages, if you go for call single linkage you may
get some type of grouping, you go for complete linkage some other grouping. If you go for average linkage like this it
is a different example, other example you are also getting a different grouping, but
please keep in mind the message is that if you can calculate the distance between the
objects within a group and between the groups also. That is also required because we required
similarity as well as we required highest level of dissimilarity. So, then using this
hierarchical clustering algorithm you will be able to find out the dendrogram and what
level of similarity or distance you want to keep the, what I can say you want to have
this amount of similarity you are satisfied with this amount of similarity then fine you
go for that level. So, what I will do in the next class I will
continue this cluster analysis I will go for some other algorithms and then for hierarchical
clustering and then we will discuss K-mean clustering. Then we will see one case study
and using a spaces and using many step how it can be done.
Thank you very much.