Good morning. Today we come to the last lecture
of this week, or the last lecture of the Clustering, in fact the last lecture of this class. We
will talk about Hierarchical Clustering; rather we talk about a specific Bottom up Hierarchical
Clustering Algorithm. Earlier we have seen k means algorithm which is an example of a
top down clustering method. Today we will look at a Bottom up Clustering Method. Now, hierarchical clustering is very common.
For example, when scientists look at the animal kingdom, they divide them hierarchically;
animals are vertebrates and invertebrates, within vertebrate, your fish reptiles' amphibians
and mammals and so on. So, this produces a nested sequence of clusters. In order to get
a hierarchical cluster we can recursively use a partition clustering algorithm like
k means. We discussed the k means algorithm in the last class, and the k means algorithms
can be recursively used to do a hierarchical clustering. But today we will talk about a bottom of clustering
method called Agglomerative clustering. In agglomerative clustering we built the clusters
from the bottom level that is we assume that every object initially given m objects we
assume that every object is a cluster of its own. Then what we do is, we find the distance
between each pair of clusters and those clusters that pair which is closest to each other is
merged, so that at the next step we have m minus 1 clusters. Again we try to find the
closest clusters and merged so that we have m minus 2 clusters. Like this we go on until
all the clusters merged. So, this is the gist agglomerative hierarchical
clustering. We could also use Divisive clustering to make hierarchical clusters. We can start
with all data in one cluster, take k equal to 2 divided into 2 clusters, then each clusters
we can recursively keep dividing. This gives top down hierarchical clustering; however
we will not discuss it further in this class. Now the bottom of the hierarchical clustering
gives rise to a structure called the Dendrogram. So what is happening is that if you have the
different objects, suppose these two objects are the most similar to each other they are
merged at step 1. Then step 2, the next most similar objects are merged. This is step 1,
this is step 2 and then step 3 the next most similar objects are merged. Let us say this
is step 3. Step 4, next most similar objects are merged. In step 5, the next most similar
objects are merged. Step 3, the next most similar objects are merged. In the step 7,
the next most similar objects are merged. So, this structure gives rise to a tree which
we call Dendrogram. And, as we see in that in the dendrogram the nodes represent the
subsets of s. This a node represent these 3 points. This node represents this 6 point.
This node represents these 3 points. So each node represents a subset of the points. And
the characteristic of the tree is that, the root is the whole input set. The root of the
tree is the entire set, and the leaves are the individual elements. And the internal
nodes are defined as the union of their children. We can also represent this by this sort of
diagram. This the one cluster, this is another cluster, this is a nested cluster, and these
are nested clusters, so the objects are hierarchically nested which can be represented like this
sort of a set diagram or this sort of a dendrogram. So, this tree may be cut at different levels.
If you cut at this level you will get only one cluster if you cut at the root. If you
cut at a particular level, this will give you a certain number of clusters. For example,
this cut gives you 1 2 3 4 clusters. This cut gives you 1 2 3 4 5 6 clusters. So cutting
the dendrogram at different levels you get different numbers of clusters. You need not
decide the number of clusters apriory; you can construct the dendrogram and cut it at
the desired level to get number of clusters that you require. So, for different values of k you get different
cluster as it is illustrated by this picture. The algorithm is notionally very simple. The
basic algorithm is very simple. Initially each data point forms a cluster. We compute
the distance or proximity matrix between clusters. And we repeat the following steps; we merge
the two closest clusters, after merging we update the distance matrix. And we continue
these two steps until we get only a single cluster. Now the main thing about this algorithm
is the distance measure depending on different definitions of the distance will give us different
algorithms. Now, initial step is every point. This is
a set of objects; initially every object is a cluster by itself. So every point is a cluster
and if there are m points we have m by n proximity matrix between each pair so there will be
m square pairs, between each pair we can compute the distance and fill it up in this matrix. Then at an intermediate state we would have
a number of clusters. And the proximity matrix will be defined, suppose there are k clusters
there will be k square elements in the proximity matrix. Now, suppose in this matrix C2 and
C5, so we have this proximity matrix and we compute and find the distance between this
five C2 pairs and suppose we find C2 and C5 are the closest. Now what we will do is that
we will merge C2 and C5 so that these 2 columns and these 2 rows in the proximity matrix will
be collapsed into 1 row 2 column. The next step we will have 4 clusters, so 4 rows and
4 columns in the proximity matrix. So after merging we will get these 4 rows
and 4 columns and we have to find out how to fill up this column and this row. Now,
how we fill up will depend on the distance measured that we will use. And we will talk
about certain common measures that we use for this Now, when we have to find the closest pair
as we said that there are different ways to measures distance of 2 clusters; some popular
methods are - Single-link, Complete-link, Average-link, Centroid. In single-link clustering
when we look at the similarity between 2 clusters, suppose we have cluster C i and cluster Cj. So, we have cluster C i comprising of the
points x 1 x 2, and cluster C j comprising x 3 x 4 x 5. Now what is the distance between
C i and C j we find the similarity of x 1 with x 3, x 1 with x 4, x 1 with x 5, x 2
with x 3, x 4, and x 5, and we find that pair which is the most similar among a pair we
find that pair which so that one of them belongs here one of them belongs here and we take
the mot similar pair. The similarity of the most similar pair we
take as the similarity between the 2 clusters in single-link clustering. In complete-link
which we will talk next, we take the least similar pair as a measure of similarity between
2 clusters. When we take the similarity between two clusters in single-link we take the value
of the similarity between the most similar pair. In complete-link we take the similarity
as the value of the least similar pair. In average-link, we take the average of their
similarity. And in centroid, we take this to be represented its centroid, this to be
represented its centroid. And look at the similarity between the 2 centroids. So, in single-link the distance between clusters
C i and C j is the minimum distance between any object in C i and any object in C j, or
maximum similarity of one object in C i with one object in C j. Which is given by this
formula: sim C i, C j equal to max, x belongs to C i, y belongs to C j sim x, y. For examples, if these are the points in single-link
we will take the most similar points, then the next most similar points and so on, and
then combine them. As we see that single-link clustering can result in straggly, that is
long and thin clusters due to chaining effect. So, in single-link clustering the similarities
determined by one pair of points, that is one link in the proximity graph. If the proximity
graph has these values, so what we need to do is that we have to look at the most similar
pair and merge them. In a complete-link method, on the other hand
the distance between 2 clusters is a distance between the least similar pair. Which is given
by the following formula, sim of C i C j is min x included in C i, y included in C j sim
y, x. And complete-link makes tighter more spherical clusters; unlike, single-link where
you can have long thin clusters. However, complete-link clustering is sensitive to outliers
because they are far away. If there is one object in a cluster which is very fairway
from the other objects it will influence similarity, which is not always desirable. Again you have this matrix and you look at
apply complete-link. In the same example as we saw before, we can
show that when we apply complete-link these are the different steps that we get. We notice
that here the clusters are more tight than what we get in the single-link example. Now, let us discuss the computational complexity
of the clustering methods. So, hierarchical agglomerative clustering method, in the first
step you need to find similarity of the objects, the N objects you require to find N squares
similarity in the beginning. Then you have N minus 1 steps, so in each steps you are
merging the clusters. And while merging you have to find out the most similar pair and
update. Now, how much time you spend will depend on the type of clustering method that
you are using; single-link, complete-link, average-link, centroid base, and it also depends
on the type of data structure that you went in.
So, if you naively do this usually it is order of N cube, but in certain cases you can also
achieve N square log N complexity. We will not describe the details. So, we talked
about single-link and complete-link. In average-link clustering, actually average-link there people
have described in two different ways. You can average across all ordered pairs in the
merged cluster or over all pairs between the two original clusters. The second formulation
is shown here, but there is other formulation of average-link clustering. So anyway, here
we find similarity of cluster C i and C j is average of similarity of x y; where x belongs
to C I, y belongs to C j. It is a compromise between single and complete-link, it is less
acceptable to noise and less acceptable to outliers it also produces spherical clusters. Complexity already we have talked about.
With this we come to the close of today’s lectures. There are other clustering methods
like density based clustering. For example, the DB scan algorithm and several others,
but we will not talk about this in this course. So, this brings us to end of the class on
clustering and also the end of the course in machine learning. I hope you all have enjoyed
this course. Thank you very much for putting up through
the entire course. All the best.