Introduction to Clustering

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Good morning. Today, we will talk about Clustering. In the early classes so far in this course we have mainly talked about supervised learning. In supervised learning, we have some data which are labeled and we try to learn a function, we try to come up with a method to label unseen instances correctly. Today, we will look at unsupervised learning and we will look at one specific type of unsupervised learning which is called clustering. We will introduce what clustering is and in the subsequent classes, we will give illustrations of different specific clustering methods. In unsupervised learning, as we have discussed in the first class of this course that we have data, but those data have no labels, so we have unlabeled data instances and in supervised learning we want to find hidden structure in the unlabeled data. We want to explore the data to find some structure in them, but there is no old standard or there is no label to say that this is what we get. Now, why would we be interested in this when we have a large amount of data, unless we are able to group the data we will not be able to do studies, do proper exploration of the data. For example, let us take the example of the plant and animal kingdom. So, biologists have studied the different species of plants and animals and come up with a grouping of them and these groups can be called clusters. They have grouped animals into vertebrates and invertebrates, and then vertebrates as mammals, fish etcetera. So, these labels were not given to the early scientists, but they have come up with these groupings based on species which share similar attributes and based on that this grouping has been done. Now, clustering is the most popular type of unsupervised learning. There is few other type of unsupervised learning which we will not talk about in this class. In clustering the task is given a set of data instances. So, here we do not have; earlier our sample contained X i Y i values. Here the sample will contain only X1, X2, X3, etcetera. So, only the data points are given and no output or no output label is given. So, given a set of these instances objects in clustering we want to group the objects into clusters. We can find the clusters C1, C2, C3. So, C1 will contain a number of these items suppose X2, X3, X7 go in C1; X1, X4, X10, X11 go in C2 and X5, X6, X8, X9, X12. So, suppose these are 3 different clusters. We come up with a given this sample, we come up with groups and how do come up with groups. So, instances which belong to the same group with in some way similar to each other right. So, X2, X3, X7 should have some similarity among each other; X1, X4, X10, X11 should have some similarity with each other and X3 and X4, 2 elements which belong to different clusters will be in some way dissimilar with each other, based on this we can come up with groups. Now, clustering is useful for many applications, for example, it can be used to automatically organize data. For example, the plants species or animal species or news documents or books. It can be used for understanding hidden structure in data and sometimes clustering is used preprocessing for further analysis of the data. Now, let us look at this slide to look at an application of news clustering. Some of you will have used Google news. In Google news, you will notice that the different new stories are grouped, for example, this is one new story about the Bombing of Istanbul airport on 29th June, 2016 and the new stories under that they are grouped together. So, this is unsupervised clustering because this particular new story was not given as a label to this, but the related news items were grouped together. So, Google news is an example of a system which does news clustering. Then this is an example of clustering of gene expression data. So, these genes are clustered based on the gene expression. There are other applications, for example, as I have talked about in biology the classification of plant and animal kingdom given their features. In marketing customer segmentation based on the database of customer data containing their properties and past buying records. These are very useful to marketing companies because based on the grouping of the customers they can decide what type of promotions to target to each customer. Third example is clustering of weblog data to discover groups of similar access patterns and the forth example is to recognize communities in social networks based on their similarities. Let us look at in order to talk about clustering algorithm. Let us look at some example data. So, these are the data points, only the data points are given, their labels are not given, but in this particular data you see visually that there are 4 natural clusters and what a clustering algorithm is required to do is take this and come up with these 4 clusters, ideally this is what the clustering algorithm will do. So, there are different aspects of clustering. First of all there is a clustering algorithm and there are different types of clustering algorithm We will discuss a few of the clustering algorithms in the next few classes. There are algorithms which are partitional or divisional in nature which takes the data and divides them into a number of groups; kmeans is an example of a partitional clustering algorithm Then there are hierarchical clustering algorithms which hierarchical divides the data into clusters hierarchical algorithms may be based on top-down method or bottom-up method, which is done in adapted hierarchical clustering. Again, we will see in that class. There are other methods like model based methods, for example, mixture of Gaussian, density based methods like DB SCAN, etcetera. So, there is clustering algorithm. Secondly, there is a distance of similarity function which the clustering algorithm uses or tries to optimize. We told earlier that in a cluster the elements in cluster are similar to each other and the elements belonging to 2 different clusters are different from each other and to measure how similar or dissimilar 2 elements are we have to use metric or similarity or dissimilarity measure. Some possible measures are Euclidean distance, cosine distance, Pearson and correlation coefficient etcetera. Thirdly one has to have a way of evaluating, how good the cluster is for that one can look at different methods, for example one can try to minimize the intra-cluster distance of elements which belongs to the same cluster and maximize the inter-cluster distance elements that belong to different clusters. The quality of a clustering result depends on the algorithm the distance function used and the application for which you are using it. So, these are some of the major clustering approaches partitioning a base method, which involves constructing various partitions; hierarchical methods which creates a hierarchical decomposition of the set of objects; model-based methods which hypothesize a model for each cluster and finds the best fit of models to data; density based clustering algorithm which are guided by connectivity and density functions; graph theoretic clustering based on the under construction of a graph and looking at some graph theoretic measures like and so on. These are some of the different clustering approaches, few of them we will talk about in this class. So, partitioning algorithms construct a partition of a given data sample. The partitioning algorithms will construct a partition of these into k clusters. So, k is given to the algorithm, it will find k clusters that optimize the chosen criteria. You may come up the algorithm may come up with a global optimum of the chosen criteria or use heuristic method and come up with a local optima. So, we will explore in detail the kmeans algorithm which comes up with a local optimum based on certain criteria. The second type of clustering algorithm is hierarchical clustering, for example, animals may be grouped into vertebrates and invertebrates; vertebrates may be broken into fish, reptiles, amphibians, mammals, etcetera. This in an example of using a tree or hierarchical clustering, we will look at some methods for hierarchical clustering which produces a nested sequence of clusters. You can, for example, use a partitional algorithm and recursively apply it to get a hierarchical clustering or you may do a bottom-up clustering where you start with a large number of each cluster containing 1 item and repeatedly go on merging the clusters until you get 1 cluster and as a result you can get a nested tree. We will talk more about in the later class. A third type of clustering is the model based cluster where given the data points you hypothesize a model, for example, you can think of each cluster being represented by a Gaussian distribution with a mean and a standard deviation and you try to fit the data to the model and, for example, you can come up with this three clusters. A forth type of algorithm is called density based clustering. We will not talk about it in this class if we do not have time, but it is based on the similarity, based on the density of a region. The density of a region is the number of instances in a region in future space. It locates regions of high density and connects those points together. So, DB SCAN is a popular density based clustering algorithm. Then we have graph theoretic clustering algorithms which takes nodes to represent the different items and the weights of the edges is based on the similarity of the items, and based on this a graph is constructed and certain graph algorithms are used to find strongly connected components, for example, looking for minimum cut in a graph, again we will not talk about these type of algorithms in this class. The third aspect of a clustering algorithm is the metric that we use; a distance metric or a similarity metric. So, there are certain distance metrics for example, the Minkowski family of distance measures, where given 2 items Xi Xj. The Minkowski distance between d Xi and Xj is computed as it takes the summation over the training examples. It takes the summation over, sorry the number of input attributes, let us say n is the input attribute Xis minus Xjs to be power p the whole thing to the power 1 by p, this is the Minkowski metric. So, one popular now, if you said p equal to 2 you get the Euclidean distance, what you have is Euclidean distance of Xi, Xj is Minkowski distance, if p equal to 2 which gives you root over sigma s equal to 1 to n Xi1 minus Xis minus Xjs, this is the Euclidean distance, or can you have the Manhattan distance between Xi and Xj as summation over s equal to 1 to n Xis minus Xjs, if you said p equal to 1 you get Manhattan distance which is defined like this; p equal to 2 you get Euclidean distance and you also use p equal to 3 4. So, this is 1 type of distance metric. A second metric which is often used when you work with text data in the model is the cosine distance metric. So, given 2 objects; Xi and Xj are the vectors of these 2 objects you find the cosine between these 2 vectors and the cosine between these 2 vectors can be computed. So, the cos of Xi, Xj can be computed as the dot product of the vectors Xi dot Xj divided by root over the sum of this the coefficient of this. So, which we can write as mod Xi dot mod Xi, this is the normalization factor and this is the dot product. So, basically we are doing is that you are finding the cosine between the 2 vectors. Then apart from such matrix there are correlation coefficients which are scale invariant and there are different such measures, for example, Mahalanobis distance, Pearson correlation coefficient Spearman relation and so on. Mahalanobis distance dxixj is taken as root over Xi minus Xj sigma inverse Xi minus Xj. So, this sigma is the co-variance matrix if they are independent then this is equal to the Euclidean distance and then we have the Pearson correlation coefficient which is given by co-variant of Xixj divided by sigma Xi sigma Xj. So, sigma Xi stands for how far each of the items in a cluster is from the mean of the cluster and this is co-variant of Xi Xj. So, these are some of the similarity matrix that is used in the clustering algorithm. The next aspect which is of importance to us is how to measure the quality of the clustering algorithm. There are 2 types of measures of quality; it could be internal evaluation or external evaluation. In internal evaluation, you evaluate the cluster quality by the clusters and the data that you have whereas, the external evaluation you use additional data. For example, internal evaluation may try to look at how close the point in a cluster is with each other and how far they are from members to other clusters. There are many several such measures one of them is the Davies-Bouldin index. Davies-Bouldin index is computed as 1 by n summation over the number of clusters. You take items belonging to the same cluster which are j naught equal to i. We look at sigma i plus sigma j divided by dcicj this is the distance of ci with cj and this is the spread of sigma i and j. So, this is one measure of the quality of algorithm. Then, if you have some label data, but you ignore the labels and come up with the clustering of the data you can compare the cluster that you have got with the labels that you have got and this gives you external evaluation, for example, there are different external evaluation matrix like f-measure, Jacquard index, Rand index etcetera. Now, we have earlier talked about defined what we mean by true positive, true negative, false positive, false negative. To refresh your memory let me draw this table. Suppose this is the actual label of the data and the actual label is plus or actual label is minus and this is predicted by an algorithm predicted plus predicted minus. Now, if you collect a set of items from which the actual label is plus and your algorithm has predicted plus they are true positive. If the actual label is minus and your algorithm has predicted minus it is called true negative, but if the actual label is plus and your algorithm has predicted minus, this is called false negative and if the actual label is minus your algorithm has predicted plus it is called false positive. Now, the rand index, TP and TN. So, these are the 2 regions where your algorithms have worked correctly. Rand index is given by TP plus TN divided by the sum of all those, so the fraction of examples from which your algorithm has predicted the correct task. The Jacquard index is given by jacquard index of 2 sets AB is given by A intersection B divided by A union B. In this case, if you take the intersection of the predicted and the actual what you get it true positive. These are the ones which these are the intersection of those which both of the algorithm have predicted correctly divided by the union of those that any of the algorithms have predicted correctly. So, this is divided by TP plus FP plus FN. So, these are some possible matrix. Then there is another common matrix called f-measure which is a harmonic mean of precision algorithm. Precision is true positive divided by true positive plus false positive. Recall is true positive divided by true positive plus false negative and f-measure is the harmonic mean between them. So, these are some measures that are used for external evaluation. With this I stop today’s lecture, in the next class we will start with kmeans, which is an example of a partitional clustering algorithm. Then we will talk about one hierarchical clustering algorithm and we may talk about one model base clustering algorithm, which is mixture of Gaussian. Thank you.
Info
Channel: Machine Learning- Sudeshna Sarkar
Views: 35,401
Rating: undefined out of 5
Keywords:
Id: CwjLMV52tzI
Channel Id: undefined
Length: 23min 55sec (1435 seconds)
Published: Mon Sep 05 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.