Lecture 59 — Hierarchical Clustering | Stanford University

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to mining of massive datasets we continue our discussion of clustering by looking at hierarchical clustering methods to the fresher memory in hierarchical clustering we can either go bottom-up or top-down in bottom up methods each point each data point is initially in a cluster of its own at each step we find the two closest clusters and combine them into a single cluster in divisive methods we all the data points are in a single cluster to begin with and we recursively split the cluster as we go along in this lecture we are going to focus on the glomer ativ or a bottom-up approach where we start with each data point as its own cluster and then combine clusters the ideas in the lecture can be easily adapted to divisive methods as well the key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster there are three key questions that we have answered in order to build a hierarchical clustering algorithm those three important questions are the following how do you represent a cluster of more than one data point we need we need a representation of a cluster so that we can figure out which clusters are close to each other and that brings us to the second question which is how do you determine the nearness of cluster so that we can combine the two nearest clusters and the third question is as you keep going along combining clusters when you decide to stop and produce a final output so we look at each of these three questions in turn let's start with a simpler case of a Euclidean space in Euclidean space you can always a ver egde two points and the average is also a point in the equal ideon space this gives us a simple answer to the question of how do you represent a cluster of many points we can represent a cluster by its centroid which is the average of its points and to determine the nearness of clusters we just measure the cluster distances by measuring the distances between the centroids of the clusters an example will make this clear here we have six points in a Euclidean space these are the these OHS represent data points in a Euclidean space and we are going to apply an agglomerative clustering methods initially let's say we determine that these points 1 2 & 2 1 are the closest points we combine them into a single cluster and now we're going to represent this cluster by its centroid to compute its centroid we're to compute the average of the two points which is just the average along every dimension so the average of the points 1 2 & 2 1 is the point 1 point 5 1 point 5 which is the centroid of this newly created cluster now we have five clusters remember we started out with each data point as its own cluster and we've combined two of the data points into a single cluster so now we have five clusters now we need to find the nearest pair of clusters and as it turns out the nearest pair of clusters is the pair of points 4 1 & 5 0 and once you've created this cluster we then represent it by its centroid now the centroid of this cluster is the point four point five zero point five now the four point five is obtained just by averaging the x coordinates of the two points and the zero point five is obtained by averaging the Y coordinates which are 1 and 0 so now we have four clusters okay and as you go along merging clusters we create this this artifact called a dendrogram and the dendrogram shows how we merge data points and clusters as we go along in the first step we merge the two blue data points and in the second step we merge the two Bleen data point and that's what the dendrogram indicates now in the next step we determine that the two closest clusters are the cluster with centroid 0 0 and the cluster with centroid 1.5 1.5 and we and we can measure we do this just by measuring the distance between 0 0 and 1.5 1.5 and finding that to be the shortest distance between any pair of clusters once you do that we can combine the three points 1 2 2 1 and 0 0 into a single cluster and we can represent it by its centroid which is 1 comma 1 now to determine the that the centroid of this combined cluster we have to average all three points 1 2 2 1 and 0 0 and the average in the X you know in the in the x axis is 0 plus 1 plus 2 which is 3 divided by 3 and that gives you 1 and similarly along the y axis so now we have three clusters one cluster with centroid one one another with centroid five three and a third with centroid four point five zero point five and when we measure the inter cluster distances between these three centroids we determine that the the two closest are five comma three and four point five zero point five so we can combine those and that new cluster has centroid four point seven one point three now we are reduced to two clusters and we have no choice at this point but to merge these two clusters so either we can stop at this point and and output these two clusters or we can decide to merge them further and create a single large cluster and now a single cluster doesn't make a lot of sense because we our goal was to end up with you know group the points into many different clusters but on the other hand even if we do merge these two clusters and create a single large cluster the dendrogram actually shows the order in which these merger the mergers happen and it contains very useful information in many cases for example if the data points actually represent you know species of different animals then the dendrogram represents a family tree of how these species evolved now that was the easy or euclidean case of hierarchical agglomerative clustering but what if you have a non Euclidean space the problem is a non-euclidean space is that it's not possible to average points and create a centroid the center I'd may not be a valid point in the non Euclidean space so the only points or locations that can talk about it a non Euclidean space are the points themselves there is no concept of average and therefore we cannot continue using the the centroid to represent a cluster so instead we have to use a different concept called a class troy'd and a cluster ID is a data point that's closest to the other point in the cluster so once you use cluster ID instead of centroids and we look at an example of trust rhoids in the next slide we can determine the nearness of clusters by treating the cluster ID exactly as if it were the centroid and we can measure the distance between two clusters we're measuring the distance between their class droids install measuring the distance between their centroids here's a cluster on three data points in a Euclidean space now the centroid which we saw how to compute is just the average of all these data points in the cluster the X here marks the centroid of the three data points shown notice that the X is actually a data point that was not among the original three data points in the cluster it's an artificial point that we created to represent the the cluster now the problem in non Euclidean spaces is that they cannot create this artificial point and therefore we just have to pick one of these three points as a class droid to represent the cluster instead of using X the centroid so in this case for example we might pick the point shown as a cluster I'd to represent a cluster because intuitively it's in the middle of the cluster and it's close here and it sort of seems to be equidistant from the other points and therefore when pick the the highlighted point as a clustering notice is the cluster is actually an existing data point it's not the centroid but for all practical purposes we can treat it as a centroid in clustering algorithms now we've defined the trust ROI to be the point that is closest to the other points within the cluster now how exactly we define this notion of closest point it turns out that there are multiple ways of defining the notion of closest when we are picking the class droid for example we might want to pick the point that is the smallest maximum distance to other points by which we mean we measure the distance between every pair of points and then you find a point such that the maximum distance between that point at any other point in the cluster is as small as possible instead we might look at the point with the smallest average distance to the other points or we might look at the point with the smallest sum of squares distance to other points and pick that as a class droid depending on the application one or the other of these notions of clustering might make more sense than the others so far we've address the first two of the three key questions of clustering the first being how do you represent a cluster and the second being how do you represent how do you determine the distance between clusters we now turn to 0.3 or the termination condition how do you know when to stop clustering and produce an output the first approach is to pick a number K up front and stop when we have K clusters now this approach makes sense when you know up front that the data falls naturally into K classes for example the data might be about galaxies and quasars and we know that there are naturally two classes galaxy and quasar and once we have two clusters we stop the second approach we don't know the number K upfront is to keep clustering and and stop when the next merge of clusters would create a bad cluster now how do you define a bad cluster we define a notion called cohesion which measures the goodness of a cluster and when the cohesion value falls below a certain level we've created a bad cluster and we stop when the you would create a bad cluster how exactly do we define this notion of cohesion there are multiple approaches to defining cohesion the first approach to cohesion is to use the diameter of the merged cluster now the diameter of the merge cluster is the maximum distance between any pair of point in the cluster we might decide to stop clustering when the diameter of a newly merged cluster exceeds a certain preset threshold the second approach is to use radius the radius is a maximum distance of a point from the centroid or the clustering and we might decide to stop clustering when we would produce a cluster of greater than a certain threshold radius the third approach is to use a density based model now a density of a cluster is the number of points per unit volume of the cluster one way of defining density is to simply divide the number of points in the cluster by the diameter or the radius of the cluster we might instead divide the number of points by a power of the radius such as a square or a cube and at a point where the next merge you would create a cluster with a density lower than a certain preset threshold we'd stop merging and produce a final output notice that in each case we set a predefined threshold either for diameter radius or density and stop when the next merge would violate that threshold we turn now to implementation of hierarchical agglomerative clustering at each step in hierarchical clustering we need to find the closest pair of clusters and merge them in order to find the closest pair of clusters we need to compute pairwise distances between all pairs of clusters since we start with each point in its own class where initially there are order n clusters wherein its number of points and therefore computing pairwise distances takes time order n square overall we might have to do order n steps of merging and therefore the overall complexity of hierarchical agglomerative clustering is order n cubed now if we do a careful implementation using priority queues we can reduce the complexity to order N squared log in n squared log in though is still too much for really big data sets that don't fit in memory when n is of the order of millions order N squared log n can get out of hand pretty soon that's my hack will clustering is not commonly used for really big data sets that don't fit in memory it is primarily used for small data sets that do fit in memory when you have very large disk lists and data sets we use other methods of clustering which we turn to next
Info
Channel: Artificial Intelligence - All in One
Views: 133,066
Rating: 4.9181585 out of 5
Keywords: Mining of Massive Datasets, Data Mining, Information Retrieval, Coursera, Computer Science, Video Lecture, Video Tutorial, Video Course, Course, Data Science, Data Mining Video Lecture, Data Mining Video Course, Stanford University, University of Stanford, Stanford, Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining Data, Online Data Mining, Online Data Mining Course, Best Data Mining video course, Coursera Data Mining Video Lecture, Hierarchical Clustering
Id: rg2cjfMsCk4
Channel Id: undefined
Length: 14min 7sec (847 seconds)
Published: Wed Apr 13 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.