Leland McInnes, John Healy | Clustering: A Guide for the Perplexed

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Really good video, does anyone have any good ones on artificial neural networks for visualising data?

👍︎︎ 1 👤︎︎ u/maruimaruiOIOI 📅︎︎ Oct 25 2016 đź—«︎ replies
Captions
I'm gonna talk about clustering data a guide for the perplexed so I'm John Healy this is Leland McKenna's or both mathematicians and researchers from the Tut Institute for mathematics and computing in Ottawa Canada you can get to us on Gmail or addresses are right there let's start off what is clustering so clustering sounds straightforward we find groups of data that are all similar easy to say a little bit hard to do and it tends to mean different things to different people so over the course of this talk we're going to make that a little bit of a more concrete statement so what it means typically depends on your application so there are lots of goals that people might use clustering to do you might want to use to partition your data say you're on an HPC or distributed computing system and you want all of your data to live on a separate drive or in a separate location then you need to partition it and probably want balanced partitions across that space maybe you want to summarize your data you want to explore your data you want embed it into a vector space you can use other machine learning algorithms you want to find patterns within your data lots of different reasons to cluster which leads to lots of different clustering and definitions so the keys there where we find groups of data that are similar how do you define groups and how do you define similar so if you're partitioning you've got some pretty hard constraints and you want it across that they're balanced and that's maybe more important than the fact that everything within each cluster or partition is exactly the same that's easy in theory that's hard in practice we're gonna walk through some examples of what common clustering algorithms do on a two-dimensional data set so two-dimensional data human eye is really good at clustering we can look at that nil I know what those clusters are I'm done no problems this will this allow your intuition to jive with what the MIS algorithms are doing you've been looking at this for a few minutes you probably already know what the clusters are so let's move off and look at k-means it's the first algorithm everyone learns it's one of the most obvious great that's probably not what most of you were thinking when you looked at this data of what the clusters are so what k-means is doing is its throwing case centers down in your space and then it's iterative lis moving them around your space to minimize in truck cluster distance and maximize inter cluster distance so that's got a couple of disadvantages the one everyone knows is you have to choose K like oh I have to pick my number of centers what if I don't know how to pick my number of centers that's a little hard to do there's some work you can do to work around that a slightly more subtle disadvantage is that it's a centroid based technique so at the end of the day every point got allocated to the center that was closest to it and that's a Gaussian ball assumption not saying my clusters are sphericals so now spiracle clusters might exist for some data they probably don't exist from the general concept of data and assuming that all of your data is spherical is kind of like assuming spherical cows not a great idea in general lastly we've allocated every single point to a cluster so this is the assumption that your partitioning instead of clustering so that's saying hey there's no noise there is no background all of my data is clean and lovely this is taking a square peg at hammering it into a round hole so let's move on and look at something else affinity propagation looks surprisingly similar to k-means there's a reason for that that's the last step of affinity propagation is going to be one iteration of k-means but affinity propagation is kind of interesting and that every point is going to vote on the exemplars enter that they feel best represents them this is kind of a cool ideas it allows you get non symmetric similarity measures alright my similar did you can be different than your similarity to me that's a really nice feature and you don't have to choose a number of clusters the disadvantages is there's another parameter that you do have to tweak the results are very similar some very sensitive to that other parameter the other word disadvantage is slow and the final disadvantage is you ran k-means at the end which means spherical cows square pegs they came right back in you partitioned and you've got these spherical clusters okay moving on mean shift mean shifts an interesting algorithm you're going to initialize your space with K centroids and then you're going to sort of estimate the local density around space and each of those centroids is going to do a hill climb to the local Maxima so you're going to find these modes or regions of high density then you're gonna allocate every or the points that are dense and near it to that cluster so one advantage is that you get those black points so those are the points in the low density areas so you say hey I got noise I'm not going to assume that all of my data has to be dense or has to be within a cluster which is great I got rid of my square peg unfortunately I still have my spherical cow assumption so I have balls in my space all right we are on I did awesome spectral clustering instead then spectral clustering is nice spectral clustering says hey I've got really high dimensional data but maybe it lives on a low dimensional manifold now think of that as like a two-dimensional surface floating through let's say a three dimensional space if I'm going to compute distances there maybe I don't want to compute them in the three dimensional space I want to talk about them moving along this surface so to do that we'll typically approximate our data with a K nearest neighbor graph and then have my distances moving along the graph and so we'll look at the adjacency matrix of that use a little bit of linear algebra embed that into a low dimensional vector space and perform a clustering on in that low dimensional space the issue there is typically what people do in that low dimensional space is run k-means and enter spherical cows and the square pegs birch balanced iterative reducing and clustering using hierarchies okay so what birch is gonna do is it's going to iteratively partition your data up into this sort of tree it's a cluster feature tree think that sort of like a B+ tree that's this sort of dynamically being updated and growing and now once we've broken our data up into these trees we're gonna look at the leaves of the trees the leaves of the trees represent sort of groupings of your data we're gonna put tiny spherical cows around each of the those leaves then we're going to run another clustering algorithm usually k-means to cluster the spherical cows into large spherical cows and this is what you get on the advantage it's fast it's got low memory a little memory consumption it's streaming and you don't have to specify the number of clusters it does have a parameter that's a little bit hard to tweak okay enough of centroid based clustering algorithms I think we were all seeing a recurrent theme of spherical cows coming across here all right hierarchical clustering or we're gonna talk about specifically agglomerative clustering this is where you say all right I'm gonna take all my data points when I take the two nearest ones merge them together and I'm just gonna repeat that over and over again now to do that meaningfully I have to come up with some definition of distance between groups of points so however you define that distance between groups of points is going to define what kind of clusters you're looking for so single linkage clusters clustering will allow you to get these long spindly snake-like clusters coming out complete linkage clustering is going to get us spherical cow clusters coming out Ward is sort of a statistical method that generates statistically spherical cows the advantages of this is that some of the methods don't constrain the shape of your clusters which is kind of nice you might not have to go cows that's great the other is you don't need to slug clay okay so this gives you all possible K in a dendrogram or a large tree so it has lots of hierarchical information about how your clusters merge and split and that's great if you have more than a few thousand points that vendor Graham looks horrible it's a black mess if anyone's ever tried to do that in Python and or R and choosing the cut at that tree which is akin to choosing your number of clusters can be quite difficult to do that being said if you've got small or medium amounts of data and you think there's a lot of hierarchical structure within your data this is an excellent technique to choose all right VB scan is the last of the algorithms I'll talk about it quickly there's a density based clustering algorithm it tries to find regions of your space that are dense surrounded by region space that are not dense so whether right the way they're gonna do that is I'm going to put balls of a fixed size or and every one of your points count the number of other points that are contained within the ball that's going to be how dense the local region of that individual point is and now I'm just gonna look for contiguous regions of very dense points so that allows me to have these gray and black points in the background because they're outliers they're not contained with any of these dense regions it also allows me to have arbitrarily shaped clusters or these dense regions you still have a parameter a couple of parameters to choose from one is that size of that ball that you need to do a lot of work to adjust over to find quote-unquote right solution so here you see we've merged some clusters that we probably shouldn't have but overall it looks fairly good now we've done all of these searching and the parameter searching seems to have come up repeatedly and that's easy ish to do in a low dimensional space and that's gonna get harder to do when you can't visualize your data and look at so what makes this all so hard all right what makes a good clustering good how do you tell so there's lots of different measures for goodness of clustering that are out there chrysten Hennig gave a great talk at PI data London 2016 highly recommend you all go out and look at it you could look at inter cluster separation inter cluster homogeneity you could look at density you can look at uniform cluster sizes lots and lots of different things there's probably a different goodness measure for every clustering algorithm out there and then you publish the paper that shows that you're better than everyone else via your distance measure that's a good secret what a good clustering algorithm is and what a good measure is depends on what your algorithm needs what you need for your application alright so here's some slides that are straight stolen from Krishna 10xtalk and this is a the best possible three clustering if you're trying to minimize inter cluster distance so you can look at that and some of you will say yes that's the best clustering that's exactly what I want my clustering to be doing via a different distance measure that's the best clustering so I'm not saying one is better than the other but for everyone in the room you need to decide which one you liked better and then figure out why you liked it better for your application I happen to like this one better and I'll go into that in a little more detail soon all right what is a cluster well what do I mean by cluster that's maybe an easier question to answer I tend to use clustering for doing EDA or finding interesting regions in my data exploring my data summarizing what my data sort of contains so I have a particular application in mind when I talk about this you will probably have a different one yourself I'll start looking about what I don't want to do I'm surprised we have these spherical clusters your clusters don't need to be balls straight up spiracle cows they don't need it spectral clustering you can see this region in the bottom right corner those are all cluster that's all one red cluster so this is the partitioning problem again not every point in your clock in your data needs to be in a cluster real data has noise we've got measurement error we've got sampling error we have plain old data corruption lots of things can can make this change one of your applications might be partitioning but if that's not an explicit requirement of your application you probably shouldn't use this square picker sub peg assumption all right this was the best of the bunch for my definition what was their definition of clustering sorry clusters are dense areas separated by less dense areas all right I I can't think of another really good notion that drives with mind very well but everyone's data is special unique just like every snowflake that's what I mean when I talk about clustering you should think really hard about what you mean before you do any clustering whatsoever back to my definition can we move more specific about density based clustering so we've panned waved a definition let's get something a little more detailed a cluster is a connecting component of a level set of the probability density function of the underlying and unknown distribution for which our data sample are drawn what all right let's walk through that a little more slowly get a little bit of intuition going on that was the data set we've been looking at all day I can do a sort of a kernel density estimation and estimate the probability density function from which that data was drawn so I've got a heat map over top of that showing denser regions in blue colors so that says I'm drawing data from this region I'm more likely to draw points of the blue areas all right so I can cut that in contour sets these are just areas of equal probability density now I can grab one of those contour sets say all right that makes a cut I can look at all the points that fall within one of those lassos or cuts no problems there's my clustering it jives well with my intuition does that process a clustering algorithm well there's a couple of problems there that say it might not be a clustering algorithm we don't know that underlying probability density function that I magically hand waved and showed you heat now that a map of at the beginning we didn't know which levels that's to choose I and carefully went through those and chose the exact right level set to give you a nice driving intuition for what that was doing and the computational complexity of doing either of those tasks is actually quite large so if I've got a great clustering algorithm that only works on a thousand data points that might be useful for some applications but in general we would like clustering algorithms to be able to scale to very large scales millions of data points and beyond so what can we do and to talk about what we can do practically speaking I'm going to pass this over to my colleague Leland there you go alright thanks John so John handles the statistics and machine learning I'm really not in that field as much I do math and programming so I'm interested in the practicalities of what we can actually do in practice so first dip that into the approximation that's a hard thing to do I don't want to approximate density all across the whole space that's incredibly computationally expensive I want to approximate density locally just around points so here's what DB scan does it has two parameters has epsilon and midpoints and it draws balls of radius epsilon and declares any points that have balls with at least men points in them to be dense so here I drew a whole bunch of circles of varying of a fixed radius and I colored them red if they had more than three points in them gray otherwise so you can see it's picking out the higher density areas there nicely I've found the red dense regions quite well okay that's also very fast to compute and it's sort of local to each point so you can do that fairly efficiently but let's turn it on its head a little bit what if we draw circles of whatever radius we need to manage to contain min points many points in other words for each point we're gonna take the radius that would be required for Dede scan to declare the point dense so i've done that and i've now shaded those circles with a color according to how large their radius is the larger the radius the light lighter the color down into sort of greens and yellows okay I can translate that onto the points and I seem to have colored the points with relatively their relative density so that's nice I've got a very cheap estimate of the probability density function over the points that's great so now I have to figure out how I can deal with level sets there all right so I need to get some level sets and pick out what you're the right level says so how am I going to do that the first step to figuring out how to do this is to step back a bit and realize that the level sets are actually going to form a tree so you can see this a little more clearly with a little animation so let's get this running this is this is a univariate probability density function and we're just going to run a line down of decreasing density and whenever you pick up a new area I'm gonna add a branch to my tree whenever I hit a local minima there I'm gonna merge together the branches where they meet until I eventually get a root of the tree so you can see given a probability density function you can build a tree from it so we can do the same thing with points so what I'm going to do here is I'm going have circles of increasing radius about each point remember the size of the radius is going to be related to how dense the pointers and I'm going to color those circles in as soon as they capture enough points to be declared dense and then I'm gonna merge connected regions together with uniform colors as I go building up this tree so here we go and you can see color is popping up and then merging together and now we're down to just three colors eventually the blue and the purple look at emerge in a minute once they connect up with each other there and then finally the green pops in and we're at the root of their tree so we can build up a tree of varying density okay if we can get a tree of density can we play our game to select out the cluster from that yeah there's a make sense a mess algorithm that'll select out clusters from a tree so this is this is not so bad now I can select out the level sets I want so it's managing to basically pull out flat clustering from the tree so what does this look like in practice so here's our familiar data set if we run this on this data set what sort of tree do we get we get this now that that is actually a tree there's actually lots and lots and lots of little branches in there but it's just solid black because there's so many of them it's hard to make any sense of it so one more trick here to try to make this a little easier to do so the trick we're gonna play is that in practice almost all of those splits in this vast complicated tree are one or two points dropping out of a cluster or joining a cluster depending on which way you want to look at it and for many purposes that's that's not really splitting into two clusters I don't really think one point is a separate cluster I have some notion of granularity of how big a cluster could be I have a midden cluster size so as long as I don't move as long as the split involves less than that mini cluster size number of points falling out I'm going to say the cluster persists rather than splitting in two and I just lost some points from that cluster so if I go through and run that process through on the whole tree it simplifies the tree down because now most of those splits are not actually splitting apart as a tree they're just a single cluster losing some points so if we simplify the tree down this is what we get now this is a tree you can actually look at and understand it make sense all at least reasonably so err tree is much simpler and the width of each branch is related to the number of points I've also colored them that way it so you can see that there are points falling out of the clusters as they go down the page and what you want to do to sort of pick out the flat clustering is effectively maximize the amount of ink so the the constraint is that if you choose a cluster you can't choose any of its children and so subject to that constraint you're just going to work through and try and choose the clusters that will basically get you the most amount of ink on the page so to speak the excess of mass so this is what that algorithm picks out as the clusters now as it happens there's a single flat cut there that would actually pick out all of those clusters but there's actually no reason for that to necessarily be the case now on a different data set with clusters of wildly varying density it can pick out something that picks out a cluster of way up high in the tree and another cluster way down low in the tree where there's no single flat cut that would get you there it will handle variable density cuts very nicely and so what's the result of doing this cluster here's what that algorithm provides and this is what I wanted when I first stared at that data the black points again our noise points that's considering those outliers and then the rest it's picking out the clusters very very tightly so this is the HDB scan clustering algorithm it was very recent from two papers from Campello moola v sander and zurich in 2013 2015 and it's an excellent clustering algorithm for just getting things right off the bat so what about performance because that was a nice looking clustering but we made it at least computable but how computable is that how well can we scale out well the first thing is that I don't want to run connected components for every level of that tree that's too hard to do that's too expensive we can frame the problem as connected components in a graph which improves things a little but better still we can find the video spanning tree of that graph whose that minimum spanning tree is literally the edges that cause disconnections of components in the graph so the minimum spanning tree actually contains all the information to rebuild that cluster tree so there's a very small number of edges in the graph that are actually the critical ones and we just have to figure out which ones those are and then we can reconstruct the entire tree from that straight away so the next catch is that the graph we would be doing this on turns out to be a complete graph so traditional minimum spanning tree algorithms such as friends and kruskal's have order n-squared asymptotics on a complete graph there they're designed for graphs with relatively small numbers of edges compared to the number of nodes we're dealing with a graph that has n squared edges where you've got your nodes so that's not going to cut it we we can't really afford an N squared algorithm so what can we do we can use spatial indexing so generally speaking that some sort of tree that divides up the space in a sort of recursive manner so that you can ignore effects of things that are a long way apart and only worry about things that are close together so by that I mean things like quad trees or octrees KD trees Bowl trees there's a whole slew of different notions of doing this generally we use KD trees and Bal trees but you can think of whatever tree structure for space that you you prefer to thinking so the main thing that spatial indexing provides is that it's really fast for nearest-neighbor queries if I want to find the nearest neighbor of a point I can basically run through that tree and ignore all the points that are really far away and only worry about the points that the closest to that point that I'm querying and that means I can query for the nearest neighbor of a point in order log in instead of ordering I don't have to look at how far it is away from all the points in that I've set just the ones in that in the sort of closest few leaves of the tree so if we use blue of course algorithm for minimum spanning trees the upside of this is that we can actually pretty much turn our minimums beat spanning tree computation into a whole bunch of repeated nearest neighbor queries there's some caveats on that but in practice they actually make the algorithm run a little faster so we can take the dual tree brew of K algorithm that was by March Raymond gray in 2010 and we can get order in login performance from minimum spanning tree computations which is the bulk of the HDD scan algorithm so we can convert HDD scan from any n squared algorithm as it was first published into an inaugural rhythm that we can run in practice so what does that mean in terms of actual clustering performance here's a whole bunch of clustering algorithms including many other ones we looked at at the start and as you can see they quickly fall into two categories ones that are actually practicable along reasonable sized datasets and the others so let's worry about the ones that look practicable will extend out the x-axis look at clustering more points and there's there's fewer here and HDD scan with air in log in implementation is down there in the dark green at the very bottom here we can extend that out one more time and now you can see that HDD scan despite effectively computing DB scan for every possible epsilon parameter value he's running faster than SK learns DB scan so we can actually do this extremely efficiently and scale to fairly large amounts of data now in practice what can you do well if you don't interactively you know on the order of a hundred thousand points if you want to do it overnight you can get to five million maybe even ten million points okay that was good can we do better so the biggest annoyance that I still have is that I still need to choose this number of points value for the density estimate how many points do I have to enclose within my epsilon ball before I can declare a point dense now we went from DBC and HDD scan by sort of computing things for all possible epsilon values so we sort of varied over the whole parameter range and then it integrated everything together can we play the same trick can we vary the min points parameter the same way we vary dip so on and integrate everything together well no there's no obvious way to convert that through but topology to the rescue said no one until now probably you can think of building the initial HDD scan tree in terms of something called persistent homology if you're familiar with that great if you're not don't panic it's just a tool from algebraic topology and the main thing is that it gives us a very clean simple mathematical description of what's going on even better once we're embracing this topological framework which you should do anyway we get other algebraic topology tools to bring to bear on the problem and one of them is it's fairly recent multi-dimensional persistent homology that came out in about 2008 that actually lets you do persistent homology over a multiple variables so we could persist over epsilon and mend points at the same time so if we can frame our problem in terms of multi-dimensional Pacific persistent homology maybe we can get what we want the catch is that the result of doing that is not a nice structure like persistent and what would you give you where you get a tree in fact you get well a really complicated thing that's hard to wrap your head around so our whole little game that we played with the tree to sort of pick out the clusters at the end that's not available to us anymore the upside is we can if you want completely rethink HDB's en andrey described it all in terms of sheaves instead of trees now I'm not gonna get into sheaves in this talk please come check with me after if you actually care and we're getting into some deep theoretical waters here which I like because I finally get to use all that category theory and tapas theory that I learned you know shout out to Gotham to eat apologies I've they're actually finding a practical use in reality the main thing is that this just gives us the right data structure mentally to solve this problem the right data structure and practice is probably not going to be this but that's okay it's a way to think about the problem so the the key is that the resulting algorithm of HDTV scan in terms of sheaves does generalize to the multi-dimensional case there's some caveats on that because whenever you're generalizing from a simple case up to a more complicated one you inevitably have the difficulty that there are a couple of different ways the generalization could go that's okay we all have to figure out the details on that yet and when we get that done that's going to be preserved in density clustering and we're going to get some advantages here the main ones are it's gonna be even more robust there's fewer parameters you're just gonna have to pick them in cluster size and in practice we can actually get similar performance same asymptotics the constants are going to be a little worse otherwise all the sudden so what are the conclusions and and what do I want you to take away from this talk what does John want you to take away from this talk well the first one is k-means should not be your first choice for clustering we really want to emphasize that k-means shouldn't be your second choice for clustering either the main thing is that we want you to think really hard about what cluster means for your application what do you need clustering to do because that really influences which algorithms make sense for your problem and while you're thinking about it you may as well run HDV scan it's probably what you want to do anyway all right thank you are there any questions yeah okay so the the question was about the curse of dimensionality and I think the short answer is that I don't think clustering is practical in really high dimensions because I don't think it makes sense distances cease to be meaningful really in truly high dimensional spaces so I would say that the answer is that you need a dimensional reduction technique before you can cluster if your data is living in a high dimensional space and in a truly high dimensional space and there's natively in that high dimensional space there isn't some low dimensional manifold that you can pull out there are no clusters there there's no meaning there if your data is in a high dimensional space on a low dimensional manifold dimensional reduction of some form will work and then you can cluster other questions yeah Oh Tina Mitchell data for the Chinese team dimensional data further for the timings and we in practice I think you can get to fifty or hundred dimensional it's not too bad but HDTV scan data scale badly with dimensionality if you get truly high dimensional data it's going to take a lot longer to run that is very true that yeah for this code base that is up on the slides here that you can pip install comments or whatever you need that is all inside them so it compiles down to C for the fundamental data structures yes it takes advantage so the question is does it take advantage of multi-core ultimately fundamentally it's single threaded it does take advantage of more cores where it can but the algorithm itself is hard to paralyze well that would be an interesting area of research I'd love to talk to you more if you have thoughts on that yeah so I know you said this precision it's a cluster a two-story code but you kind of have an idea of how that might tie it up this is a persistent homology like topological data analysis kind of things that happen yeah well certainly I think I think persistent homology is a really interesting idea that's still trying to find a really really good application personally I think this is potentially one of those good applications now I'm not using anywhere near the full power of persistent homology I'm only interested in the zeroth component but it's where the tools of persistent homology have realized theoretical value in a practical setting and I think there are going to be more cases of that but I think it's going to be time will tell yep what is your recommendation to handle situations when individual dimensions have very varying scales organizations what's your take on that well that's actually a pretty difficult problem to be honest John do you have thoughts on that so then you're talking into two dimension reduction techniques so if you're looking across the center dimension reduction techniques how to properly do standardization across your individual variables before collapsing down into space so there are so that you can always do the standard center centering and basic statistical techniques there's some really interesting work coming out of a generalized low-rank model paper that came out of Stanford recently with Odell at all who are talking about the right centering methodology for dealing with heterogeneous data and I would I would jump in that direction if you were looking for something sort of cutting edge and experimental
Info
Channel: PyData
Views: 7,399
Rating: 4.9340658 out of 5
Keywords:
Id: ayZQj4llUSU
Channel Id: undefined
Length: 34min 36sec (2076 seconds)
Published: Mon Oct 24 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.