HDBSCAN, Fast Density Based Clustering, the How and the Why - John Healy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm John Healy this is joint work with Leland McKenna's who you saw I'll give a dimension reduction talk earlier today if you missed that talk I'm gonna highly recommend it it was a lot of fun when it goes up on YouTube go check it out I've got to give a quick shout out to another pair of colleagues here from the Tut who are shell and ami woulding who are going to be giving a tutorial on reproducible data science tomorrow I'd recommend you drop by that as well so who are we we're mathematicians and researchers of the ton Institute for mathematics computing which little Research Institute in Ottawa Canada we've recently been focusing some of our research efforts on sort of foundational mathematics underneath machine learning and data science techniques to that end a while back we came back on a great clustering algorithm called HDB scan we figured out how to make it more efficient and developed it into an open-source library that's now been contributed to scikit-learn contribute I'm gonna talk to you a little bit today but both how and why HDB scan works all right to start with that I'm gonna roll back and talk about clustering in general so let's get everyone on the same page clustering at its heart is about grouping together data points that are very similar I'm gonna break that down by breaking down clustering algorithms in a way that I find useful and I'm going to apply Lille and sort of red star this is gonna be a very high level I might lie a little bit in this breakdown but I think it's gonna give you the right intuition so let's start off by dividing clustering algorithms and just to sort of clumps cluster elements are a naturally break apart into centroid based or parametric clustering algorithms or density based algorithms so centroid based algorithms are going to make explicit assumptions or implicit assumptions about the shape of your clusters this is going to allow them to work on very small amounts of data very well which is a really nice feature density based clustering is going to allow your data to define the shape of your clusters that's going to require a lot more data to be very effective at working so now we've got a nice great hold on now I got that wrong we shouldn't break them down that way that doesn't the right way to break them down is flat versus hierarchical clustering that's a way better way to divide it sorry so this is about sort of complex city of resolution flat clustering gives you a single grouping or partitioning of your data while hierarchical clustering returns sort of sets of nested relationships between your data which is a much nicer way to think about clustering algorithm all right hold on what if we got a two-dimensional partitioning of our clustering algorithms where we can have flat centroid based flat density based hierarchical centroid based and hierarchical density based that might be a nice way to group a nice taxonomy of clustering algorithms together I'm gonna give you a quick whirlwind tour through these four cells starting with flat sort of centroid based or parametric clustering algorithms so these two of the most common algorithms in this space are k-means which most people have probably heard of or Gaussian mixture models so as I mentioned these have either an explicit or an implicit assumption about the shape of your clusters so that's gonna require you to have an a prior understanding of what you want your cluster shapes to be before you choose your algorithm so this is going to work really well on small amounts of data it's gonna work in low dimensions where you can visualize your data and get an understanding for the shapes you're seeing within your data but that's rarely a situation that I'm in I often end up dealing with higher dimensional data where I can't see and don't know what my data really looks like this is sort of the situation that I tend to find myself in higher dimensional versions of this where Mike clusters from all sorts of different shapes their variable densities their variable sizes k-means and parametric algorithms tend to perform poorly in these sort of situations because they're gonna make the gaussian ball assumption that as your clusters are sort of high dimensional spheres now the Gaussian ball assumption it's lovely in theory but in my experience it's a little bit like a spherical cow theoretically very appealing but rarely occurring in nature additionally they're gonna make the same problem have one problem which is a number of clusters parameter our resolution parameter that we're gonna have to fit and if we don't know a lot a prior I about what's going on with the data that's gonna be really hard to set alright that brings us to flat density based clustering examples here a DB scan or mean shift algorithm these eliminate the parametric distribution assumption made by centroid based methodologies which is nice think of these are sort of nonparametric data-driven versions of the centroid based methods DB scan or I'll say density based spatial clustering of algorithms with noise applications with noise sorry is a popular density based costing algorithm it's got some nice advantages it's fast its robust to noise it allows for variable shaped clusters you get the nice advantage here you don't Saul the points in the gray in the background it's a clustering algorithm not a partitioning algorithm which means it's not requiring you to put every one of your points into one of your clusters which is really nice if you think there might be actual noise in your data k-means doesn't have that property and you can get very variable clusters in high dimensions without realizing what's happening fortunately it's got a similar disadvantage to our previous algorithm and that the resolution parameter on this is quite hard to set which leads us to hierarchical centroid based or parametric algorithms and these are specifically a doctrine designed to address that parameter selection problem since we don't know a pro RI what the correct numbers of clusters in are we're gonna build a sort of a natural hierarchy over a whole range of our clustering and use that to try to divine the correct number so that is often to be represented as a dendrogram which can be useful but if you quickly over round the user so this is a picture of those clusters splitting up and breaking apart as it fragments down until you have every point in their own cluster right at the bottom and so from this you can sort of look at it and stir the tea leaves a little bit and say no there should be four maybe six maybe eight clusters I know the answer I want so I'm gonna pretend that you can Devine six from looking at this and say oh that's obvious and that gives you this clustering but again we see we get hurt here by the parametric assumption implicit in what I do there which was called war clustering via hierarchical and that assumption was essentially again the Gaussian ball it's come back to bite us again but everyone in this room at this point says wait John you just told me how to deal with the Gaussian ball assumption we do use density based clustering and you would be right so that would bring us the hierarchical density based clustering the big reveal then of the question mark state should be scanned shouldn't be a surprise to anyone who read the title of the top with you before you came so this is hierarchical density based spatial clustering of applications with noise HTTP scan this is the kind of space you want to use when you're using density based clustering to robust to variable shaped clustering and hierarchical clustering to be account for the resolution problem within your data so HDB scan is a good example of a hierarchical density based clustering algorithm let's talk a little bit about more about that let's talk about higher density based clustering in general so I'm going to start with a high-level motivating description that's gonna give you an example of a natural way you might go about constructing hierarchical density based algorithm alright let's go back to that dataset we were looking at before this is DB scan as I said before the hard thing with DB scan is a slightly unintuitive parameter called Epsilon finding the epsilon that's gonna fit your data it can be a little bit of work so let's look what happens when we bury that epsilon parameter for very small values of epsilon really a little tiny clusters and most of your data defined as noise in the background we're gonna increase epsilon the background noise get sucked into your clusters some of your clusters get merged together we increase epsilon again more background noise get sucked in your clusters get over merged and eventually this will merge everything into one single massive cluster this is easy to deal with if you're dealing with two-dimensional data you can do the advanced eyeball exam and you can say ah I found the freight parameter I've looked a lot of pretty pictures and high dimensional data that gets a lot harder to do but it turns out that as epsilon decreases you've got to guarantee that components are only going to fragment into smaller clusters are going to break into smaller stirrers or vanish entirely that's nice that lets you build a hierarchical tree which we could represent via dendrogram and now we can stop here and say haha I have a hierarchical tree this is what I present to users I'm going to require you to do a cut through that tree and I'm finished exercise to the reader and I walk away but that's still a bit too complex for my liking it's a bit too complex to be useful because these can get very very complex for large data very quickly and eventually look like a big mass of black on your screen so what we're gonna do is we're gonna simplify that cluster tree we're gonna prune the dendrogram back by eliminating any clusters with less than min points in them and this is going to get a much simpler dendogram and then we're gonna use a cluster selection from this tree via either a flat cut which we're sort of comfortable with from before or a more advanced technique that I'll talk about a little bit later and that is going to give us the clustering that I wanted when I first looked at this data so this is almost exactly what I wanted the black points are noise again the number of clusters looks about right the clusters are nicely shaped the algorithm I just described is very roughly HTTP scan now why does density based clustering work so let's back up a moment see what's gonna go on first we should define what we want a hierarchical density based algorithm to do let's define a cluster I'm gonna define a cluster to mean a connected component of a level set of a probability density function of an underlying and unknown distribution over what your data samples are drawn obviously what sorry I'll back up and take that a little more slowly with some intuition we're gonna presume that's an underlying probability density function over which our samples of our data are drawn from there are some areas that are less likely to see data they're more areas that are more likely to have data drawn from them you can color that with a heat map of that probability density function denser regions are more likely to see data showing up then I can draw contour plots over that or level sets and this is like a topography map if you've ever seen one of those you can see the level sets increasing as your areas get denser we can take that we can select a we'll set from that cut it off these look at the connected parts of those level sets and see what's closed in this case we're gonna call each connected part a cluster that's gonna result in our clustering everything not last sued I'm gonna call background noise fine I'm done I have a great clustering algorithm except I don't quite so what's the difference between what I just described and a clustering algorithm first off we don't know that probability is a function that I drew initially this is the underlying problem of density based algorithm says how you go about estimating that probability density function second how did I choose that level set a hint is carefully by hand knowing the answer I wanted ahead of time so that I can do a nice talk in reality this is the problem that the hierarchical component of our algorithm is going to try to deal with which level set should I choose and lastly we'd like an algorithm to be efficient so we can do crazy things for a hundreds or thousands of points but if we want to scale to millions of points we have to start talking about what we can do with good computational efficiency and what I just described is naively not computationally efficient so what can we do if we've got no we know what we should be doing let's see what's computationally feasible to do all right first we're gonna have to locally approximate that density said there are two problems let's deal with the first one so how are we gonna do this in a fast efficient way and fast inefficient in this case means locally without looking at all of our data at any given point in time all right where's what DB scan does it's got two parameters epsilon and min points it's gonna draw a bunch of balls of radius epsilon around all of our data then it's gonna say hey if you have more than min points in your epsilon ball I'm gonna declare your debts this is going to be very fast to compute it's all very local and it's point counting in very small local regions now but what if we don't want a declaration of dense or not dense what if instead we'd like an estimate of the density well what if I draw circles of whatever radius would enclose men points many points and so now in other words what radius would I have to take before DB scan declared a point to be dense I'm going to draw those circles I'm going to color code those circles with smaller circles being darker representing denser regions of space lighter circles being larger lest it dense regions of space now I'm going to take those colors I'm going to map them on the points and now what I have is a very fast to compute local point estimate of the probability density function at each of my points for my data set so now we've done a fast way to compute the density how we going to compute the level sets all right the intuition here is that the connected components of a level sets are gonna form a tree so that's probably best seen with a pretty picture even an animation yes it worked so as your required density to be included drops your background points start getting added into your clusters and so as you encounter new branches these things going to merge together you can think of this as the water level dropping and islands of your cluster appearing and as they drop down a new islands will appear and other islands of your data will be merging together at the base when your density drops completely all of the points under a single massive cluster all right we can play the same game with our data points you can picture slowly expanding circles around each of the data points as the density decreases as the points become dense under the density we're gonna color them in and we're gonna merge circles when their components meet you watch the tree form and it's going to end in a single root and that would be fine if we were dealing with Euclidean distance the key point here is that we're not dealing with a pure Euclidean distance we're dealing with something called neutral reach ability distance with an HDTV scan and all of that is saying that before you can connect to components they don't have to only be closed in Euclidean space they also have to be dents so we'd like to code both of those conditions into a single distance and that's going to be mutual reachability distance in this case that's essentially writing a new distance functions that's the max of the distance between the points and each of their distance to their case nearest neighbor their capital is just distance to the case nearest neighbor for a point so this has the effect of pushing sparse points away from the rest of our data but it's also what's going to make a flat cut through the tree we're gonna build equivalent to running DB scan for a particular parameter Epsilon there's a little bit of a foreshadowing here it turns out that this satisfies the triangle inequality which will come up later in the talk and bring us great joy well bring me great joy at the very least all right so now we've approximated the level sets of a tree you might want to track the flat clustering from that again here's a dendrogram we could do a flat cut across it that would be equivalent to running DB scan for a given epsilon parameter we could be done but we would like to do something a little bit smarter so we're gonna do that condensed dendrogram again and we're gonna prune out any of those clusters of size less than min points and the intuition here is that as a cluster is decaying any time a small one or two points ie any point less than min points or any number of them fall out of a cluster I'm just saying the clusters shrinking I'm not letting you spawn a new cluster I'm not breaking off a single two new cluster and a single two new cluster cluster shrink unless they hit a point that splits them into two large chunks and you get a much more simplified representation of the dendrogram coming out of that here I've colored the dendrogram by the size of the cluster and I've also done the width of those bars based on the size of the clusters so you can see what's happening as those clusters fade away and then vanish so now we need to select a clustering from this a flat clustering would work but we would like to find something a little bit more natural that's not going to force you to go and look at this dendrogram every time you want to find out the natural clustering so we're gonna think talk about is persistence of the clusters so we want the clusters that live for a really long time in this density plot and are really large and have lots of our data and that's going to amount to just maximizing the amount of ink we have on this page subject to the constraint that if you select one of these clusters you're not allowed to select any of their descendants and we're done you get a very natural clustering falling out immediately from that and that's the clustering I showed you earlier and that's HDTV scan so this is the HTTP scan algorithm it was described and a couple of papers by compellable avi sander and Civic it's a great advance in clustering I can't say enough good things about it it should be amongst the first clustering algorithms I think anyone should run on their data unfortunately even though it's a fantastic cutting-edge algorithm it's in squared which could limit its applicability if you have large amounts of data so what can we do we can try to make it fast all right how are we gonna go about making this fast the key here is to realize that under the hood this closely resembles a Sinkin single linkage computation which can be reduced to a minimal spanning tree computation prims algorithm one of the best algorithms around for computing minimal spanning trees that's great but it's only true in the midst number of edges is a small multiple of the number of vertices in your graph unfortunately we have a complete graph with n squared edges and as a result that's not going to be fast or efficient for this particular case what other alternative algorithms do we have that we might make use of so there's another algorithm called budova cos algorithm so booter of cos algorithm on a graph conceptually can be thought of as a parallel version of prims so what we're gonna do is forever we're gonna start with all the vertices and their own separate components and then for each component and we're gonna find the shortest edge that touches that component that isn't in it are in it in a component we're gonna merge those components we're gonna repeat over and over the process until we're left with a single component and that single component is going to be the minimal spanning tree of your graph the algorithm is a log V which means we aren't doing any better than prim so far but as I mentioned before this is a de general graph we've got points points in space so this is a spatial graph so what's what's Berta's algorithm look like on a spatial graph all right start with every point in its own component we're in a query each point for its nearest neighbor that is not within the component for each component lad in the smallest edge will merge will rinse will repeat we'll end up with a single big component and that will be the minimal spanning tree it sounds like almost what I just said earlier except I use a slightly different word I said nearest neighbor not shortest edge so if we can find a way to speed up nearest neighbor computations then we can speed up our whole algorithm and we're good to go well as everybody knows if you want to speed up a nearest neighbor query you look at spatial indexing trees fun so basic idea of spatial indexing tree is that you're going to recursively partition your space into separate little regions and then when you want to query a new point you just drop it down your tree until you hit a leaf node and then you only actually have to compare the distance of your new point to all the points within the leaf node because you've got bounds on the distance you can be away from any point within a given region that's spatial indexing trees they're well known they do wonderful things for you so now the intuition we've just been described it's been put into practice by a great paper by March RAM and gray they actually provide as good complexity bounds it's wonderful it's called dual tree bruco for minimal Euclidean spanning trees wonderful paper it works for KD trees it works for cover trees that looks like a slightly complex complexity bound those C's are just really small data dependent constants the Alpha is the inverse Ackermann function which is if it takes on values between like three and four for any values in n ranging from 60 to 10 to the 80 so if you're clustering all of the data in the universe it's maybe four otherwise it's a small constant and you're okay make those vanish and you're looking at algorithm which is essentially n log n instead of N squared now except that thing that came up last time again mutual reach ability distance not Euclidean distance but that's okay because it turns out the only thing that they required for the algorithm to work was the triangle inequality holding and we have the triangle inequality holding for mutual reach ability distance so with a little finagling and a fair amount of technical finesse you can make the dual tree root of K algorithm work across mutual reach ability distance instead of Euclidean distance and we're good to go so now we have an algorithm that theoretically works is in log N and you're saying sure but those constants might be silly large and your algorithm could be garbage and I'm sure it's favor where but no so I ran experiments so these are scaling experiments comparing the Python implementation of HDTV scan that we had developed in this now and scikit-learn trip with the base implementation java implementation of HDTV scan because we're comparing implementations as well as algorithms you should take this with a giant grain of salt maybe a boulder of salt because the details might not be in the algorithm they might be in the actual details of efficient coding that's okay the only takeaway here is those lines are wildly different shapes on the left you have sort of the natural timings and on the right you have the log-log plots and those shapes are very indicative of an N squared algorithm versus an N log n so it looks like we've got a practical efficient n log n algorithm running on our data so you say ok well you're faster than HDB scan that's good but I've got lots of other clucks during algorithms I could use how do you compare against those that's an excellent question let's run that experiment as well so these are grabbing a whole slew of the algorithms that FSK learn and this is to minimize any sort of effect that might come from language intended effects or base coding run a whole slew of them you quickly see that this groups up in a sort of three natural clumps of algorithms the very fast ones the bottom medium-speed ones and the quite slow ones near the top let's throw out anything but the fastest ones and then I can scale my number of data points out farther to the right and we can see a more interesting comparison so here HDB scan is the dark green line to k-means algorithms scikit-learn and scifi hmm are in the bottom lines yeah and the interesting thing here is DB scan is the blue line so that should be interesting because I had said earlier that running HDB scan is equivalent to running DB scan for every value of Epsilon that's DB scan run with a single value of Epsilon and we've managed to get it faster do the computational tricks that we're using to speed things up that's nice but you should take that with a tiny grain of salt again because DB scans performance can vary wildly depending on the epsilon value you chose it's a really good algorithm it's very fast it's very solid if you pick epsilon too small it declares every single one of your points as noise and it's blindingly fast it can do nothing really quickly if you set up salon too big it grinds to a halt both of those cases are not fair to DB scan which is a good algorithm so if you want to compare against it you need to find the right parameter that's doing something meaningful on your data actually compare against okay the way we got around that was we said I'm gonna do a massive hyper parameter search I'm gonna run h DB scan I'm gonna get a clustering I'm gonna get that persistent natural clustering extracted from the tree then I'm gonna search across a whole slew of parameters for DB scan and find the clustering that best matched my h DB scan clustering and that way I can guarantee that DB scan is doing something meaningful with my data once I found those going to plot the timings for those parameter runs against the timings for an HTTP scan run and the short takeaway here is the Green Line is HTTP scan the blue line is DB scan they are performing very similarly so you can get on meaningful runs with a single run of h DB scan all of the epsilon parameters of DB scan and as a result even if you like DB scan more than HD V scan you want this flat cuts the right way to go about doing it is to build the tree for every one of them and then choose the cut you want out of it so there's not a reason to run DB scan anymore if you want it run h DB scan and pull the results out so we've got an accelerated HTTP scan it's very easy to install reproducibility matters so all of these experiments are available in notebooks or binders you can load them up you can run them you can play with them yourselves the interface because it's in scikit-learn can trim the interface matches scikit-learn which means if you've got notebooks or data analyses or anything like that it's making use of k-means clustering or hierarchical clustering or any of the clusterings out of scikit-learn you can install this and swap out HTTP those algorithms for HTTP scan immediately look at the results see what they do on your data I encourage you to do that if you like what you see or don't like what you see give us a shout we'll be very interested to hear your experiments and your experiences with a clustering on your data any questions [Applause] if you go back and look at the metric D choose looked at which one's the metric the Chinese he was one yeah sure reachability distance have you considered putting weights on tie on the first two numbers or the last or just the last one and would that change anything like um there's like a five and in front of the T what does that make sense it would be weird good you know yeah it would be weird I would agree you you would no longer be talking about density you would be talking about some sort of different metric right because this is all about trying to make sure that a point is dense as well as closed under DB scans definition of density yeah so you change the definition no offense exactly yeah that's me where you could totally yeah the definition of density is very much a thing you could yeah thanks no problem anyone else you mentioned the triangle inequality a number of times could you define that sure if I am particularly dis in this way for a mom if we are this distance apart and Leland and I are this distance apart I know that you two have to be closer to each other than going through this path so truck so if that doesn't hold this is a basic definition of what we would call a metric space you need this to hold for any of your intuitions to match up it's faster to go through a direct route then to come off here and then go there in any space that baffles that's the triangle inequality holding if that's not true it breaks all of these sort of abilities to do those spatial trees that we were talking about does that make some sense no okay we could grab a whiteboard afterwards and I can grind through some more details if you like so algorithmically if my data is larger than memory what is the right way to build a minimum spanning tree not saying tools and just saying how does one build the algorithm right my short answer is to buy a bigger computer that was our solution my longer asker would be to talk to a desk developer somewhere and see if you could work together to get this built over top of gask I may have been pandering to the person who asked the question there no yeah it's hard is the shorter the very short answer any more questions of clustering algorithms sure what where does spectral clustering fit in there sure so spectral clustering is flat parametric so traditional spectral clustering is going to be done by inducing the nearest neighbor graph looking at the laplacian doing an igt composition embedding that into a low dimensional space and then running k-means on that low dimensional data and so it's a spatial transform followed by a k-means that's traditional spectral clustering you could do something stinkier so if you say I'm willing to like be spectral ish clustering you could embed that down into a low dimensional space and then you could run higher clustering across it or you could run something else on and that would move it out of that category and this is sort of where dimension reduction and clustering start intersecting which makes it a very natural thing to move from clustering research into you map research which means you should go see Leland's talk on the web Thanks what are your thoughts on using TST as a method to make data more visualizable either before or after using HTTP scan my short answer is going to be I would suggest you use you map instead and what is the American uniform manifold a uniform manifold approximation and projection you can pip install you map easiest way to find that would be if you want more details on it Leland gave a sci-fi talk on you map my GS needs a fine embedding technique you can do TST and look at the clusters that come out one of the issues I have had in practice with running TST before using HDB scan is the embeddings I get out of TST are very often unstable so if I run the algorithm once and I run it again I get different clustering out and this typically makes my end-users very uncomfortable and it makes them less they don't trust my clustering as much and I don't want to inject any of that noise and I don't want very stable results so I really want to stable embedding technique before running a clustering algorithm is it easy to explain the main difference between you map and T is knee know okay but I will give you a pointer to the slides immediately afterwards thing it's not in the short time we have right now I think we have time for one more question if anyone has one okay we can thank our speaker again you
Info
Channel: PyData
Views: 25,539
Rating: undefined out of 5
Keywords:
Id: dGsxd67IFiU
Channel Id: undefined
Length: 34min 8sec (2048 seconds)
Published: Fri Feb 01 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.