Brian Kent: Density Based Clustering in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay I think I'm gonna go ahead and get started it's five to two so thanks everyone for coming to my talk density based clustering in Python my name is Brian Kent I'm a data scientist at dado we're based in Seattle hopefully you've seen all of our big fuchsia banners everywhere if not stop by our booth after the talk the talk is not really a data talk I'll mention wanna one of the tools in dado but I really want to talk more about the method today so I want to talk about density based clustering but before I get started on that a little bit of a baseline about what is clustering and I think this might be pretty intro for a lot of people so I'll move pretty quickly through it but I think it's important to start here the idea is that we have a bunch of data instances data points here I'm using the two moons data drawn with psyche at random state 19 by the end of this talk you'll be really sick of this data set I don't want to spend too much time explaining the data so I'm gonna keep using this over and over and over again so bear with me the point of clustering is to group the data points so that we have similar instances in the same group and dissimilar instances in different groups sounds pretty easy but we don't have a target variable so we're not going to split this into a training and testing phases we need to do this without any supervision and so this is kind of what we want the answer to be we've colored the points based on the true groups they were drawn from here so we want to get some output like this blue and red split so we're gonna sign each point to a cluster so that's clustering so why did we do this and so I started with clustering from an academic context and an academia had sort of presented as an axiomatic thing that we would want to do it's sort of a foundational to our fundamental task in statistics but now that I'm at a company that's trying to sell a product I found that it's actually a lot harder to sell clustering as something that commercial users would want to do things like recommender systems and a classification regression kind of sell themselves term prediction etc everyone kind of knows what those applications are but it's been a little harder to find sales points for clustering so I made a list of the reasons why people do clustering and if there are others awesome let me know after the talk I'd love to add to my list the first thing and to mention this in a lightning talk yesterday is exploring and visualizing complex data so clustering is often one of the first things that we do as humans when we start working with a new data set to get some sense of the structure of that data another thing that people use clustering for often is to reduce the scale of the data in some way it can be done in either dimension either reducing the number of instances or reducing the dimensionality a very common application and this is one that does have a lot of commercial use is detecting outliers which to me is a flavor of anomalies there's lots of other types of anomalies small clusters for example or clusters with small mass but high density all of these could be considered anomalies deduplication is a fascinating clustering example because we often think of clustering as trying to find a small number of groups in that Moon's example I have two groups but deduplication if you have n records you might want n minus two clusters there might only be two duplicates or something like that so it is a clustering problem but it's a very specific form of clustering it's very different from the usual applications and finally people often talk about using clustering to segment market or a user base I've put this last because I'm not really sure what the purpose is of that usually there's some other purpose but it's a kind of thing where people do clustering because I don't know what else to do to get started so that that is the reason to do it if you don't meaning else to do try closely so there's a few key takeaways from today now that we've got the basics down the first is that k-means is not always a good option and this was the I think the big point for the light and talk yesterday so I agree with that completely there are reasons to use k-means we'll go into that but it's not always a great option and the point of this talk is that density based clustering is a particular flavor or family of methods that's sometimes a good alternative DB scan is the most popular form so I spend a lot of time explaining how that works level set trees is something that I've worked on for a long time it's a more powerful method than DB scan I believe and finally I'll go through some code pretty quickly with scikit-learn with graph lab create and with my own package debacle so k-means is kind of a default clustering method and part of the origin for this is before this talk is that I was talking to someone in DevOps and he was talking about how he did a clustering problem for some sort of system and I asked him why he used k-means and he said it's because the only thing he knew and so that was sort of why I wanted to talk about different kinds of clustering because k-means is the default but there are other things out there but why do people do it why did it become the default well it's a very simple algorithm for one there are there are lots of resources out there to learn about how K means works there's lots of implementations pretty much any generic statistics or machine learning package is going to have some form of k-means clustering in it and it scales very well so if you have truly massive data k-means can be a good option especially with the some more recent implementations of this mini-batch in particular can be extremely fast and still yields pretty good results so as an aside about k-means if you're doing k-means and you're using the original algorithm you're probably doing it wrong I can't guarantee that there I'm sure there's good uses for it but you're probably doing it wrong so if you want to know more about that come see me afterward ok so here's an example where k-means is not the only answer so the first question we have to ask is how do we choose K and k-means and of course there is something called the elbow heuristic or the gap heuristic where we're gonna try K knees with a bunch of different values of K and find out where the error stops decreasing quickly and choose that as our value of K and that sounds great in principle but in practice it doesn't work that well in my experience sometimes choosing K is impossible so deduplication is a great example of this where if K you know might be on the range from 2 to 20 it's one thing to try this gap heuristic if K is on the range from 1 to n and n is 10,000 it's pretty tough to do that gap statistic so sometimes choosing K is just impossible deduplication being a great example and finally k-means only find spherical convex clusters and so when we have this two moons example this is why I chose the two moons example k-means fails so here we have clustered with k-means into the red and blue points and the answer is terrible ok so the next part of the talk is the overview about density based clustering and this is sort of what we want to do with density based clustering so don't think about this yet as a specific algorithm this is a concept about what we would want to achieve with density based clustering and the premise here is that the data is drawn from an unknown underlying probability density function so this is a statistical view of the world there is some unknown PDF and we're drawing data from it the PDF basically describes for any point in our space how likely it is to draw a data instance from that point so what we would want to do is take this data and estimate that PDF so I've drawn our estimated PDF with color here with contour lines so it's you know where the regions where the data lies in those two moons is where we have high estimated PDF what we would want to do is in threshold this PDF this estimated density function and we find that upper level set so here i've cut it at point two and I've taken that region in our space that has a higher estimated density than point two and so from here what we would like to do is then find the connected components and so keep this in mind I'm going to come back to this in a minute what we want to do is find two connected components of that upper level set then we're going to intersect our original data with that upper level set so imagine our data in those regions and then use that intersection to label the data data points themselves so this would be our final clustering with our ideal density based clustering routine so pros and cons of this idea first off we can recover much more complex cluster shapes so this is the idea that we get these moons which k-means couldn't do could only recover spheres or hyper spheres we don't need to know K so this is a big benefit of course we have to know some other parameters depending on which specific method we choose but we don't need to know the number of clusters a priori we can automatically find outliers so if you saw on that last last image we had some black points that weren't actually assigned to a cluster that's a good candidate for an outlier so those are the pros the cons we need some sort of function to measure the dissimilarity or the distance between each pair of points or a given pair of points and so some methods encode this automatically other methods allow you to specify however you want but we need to have some measure of this dissimilarity incidentally k-means does this implicitly and so that's one of the problems with it is this can't change it without changing the algorithm it's not generally as scalable as k-means because we have to compute a lot of those distance functions and finally last but not least it's impossible to do what I just described because computing those topologically connected components is computationally impossible and so I don't want to get into the details of what I mean by topologically this is just common sense connected components we have this region in space we need to decide if there's any path between these two regions this is a computationally impossible task so this is where we start getting into lots of different density based clustering methods specific flavors of this and so because there's no well there are several ways to measure the quality of clustering but it's not nearly as clear-cut as something that's supervised these methods have proliferated but DB scan is far and away the most popular form of density waves clustering so dense DB scan stands for density based spatial clustering of applications with noise so people assume it has something to do with databases and there is some distance loose connection there but it really is not about databases its density base spatial clustering of applications with noise it's from Martin Esther and other authors in 1996 so it's a relatively old algorithm now but it won the test of time award just last year at the kdd conference so people are still using this and in fact it is amass 7400 citations so it's far beyond the scikit-learn threshold of when to implement to model or not which is like 200 or so 300 oh okay good but isolation forest is okay so the main idea of DB scan wow this is again a very high level idea is that we're going to partition our data into three types of points we're not going to this isn't the clustering yet but each point is going to be assigned to either the core group a boundary group or a noise group and so the core points are the high density points these are the points at the center of clusters loosely speaking they have lots of neighbors within a certain distance around them and so the next thing to do with to connect these core points into the clusters and then to assign the boundary points to a nearby cluster and the noise points are left is noise okay so now there's a great visualization if my internet is working okay so I'm gonna click out of here and see if we can get this going this is not mine I don't know the author but it's a fantastic violation of how this works sort of on a local scale in a breadth-first way as opposed to k-means so we're gonna start with this very noisy data set here I'm gonna click the the parameters up a little bit so it's a little faster and we're gonna go so we start with an arbitrary point we expand the neighborhood around that point to include other core points and we just keep doing this in a breadth-first way until we run out of nearby core point neighbors and so this blue cluster continues expanding around the outside of the smiley face even though we've were far beyond a convex shape at this point it will just keep expanding at a local fashion the red point the red cluster is already stopped there's no more neighbors to be found around that cluster that this is random noise so this is not something I've seen before it's always different which is cool but I don't really know exactly how it's gonna go so hopefully we get some new clusters now with the mouth and the other eye it's moving slowly slowly slowly yeah so it's working well yeah so it's I mean so then these black points are the noise points which aren't gonna be assigned to a cluster and so it's it's fairly robust to that that order order does matter but in a very very small way it matters in how the boundary points are assigned to a cluster because they're assigned to the first core cluster that they see but if it picks a noise point and there aren't sufficient number of neighbors around it just moves on to the next point it marks it as visited and then moves on to the next point so it's pretty robust to that so is that pretty clear I love that illustration any questions about that yeah I don't think it was reassigned I think it was on the edge of rad I think it was a weird case but I don't think it was reassigned yeah it shouldn't I mean not I don't this is not Feli Harris I don't know who this is there could be a bug but yeah yeah unless there's above I don't think it could have been if it's done right yeah yeah yeah I mean the question is about high dimensional spaces I'm not sure about the specific the answer that specific question but in general high dimensional spaces are going to be problems for these distance based methods because all the distances are going to look similar as the dimension gets really high there is a middle ground I mean that's part of the reason why we did this in graph lab create is because we're trying to expand the usefulness of this method to work with some medium dimension data like images for example but once you get into really high dimensions it's not going to work that well none of these methods will work that well it's true yep oh okay so now I wanted to show how to do TV scan in scikit-learn so then go back to a Jupiter notebook here I'm actually gonna move this up a little bit how's the font size good people okay great yeah yeah he says it's okay so it's okay okay so we're gonna import our libraries matplotlib so I can show what's happening the make moons function from Circular and so I can generate this data and the DB scan function from scikit-learn and of course I'll use the G be GG plot style and matplotlib because it's awesome not at the moment but I'll post it it's good idea okay so here's just the top top five rows of our data so just to see what it looks like we've got an N by 2 matrix here it's very straightforward at the end scikit-learn so this is I mean this is really simple code this is not meant to be a tutorial about how to make super high awesome a super high level awesome machine learning you know whiz-bang things this is pretty simple code model equals DB skin so that the trick here is the parameters and so there's two main parameters for DB skin the first is epsilon F PPS here and this defines the size of the neighborhood around each point the second is min samples which is the number of points that have to be within that epsilon neighborhood for a point to be considered a core point so men samples is basically our density level threshold so when I picked point two on that contour plot this is analogous to the min samples value so we'll make that model and we'll print it so scikit-learn doesn't print too much when you print the model but we can see the parameters that we use to make the model and of course with fit we've already fit the data to it on the chained operation here so what comes out of this pretty straightforward model labels so these are our cluster labels the renders a little strange here but basically we have ones and zeros because we have two clusters with this moons example there's one negative one in here that stands for the noise point and then of course there's the core sample indices as well which is useful too which points are the core points so we can check that this worked pretty well with the matplotlib plot and we get some strange warning but basically our red point is the noise point here and the gray and the orange are our clusters so works pretty well so based on those parameters the epsilon and mid neighbors parameter it's not so the reason it's noise is that it itself is not a core point it doesn't have sufficient number of neighbors but it also none of its neighbors are core points as well so it can't be a boundary point it can't be assigned to one of the core points so it means at this point is also a boundary point and this points probably a boundary point that's Sarah yeah yeah that's a great question so the second part of the talk is about level set trees which are in part intended to get around that problem yeah yeah yeah so it doesn't it does support more more distances which is one of the great features about these distance based methods you have to compute some distances but you can choose your distance so there's a pretty pretty wide range of distances and I'll come back to that a little bit for the GraphLab create version as well the complexity is the funky issue it's kind of a funny story because the original paper says that it's n log n if you already have a nearest neighbors indexing structure where you can quickly look up the region around each point which is wonderful if you have that great but I don't usually have that what's funny is there's a paper that came out this year or last year that actually finds a mistake in the original paper and corrects the rates to be worst case N squared but in practice when log n is not a terrible guess if you have an index indexing structure that is the bigger bigger part yeah other questions okay great let's see okay so I'm gonna go back to the slides real quick and talk a little bit about DB scan and graphlab create so I guess the question is why did we do this if it's already inside cute learn the first thing is we wanted to build this on top of the graph lab create s frame with s graph data structures so if you've heard word rats talk earlier or you come by the booth you know that these are our scalable data structures they're on disk parallel computation etc both open source both awesome but we wanted to build a tool that worked with these natively the second thing is this distance question so we have this idea of a composite distance because a lot of our customers and users want to work with features that have different types so they want to work with say string data for say a house address and Euclidian data for lot size and square footage or something like that so we have numeric data and they have string data they want to have some function that measures the the dissimilarity between two points based on these different features so Euclidean distance for example only works with the America data it's not going to work so we have a composite distance where you can add different distance functions on different sets of features and this can be sent to DB scan as well that said there's a big caveat with this which is that the the DB scan algorithm doesn't explicitly talk about how which properties if distance functions are needed to work well but if you give it a weird really weird distance function you might get really weird answers back that's the test the caveat and the third thing is so the original algorithm if you've noticed a lot I didn't talk about in great detail but basically is doing a connected component search on a graph and so we thought by doing that connected component search directly we could use it more efficient algorithm and speed up the speed up the process a little bit and in fact when we wrote it it was pretty it was faster than scikit-learn so I could learn has made some great improvements so it's pretty much on par at this point so I would say that's a wash lastly of course DB scan graph like rate is not open source but free for non-commercial use send free to try so check it out okay and I want it now I wanted to show off what that looks like in graph type create as well ok so again I don't want to wear too much about the the S frame code here I'm importing graph lab I'm converting it to RS frame the numpy array into an S frame I'm going to create a model I get a little bit of output here and we're going to look at the summary of that model this is the key thing that I like to look at here so this gives us some sense of what went into our models the way you have a number of examples this is the schema of the model so to speak a number of columns we have two columns in our original data set we have our parameters our radius parameter we call it radius instead of Epsilon and the min mid number of neighbors to be a core point our density level threshold training time etc so how do we use this well we get a table that describes the cluster ID for each point and we also get a column in this that indicates the type of the point so this can be useful as well if you want to say plot your data by instance type so if you want to look at say the boundary points which are points where you might have less certainty about which cluster it belongs to or Noi's points for example so I'm going to do that the machinery here isn't too important let's let's ignore that for now I just wanted to show how you would use this result to make a plot with the boundary points here so in this plot the gray now is the boundary points according to the DB scan algorithm the red points are the core points and again we have this orange Point which is our noise sorry for the changing colors then still getting used to the matplotlib color maps any questions at this point great okay so there's one another example I wanted to show here which is starting to get into that idea of different distances and here it's sort of deduplication I'm losing the term deduplication loosely basically what I'm going to do is I'm going to take a small sample of Wikipedia articles here pardon the rendering again and that the title is for some reason lost their spaces but this is title and then text so aquaculture and then output is reported from aquaculture world etc ad hominem etc so this is some random sample of Wikipedia articles the text of Wikipedia articles and there's a little bit of feature engineering I'm going to convert this into a bag of words representation details aren't super critical if you're interested please talk to me afterwards so now instead of just a long string of text for each article we have dictionary which represents the number of times that each word appears in our in each article that is going to go well the first one never moves to stop words and then we're going to create the DB scan model and I'm not going to do it here it takes about a minute I want to keep moving but I'm going to load a pre train model but the idea is here I'm using Jaccard distance I'm not using Euclidean distance so I've changed up yeah that's grass web create yep yep and so here my radius is 0.6 so what's your car distance I actually have some intuition about what that means basically the ratio of the intersection of the words and the union of the words between two articles have to be a certain level so I'm going to load this model start even trained and show you the summary we have 14 and a half thousand articles roughly we have one hundred and twelve clusters so this is an exactly BBC duplication rather but it's closer to that than the other applications so one thing I wanted to touch on real quickly this is the canvas feature and GraphLab create so I wanted to show basically the distribution on the cluster ID so first of all there's 14,000 of our 14 and a half thousand that are undefined these are the noise points so this clustering routine found a lot of noise which is OK in a deduplication case the noise doesn't mean it's bad it just means we didn't find anything to cluster it with we can look at the distribution of clusters here and see that it's pretty skewed so there's one set of clusters one set of articles that are really similar to each other and the rest are kind of all in their own little world I'm gonna go back and I wanted to explore the clusters now just to show you what these clusters look like so I'll join the original data back to my cluster IDs and then I'll print some of these out and you can see that this particular cluster number ten is National Historic Places so Glencoe Maryland it was listed in the National Register of Historic Places Douglass place was listed on the National Historic historical register national historical places so we have some reason to believe that this is working even with non-traditional distance on some fairly complex data questions at this point okay cool I did I did absolutely yeah so that's a good question so it scales so the big challenge there's two pieces to this there's finding the nearest neighbors for each point so the all point nearest neighbors and then there's finding the connected components in the graph those are the two big pieces and so the nearest neighbors part is sort of its own well studied problem it can scale depending on the attributes of the data it's not all pairs necessarily so it's not necessarily an N squared problem but it's it is computationally expensive the connected components is meanness it's not nearly as bad but it also takes some time yeah okay right now at brute-force yeah so we don't have the dual tree enabled yet partly because we want it to be as general as possible so we should be one of the next steps just to make this a bit more automated in terms of which indexing structures used for the nearest neighbors part right now it's brute force because it's the most flexible yeah yep yep okay so I'm gonna switch gears a little bit now and talk about why DB scan is not perfect itself and why level set trees are better than DB skin this is a bit of an aggressive claim I'm not really that aggressive about it but in trying to get the message across so I'm sure a DB scan aficionados in the room we can have a great argument about this that's not really the point so again how do we choose those parameters in DB scan the density level in particular Epsilon okay I can say how you might have some sense for what a radius might be like with Jaccard distance you might have some common sense intuition about what that might be but how do you choose a density level I don't know if you need to change the density level the mid name Burres parameter you have to start from scratch with DB skin so level set trees is something I've worked a bit on in grad school they describe the entire hierarchy of the density based clusters they allow you to retrieve clusters in different ways and repeatedly without starting from scratch the tree was the tree is computed it's exists and you could just query it as many times as you want and whatever way you want you don't need to recompute it each cluster could have a different density level so one of the issues with DB scan is you have to pick a flat level with these trees you can pick clusters at different levels so if you have clusters that have different densities you can retrieve all of them and this is a very useful visualization of high dimensional or complex data so DB scan you can plot the points back in feature space but you don't really have a great sense for what the structure is of the data the hierarchy gives you a better sense for that topological structure of the data okay so how do you build a level set tree this is first some text bullets about the process and then I'll show a little bit of illustration on the two moons data so first we're gonna estimate the PDF at each point this was like we started off with our original density based clustering concept what estimate the PDF at each point we're going to construct a similarity graph on the data similar to our DB scan implicitly is doing we're so similarity graph if you're not familiar each vertex is going to stand for one of the data points and we'll put an edge between a pair of data points if they are neighbors of each other we're going to then filter through this graph and the way we do this is we're going to remove vertices in order of estimated density so we'll take the minimum estimate a density point and we'll take it away from the graph and all the edges that are attached to it and as we do that we'll compute the connected components so a very similar is so far to DB scan but we're gonna keep track of these components between levels and as it happens they make a tree so here's kind of what it looks like with our standard two moons data so we have the same data set I've colored it by estimated density so the the warmer colors the darker Reds are higher estimated density the lighter whites are lower estimate density so that's the first step now I've constructed a similarity graph and I represented this with black lines between pairs of points and so the thing to know here is that we have one component we have one cluster at this level at level zero basically yes so I used the K nearest neighbor similarity graph so if so for a pair of points if either one is one of the others K nearest neighbors I drew a line between them yep okay so because there's one component we have one cluster represented in our tree there's no tree yet this is the root of the tree okay so now this is the low density point circled in blue so basically what happens here to step back three things can happen three one of several things can happen when a point is removed either nothing can happen which is what happens in this case so we take it out the one component stays the same nothing changes so we represent that with another point on our tree it's the same same place horizontally but just a little higher because at that point had a particular density and it was point O seven it's very hard to see but it was point O seven so now we've put our put our dot there so we keep doing that and nothing is happening until suddenly remove at one more point and now a different thing has happened our first cluster has died and we've split into two clusters so a river the way we represent on the tree is first I'm not going to use these points I'm just going to present that one cluster by a horizontal line and then I'm going to have a split represented by a vertical line the split is represented by a horizontal line so now we have two clusters growing from those branches right now we do this again we remove points one by one by one nothing has happened in either one of those clusters until we have two points left our tree kind of looks like this we move that one point this is the third thing that can happen a cluster disappears it didn't split it just vanishes so those are the things that can happen as we filter through this graph we end up with this tree structure kind of looks like a field goal for the two moons case doesn't always look like that so that's that's kind of the basic idea what it looks like in code let's switch to debacle here again we're gonna make the same data set so I've imported debacle as DCL here just for syntactic brevity I'm gonna use the construction tree command so it takes data it takes this parameter k which is a connectivity parameter analogous to epsilon and DB scan this is the parameter that I use to build that K nearest neighbor graph this is K pruning threshold is not really related to the density level or the connectivity but basically what happens if we don't prune we get a lot of little tiny leaves with one or two points in them so I'm just gonna say if a leaf has fewer than 10 points just merge it with its siblings just makes the tree a little smoother so make this tree and then we'll print it well that's that's terrible let me shrink it a little bit basically the tree this the point is not to digest all of the numbers in this tree the point is that this is all of the information in the tree it's a very compact data structure three lines and a table that's all it is well the size here hides the fact that we keep the indices of the points that belong to each of those nodes as well but then we plot it and we get our our field goal again so to retrieve clusters we could do this in one of them anyway so this is the get clusters method get clusters function and it takes a method parameter and first we're going to do this with leaf so leaf leaf clusters are just the ones that are the leaves of the tree these are the high density modes of the tree so look at those I printed the first five just to show you what it looks like the first column the left column is the row index the second column is the cluster label and integer label and then I'll plot those to show you what it looks like again so we have the same output life is good notice that we don't have all of the points plotted so the points that were pruned off before we split are not part of this clustering right now those are the noise points and of course we can then go back and replac this tree so that our colors match so here we have red and gray those red and gray points came from the red and gray leaves of this tree so the next thing we want to do is identify outliers so let's go back and find points that are low-density and instead of fiying the high-density clusters and so what we can do we can do this is again the tree get clusters method will do the upper-level set one thing you might want to do is look at the points that are in the bottom 5% so what are the low 5 percent density points so that's what I'll do here I'll cut at a single density level much like DB scan does and I'll get the the 0.05 percent points so these are those points these are the low density points and then we can plot those same plot as before but now I've added those in black so now we found our low density points so I'm sure they're a lot of questions let me hold off because I just have one more slide or a couple more slides one more concept here that's the basics but you might say okay well DB scan was already pretty good at finding two clusters pretty much any parameter in there we would have found two clusters so why do we really need level set trees well here's an example of much more complex data so these are hurricane tracks in the Atlantic from 1950 ish to 2014 ish I've plotted them according to some estimate of density let's not get into specifics about density on functions but even with some coloring about the density it's still really hard to figure out what's going on in this data set we cannot visualize this and see the structure there's just too much over plotting going on so we'll make the tree and the top upper left we get those leaf clusters we plot those clusters back in our original feature space and we have a much better set to the organization of these tracks so that's pretty much it debacle is my own thing HIPAA stall debacle help is definitely wanted if you're interested and playing with this stuff the wrap-up just to reiterate k-means is not always the best option sometimes it is but think twice before you do it that's all I'm asking density based clustering can be a good alternative DB scan is the place to start if you're just started getting started with with density based clustering check out code and scikit-learn graph lab create and check out level set trees and debacle if you're interested thank you very much okay a few minutes for questions I'm not sure I quite understand the question yeah yeah so in fact I have to yeah yeah so the in fact I have to admit that right now the implementation is pretty dumb and does fine connected components at every level but in fact I union-find data structure or destroy data structure would be a much better way to do this so there is a prototype code on github for this I just haven't sped it up yet I haven't optimized it so that it's actually faster than the connective components version but it is computationally a more efficient way to do it yeah yeah good question I it's a very it's a long answer and I completely smooth over it so I just used the points in space so the geo chords for each point so each of these hurricane tracks has somewhere between 10 and 100 or so points so it's a variable number of points which makes it a little bit trickier and I use the distance called max average min sometimes called chamfer distance to measure the distance between any pair of tracks yeah Andres yeah it's so there's a paper by audrey and desk gupta from UC san diego I think 2010 ish that has an algorithm but it's a very theoretical algorithm it has continuously varying variables from 0 to infinity and it's very computationally inefficient so I modified it a bit to be a bit more reasonable for practical use yeah mm-hmm I know the questions great thank you you
Info
Channel: PyData
Views: 29,076
Rating: 4.9425836 out of 5
Keywords:
Id: 5cOhL4B5waU
Channel Id: undefined
Length: 39min 24sec (2364 seconds)
Published: Fri Dec 04 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.