Clustering Using DBSCAN // Rahul Maddimsetty, Foursquare [FirstMark's Code Driven]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] evening everyone my name is Rahul Madan Sethi I'm an engineering director for square here in New York I'm in charge of the places database and API team and my team is basically responsible for the systems that that produce our database of over 105 million places worldwide both from first and third party data sources as well as the platform that enables third party developers to access this data either by the API or to flat file deliveries I assume most of you are kind of familiar with the concept of clustering but I'll state in a nutshell what it is clustering is essentially the process of grouping together similar objects in a data set sometimes this grouping together is the end goal as wet like market segmentation of some of the exercises and at other times if this intermediate step that enables further analysis of data sets there are several well-known clustering algorithms they tend to be classified as like connectivity based centroid based distribution based or density based and I have a couple of examples one of each kind there's agglomerated k-means clustering clustering using gaussian mixture models and then DB scan which is what I'll be talking about today I have to admit who over here has heard of DB scan at all some of you have how many of you knew DB scan was an acronym I didn't I always assumed that it's like scanning a database it seems like in order to accomplish anything that's like the first conceptual step but no it actually is an acronym and unlike all those heavily contrived acronyms that you come across this one describes exactly what it does so it's density based spatial clustering of applications with noise maybe the application was a bit forced but DB still doesn't quite work it's been around a while this was first proposed in 1996 in a paper that I believe has gone on to be among the most cited data mining papers of all time received something called the test of time award at the fatality conference in 2014 we keep returning to this diagram quite a bit as I keep defining terms but even as you look at it I think it's pretty it's really useful to set up the intuition behind the DB scan algorithm because as you look at it you think you know what everything but n should be part of a cluster right and that's probably because without your realizing it your brain is scanning for clusters using some measure of density and that's exactly what DB scan does the first definition is something called a core point right so in the DB scan algorithm any point that has at least min points neighboring points including itself within distance radius is said to be a core point right so the DB scan algorithm takes two parameters main points and radius and these are the only two parameters that it takes so to return to the diagram for a bit over here main point is four and the red points are core points because they're all within some radius that have had to define on this slide but within some radius of at least four points including themselves right now although this is diff is a slide showing the diagram in like two dimensional space much of what we're discussing applies to any arbitrary and dimensional space next there is the notion of reach ability which in the original paper is defined fairly elaborately there's this notion of reach ability between a pair of points which is not a symmetric relationship and then there is something called density reach ability which is we can skip over those details though let's do suffice to say that a reachable point is something that is within distance radius of some core point but which doesn't qualify to be a core point itself all right so over here that's the point B and C there in orange and so by exclusion any point that is neither core nor reachable is considered an outlier right outliers again intuitively shouldn't be part of any cluster over here that's a single blue and so given these definitions you can think of a cluster as collections of core points and the points that are reachable from them now the diagram that I showed you had just one cluster in it but it's it's fairly easy to see how this generalizes to a data set that contains several clusters in it this is actual Scala code from the Foursquare code base I didn't like obfuscate clean it up or do anything of the sort right this is exactly what we have in our code base it's an implementation of the DB scan algorithm so we'll step through it a little quick if you haven't read Scala before it'll be fun to you so we will step to it so I don't want top is is the inputs there's a bunch of points the the parameters radius and main points over here is where we have the outer loop as I should refer to it we will loop over the point skipping over anything that we've already visited as soon as you visit a point you mark it a scene and then you find all of its neighbors right in order to do that efficiently we have a KD tree up here and specifically we use a geo carry tree which is aware of distances on the Earth's surface it can take a distance in meters and then find points as opposed to just XY coordinates so we then do a density check having retrieved neighbors you just need to compare that number of neighbors to min points that is our density check and if that is greater than or equal to main points it is a core point right so anything that isn't a core point you skip for now there is there is the possibility of rescuing it afterwards but you treat it like it for noise right if it is a core point on the other hand you kind of nq-- it to be explored and then you have this inner loop that keeps going while the queue is not empty you find the neighbor of the first final neighbors or the first element in the queue you do another density check if it is a core point then you in queue all of its neighbors as well right if it isn't you don't but regardless you add it to the current cluster if it isn't already part of a cluster when you get to this point it's because the queue was emptied so you're done with the current cluster you can move on to the next point in the out early and when you get here it's because you're done with the outer loop you just return the list of clusters that you've built up so far and anything that is not in a cluster at this point is an outlier on noise so you return that right so DB scan is an intuitive but also a fairly versatile algorithm and it's great because unlike K means for instance which some of you might be familiar with you don't need to know beforehand how many clusters are present in the data it's good because it can find arbitrarily shaped clusters and because the notion of noise is so central and is like baked into the algorithm itself it's especially well-suited to when you don't want to include outliers inside a cluster and finally like I mentioned earlier there are just two parameters to tune so that makes evaluating models a bit simpler this is of the quick illustration of some of the more arbitrary shaped clusters that it can find the one on top is like non linearly separable Dedes can can detect those clusters and you also see points that don't make it in those are noise and the figure below shows you on the left the output of k-means and on the right the output of DB scan and you see the DB scan does a better job it has its disadvantages as well and as we were looping through the course stepping through the code you probably realized that the actual clusters that are output by the algorithm could depend on the order in which the points are processed and this makes DB scan non-deterministic in some cases now the core points will always be part of the same clusters regardless of order but the reachable points especially if they're reachable from from more than one cluster they can end up in different clusters depending on which core points came first in the outer loop this isn't always a bad thing but it is something to be cognizant of now another advantage of DB scan is that it doesn't work great on datasets where density varies greatly and this is because there is no single value of main points and radius that would work well across all of the data set all right so this is where we end our generic high-level treatment of DB scan and move on to how we use it at Foursquare it's like the title slide said if you had the time to read it we use this to improve the accuracy of our venue lat/longs by the way venues are just what we call places in our database so eBay NYC is a venue as is probably that Marshalls Manhattan that I see right so anything can be a venue and venues have lat/longs and we want those lat/longs to be as accurately placed as possible so historically we would try and determine the best lat/long for a venue by periodically recomputing the centroid of the check-ins that a venue had received over time and then just hope for the best now the problem is that check-ins are inherently noisy and I'm not talking about GPS accuracy because I mean that's a whole other problem but even if GPS accuracy were to be perfect which it is not the problem is that the behavior that generates the check-ins themselves that is noisy so what do I mean by that most users they check in when they're physically at the venue right but sometimes they check in from a parking lot or they check in from across the street as they're walking in or out of the venue less frequently but I guess still often enough that this is a problem and I know about it is that we have we have users that try to cheat at the game right and so they may check in from somewhere else entirely in order to earn coins or unlock stickers or become mayor of a place and so we have simple checks to try and detect obvious cheat check-ins but we also have some very motivated and persistent users who have been known to like gradually check-in from further and further away from the center to a liked wart or cheat detection and be gradually moved the centroid closer to to wherever they want it to be I guess honestly with the things that I've seen there is nothing about user behavior that can surprise me anymore so this is something we need to live them regardless the observation was that check-ins are inherently clustered because of the underlying patterns of user behavior that generate them right so the idea was if we could find clusters of check-ins and then use the centroid of just a dominant or the largest check-in rather than the centroid of all check-ins then maybe that would give us a better lat/long and the hope was that the dominant challenge the dominant checking cluster that we found would correspond to the check-ins that originated from inside the venue now we also expect that they will typically be smaller clusters that correspond to legitimate check-ins from the parking lot or from across the street and anything else is noise for our purpose there is a problem though the problem is essentially what we noted earlier as a disadvantage of DB scan which is that the distribution of check-in densities vary significantly across the world it tends to vary as a function of venue shape and size which in turn depends on category or chain it depends on where the venue is located in the world so for instance the check-ins if you were to visualize the check-ins at a suburban target that would look very different from a bar in Manhattan for instance all right so there is no single value of main points and radius that would work across our entire venue database but we worked around this by training a multi output random forest to predict the best parameter values for each individual venue given a feature vector that included among several other things the category and where the venue is located and we then fed these parameters to the DB scan algorithm to detect clusters and then recomputed the lat/longs of tens of millions of venues this way so before I jump ahead to the results of doing this I want to just establish how we define that long accuracy right so for our purposes a venue's lat/long is accurate if it is within 50 meters of its real-world location and on the same side of the street so that latter requirement is especially important if you want to avoid routing people who've asked for directions to the place on the wrong side of the street or telling them to to stop because the destination is on their left versus on their right especially if it involves a u-turn it's kind of painful anyway so this is also the condition that as it happens our lat/longs were failing more frequently as part of the accuracy test so what we did is we built up a ground truth data set from a sample of around a thousand venues from across the world in a variety of categories and for each of them we had a member of our internal ground crew team draw the true shape of the venue as in the actual outline of the venue as a human miss here as well as the shape of the street block that surrounds it now the first shape is used for the first part of the test which is is a point within 50 meters of the real world location so inside of a 50 metre buffer around the polygon and the street block shape is for the second condition so if you think about it if a point is inside a street block that contains the venue then by definition it must be on the correct side of all enclosing street segments right so to go through some actual examples this is a hair salon in Greece I believe this green polygon is the actual shape that we determined for the venue you can see the chickens as blue dots and this is our street block shape again this is a car wash in Turkey and this is the street block shape now you might have noticed with both of the street block shapes they seem to include some streets in them that's actually because our definition is only concerned with public streets so if there are private streets which might show up on a map or if they're pedestrian only walkways they was to show up on a map but they don't really count for our definition of being on the correct side of all streets I guess our definition of a literal car directions or driving directions specific eccentric regardless once we've done this we needed to synthesize the training and validation datasets for a random forest models so the way we did this is by enumeration for every venue in the ground truth data set all possible main points and radius values using them to cluster and then compute the dominant cluster centroid and then only use rows that resulted in accurate centroids as per our definition earlier right so this could result in potentially multiple rolls per per input feature vector and once we did this we did like a 90/10 training validation split which I know sounds a little unconventional so I'll explain myself in a bit the reason we did this is the alternative may have been the more traditional 70-20-10 training validation test let if we done that we would have had just about a hundred venues in our draw in our test set and we figured that testing didn't necessarily meet the actual shapes so one the the process of acquiring more shapes and actually precisely drawing them using internal resources is fairly expensive and slow so we decided to use a CrowdFlower job instead where we could just have them make the judgement of whether or not the new lat/long was better than the old lat/long without having to draw a shape to determine that and we were able to do that with far more venues the job looks something like this it kind of shows you the old location in yellow a new location in green I believe and I've given these definitions which of these is better right so final results I believe were a net improvement in Latin placement of 37% the other side about 37% which is pretty significant and like it beats by a mile any previous attempts that we've made to to make progress against a specific problem these are a couple of venues where again you start to appreciate how check in density can factor into this problem so you've got some place that I guess looks somewhat suburban and another place that's in a slightly more packed area and the check-ins are a little more scattered in one case a little more for another but we still managed using this approach to pull the map and over to the correct side of the street said that's pretty much everything I had I just wanted to acknowledge that all of the work I just presented was actually done by people on my team I'm just presenting it if you ask me difficult questions I may not be able to answer them but I'll try all right thank you all for the feature vector for the venues show as it happens from previous work that we've done to improve not just like the quality for lat/longs but also the quality of my categories and a reality problem which is just because we are a crowdsource database of places we frequently have places that are created that are not real world places you know like these people create places of the joke or like mesh categorize them as a joke and so there's been previous work done to try and extract a large number of feature we have like tens of them I want to say like 50 or 60 maybe right that includes things like how many people have checked into this venue how many unique well how many check-ins have we had totally how many unique visitors have we had several geo normalized variants of the same signals like we have for every top-level category we have like one feature so is this like a food or nightlife place is this like a shop or service is this something else again the random forest of learns the significance of each of those features we have features I believe that have to do with category like I said possibly whether or not a venue is part of a chain stuff that has to do with I don't I'm thinking of my feature I think when it was last edited a couple of things like that that there's several features I don't remember all of them now but things that we hope would indicate inherent properties of a venue like is this a big venue a small venue is it visited a lot is it not visited enough like the reason I said earlier that we recomputed tens of millions of venue lat/longs versus like the 105 million plus is that some cases we just don't have enough check-ins to be able to determine or to find clusters right so I'm not sure that answers your question that's like as many features as I can recollect like right now how do you foresee that generalizing to venues where you don't have the actual transfer chip right right oh I think I know what you're talking about so maybe this answers your question but there's this work that we've done previously outside of my team even which is to try and determine two-dimensional Gaussian distributions of check-ins and I suppose what you're saying is we could use for instance whatever shape 90 percent within of our check-ins are as like we didn't beep all right to be fair we didn't try that exact thing in this case we have considered I think using those shapes that we have they're probabilistic shapes that we have and we've considered using them but not in this iteration when I said earlier that this beats you know by far anything that we've done previously I think what we did previously was along a couple of lines we tried filtering out check-ins that seemed bad there was a I remember very specifically there was this point where bad lat/longs from blackberry from old BlackBerry devices were like throwing off a lat/longs by a huge amount so we just filtered them out altogether similarly we filtered out check-ins that were before a manual edit had been made to the venue's actual that long and only used like check-ins that came afterwards it suffice to say that our general our approach was like just continue to use the centroid of seconds but just build things that you think are outliers right I guess that's what we were trying to do and DB scan turned out to be a better way of doing the same thing that's the only thing we've tried and we are liked by some measure 37% better I guess I think he was before you um that is a good question I'm not sure I'm the best person to answer that there are other people at the company who are thinking about that problem what I didn't mention is that when we get check-ins we get a latitude or longitude but we also get a horizontal accuracy which for now we don't consider as part of any of this computation we consider doing it like what if rather than just a point we used like circles and try to like do something fancy with overlap I guess it was one of those fancy things that we didn't get around to doing because a simple thing worked better but yeah looking ahead into the future when GPS accuracy improves significantly I think it solves the problem it solves one component of the noise of check-ins it doesn't solve the fact that people don't necessarily all check in when they're actually at the venue and the fact that people are going to cheat right so there are there's noise from legitimate check-ins and there's noise from not so legitimate check-ins but regardless I feel like having higher GPAs I could see only solves like part of the problem yeah this is one of those questions where I wish somebody on my team was here but the one thing that I do know is that there is this cases where DB scan would not be able to detect clusters in the first place right because you you specify them a value of main points and radius and there aren't enough check-ins to meet that right so in that case we do nothing we just stick with the lat/long that we have already and to be clear like a lot songs were not terrible before they had problems and you know this helped us get better with them so I don't know that it answers your question as a Saturday but I'm I'm sure those insights are possible to extract we've not like specifically noticed them it's just of course the greater the number of check-ins at a venue the more likely that there will be noise in it but also I think that's one of the advantages is just throwing stuff at a random forest and like not having to look on the inside you know to see what it learned so the short answer is no I there's no there's no pattern like that that sticks out at mine that I remember my team mentioning but maybe they did [Music] oh no actually it's just like I said urban versus rural right so one thing that happens a lot is that GPS signals bounce around buildings right so the kind of problems or the kind of noise that you'd see with signals in urban areas is very different from what you'd see in rural areas that's one part of it the other part is just that there are fewer venues in the first place in a rural area so associating a lat/long to a venue becomes kind of easier worse so somebody could actually be at the venue next door in a city and like the distances are so small between venues that it could make a difference and so your centroid could end up being you know just like streets are narrower for instance and that places you on the wrong side of the street it has everything to do with its density so it comes down to whether it's urban or rural or something in between oh no not at all this is from all over the world so mostly relying on satellite imagery and other forms of web research to try and establish where a place really was cases we can't determine it so you just let go and you move on to the next thing so a thousand venues is a lot for a small team to do right so like I said it's slow and expensive yes I I could go back to this is what you're talking about right [Music] to be sure I'm not positive I fully understand your concern but I here's the thing right like the shape only factors or rather enters the picture when we need to use this particular row of ground truth to inform what valid outputs are for the random forest right so we would iterate over a number of possible values and then it would end up detecting different clusters as a result and if the largest cluster well it's not throwing away anything outside of the shape in case that wasn't clear right it's just using every check-in regardless it's just using different parameters and if the centroid of the largest cluster that that results from that particular set of parameters happens to be within 50 meters of the real shape which was on the previous slide or inside and inside of this shape then we'd consider it good right so we're not throwing away anything outside of the shape again like I said with unseen venues we have no way of of knowing what the actual shape is they were included they were included in this I mean this is dirty and it was greased before but we've done this to include like Vietnam and a bunch of others like they're all over the world this is not just this isn't just the developed world yep it still works it still works conceptually again it is it is reliant on having enough check-ins in the first place and then maybe in some places we don't meet that minimum but where we do it does a better job than if we just used all the chickens right so it's incremental but it it requires that we have enough chickens in the first place to be able to attract traffic all right thank you thank you all right [Applause]
Info
Channel: Code Driven NYC
Views: 1,004
Rating: 5 out of 5
Keywords: software, development, engineering, code, programming, startups, product, design
Id: X-QxRZkessk
Channel Id: undefined
Length: 30min 3sec (1803 seconds)
Published: Tue Jan 02 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.