4 Basic Types of Cluster Analysis used in Data Analytics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello and welcome to data science Wednesday my name is Tessa Jones and I'm a data scientist with decisive data today we're going to be talking about cluster analysis and the basic techniques that we use to determine clusters one of the more common reasons that we use clustering in a business setting is to understand what customers are related to each other called customer segmentation for various different reasons so for the sake of simplicity let's have an example to play with let's pretend that we are pet store owners and we want to know who our cat people and who our dog people and who our fish people let's have some fun with it so let's start with centroid clusterings is one of the more common methodologies that are used so what is a centroid what is this whole idea so basically in this methodology you choose the number of clusters that you're that you want so let's pretend that we want cat people and dog people so we want two clusters so we say that and then basically we determine a centroid for each cluster and if you belong to that cluster it basically determined is determined by how far away you are from that centroid so for example let's say in the beginning of an algorithm you would randomly choose centroids so we're just going to put one here and we're gonna put one here so these two points belong to this centroid because they're closer to this one than they are to that one and likewise with these three they belong to this centroid because they're closer to this one than they are to that one this one's kind of marginal it's a little bit closer to this guy so we're gonna draw the line right there so now we have two clusters and so we're gonna recalculate where our centroids are based on the points that belong to our clusters so in these two clusters we're gonna reappropriation right there because it's between the two points that belong to that cluster and for this cluster we're gonna move it ever so slightly up here so that when we recreate the line based on how which centroid this point is closer to it goes like this so now we have two clusters that are determining who are cat people and who are dog people now a pretty small dataset normally we would iterate through many iterations or we would go through many iterations to determine what's the optimum groupings so this is a pretty simple data set let's pretend that our data sets a little bit more weird and complicated so that's really important to understand because if you're if your data is kind of organized strangely or your clusters wouldn't fall into this you really want to apply the right technique so let's move on to discussing density clustering so in density clustering the basic idea is you group people based on how close how densely they're populated together so if you have a lot of people that are closely related they're considered to be a single group that the more the less dense they become they're less likely to belong to the same group so for example usually the first thing they do is is select a randomly appoint like let's say we determine this point and in most algorithms you would want a minimum number of points around this point that are close enough to it to say that they're related that they are now a group so in this point we have six points around it five points around it and so now we have a group of six that are all saying okay we're related so now they want to know who else might be related to this group so this points part of it so now it goes to the next point over and it says are you close enough to me to be related and if the answer is yes then they're considered to be part of that group and so on and so forth until it's hit every point that meet meets that criteria of being close enough the distance to from one point to the next is close enough so then it goes through and it says well we have all these other points that we need to understand what clusters they belong to so let's pretend for a minute that it picks this random point out here as you can see it's not very close to anything so in most of these techniques they would just be considered outliers and not part of any group so then it would move on to the other points that have not been yet established so let's pretend that we're going to this point here and we have four points around it that are close enough to the center to be considered to be a group so it goes through and does the same thing where it's like these five points have now become a group they each go to the points next to it and say hey are we part of the same group and if they are they become part of the same group so now we have this other cluster here like this so this is how density clustering works and I'd like to highlight too that when you have these kind of strange clusterings here it is important like this this methodology would have never picked this up so that's important to understand let's move on to distribution clustering distribution clustering is a pretty interesting technique it basically looks at the probability that a particular point belongs to a cluster so we know that we want three clusters we want dog fish and cat people we're gonna fast forward a little bit to help make the point of how this one really works so let's pretend that we know that we have three groups and their centroids are like so so if we draw a map that basically is the density distributions of these points basically if this centroid is this point if you're a person that lands right here in the data it's 100% probability that you're part of this group the farther away you get from here the less the probability is that you belong to this group and so it works really well because you have it's more of a probability that you belong to a particular group versus you are definitely part of this group or that group for example you have this guy out here that's a little bit of an outlier kind of close to the dog people all kind of close to the fish people and so this guy ends up having a probability that it belongs to any of these distributions it might have a 53 percent probability it belongs to dogs of 47 percent probability that it belongs to fish and like a zero percent probability it belongs to cats so then you can determine where you want it to go based on your business needs maybe you want to say let's advertise for both dogs and cats or something like that you can be free to have a little bit more freedom of how to use that information so next we're going to talk about connectivity clustering and this is another interesting one because you basically start with individual clusters each person is their own cluster and rather than being determined by how close it is to a particular centroid we determine how much it's related to another individual so for example let's pretend that person J here has a Labrador Retriever and person k here has a golden retriever the products that they buy are probably very similar and therefore they're going to be very related to each other in contrast you might have a Pooh some one person H has a poodle person I has a Labradoodle there likewise going to be buying things that are very similar and can be clustered together and in this dendrogram at what it's illustrating is that so so these two are independently related to each other but they're also more related to each other as clusters than they are to G which is a person that owns a bulldog they're going to be buying things that are different so we would we would cluster these two together and then all three of these all five of these are dogs so we're going to go ahead and cluster all of these together and the same goes here you have two different kinds of cats you know they're they're more or less related to each other and it kind of you keep going out and out more and then eventually I could put a circle around this whole thing because they all come to a point where everybody's part of one group so the big trick on this one is determining where do you cut it off how do you know how many clusters you want it comes down to how many clusters you want again so do you want to leave G just as an outlier or do you want to keep it as part of the group so that's a basic overview of how clustering works and the basic techniques that you would employ to cluster groups together and why you might use them and that state science Wednesday thank you
Info
Channel: Decisive Data
Views: 104,809
Rating: 4.9559517 out of 5
Keywords: data science, data science wednesday, cluster analysis, centroid clustering, density clustering, distribution clustering, connectivity clustering, data analytics
Id: Se28XHI2_xE
Channel Id: undefined
Length: 8min 52sec (532 seconds)
Published: Mon Jul 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.