Validating a Hierarchical Cluster Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I'm going to show you one approach and validating a hierarchical cluster analysis and I'll do it through the agglomeration schedule so let's go to a cluster analysis real quick and I've already shown how to do this in different video so I'm not going to spend a lot of time explaining what each thing means I'm using the Berger data I'm going to classify based on calories fat and sodium and the statistics I'll make sure the agglomeration schedule is checked it continue plots I definitely want to dendrogram this time it continued method I'm using the centroid clustering method and the squared Euclidean distance which is most appropriate for centroid clustering I'm also standardizing between negative and positive 1 continue I'm not saving anything and let me hit OK I'll also have labeling cases by sandwiches hit OK and I haven't split it by restaurant right now so let me just jump to maybe the simplest restaurant which would be chick-fil-a here's the agglomeration schedule and let me jump down to the dendrogram I'm actually just going to copy this dendrogram out copy go over to PowerPoint may paste it as a picture there we go just set that aside for now and go back here let's go to the agglomeration schedule now if you double click this you have access to the cells I'm going to highlight these coefficients and right click create graph line graph and it creates something like a scree plot like you get an EF a except it's backwards um let me also copy this over and they'll be easier to explain this way go back to PowerPoint let me paste this in doesn't like it in that format oops paste it in as a picture here we go and it would actually be easier if I overlaid some images here um let's go like this create a little box to do like that and make it see-through for effect here to do okay and also make sure its text is black and let me just create a couple of these here yeah that looks pretty good let me do one more and stick it right here okay I'm going to just resize these a little bit to represent the degree of change in the coefficients in the agglomeration table and you'll see why in a moment let me also label them as one two three four five okay now let me explain what all this means the alum raishin Schedule coefficients represent the amount of heterogeneity we observe in our cluster solution we would like our clusters to be heterogeneity heterogeneous distinct from each other and we want to maximize that distinction but only in so much as it is useful the most distinct or heterogeneous solution would be a single factor all these sandwiches as a single cluster compared to everything else which is nothing would provide perfect heterogeneity the next best solution would be to just split it into two clusters and you can observe that here and these two big lines represent these two main clusters and that's a very good very distinct solution but often two clusters is not the most ideal or meaningful solution and so we look to the next which would be three clusters and you can see that between 4 & 3 there's not a big jump in these two bars and so if we're going to choose between three and excuse me four clusters then it's there's not a big difference we could go either way we could observe the three clusters here it's 1 2 & 3 but which one is 4 if we're going to go with a 4 cluster it could be this one guy by itself that grilled chicken nuggets could be its own cluster and then this one this one and this one or there are a number of different ways we can break this out but that's the most likely scenario but you you can observe that the difference between three and four splitting this one off on its own is not a big difference and so if you got to choose between the two it doesn't really matter and choose what's more meaningful now if you're choosing between four and five clusters you can observe there's a huge jump from the five to the four and so I would recommend going to four cluster solution so where would five be maybe this one two maybe three or maybe three four or maybe four and then five and so it's hard to distinguish where the clusters would be hence the lack of difference in heterogeneity and so choosing a five factors our 5 cluster solution isn't the ideal but a 4 a cluster solution would be far more ideal than 5 cluster solution anyway this is one way you can try to validate your hierarchical cluster analysis or try to determine what number of clusters is appropriate this is also called a stopping rule in hierarchical cluster analysis I hope that's helpful and not too confusing you might have to rewind them and then watch again in order to have this make sense also if you want more information on this go to the hair book hair at all 2010 multivariate data analysis and in Chapter 9 he talks about the agglomeration schedule and provides a pretty cool example
Info
Channel: James Gaskin
Views: 23,502
Rating: 4.826087 out of 5
Keywords: SPSS, Cluster Analysis, hierarchical, agglomeration schedul, Statistics, visualize clusters, anova
Id: mSzk2KrbNfs
Channel Id: undefined
Length: 5min 57sec (357 seconds)
Published: Thu Jun 25 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.