Hierarchical Cluster Analysis in SPSS (SPSS Tutorial Video #29) - Dendrogram

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to data demystified i'm jeff gallick and this is my series of tutorial videos on how to use spss to work with data in this video i'm going to show you how to conduct and interpret a hierarchical cluster analysis as always we'll be using the youtube viewings habit survey that i created and you can find both a link to the data file and a video tutorial of the data below very often when we're dealing with data we can compute central tendencies like means or medians but sometimes what we want to know is if not every response is identical rather are there groups of responses that are cohesive and stick together and that's where some form of cluster analysis comes in there are a variety of tools that we can use and in this video i'll focus on the hierarchical cluster analysis now i'll admit this is not my favorite form of cluster analysis but it turns out to be really useful in one specific way and that is providing us a good estimate of how many clusters our data has and we'll see how it does that in just a moment so we're going to be using cluster analysis in this case to answer the following question we have a number of dimensions right here which describe how important each of these things is in determining whether somebody watches a youtube video now it's possible that all of those are exactly the same for everybody but more likely than not there are groups of people who tend to respond differently on these dimensions and we might be able to group those types of people and we can do that with this type of cluster analysis and to run hierarchical cluster analysis we go up to analyze classify hierarchical cluster analysis and there's a few things that we have to do here well the very first is we have to put in the variables that we'll be analyzing and that is all of our importance measures right here so i'll put those into variables under plots critically we need to select the dendrogram this is going to be a diagram that's going to help us understand the relationship between our variables and help us decide how many clusters there actually are in our data so we'll click continue under method there are a variety of approaches to determining what clusters to use if we click under the clustering method we see quite a few some of the more common ones are these between group linkages as well as furthest neighbor or the centroid clustering technique what i tend to prefer though is this wards method approach one of the things that words methods does is it helps create equal size clusters it's very possible that in our data if we use another approach we're going to find clusters where there's just a few responses in one group and lots and lots of responses in another and practically speaking that's not very useful words method attempts to create clusters that are more evenly sized and so we'll select that now all of my data are coming from the same type of scale so i don't need to worry about standardization but if your data are coming from widely different sources in terms of the range of data that you're dealing with you might want to consider standardizing your data using something like z-scores i don't need to do that so i'll skip that option and i'll click continue now if this were the final step and i was just running this cluster analysis and being done with it here i could go to save and save the cluster membership now when this is all done it'll categorize each response in our data set as being a member of one cluster or another now since i'm really going to be using this as an input to another cluster analysis technique k means which is something i cover in the next video i don't actually need to save the classroom membership but i will just to show you what that looks like and for this example i'll say single solution meaning just let's assume that there are exactly three clusters now i don't know that to be true just yet but we're gonna go ahead with it just to see what happens we'll click continue and we'll click ok now the cluster analysis might take a moment to run and a lot of information comes out and what's important to note is a lot of these tables are actually going to be very large because we have a thousand data points in our data set and i'm actually going to skip over some of them including this conglomeration table this vertical icicle chart as i don't find it particularly useful but what i am going to focus on is this dendrogram now one small trick in spss is this dendrogram is going to be as large as is the data set we have and it's really hard to read it when it's in this format so a quick tip for how to make this a little bit more readable is if we double click into it we get this chart editor and if we right click and select the properties window under chart size if we uncheck maintain aspect ratio and set the height to something manageable like 8 inches we can click apply and what that will do is basically smush our chart you can see that here so we can then exit out of it and now this chart is a whole lot more readable than it was before and so this is how this chart works we can start on the left over here next to what i'll call the origin and what you'll have is every single response all the way down the list here this is all thousand individuals that completed our survey and it's a little bit hard to read because we smush this chart together but at the extreme we could say that there are a thousand clusters in our data now that's not very useful because it doesn't actually group anything but of course we can assign each person their own cluster at the other extreme we can say that everyone is exactly the same and if we were to go off this chart off to the right over here we'd now just be basically computing let's say an average of everyone but that's also not useful so what this dendrogram lets us do is decide how many clusters we're going to have what this is a hierarchical branching diagram where each of these branches denotes connections and the closer those connections are to each other on this diagram the more related they are to one another so for example right here is one cluster and i know that because there's a branch right here that eventually then just split into smaller groups but ultimately this is some sort of sizeable cluster right here and this cluster is reasonably similar to this cluster here because it is positioned close to one another on this chart that's in contrast to this cluster compared to say this cluster down here they're very different from one another because they're actually quite far apart on this chart and so as we move up this branching diagram the clusters become larger and more heterogeneous meaning that there's more variation in what is comprised within the cluster as we create clusters based on branches that are more to the right in our diagram let me say that a bit differently if i just pick two people that are right next to each other they're probably very very similar to one another but as i expand what i consider to be a group let's say i consider this one big cluster based on this one branch right here well these people are in fact more similar to one another than say people in this group are right this group is going to be different from this group but within this group there's a lot more variation so there's a trade-off we have to make which is to say where do we make the cut do we decide to have lots of little clusters let's say here's a cluster here's a cluster here's one another another another another and so on where everyone in the cluster is very similar to one another but now we're dealing with lots and lots of groups or do we choose really large clusters let's say picking one right here and picking another right here well now we've got a few clusters which is nice because it's a little bit easier for us to handle from a mental perspective but those clusters are more heterogeneous themselves and that's actually a subjective call and the way we do this is looking at how much would our grouping change if we made small deviations and where we drew a hypothetical vertical line running through this chart so let's just pretend i draw a vertical line right here if i move that line a little bit to the left or a little bit to the right i still conclude that there are two general clusters and how do i know that well here's a cluster following this branching and encapsulating all of these people and here's another cluster following this branching and encapsulating all of these people so little deviations don't do much to change my solution which is a good thing on the other hand if i drew my line let's say right here well that would include three clusters one cluster right here another cluster right here and a third cluster right here but the problem is a slight move this way and all of a sudden i generate many more cluster solutions in other words tiny variations in where i draw that line change my conclusion and that's not great we want solutions that are relatively stable to small variations in our judgment call and so if i were to look at this dendrogram i would conclude firmly that there are two clusters in this grouping one right here and one right here and if you really push me i might say well maybe there's a third cluster as well so one big one here one here and one here but practically speaking the most robust solution is to say that there's two groups of people a two cluster solution so if i were then to take this to the next step and look at something like a k-means algorithm which is going to be a little more robust in how it identifies those clusters i would feed that algorithm the solution from this dendrogram which is two clusters now if i didn't want to do that and i wanted to rely solely on this particular hierarchical cluster analysis if we go back to our data we now have a new row of data called cluster 31 now what this is is a solution where there are exactly three clusters because that's what i defined in my options if i go to the data view if i just double click on this we see that each row each response is categorized as having been in one of three clusters one two or three and so now if i wanted to go back and identify who those people are i can do that using this new variable but again i don't find this particularly intuitive or useful instead i'm going to take the next step and plug the solution of two clusters that we saw a moment ago into a k-means algorithm and for that i'll be using k-means which is the topic of the next video and i'll make sure to link to that below that's it for this video i hope you found this useful and if you have any questions please comment below and i'll be sure to reply as quickly as i can aside from these tutorials i'm on a mission to equip everyone with the information they need to thrive in our data rich world if you'd like to learn not just the mechanics of analysis which these video tutorials focus on but also learn the intuition behind the analysis you're performing i strongly suggest you check out the other intuition-focused videos on this channel where i take the jargon out of statistics and data science and help you build a deep intuitive understanding behind all the analysis that you're performing i'll put a link below to a playlist of the videos that focus on just this finally please take a moment to like the video subscribe to this channel and click that little bell icon so you don't miss out on any new content that i put out thanks for watching

Info

Channel: Data Demystified

Views: 5,229

Rating: 5 out of 5

Keywords: Hierarchical Cluster Analysis, hierarchical cluster analysis spss, dendrogram clustering example, dendrogram, dendrogram spss, how to interpret a dendrogram in spss, how to interpret a dendrogram, ward's method, ward's linkage, ward's method spss, cluster analysis, cluster analysis spss, k-mean cluster analysis, how to pick the right number of clusters, cluster analysis tutorial, hierarchical cluster analysis tutorial, spss, spss how to, spss tutorial, data demystified

Id: 3BzfOLnIY9w

Channel Id: undefined

Length: 9min 14sec (554 seconds)

Published: Wed Dec 16 2020