Introduction to Cluster Analysis with R - an Example

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we are reading a CSV file from my desktop and the name of the file is utilities and we'll store the results in utilities itself so it has 22 observations and 9 variables if you look at this you can see first column is company and then you have quantitative variables from the second column onwards we can look at the structure so first variable is a factor variable and other variables are either numeric or integer next we do a scatter plot so let's do scatter plot of fuel cost and sales so fuel cost versus sales and the data file we are using is utilities so this is how the scatterplot looks like we can add names of the companies next to the dots we want to add text and labels equals company so we are using first column so this is how this scatterplot looks like there are some overlaps so proudly we can make some adjustments in the graph we can specify the position of company names let's use 4 so it will be on the right hand side if you use 1 it will be at the bottom and we can say size of the text point 3 so now you can see the overlap is less but obviously you have names very tiny we can increase that slightly point four so this is much more cleaner you can see there are four companies towards high sales but low fuel cost and then there is these group of companies in the middle have medium sales and medium to lower fuel cost and these group of companies we can say have lower sales and higher fuel cost so very broadly we can have three clusters but these three clusters are based only on two variables we can normalize a variable by subtracting mean and dividing by standard deviation please remember that for cluster analysis we need to have quantitative data first variable in the data set should be removed from further analysis so I'm going to create Z which will not have company in it so from utilities so we have all rows but we subtract first variable now if you type Z you will see that company column is not there now we have only quantitative variable but there are some variables which are very low in value like point six one point five but variable like sales this is in thousands like nine thousand five thousand so normalization is needed so that all the variables have a level playing field it should not happen that just because some values are very high they dominate the whole show so when the clusters are being formed we don't want one variable to dominate just because the observations are on the higher side when we normalize all the variables the average for each variable becomes zero and standard deviation is approximately one so that creates a level playing field so first we calculate mean for all the variables so we apply this to Z data set and we put two here to indicate that we are doing this four columns because our variables are in columns if we put one that will be four rows and then we say mean and we will store all these claims and M we can do the same thing and store all standard deviations in s and then we can calculate a normalized data set Z so once you have calculated Z we can calculate what is called Euclidean distance we can make use of function dist:4 Z and we can store this in let's say distance you so we have so many decimals let me make this more compact by using print command so this is much more compact so this gives you Lydian distance among all different records we have 22 rows in the data set how close or how far each one is compared to others is listed here so for example there is a very high value 6 so this means 7th row so that company and 11th company they are very dissimilar in terms of those variables because distance is quite high but if you look at this one this is one point six six so company seven and 12 in terms of equilibrium distance they seem to be much closer so lot of similarity between these two companies so now we can make clustered intagram and the default will give us complete linkage so for that we will use h + t' so H for hierarchical and + for cluster and we will use distance and we can store the outcome in HC dot C so C to represent complete linkage now we can make dendogram using plot command so this is a cluster dendogram so because this is hierarchical clustering initially each company is treated as a single cluster and then we try to find which one is the closest one in terms of distance so for example companies 10 and 13 they seem to be very close to each other so they are grouped into one cluster first and this process continues till you reach a situation where all the companies form one big cluster so when the height is about seven at that point of time everything becomes one huge cluster if you have a horizontal line at let's say six so that may give rise to one cluster here so this is a big cluster second cluster could involve companies number eleven eight and sixteen and the third cluster would involve companies seventeen seven fifteen twelve and twenty one because they are really close to each other if you look at ten and thirteen in this matrix here one point four one so that means these two companies are those two rows represented by ten and thirteen are very close to each other so we can indicate the names by saying labels equal so we have to use utilities data set that we have and the variable is company so the names of the companies are now listed on the cluster dendogram and at the bottom you can see this is complete type of linkage you can also make these lines aligned to each other using hang you can see all the lines are aligned horizontally we can similarly make a dendrogram with the average linkage so for that let's call the output s hierarchical cluster at C dot a for average you so this is what you get you can see now the method is average so this is average linkage compared to the earlier one obviously the formation of clusters are slightly different because we are using a different linkage now coming to cluster membership will say membership information in member dot C and we use a function cut tree at C dot C and let's say we are looking at three clusters we have only 22 rows of data so going for too many clusters may not be advisable similarly we can find memberships based on average linkage and then we can make a table so this table tells us that using average linkage method there were thirteen plus five observations that belong to cluster 1 one observation belonged to cluster two and three observations belong to cluster three using complete linkage method there were 13 plus one 14 companies that belong to cluster one five belong to cluster two and three belong to cluster three if you compare average linkage versus complete linkage there were good match for 13 companies both methods listed them as lustre one whereas there were three companies where both indicated that these three belong to cluster three but then there is also some amount of mismatch one company had a membership in cluster two based on average method but this company had membership in cluster one if we use complete linkage method similarly there are five companies that have membership in cluster two with complete linkage but they belong to cluster one when we use average linkage so this table allows us to compare these two different methods and see how cluster formation is behaving when we use average linkage versus complete linkage we can also calculate cluster means we can use aggregate command so we are doing this for Z and the list we use is let's say we are doing it for complete linkage method and we are calculating me so we get average values for the three clusters for each variable remember these are normalized values so this will help us in characterizing these three clusters if we do not see too much variation among these three averages for a variable that means that variable is not really playing a very significant role in deciding cluster membership for the companies if you look at sales so there is a very high value 1.85 and then there is negative 0.67 obviously sale seems to have significant impact on clustered membership so this means that companies that belong to cluster 3 they have higher sales whereas companies that belong to cluster 2 they seem to have lower than average sales on the other hand if you look at fuel cost so companies in cluster 2 seem to have higher fuel cost whereas companies in cluster 3 seem to have lower than average fuel cost these averages indicate which variables are really playing an important role in characterizing the clusters so this is very useful you can also do this aggregation in original units so these averages are now in original units so interpretation becomes easier we can also visualize the clustering using solute plot and for this we are going to use library called cluster if you run this line where we are using HC dot see our hierarchical cluster is using complete linkage method and we are using three clusters right now in this analysis and the function we are using is called cut tree and using that we are developing a scene hot a plot so if you run this you get this plot here if cluster formation has been good or the members in the cluster are closer to each other SI values will be high if they are not then SI values will be low this kind of visualization helps to identify clusters visually if you have si value which is negative obviously that member in the group is sort of outlier does not really belong to that group we can also develop what is called a scree plot so a scree plot will require calculations of within group sum of squares so within group sum of squares I am representing here using WS s and how it is calculated is included here let me run these three lines so the plot you get on the right side is called a scree plot what it does is it gives you a overview of all possible clusters and within group sum of squares so within group means within cluster variability we want to reduce within cluster variability so when you go from one cluster to two clusters you can see the drop in within group sum of squares is very large similarly when we come here you can see the drop is very large but maybe somewhere here like if you try to have five or more clusters the improvement is not that significant the drop in variability or within group sum of squares is not really that much a scree plot indicates in this case that we should go for lower number of clusters maybe two maybe three beyond that the gains are not very significant we have to choose at the beginning how many clusters we are interested in we can do this k-means clustering using k-means function and we will use Z or normalized data set and let's say we go for three clusters we can store this information in say KC so k-means clustering so if you want to look at what this analysis is giving us we can simply type KC and it gives a lot of information for example top line is k-means clustering with three clusters of sizes 12 3 and 7 so first cluster has 12 companies in it second cluster has 3 and third cluster 7 it also gives us cluster means then you have cluster membership so this clustering vector tells us that first company should go into cluster 1 second company belongs to cluster 3 based on these analysis and so on then you have within cluster sum of squares within cluster variability is 58 point 0 1 4 first cluster whereas for cluster 2 which has 3 members the variability is lower that means the members are closer to each other in terms of distance it also gives you between cluster sum of squares divided by the total sum of squares right now it is about 39.5% and then there are various components of this cluster analysis that is available if you want to for example look at the first one you can say KC and dollar sign cluster this gives us the cluster memberships similarly if you do KC dollar sign centers you get averages we can also plot any two variables in the form of a scatter plot so let's try two variables d demand and sales so we are plotting sales versus d dot demand our data set is utilities or if you simply run this plot it will give you a scatter plot which looks like this you can color-code those three clusters like this you can say color equals so now you can see that membership is indicated with the help of a color so remember in first cluster there were 12 observations so these circles that are black in color those are 1 2 3 4 5 6 7 8 9 10 11 12 companies similarly second cluster has 3 companies these 3 and green ones are third cluster so you can see second cluster separation from first and third cluster is very good there is hardly any overlap but between first and third cluster there is a lot of overlap clustering is good when between cluster distance is high and when within cluster distance is low so in this case we get a good separation for second cluster but first and third cluster for these two variables are not really that good
Info
Channel: Dr. Bharatendra Rai
Views: 176,760
Rating: 4.9481921 out of 5
Keywords: Cluster Analysis, working with r
Id: 5eDqRysaico
Channel Id: undefined
Length: 18min 11sec (1091 seconds)
Published: Thu Dec 03 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.