59. Cluster Analysis in Practice - I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome everyone to the class of marketing research and analysis. So till now we have covered almost all the, you know, different techniques involved in marketing research. So we have talked about majorly hypothesis testing and techniques like regression, analysis of variance and you know this discriminant analysis, logistic regression so SEM structural equation modeling path analysis and factor analysis. So today we will be doing one more technique which is called the cluster analysis. So what is cluster analysis by the way and how, what is it similar to and how is it different from other techniques. So to understand that let us first understand that cluster analysis is a technique very close to the factor analysis right. So as factor analysis you had done and you know that in factor analysis. The basic objective of factor analysis is to do data summarization and data reduction right. Similarly cluster analysis helps you to create groups among similar objects right. So the only difference between the cluster and the factor is that, that in factor analysis you are you know the basis of grouping was the variables that means on basis of the various variables you are creating a data reduction technique. And thus few of the, you know, a large number of the variables were being grouped in a few ones right, but here the objective in cluster analysis is not to group the variables, but rather the objective is to group the cases means the respondents. So can you know that means the objective is to see can the group or create groups out of a large number of respondent into a few numbers. So that we can say well a particular group let us say one particular group has a particular kind of behavior and another group has another kind of behavior. So accordingly maybe a marketer can use that knowledge for his or own benefit okay. So let us see let us get into it now. So cluster analysis is a group of multivariate technique right whose primary purpose is to group objects based on the characteristics they possess right. So it is a basically an interdependence techniques again like factor analysis it is an interdependence technique. So there is no dependent and independent variable as such right. It is a means of grouping records based upon attributes that make them similar. So on what basis on what attributes what are the attribute that make similarity creates a similarity among the respondent is what is of interest to the researcher. If plotted geometrically the objects within the clusters will be closed together while the distance between the clusters will be farther apart that means if there are 2 clusters let us say this is cluster 1 and cluster 2. So what we are saying here is them, you know, the data points within the cluster are as homogenous as possible and 2 different clusters are as far as possible. So there is a clear cut distinction right. So that means it is homogenous within the cluster and heterogeneous between right. Cluster analysis is also called classification analysis or numerical taxonomy (03:50). Now if you have read in childhood when you are doing you know classes of biology and zoology you are studying about different species right. How different animals can be grouped into different species or different species basically right. So we somewhere into like for example reptiles somewhere into Homo sapiens for example human being is a Homo sapiens, some was getting into fungi, somebody into algae so there were different species right of plants and animals. So according to the characteristics we were classifying them. So how was it done? This was done on the basis of with the help of such techniques like cluster analysis right. So you see for example this is just an example where it says if you can see there are estimated number of cluster is 4 all shown different colors. So this is a slightly greenish color sorry light blue, this is dark blue green and red. Now each one is showing depicting a particular cluster. Now if you can see this cluster is the farthest from the other clusters, but these 3 clusters are very close to each other right. So this is an example to depict how clusters are formed. So you can say if you take it in like for example the species what I said so this could be humans and this could be like the fungi, algae and some other kind of plants or something right. The ideal clustering is basically like this it seems to be, but nothing come so clear in real life right. So basically the data points are spread up in this manner in a real life. So practical this is ideal and this is practical. So when you are doing a clustering the data points if it is you know so clearly distinctly separated than nothing like it, but it does not happen in real life. So sometimes that is why you see some of the clusters will show some super imposition. Some parts of it will try to you know jumble upon each other right. So this is why because this you see this data are so close so it might create a situation like this. So clusters are extremely close to each other. So the difference between cluster and factor analysis. It is a classification technique this is a dimension reduction technique factor analysis. The basis is the grouping is based on the distance. Now this is very important. Grouping is based on the patterns of variation using the correlation. So in the factor analysis the major objective was to you know create the groups of the variable on basis of the correlation among them. But cluster analysis does not take the correlation it takes the distance between the variables okay. Statistics used is the dendogram it is dendogram is a diagrammatical representation cluster centroid and agglomeration schedule I will show you what they are this helps you to separate or create the clusters. Here we used if you remember Eigen value, factor loading which was nothing but the correlation with variable and the factor score and commonality. So these are some of the things that differentiates the cluster and the factor and this is basically is done on cases or respondents this is done on the variables this is the most important simplest and the important you know difference. Similarly cluster analysis versus discriminant. If you remember discriminant analysis was a technique which was used the dependent variable was a categorical variable right. So you wanted to take a decision whether somebody should be allowed into the class or not permitted into the class somebody should be given a bank loan or not a bank loan. So there we are using discriminant analysis right. So what is the difference between these 2 why we are saying this? Now in case of a cluster analysis it does not require any prior information actually what happens is cluster analysis is subjective study right is a subjective measure basically. So any prior information about the group or membership is not required, but in discriminant analysis prior knowledge of the cluster or group membership for each object or case included to developed the classification rule is required. For example, when you are discriminating the groups for example there are 2 groups. So in order to discriminate this 2 groups you need very clear information about the groups okay. This is as I said this is an interdependence technique and this is a dependence technique right why because there is a dependent variable which is categorical and there is one or more independent variables which are continuous okay. For example, as I said pass fail right and this is on hours of study you know IQ of the student etc where is it applied. In the field of psychiatry where the characterization of patients on the basis of clusters of symptoms can be useful in the identification of an appropriate form of therapy. See in when you talk about the mental health for example there are different kinds of patients everyone cannot be treated similar right or same. So in such a condition it helps us to decide at what level or what level of mental condition is a patient lying. So that helps us to give different kinds of medicines right. Biology for example to group the genes that have similar functions. Some genes are responsible for your happiness; some genes are responsible for somebody is being aggressive right. So what are the genes that are responsible for these so that helps you to identify you know through a cluster analysis. You know the application of cluster analysis has improved like anything today right. In Worldwide Web consist of billions of web pages for example you see. Clustering can be used to group this results into small number of clusters each of which captures a particular aspect of the query. For example, somebody is interested to know about you know what is there under the sea. So anything linked to sea and all that can be cluster into one group something about the sky, the universe the milky way. So that is another group or sometimes you can find out if sea and reptiles could be taken together maybe that is a cluster. So such kind of clusters can be formed right. Understanding the earth’s climate requires finding patterns in the atmosphere and ocean. Cluster analysis has been applied to find patterns in the atmospheric pressure of polar regions in areas of the ocean that have a significant impact on land climate. So such kind of applications are very much there right. Similarly cluster analysis has been largely and largely used in marketing right. First of all it is used for segmenting the market. Now on what basis do you segment what is segmentation to divide the markets into several parts so that each part has got some measurable value okay. Example consumers may be clustered on the basis of benefit sought from the purchase of a product or service. For example, you see sometimes some customers are divided in such patterns for example some are called to be innovators. One who try the product for the very first time and they want to test it initially, some people are those who want to check the product after only it has been used. So what is the kind of pattern or behavior of these people right. So what factors effect them? So these kind of benefit sought and things help in segmentation which can be done through cluster analysis. Understanding buyer behavior cluster analysis can be used to identify homogenous group of buyers then the buying behavior of each group may be examined separately. That means suppose I have created 2 clusters let us say 1 and 2. So cluster 1 has got certain behavioral characteristics. For example, we have taken characteristics like happiness you know sportsmanship right. For example, team person now what is the score that the cluster 1 people are getting what is the score the cluster 2 people are getting. So accordingly we can see which cluster is a better cluster for us. Identifying new product opportunities by clustering brands and products competitive sets within the market can be determined obviously. So when you can create different cluster obviously then you can target which cluster you like more and would be more beneficial for the company. Sometimes test markets are selected on basis of clustering by grouping the cities into homogenous clusters. So it is possible to select comparable cities to test various marketing strategies. So what you can do is you can create a cluster or cities so any city that falls in a particular cluster are similar right. So if I want to select let us say my target segment falls into cluster 1 let us say and in cluster 1 there are 5 cities so I can select any one of them and use it in my test market okay. Finally reducing data for example as you are doing in factor analysis. You can do it in cluster analysis too. How does it work that is most important? The primary objective is to define the structure of the data by placing the most similar observations into groups right. So it is a matter of similarity or dissimilarity right. So how similar this observation are? To accomplish this task we should know how do we measure similarity, how do you measure right, how do you form clusters and how many groups should you form right. So how many clusters how do we measure the similarity and how do we then understand these each clusters okay. So similarity represents the degree of correspondence among objects across all the characteristics used. Please remember this is what I was trying to say. There are 2 ways you can understand the similarity right. One through the correlation measures which you are doing in factor analysis. So in the factor analysis we said the variables which are very close to each other or have a high correlation would fall into one factor. But in the case of cluster analysis we are not taking the correlation measures right. We are often related with the distance measure most often used as a measure of similarity with higher values representing greater dissimilarity distance between cases not similarity that means what if the distance between 2 clusters let us say cluster 1, cluster 2 and let us say this is cluster 3. Now this is 1, 2, 3. So higher the distance that means farther the 2 clusters that means in a layman terms you can understand that cluster 1 and cluster 2 are more different are much more separate to each other then in comparison to cluster 1 and cluster 3. So this is a closer pair and this is much farther from each other. Look at this similarly measure now both graph if you can see if you look at this right the pattern of the data. So it rises then it falls then it falls more rises, falls, falls more right. So same thing is happening in both the cases. So this is if you check it in terms of correlation if you see both are having a correlation of 1 right. So both graph has the same r=1 the correlation=1, but if you look at the distance the distance between the points. Now you can very easily see that here in this case there is a much farther distance that means the data points are at a much farther from each other right. So that is what it says graph 1 represents higher level of similarity right. So look at this for example now this is 2 lines so this is A and C. If you look at the pattern of the data if I put in a data you know they are very similar to each other, but if I now put in a third point right let us say this one the B line and now you can see that the pattern of the data is different. Now if I measure they say this point and this point and this point and this point you see now the distance is becoming different. At this point it is very less, but at this point it is very high or this point it is very high. So now A and B and B and C are different from each other right. So the distance is different from each other. So what basically we are doing in a cluster analysis is to measure the distance. Basically we want to say how far is one cluster from the other or how far is one variable from the other or how far is one data point from the other. This is what we want to see not the correlation not the trend. Correlation is basically the trend the pattern. So we are not talking about the pattern we are talking about the actual distance right and we measure here the Euclidean distance. The Euclidean distance is referred to as the straight line right and there are some distance measures for example Euclidean distance which is a straight line distance square Euclidean distance is a sum of the squared differences sum of the square differences without taking the square root right. So for example x we say equation of straight line is = v(x2 - x1)2+ (y2 -y1)2 Now if you just remove the square root then this is the square Euclidean distance. City block distance it uses the sum of the absolute differences of the variables right. I will show you in the next slide. Chebychev is the maximum absolute difference in the clustering variables values and Mahalanobis. This is used when there is a high degree of correlation among the variables in order to avoid that situation. Whenever you have high correlation among the variable at that condition the Mahalanobis distance is a very right measure to be used right. It basically what it does is it standardizes. It is a generalized distance measure that accounts for the correlations among variables in a way that weights each variably equally then what it basically does it standardizes the variables okay the values. Now this is what we are talking about the Euclidean distance, the city-block distance. Now this city-block distance is as you can see you can read it. It uses the sum of the absolute differences of the variables. Now this right+ this right is the city-block distance. The chebychev distance is this part. So there are difference distance measurers okay. Now let us take this example a market researcher wishes to determine tourism segments in a wellness destination based on patterns of tourist emotions. Now on basis of the emotion he wants to create segments towards the destination. A small sample of seven respondents is selected as a pilot test right and cluster analysis is applied. So two measures of emotions V1 and V2 right and seven respondents are there and the data are given to you right. The emotions were measured on each respondent on a 0 to 10 scale okay. Now when we represent the same data V1 and V2 you see so the first case was 3, 2 for example so A is 3, 2 right. So B is 4, 5 right so we have plotted it right. So but from here we cannot say anything much okay. Now what we done is how do we measure the similarity that was one question we had in our mind how do we measure the similarity. So to do that this is the formula we use. So for example for A and B. So I have put that same table here I have copied it here see A and B. Now the distance is how do I measure? Now = v(V1(A) - V1(B))2 + (V2(A) – V2(B))2 = v(3 – 4)2 + (2 – 5)2 =3.162 So this is 3.162 similarly for A and C you can do A and D, A and E, A and F, A and G you can do everything right so 3.162 we have found for B and A or A and B same thing right. Now let us say A and C so = v(3 – 4)2 + (2 – 7)2 = v26 = 5.099 so similarly you can do it for the rest right. So every value has to be written. How do we form the cluster identify the two most similar observation not already in the same cluster and combine them? So what do you do is first once you have this values after that you start doing your clustering. So first find the similar clusters and try to combine them right and you continuously do it till you reach the single cluster This process is termed as hierarchical procedure hierarchical because it is in order right. Because it moves in a stepwise fashion to form an entire range of cluster solutions. It is also an agglomerative method because clusters are formed by combining existing cluster so that is why it is an agglomeration right. So it could be agglomerative or divisive (21:58) we will see right. So how do you do it? So we have found out this values the initial solution now this is the pair we have noted that 1.414 so E and F so E and G is this C and D, B and C, B and E, A and B so these are the different values right. Now this is the first initial solution A, B, C, D, E, F, G there is no combination there is no clusters have been clubbed till now. Now first we will start how do we do it. Now we will first club the closest one. So this is if you see it is put in a order right So E and F first we are clubbing E and F right and then how many clusters are left 1, 2, 3, 4, 5, 6 right. So this values remains the same then next is A, B, C, D, E, F and G so E, F and G we have clubbed. So how do you do here? Now you see now this value you must be thinking how this value has come. Now to do that what you do is E and F how much 1.414 E and G 2 so let us add these two right 1.414+2+ F and G right. So F and G you can measure it here so F and G for example so = v(7-6)2 + (7-4)2 = v10 3 point something you see right 3.162. So now this+ this let us say this 3.162 kind of a value. So if you add it here and divide it by 3 then you will get this score. So rest you can do it for the others also. So now we have clubbed and you see each by clubbing each step. Now you have come across to a point where finally you have got one cluster. So in steps 1,2, 3 and 4 the overall similarity measure does not change substantially right 1, 2, 3 and 4 which indicates that we are forming other clusters which essentially with essentially the same heterogeneity of the existing cluster okay. When we get to step 5 we see a large increase step 5 right so there is a large increase okay. This indicates that joining clusters B, C, D B, C and D this one right B, C, D and E, F, G right resulted in a single cluster that was markedly less homogenous. Okay you just understand that this is how the clusters are formed okay. Therefore, how many groups do we form the question comes. Therefore, the 3 cluster solution of step 4 seems the most appropriate right. Let us see for example here if you look at this values now where is a change maximum. Now here there is hardly any change there is a change yes there is a big change, but there is hardly any change slight change. Now if you go from the bottom right let us say you go from the bottom. So there is some change here there is change also, but there is slight change and this is almost no change rather it is increasing. So how many clusters have to be formed we can think accordingly. So if you see step 4 seems the most appropriate for a final cluster solution with 2 equally sized clusters B, C, D and E, F, G and a single outlaying observation A. So this is what we are talking about B, C, D E, F, G and a cluster A. So on the fourth step in the 4 step this part right we are talking about this part. So this is how it looks like B, C, D, E, F, G and A is alone right. What is the dendogram it is a graphical representation a tree graph of the results in the hierarchical procedure. So how it has been clubbed you can see here. So starting with each object as a separate cluster the dendogram shows graphically how the clusters are combined at each step of the procedure until all are contained in a single cluster. So first you add one then you go on adding to the next so slowly, slowly you go on adding till you get the sixth cluster right. So let us say this is 1 then with 1 and 2 is combined then after that 3 and 4 are combined then 3 and 4 are combined with 1 and 2 and then this total 5 is combined with 6 so it goes like this. How do you derive the clusters there are number of different methods that can be used to carry out a cluster analysis to understand how many clusters should be made? One is the hierarchical cluster analysis the other is a non hierarchical cluster analysis and third is a combination of both. So hierarchical cluster analysis I will just brief you and then we will wind up in the next lecture we will carry on from here. In the hierarchical cluster analysis as you have seen it helps you to indicate the number of clusters that need to be formed okay. And in the non hierarchical cluster which is basically if you see any test it is shown as a k means cluster right is used to identify and analyze the behavior of each cluster okay. So when these 2 methods are used together A and B this and this are used together it is a combination of both. It gives a very appropriate (27:26) solution for any researcher to derive the number of clusters and understand them very critically right. So what we will do is we will continue with this lecture in the next this you know topic in the next topic and we will also see how it is being done in the SPSS right. So thanks for today.
Info
Channel: IIT Roorkee July 2018
Views: 1,559
Rating: 4.8095236 out of 5
Keywords: Prof. J. K. Nayak, Department of Management Studies, Indian Institute of Technology Roorkee, cluster analysis, cluster vs factor analysis, cluster vs discriminant analysis, application of cluster analysis, how to form clusters
Id: 1MQ35D3EL0c
Channel Id: undefined
Length: 28min 27sec (1707 seconds)
Published: Thu Apr 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.