Lecture 30- Cluster Analysis- I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome everyone to the class of marketing  research and analysis till now we have   discussed about a several tools and techniques  used in the you know in the marketing space how   companies utilize them and what do they  gain out of it how they make advantage   out of those by using this techniques we have  discuss some of them one of this which we had   recently discussed was a technique which was  an interdependence technique in which we try   to bring in large number of variables to a few  ones and create factors out of it which we said   as factor analysis right. So in that we had you  know reduce large number of variables may be 100   to a few meaningful ones which I always repeat the  word meaningful right once and then give a name to   those factors and then try to understand how this  factors are going to determine or effect may be   some other you know prediction some other kind  of a you know relationship. Today we are going   to discuss similarly another technique which is  equally important right very much utilized in the   market basically by marketers right. Let us say  there is a case a company wants to sell let us   say some you know some kind of phones right now  a days a lot of phones are coming mobile phones   so it wants to sell mobile phones and it has  got the target of about let us say 100000 okay   100000 pieces of you know phones to be sold. Now the company want to know how should I sell   them right so while selling he wants to target the  customers so while targeting the customers he has   got certain variables like age of the people let  us say income of the people kind of a you know   habit of a people some kind of habit let us say  so now on basis of these variables the company   wants to divide a particular may be state or place  because suppose this is a whole place let us say   and now they want to say are we going to target  the whole state or no if we because if we are   going to target the whole state for 1 lakh pieces  of mobile it might be little difficult right. So   is there a way that I can you know divide this  state in to several several groups or clusters   then what it does is he tries to divide the  state not a geographically may be on certain   other parameters taking these three may be on  some clusters may be right some clusters. So now   it says out of this clusters they find all these  clusters have got different characteristics okay.   so it finds that may be once I break this  into four clusters, cluster two is the one   which is very promising it seems to the company  that it is a very promising you know cluster,   why? Because the people in this cluster are once  who are would be interested to buy such kind of   mobile phones which has all the characteristics  that this companies phone has got right. So in   such a condition this becomes highly useful for  the company to understand how we could easily   segment and then target the market so that we  can achieve better results right. So to do this   we are using the technique which is called cluster  analysis, so cluster analysis as understand is to   divide or you know create cluster out of a large  pool of respondents right we say respondents cases   whatever you can say. So these respondents are  grouped into several similar clusters right so   that each cluster has got people inside or you  know who are highly similar or highly homogenous   in nature right. So what we understand is the  all the respondents within the clusters within   the clusters let us say these are the respondents  all the people or all the respondent what it say   there highly heterogeneous sorry homogenous in  nature that means their behaviors is more or   less highly similar they are same right but  we understand the two clusters two clusters   let us say cluster one and cluster four cluster  two and cluster three they all different from each   other right so this is the basic understanding  of clusters analysis that within the cluster   there is very minimal difference or distance  and two clusters between the clusters.   Let us say the distance is high okay so let us  see how it functions what happens in a cluster   analysis so as I started so what I cluster how  to define it says it is a group of multivariate   technique who is primary purpose is to group  objects right example respondents it could be   products it could be entities whatever right  so you are trying to group these respondents   or products are something into several clusters.  Several groups which have a similar nature right   the other thing if you remember in factor analysis  we said there were trying to group the variables   right so that was variables here respondents  or products or whatever other entities right,   so what is saying it is a means of grouping  records based up on certain attributes.   Now what are the attributes on which one it is  grouping now the attributes in this case are for   example age, income level, let s say habit. let us say habit means spend let us say spending   habit right or let us say may be technologically  savvy how technologically tech savvy are people so   maybe that is an indicator of measuring right, so  you can have certain criteria certain attributes   upon which you can create those similarity  right so and each similar group is a cluster   right so if plotted geometrical is objects  within the clusters will be close together so   that is what I was saying if you plot it then  the objects with the clusters are very close   to each other that is why it is said they are  homogeneous in nature and two clusters are the   variables the respondents within two different  clusters are highly heterogeneous in nature   right. Then we have said more highly different  right this is what cluster basically tells you   so cluster variate represents a mathematical  representation of the selected set of variables   which compares the objectives object similarities  now what it is saying so mathematically you are   trying to plot may be a plot or try to find out  some similarity so that it becomes easier for   a marketer to understand okay which are the  clusters he or she should cater to right. It   becomes very simple they say take so many example  companies of Maruti is coming up with a new car   now should Maruti target everybody or should  Maruti have a specific thing specific policy   in mid how to target whom to target so when it  does it has to take certain attributes on this   basically attributes they finally decides okay. So clusters versus factor analysis clustering is   based on the distance matrix and factor  we said was in a correlation matrix,   so if you remember in factor analysis you  said that the most important thing was.   We were trying to find out a correlation right so  we were trying to find a correlation right so when   we are trying to find a correlation and this  correlation was telling us how close were the   variables to each other but in cluster analysis  we do not take the correlation rather we take   the distance what we take we take the distance so  the grouping is based on the distance how far are   the variables from each other or how close are  the variables from each other the you know from   each other. So as it says factor analysis we form  groups of variables based on the several peoples   responses to those variables in contrast cluster  analysis we group people based on the responses to   several variables now how people have responded  to several variables what score they have given   so that tells basically their behavior that  their thinking pattern and all so by taking   those variables their thinking pattern in all  these things you can classify these respondents   into several groups right. Which as I said is  basically a nothing but a mental distance or a   distance that is measured right so this distance  is basically mental distance sometimes we feel   something are very close to us but they are  actually not where as some other things some   other places might be felt that very far of but  they are actually not so far this is actually   nothing but a mental perception a mental distance  right. We feel that it is very far of but actually   that might be closer right so that is what  happens so where the applications are some   application are other mentioned. Field of psychiatry, biology. right   information retrieval for example the world  wide web contains billions of web pages so   when you type maybe for example market research  right, so all those pages which are linked to   research or marketing research they will be  clubbed they will brought together right so   this is nothing but they are clustering in one  way okay the climate for example understanding   the earth s climate requires finding pattern in  the atmosphere and the ocean to that and cluster   analysis have apply to fact fine patterns in the  atmospheric pressure of polar regions and areas   of the ocean that are the significant impact so  how do you say that some places are very similar   climatic conditions because this is again where  we use cluster analysis to divide the you know as   per the all the distance by that two places might  be far of but the climatic at the type of climate   the weather all this things are very similar  in two different places, so they can be still   clubbed into one right so understanding these  things is very from more applications I can   show you market segmentation grouping people. Right so grouping people and with the willingness   purchasing power the authority to buy according  to the similarity in several dimensions states   segmentation can tell you what type of customers  buy what products, so customer a would buy what   kind of a product that is what a helps you know  there is how the clusters is helps the marketer   right so some examples are always there many  more examples city planning, insurance right   geographical you know examples so all these are  basically are used to group respondents. According   to certain behavior attributes and find those  clusters right once you find those clusters it   becomes easy for you to polishing may for making  policy making for selling something for you know   maybe understanding the kind of trend or any trait  and it behavioral trait maybe some genetics study   you are wanted to do, so everywhere in fact to  tell you the use of clusters analysis was not   started did not start with marketing or something  right it has it is roots here actually in taxonomy   in biology. Where different kinds of spaces where  to be classified into different show in groups   right so this is how it is all has started So common role the cluster analysis can play   right first is data reduction so as factor  analysis also was helpful in data reduction   similarly cluster analysis also helps you in data  reduction researcher maybe face it larger number   of observations that can be meaningless right  unless classify to a managerial groups so how many   groups suppose you have 10,000 respondents 1 lakh  respondents but 1 lakh respondents individually   try if you understand this nothing you can do so  if you either create 10 clusters out of it. Then   it makes more much of better is meaning out  of it right second is hypothesis generation   cluster analysis also useful when a researcher  wishes to develop hypothesis concern the nature   of the data or to examine previously stated  hypothesis so cluster analysis is also useful   in developing a hypothesis right to it because  it has it gives you insight the knowledge about   certain things so it helps you to develop a  hypothesis right and even finally test it.   So what are the objectives so taxonomy description  so identifying groups data simplification the   ability to analyze groups on similar you  know observations instead all individual   which I said is very, very taxing and difficult  impossible sometimes then finally his relationship   identification where it says the simplified  structure from cluster portraits relationships not   revealed otherwise sometimes we can find out okay  let say now I have said there are four clusters   right so are this four clusters okay. Fine there is no doubt that this four clusters are   different but this is a possibility that cluster  1 and cluster 3 are actually very closed to each   other sometimes it happens in your state where you  stay that maybe one distinct the state has got 10   districts out of which two districts are extremely  similar because of their maybe a language there   food habits are something or cultural habits  so they are very similar so they can sometimes   if you want you can even club them and make it  like one cluster right. You can make it so this   relationship identification is there important  job of cluster analysis how does work.   So three basic things are there, now what  are these three basic things so it says the   primary objective of cluster analysis is  to define the structure of the data right,   so by placing the most similar observation into  groups remember. If you are working with data   and trying to find out similarity so that you can  form groups then there must, there could be some   problems with it also. Suppose the biggest  problem that can affect cluster analysis is   suppose you have a data which is got there is  few outliers in the data. Now if you have few   outliers in the data then that could completely  change the you know the way the groups are formed,   that could be a very important thing one should  keep in mind okay. Three things that are very   important what, how do we measure the similarity  so when you are doing the cluster analysis the   question comes how do I know key which groups  are similar there are four clusters.   Now how do I know which are similar to each other,  one second thing now do I form the clusters,   how do we form the clusters it is not that  key the data is given to us and we just do it,   there must be a way right, so how do I form  the clusters. Third, how many so the question   is how do I form clusters, how do I form. How do I form, right, how many do I form,   right how many, how do I form, how many do I  form and how do I measure a similarity, how do   I measure the similarity so if I can measure the  similarity then only I can do this right, so let   us see but one thing is you have to keep in mind  that in the cluster analysis we are not looking at   the correlation and why we are not looking at the  correlation I will explain you, see correlation   could is like something like this you know suppose  two things are moving like this. The correlation   might be very high among them but suppose this  two verses let us say take this you know just   for understanding, understand this way. suppose  this two lines and these two lines if you look the   correlation between the two both the lines pairs  let us A, B,C,D, A,B and C and D suppose you these   two are one group, these two are one group is more  or less same the correlations more or less same.   But if you look at the distance is  actually not same so that is what is   the basic underlying difference between cluster  and the distance and the correlation, okay.   So what is similarity represents the degree  of correspondents among objects across all   the characteristics used in the analysis,  so I have several attributes as I have used   let us say I have ask several variables, there  are several variables which I have used in the   study some of them being like income, age and all  these things. Now taking them together when I am   trying to build a kind of a similarity matrix  right, so this all this variables together will   help me not one but together they will help  me in defining a cluster, right, okay.   So as I said correlation basis are less  frequently used only in case there is a   special case where correlation is used where  you know when there is something called this   is called a molecular distance which is used when  the certain variables do have a correlation we   use it in special case I will come to that  right, otherwise most often the similarity   is measured through the distance, okay. Now this is what I was trying to explain you see   if you look at it, if you look at the two charts,  the two graphs the graph 1 and graph 2 both the   R=1, now what is R the correlation value so the  coefficient of correlation is 1 that means they   are highly correlated right, so which implies  to have a same pattern right, but the distances   are not equal, the distances between these and  the distances between these are not equal so   these two will get a similar different you know  interpretation in cluster analysis but had not   been enough case of a factor it would have been  it would be very similar, right So that is the,   thus the basic differences okay, so graph 1  represents high level of similarity right,   and graph 1 because the distance is less, why  it is saying now because if the distance is   less that means they are close to each other  as good as that, right as that right if it is   close the similar there will be coming trending  towards each other right if the distance is high   or the correlation the trend is same but there  is a sufficient gap right so all this things are   very important to understand this right. So the distance now how do you measure the   distance several ways the basic way of measuring  the distance is the Euclidean distance now   Euclidean distance is the distance which we  say is the straight line right so how do we   calculate so D is equal to let say (x2-x1)2  + (y2-y1)2 right so this is what basically   how we measure right so the most commonly  recognized 12 straight line distance right.   This is how you measure so the other forms  also for example square Euclidean distance   which says that you take the sum of the squared  differences without taking the square root that   means you only omit the square root right okay. So  the other distance is like the city block distance   Manhattan distance which is not like the Euclidean  because this one takes the absolute value. And it   is sometimes it is it does not work well because  of this nature of absolute value it does not work   well right so one is to be very careful which you  are applying mostly if you do not understand much   simply you can go for the Euclidean distance  because that is another safest way until   the correlation among the variables right. Then you have two more like Chebychev distance   which is taking the maximum of the absolute  difference in the clustering variables value   for example maximum of x1 x2-x1 or y2-x1 so the  absolute value is only taken right and the last   which I am saying is the Mahalanobis distance  which measures accounts for the correlation   among the variables in a way that each variables  equally. Now Mahalanobis distance is also very   important tool which is used to also find  out, outliers let me tell you this is may   be not here but suppose you are doing the  simple regression and you want to find out,   outliers Mahalanobis distance is the technique  which is used to measure to find out those   outliers okay so this is one way of doing it.  You remember only the condition applied is when   the variables have a large correlation or high  correlation among them that time it is preferable   to use a Mahalanobis distance over the other  distances and these things you will find almost   every were in the software nowadays right so  you do not need to calculate it by hand okay.   Now this is how the Euclidean distance looks  like all the three one go it is showing you   now this is hypothesis measure right. So let us say this is an example now let s   starts with the example now this example if you  can go through a market researcher a marketing   researcher wishes to determine market segments  in a community based on patterns of loyalty to   brands and stores to stores a small sample of 7  respondents is selected as a pilot two measures of   loyalty store loyalty and brand loyalty V1 and V2  V1 and V2 right were measured for each respondents   on a 0-10 scale right. The scores are given to  you so these are the 7 respondents A B C D E F   G and the score that they are given for store  loyalty and brand loyalty is been given to you   on the scale of 0-10 right now from this data let  us see how we can get into the cluster analysis   okay so when I place the data on a graph So 3 2 4 5 4 7 so if you can see this A is 3,   2 right then B is 4, 5 right C is 4,  7 D is 2, 7 so we can see here D is 2,   7 right so we have just paste them on the graph  and we place them on the graph right   Now how do we measures similarity the first  question that we have how do we measure the   similarity so to do this right how do we  measure similarity I said we can use the   Euclidean distance as a way of doing it now  how it has done now for example let us take   the distances now let us ate two distances okay  so among between two variables now for example   this was 3 2 4 5 4 7 2 7 okay. So let us take  only this much so this ADCD I am taking it. Now   suppose if we want to find out the distance,  now how do we do it? Now do to do that? What   is the ways simply, for between A and B. so the distance between A and B = x2 x1 so   3- 4 right you can say 4- 3 also does not make  a difference obviously you will square right,   (4 3)2 + (5 2)2 sorry either you take it this  way so(4 3)2 + (5 2)2 so what it is coming,   so 1 + 9 = 10.That means it is coming something  around 3.162, similarly you can find the distance   for all the other variables right, now the  distance between A and B is 3.162.   Obviously A and A would be same 1, now let us  calculate one more A and C , these two we are   measuring, if you measure these two it will be 4  d, a, c (4- 3)2 + (7-2)2 so that is = 1+25 = v26   so that is = 5. Something right,  so it must be A and C IS 5.099,   so for that everything you have measure right. So after measuring the next question was how do   we form the clusters? So you have measured the  distances, now find out the distance between two   clusters, the minimum distance between the  two clusters, the minimum why I am saying   this is the closest, they are very close to each  other. Had the distance been more they would have   been far away from each other. So identify  the two most observations, who are already   in the same cluster and combine them. So what you can do is two similar clusters,   2 observation for example as I explained, this  is 1 cluster, so identify two values right and   through certain ways, there are certain ways,  I will explain that also, similar linkage,   average linkage, centroids method, there are  different ways how should I use right. So   identify the two most similar observations, which  are already in the same cluster and combine them   so that is what the objective. So we want to form the clusters,   so once we have identified this, so apply  this rule to generate the number of clusters,   starting with each observation as it is own  clusters right and then combining two clusters   at a time until all observations are in the single  clusters, that means what you are trying to add   up the closest once, now this is one 1 and 2 are  close, so next is say 5, so 5 is closest to them,   so you add up now 5 then let say 8. So you  add up 8 then let us say 6 you add up 6 so it   goes on adding the nearest variable in terms of  the distance okay this method of this process is   termed as hierarchical procedure why obviously  it will say hierarchical procedure because your   maintaining a hierarchy right you are following  a hierarchy so when you do this ultimately all   the different respondent will be clip together  to form a single cluster right single cluster.   But the question is if we have a single cluster  then the whole meaning is lost right suppose I   want to know to which state suppose the state  of let say or this country our country India   suppose you want to do it in India then if I say  whole market is your market then it becomes very   difficult for me to make an interpretation out  of it so in those cases you have to understand   well what should I do this is not sufficient  for me how do I break it into several clusters   first so how many clusters should India be broken  into for the marker so that it can easily scatter   to those clusters so we will see that right the  process is termed hierarchical procedure   Because it moves in a step wise fashion to  form an entire range of cluster solutions   right it is also agglomerative method because  clusters are formed by combining the existing   clusters so what it is saying so you are  trying to form the clusters because let   us say cluster one as I have drawn here. Let us say there was a cluster one there was   cluster two so cluster three so if I add up  1+2+3+4 then it becomes the whole place the   big country India now the question is how  do I add up should I add up like anybody or   everybody or should I have a mechanism now what  is the mechanism the mechanism is find out the   two clusters which are very similar or close to  each other. And once you can find out those two   clusters which are close to each other and you  start on adding and ultimately you will land up   to the whole market that is India right so this is  why it is called a agglomerative method okay now   let us look at this so the agglomerative process  is the process in which what we have done is if   you look at it this is the minimum distance. That you have to calculated by now we have done   this through the distance now after doing  this what were the distances now distance   we have arranged it so 1.414 to 2.236 now the  pairs of respondent which were related to it   are the first one is E and F the second  one is EFG I hope you can recall that.   Let go back and look at it so the lowest value is  at E and F okay so E and F is 1.414 is the lowest   so the next one is there are three next lowest 2  right in this line and then in this line again you   have a 2 and again you have 2 so now you can take  the closest one for example in this case what we   have done E and F for the first pair then second  pair we took E and G right because U was already   there so the closest to E we found out and then  C and D and finally we took we started doing each   right if you can see the cluster membership  here abcdefg so there are seven clusters.   So we have not done any grouping right now  after this what we did was we club E and F   together here to form a single cluster so they  become 6 clusters then right then we did abcdef   and g we added EFG together so 5 clusters then  we added ABCD together then EFG right so what we   have done is basically we have shorten we have  reduced the number of clusters to a few so that   we can finally land up into one right so by doing  this where I will explain you may be in the next   session how do we calculate this part also right  this oval similarity measure the average within   the cluster distance this also I will explain.  So we are identifying the number of cluster and   through this we can say finally by through this  value this data how many clusters in this case   we should have right so well this is what just  the introduction of the cluster analysis so we   will in the next session we will get into  more details rights how one should form the   clusters and how then one should interpret the  clusters okay thank you for this session.
Info
Channel: Marketing research and analysis
Views: 28,855
Rating: 4.8692808 out of 5
Keywords: Cluster, Analysis-I
Id: icRnW0o5haI
Channel Id: undefined
Length: 34min 22sec (2062 seconds)
Published: Sun Aug 27 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.