Lecture 21- Hypothesis Testing: ANOVA & MANOVA

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome everyone to the session of  marketing research analysis. Today,   we will discuss about one of the ways  of hypothesis testing, which is a very   popular method and largely used in all kinds of  researches be it experimental or non-experimental   or cause experimental like surveys and all  so whenever you do but it is largely applied   in all cases right and the application is  much more is seen much in larger way in the   case of experimental design especially. So, what is this way of testing? and let us   see and we will discuss about it. Basically in  the last session if you if I if you remember we   have discussed about you know the beginning  of hypothesis testing, we talked about the   test of means right the test of means. In which we and proportions okay so where   we talked about that how do we calculate now we  just calculate now we just calculate as Z score   which is equal to x bar minus of the population  mean this is the sample mean this the population   mean and over the standard error and we said we  would calculate accordingly and we will find out   and similarly this also true for a proportion. So, the case proportion right but this case was   only possible when they were two levels right  that means there where two sample groups now we   can say so this was the group one and this was  the group two and we could compare it through   an independent sample t-test or if the sample  was only 1 but taken two times then we said it   was a dependent sample t-test or paired sampled  t-test but the question arised arises what happens   when we have more than 2 groups okay we have more  than two groups or more than two levels right.   So, in such a condition there is a possibility  that there researcher can go for multiple t-test   right but if you remember I explained you why  what is the logic behind not doing a multiple   t-test and why one should avoid to do that  in fact there is something called an you   know Bonferroni equality test where basically  it says that if you do conduct multiple test   the problem is that the a which we generally  take at as 0.01 or 0.05 or whatever it is   this is level of a goes gets inflated right. So, to avoid this problem of inflation so that   means if there is a 0.05 that means 5% and you  are having 4 times so that it will be around 20%   of that means it will get into 20 % right so this  is too much of an loss of information or to much   of a type one error that is occurring possibility  of a type 1 error, to avoid the situation Fisher   the founder of this technique or the one who  develop the technique he came up with the   technique other than having the multiple test and  he said we could do it by studying the variance   so he said if we use a Fisher said if you could  use the variance then we can do it better.   But to do this he said he calculated something  called the he developed the F-test or we calculate   through that F-ratio which I just began in the  last session so the F ratio is nothing but I   had said if the mean sum of square between  right divided by mean sum of square within   the groups okay so he said if there are n  number of groups right so you need to find   out calculate the variance for the entire  group right or you can the total variance,   the between the groups the variance between the  groups let us say they are different teams.   So, across the teams what is the variance and  within the teams or within the groups what is   the kind of variance so suppose 11 players in  the cricket team so what is the variance within   the team right so when you multiply when you  sorry when you find out this variance the you   can calculate the F ratio and by then comparing  the f-ratio the calculated f-ratio with the table   corresponding table value for the f value then we  can say that we want to reject our hypothesis or   not, but what is the hypothesis let see let us  go by slowly so what is the definition saying.   It says analysis of variance basically involves  investigating the effects of one treatment   variable so this is why I had said, this is  a basically any kind of experimental study   it is used. So there is a treatment variable now  the treatment variable for an yield for example   in agricultural a firm is like suppose you are  giving fertilizers so fertilizers could be the   type of fertilizer could be the treatment  on interval scaled dependent variable,   now that is important, what it says is that the  dependent variable and the independent variable.   So, the dependent variable if I have if  I remember I had also said this and the   independent variable now the dependent variable  in case of analysis of variance is basically   measured in a continuous scale, continuous so  it maybe and interval or ratio scale basically   interval or ratio okay. On the other hand the  independent variables are basically nothing but   they are you know the non parametric right in  nature; they might be categorical, categorical   in nature right so this is continuous. So, this is continuous this is non continuous   or non parametric whatever you say so  this is categorically nature right,   so this should be in a form of a nominal scale or  something okay, so let us go see so what it says   to test the differences the purpose is to test the  differences in means for statistical significance,   now what is the hypothesis? the hypothesis is  suppose there are four groups or whatever k number   of groups so we say there is no difference  between the means that means the means of   each group are equal in are equal right. So, 1 = 2 = 3 = 4 goes on right till k now   what is my alternative then the alternative says  okay so there is at least one which is different   so whichever it is 1 we do not know but at the  moment but at least 1 is different that means I   cannot claim that my hypothesis is that there is  no difference between the means write is correct   right, so it is used when we have 1 or more  independent variables and only one depend variable   the case ANOVA is basically a one way ANOVA we  are saying we are talking about right now.   So, we are having one or more independent  variables and one dependent variable so let   us see right what is happening so if you, you  can have multiple independent variables that is   one thing so multiple independent variables means  suppose you have a one variable one independent   variable it is one way ANOVA, if it is a which  is also called as a factor basically you can   understand it has a factor okay factor or whatever  one way we say so it is two factor two way ANOVA   n factor n way ANOVA okay. Now the assumptions. Random sampling subjects are random is sampled for   the purpose of significant testing, it is a random  selection okay. Data is interval level dependent   so the dependent variable that is in a interval  level so which we also said here right, now this   is interesting in fact if I if you remember  I had explain so there is something called   a Homoscedasticity and a Heteroscedasticity.  Homoscedasticity means when the data are plotted   around the regression line right, close to the  regression line that means the variance within   the or the standard deviation or the movement of  the data from the regression line is minimal it is   quite close right, but if there is a the opposite  of that is Heteroscedasticity when the data is   highly scattered right which is highly unwanted  situation which is not desired. Dependent variable   should have the same variance in each category  of the independent variable that means this test   although it is done if you go to any software they  measure it through the this variance is equal the   variance is not equal two conditions. But generally we take the case of where   the variance is are equal that means if the  variances are equal we would assume then only   that this groups can be actually compared okay,  the groups are basically those levels right.   So, this is an example, I will also solve  a problem let see one so what it is saying,   a call center manager wants to know so if there  is a significant difference in the average handle   times among three different call operators. So  there are three different call operators so the   independent variable are the call operators,  so the call operators could be let us say the   independent variable is call variable 1,  call variable 2, call variable 3, right.   The dependent variable is my average handle time  so how much of time they are taking to handle   the clients, customers is my dependent variable  okay, now that means how will it look like now   it will look like something let us say, so let  us say this is how it is suppose for the moment,   so we will say let us say this is 40 seconds  or this is 20 seconds this is 25, 30 seconds,   this is again 35 seconds so whatever the time  actually they have taken right,42 seconds whatever   for time, seconds, minutes that is up to your  unit, so that is the different story.   So, now what you are saying the hypothesis is  that 1= 2= 3 because there are three operators   so we are saying the time taken, the average  time taken by the first operator is equal to   the average time taken by the second operator  is equal to the average time taken by the third   operator. What is by alternate? as a researcher  is sometimes I given an example, a researcher is a   fault finder is generally his habit an alternative  hypothesis through which he finds is early like   fault finding, he is trying to find out how come  there can be no difference there has to be some   difference like he is Sharlock Holmes homes, he  is like a detective, he is trying to find out.   So at least one is different he is  saying, now let us see this example,   so the time is given in seconds. So, the operator 1, this is the operator   1 s data given to you, operator 2 s data is  given to you, operator 3 s data is given to you,   now you can understand suppose you go  to an actual folder or a file right,   how the data will look like, so you may have, you  should also understand this so this is let us say   11111 there are how many 1,2,3,4,5,6,7,8,9,10  so 10 right, 3,4,5,6,7,8,9,10,2,2,2,2,2,2,2,2,   till 10 again 10 right. Then, 3 goes on 10 so the values are   correspondingly so suppose in any software package  you want to use this is how it will look like,   so the operator the time okay, this is how you  will make because making in the files also in the,   you know software files also it is very important  how do you put your data that is why I am showing   you. Now first is what is saying, let us take the  X1 so there are three groups okay, and there are   10 participants in each group okay, so this could  be this is the case whether I equal participation,   it is there could be possibility that there  are not equal participation also, okay.   So X1=75.1, 74,2 is the X2 and X3 is the 74.7  so this is the mean, the mean of the first two   operator mean of the second operator,  the mean of the third operator right,   this is something called if you, if it is  not visible to you I am drawing it again   the X , X double bar is called the grand mean. So, the grand mean is the overall mean right, so   either you can add up all this, this, this, this,  this till this till all this and then divided by   the number that is 30 here in this case so that  means or you can simply do it by suppose you   have this 75.1x10, so 700.1x10+74.5x10+74.7x10/30  right. So if I do this also I can find it right,   so this is my grand mean which is coming 74.8. Now, F-test is used to determine whether there   is more variability in the scores of one sample  then in the scores of another sample, is more   variance there in the score of one operator  over the other or something. So let us see now,   how is using now F, so the F-ratio which I have  written here right, is nothing but the variance   between the groups and variance within the groups  so I said mean sum of square is nothing but the   variance you are calculating here between the  groups and this is the within the groups, okay.   So, means it is written here, mean sum square  between mean sum square within, so what is the   within group let us see, now within group if you  can see it is shown here this is the variances   of the observations in each group weighted  for the group size. Now, this is important,   many a times you will get equal sizes, group of  equal sizes 10, 10, 10 in this case there might   not be equal sizes so if there are not equal sizes  then you have take in to account this factor of   group size this has to be weighted for the group  size if you do not do it then you will make a   wrong analysis. So whatever the number of groups  so size that has to be taken care of okay.   Now this is the between group now between  group is this right between this between   this and may be this is another one right three  possibilities right so there is a variance of set   of group means from the overall mean of all  observations so what did he saying how much   is the variance of all the group means from  the overall mean of the observations. Now let   me show you here so how it will look like? How does it look like? So I have three things   right I will tell you something the simplest  way is to you do not have to remember anything   right find out let us say in this case what is  said it saying between right so you have the X   1 you have the X 2 you have X 3 right and  you have something the grand mean right so   you have between groups is nothing but X 1 minus  the grand mean multiplied by the n1 right.   Similarly X 2 so you need a plus you have to  add it up all right - X sorry here is n2 is   it visible let me do it again so + n2 x this +  n3 X 3? X so if you take this so this is what   is the between group right now similarly the  within group is nothing but X1 X right square   this is all square okay remember please it is  variance this is not a standard deviation this   is the variance which is the v of the standard  deviation plus let us say x1 for the first row   we are doing only for the first row right. So, what will be this? This will be   the X11, X21, X31, X41, X51, X61, X7, X8, X9,  X10 similarly X first row so the X12 X13 X14 so   it goes on right sorry this is first row second  column so 12 X12 so this is 22 so X22, X3 third   column third row 32,X42, X52, X62 you have to  go on right, so this is one is third first row   so this is X13 first row 3 column right, so  then is this one let us ay this one second   row third column right it goers on till x this is  the third row 10th row right X10 and third column   right so it goes on you have to add it up right. So, once you do this once you are making it you   calculate so you have to find it out from the mean  of the group in between the group so you have done   this, this is the for the first one now you are  doing it for that individually so x1 x or just do   it by the group it is simple (x1 x )2 + let us say  this is only for the first right first this (x2 x   )2 goes on till x10 x1 bar square + for the second  so x let us say again the x1 the first means this   one I am saying whether you it independently or  you write the way I was writing x2 bar square   it goes on right. So you have to find out the  within group for the all the three okay.   Now, this is very simple right, so you are finding  the total there three things so total now what is   total? Now total is if I am taking the every  value each value minus subtracting it from the   grand mean so 76.5 74.8 I think it was there. Yeah 74.8, so (76.5 74.8)2, (76 74.8) and 75.1   again till this one then start this  one entire group has to be deducted   individually from the grand mean okay. So, I have calculate this is SS total,   this is SS within, this is SS between so  if I have the total and if I have within,   I would not find between also or if I have the  total and if I have the between ,I would not   find the within also or if I have the total  and I have the between I did not find the   within also because this two will sum up to  become automatically this that means what,   what I am saying is sum of squares total is equal  to sum of square within+ sum of square between.   So, in case you have the total then and you  found one of these the third one you might   not also calculate it is automatically you can  deduct it and find out, so this is suppose this   is 22.5 within it is calculated and 1.9 is  this one so what is the total now sum of the   total will be so you have to multiply and  find out okay so you can say over all this   some where in between you cannot do this. You can just add up right so 22.5 23.5 24.4   so in some of squares total is equal to 24.4  out of which 22.5 is for the within the groups   and between this 1.9 okay, so now let us  see what is the mean sum of squares? Now   mean sum of squares is the sum of squares  within divided by the degree of freedom.   So, I have said now the degree of freedom  to degree of freedom is equal to the number   of elements -1 right so number of elements  -1 so for the degree of freedom between the   groups you have let say this case three groups  so 3-1 right but when you all doing the within   the group let say degree of freedom within  there is it has to be 10 for each column you   have to deduct 1. So 10-1+10-1+10-1 okay  so this is equal to nothing but 27 right   or you could say n-k in simple terms right. So, now the F is coming to we have calculated so   this is 0.28 this is 1 so 1/0.8 is 1.1 so if you  take the F value at this is the F value let check   this how to check I will show you .05 level for 2  and 27 degrees of freedom right, 2 and 27 right so   did you understand 2 and 27, this is 2 and this  is the 27 so between is 2 degree of freedom,   27 is the within the group right. So, now let us go for 2 and 27,   2 and 27 so this is something here, right  sorry 27 is here 3.3541 right so the value   I think it is visible 3.3541 yes so 3.35 would  be require to reject the null hypothesis but   what have we got 1.1 so if you have got 1.1 can  be reject the null hypothesis in this case.   The null hypothesis it is coming 3.35 right  and our 1.1 so sorry this is anyway you have   to understand this is some where here and your  value is here so it is well within the home   it is well within the boundary so you cannot  reject the null hypothesis in this case right,   there could be some case in which  it can cross the boundary okay.   So, this is how the if you go for the Anova table  it something it sometimes look like this sum of   squares, if you are using excel, SPSS something  sum of squares between the group is this much   within the group this much so total was what I  was adding up that time 24.4 degree of freedom   is 2 and 27, so total is 29, mean sum of  square is 1 0.8 so if I show this one.   So, we cannot reject the null hypothesis therefore  conclude there is not a statistical significant   difference between the average and time of an  operator 1, 2 and 3 at this case we cannot say,   but suppose there would have be an difference  suppose let us say there would have been an   difference suppose let us say there would have  been an difference then you have said at least   there is a difference between the mean of first or  the second, second or the third whatever it is.   To test this, suppose how do you find now to test  that we use something called a although manually   we are not doing it we do something called a post  hoc test so please remember this so if you go to   any you are using software packages like SPSS  or something then you are using this post hoc   test which basically does nothing but if it  uses it calculates the mean and it uses the   mean to find out what is the which of them is the  most let say has the highest value and which one   has the lowest value and that it can tell out  of this in which significant manner right.   Which one is actually strongest or the highest  and which one is lowest as good as that. But   now let us say, you have a case in ANOVA one  more important thing we measure but I will go   to it later on it is called interaction effects  okay but before that let me also come with.   Okay let us explain the interaction also, many  times what happens is, there are suppose two   levels or two groups okay, or 3 groups right,  so in such a condition what happens that there   might be an, there are two kinds of effects,  one is the main effect and there is something   interaction effect, now which is important to  study. Suppose, there are two things right, 2   things individually have an effect on the depended  variable, individually they do have an effect.   But what if that when these two things  come which you may say in English we say   in dictionary ,if you find symbiotic or synergy  or sometimes the relationship becomes weak also,   so if two things are coming together they give a  third kind of effect, what is the 3rd effect? The   3rd effect which we say that which happens to be  the present of two different of material may be in   a chemical lab or in compound something. So we say  when two things come together automatically.   Let say take the example, somebody is enjoying a  party okay , so when his friends are there he is   enjoying the lot, he is also individually goes  with the family also he is also enjoying right   but what if when his family and his friends  come together? Suppose in the same party,   will it be the similar effect? So, in such a  condition the interaction comes into the play, so   that is where one needs to study that interaction  effect can have major bearing on any study.   So, if the researcher is doing any kind of  study on the experimental design or anything,   they need to conduct the effect of the interaction  and show it has a result in the may be research   outcomes or the research paper or in the thesis  any where okay. Because the interaction has a   larger effect in the real life than in sometimes  the main effects, it is possible right. Now,   we come to a situation where we say. There is multiple analyses of variance,   now earlier you are talking about one  dependent variable one dependent variable   and multiple independent variables you  are taking 2, 3, 4 whatever. Now what   if I have more than one dependent variable  let see this case, analysis involving the   investigation of the main and the interaction  effects of categorical independent variables,   the independent variables are categorical, on  multiple dependent interval variables. There   are multiple dependent interval variables. So, if  multiple dependent interval variables are there,   how you would make the study, so this is the  case where we are talking about basically   we say is called the MANOVA. So, there are many suggestion,   there are many studies, in fact most of the  people generally do not do this test because   they are not aware but they are not difficult  at least if you are using any software package,   everything is there, if you use suppose SPSS,  you will go to general linear model and you can   do a SPSS of MANOVA which can easily tell you when  two dependent variables are brought into together   in the same time, what will affect, how will  the independent variables will affect them?   So, to determine individual categorical  independent variables have an effect on   the group or related set of interval dependent  variables or not? So this is the purpose take an   example. We want to study, We want make a study,  where we try to use two different text books,   so we are using different text books, so which are  the independent variables because the change in   text book will affect the change in the dependent  variables. So, there is independent variable.   And we are interested in the outcome in the  students improvement in math and physics,   in math and physics score okay, so in this  case that means the math and physics becomes   my two dependent variables right, we have  two dependent variables and score, score   is obviously the continuous variable so we are  measuring in 50, 60, 65, 70 whatever the scores   are and the hypothesis is that both together are  affected by the difference in text books, so we   are saying that in such conditions. The effect of you know the interactional effect   comes into a larger play right, so we are  saying that let that means these two text   books are having an impact on the dependent  variables which is the math and physics score.   Now, what are the assumptions? The assumptions  are the independent variables are categorical,   the multiple independent variables are continuous  and interval, now continuous and interval okay I   would have gone to the 3rd case, third  is saying it is a relationship between   the dependent variables so this is the assumption  so you just cannot put an any dependent variable   that you like, no, that has to be a theoretical  justification why you are using it as a dependent   variable and why you are using a MANOVA, if you  are feeling that there is a relationship between   the two dependent variables right a and b, here,  DV1 and DV2 then such a condition manova fits   into the situation, number of observation  for each combination of the factor are the   same it is the balanced experiment right. Now, same example I would just show you how   it will look like the call center manager wants  to know if the operator or method of answering   calls makes a difference on average handle  time, wait time and the customer satisfaction,   earlier I think we were talking about  only the average handle time right,   so now we have brought in two different things  now the wait time and the customer satisfaction,   so there are three basically dependent variables  now, earlier we had only one right so one this   one is this one are they not related, yes, they  have a relationship the average handle time,   how much of the wait time? and finally what is  the customer satisfaction? they are the dependent   variables and the independent variables  are now only two things call operator.   So, now who is the call operator let us say  when we give promotions we find out the persons   you know how effectively he works or how nicely  he perform his job so how is the call operator   performing let say in that case right and is the  method of answering so sometimes the call operator   might not be the only factor that can affect the  dependent variable the satisfaction and all.   So, it could be the method of answering and so how  is he answering is he answering through some other   some device which is not very clear, sound is not  going well or some other device or some method   which is using which is more clearer and you know  clearer to the customer. This is how I am doing.   So, my hypothesis now is that average handle  time, wait time and customer satisfaction are   the same for both the operators 1&2. what is the  next hypothesis null hypothesis? the average AHT,   WT and CS are the same whether you have  a used method 1 or method 2 there are two   methods right similarly the alternative  is not the same for operator 1&2 the   alternate is not the same for method 1&2 right. So, this is how it looks like so the total time,   waiting time, handle time, waiting time,  customer satisfaction, operator 1&2,   method of answering 1&2 right, so if I m using  this method so here an ANOVA will not fix okay   the question is then why could not you do two  ANOVAs right you might be asking yourself in   your mind may be possibly that why didn t I do  two ANOVA individually one taking this group   individually one depended variable then one  of them and so if I do it this also if you   see how many times how many combinations are  coming each time we are taking handle time   with these two right only operator let us say  or method of answering waiting time again.   So, again we are doing the same thing that we  were doing in the case of t-test and something   right so in such conditions the combinations  will increase and more the combinations or   more the number of sorry you know ways of  doing more that number of repetition you   are doing again and again individually when you  are doing, so the errors will go on increasing   so manova becomes a very good techniques  so this is something am just showed you   this is something when you do in a software I  have brought it I manually cannot do it now.   So ,if you do it by this is called something  ? if you look at this table and these are the   variable four and five right, so if you look  at this now you have to see this significance   values now the significance values I think  this is also important for you to know right,   this significance value is the value which helps  you to reject or except a hypothesis in this   condition 0.003 for the variable 4 suggests that  the null hypothesis is to be rejected because it   is less than 0.05 right if we have taken 95%  confidence and similarly this also right.   But, look at the 3rd one, so what we have  taken we have taken a interaction between   the independent variables 4&5 now what if I  take an interaction effect of operator and   the method of answering right, now when I m  taking the interaction effect if you look now   the result is no more significant, so that means  what we are saying there is a main effect.   But there is no interaction effect in this case we  cannot say that there is an interaction effect it   is good if there is no interaction effect it is  good but what if interaction effect would have   been very significant one, then that means you  would have said two things when combined together   only do a better job of sometimes or may be  do a inferior job sometimes whatever it is   right so this is all for analysis of variance  and multiple analysis variance thank you.
Info
Channel: Marketing research and analysis
Views: 26,685
Rating: 4.6654544 out of 5
Keywords: Hypothesis, Testing:, Anova, Manova
Id: UQBeh63Q-SM
Channel Id: undefined
Length: 35min 51sec (2151 seconds)
Published: Sun Aug 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.