Welcome everyone to the session of
marketing research analysis. Today, we will discuss about one of the ways
of hypothesis testing, which is a very popular method and largely used in all kinds of
researches be it experimental or non-experimental or cause experimental like surveys and all
so whenever you do but it is largely applied in all cases right and the application is
much more is seen much in larger way in the case of experimental design especially.
So, what is this way of testing? and let us see and we will discuss about it. Basically in
the last session if you if I if you remember we have discussed about you know the beginning
of hypothesis testing, we talked about the test of means right the test of means.
In which we and proportions okay so where we talked about that how do we calculate now we
just calculate now we just calculate as Z score which is equal to x bar minus of the population
mean this is the sample mean this the population mean and over the standard error and we said we
would calculate accordingly and we will find out and similarly this also true for a proportion.
So, the case proportion right but this case was only possible when they were two levels right
that means there where two sample groups now we can say so this was the group one and this was
the group two and we could compare it through an independent sample t-test or if the sample
was only 1 but taken two times then we said it was a dependent sample t-test or paired sampled
t-test but the question arised arises what happens when we have more than 2 groups okay we have more
than two groups or more than two levels right.
So, in such a condition there is a possibility
that there researcher can go for multiple t-test right but if you remember I explained you why
what is the logic behind not doing a multiple t-test and why one should avoid to do that
in fact there is something called an you know Bonferroni equality test where basically
it says that if you do conduct multiple test the problem is that the a which we generally
take at as 0.01 or 0.05 or whatever it is this is level of a goes gets inflated right.
So, to avoid this problem of inflation so that means if there is a 0.05 that means 5% and you
are having 4 times so that it will be around 20% of that means it will get into 20 % right so this
is too much of an loss of information or to much of a type one error that is occurring possibility
of a type 1 error, to avoid the situation Fisher the founder of this technique or the one who
develop the technique he came up with the technique other than having the multiple test and
he said we could do it by studying the variance so he said if we use a Fisher said if you could
use the variance then we can do it better.
But to do this he said he calculated something
called the he developed the F-test or we calculate through that F-ratio which I just began in the
last session so the F ratio is nothing but I had said if the mean sum of square between
right divided by mean sum of square within the groups okay so he said if there are n
number of groups right so you need to find out calculate the variance for the entire
group right or you can the total variance, the between the groups the variance between the
groups let us say they are different teams.
So, across the teams what is the variance and
within the teams or within the groups what is the kind of variance so suppose 11 players in
the cricket team so what is the variance within the team right so when you multiply when you
sorry when you find out this variance the you can calculate the F ratio and by then comparing
the f-ratio the calculated f-ratio with the table corresponding table value for the f value then we
can say that we want to reject our hypothesis or not, but what is the hypothesis let see let us
go by slowly so what is the definition saying.
It says analysis of variance basically involves
investigating the effects of one treatment variable so this is why I had said, this is
a basically any kind of experimental study it is used. So there is a treatment variable now
the treatment variable for an yield for example in agricultural a firm is like suppose you are
giving fertilizers so fertilizers could be the type of fertilizer could be the treatment
on interval scaled dependent variable, now that is important, what it says is that the
dependent variable and the independent variable.
So, the dependent variable if I have if
I remember I had also said this and the independent variable now the dependent variable
in case of analysis of variance is basically measured in a continuous scale, continuous so
it maybe and interval or ratio scale basically interval or ratio okay. On the other hand the
independent variables are basically nothing but they are you know the non parametric right in
nature; they might be categorical, categorical in nature right so this is continuous.
So, this is continuous this is non continuous or non parametric whatever you say so
this is categorically nature right, so this should be in a form of a nominal scale or
something okay, so let us go see so what it says to test the differences the purpose is to test the
differences in means for statistical significance, now what is the hypothesis? the hypothesis is
suppose there are four groups or whatever k number of groups so we say there is no difference
between the means that means the means of each group are equal in are equal right.
So, 1 = 2 = 3 = 4 goes on right till k now what is my alternative then the alternative says
okay so there is at least one which is different so whichever it is 1 we do not know but at the
moment but at least 1 is different that means I cannot claim that my hypothesis is that there is
no difference between the means write is correct right, so it is used when we have 1 or more
independent variables and only one depend variable the case ANOVA is basically a one way ANOVA we
are saying we are talking about right now.
So, we are having one or more independent
variables and one dependent variable so let us see right what is happening so if you, you
can have multiple independent variables that is one thing so multiple independent variables means
suppose you have a one variable one independent variable it is one way ANOVA, if it is a which
is also called as a factor basically you can understand it has a factor okay factor or whatever
one way we say so it is two factor two way ANOVA n factor n way ANOVA okay. Now the assumptions.
Random sampling subjects are random is sampled for the purpose of significant testing, it is a random
selection okay. Data is interval level dependent so the dependent variable that is in a interval
level so which we also said here right, now this is interesting in fact if I if you remember
I had explain so there is something called a Homoscedasticity and a Heteroscedasticity.
Homoscedasticity means when the data are plotted around the regression line right, close to the
regression line that means the variance within the or the standard deviation or the movement of
the data from the regression line is minimal it is quite close right, but if there is a the opposite
of that is Heteroscedasticity when the data is highly scattered right which is highly unwanted
situation which is not desired. Dependent variable should have the same variance in each category
of the independent variable that means this test although it is done if you go to any software they
measure it through the this variance is equal the variance is not equal two conditions.
But generally we take the case of where the variance is are equal that means if the
variances are equal we would assume then only that this groups can be actually compared okay,
the groups are basically those levels right.
So, this is an example, I will also solve
a problem let see one so what it is saying, a call center manager wants to know so if there
is a significant difference in the average handle times among three different call operators. So
there are three different call operators so the independent variable are the call operators,
so the call operators could be let us say the independent variable is call variable 1,
call variable 2, call variable 3, right.
The dependent variable is my average handle time
so how much of time they are taking to handle the clients, customers is my dependent variable
okay, now that means how will it look like now it will look like something let us say, so let
us say this is how it is suppose for the moment, so we will say let us say this is 40 seconds
or this is 20 seconds this is 25, 30 seconds, this is again 35 seconds so whatever the time
actually they have taken right,42 seconds whatever for time, seconds, minutes that is up to your
unit, so that is the different story.
So, now what you are saying the hypothesis is
that 1= 2= 3 because there are three operators so we are saying the time taken, the average
time taken by the first operator is equal to the average time taken by the second operator
is equal to the average time taken by the third operator. What is by alternate? as a researcher
is sometimes I given an example, a researcher is a fault finder is generally his habit an alternative
hypothesis through which he finds is early like fault finding, he is trying to find out how come
there can be no difference there has to be some difference like he is Sharlock Holmes homes, he
is like a detective, he is trying to find out.
So at least one is different he is
saying, now let us see this example, so the time is given in seconds.
So, the operator 1, this is the operator 1 s data given to you, operator 2 s data is
given to you, operator 3 s data is given to you, now you can understand suppose you go
to an actual folder or a file right, how the data will look like, so you may have, you
should also understand this so this is let us say 11111 there are how many 1,2,3,4,5,6,7,8,9,10
so 10 right, 3,4,5,6,7,8,9,10,2,2,2,2,2,2,2,2, till 10 again 10 right.
Then, 3 goes on 10 so the values are correspondingly so suppose in any software package
you want to use this is how it will look like, so the operator the time okay, this is how you
will make because making in the files also in the, you know software files also it is very important
how do you put your data that is why I am showing you. Now first is what is saying, let us take the
X1 so there are three groups okay, and there are 10 participants in each group okay, so this could
be this is the case whether I equal participation, it is there could be possibility that there
are not equal participation also, okay.
So X1=75.1, 74,2 is the X2 and X3 is the 74.7
so this is the mean, the mean of the first two operator mean of the second operator,
the mean of the third operator right, this is something called if you, if it is
not visible to you I am drawing it again the X , X double bar is called the grand mean.
So, the grand mean is the overall mean right, so either you can add up all this, this, this, this,
this till this till all this and then divided by the number that is 30 here in this case so that
means or you can simply do it by suppose you have this 75.1x10, so 700.1x10+74.5x10+74.7x10/30
right. So if I do this also I can find it right, so this is my grand mean which is coming 74.8.
Now, F-test is used to determine whether there is more variability in the scores of one sample
then in the scores of another sample, is more variance there in the score of one operator
over the other or something. So let us see now, how is using now F, so the F-ratio which I have
written here right, is nothing but the variance between the groups and variance within the groups
so I said mean sum of square is nothing but the variance you are calculating here between the
groups and this is the within the groups, okay.
So, means it is written here, mean sum square
between mean sum square within, so what is the within group let us see, now within group if you
can see it is shown here this is the variances of the observations in each group weighted
for the group size. Now, this is important, many a times you will get equal sizes, group of
equal sizes 10, 10, 10 in this case there might not be equal sizes so if there are not equal sizes
then you have take in to account this factor of group size this has to be weighted for the group
size if you do not do it then you will make a wrong analysis. So whatever the number of groups
so size that has to be taken care of okay.
Now this is the between group now between
group is this right between this between this and may be this is another one right three
possibilities right so there is a variance of set of group means from the overall mean of all
observations so what did he saying how much is the variance of all the group means from
the overall mean of the observations. Now let me show you here so how it will look like?
How does it look like? So I have three things right I will tell you something the simplest
way is to you do not have to remember anything right find out let us say in this case what is
said it saying between right so you have the X 1 you have the X 2 you have X 3 right and
you have something the grand mean right so you have between groups is nothing but X 1 minus
the grand mean multiplied by the n1 right.
Similarly X 2 so you need a plus you have to
add it up all right - X sorry here is n2 is it visible let me do it again so + n2 x this +
n3 X 3? X so if you take this so this is what is the between group right now similarly the
within group is nothing but X1 X right square this is all square okay remember please it is
variance this is not a standard deviation this is the variance which is the v of the standard
deviation plus let us say x1 for the first row we are doing only for the first row right.
So, what will be this? This will be the X11, X21, X31, X41, X51, X61, X7, X8, X9,
X10 similarly X first row so the X12 X13 X14 so it goes on right sorry this is first row second
column so 12 X12 so this is 22 so X22, X3 third column third row 32,X42, X52, X62 you have to
go on right, so this is one is third first row so this is X13 first row 3 column right, so
then is this one let us ay this one second row third column right it goers on till x this is
the third row 10th row right X10 and third column right so it goes on you have to add it up right.
So, once you do this once you are making it you calculate so you have to find it out from the mean
of the group in between the group so you have done this, this is the for the first one now you are
doing it for that individually so x1 x or just do it by the group it is simple (x1 x )2 + let us say
this is only for the first right first this (x2 x )2 goes on till x10 x1 bar square + for the second
so x let us say again the x1 the first means this one I am saying whether you it independently or
you write the way I was writing x2 bar square it goes on right. So you have to find out the
within group for the all the three okay.
Now, this is very simple right, so you are finding
the total there three things so total now what is total? Now total is if I am taking the every
value each value minus subtracting it from the grand mean so 76.5 74.8 I think it was there.
Yeah 74.8, so (76.5 74.8)2, (76 74.8) and 75.1 again till this one then start this
one entire group has to be deducted individually from the grand mean okay.
So, I have calculate this is SS total, this is SS within, this is SS between so
if I have the total and if I have within, I would not find between also or if I have the
total and if I have the between ,I would not find the within also or if I have the total
and I have the between I did not find the within also because this two will sum up to
become automatically this that means what, what I am saying is sum of squares total is equal
to sum of square within+ sum of square between.
So, in case you have the total then and you
found one of these the third one you might not also calculate it is automatically you can
deduct it and find out, so this is suppose this is 22.5 within it is calculated and 1.9 is
this one so what is the total now sum of the total will be so you have to multiply and
find out okay so you can say over all this some where in between you cannot do this.
You can just add up right so 22.5 23.5 24.4 so in some of squares total is equal to 24.4
out of which 22.5 is for the within the groups and between this 1.9 okay, so now let us
see what is the mean sum of squares? Now mean sum of squares is the sum of squares
within divided by the degree of freedom.
So, I have said now the degree of freedom
to degree of freedom is equal to the number of elements -1 right so number of elements
-1 so for the degree of freedom between the groups you have let say this case three groups
so 3-1 right but when you all doing the within the group let say degree of freedom within
there is it has to be 10 for each column you have to deduct 1. So 10-1+10-1+10-1 okay
so this is equal to nothing but 27 right or you could say n-k in simple terms right.
So, now the F is coming to we have calculated so this is 0.28 this is 1 so 1/0.8 is 1.1 so if you
take the F value at this is the F value let check this how to check I will show you .05 level for 2
and 27 degrees of freedom right, 2 and 27 right so did you understand 2 and 27, this is 2 and this
is the 27 so between is 2 degree of freedom, 27 is the within the group right.
So, now let us go for 2 and 27, 2 and 27 so this is something here, right
sorry 27 is here 3.3541 right so the value I think it is visible 3.3541 yes so 3.35 would
be require to reject the null hypothesis but what have we got 1.1 so if you have got 1.1 can
be reject the null hypothesis in this case.
The null hypothesis it is coming 3.35 right
and our 1.1 so sorry this is anyway you have to understand this is some where here and your
value is here so it is well within the home it is well within the boundary so you cannot
reject the null hypothesis in this case right, there could be some case in which
it can cross the boundary okay.
So, this is how the if you go for the Anova table
it something it sometimes look like this sum of squares, if you are using excel, SPSS something
sum of squares between the group is this much within the group this much so total was what I
was adding up that time 24.4 degree of freedom is 2 and 27, so total is 29, mean sum of
square is 1 0.8 so if I show this one.
So, we cannot reject the null hypothesis therefore
conclude there is not a statistical significant difference between the average and time of an
operator 1, 2 and 3 at this case we cannot say, but suppose there would have be an difference
suppose let us say there would have been an difference suppose let us say there would have
been an difference then you have said at least there is a difference between the mean of first or
the second, second or the third whatever it is.
To test this, suppose how do you find now to test
that we use something called a although manually we are not doing it we do something called a post
hoc test so please remember this so if you go to any you are using software packages like SPSS
or something then you are using this post hoc test which basically does nothing but if it
uses it calculates the mean and it uses the mean to find out what is the which of them is the
most let say has the highest value and which one has the lowest value and that it can tell out
of this in which significant manner right.
Which one is actually strongest or the highest
and which one is lowest as good as that. But now let us say, you have a case in ANOVA one
more important thing we measure but I will go to it later on it is called interaction effects
okay but before that let me also come with.
Okay let us explain the interaction also, many
times what happens is, there are suppose two levels or two groups okay, or 3 groups right,
so in such a condition what happens that there might be an, there are two kinds of effects,
one is the main effect and there is something interaction effect, now which is important to
study. Suppose, there are two things right, 2 things individually have an effect on the depended
variable, individually they do have an effect.
But what if that when these two things
come which you may say in English we say in dictionary ,if you find symbiotic or synergy
or sometimes the relationship becomes weak also, so if two things are coming together they give a
third kind of effect, what is the 3rd effect? The 3rd effect which we say that which happens to be
the present of two different of material may be in a chemical lab or in compound something. So we say
when two things come together automatically.
Let say take the example, somebody is enjoying a
party okay , so when his friends are there he is enjoying the lot, he is also individually goes
with the family also he is also enjoying right but what if when his family and his friends
come together? Suppose in the same party, will it be the similar effect? So, in such a
condition the interaction comes into the play, so that is where one needs to study that interaction
effect can have major bearing on any study.
So, if the researcher is doing any kind of
study on the experimental design or anything, they need to conduct the effect of the interaction
and show it has a result in the may be research outcomes or the research paper or in the thesis
any where okay. Because the interaction has a larger effect in the real life than in sometimes
the main effects, it is possible right. Now, we come to a situation where we say.
There is multiple analyses of variance, now earlier you are talking about one
dependent variable one dependent variable and multiple independent variables you
are taking 2, 3, 4 whatever. Now what if I have more than one dependent variable
let see this case, analysis involving the investigation of the main and the interaction
effects of categorical independent variables, the independent variables are categorical, on
multiple dependent interval variables. There are multiple dependent interval variables. So, if
multiple dependent interval variables are there, how you would make the study, so this is the
case where we are talking about basically we say is called the MANOVA.
So, there are many suggestion, there are many studies, in fact most of the
people generally do not do this test because they are not aware but they are not difficult
at least if you are using any software package, everything is there, if you use suppose SPSS,
you will go to general linear model and you can do a SPSS of MANOVA which can easily tell you when
two dependent variables are brought into together in the same time, what will affect, how will
the independent variables will affect them?
So, to determine individual categorical
independent variables have an effect on the group or related set of interval dependent
variables or not? So this is the purpose take an example. We want to study, We want make a study,
where we try to use two different text books, so we are using different text books, so which are
the independent variables because the change in text book will affect the change in the dependent
variables. So, there is independent variable.
And we are interested in the outcome in the
students improvement in math and physics, in math and physics score okay, so in this
case that means the math and physics becomes my two dependent variables right, we have
two dependent variables and score, score is obviously the continuous variable so we are
measuring in 50, 60, 65, 70 whatever the scores are and the hypothesis is that both together are
affected by the difference in text books, so we are saying that in such conditions.
The effect of you know the interactional effect comes into a larger play right, so we are
saying that let that means these two text books are having an impact on the dependent
variables which is the math and physics score.
Now, what are the assumptions? The assumptions
are the independent variables are categorical, the multiple independent variables are continuous
and interval, now continuous and interval okay I would have gone to the 3rd case, third
is saying it is a relationship between the dependent variables so this is the assumption
so you just cannot put an any dependent variable that you like, no, that has to be a theoretical
justification why you are using it as a dependent variable and why you are using a MANOVA, if you
are feeling that there is a relationship between the two dependent variables right a and b, here,
DV1 and DV2 then such a condition manova fits into the situation, number of observation
for each combination of the factor are the same it is the balanced experiment right.
Now, same example I would just show you how it will look like the call center manager wants
to know if the operator or method of answering calls makes a difference on average handle
time, wait time and the customer satisfaction, earlier I think we were talking about
only the average handle time right, so now we have brought in two different things
now the wait time and the customer satisfaction, so there are three basically dependent variables
now, earlier we had only one right so one this one is this one are they not related, yes, they
have a relationship the average handle time, how much of the wait time? and finally what is
the customer satisfaction? they are the dependent variables and the independent variables
are now only two things call operator.
So, now who is the call operator let us say
when we give promotions we find out the persons you know how effectively he works or how nicely
he perform his job so how is the call operator performing let say in that case right and is the
method of answering so sometimes the call operator might not be the only factor that can affect the
dependent variable the satisfaction and all.
So, it could be the method of answering and so how
is he answering is he answering through some other some device which is not very clear, sound is not
going well or some other device or some method which is using which is more clearer and you know
clearer to the customer. This is how I am doing.
So, my hypothesis now is that average handle
time, wait time and customer satisfaction are the same for both the operators 1&2. what is the
next hypothesis null hypothesis? the average AHT, WT and CS are the same whether you have
a used method 1 or method 2 there are two methods right similarly the alternative
is not the same for operator 1&2 the alternate is not the same for method 1&2 right.
So, this is how it looks like so the total time, waiting time, handle time, waiting time,
customer satisfaction, operator 1&2, method of answering 1&2 right, so if I m using
this method so here an ANOVA will not fix okay the question is then why could not you do two
ANOVAs right you might be asking yourself in your mind may be possibly that why didn t I do
two ANOVA individually one taking this group individually one depended variable then one
of them and so if I do it this also if you see how many times how many combinations are
coming each time we are taking handle time with these two right only operator let us say
or method of answering waiting time again.
So, again we are doing the same thing that we
were doing in the case of t-test and something right so in such conditions the combinations
will increase and more the combinations or more the number of sorry you know ways of
doing more that number of repetition you are doing again and again individually when you
are doing, so the errors will go on increasing so manova becomes a very good techniques
so this is something am just showed you this is something when you do in a software I
have brought it I manually cannot do it now.
So ,if you do it by this is called something
? if you look at this table and these are the variable four and five right, so if you look
at this now you have to see this significance values now the significance values I think
this is also important for you to know right, this significance value is the value which helps
you to reject or except a hypothesis in this condition 0.003 for the variable 4 suggests that
the null hypothesis is to be rejected because it is less than 0.05 right if we have taken 95%
confidence and similarly this also right.
But, look at the 3rd one, so what we have
taken we have taken a interaction between the independent variables 4&5 now what if I
take an interaction effect of operator and the method of answering right, now when I m
taking the interaction effect if you look now the result is no more significant, so that means
what we are saying there is a main effect.
But there is no interaction effect in this case we
cannot say that there is an interaction effect it is good if there is no interaction effect it is
good but what if interaction effect would have been very significant one, then that means you
would have said two things when combined together only do a better job of sometimes or may be
do a inferior job sometimes whatever it is right so this is all for analysis of variance
and multiple analysis variance thank you.