(light acoustic music) - [Brandon] Hello. Thanks for watching, and welcome to the next video in my
series on basic statistics. Now as usual, a few things
before we get started. Number one, if you're watching this video because you are struggling
in a class right now, I want you to stay positive
and keep your head up. If you're watching this, it means that you've accomplished
quite a bit already. You're very smart and talented, but you may have just hit
a temporary rough patch. Now I know with the right
amount of hard work, practice, and patience,
you can work through it. I have faith in you, many other people around you have faith in
you, so so should you. Number two, please feel free to follow me here on YouTube, on Twitter, on Google Plus, or on Linkedin. That way, when I upload a
new video, you know about it, and it's always nice to
connect with my viewers online. I feel that life is much too short and the world is much too large for us to miss the chance
to connect when we can. Number three, if you like the video, please give it a thumbs up, share it with classmates or colleagues
or put it on a playlist. That does encourage me to
keep making them for you. On the flip side, if you think there's something I can do better, please leave a constructive
comment below the video, and I will take those ideas into account when I make new ones. And finally, just keep in mind that these videos are meant for individuals who are
relatively new to stats, so just going over basic concepts, and I will be doing so in
a slow, deliberate manner. Not only do I want you
to know what is going on, but also why, and how to apply it. So all that being said, let's
go ahead and get started. So this is the first
video about a new topic: The analysis of variants, or
more commonly known as ANOVA. In our last set of videos, we learned how to compare the variances
of two populations, using the F-ratio and F-distribution. In the context of ANOVA, it is important to remember that the F-ratio is simply a ratio of two variances. F-ratios are a central part of ANOVA, so if you are unsure what the F-ratio is, you may wanna go back
and watch those videos. ANOVA allows us to move beyond comparing just two populations. With ANOVA we can compare
multiple populations, and even subgroups of those populations. We can investigate how two groups interact with each other quantitatively. Many experimental research designs use ANOVA for these very reasons. Now in this video, we will not be doing any calculations, looking at any formulas, or be testing any hypotheses. Many students I've worked
with over the years learn ANOVA in class
without actually knowing what is is or why it even exists. So this video offers a solid, conceptual foundation about ANOVA using
illustrations and graphics. So if you are new to
ANOVA or are still trying to figure out exactly what
it is, this video is for you. So sit back, relax, and let's
go ahead and get to work. So the first and most obvious
question is, well, why ANOVA? So up to this point, we have
been comparing two populations. So the independent samples t-tests, which are two random samples we compare, or the matched sample t-test, where each measurement
is maybe the same person or same machine, something like that. But of course, limiting ourselves to the comparison of two
populations is, well, limiting. The world is much more
complex than just two things. What if we wish to compare the means of more than two populations? What if we wish to compare populations, each containing several
sublevels or groups? Well, enter ANOVA. So ANOVA, the acronym,
comes from the phrase analysis of variance,
so ANOVA greatly expands what we are able to do in statistics. So suppose we wanna
compare three sample means to see if a difference
exists somewhere among then. So our first sample mean
is up here in the blue, X bar sub one, X bar sub two
is our second sample mean here in the pink distribution, and then our third one is
down here in the green. So each sample will have its
mean and its own distribution. So what we are asking is do all three of these means come from
a common population? So is one mean so far
away from the other two that is likely not from
the same population as those other two, or
are all three so far apart that they all likely come
from unique populations? So you can see we're talking about sort of the relative
distance between these means. Now the variance of these distributions is also important, we'll
talk about that later. But for right now, let's
focus on the relative distance between these means and whether or not we could conclude that they come from the same overall population. So here is our first mean, so X bar sub one with its distribution, X bar sub one with its distribution, and then X bar sub three
with its distribution. Now let's say we take all the data points and all three of those samples and we put all those data points into a common, larger distribution. So we'll put that behind these three. Now what we're asking ourselves is where is each mean relative to the overall data set
sorted in the background? You can see that X bar sub one falls pretty much right down the middle. X bar sub two, that mean
is a bit to the right, so you can see the red arrow denotes how far it is away from the mean of the larger, sort of
combined population. What about X bar sub three? Well, it is a bit to the left, so you can see that red arrow denoting how far it is away from the mean of the overall sort of
combined population. So look at this example. So X bar sub one is where it
is was, right at the middle. X bar sub two is a bit to the right. But now look where X bar sub three is, our third sample mean. It's way over to the left. So we might conclude that this mean, the third one in the green, is too far away from the others. It's too far away from the mean of the larger group to be considered as part of that larger population. It's kind of the oddball. So X bar sub two is about this distance, you can see the red
arrow, but X bar sub three is way over to the left. So is this sort of the
oddball distribution, sort of the weird one, sort of the one that doesn't belong in the same
population as the other two? Now look at this case. Here we have X bar sub
one, that's pretty much right down the middle again, but now X bar sub two,
the second sample mean, goes way over to the right, so you can see that distance is pretty far away. Now look at X bar sub three. It's way over to the left. So we might conclude that
each one of these sample means belongs to its own population. Only X bar sub one belongs to this distribution in the background. X bar sub two may belong to
one that's off to the right, and X bar sub three may belong to one that's off to the left. So you can see that the means are in very different locations relative to the overall mean
there in the background. So the null hypothesis
in these type of problems and ANOVAS is whether or not these three sample means come
from the same population. Now remember, the sample
mean is a point estimator of the population mean,
so our null hypothesis is that mu sub one equals mu
sub two equals mu sub three, which is another way of expressing that these three means come from the same overall population. Now remember, we're not asking
if they are exactly equal. We're asking if each mean likely came from the same larger overall population. So in ANOVA, this idea is a very specific and important idea. We call this the variability among or between the sample means. So each sample mean is a certain distance from the mean of the overall
population in the background, and we know that that is
an expression of variance, sort of the distance of the sample mean from the overall mean in the back. So this variability
between the sample means is something I really want you to keep in your mind as we proceed. This is between variance. So we could test all of these sample means using pair y's t-tests. So here are our three sample means, so X bar sub one, X bar sub
two, and X bar sub three. Now we'll block out this part
of this little matrix here, because in this triangle, we have comparisons to themselves
and repeat comparisons, so we're only gonna have three
that we could actually do. So in this first one, we
could compare mean one here in the blue to mean two,
which is there in the pink. So our null hypothesis would be X bar sub one equals X bar sub two. We could have a t-test that tests that. We could also compare
mean one and mean three. So we'd have a t-test of X bar sub one equals X bar sub three. Now we could also test
mean two and mean three. So X bar sub two equals X bar sub three. Now notice each one of
those independent tests has its own alpha level, so
alpha level .05, .05, and .05. Now the problem with doing
all these pair y's comparisons is that the alpha level is the type one error rate of course, which means 95% confidence. But the error compounds with each t-test. So if we compared each possible pair, we would have .95 times .95 times .95. So our 95% is now .857, or 85.7%. So of course our alpha is one minus that, so one minus .857 equals .143. Well, what is that .143? Well, that is our overall alpha level. That is our overall type one error rate. So our type one error rate
went from 5% to 14.3%. That is why we do not conduct t-tests for every possible pair of means. The error rate compounds and therefore the test of course has problems. Now what's different? So if I stretch each one of these sample distributions out, what changes? Well, the spread, or the
variance of each distribution. So we call this the variability around or within the distributions. So remember before, we talked about the variance between the distributions, and that was sort of the
distance of each mean from the overall population
in the background. So that was variability between. This is variability within, so within each sample distribution. So at its heart, ANOVA is
really a variability ratio. It is a ratio of the first
type of variance we saw, the variability between the means, over the variability
within the distributions. That's the second type we saw when we stretched each
sample distribution out. So it is just a ratio. It is between variance,
divided by within variance. So remember, in the first type, we had an overall mean,
and then the distance of each sample mean from that. So you can see that
represented on the top. In the bottom, we had the
variability within distributions. So their width or spread side to side. So on the top, we're talking about distance from the overall mean, and on the bottom, we're talking about each one's spread or width
within its own sample, so distance from the overall mean divided by sort of the internal spread. So we reduced this to
a very common fraction. It is the variance between
divided by the variance within. So remember, this is a ratio of variances. So of course, the f-distribution will come into play here in a bit. So just keep in mind,
this sort of visual tool. On the top, the distance
from the overall mean, and in the bottom, the internal spread of each sample distribution. So variance between
divided by variance within. Now if we put those together, those are the components
of the total variance. So for an ANOVA, we
have the total variance sort of split into two parts. We have the variance between the means, and then the variance within
each mean's distribution. So we put those together, and we have the total variance for
the entire data set. So this is called partitioning. So we're separating this total variance into its two component parts, and again, this is what's
called a one way ANOVA, and I'm not gonna go into that right now, that's for the next video. But in this type of problem,
sort of the basic problem, we have total variance, and it's made up of variance between the means and the variance within each sample. So here are sort of the
nuts and bolts summary of everything we have here. If the variability between the means, sort of the distance
from the overall mean, we saw that in the red before, in that numerator is relatively large compared to the variance
within the samples, so within the actual
individuals samples of spread, in that denominator, than this ratio will be much larger than one,
because the variance between will be quite large relative
to the variance within. That will make this ratio
much larger than one. If that's the case, then the samples most likely do not come
from a common population. So then we would reject
the null hypothesis that these means are equal, or come from the same population. So if that's the case, we may
have one oddball distribution out to the side, or all three of them may be so far apart that it would create a very high ratio,
between the variance between would be much higher
than the variance within. So what could this look like sort of in our test in general? So if the variance between is large relative to the variance
within being small, we would reject the null hypothesis that says all the means are equal. So at least one mean is an
outlier, sort of off to the side, or they may all three be spread far apart, and each distribution
is relatively narrow, so they don't sort of all melt together. They're distinct, so think of this as three distinct distributions
that are far apart, or one that is sort of by
itself off to the side. That would create a large
variance between the means. Now if the between variances
and the within variances are similar, then we would probably fail to reject that null hypothesis, so they would appear to be equal, from the same overall population. So in this case, the means may be fairly close to each other, fairly close to that overall mean and or the distributions overlap a bit. So they may be a bit harder to
distinguish from each other. So if they're very close together, or the variances within is very wide, then they'll sort of melt together and therefore they will not be distinct. Now the other case is where the between variance is very small and the variance within is very large. So yo can think of this
as three distributions that are very spread out internally, and do not have a whole lot
of distance from each other. So they may be close together and or the distributions because
of very high variations are sort of wide and spread
out, they sort of melt together, and you really can't
distinguish them from each other as coming from separate
or distinct populations. So again, this is a very general look at how we would interpret an ANOVA. And it obviously gets more
complicated than this, but I just wanted to give
you a very rough overview. So here is what we're left with. This F-ratio, remember, the F-ratio is a ratio of two variances. So we have the between the variance, so again, the distance of each mean from the overall mean, or
the combined population in the background, as compared
to the variance within. So within each sample distribution. Now sometimes, this is called the among variance and
the around variance. So the thing about ANOVAs
is that I could have three different or four
different stats textbooks and they all call this
something different, but the most common way of expressing it is the between variance
and the within variance. But you may see it as the among variance, the variance among the means, and the around variance, which is the variance around each sample means, but it means the same thing. So this is that partitioning. So the variance between,
there in the red arrows, plus the variance within,
there in the blue arrows, adds up to the total variance. Now you will also see a term called the error, or the error variance. Well, the error variance is another name for the within, there in
the blue, or the around. So again, you might see
that in the textbook, and that's one of the challenges of doing these types of videos, because stats books like to give different names to the exact same thing. Okay, so remember, why ANOVA? So up to this point, we had been comparing just two populations, so the independent samples t-test and the matched sample
t-test are two examples. But limiting ourselves to the comparison of two population is, of course, limiting. What if we wished to compare the means of more than two populations? What if we wish to
compare the populations, each containing several
levels or subgroups? Well, that's what we have ANOVA for. Remember, ANOVA stands for
the analysis of variance. So, one more thing I wanna hit on before we go onto the
end slide and wrap up. Remember that we're looking at the variance between the means as compared to the variance
within each sample. So it's all about variance. That's why it's called
analysis of variance. Variance between as
compared to variance within. So we're always looking
at ratio of two variances. Okay, so that wraps up our first video on the analysis of variance, or ANOVA. So again, I wanted to give you a graphical representation
of what's going on in ANOVA. So when you're working
with the table of data and you're working with your numbers and your F-ratios and things like that, you actually know what is
going on in the background. What we're really trying to do is find any distinctions
between sample means. So we may have sample means
that sort of all line up. Therefore, we could conclude that those probably come from the
same combined population. But we might have one of the means as sort of an oddball, out by itself. So in that case, it probably comes from a different
population off to the side. Or we could have three sample means that are very far apart from each other. Therefore, each sample mean may come from its own population, sort
of in the background. So what we're trying to look for here is distinction, or differences
among several means. And we are doing that by looking
at two types of variance, between variance and within variance. So a few last words and then we are done. If you're watching this video because you're struggling in a class, stay positive and keep your head up. I have faith in you, many other people around you have faith in
you, so so should you. If you like the video,
please give it a thumbs up, share it with classmates or colleagues. Feel free to follow me here on YouTube, on Twitter, on Google
Plus, or on Linkedin. It's always nice hearing from you. And finally, just keep in mind that the fact that you're
on here trying to learn, trying to improve yourself as a student or as a businessperson,
that is what really matters. I firmly believe if you have
the right learning process in place, the results will
take care of themselves. So thank you very much for watching, I wish you the best of luck in
your work and in your studies and I look forward to
seeing you again next time. (light acoustic music)