Statistics 101: ANOVA, A Visual Introduction

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(light acoustic music) - [Brandon] Hello. Thanks for watching, and welcome to the next video in my series on basic statistics. Now as usual, a few things before we get started. Number one, if you're watching this video because you are struggling in a class right now, I want you to stay positive and keep your head up. If you're watching this, it means that you've accomplished quite a bit already. You're very smart and talented, but you may have just hit a temporary rough patch. Now I know with the right amount of hard work, practice, and patience, you can work through it. I have faith in you, many other people around you have faith in you, so so should you. Number two, please feel free to follow me here on YouTube, on Twitter, on Google Plus, or on Linkedin. That way, when I upload a new video, you know about it, and it's always nice to connect with my viewers online. I feel that life is much too short and the world is much too large for us to miss the chance to connect when we can. Number three, if you like the video, please give it a thumbs up, share it with classmates or colleagues or put it on a playlist. That does encourage me to keep making them for you. On the flip side, if you think there's something I can do better, please leave a constructive comment below the video, and I will take those ideas into account when I make new ones. And finally, just keep in mind that these videos are meant for individuals who are relatively new to stats, so just going over basic concepts, and I will be doing so in a slow, deliberate manner. Not only do I want you to know what is going on, but also why, and how to apply it. So all that being said, let's go ahead and get started. So this is the first video about a new topic: The analysis of variants, or more commonly known as ANOVA. In our last set of videos, we learned how to compare the variances of two populations, using the F-ratio and F-distribution. In the context of ANOVA, it is important to remember that the F-ratio is simply a ratio of two variances. F-ratios are a central part of ANOVA, so if you are unsure what the F-ratio is, you may wanna go back and watch those videos. ANOVA allows us to move beyond comparing just two populations. With ANOVA we can compare multiple populations, and even subgroups of those populations. We can investigate how two groups interact with each other quantitatively. Many experimental research designs use ANOVA for these very reasons. Now in this video, we will not be doing any calculations, looking at any formulas, or be testing any hypotheses. Many students I've worked with over the years learn ANOVA in class without actually knowing what is is or why it even exists. So this video offers a solid, conceptual foundation about ANOVA using illustrations and graphics. So if you are new to ANOVA or are still trying to figure out exactly what it is, this video is for you. So sit back, relax, and let's go ahead and get to work. So the first and most obvious question is, well, why ANOVA? So up to this point, we have been comparing two populations. So the independent samples t-tests, which are two random samples we compare, or the matched sample t-test, where each measurement is maybe the same person or same machine, something like that. But of course, limiting ourselves to the comparison of two populations is, well, limiting. The world is much more complex than just two things. What if we wish to compare the means of more than two populations? What if we wish to compare populations, each containing several sublevels or groups? Well, enter ANOVA. So ANOVA, the acronym, comes from the phrase analysis of variance, so ANOVA greatly expands what we are able to do in statistics. So suppose we wanna compare three sample means to see if a difference exists somewhere among then. So our first sample mean is up here in the blue, X bar sub one, X bar sub two is our second sample mean here in the pink distribution, and then our third one is down here in the green. So each sample will have its mean and its own distribution. So what we are asking is do all three of these means come from a common population? So is one mean so far away from the other two that is likely not from the same population as those other two, or are all three so far apart that they all likely come from unique populations? So you can see we're talking about sort of the relative distance between these means. Now the variance of these distributions is also important, we'll talk about that later. But for right now, let's focus on the relative distance between these means and whether or not we could conclude that they come from the same overall population. So here is our first mean, so X bar sub one with its distribution, X bar sub one with its distribution, and then X bar sub three with its distribution. Now let's say we take all the data points and all three of those samples and we put all those data points into a common, larger distribution. So we'll put that behind these three. Now what we're asking ourselves is where is each mean relative to the overall data set sorted in the background? You can see that X bar sub one falls pretty much right down the middle. X bar sub two, that mean is a bit to the right, so you can see the red arrow denotes how far it is away from the mean of the larger, sort of combined population. What about X bar sub three? Well, it is a bit to the left, so you can see that red arrow denoting how far it is away from the mean of the overall sort of combined population. So look at this example. So X bar sub one is where it is was, right at the middle. X bar sub two is a bit to the right. But now look where X bar sub three is, our third sample mean. It's way over to the left. So we might conclude that this mean, the third one in the green, is too far away from the others. It's too far away from the mean of the larger group to be considered as part of that larger population. It's kind of the oddball. So X bar sub two is about this distance, you can see the red arrow, but X bar sub three is way over to the left. So is this sort of the oddball distribution, sort of the weird one, sort of the one that doesn't belong in the same population as the other two? Now look at this case. Here we have X bar sub one, that's pretty much right down the middle again, but now X bar sub two, the second sample mean, goes way over to the right, so you can see that distance is pretty far away. Now look at X bar sub three. It's way over to the left. So we might conclude that each one of these sample means belongs to its own population. Only X bar sub one belongs to this distribution in the background. X bar sub two may belong to one that's off to the right, and X bar sub three may belong to one that's off to the left. So you can see that the means are in very different locations relative to the overall mean there in the background. So the null hypothesis in these type of problems and ANOVAS is whether or not these three sample means come from the same population. Now remember, the sample mean is a point estimator of the population mean, so our null hypothesis is that mu sub one equals mu sub two equals mu sub three, which is another way of expressing that these three means come from the same overall population. Now remember, we're not asking if they are exactly equal. We're asking if each mean likely came from the same larger overall population. So in ANOVA, this idea is a very specific and important idea. We call this the variability among or between the sample means. So each sample mean is a certain distance from the mean of the overall population in the background, and we know that that is an expression of variance, sort of the distance of the sample mean from the overall mean in the back. So this variability between the sample means is something I really want you to keep in your mind as we proceed. This is between variance. So we could test all of these sample means using pair y's t-tests. So here are our three sample means, so X bar sub one, X bar sub two, and X bar sub three. Now we'll block out this part of this little matrix here, because in this triangle, we have comparisons to themselves and repeat comparisons, so we're only gonna have three that we could actually do. So in this first one, we could compare mean one here in the blue to mean two, which is there in the pink. So our null hypothesis would be X bar sub one equals X bar sub two. We could have a t-test that tests that. We could also compare mean one and mean three. So we'd have a t-test of X bar sub one equals X bar sub three. Now we could also test mean two and mean three. So X bar sub two equals X bar sub three. Now notice each one of those independent tests has its own alpha level, so alpha level .05, .05, and .05. Now the problem with doing all these pair y's comparisons is that the alpha level is the type one error rate of course, which means 95% confidence. But the error compounds with each t-test. So if we compared each possible pair, we would have .95 times .95 times .95. So our 95% is now .857, or 85.7%. So of course our alpha is one minus that, so one minus .857 equals .143. Well, what is that .143? Well, that is our overall alpha level. That is our overall type one error rate. So our type one error rate went from 5% to 14.3%. That is why we do not conduct t-tests for every possible pair of means. The error rate compounds and therefore the test of course has problems. Now what's different? So if I stretch each one of these sample distributions out, what changes? Well, the spread, or the variance of each distribution. So we call this the variability around or within the distributions. So remember before, we talked about the variance between the distributions, and that was sort of the distance of each mean from the overall population in the background. So that was variability between. This is variability within, so within each sample distribution. So at its heart, ANOVA is really a variability ratio. It is a ratio of the first type of variance we saw, the variability between the means, over the variability within the distributions. That's the second type we saw when we stretched each sample distribution out. So it is just a ratio. It is between variance, divided by within variance. So remember, in the first type, we had an overall mean, and then the distance of each sample mean from that. So you can see that represented on the top. In the bottom, we had the variability within distributions. So their width or spread side to side. So on the top, we're talking about distance from the overall mean, and on the bottom, we're talking about each one's spread or width within its own sample, so distance from the overall mean divided by sort of the internal spread. So we reduced this to a very common fraction. It is the variance between divided by the variance within. So remember, this is a ratio of variances. So of course, the f-distribution will come into play here in a bit. So just keep in mind, this sort of visual tool. On the top, the distance from the overall mean, and in the bottom, the internal spread of each sample distribution. So variance between divided by variance within. Now if we put those together, those are the components of the total variance. So for an ANOVA, we have the total variance sort of split into two parts. We have the variance between the means, and then the variance within each mean's distribution. So we put those together, and we have the total variance for the entire data set. So this is called partitioning. So we're separating this total variance into its two component parts, and again, this is what's called a one way ANOVA, and I'm not gonna go into that right now, that's for the next video. But in this type of problem, sort of the basic problem, we have total variance, and it's made up of variance between the means and the variance within each sample. So here are sort of the nuts and bolts summary of everything we have here. If the variability between the means, sort of the distance from the overall mean, we saw that in the red before, in that numerator is relatively large compared to the variance within the samples, so within the actual individuals samples of spread, in that denominator, than this ratio will be much larger than one, because the variance between will be quite large relative to the variance within. That will make this ratio much larger than one. If that's the case, then the samples most likely do not come from a common population. So then we would reject the null hypothesis that these means are equal, or come from the same population. So if that's the case, we may have one oddball distribution out to the side, or all three of them may be so far apart that it would create a very high ratio, between the variance between would be much higher than the variance within. So what could this look like sort of in our test in general? So if the variance between is large relative to the variance within being small, we would reject the null hypothesis that says all the means are equal. So at least one mean is an outlier, sort of off to the side, or they may all three be spread far apart, and each distribution is relatively narrow, so they don't sort of all melt together. They're distinct, so think of this as three distinct distributions that are far apart, or one that is sort of by itself off to the side. That would create a large variance between the means. Now if the between variances and the within variances are similar, then we would probably fail to reject that null hypothesis, so they would appear to be equal, from the same overall population. So in this case, the means may be fairly close to each other, fairly close to that overall mean and or the distributions overlap a bit. So they may be a bit harder to distinguish from each other. So if they're very close together, or the variances within is very wide, then they'll sort of melt together and therefore they will not be distinct. Now the other case is where the between variance is very small and the variance within is very large. So yo can think of this as three distributions that are very spread out internally, and do not have a whole lot of distance from each other. So they may be close together and or the distributions because of very high variations are sort of wide and spread out, they sort of melt together, and you really can't distinguish them from each other as coming from separate or distinct populations. So again, this is a very general look at how we would interpret an ANOVA. And it obviously gets more complicated than this, but I just wanted to give you a very rough overview. So here is what we're left with. This F-ratio, remember, the F-ratio is a ratio of two variances. So we have the between the variance, so again, the distance of each mean from the overall mean, or the combined population in the background, as compared to the variance within. So within each sample distribution. Now sometimes, this is called the among variance and the around variance. So the thing about ANOVAs is that I could have three different or four different stats textbooks and they all call this something different, but the most common way of expressing it is the between variance and the within variance. But you may see it as the among variance, the variance among the means, and the around variance, which is the variance around each sample means, but it means the same thing. So this is that partitioning. So the variance between, there in the red arrows, plus the variance within, there in the blue arrows, adds up to the total variance. Now you will also see a term called the error, or the error variance. Well, the error variance is another name for the within, there in the blue, or the around. So again, you might see that in the textbook, and that's one of the challenges of doing these types of videos, because stats books like to give different names to the exact same thing. Okay, so remember, why ANOVA? So up to this point, we had been comparing just two populations, so the independent samples t-test and the matched sample t-test are two examples. But limiting ourselves to the comparison of two population is, of course, limiting. What if we wished to compare the means of more than two populations? What if we wish to compare the populations, each containing several levels or subgroups? Well, that's what we have ANOVA for. Remember, ANOVA stands for the analysis of variance. So, one more thing I wanna hit on before we go onto the end slide and wrap up. Remember that we're looking at the variance between the means as compared to the variance within each sample. So it's all about variance. That's why it's called analysis of variance. Variance between as compared to variance within. So we're always looking at ratio of two variances. Okay, so that wraps up our first video on the analysis of variance, or ANOVA. So again, I wanted to give you a graphical representation of what's going on in ANOVA. So when you're working with the table of data and you're working with your numbers and your F-ratios and things like that, you actually know what is going on in the background. What we're really trying to do is find any distinctions between sample means. So we may have sample means that sort of all line up. Therefore, we could conclude that those probably come from the same combined population. But we might have one of the means as sort of an oddball, out by itself. So in that case, it probably comes from a different population off to the side. Or we could have three sample means that are very far apart from each other. Therefore, each sample mean may come from its own population, sort of in the background. So what we're trying to look for here is distinction, or differences among several means. And we are doing that by looking at two types of variance, between variance and within variance. So a few last words and then we are done. If you're watching this video because you're struggling in a class, stay positive and keep your head up. I have faith in you, many other people around you have faith in you, so so should you. If you like the video, please give it a thumbs up, share it with classmates or colleagues. Feel free to follow me here on YouTube, on Twitter, on Google Plus, or on Linkedin. It's always nice hearing from you. And finally, just keep in mind that the fact that you're on here trying to learn, trying to improve yourself as a student or as a businessperson, that is what really matters. I firmly believe if you have the right learning process in place, the results will take care of themselves. So thank you very much for watching, I wish you the best of luck in your work and in your studies and I look forward to seeing you again next time. (light acoustic music)
Info
Channel: Brandon Foltz
Views: 718,688
Rating: 4.9591389 out of 5
Keywords: brandon foltz anova, statistics 101 anova, ANOVA introduction, anova brandon foltz, anova tutorial, introduction to anova, anova statistics tutorial, statistics anova, ANOVA, anova statistics, anova explained, what is anova in statistics, brandon foltz, ANOVA basics, what is anova, anova test, statistics 101, brandon c foltz, brandon c. foltz, analysis of variance, one way ANOVA, statistics, logistic regression, linear regression, simple linear regression, Machine learning
Id: 0Vj2V2qRU10
Channel Id: undefined
Length: 24min 17sec (1457 seconds)
Published: Sat Apr 27 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.