Introduction to the Central Limit Theorem

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Let's talk about the central limit theorem, an extremely important concept in statistics. The gist of the central limit theorem is that the sample mean will be approximately normally distributed for large sample sizes, regardless of the distribution from which we are sampling. And I'm going to illustrate this using simulation in a little bit. First let's recall a few characteristics of the sampling distribution of the sample mean. Suppose we are sampling from a population with mean mu and standard deviation sigma. Let X bar be a random variable representing the sample mean of n independently drawn observations from this distribution. We have previously learned that the mean of the sampling distribution of the sample mean is equal to the population mean, the mean of the sampling distribution of the sample mean is equal to the mean of the population from which we are sampling. We've also learned that the standard deviation of the sampling distribution of X bar is equal to sigma over the square root of n. As previously discussed, if the population is normally distributed, then the sample mean X bar is also normally distributed. But what if the population is not normal? The central limit theorem addresses this question. The distribution of the sample mean tends toward the normal distribution as the sample size increases, regardless of the distribution from which we are sampling. Let's illustrate this through simulation. Here's an exponential distribution which is most definitely not normal. And what I'm going to do is I'm going to draw a sample of 2 observations, so n is 2, I'm going to draw 2 independent values from this distribution and get the mean. And I'm going to do that again and again and again a million times, and so we're going to get a million sample means where n is equal to 2. One thing to note right off the bat is that I'm allowing the y axis scaling and the x axis scaling to change. What we're interested here is the shape of the distribution. The grey histogram here is a histogram of those of million sample means where n=2, and this is going to be approximately the sampling distribution of X bar in this scenario. In this particular spot we can mathematically work out the exact sampling distribution, but here i've done it through simulation. We can see that this distribution retains some of the original distribution here, we've got some right skewness. It's most definitely not normal. Here, this white line I've plotted in, superimposed, a normal curve with the appropriate mean and variance. And we can see here is that when n is 2 the sampling distribution of the sample mean is not normal. Let's see what happens when I increase the sample size. Here again have drawn a million samples, this time the sample size in each one of those samples is 4, and I've plotted out those million sample means in a histogram, and that is approximately the sampling distribution of X bar when n is 4. Here again we still have some of that skewness and it's not quite normal, but it's getting there. Still quite a bit different from this superimposed normal curve though. When n is 10 we're getting a little bit closer but we can still see some of that skewness. When n is 20 we're getting closer still. and when n is 50 this histogram of sample means, which is approximately the sampling distribution of X bar when n is 50 here, is pretty close to that superimposed normal curve. So for a sample size of 50 the sampling distribution of X bar is pretty close to normal. And we'd see if I did this for larger and larger sample sizes it would be getting more and more normally distributed. It would look better and better and closer and closer to that superimposed normal curve. What I was illustrating there is that when we are sampling from non normal populations the distribution of the sample mean tends toward the normal distribution as the sample size increases. As a very rough guideline the sample mean can be considered to be approximately normally distributed if the sample size is at least 30. if our n is at least 30. Again this is a very rough guideline. We can easily construct scenarios in which a sample size of a hundred trillion is not nearly enough to give us approximate normality, but in most practical situations when our sample size starts getting up beyond 30 the distribution of the sample mean will be approximately normal. Let's do another simulation for a different distribution. This one's a bit of a weird mixture type of distribution. So the same scenario as the last time, I'm going to randomly and independently draw 2 observations from this distribution, calculate the sample mean, draw another 2 observations, calculate the sample mean, and do that a million times, and plot out a histogram. Here the grey histogram is approximately the sampling distribution of X bar when n=2, and I've superimposed the normal curve again and obviously this distribution is not quite normal. Let's increase the sample size and see what happens. Note that in this series of plots I'm keeping the scaling on the x-axis the same and letting the y axis scaling change from plot to plot. Here I've done this a million times, sampled from the original distribution a million times, for a sample size of 4, and plotted out the million resulting sample means in the histogram, so the grey histogram is approximately the sampling distribution of X bar when n is equal to 4. And I've superimposed the normal curve with the appropriate mean and variance. And we can see here that the sampling distribution of X bar is actually quite close to normal. And we'll see that when we let the sample size increase it's going to get closer and closer and closer to that superimposed normal curve. Here when n is 10 the sampling distribution of X bar looks quite normal. When n is 20 it's even closer to that normal curve. And when n is 50 it's looking very normal. And the sampling distribution of X bar would get closer and closer and closer to a normal distribution as the sample size increases. Why is this important? The central limit theorem tells us that many statistics have distributions that are approximately normal for large sample sizes, even when we are sampling from a distribution that is not normal. And this means that we can often use well-developed statistical inference procedures and probability calculations that are based on a normal distribution, even if we are sampling from a population that is not normal, provided we have a large sample size. A little more formally, the central limit theorem tells us that our usual z score value here involving the sample mean, that tends in distribution to the standard normal distribution as the sample size tends to infinity. We do have a couple of technical restrictions in that we need the mean and variance to be finite, but that's usually going to be the case for the things we're dealing with. Let's see how the central limit theorem might help us carry out a probability calculation. Suppose salaries at a very large corporation have a mean of $62,000 and a standard deviation of $32,000. Our population mean mu is 62,000 and our population standard deviation sigma is 32,000. If a single employee is a randomly selected, what is the probability their salary exceeds $66,000? Let's let the random variable X represent the salary of a randomly selected employee. What we want to find is the probability that X is greater than $66,000. We've previously done some probability calculations and said that Z is equal to X minus mu over sigma. So it might be tempting here to say that this is equal to the probability that Z is greater than 66,000 minus 62,000 over 32,000 and this would be equal to the probability that Z is greater than 0.125. We haven't done anything wrong to this point, but if we were to find this probability by looking it up for the standard normal distribution, that would be an error. Nowhere in this question does it say that salaries are normally distributed, and we've learned previously that salaries are simply not normally distributed typically. Salaries have some right skewness to them. So salaries are not normally distributed, nowhere in this question does it say anything about normal and so this random variable X is not going to have a normal distribution, which means that this random variables Z is going to not have a standard normal distribution, and so this question cannot be answered without further information about the distribution of X. But suppose we changed the question a little bit. We still have the same premise in that we're sampling from a population with a mean of 62,000 and a standard deviation of 32,000. And here it changes to if 100 employees are randomly selected, what is the probability their average salary exceed $66,000? And we're going to let X bar represent the average salary of those 100 employees. And we want to know the probability that X bar takes on a value greater than $66,000. Well we previously had this notion that we can standardize that and call it a Z if we go X bar minus mu over sigma over the square root of n. And here the fundamental difference in this question as opposed to the previous question is that X bar is going to be approximately normally distributed, by the central limit theorem. The central limit theorem tells us that the sample mean will be approximately normally distributed in this spot, so we can come up with an approximate probability even though we don't know the real distribution of the salaries. And so this is going to be the probability that Z is greater than 66,000 minus 62,000 over sigma which is 32,000, over the square root of the sample size. And this is the probability that Z is greater than 1.25. The central limit theorem tells me that X bar is approximately normal which tells me that Z has approximately the standard normal distribution, And so this probability, for that we're going to go to our normal curve. Here's zero, 1.25 is over here somewhere and if we looked that up using software or a table we'd see that that is 0.106. So the central limit theorem has allowed me to say that this probability here is approximately 0.106, even though we didn't know that distribution from which we were sampling. This is going to be an extremely helpful notion in a lot of spots. The world of statistics would be very very different if there was no such thing as the central limit theorem.
Info
Channel: jbstatistics
Views: 458,694
Rating: 4.934175 out of 5
Keywords: central limit theorem, CLT, central, limit, theorem, sampling distribution of x bar, sampling distributions, histogram of sample means, jbstatistics, jb statistics, statistics, 8msl, 8 minute stats lectures, intro stats videos, intro stats help, stats help, stats tutor, jeremy balka, AP statistics, p value, p-value
Id: Pujol1yC1_A
Channel Id: undefined
Length: 13min 13sec (793 seconds)
Published: Fri Dec 28 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.