Let's talk about the central limit theorem,
an extremely important concept in statistics. The gist of the central limit theorem is that the sample mean will be
approximately normally distributed for large sample sizes, regardless of the distribution from which we are sampling. And I'm going to illustrate this using simulation in a little bit. First let's recall a few characteristics
of the sampling distribution of the sample mean. Suppose we are sampling from a population
with mean mu and standard deviation sigma. Let X bar be a random variable representing the sample mean of n independently drawn observations from this distribution. We have previously learned that the mean of the sampling distribution of the sample mean
is equal to the population mean, the mean of the sampling distribution of
the sample mean is equal to the mean of the population from which we are sampling. We've also learned that the standard
deviation of the sampling distribution of X bar is equal to sigma over the square root of n. As previously discussed, if the population is normally distributed, then the sample mean X bar is also
normally distributed. But what if the population is not normal? The central limit theorem addresses this question. The distribution of the sample mean
tends toward the normal distribution as the sample size increases, regardless of the distribution from which we are sampling. Let's illustrate this through simulation. Here's an exponential distribution which is most definitely not normal. And what I'm going to do is I'm going to
draw a sample of 2 observations, so n is 2, I'm going to draw 2 independent
values from this distribution and get the mean. And I'm going to do that again and again and again a million times, and so we're going to get a million sample means where n is equal to 2. One thing to note right off the bat is that I'm allowing the y axis scaling and the x axis scaling to change. What we're interested here is the shape of the distribution. The grey histogram here is a histogram
of those of million sample means where n=2, and this is going to be approximately
the sampling distribution of X bar in this scenario. In this particular spot we can mathematically work out the exact sampling distribution, but here i've done it through simulation. We can see that this distribution retains some
of the original distribution here, we've got some right skewness. It's most definitely not normal. Here, this white line I've plotted in, superimposed, a normal curve with the appropriate mean and variance. And we can see here is that when n is 2 the sampling distribution of the sample
mean is not normal. Let's see what happens when I increase
the sample size. Here again have drawn a million samples, this time the sample size in each one of
those samples is 4, and I've plotted out those million
sample means in a histogram, and that is approximately the sampling
distribution of X bar when n is 4. Here again we still have some of that skewness and it's not quite normal, but it's getting there. Still quite a bit different from this
superimposed normal curve though. When n is 10 we're getting a little bit closer but we can still see some of that skewness. When n is 20 we're getting closer still. and when n is 50 this histogram of sample means, which is approximately the sampling distribution of X bar when n is 50 here, is pretty close to that superimposed normal curve. So for a sample size of 50 the sampling distribution of X bar is pretty close to normal. And we'd see if I did this for larger and larger sample sizes it would be getting more and more normally distributed. It would look better and better and closer and closer to that superimposed normal curve. What I was illustrating there is that when we are sampling from non normal populations the distribution of the sample mean
tends toward the normal distribution as the sample size increases. As a very rough guideline the sample mean can be considered to be
approximately normally distributed if the sample size is at least 30. if our n is at least 30. Again this is a very rough guideline. We can easily construct scenarios in
which a sample size of a hundred trillion is not nearly enough to give us approximate normality, but in most practical situations when our sample size starts getting up beyond 30 the distribution of the sample mean will be approximately normal. Let's do another simulation for a different distribution. This one's a bit of a weird mixture type of distribution. So the same scenario as the last time, I'm going to randomly and independently
draw 2 observations from this distribution, calculate the sample mean, draw another 2 observations, calculate the sample mean, and do that a million times, and plot out a histogram. Here the grey histogram is approximately
the sampling distribution of X bar when n=2, and I've superimposed the normal curve again and obviously this distribution is not quite normal. Let's increase the sample size and see what happens. Note that in this series of plots I'm
keeping the scaling on the x-axis the same and letting the y axis scaling change from plot to plot. Here I've done this a million times,
sampled from the original distribution a million times, for a sample size of 4, and plotted out the million resulting
sample means in the histogram, so the grey histogram is approximately the sampling distribution of X bar when n is equal to 4. And I've superimposed the normal curve with the appropriate mean and variance. And we can see here that the sampling
distribution of X bar is actually quite close to normal. And we'll see that when we let the sample size increase it's going to get closer and closer and
closer to that superimposed normal curve. Here when n is 10 the sampling
distribution of X bar looks quite normal. When n is 20 it's even closer to that normal curve. And when n is 50 it's looking very normal. And the sampling distribution of X bar
would get closer and closer and closer to a normal distribution as the sample size increases. Why is this important? The central limit theorem tells us
that many statistics have distributions that are approximately normal for large sample sizes, even when we are sampling from a distribution that is not normal. And this means that we can often use well-developed statistical inference procedures and probability calculations that are based on a normal distribution, even if we are sampling from a
population that is not normal, provided we have a large sample size. A little more formally, the central limit theorem tells us that our usual z score value here involving the sample mean, that tends in distribution to the standard normal distribution as the sample size tends to infinity. We do have a couple of technical restrictions in that we need the mean and variance to be finite, but that's usually going to be the case
for the things we're dealing with. Let's see how the central limit theorem
might help us carry out a probability calculation. Suppose salaries at a very large
corporation have a mean of $62,000 and a standard deviation of $32,000. Our population mean mu is 62,000 and our population standard deviation sigma is 32,000. If a single employee is a randomly selected, what is the probability their salary exceeds $66,000? Let's let the random variable X represent the salary of a randomly selected employee. What we want to find is the probability that X is greater than $66,000. We've previously done some probability calculations and said that Z is equal to X minus mu over sigma. So it might be tempting here to say that
this is equal to the probability that Z is greater than 66,000 minus 62,000 over 32,000 and this would be equal to the probability that Z is greater than 0.125. We haven't done anything wrong to this point, but if we were to find this probability by looking it up for the standard normal distribution,
that would be an error. Nowhere in this question does it say
that salaries are normally distributed, and we've learned previously that
salaries are simply not normally distributed typically. Salaries have some right skewness to them. So salaries are not normally distributed, nowhere in this question does it say anything about normal and so this random variable X
is not going to have a normal distribution, which means that this random variables Z is going to not have a standard normal distribution, and so this question cannot be answered without further information about the distribution of X. But suppose we changed the question a little bit. We still have the same premise in that
we're sampling from a population with a mean of 62,000 and a standard deviation of 32,000. And here it changes to if 100 employees are randomly selected, what is the probability their average
salary exceed $66,000? And we're going to let X bar represent the
average salary of those 100 employees. And we want to know the probability that X bar
takes on a value greater than $66,000. Well we previously had this notion that
we can standardize that and call it a Z if we go X bar minus mu
over sigma over the square root of n. And here the fundamental difference in
this question as opposed to the previous question is that X bar is going to be approximately normally distributed,
by the central limit theorem. The central limit theorem tells us that
the sample mean will be approximately normally distributed in this spot, so we can come up with an approximate probability even though we don't know the real distribution of the salaries. And so this is going to be the probability that
Z is greater than 66,000 minus 62,000 over sigma which is 32,000, over the square root of the sample size. And this is the probability that Z is greater than 1.25. The central limit theorem tells me that X bar is approximately normal which tells me that Z has approximately
the standard normal distribution, And so this probability, for that we're going to go to our normal curve. Here's zero, 1.25 is over here somewhere and if we looked that up using software or a table
we'd see that that is 0.106. So the central limit theorem has allowed me to say that this probability here is approximately 0.106, even though we didn't know that distribution
from which we were sampling. This is going to be an extremely helpful notion in a lot of spots. The world of statistics would be very very different if there was no such thing as the central limit theorem.