Let's look at an introduction to the Student t distribution, often shortened to simply the t distribution. This video is a little light on mathematical details, so if you're looking for how the t
distribution arises mathematically, or its pdf, I go through that in another video. Suppose we are about to draw a
random sample of n observations from a normally distributed population. We've previously learned that the quantity X bar minus mu over sigma over the square root of n has
the standard normal distribution. And we typically label that with the letter Z. Previously, we've used this notion to construct a confidence interval for the population mean mu. But in practice we encounter a problem,
and that problem is that we don't know the value of the population standard deviation sigma. Sigma is a parameter, the standard
deviation for the entire population, and we don't typically know its value, so
we can't use that value in a formula. So we do the next best thing, and instead
of using the population standard deviation, we're going to use our sample standard deviation to estimate it and then we're going to have a statistic X bar minus mu over s over the square root of n,
where s is our sample standard deviation. But something very fundamental has changed here. Sigma is a constant but we don't know its value so we use s, which is a statistic, and
this statistic s has a sampling distribution, and it would vary from sample to sample. And so this quantity down here would no longer have the standard normal distribution. And we call this quantity or we label it as t because it has a t distribution. When we are sampling from a normally distributed population, the quantity X bar minus mu over s over the square root of n has the t distribution with n-1 degrees of freedom. The concept of degrees of freedom can be a bit of a tricky one, so I'm not going to get into the details here. But the degrees of freedom for the t and if you recall when we had our sample
variance s squared, we divided by n-1. those two notions are very much tied together. What does the t distribution look like? We'll look at that in a moment, but if we look at this statistic, it looks very much like our Z statistic,
which has the standard normal distribution, Except we've replaced the population standard deviation with the sample standard deviation. We are estimating a parameter with a statistic so there is greater variability.
So our t distribution is going to look a lot like the standard normal distribution,
except with greater variance. Here's a plot of the standard normal distribution in white and a t distribution with one degree of freedom in red. We can see that both distributions are
symmetric about zero and bell-shaped, but the t distribution has heavier tails and a lower peak. The exact shape of the t distribution depends on
the degrees of freedom. A very fundamental point here is that
as the degrees of freedom increase, the t distribution tends toward the standard normal distribution. So I'm going to let the degrees of freedom increase
and let's see what happens. as the degrees of freedom increase here we see the red curve getting closer and closer and closer
to the white curve. or in other words, as the degrees of freedom increase the t distribution is tending towards the
standard normal distribution. I've stopped it here at 20 degrees of freedom, and the curves might look close, but if
we look very closely we would see that the t distribution still has slightly
heavier tails and a slightly lower peak. But if I let those degrees of freedom continue to increase, the t distribution is going to get closer and closer and closer to the standard normal distribution. This has some implications for us in statistical inference. Here I'm going to look at constructing a 95% confidence interval, but the same notion would hold in many other situations as well. If we are sampling from a normally distributed population, and we happen to know the value of the
population standard deviation sigma, then we've discussed previously that this
is the appropriate formula for our confidence interval. This 1.96 comes from the standard normal distribution. And I've drawn in the standard normal distribution down here. If we want a 95% confidence interval then we put an area of 0.95 in the middle, and we divide up the remaining area of 0.05 evenly into the two tails, putting 0.025 in the right tail and 0.025 in the left tail. We call the value here with an area to the right of 0.025 z_.025, and that value is 1.96, which we've encountered previously, and we can find from the standard normal table or software. But if sigma is not known, then we can't use it in our confidence interval formula, and we would have to replace it with the
sample standard deviation. But then we should no longer use 1.96, we shouldn't use a value based on the standard normal distribution, we need to use a value based on the t distribution. So down here I've drawn in a t distribution, and we use the same logic in that we want to put 95% of the area in the middle
and split up the remaining area evenly into the two tails. And so what we want to find is from this t distribution the t value that gives an area to the right of 0.025. Because the t distribution has greater area in the tails and greater variability than the standard normal distribution, How much greater? Well that depends on the degrees of freedom, because the shape of the t distribution depends on
the degrees of freedom. But let's look at a few values. Here I have a table with the appropriate t
value for various degrees of freedom. This first column has the sample size n. The second column has the degrees of freedom, which are n-1 for the case we're discussing today. And then the appropriate t value for a 95% confidence interval. This can be found from a t table or software. Take note that at infinite degrees of freedom we get our z value of 1.96, that is our z_.025 value, and that's because a t distribution with
infinite degrees of freedom is the same as the standard normal distribution. But if we look up here with five degrees of freedom, we see that the t value is 2.571, which is quite a bit bigger than the 1.96
value from the standard normal distribution. As the degrees of freedom increase, the t distribution is getting closer and
closer and closer to the standard normal distribution, so these t values are getting closer and
closer and closer and closer to 1.96, the value from the standard normal distribution. Some sources go so far as to say that if
the sample size is greater than 30 just forget all about the t distribution
and use the standard normal distribution. But if you take statistics from me, forget you ever heard such a notion. If we look here at 30 degrees of freedom we see that the t value is 2.042, which to me at least is quite a bit
bigger than the z value of 1.96. Even at 100 degrees of freedom the t value still is a little bit different than the 1.96. And so if we use this z value when we
should be using the t value our calculated margin of error will be smaller than it should be. If we are sampling from a normally distributed population and we are using a standard deviation
that is based on our sample's data, then we should be using values from the t distribution and not the standard normal distribution, regardless of the sample size.