Let's look into t tests for the
population mean mu. These tests will be appropriate when we
are sampling from a normally distributed population. Is there strong evidence that the population mean mu is different from some value that is of interest to us? Is it different from some hypothesized value? Here we'll be testing the null hypothesis
that the population mean mu is equal to some hypothesized value mu_0. The alternative hypothesis is that the
null hypothesis is wrong in some way and we're going to choose our alternative
from one of these three possibilities. These first two are called one-sided alternatives and this second one is a two-sided alternative hypothesis. We should choose this two-sided alternative hypothesis unless we have a strong reason to be
only interested in one of these sides. Suppose we are drawing a simple random sample of n observations from a normally distributed population. To test the null hypothesis that mu is equal to mu_0, we're going to use either a Z test or a t test. The choice of appropriate test statistic
boils down to whether the population standard deviation sigma is known or not. If the population standard deviation sigma is known, which is going to be very rare in reality, then we would use a Z test. This sigma X bar, which is simply sigma over the square root of n, is the true standard deviation of the sampling distribution of X bar. If the null hypothesis is true, then this Z statistic
will have the standard normal distribution. In practice, sigma is almost never known and so we will typically be using a t-test. If the population standard deviation sigma is not known, we estimate it with the sample standard deviation s. We call s divided by the square root of n, the
standard error of the sample mean X bar. And the standard error of X bar is the
estimated standard deviation of the sampling distribution of the sample mean. When we replaced sigma with s, something fundamental happened to the
underlying mathematics, and this test statistic does not have the standard normal distribution. If the null hypothesis is true, this test statistic
will have a t distribution with n-1 degrees of freedom. And so in t-tests, p-values or rejection regions will be
found based on the t distribution with n-1 degrees of freedom. Here I've plotted the standard normal distribution in white and a t distribution with 5 degrees of freedom in red. The t distribution is very similar to
the standard normal distribution except that it has heavier tails and a lower peak. The exact shape depends on the degrees of freedom. Here we had five degrees of freedom,
but we'll see as the degrees of freedom increase the t distribution tends toward the standard normal. So as the sample size increases, the t distribution will get closer and closer to the standard normal. After we construct the appropriate test
and get our test statistic, we're going to want to find a p-value. So let's go ahead and
look at a few examples of that. If we're testing the null hypothesis
that mu is equal to mu_0 using this test statistic and we're
sampling from a normally distributed population, then the distribution of this test statistic if the null hypothesis is true is a t
distribution with n-1 degrees of freedom. The exact shape of the distribution
depends on the degrees of freedom but I've plotted in a t distribution here. Suppose we go through and we carry on our test, and we end up in this spot getting a
value of our test statistic of -1.31. -1.31 is around there somewhere. that is the observed value of the test statistic. Since in this case our alternative hypothesis
is that mu is less than mu_0, values in the left tail of the distribution
give evidence against the null hypothesis in favor of that alternative, and so our p-value is going to be the
probability, under the null hypothesis, of getting what we observed,
or something even farther to the left. Or in other words, the area under the curve to the left of our test statistic is our p-value
for this alternative hypothesis. If on the other hand our alternative
hypothesis is that mu is greater than mu_0, then values in the right tail give us evidence against the null
in favor of that alternative. So suppose we got the same value of the test statistic here, and our test statistic is equal to -1.31 in our sample. Well, -1.31 is still over here somewhere and values in the right tail give
evidence against the null hypothesis so the p-value is the probability, under
the null hypothesis, of getting this value or something even
farther to the right. Or in other words, the area under the curve to the right
of our test statistic is our p value. Suppose instead we had a two-sided
alternative hypothesis and we go ahead and we get our sample data and we end up finding a test statistic that is again -1.31. Well -1.31 is over here somewhere, and so we are interested in the probability
of getting this value or something even more extreme. So this area is of interest to us, but we would have thought it just as much evidence
against the null hypothesis had we gotten a value on this side at 1.31. So this area is also of interest to us as well. And so the p-value for this two-sided alternative is going to be the area in the tail,
beyond the test statistic, doubled. Let's look at an example to illustrate. Does mild dehydration affect reaction times? Suppose it is known from a large body of
past experience that young women have an average reaction time
of 0.95 seconds on a certain type of test. 25 dehydrated young women take the test. They have an average reaction time of 1.00
seconds with a standard deviation 0.18 seconds. Does this yield strong evidence that the
true mean reaction time for dehydrated young women is different from 0.95 seconds? Here we might be interested in testing
the null hypothesis that the true mean reaction time for
dehydrated young women, mu, is equal to 0.95 seconds, against the alternative hypothesis that
it is different. So I'm going to use this alternative hypothesis that mu is actually different from 0.95. One could make the argument that
dehydration is likely to slow down reaction times at some point but here let's play it safe and use a
two-sided alternative hypothesis. And suppose for the sake of argument that we feel that an appropriate significance level here is 0.05. Before carrying out any of our
statistical inference procedures, we should plot our data and have a look. Here's a box plot of the reaction times for the 25 women. I've put in a red line representing
the hypothesized value of mu, 0.95 seconds. The observed mean in the sample, X bar, was 1.00, and we're going to carry out a test to
see if this difference here, between what was observed in the sample
and the hypothesized value is a significant difference. Visually it doesn't seem like there's
much of a difference there but let's see what the test says. The t-test assumes that we are sampling
from a normally distributed population, and that assumption should be investigated. Here I've plotted out a normal quantile quantile plot. The normal quantile quantile plot
will be approximately a straight line if the data is approximately normally distributed. Here that's a pretty straight line I'd say and so I'm going to give that the check mark and say that's a reasonable normal quantile quantile plot and that its okay to go ahead with the t test. Recall that we are testing the null hypothesis that the population mean is 0.95 seconds, against the alternative hypothesis that
it's different from 0.95 seconds, at an alpha level of 0.05. The standard deviation, s, was calculated based on the 25 observations in the sample and so since we are using the sample
standard deviation, we use a t-test and not a Z test. The t test statistic is going to equal
the observed value of the sample mean minus the hypothesized value of 0.95 from the null hypothesis, divided by s which is 0.18 over the square root of 25. And rounded to three decimal places, this is equal to 1.389. Now we need to find the p-value and the p-value is going to come from a t distribution with degrees of freedom equal to n-1,
which is going to be 24 in this case. Here's a t distribution with 24 degrees of freedom. and the observed value of our test statistic is 1.389,
which is right around there somewhere. Our alternative hypothesis is two-sided and so the p-value is this area in the tail,
beyond the test statistic, doubled. We can get that area using software or a t table. And if we were to use software
we could see that this area is 0.0888. If we didn't have access to software and we had to use a t table,
we could only get an interval of values and say that the area is less than
0.10 but greater than 0.05. If we were using software we would double this area and say that
the p-value is equal to 0.1776. If we were using a t table we would
have to double the endpoints of this interval, and say that the p-value lies between 0.1 and 0.2. If you recall, the smaller the p-value the greater the
evidence against the null hypothesis. And this alpha level here can be
considered a cut-off level for significance. but since our p-value is greater than that alpha level, we'd say that the evidence against this null hypothesis is not significant at an alpha level of 0.05. But even if we did not have an alpha level given to us, we still should be able to come up with
a reasonable conclusion based on the p-value. A p-value of 0.1776 is not considered to be very small, and does not give strong evidence against the null hypothesis. Now let's tie that into the problem at hand. There is not strong evidence (with a p-value of approximately 0.18) that the true mean reaction time
for dehydrated young women differs from 0.95 seconds. A few notes to finish. The young women
in this study were not a random sample. They were healthy young women who
volunteered for the study at a university. Anytime we don't have a random sample
from the population of interest, there could be strong biases introduced. And so drawing conclusions about
young women in general here is dubious at best. Also, to properly investigate the
effect of dehydration on reaction times, it would be far better to carry out a
comparative experiment. And in fact the authors of the study
that inspired this example did just that, and they compared the reaction times the
women when they were dehydrated to the reaction times of those same
women when they were not. On a related note, the one sample t-test on means carried
out here is not typically as interesting as the two-sample t-test on means. t tests on means are most useful in comparing two groups and we will look in much greater detail
at the two-sample t-test a little later on.