The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high quality
educational resources for free. To make a donation, or to
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Good
afternoon, everybody. Welcome to Lecture 8. So we're now more than
halfway through the lectures. All right, the topic
of today is sampling. I want to start by reminding
you about this whole business of inferential statistics. We make references
about populations by examining one or more
random samples drawn from that population. We used Monte Carlo simulation
over the last two lectures. And the key idea
there, as we saw in trying to find
the value of pi, was that we can generate
lots of random samples, and then use them to compute
confidence intervals. And then we use the
empirical rule to say, all right, we really
have good reason to believe that 95% of the
time we run this simulation, our answer will be
between here and here. Well, that's all well and good
when we're doing simulations. But what happens when you to
actually sample something real? For example, you
run an experiment, and you get some data points. And it's too hard to do
it over and over again. Think about political polls. Here was an interesting poll. How were these created? Not by simulation. They didn't run 1,000
polls and then compute the confidence interval. They ran one poll-- of 835 people, in this case. And yet they claim to have
a confidence interval. That's what that
margin of error is. Obviously they needed that
large confidence interval. So how is this done? Backing up for a minute,
let's talk about how sampling is done when you are not
running a simulation. You want to do what's called
probability sampling, in which each member of the population
has a non-zero probability of being included in a sample. There are, roughly
speaking, two kinds. We'll spend, really,
all of our time on something called
simple random sampling. And the key idea here is that
each member of the population has an equal probability of
being chosen in the sample so there's no bias. Now, that's not
always appropriate. I do want to take a
minute to talk about why. So suppose we wanted to survey
MIT students to find out what fraction of them are nerds-- which, by the way, I
consider a compliment. So suppose we wanted to
consider a random sample of 100 students. We could walk around campus and
choose 100 people at random. And if 12% of them
were nerds, we would say 12% of the MIT
undergraduates are nerds-- if 98%, et cetera. Well, the problem with that
is, let's look at the majors by school. This is actually the
majors at MIT by school. And you can see that they're
not exactly evenly distributed. And so if you went
around and just sampled 100 students
at random, there'd be a reasonably high probability
that they would all be from engineering and science. And that might give
you a misleading notion of the fraction of MIT
students that were nerds, or it might not. In such situations we do
something called stratified sampling, where we partition
the population into subgroups, and then take a simple random
sample from each subgroup. And we do that proportional
to the size of the subgroups. So we would certainly
want to take more students from engineering than
from architecture. But we probably want to
make sure we got somebody from architecture in our sample. This, by the way, is the way
most political polls are done. They're stratified. They say, we want to get
so many rural people, so many city people, so many
minorities-- things like that. And in fact, that's probably
where the election recent polls were all messed up. They did a very,
retrospectively at least, a bad job of stratifying. So we use stratified
sampling when there are small groups,
subgroups, that we want to make sure are represented. And we want to represent them
proportional to their size in the population. This can also be used to reduce
the needed size of the sample. If we wanted to make sure
we got some architecture students in our
sample, we'd need to get more than 100
people to start with. But if we stratify, we
can take fewer samples. It works well when
you do it properly. But it can be tricky
to do it properly. And we are going to stick to
simple random samples here. All right, let's
look at an example. So this is a map of temperatures
in the United States. And so our running
example today will be sampling to get information
about the average temperatures. And of course, as you can
see, they're highly variable. And we live in one
of the cooler areas. The data we're going
to use is real data-- and it's in the zip file
that I put up for the class-- from the US Centers for
Environmental Information. And it's got the daily
high and low temperatures for 21 different American
cities, every day from 1961 through 2015. So it's an
interesting data set-- a total of about 422,000
examples in the dataset. So a fairly good sized dataset. It's fun to play with. All right, so we're sort of
in the part of the course where the next series of
lectures, including today, is going to be about data
science, how to analyze data. I always like to start by
actually looking at the data-- not looking at all
421,000 samples, but giving a plot
to sort of give me a sense of what the
data looks like. I'm not going to walk
you through the code that does this plot. I do want to point out that
there are two things in it that we may not
have seen before. Simply enough, I'm going
to use numpy.std to get standard deviations instead
of my own code for it. And random.sample to take
simple random samples from the population. random.sample takes
two arguments. The first is some sort
of a sequence of values. And the second is an
integer telling you how many samples you want. And it returns a list
containing sample size, randomly chosen
distinct elements. Distinct elements is important,
because there are two ways that people do sampling. You can do sampling
without replacement, which is what's done here. You take a sample, and then
it's out of the population. So you won't draw
it the next time. Or you can do sampling
with replacement, which allows you to draw the
same sample multiple times-- the same example multiple times. We'll see later in
the term that there are good reasons that
we sometimes prefer sampling with replacement. But usually we're doing
sampling without replacement. And that's what we'll do here. So we won't get Boston on April
3rd multiple times-- or, not the same year, at least. All right. So here's the histogram
the code produces. You can run it yourself
now, if you want, or you can run it later. And here's what it looks like. The daily high temperatures, the
mean is 16.3 degrees Celsius. I sort of vaguely know
what that feels like. And as you can see, it's kind
of an interesting distribution. It's not normal. But it's not that far, right? We have a little tail of these
cold temperatures on the left. And it is what it is. It's not a normal distribution. And we'll later see that
doesn't really matter. OK, so this gives me a sense. The next thing I'll
get is some statistics. So we know the mean is 16.3
and the standard deviation is approximately 9.4 degrees. So if you look at it,
you can believe that. Well, here's a histogram of
one random sample of size 100. Looks pretty different,
as you might expect. Its standard deviation
is 10.4, its mean 17.7. So even though the figures look
a little different, in fact, the means and standard
deviations are pretty similar. If we look at the population
mean and the sample mean-- and I'll try and be careful
to use those terms-- they're not the same. But they're in
the same ballpark. And the same is true of the
two standard deviations. Well, that raises
the question, did we get lucky or is
something we should expect? If we draw 100 random
examples, should we expect them to correspond to
the population as a whole? And the answer is sometimes
yeah and sometimes no. And that's one of the issues
I want to explore today. So one way to see whether
it's a happy accident is to try it 1,000 times. We can draw 1,000 samples of
size 100 and plot the results. Again, I'm not going
to go over the code. There's something in
that code, as well, that we haven't seen before. And that's the ax.vline
plotting command. V for vertical. It just, in this case,
will draw a red line-- because I've said
the color is r-- at population mean
on the x-axis. So just a vertical line. So that'll just show
us where the mean is. If we wanted to draw
a horizontal line, we'd use ax.hline. Just showing you a couple
of useful functions. When we try it 1,000 times,
here's what it looks like. So here we see what we had
originally, same picture I showed you before. And here's what we
get when we look at the means of 100 samples. So this plot on the
left looks a lot more like it's a normal distribution
than the one on the right. Should that surprise
us, or is there a reason we should have
expected that to happen? Well, what's the answer? Someone tell me why we
should have expected it. It's because of the central
limit theorem, right? That's exactly what the
central limit theorem promised us would happen. And, sure enough, it's
pretty close to normal. So that's a good thing. And now if we look
at it, we can see that the mean of the
sample means is 16.3, and the standard deviation
of the sample means is 0.94. So if we go back to
what we saw here, we see that, actually,
when we run it 1,000 times and look at the
means, we get very close to what we had initially. So, indeed, it's not
a happy accident. It's something we can
in general expect. All right, what's the 95%
confidence interval here? Well, it's going to be 16.28
plus or minus 1.96 times 0.94, the standard deviation
of the sample means. And so it tells us that
the confidence interval is, the mean high
temperature, is somewhere between 14.5 and 18.1. Well, that's actually a
pretty big range, right? It's sort of enough to
where you wear a sweater or where you don't
wear a sweater. So the good news is it
includes the population mean. That's nice. But the bad news is
it's pretty wide. Suppose we wanted
it tighter bound. I said, all right, sure enough,
the central limit theorem is going to tell me the
mean of the means is going to give me a good estimate
of the actual population mean. But I want it tighter bound. What can I do? Well, let's think about a
couple of things we could try. Well, one thing we could think
about is drawing more samples. Suppose instead
of 1,000 samples, I'd taken 2,000
or 3,000 samples. We can ask the question,
would that have given me a smaller standard deviation? For those of you who
have not looked ahead, what do you think? Who thinks it will give you
a smaller standard deviation? Who thinks it won't? And the rest of you
have either looked ahead or refused to think. I prefer to believe
you looked ahead. Well, we can run the experiment. You can go to the code. And you'll see that there is
a constant of 1,000, which you can easily change to 2,000. And lo and behold, the standard
deviation barely budges. It got a little bit
bigger, as it happens, but that's kind of an accident. It just, more or
less, doesn't change. And it won't change if I go
to 3,000 or 4,000 or 5,000. It'll wiggle around. But it won't help much. What we can see is doing
that more often is not going to help. Suppose we take larger samples? Is that going to help? Who thinks that will help? And who thinks it won't? OK. Well, we can again
run the experiment. I did run the experiment. I changed the sample
size from 100 to 200. And, again, you can
run this if you want. And if you run it,
you'll get a result-- maybe not exactly this, but
something very similar-- that, indeed, as I increase
the size of the sample rather than the
number of the samples, the standard deviation
drops fairly dramatically, in this case from 0.94 0.66. So that's a good thing. I now want to digress a little
bit before we come back to this and look at how you
can visualize this-- Because this is a
technique you'll want to use as you write
papers and things like that-- is how do we visualize the
variability of the data? And it's usually done with
something called an error bar. You've all seen
these things here. And this is one I took
from the literature. This is plotting pulse
rate against how much exercise you do or how
frequently you exercise. And what you can see here
is there's definitely a downward trend suggesting
that the more you exercise, the lower your
average resting pulse. That's probably worth knowing. And these error bars give us
the 95% confidence intervals for different subpopulations. And what we can see here is
that some of them overlap. So, yes, once a fortnight-- two weeks for those of you
who don't speak British-- it does get a little bit
smaller than rarely or never. But the confidence
interval is very big. And so maybe we really
shouldn't feel very comfortable that it would actually help. The thing we can say is that if
the confidence intervals don't overlap, we can conclude
that the means are actually statistically
significantly different, in this case at the 95% level. So here we see that
the more than weekly does not overlap with
the rarely or never. And from that, we can conclude
that this is actually, statistically true-- that if you exercise
more than weekly, your pulse is likely to be
lower than if you don't. If confidence
intervals do overlap, you cannot conclude that there
is no statistically significant difference. There might be, and
you can use other tests to find out whether there are. When they don't overlap,
it's a good thing. We can conclude
something strong. When they do overlap, we
need to investigate further. All right, let's
look at the error bars for our temperatures. And again, we can plot
those using something called pylab.errorbar. Lab So what it takes is two
values, the usual x-axis and y-axis, and then
it takes another list of the same length, or
sequence of the same length, which is the y errors. And here I'm just
going to say 1.96 times the standard deviations. Where these variables
come from you can tell by looking at the code. And then I can say
the format, I want an o to show the mean,
and then a label. Fmt stands for format. errorbar has different
keyword arguments than plot. You'll find that you
look at different ways like histograms and bar
plots, scatterplots-- they all have different
available keyword arguments. So you have to look
up each individually. But other than this,
everything in the code should look very
familiar to you. And when I run the
code, I get this. And so what I've plotted here
is the mean against the sample size with errorbars. And 100 trials, in this case. So what you can see is that,
as the sample size gets bigger, the errorbars get smaller. The estimates of the mean don't
necessarily get any better. In fact, we can look
here, and this is actually a worse estimate,
relative to the true mean, than the previous two estimates. But we can have more
confidence in it. The same thing we
saw on Monday when we looked at estimating
pi, dropping more needles didn't necessarily give us
a more accurate estimate. But it gave us more
confidence in our estimate. And the same thing
is happening here. And we can see that,
steadily, we can get more and more confidence. So larger samples
seem to be better. That's a good thing. Going from a sample size of
50 to a sample size of 600 reduced the confidence
interval, as you can see, from a fairly large
confidence interval here, ran from just below 14 to almost
19, as opposed to 15 and a half or so to 17. I said confidence interval here. I should not have. I should have said
standard deviations. That's an error on the slides. OK, what's the catch? Well, we're now looking at
100 samples, each of size 600. So we've looked at a
total of 600,000 examples. What has this bought us? Absolutely nothing. The entire population only
contained about 422,000 samples. We might as well have
looked at the whole thing, rather than take a few of them. So it's like, you might
as well hold an election rather than ask 800
people a million times who they're going to vote for. Sure, it's good. But it gave us nothing. Suppose we did it only once. Suppose we took only one sample,
as we see in political polls. What can we can
conclude from that? And the answer is actually
kind of surprising, how much we can conclude, in
a real mathematical sense, from one sample. And, again, this is
thanks to our old friend, the central limit theorem. So if you recall the
theorem, it had three parts. Up till now, we've
exploited the first two. We've used the fact
that the means will be normally distributed so that
we could use the empirical rule to get confidence
intervals, and the fact that the mean of
the sample means would be close to the
mean of the population. Now I want to use the
third piece of it, which is that the variance
of the sample means will be close to the variance
of the population divided by the sample size. And we're going to use
that to compute something called the standard error-- formerly the standard
error of the mean. People often just call
it the standard error. And I will be,
alas, inconsistent. I sometimes call it one,
sometimes the other. It's an incredibly
simple formula. It says the standard
error is going to be equal to sigma, where
sigma is the population standard deviation divided by
the square root of n, which is going to be the
size of the sample. And then there's just
this very small function that implements it. So we can compute
this thing called the standard error of the mean
in a very straightforward way. We can compute it. But does it work? What do I mean by work? I mean, what's the relationship
of the standard error to the standard deviation? Because, remember,
that was our goal, was to understand the
standard deviation so we could use the empirical rule. Well, let's test the
standard error of the mean. So here's a slightly
longer piece of code. I'm going to look at a bunch
of different sample sizes, from 25 to 600, 50 trials each. So getHighs is just a function
that returns the temperatures. I'm going to get the
standard deviation of the whole population, then
the standard error of the means and the sample standard
deviations, both. And then I'm just going
to go through and run it. So for size and
sample size, I'm going to append the standard
error of the mean. And remember, that uses the
population standard deviation and the size of the sample. So I'll compute all the SEMs. And then I'm going to compute
all the actual standard deviations, as well. And then we'll produce
a bunch of plots-- or a plot, actually. All right, so let's see
what that plot looks like. Pretty striking. So we see the blue solid line
is the standard deviation of the 50 means. And the red dotted line is the
standard error of the mean. So we can see, quite strikingly
here, that they really track each other very well. And this is saying
that I can anticipate what the standard deviation
would be by computing the standard error. Which is really useful,
because now I have one sample. I computed standard error. And I get something
very similar to what I get of the standard
deviation if I took 50 samples and looked at the standard
deviation of those 50 samples. All right, so not obvious that
this would be true, right? That I could use
this simple formula, and the two things would
track each other so well. And it's not a
coincidence, by the way, that as I get out
here near the end, they're really lying
on top of each other. As the sample size
gets much larger, they really will coincide. So one, does everyone
understand the difference between the standard deviation
and the standard error? No. OK. So how do we compute
a standard deviation? To do that, we have to
look at many samples-- in this case 50-- and we compute
how much variation there is in those 50 samples. For the standard error,
we look at one sample, and we compute this thing
called the standard error. And we argue that we get the
same number, more or less, that we would have gotten had we
taken 50 samples or 100 samples and computed the
standard deviation. So I can avoid
taking all 50 samples if my only reason
for doing it was to get the standard deviation. I can take one sample
instead and use the standard error of the mean. So going back to
my temperature-- instead of having to
look at lots of samples, I only have to look at one. And I can get a
confidence interval. That make sense? OK. There's a catch. Notice that the formula
for the standard error includes the standard
deviation of the population-- the standard deviation
of the sample. Well, that's kind of a bummer. Because how can I get
the standard deviation of the population
without looking at the whole population? And if we're going to look
at the whole population, then what's the point of
sampling in the first place? So we have a catch, that
we've got something that's a really good approximation, but
it uses a value we don't know. So what should we do about that? Well, what would be, really,
the only obvious thing to try? What's our best guess at
the standard deviation of the population if we have
only one sample to look at? What would you use? Somebody? I know I forgot to
bring the candy today, so no one wants to
answer any questions. AUDIENCE: The standard
deviation of the sample? PROFESSOR: The standard
deviation of the sample. It's all I got. So let's ask the question,
how good is that? Shockingly good. So I looked at our example
here for the temperatures. And I'm plotting the
sample standard deviation versus the population
standard deviation for different sample sizes,
ranging from 0 to 600 by one, I think. So what you can see here is
when the sample size is small, I'm pretty far off. I'm off by 14% here. And I think that's 25. But when the sample
sizes is larger, say 600, I'm off by about 2%. So what we see, at least for
this data set of temperatures-- if the sample size
is large enough, the sample standard deviation
is a pretty good approximation of the population
standard deviation. Well. Now we should ask the
question, what good is this? Well, as I said, once the sample
reaches a reasonable size-- and we see here, reasonable is
probably somewhere around 500-- it becomes a good approximation. But is it true only
for this example? The fact that it
happened to work for high temperatures
in the US doesn't mean that it will always be true. So there are at least two
things we should consider to asking the question,
when will this be true, when won't it be true. One is, does the distribution
of the population matter? So here we saw, in
our very first plot, the distribution of
the high temperatures. And it was kind of symmetric
around a point-- not perfectly. But not everything
looks that way, right? So we should say,
well, suppose we have a different distribution. Would that change
this conclusion? And the other
thing we should ask is, well, suppose we had a
different sized population. Suppose instead of
400,000 temperatures I had 20 million temperatures. Would I need more than 600
samples for the two things to be about the same? Well, let's explore
both of those questions. First, let's look at
the distributions. And we'll look at three
common distributions-- a uniform distribution,
a normal distribution, and an exponential distribution. And we'll look at each of
them for, what is this, 100,000 points. So we know we can generate
a uniform distribution by calling random.random. Gives me a uniform
distribution of real numbers between 0 and 1. We know that we can generate
our normal distribution by calling random.gauss. In this case, I'm looking
at it between the mean of 0 and a standard deviation of 1. But as we saw in
the last lecture, the shape will be the same,
independent of these values. And, finally, an
exponential distribution, which we get by calling
random.expovariate. Very And this number,
0.5, is something called lambda, which
has to do with how quickly the exponential
either decays or goes up, depending upon which direction. And I'm not going to give
you the formula for it at the moment. But we'll look at the pictures. And we'll plot each of these
discrete approximations to these distributions. So here's what they look like. Quite different, right? We've looked at
uniform and we've looked at Gaussian before. And here we see an
exponential, which basically decays and will asymptote
towards zero, never quite getting there. But as you can see,
it is certainly not very symmetric around the mean. All right, so let's
see what happens. If we run the experiment on
these three distributions, each of 100,000 point examples,
and look at different sample sizes, we actually see
that the difference between the standard deviation
and the sample standard deviation of the population
standard deviation is not the same. We see, down here-- this looks kind of like
what we saw before. But the exponential one
is really quite different. You know, its worst
case is up here at 25. The normal is about 14. So that's not too surprising,
since our temperatures were kind of
normally distributed when we looked at it. And the uniform is, initially,
much better an approximation. And the reason
for this has to do with a fundamental difference
in these distributions, something called skew. Skew is a measure of the
asymmetry of a probability distribution. And what we can see here is
that skew actually matters. The more skew you
have, the more samples you're going to need to
get a good approximation. So if the population is
very skewed, very asymmetric in the distribution, you
need a lot of samples to figure out what's going on. If it's very uniform,
as in, for example, the uniform population, you
need many fewer samples. OK, so that's an
important thing. When we go about deciding
how many samples we need, we need to have some estimate
of the skew in our population. All right, how about size? Does size matter? Shockingly-- at least it
was to me the first time I looked at this--
the answer is no. If we look at this-- and I'm
looking just for the uniform distribution, but we'll see
the same thing for all three-- it more or less doesn't matter. Quite amazing, right? If you have a bigger population,
you don't need more samples. And it's really almost
counterintuitive to think that you don't need
any more samples to find out what's going to happen if you
have a million people or 100 million people. And that's why, when we look
at, say, political polls, they're amazingly small. They poll 1,000 people and
claim they're representative of Massachusetts. This is good news. So to estimate the mean
of a population, given a single sample, we
choose a sample size based upon some estimate
of skew in the population. This is important, because
if we get that wrong, we might choose a sample
size that is too small. And in some sense,
you always want to choose the smallest
sample size you can that will give you an
accurate answer, because it's more economical to have small
samples than big samples. And I've been
talking about polls, but the same is true
in an experiment. How many pieces of
data do you need to collect when you run
an experiment in a lab. And how much will depend,
again, on the skew of the data. And that will help you decide. When you know the size,
you choose a random sample from the population. Then you compute the mean
and the standard deviation of that sample. And then use the standard
deviation of that sample to estimate the standard error. And I want to emphasize that
what you're getting here is an estimate of the standard
error, not the standard error itself, which would require you
to know the population standard deviation. But if you've chosen the
sample size to be appropriate, this will turn out to
be a good estimate. And then once
we've done that, we use the estimated
standard error to generate confidence intervals
around the sample mean. And we're done. Now this works
great when we choose independent random samples. And, as we've seen
before, that if you don't choose
independent samples, it doesn't work so well. And, again, this is an issue
where if you assume that, in an election, each
state is independent of every other state, and
you'll get the wrong answer, because they're not. All right, let's go back
to our temperature example and pose a simple question. Are 200 samples enough? I don't know why I chose 200. I did. So we'll do an experiment here. This is similar to an
experiment we saw on Monday. So I'm starting with the
number of mistakes I make. For t in a range
number of trials, sample will be random.sample of
the temperatures in the sample size. This is a key step. The first time I did
this, I messed it up. And instead of doing
this very simple thing, I did a more complicated
thing of just choosing some point in my
list of temperatures and taking the next
200 temperatures. Why did that give
me the wrong answer? Because it's organized by city. So if I happen to choose
the first day of Phoenix, all 200 temperatures
were Phoenix-- which is not a very
good approximation of the temperature in
the country as a whole. But this will work. I'm using random.sample. I'll then get the sample mean. Then I'll compute my estimate
of the standard error by taking that as seen here. And then if the absolute
value of the population minus the sample mean is more
than 1.96 standard errors, I'm going to say I messed up. It's outside. And then at the end,
I'm going to look at the fraction outside the
95% confidence intervals. And what do I hope
it should print? What would be the perfect
answer when I run this? What fraction should
lie outside that? It's a pretty
simple calculation. Five, right? Because if they all
were inside, then I'm being too conservative
in my interval, right? I want 5% of the tests to fall
outside the 95% confidence interval. If I wanted fewer,
then I would look at three standard deviations. Instead of 1.96, then I
would expect less than 1% to fall outside. So this is something we have
to always keep in mind when we do this kind of thing. If your answer is too
good, you've messed up. Shouldn't be too bad, but it
shouldn't be too good, either. That's what probabilities
are all about. If you called every
election correctly, then your math is wrong. Well, when we run this,
we get this lovely answer, that the fraction outside
the 95% confidence interval is 0.0511. That's exactly-- well,
close to what you want. It's almost exactly 5%. And if I run it
multiple times, I get slightly different numbers. But they're all in that
range, showing that, here, in fact, it really does work. So that's what I want to say,
and it's really important, this notion of the
standard error. When I talk to other
departments about what we should cover in 60002,
about the only thing everybody agrees on was we should
talk about standard error. So now I hope I have
made everyone happy. And we will talk
about fitting curves to experimental data
starting next week. All right, thanks a lot.