An Introduction to the Hypergeometric Distribution

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Let's look at an introduction to the hypergeometric distribution, another important discrete probability distribution. I'm going to assume that you know the combinations formula, also known as the binomial coefficient, both its meaning and how to calculate it, because it's going to play a big role in the hypergeometric distribution. If you don't recognize this, you should look into it before watching this video. I'm also going to assume that you've previously been introduced to the binomial distribution, because I'm going to be comparing the binomial and hypergeometric distributions here. Let's look at an example to start. An urn contains 6 red balls and 14 yellow balls. (These types of urn and balls problems are classic hypergeometric problems.) Five balls are randomly drawn without replacement. What is the probability exactly 4 red balls are drawn? An important point to note here is that the sampling is done without replacement, and by that I mean once a ball is chosen we look at the color and set it aside and it cannot be chosen again. And that implies that the trials are not independent. Knowing what happens on one trial gives some information about the probabilities on other trials. Because the trials are not independent, the binomial distribution would not be appropriate here. Before we take a formal look at the hypergeometric distribution, let's calculate this probability by thinking through the underlying logic. If we are randomly selecting five balls, then any sample of five balls is equally likely. The probability of getting exactly four red balls is the number of samples that result in exactly four red balls and one yellow ball (since we're picking five balls and four must be red, one must be yellow), divided by the total number of possible samples of size 5. Recall that there were 6 red balls and 14 yellow balls, for 20 balls in total. The total number of possible samples then is 20 choose 5. This is the combinations formula, the number of ways of picking 5 balls from 20. In the numerator we need the number of ways of getting four red balls and one yellow ball. There are 6 red balls, and from those we must pick 4, and there are 14 yellow balls, and from those we must pick 1, and so 6 choose 4 times 14 choose 1 over 20 choose 5. If we use our combinations formula properly here we'll see that this works out to 15 times 14 over 15,504. To 5 decimal places this is 0.01354. It's not appropriate to use the binomial distribution here. Since the sampling is done without replacement, the trials are not independent. The probability of getting a red ball will change from trial to trial depending on what happened in other draws. For example, on the first draw, since there are 6 red balls and 14 yellow the probability of getting a red ball on that first draw is 6 out of 20 or 0.3. But suppose the first draw is a red ball. Then on the second draw the probability of getting a red ball is now only 5 out of 19. There's 5 red balls left out of 19 total, and that's a little less than 0.3. And so the probability of success on any individual trial depends on what has happened on the other trials. The trials are not independent and independence is one of the necessary conditions for the binomial distribution to hold. Suppose instead that the sampling had been done with replacement, meaning that when a ball is chosen we look at the color and count it, but place it back in the urn so that it might be chosen again. The probability of getting a red ball on any given trial is simply 6 out of 20, regardless of what happened on the other trials. And since the trials would be independent here, the binomial distribution would be appropriate. I'm not going to go into the details, but this would be the appropriate method of calculating the probability using the binomial formula. (You can see my video on the binomial distribution for more information.) To 5 decimal places this works out to 0.02835. Compare that to the probability we found previously using the hypergeometric distribution, when the sampling was done without replacement. There we found a probability of 0.01354. The probability found from the binomial distribution with replacement is actually quite a bit different from that. So if we were to mistakenly use the binomial distribution in the without replacement case our calculated probability would be quite a bit off. In our hypergeometric distribution example we simply thought through the problem and came up with the probability using some logic. But now let's give a little more formal introduction to the hypergeometric distribution. Suppose we are randomly sampling n objects without replacement from a source that contains a successes and capital N - a failures. There are capital N objects altogether. There's only two types of objects: the successes, and there are a of those, and the failures, and there are N-a of those. And we're going to let the random variable X represent the number of successes in the sample. Then the random variable X has the hypergeometric distribution, with this probability mass function. The probability the random variable X takes on the value little x, which I'll sometimes write as p(x), is equal to this quantity. In order to get little x successes from the total number of successes a, we must pick x of them, and from the N-a failures we must choose n-x of those and the denominator is simply the total number of samples, the number of ways are picking little n objects from capital N objects total. What values can X take on here? What are the possible number of successes? Well, it's the number of successes, so can only take on whole number values. The minimum and maximum numbers are a little bit messy. The number of successes in the sample can't possibly take on a value bigger than the number of objects we are choosing, so it couldn't possibly be bigger than n, and it also can't take on a value bigger than the number of successes in the population, so the maximum value X can take on is going to be the minimum of a and n. As for the minimum, we know that the number of successes can't possibly be less than 0, but it also can't be less than this quantity. To make that a little easier to see I'm going to write this as little n - (N-a) and this is the number of objects we are sampling minus the total number of failures in the population. The number successes has to be at least that quantity. So the minimum value X can take on is the maximum of 0 and this quantity. The mean of a hypergeometric distribution is equal to n times the number of successes a over the total number of objects, capital N. In other words, n times the proportion of successes in the population. Note that this looks a little bit like np, which is the mean of a binomial random variable, and it's pretty much the same thing here. There's also a formula for the variance, but it's a little ugly and I'm going to leave it out here. If you need it you can easily look it up. Let's look at a different example. Suppose a large high school has 1100 female students and 900 male students for 2000 students in total. A random sample of 10 students is drawn and we want to find the probability that exactly seven of the selected students are female. Here, although it doesn't state it explicitly, it's implied that the sampling is done without replacement. If, say, your boss asked you to get a sample of 10 people and you come back with 2 people and tell your boss you sampled one of them 6 times and the other one 4 times, you'll likely be looking for another job in the very near future. If we let the random variable X represent the number of female students selected, then we need to find the probability that the random variable X takes on the value 7. Again, I strongly recommend in this type of problem that you don't try to put the values into the formula and you simply try to think it through logically. Most people find it easier to find the correct probabilities that way. Here the denominator is going to be the total number of possible samples, and we are picking 10 students from 2000, and so the denominator is going to be 2000 choose 10. The numerator is the number of ways of getting exactly 7 female students, and from those 1100 female students we must pick 7. But we're not done yet. In order to get exactly seven female students in a sample of 10, we must also pick 3 male students, so from the 900 male students we must choose 3. The probability of getting exactly 7 females is 1100 choose 7 times 900 choose 3 divided by 2000 choose 10. Note that 1100 plus 900 is equal to 2000, and 7 plus 3 is equal to 10. This is not a coincidence, and it will work out like that if done properly, and so that can be a useful double check on your calculations. To 6 decimal places this works out to 0.166490. If you feel the need to use the formula for the probability mass function on the previous slide, then capital N represents the total number of objects, or here the total number of students, and we had 2000 students. Little n represents the number of objects that we're sampling and that is 10. a is the total number of successes in the population, and since we're counting up the number of females we're calling getting a female student a success, and the total number of female students is 1100. If we put all of these into the formula for the probability mass function from the previous slide we'd get what we have over here. It might be informative to try that once, but most people find it easier if we just think it through logically and not rely on the formula. What if we had ignored the fact that the sampling was done without replacement, and we used the binomial distribution instead? What if we simply said the probability of getting a female student on any given trial is simply the number a female students we had, 1100, over the total number of students, and we said that that was 0.55 on each trial, and we ignored the fact that that's changing from trial to trial. If we put this into the binomial formula we would see that this works out to 0.166478. But since the sampling was done without replacement, that is not the correct probability. Recall that when we used the hypergeometric distribution on the last page, we found that the correct probability was 0.166490. Wait a minute, these two probabilities are pretty close. The incorrect one calculated with the binomial distribution is pretty darn close to the correct probability for this example. And that leads us to this point: the binomial distribution can sometimes be used to provide a reasonable approximation to the hypergeometric distribution. In most cases it will provide a reasonable approximation if we're not sampling a very large proportion of the population. And as a very rough guideline if we are not sampling more than 5% of the population, the binomial distribution would provide a reasonable approximation. Why would we want to use the binomial distribution as an approximation? Why wouldn't we simply use the hypergeometric distribution if it's the appropriate distribution? Well it turns out that in some cases the binomial distribution is easier to work with. In some probability calculations and statistical inference scenarios, the true underlying reality might imply a hypergeometric distribution, but the binomial distribution might provide a very good approximation, and might be much easier to work with. If we look back at this example we were sampling only 10 people out of 2000 total which is 0.5%. The guideline tells us that the binomial distribution would provide a reasonable approximation here. Why is this? Here 55% of the population is female, so the probability the first student selected is female is 0.55. But as students are selected the probability of selecting a female student is going to change a little bit, depending on what students were selected before. But since we're only sampling a small proportion of the population, the probability is not going to change very much. For example, suppose the first three students selected were female, the probability the next student selected is female is 1097, the number of remaining female students, over 1997, the total number of students remaining. This is a little bit less than 0.55, but it's still pretty close. So this probability changes only a little bit and the binomial distribution, which assumes a constant probability of success regardless of what happens on the other trials, provides a very reasonable approximation in this situation. One last thing. These methods can be extended to more than two groups, and let's take a quick look at that. Suppose that in the US a business employs 12 Democrats, 24 Republicans, and 8 independents. If a random sample of 6 employees is drawn, and suppose without replacement again, what is the probability there are 3 Democrats, 2 Republicans, and 1 independent in the sample? If we didn't rely on the formula in the earlier examples, and we understood the underlying logic, we can extend those methods to this type of situation, where there are three or more groups instead of just two. Here there are 12+24+8 people, or 44 altogether. So when we're calculating our probability, the total number of possible samples, which we put in the denominator is going to be 44 choose 6, because we're picking 6 people from 44. In the numerator we need the number of ways of getting 3 Democrats, 2 Republicans, and 1 independent. From the 12 Democrats we must pick 3, and from the 24 Republicans we must pick 2, and from the 8 independents we must pick 1. And all of this works out to 0.0688. So the methods the hypergeometric distribution can be extended to more than two types of object. This is sometimes called the multivariate hypergeometric, and this example is simply a very quick introduction to that.
Info
Channel: jbstatistics
Views: 226,218
Rating: undefined out of 5
Keywords: Hypergeometric, Hypergeometric distribution, with replacement, without replacement, probability, discrete probability distributions, hypergeometric example, probability examples, jbstatistics, jb statistics, statistics, introductory statistics, 8msl, 8 minute stats lectures, intro stats videos, intro stats help, stats help, stats tutor, jeremy balka, AP statistics
Id: L2KMttDm3aY
Channel Id: undefined
Length: 15min 34sec (934 seconds)
Published: Tue Sep 10 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.