Lecture 21: Covariance and Correlation | Statistics 110

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So today is covariance day. Covariant, is a long awaited moment that will let us finally deal with the variance of a sum. For one thing, we said variance is not linear, unlike expectation. That doesn't mean we don't need ways to deal with the variance of a sum, it just means we need to think harder, rather than falsely applying linearity. So on the one hand covariance is what we need to deal with variance of the sum, on the other hand it's what we need when we wanna study two random variables together instead of one. So it's like variance except two of them and that's why it's called covariance, so let's define it. Do some properties, do some examples. So at first start with the definition. It's analagous to how we define variance, except now we have an X and a Y. Cuz we're looking at joint distributions. So we have X, we have Y, we want their covariance. And we define it like this, covariance of X and Y. X and Y are any two random variables on the same space. Covariance X,Y equals expected value of X minus it's mean, times Y minus its mean. That's just the definitions, so you can't really argue with it too much but let's stare at it intuitively for a bit and just see where might this thing have come from? Why define it this way instead of any other way? Well, first of all, it's a product, something times something. So we've brought the X stuff and the Y stuff together into one thing, cuz we're trying to see how they vary together. And just, obviously, we all know that a positive number times a positive number is positive, negative times negative is positive, positive times negative is negative. So if it happens to be true that, This is X relative to its mean, Y relative to its mean. So now imagine drawing a random sample, suppose we had a lot of i.i.d pairs X, Y but, the pairs are i.i.d, but within each pair Xi, [INAUDIBLE] Yi, they have some joint distribution. They may not be independent. By the way, we did show before that if they're independent, then you can write this is just E of this times E of this. So this is you know, we're interested in what happens if they are not independent. Well if in that random sample we drew, if most of the time when X is above it's mean, then also Y is above it's mean. Then you're getting positive times positive. And if X is below it's mean, tends to imply that Y is below it's mean, you get negative times negative is positive. So if X being above it's mean tends to imply that Y is above it's mean, and being below, being below. Then we would say that they're positively correlated. And vice versus. It's be negatively coordinated if X is above it's mean. It doesn't imply that Y is below it's mean but it has more of a tendency that Y is below it's mean. Then we would say that they're negatively correlated. So this is just a measure of that. We'll actually define correlation in a little while. But correlation is a very familiar term to everyone cuz people talk about correlation all the time. But mathematically, what is correlation? It's defined in times of covariance. So we'll get to that soon. That's just the definition. But just like, you know how for variance, we had two different ways to write it. We define variance as, notice the way we define variance was expect the value of X to minus its mean squared. So, if we let X equal Y, that is just the variance. So we've just proved the theorem already, so I'll just call this properties. The first property to keep in mind is that covariance of X with itself is the variance. Proof is just let X equal Y, well that's the definition of variance. But that's a very useful fact to keep in mind. And secondly, it's symmetric. Covariance X, Y equals covariance Y, X. And that's, again, something you can just see immediately just swap the X and Y, but it's the same thing, so it's immediately true that it's symmetric. That's also a useful fact. I don't even want to group it into this list. Right here, what's the alternative way to write covariance? This is completely analogous to how, where we defined variance as this thing, this part squared without that part. But then we quickly showed that we could also write it as E of X squared minus E of X squared, you know parenthesize the other way. The analog of that formula which is generalization, so this is E of XY minus E of X E of Y. So in general these two things are not equal. We proved that they are equal if X and Y are independent, but in general they're not equal. Notice that if we let X equal Y, like in property one here, that's just E of X squared, minus E of X squared the other way. So that is just a version of that formula. And the proof of this is just to multiply this out and use linearity. We'll just quickly do that over here just for practice. And we'll just have four terms use linearity, so it should be very straightforward. We're doing this times this, this times this, and so on. So we have E of X. I'm just gonna use linearity. The first term, X times Y, E of XY. And then minus, and then we do this times this, but notice we're doing E of X times this, this thing is a constant, you can take out the constant. So, that term would just be E of X, E of Y, and then we have another cross term. This one times this one. E of X is just a constant, that comes out. So, that's minus another one that looks the same, E of X E of Y, and then the last term is this times this. Again, that's just a constant, E of a constant is a constant, so it's plus that thing again, and so that's all it is, minus 2 of them plus 1 of them, so it's the same thing. All right, so that's just an easy application of linearity of expectation. So most of the time this way is a little bit easier than this for computing covariance. But, like with variance, this one has a little bit more intuitive appeal because it's just saying X relative to its mean Y relative to its mean, but it's the same thing. So well we already have two properties, well, let's get some more properties of covariance. What if we have a covariance of X with a constant? So I'm letting Y equal a constant C. So here Y is C. The expected value of constant C is C, that's just 0. So it's immediately just 0 just from the definition if C is a constant. Similarly by symmetry we could have covariance of C with X. I just happened to write it on this side but it's symmetric. So if C is a constant. Okay, now what if we multiplied by a constant instead of just having a constant there? So if we have, let's say the covariance of CX with Y. And let's just use this one. To compute this, all we have to do is replace X by C times X. C comes out, C comes out. So C just comes out of the whole thing. So constants come out. Okay, so we just prove that just by plugging in cX in for X, and then it's just immediate. Okay, again, c is any constant here. Similarly they could have constant here, a constant here, and just take them both out. Very, very straightforward. All right, and now we want something that looks kind of like linearity. What happens if we have the covariance of x with y plus z? So if we take the covariance of x with y plus z, then what that says to do is to replace y by y plus z here, okay? And just as a quick little scratch work for seeing what's going on, I'm taking xy, replace y by y plus z. Well, of course, that's just xy plus xz, and now we expect a value of that. So we use linearity, so it's E of this plus E of that. Similarly, we replace this Y by Y+Z, so again use linearity, E(Y) + E(Z). And so those terms you get are simply the sum of the two covariances. So Cov(X,Y) + Cov(X,Z). Just write down the four terms you get and you've just added the two covariances. So again, all of these things are basically immediate. I'm not writing out long proofs for these because all of these things are immediate from plugging into the definition. Either this definition or this equivalent, plug into either one and use linearity of expectation and all of these follow immediately. So these two together are especially useful, and they're called, it's not linearity, but it's called bilinearity. Bilinearity is just a fancy term that means, If you imagine treating one coordinate as just kind of fixed and you're working with the other coordinate, it looks like linearity, right? So like here, notice the Y just stayed as Y, and what happened to the cX? Well, I took out the constant just like linearity. And what happened here? X just stayed x throughout. But if you just look at the y + z part, we split it out into the y and a z. So it looks like linearity if you're going one coordinate at a time. I just happened to write it this way, but obviously, I could have done x + y, z and it would be analogous. I could have put the constant over there, or a constant here, a constant there, okay. So those are really useful properties that kind of if you use these properties, you can avoid a lot of ugly calculations that is you can just like apply this rather than always having to go back to the definition. Just like linearity is incredibly useful, bilinearity is incredible useful, when working with covariances. So, and kind of an easy kind of way to remember this is it kind of looks this distributive property, here this is just the distributive property x times y plus z is x, y plus x, z. It kind of looks like that as if I'm doing code expect, it's not literally multiplication, it's covariance, but I'm doing covariance of this and this and this and this. Right, so if I wanted to extend that to what happens if we have more of them. Let's say we had covariance of X plus Y. I mean this doesn't really need to be listed separately but for practice, let's just do it. Just apply that property five repeatedly. And we're gonna get the covariance of this and this, this and this, this and this, that. It's just like multiplying two polynomials, or however you usually do that thing. So, So we can immediately just write this down as four terms. Cov(X,Z) + Cov(X,W) + Cov(Y,Z) + Cov(Y,W). And that follows immediately just by using that property 5 repeatedly. And more generally than that, let's just write what happens if we have a covariance of one sum with another sum. I don't wanna write out nine terms, let's just write the general thing once and for all. So we have a covariance of one sum of terms. Let's say we have a sum over i of AiXi, where Ai is a constants. so this is linear combination of random variables. And then, let's say i goes from 1 to m. And then we have another one, let's say j=1 to n of bjYj. So, we want the covariance, so it looks like this complicated thing. Okay, but as soon as you think about what's the structure of the problem, it's just the covariance of one sum with another sum. So if you apply that property five over and over and over again, we don't literally have to do that. But conceptually we're just using that property over and over and over again, and just think about what you're gonna get. And also use property four to take out the constants. Well, it just means you're gonna get a sum over all ij of the covariance of individual terms, right? Cuz it's just saying, you know, take one term here and co-vary it with one term here. For all possible pairs. So the sum over all ij of aibj covariance Xi Yj. So that's just a very, I know this looks complicated but it's no different from property five that just means we used it a lot of times instead of once. So a lot of times it'd be easier to use this kind of thing rather than going back to the definition and multiplying everything out in terms of expectation. It's often easier to be able to work directly with covariance. All right, so that shows us how, property one says how covariance is related to variance, but it doesn't show us how it would be useful in actually computing a variance, the variance of a sum that is. Okay, so one of the main reasons we want covariance is so that we can deal with sums. So let's just work out the variants of a sum. Let's say we have the variance of x1 + x2 to start with, but then we can generalize that to a sum of any number of terms just by using this one repeatedly, okay? Well, we already know how to do this, because by property 1, that's the covariance of (x1+x2), with itself. But by property five, or whichever property six, what's the covariance of x1 + x2 with itself? Well, we just have those four terms. We have the covariance of x1 with itself but that's just the variance. And we have covariance of x2 with itself, that's just the variance of x2. And then we have two cross terms, we have the covariance of x1 and x2. And we have the covariance of x2 and x1. But by the symmetry property, those are the same thing. So it's simpler to just write it as 2 times the covariance of x1 and x2. In particular this says that if the covariance is 0, then the variance of the sum is the sum of the variances. And that's an if and only if statement. So one case were that's true is if they're independent, we showed that before that if they're independent then the covariance is 0. So if they're independent, this is gone. And we'll also see examples where they're not independent, but this term is still zero and so, so then it's true. Okay, but in general, you can't say the variance of the sum of the sum of the variances, because you have these covariance terms. Yeah, question? >> [INAUDIBLE] >> That's if and only if the covariance is 0, that the variance of the sum will be the sum of the variance. So let's write what would happen if there's more than two of them, variance x1 + blah, blah, blah + xn. Just applying, so that's the covariance of this sum with itself, so we can just apply this result. So again, it's gonna be the sum of all the variances, and then we're gonna have all these covariances. So add up all the variances, and then add up all the covariances. And so, you're gonna have a covariance of x1 and x2, x2 and x1, x1 and x3, x3 and x1, all those things. I think it's easiest if we write it as 2 times the sum over i less than j covariance x i, x j, it's easy to forget the 2 here. I could have also written it to i not equal to j in this case I would have not put the 2. It's simply the question of are you going to list cov(x1,x2) separately from cov(x2, x1) or group them together. Seems a little simpler to group them together, but then we need to remember to put the 2. Since I specified less than j, then I have cov(x1, x2) listed here, but not cov(x2, x1), cuz I included that here. All right, so that's the general way to get the variance of a sum, and we'll there are some examples of that in a few minutes. First I just wanna make sure that the connection with independence is clear and we also need to define correlation. So theorem says that if x and y are independent, Then they are uncorrelated. The definition of uncorrelated is just that the covariance is 0. That's just definition. I.e, cov(x,y) = 0. And we actually proved this last time when we just didn't have the terminology yet. At least we proved in the continuous case, but the discreet case is analogous. So we proved this using the 2 dimensional lotus thing that we did e of x times y, all right? Equals e of x, e of y in the independent case, so we showed that before. Converse is false. And that's a common mistake is to show the covariance is 0, and then just leap to the conclusion that they're independent. If the covariance is 0, and that's all we know, they may or may not be independent. So just to give a simple counter example showing why this doesn't imply this. Let's just consider an example with normal random variables. So let's let z be standard normal, and let n and we'll let x = z. Slightly redundant notation but I'm just in the habit of using z for standard normals and y = z squared. So we're looking at a normal and it's square, okay? So now let's compute the covariance for this example. Cov(X, Y) = E(X, Y)- E(X)E(Y). In terms of Z, that's E(Z cubed) -E(Z)E(Z squared), but both terms are just 0, because we saw before that the odd moments of a standard normal are 0. That's an odd moment and that's an odd moment so it's just 0- 0. So they're uncorrelated, but they're clearly not independent. In fact, they are very non-independent, I should say, very dependent. Avoid too many double negatives. So they're very dependent, in fact, y is a function of x, so that they're extremely dependent. If you know x, you know y, complete information. So y is actually a function of x. Dependent just means there's some information, right? It doesn't have to be complete information. In this case, if we know x, we have complete information about y, y is a function of x. And if we go the other way around, if we know y, well, we don't know x, but we do know its magnitude, all right? If we know z squared, then we can take the square root and we'll get the absolute value, so we know it up to a plus or minus. So that also shows it's dependent going the other, which we didn't need to do but it's just nice to think. If you know this, okay, we know this, if we know this, then what do we know? Well, we know it up to a sign. So I would just say y also determines x, at least it determines it up to a sign, so it determines the magnitude of x. So, okay, so that's just an example that shows the converse is false. But it's kind of a handy counter example to keep in mind for a lot of things. So kind of intuitively what's going wrong here, I mean, there's nothing wrong with this. But why the definition doesn't capture this is part of the intuition of correlation is it's kind of a measure of linear association. And those of you who have taken Stat 100 or 104 see a lot of things a lot like that. Where you actually have a data set, and if it kind of looks like it's sloping upwards generally you have this cloud of points. And as they kind of go upwards or downwards, that kind of thing. It's measuring linear trends in some sense. There's a theorem that we're not gonna prove that says, if every function of x is uncorrelated every function of y, then they're independent. But just having the linear things be uncorrelated is not enough as this example shows. Okay, here they have this quadratic relationship there is no linear relationship that the kind of intuition on that. All right, so let's also define correlation and then I will do some examples of how to use this to compute the variance of the things that we did not already know the variance. Okay, so once we have covariance, which we do, correlation is easy to define. And I'll tell you some of the intuition as well as what's the math. So here's the definition of correlation. You can think of it as just a standardized version of covariance. So correlation, which you either write as Cor, or usually I write it as Corr. Just because R's tend to look like V's sometimes if you're writing too fast. Corr(X, Y), usually it's defined this way, as the covariance, and then we divide by the product of the standard deviations. Remember, standard deviation's just the square root of variance. So take the covariance, divide by the square root of the product of the variances. But that's the usual definition. I actually would prefer to define it a different way, and I'll show you why, that these are equivalent. I would prefer to define it as the covariance of X, remember standardization? If we have any normal, we subtract the mean, divide by the standard deviation, that gives us standard normal. So that's called standardization. Now here, I'm not assuming anything is normal, but that the same standardization makes sense. That we take X, we subtract its mean, we divide by its standard deviation, and then we do the same thing with Y. So we've standardized both X and Y, And we take their covariance. So correlation means standardize them first, then take the covariance. The reason that this is a useful thing to do is that covariance kinda has an annoying property, as far as interpretation in terms of units and things like that. If you imagine X and Y are distances, right, they're random variables but they're representing a distance quantity, okay? And if you measured X and Y in nanometers, and then someone else working on the same problem measures them in light years instead of nanometers, you're gonna get extremely different answers. So if I just tell you, the covariance between my X and Y is 42, what does that tell you? You have to think really hard about what are the units, what's going on, is 42 a big number or a small number, right? I mean, it's the answer to life, the universe and everything, but is it a big number or a small number? I don't know, because the units thing. This is a dimensionless quantity, dimensionless just basically means unitless. So if X is measured in nanometers, and you're subtracting off nanometers, that's still nanometers. Remember, that's why we define standard deviation, also. Standard deviation has a square root in it, so mathematically, it's pretty annoying to deal with these square roots. Mathematically, it's nicer to work with variance, but intuitively, the variance would be in nanometers squared. Now we're back to nanometers, divide nanometers by nanometers, we'll get a dimensionless quantity. So that's a major advantage of this. And I guess I should tell you briefly, why is this thing the same as this? Well, you should kind of just think about those properties, I'll just say this kind of quickly. First of all, subtracting the mean, that's just adding a constant, that's not gonna affect the covariance at all. So I could have left this out, but it's just useful to think of standardizing. Cuz this standardization, what it does is takes X, which could have any mean and any variance, and makes it have mean 0 and variance 1. That's why it's called standardization. The part that's affecting what's going on is the standard deviation. But from one of those properties that we wrote, we can just pull out the standard deviations, and we get exactly that. So they're exactly the same thing, I just think this one's a little more intuitive to think about. Okay, so one quick theorem about correlation. Correlation can never equal 42. More generally, correlation is always between -1 and 1. So not only is it something more interpretable in the sense that it doesn't depend on what system of units you used. It's also more interpretable in that, if I say a correlation is 0.9? That's a pretty high correlation, cuz I know the largest it can be is 1, okay, so that's very useful. And kind of an interesting fact about this inequality is that it's essentially just Cauchy-Schwartz. For those of you who have seen the Cauchy-Schwarz inequality in linear algebra or elsewhere. The Cauchy-Schwarz is one of the most important inequalities in all of mathematics. And if you put this, if you rewrite this statement in a linear algebra setting, you can show that it's essentially Cauchy-Schwarz. If you haven't seen Cauchy-Schwartz yet, we'll come back to it later in the semester, and you don't need to worry about it right now. But for those of you who have, I wanted to make the connection right now. So let's prove this fact. So one proof would just be to put it into the Cauchy-Schwarz framework, and apply Cauchy-Schwarz. But that doesn't really show what's going on, first of all. And secondly, that assumes you're familiar with Cauchy-Schwarz. So let's just prove it directly. So, first of all, Math classes, you'll often see the acronym WLOG, Without Loss of Generality. We're going to assume X and Y are already standardized. If they're not already standardized, so we're trying to prove this inequality. We may as well just assume from the start that they've been standardized, standardized meaning that they have mean 0, variance 1. Because if they weren't standardized, well, I could just make up some new notation, x tilde, y tilde for the standardized ones. But this says that the correlation will be the same anyway, so we may as well assume that they're already standardized. All right, so now let's just compute the variance. This is actually good practice with property seven there. Let's compute Var(X + Y). Well, that's Var(X) + Var(Y) + 2 Cov(X, Y). And for some reasons, statisticians often like to call the correlation rho, so I'll follow that trend. Corr(X, Y), that's just notation, let's just name that rho. All right, so that's the variance, but I assumed they were standardized, so this is 1 + 1. And if they're standardized already, then the covariance is the correlation, because they're standardized. So that's just 1 + 1 + 2 rho, so that's really just 2 + 2 rho, right? On the other hand, we could look at the variance of the difference. Again, that's good practice with variances of sums and differences. A common mistake is to say, that's Var(X)- Var(Y). Which we talked about that fact before when we were talking about sums and differences of normals, variances can't be negative. So think of this not as X- Y, think of this as X + -Y, so it still adds, Var(X)- Var(Y). Now we subtract. Just check this, right, is the covariance of x-y with itself? So we have a- on the covariance part, but not on these variance terms. So that's just 2- 2 rho. Okay, well we're running out of space on this board, and that's actually the end of the proof because variance is non-negative. So these two inequalities say that rho is between- 1 and 1. All right, so that shows a correlation is always between- 1 and 1. And so in general it is easier to work with covariances than correlations but correlations are more intuitive and standardized with everything between- one and one. Okay, so I wanted to for the rest of the time do some examples with with this thing, and also with computing covariances for certain problems we might be interested in. So let's talk about the multinomial cuz we were talking about that last time. And now we actually have the tools to deal with the covariances, within a multinomial, okay? So, this is just an example. But it's an important example, cuz multinomials come up a lot. So we wanna compute covariances, if we have a multinomial, okay? So covariances in a multinomial. That is, this multinomial is this vector, right? It's about how many people are in category one, how many people are in category two, and so on. So you can take any two of those counts of how many people are in category one, how many people are in category five, and compute the covariance of those things,right? That's a very natural thing to look at. And I actually know four or five ways to derive this, and I really like this example, so I will probably come to this later with some of the other methods, but for now let's just do one method, okay? So, we have this multinomial, so we have this vector, using the notation from last time, we have k different categories. And Xj is the number of people or objects in the jth category. And this is multinomial, n, that there are n objects or people. And the probabilities are given back by some vector P. That's just gives the probability for each category, okay? And we wanna find the covariance of Xi with Xj, For all i and j, Right? So first of all let's consider the case I = j. Then we just have a covariance of Xi with itself. And we know that's just a variance of Xi. And last time we talked about the fact that if we define success to be being in category i, we just have a binomial, so for this, we just use the variance of the binomial, npi 1- pi. So that's easy, the more interesting part is what happens if i is not equal to j. Okay, now if we, I think it's easier to just think concretely in terms of the first two. So let's just find covariance of X1 and X2. If we know how to do this, we could always just relabel things and get X5 and X12 or whatever we want. But it's just easier to think concretely in terms of X1 and X2, rather than having so much notation going around. Okay, so I'll find this one first. And there's a lot of ways to do this. Let's just think about it intuitively first. Intuitively, you think this is positive, negative or zero? >> [INAUDIBLE] >> Negative, why? >> [INAUDIBLE] >> Exactly, so if you somehow computed this an you got a positive number, you shouldn't just be happy you're done with the problem and move on. You should stop and think, does a positive number make sense here? As you just said, if you knew that there were more people in the first category, like there's tons of people in the first category, there's fewer people left over who could be in the second category. So It's like these categories are kind of competing for membership, right? You have a fixed number of people, not like the chicken and egg problem we had a Poisson number of eggs, okay. So fixed number of eggs competing for different categories, more in one then you'd expect less in the other, right? So they should be negatively correlated, all right? So now how do we do this, well there's a bunch of ways as I said, but one way that I especially like is, to relate this back to stuff we did last time, we talked about the lumping property of the multi nominal try to relate it to this. Normally, you'd think of this as a way to find the variance of a sum. But if we know this, this and this, then obviously we know this also. So let's actually do it this way and I'll probably do some of the other methods later, not today. So we have, Let's take the variance of the sum, Let's call this thing C. Just to have some notation, so we're trying to solve for C, okay. So at the variance of the sum equals the sum of the variances. Now the variance of X1 is nP1 (1- P1) and the variance of X2 is nP2 (1- P2) and then plus, twice the covariance but I just named the covariance C, just to have a simple name for it, so it's + 2C. So the only thing, we wanna solve for this. The only thing left that we haven't gotten is this. But then variance of X1 + X2 follows immediately from what I was talking about last time with the lumping property. This just says merge the first two categories together into one bigger category, okay? If we do that, It's still binomial, right? Now we're defining success to mean being a member of category one or category two. Still binomial, so we can immediately right down, thus well n. Now the probability of success is (P1 + P2)(1- P1 + P2) So now we know everything in this equation except C. To solve for C, multiply things out, factor it, however you want just do the algebra, easy algebra at this point. So I'm not gonna show you like write that or multiply this times this and just multiply it out, simplify it and what you'll get is the covariance of X1 and X2 = -nP1 P2. And so in general, That was just for X1 and X2, just for concreteness. The general result would be the covariance of Xi Xj =- nPi Pj for i not equal j Notice it is a- number. Okay, so That's the covariance in a multinomial. And, Let's do a few variants examples now. For example, variants of the binomial, we did derive the variants of the binomial before using indicator random variables, just directly. Because we didn't have these tools available yet, okay? So let's redo the variance of the binomial and then do one more example after that. So okay, so the variance of the binomial NP is NPQ. And let's just derive that really quickly, So let X be binomial np, and we write it as we've done many times. X = x1 + blah, blah, blah plus Xn, where the Xis are IID Bernoulli P. Now each Xi, let's do a quick little indicator random variable review. We can think of these Xj's, they're Bernoulli's, but they're also indicator random variables. It's the indicator of success on the jth trial, solet's just state this in general. Let's let capital I and capital J, let IA be indicator random variable for event A, just in general. A is any event, IA is it's indicator random variable, indicator rv of event A. Okay, so just a couple quick simple facts about indicator random variables. What's IA squared? It's just IA, cuz you're squaring 0 or 1. Similarly IA cubed = IA and you can generalise this to other powers if you want, just it's 0 or 1. There's a very, very, very simple fact, but I've seen it get overlooked many times, so I'm emphasizing it now. Pick any positive power, nothing happens because it's 0 or 1, very easy, okay? Now let's look at something else IA times I times B where A and B are both of events. How would you write that as one indicator random variable? Intersection, extremely useful simple fact, but often gets overlooked. Product of these indicators is 0 or 1 times 0 or 1, that's gonna be 0 or 1. It's gonna be 1, if and only if both of those are 1, that's the definition of intersection. So that's immediately true, very useful fact. Okay, now coming back to this binomial, if we want the variance of Xj, That's just E of xj squared-E of xj squared, But xj squared is xj, so that's just E of xj, and we know E of xj is p, for Bernoulli P. This one is p squared, so that's just p1-p, okay? So it's extremely easy to get the variance of a Bernoulli. If we define this, let's define this as q then we're just saying p times q, okay? So Bernoulli P, you get p times q, very easy. So now we want the variance of the binomial, Well it's just npq, done. Because you're adding up, and they're independent for the binomial. We have independent Bernoulli trials, so just to write out a little bit more. Covariance of Xi, Xj = 0 for i not equal j because they are independent. They are not only uncorrelated they are independent, so we don't have any covariance term. So we just add up the variances and n times this, npq, all right? So now you can do the variance of a binomial in your head. You don't need to memorize this, it's just n times the variance of one of these Bernoulli's, okay? So that's easy, Well let's talk about a more complicated one though. Hypergeometric, So let X be hypergeometric, With parameters w, b, n, which we interpret as saying, we have a jar that has w white balls, b black balls. We take a sample of size n, and we want the distribution of the number of white balls in the sample. Well again, we can decompose it in terms of indicator random variables. So Xj = 1, we can interpret this as drawing balls from that jar one at a time without replacement. We'd get a binomial if we did with replacement, but the hypergeometric would be without replacement. Take the balls one at a time, and we just say one if the jth ball is white, 0 otherwise. The problem, the reason it's more difficult than this is that these are dependant indicator random variables, because it's without replacement. So if we write this thing out, variance of x = so we're gonna write out all these variance terms and all of these covariance terms. Sounds like it's gonna be a nightmare, okay? But there are some symmetries that we can take advantage of. First of all, we have the sum of all the variances, we're gonna use some symmetries here to make life easier. This goes back to our homework problem, I'll talk a little bit more about. Variance of x = n times the variance of x1, because let's say we're looking at x7. Let's say the seventh ball, like the homework problem where you picked two balls. And a lot of students were struggling somewhat with the fact that to consider the second ball, don't you have to consider the first ball, okay? But when we're just looking at like x7, that depends on the seventh ball, we're imagining before we've done anything, okay? Now the seventh ball is equally likely to be any of the balls, right? There's isn't like some balls like to be chosen seventh and other ones don't, right? It's completely symmetrical, so this is just n times the variance of x1. Similarly, for all the covariance terms, 2 times and then there are n choose two of these covariance terms. But we may as well just consider the covariance of x1 and x2, Symmetry, so you should think through to make sure you see why this symmetry hold here. So for the first ball, I mean variance of x1, that would just get using a Bernoulli, right? So that's easy to get, but let's think a little bit about the covariance of x1 and x2. So this part we already know, this we now know if we see the symmetry or not. But you should make sure you see the symmetry in this problem, cuz if there's symmetry you only take advantage of it. And if there isn't symmetry you don't wanna falsely assume, and so you have to be very careful about that. Symmetry is powerful but the danger is, all right? Let's quickly get covariance of x1 and x2. Well that's E(x1 x2)-E(x1) E(x2). E(x1 x2), let's do the second term first. That's easy, that's just the fundamental bridge, the probability of the first ball is white times probably the second one is white. But both of those are w over w + b, Okay? Now for this term, E(x1 x2), let's use the fact here that the product of two indicator random variables is the indicator of the intersection. So this event here, it's expected I have an indicator, fundamental bridge that's the probability that the first two balls are both white. While the first ball has probably w/w + b, and then the second ball being white given that the first ball is white is w-1/w + b-1 So then we have the covariance, so we know this thing, we know this thing. So at this point, we can just do some algebra and simplify everything together. I'll clean this up next time, and give the final answer, but at this point we know the answer, it's just algebra. Okay, so see you on Friday.
Info
Channel: Harvard University
Views: 93,552
Rating: 4.8507891 out of 5
Keywords: harvard, statistics, stat, math, probability, covariance, correlation, hypergeometric random variable
Id: IujCYxtpszU
Channel Id: undefined
Length: 49min 26sec (2966 seconds)
Published: Mon Apr 29 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.