Lecture 27: Conditional Expectation given an R.V. | Statistics 110

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

All right. So let's get started. So we're still talking about conditional expectation, right? And so today we'll finish conditional expectation as a topic in its own right, which of course, doesn't mean you can then forget it because everything in this course is about thinking conditionally. But as its own topic, we'll finish that today. So I wanted to start with just a couple quick examples of conditional expectation where you're conditioning on a random variable. Last time we were talking about conditioning on an event versus conditioning on a random variable. So we'll do a couple quick examples, then derive some properties, and then do some more difficult examples. But just to start with a couple easy examples just to help get the notation and concepts in mind. So, here is a simple example. Let's just start with a normal, x is standard normal. And let's let y = x squared, okay? And then suppose we want E(Y given X). This is just a practice, you know what does the concept mean? E(Y given X) is E(X squared given X). This notation means we get to treat x as known and then we try to give our best prediction for x squared. Best is in the sense of minimizing mean squared error. But in a certain sense, it's the best prediction. If we know X, we know X squared, so obviously our best prediction would be X squared, which equals Y. Okay, so now that's a very easy calculation, but if we didn't get X squared here, then there's something very suspicious about this, right? We get to know X and somehow predicting something else doesn't make sense. So this should be very very clear. Now let's see what would happen if we went the other way around. Same example, but now let's do E(X given Y) instead of E( Y given X). So that's E (X given X squared). So we get to observe X squared. Now we treat X squared as known, but we don't know X. Okay what do you think this is? Zero, why? Negative, yeah. This is just 0 and you can do some big calculation, but you shouldn't have to cuz you just think about what's the conditional distribution. Since if we know, if we get to observe that in fact X squared = a, so we're treating it as knowns so I'm just going to call it a, that we get to know a, then we know that x is plus or minus the square root of a, but by symmetry those are equally likely. All right, this is only giving us information about the magnitude, it's not giving us any information about the sign. Since the normal is symmetric, then it's equally likely. So this is equally likely to be square root of a or minus square root of a. Those are equally likely. If you average square root of a and minus square root of a you'll get zero. So this doesn't say that X and X squared are independent, right? We saw before that they're uncorrelated. But they're definitely not independent. But this just says that X squared doesn't help very much with predicting X. Just as a number in this sense, right. We know the magnitude but we don't know anything about the sign so we just have to guess one number and we may as well say zero. All right, let's do another example, just another quick example. Okay, so suppose we have a stick and we break off, we have these stick breaking type problems. We have a stick, let's say it has length 1 and break off a random piece, and by random here I mean uniform. So we break off a piece, throw out the other piece. So now we only have one random piece then break that piece again, okay? So break off another piece. And suppose we want the expected value, or the conditional expectation, for the length of the second piece. So in terms of the picture, what we're doing is we're first picking x. I have to put it somewhere for the sake of the picture. That's x, but let's assume that x is uniform between 0 and 1. I'm just translating what I just said into probability notation. So that's the first break point. So we break the stick here, and we keep this part, throw out the other piece. And then, now we just have this piece from 0 to x, then pick a random break point in this piece. Let's say there, that's y. Okay, and the question is then, what's the length of this piece, right, or the distribution, or the conditional expectation, that kind of thing. So to write that out conditionally, we would just write y given x is uniform (0, x). So this notation would not make any sense if I didn't write given x here, right, because I just wanna specify a distribution. But what this notation means, it's just shorthand for saying that if we know that big X equals little x, then it's going to be uniform between 0 and little x. And this is just short hand for that. But you can always think of it back in terms of conditioning on big X equals little x. So that's just short hand saying if we get to treat x as known, then we're picking a random point from here to here. All right, so that's the setup. Okay, now let's compute E(Y given X=x). So that's going to be a function of little x. This just says that we know the first break point is here, call this point little x. This one is anywhere from zero to little x uniformly, so on average it would be little x over 2. And so E(Y given capital X)= capital X /2 because we just changed lowercase x to big X. And as I said, you can just think of this as short hand for this. It's easier to write this and to work with it once you understand what it means. But it's not essentially a different concept. Okay so that's a random variable, right? E(Y given X) is a random variable and it's a function of capital X. And let's just quickly see what happens if we then take, so this is a random variable, we can then ask, what's its expectation? So if we now take E(E(Y given X)), That makes perfect sense to do that, right? Because that's a random variable, I can take its expectation. Expected value of x is one-half, cuz that's uniform zero-one. So one-half of one-half is one-fourth. And I also said at the very end last time, we didn't prove it yet, but we'll prove it today that just in general not just for this problem. E(E (Y given X)) is just E(Y). So that would be a quick way to get the expected value of that second piece after we break twice. And one-fourth seems pretty intuitive, right? Because on average, you're taking half the stick and then half the stick again. So that seems reasonable, but as we've seen many times our intuitions could be wrong, but in this case it's pretty intuitive. And that actually proves that that's true, at least once we know that this equals this, which I stated but we haven't proven yet, okay? So those are just a couple of quick examples. So now let's talk about kind of the general. Properties of conditional expectation. There's like three or four main properties and once you understand those few properties then you can derive all kinds of stuff about conditional expectation. So these are very, very useful properties. I'll even write useful properties. Although we wouldn't do them if they were useless. Okay. Property one, similar to what we were just doing over there, but I just want to kind of write that as a general statement. If we have E of, let's say, each of X times Y, given X. Now we know that if we have a constant in front, we can take it out. Right. Well in this case we're treating X as known, so from our point this is capital X. So this h of X is random variable. h is just any function. Could be X cubed, E to the X, whatever. It's a function of X, we're treating X as known, so we know h of X so we can take it out, because we're treating it as a constant. So that just becomes h of X, E of Y, given X. So that's really what we implicitly were doing up here, I just took out the X squared, and we're left with a 1 inside, the expected value of 1 given anything is 1, cuz it's always 1, okay? So that's called taking out what's known. Okay, so we use that a lot to simplify when we see a function of X there and we're conditioning on X. We can take it out. That's good. Okay, and secondly, E of Y given X equals E of Y, if X, and Y are independent. This is not if and only if. We just saw an example over there, where E of X given X squared is zero, we now that they're not independent, but if they are independent then we can just get rid of the condition. That's just clear from the definition, right, because the definition of this says that we work with the conditional distribution given X. The conditional distribution of Y given X is no different from the unconditional one, because they're independent. So being given X doesn't help at all for predicting Y. So then, that's just true. Okay. Third one is the one we we just stated, E of E of Y given X equal E of Y. So we have the conditional expectation. Take its expectation and just get the unconditional expectation. This one needs some proof. [COUGH] This one goes by different names, depending on where you look. But I'll call it either iterated expectation or in this department, we like to call it Adam's Law for Reasons that we might get to later. Anyway whatever you call it it's an extremely useful fact. Its main use is not to say that this equals this, maybe I should have written it the other way around, this equals this. It's more useful the other way around just like in law of total-, and why do we care about conditional probability? Well, one reason is just that we gather evidence and we condition on the evidence, right? But the other reason is even we want unconditional probability, then we keep using the law of total probability, and reduce it down to conditional, right? This is an analogous to that, this is actually a generalization of the law of total probability. So it's says we want the expected value of Y, but we don't know how to do it, we can try to cleverly choose X to make the problem simpler, where E of Y given X is simpler to work with, and then take the expected value of that. That's basically what we did over here, E of Y given X, I could immediately just write down that's X over 2, right, cuz we know that conditional distribution. It's harder to just say right away, what's the unconditional distribution of Y, right, cuz it has the conditional structure built in. So that's an extremely useful property. We'll prove this in a few minutes. Just want to state one more property, I think. And that one is that if we take Y minus E of Y given X, that's a natural thing to look at because we're thinking of this as the prediction, that is we're using X to predict Y. E of Y given X is our prediction. So, Y minus that is just how far off is the prediction, right? That's the actual value of Y, minus the predicted value of Y, okay? And then the statement is that if we multiply this by any function of X, you'll always get 0, h is any function of x, h of x is any function of x. In words, this says that, This thing, Y minus E of Y, given X, which in statistics is called a residual. It's just what's left over after you try to predict Y and then the difference is uncorrelated with any function of X. Because if we computed the co-variance of these two things, I'll just write out the co-variance, Y minus E of Y, given X, just the definition of co-variance. The co-variance would be exactly this thing that we wrote here, E of, I'm just writing the same thing again. E of Y minus E of Y, given X, h of x, and then minus, definition of co-variance. Expected value of this times this, minus E of this, E of this, right? So it's minus E of Y minus E of Y, given X. Looks complicated, but it will simplify, E of h of X. Just writing down what the co-variance is. But this thing is 0. We know that's 0, right, just by iterated expectation and linearity. That's E of Y minus E of Y, so this part is just 0. So we only have this term. So in other words what that shows is that this is actually the covariance of this and this and it says it's 0. We haven't shown that yet. So let me draw a picture to show geometrically what this says, and then I think we'll prove this property assuming the third one, then we'll prove the third one, okay? All right, so first, here's a picture, and whether this picture makes sense or not kind of depends on how much linear algebra you've had. So, if you haven't had much then you can ignore the picture. But if you have, then this picture will help with your intuition for some of these properties, okay? So the picture is like this. We start with Y, and we're representing it as a point. So the whole, not the whole idea, but a big part of the idea of linear algebra is you start treating vectors in an abstract way, right? Where a vector could be a function, It could be a cow. It could be anything. Well, it doesn't matter what it is. All that matters is the axioms that you have certain operations, right? So if you have an operation on cows that satisfies the axioms of a vector space, then you treat them as vectors and you need to just work with them, right? And so it's an axiomatic thing. And a big strength of that approach is that we all have at least some intuition for what goes on in r2 and r3, right, in Euclidean space, in the plane, and things like that. It's harder when you have infinite dimensional spaces, or even four or five dimensional spaces, it's harder to figure out what's going on, like. But a lot of the geometric intuition still applies. Okay, so we're thinking of this as a random variable, which, remember, formally speaking, a random variable is a function, but we are treating that function as if it's just a point or a vector, okay. And then in our picture, I'm gonna draw a plane, and it's not literally a plane, but we're just visualizing it as a plane. This plane consists of all possible functions of x. So it's a collection of random variables. It's a plane through the origin. It's not really a plane, but it goes through the origin because one function of X is just zero. So every constant is contained in here. And X is in here somewhere, and X squared is in here, and E to the X is in here. Any function of X, that's this plane. Okay, now what we're doing geometrically when we do conditional expectation is a projection. So we're taking Y and we're going, project it down into the plane, E (Y|X). E (Y|X) is a function of X, right. I keep emphasizing that. So E (Y|X) is in this plane. E(Y|X) is defined to be the point in this plane that's closest to Y, okay? So that tells us why is it true, if Y is already function of X, then E(Y|X) = Y. Because that says if it's already in the plane, you don't need to project it anywhere. But if Y is not already a function of X, then you're projecting it down to whatever function of X is closest in a certain sense. The inner product of X,Y is E(X,Y). That's just for those of you who've seen inner products before, that's what it is. The inner product is just a fancy word for a dot product, right? You've all seen dot products in LSU and for r2 and r3 and that's just the generalization of that concept. You can check that this has the properties of an inner product. The only assumption here is that we're working with functions of X. All our random variables we want to assume have finite variance. So implicitly assuming finite variance in this picture. Okay, so anyway, we project Y to E(Y) given X, and then just thinking geometrically if we want this residual vector, Y-E(Y|X) that's just gonna be a vector like this, right, the vector from here to here. Okay, and that, just from projection, you know how if I have a point above this table, and I wanna project it down, I'm gonna go perpendicularly down til I hit the table, right? So that's perpendicular. So all this is saying, all number (4) says in this picture is that this vector from here to here is perpendicular to the plane, right? So take any function of X and this residual is gonna be perpendicular. So that's what it says geometrically. And let's see, what does this statement say? E(E(Y|X)). So it's just, This is a function of X and we're taking its average and we say we got E(Y). I'll talk more about the intuition for this one later when we also get to the version of this for variance. But first, let's prove number (4), assuming number (3). Then we'll prove number (3). So that's just a picture to keep in mind, okay? That's not a proof, though, so we still need to prove these things. So let's just calculate that. That proof of (4), I just wanna calculate it and see if we get 0. Hopefully, we will. Okay, proof of (4). So let's just take this thing and use linearity. So I'm not gonna rewrite that whole expression, but I'm just gonna use linearity. I'm just gonna look at what I'm gonna distribute, this time this minus this time this. So it's E(Yh(x))- E(Y|x)h(X)). I'm just rewriting the same thing except splitting it into two terms using linearity. Okay, now let's just try to see what could we do with this to try to simplify it. This looks as simple as we can get it. So just leave that alone. Let's try to simplify this part. E of something. Well, I kind of like really wanna apply Adam's law, cuz I have this E(E), but then there's like this h(X) here, okay? So I can't directly apply it. So what do you think we should do with h(X)? >> [INAUDIBLE] >> Here we know X, so let's actually put it back. So that's called putting back what's known, but it follows from taking out what's known that you can put back what's known, Right, we're treating that as known so I can write it here, write it there, it's fine okay? So now so we have E(E(h(x)Y|X)), right? Now it's exactly, well, I should have put it back over here, right, move it over there. That should be Y times h(x). So that's of the form where we can apply Adam's law now, right? So that's E(Yh(x))- E(Yh(x)) = 0. Okay, so it's really just linearity, taking out what's known or putting back what's known, and iterated expectation. That proves property (4), assuming iterated expectation. Okay, so now we really need to check that this iterated expectation formula is correct. Okay, so let's do that. Okay, so just to simplify notation, let's do it in the discrete case. Continuous case is analogous. Proof of (3) discrete case just to simplify our notation And let's let, So we're trying to find E of E of Y given X, so let's give it a name, g of X, okay? So we're gonna let E of Y given X. Remember, it's a function of X as I keep saying, so we may as well give it a name, g of X. So really all we're trying to do here, that's just E of g of X, that's all we're trying to do, okay. So we need to find the E of g of X, and show that that just reduces to E of Y, right. Okay, so let's just do that, E of g of X, well, we've dealt with things that looks like E of g of X many times before already, just LOTUS, right. So lets just write down discrete LOTUS, in the continuous case, we could write down continuous LOTUS. So by LOTUS, that's just the sum over X of, g of X times the probability that X = little x. Now, let's write down what's g of little x. g of little x is E of Y given, this is how we define conditional expectation, Is that, We started by conditioning on big X = little x, call that g of X. Then we changed little x to big X to get g of capital X. So that's just what g of little x is. Okay, so I just used the definition. So far all I've done is used LOTUS, used the definition. Now, again, let's just use the definition of this, okay? Cuz I don't like memorizing proofs or anything, and I don't remember how to do this. So all I'm gonna do is just plug into the definition and hope it works, okay? So, let's just see, what's the definition of this thing, E of Y given X = x? Well, again, I don't like memorizing definitions any more than I like memorizing proofs, but I know the definition of expectation, and then conditional just means make it conditional. So I'm just gonna write down that the definition, just y. Right, if we're unconditional, we would just do P of Y = y here, right? That would just be E of Y, but it's conditional, so we just put given X = x. All right, and then we have this probability X = x here, Which is outside of that sum. But actually, if we want, we can bring it inside of the sum, okay? So, because this depends on X and we're summing over Y. So somehow, we have to reduce this down to, So if we want, we can bring this in here because this is a function of X and we're summing over Y. So somehow we're hoping that this will reduce down to just the expected value of Y. So somehow we have to get rid of all of these xs here somehow have to go away. So a very, very common trick when we're dealing with a double sum or a double integral is to swap the order of summation or swap the order of integration, okay? Especially in the discrete case that as long as the sum converge absolutely, it's a completely valid thing to do. You just rearranging the whole a + b is b + x. So I'm gonna add up the same thing in a different order. So I'm just gonna say sum over y first and then sum over x. That's the same thing as summing over x and then summing over y. We're just rearranging terms. We're just adding in a different order. Okay, and then, that's y p times y = y. Let me write it this way, X = x. I can write it this way because remember that's the joint PMF. But remember the joint PMF is the conditional PMF times the marginal PMF, that's the marginal, right? That's the marginal PMF of X, that's the condition PMF of Y given X. So we multiply this thing times this thing, that's just the joint PMF, so we may as well write it that way. Okay, so, Now, notice since I swapped the order of summation, something good happens, which is that this y doesn't depend on x, so we can pull this y out. So that y goes right there, okay? So that's the sum over y of y. So just imagine pulling this y out and let's just stare at this sum here. We have the joint PMF and we're summing up over all x. Well, that's exactly how we got the marginal, right? To get a marginal from a joint, we just took the joint PMF and we sum up over X, we'll get the marginal of Y, if we sum over Y, we'll get the marginal of X. In this case, we're summing up over X. Just like remember those 2x2 tables we were drawing? Just add up a row or add up a column we get to get the marginal? So we're summing up over x, that gives us the marginal distribution of Y. So therefore, by definition that's just E of Y. So really the only trick here was to write this as a double sum and then swap the order of summation which is often a useful trick and proving things. Other than that, I would just plugged into LOTUS, plugged into the definition. And just used what's a conditional distribution and marginal and joint distribution which we talked about before, okay? So that is the proof of this property, and I want to do some more examples. First, one more, One more definition of a conditional thing. So definition of conditional variance, cuz we have conditional expectation, and I think this would be a good time to get to conditional variance. It's defined analogously. So the variance of Y given X. Lets just try to think intuitively. Either we could write it as E of Y squared. Usually the way we do variances, E of Y squared- the square of E of Y. So let's just write down the same thing, except make everything given x, right. Because this says, we get to treat X as known, given that information, what's the variance of Y, okay? So a natural thing to do would be E of Y squared given x- E of Y given X squared, Which is correct. But remember we also define variance a different way, that was the expected square difference from the mean. So let's try to also write that down that definition. If we define it the other way, we did like E of Y- E of Y. If it was just unconditional, we would do Y- its mean and square that thing. However, we're trying to make it conditional, so we're gonna put Y -, we get to treat X as known, so instead of E of Y, we're using E of Y given X to make it conditional, and we're squaring this thing. Now if I just close the parentheses here, that would be wrong. And you could immediately see that would be wrong just by thinking about what kind of object this is. If I just put closed parentheses here, then that's just gonna be a number cuz this is random variable. Taking its expected out, we'll get a number. But actually this equation or just this expression makes it clear that variance in y given x should be a function of x. As we're treating x as known, okay? And then what's the variance of y as a function of x. That means we need another given Xhere. So all this is saying is like throughout this problem, everything is given x. So we can't forget one of the given xs, everything is based on the assumption that we know x. Okay, so I just wrote that these two things are equal, we didn't prove that these are equal. We kind of hoped that there will be something kind of strange if they were not equal. Cuz intuitively, we're just doing variance except everything is given x. So it should work out, but it should still be checked. It's good practice, so I'll probably put this on the next strategic practice cuz it is good practice to check that this equals this. Just practice with, because at this point we reduced it to conditional expectation, so you an just use the properties of conditional expectation, all right? So that's variance and then, okay, now we have one more property, property 5, I'll write it up there, easier to see. These are four properties of conditional expectation, and it would be sad not to get at least one property of conditional variance. So the property is that the variance of y equals the expected value of the conditional variance, plus the variance of the conditional expectation. So it's a pretty cool-looking formula, right? This is the unconditional variance of Y. And somehow we want to like, so imagine we're trying, we have this quantity y and we want it's variance and we don't know how to get it. So we kinda want to do some condition on something to make the problem easier. So the condition on x, but then should we do the variance of y given x first? Or should we take the conditional expectation and then take the variance of that? Not so obvious, right? This says two ways you could do this, add them together. This property is called Eve's Law because it's EVVE. That actually should be EVVE but we abbreviate to Eve, Eve's law, okay? So, that explains some of the etymology here for Adam's law. Especially when you see the proof of Eve's law, which is also very very good practice to prove this. So, I'm going to put this on the next strategic practice, too. You should try it yourself first, then you can study the proof that I will put in the strategic practice. Let me explain the intuition of this a little bit and then, we'll do an example of how to use Adam's law and Eve's law together to get the mean and variance. So here's kind of an intuitive picture. Imagine we have different groups of people, okay? Just to have a simple picture in mind, let's say we have three groups, okay? And then there's lots of people inside each group. And then, just have a concrete example in mind, maybe think of y as height, so you have some population of people which consists of three subpopulations. You wanna know the mean and the variance for the heights of people in this population, or make up your own example. So these are the three subpopulations and each subpopulation may have its own mean and variance, right? But you want the overall, okay? So it's kinda hard to think about this entire population all at once. It's much easier to think about each subpopulation, all right? So notice that there are two types of variability going on. One is that different subpopulations may have differences in height, right? So we have differences between populations, then you have variability within each population, right? So within each population, unless everyone in that subpopulation is the same height, you have variability within each of these and you have variability between them, okay? So if you want, I'm thinking of x in this case, it'll be like, this wold be x = 1, x = 2, x = 3. So x takes three values, x just says which, if you take a random person from this population, which sub-population are they in, okay? So that would be the x. So if we do e of y given x equals 1, that was just be the mean for this population, right? So really, what this says is, this term here is saying look at the average within each population. And then take the average, take the variance of those numbers, right? So that's really looking between populations. And this is saying look within each population, this one's within, this one's between. This says look within each population, take its variance and then average those numbers. This one says look, Replace each population by just its average height and take the variance of those, okay? So it is pretty intuitive that there are those two types of variabilities, but what's kind of cool is that this just says you can just add them, right? Intuitively there's two types of variability, but it's pretty nice that to get the overall variance, you just add those two things. It sounds too good to be true, but that's how it works. Okay, so let me do an example. All right, so imagine that we're studying prevalence of a certain disease and the fraction of people, so you have some country, or let's say some state that consists of different cities, and different cities have different prevalences of the disease, okay? So, this is just an example. So, suppose that you pick a sample of basically, here is how you do your sampling. Sometimes this is done in practice cuz like, I guess if you've been studying the entire state ideally ,maybe you would get a simple random sample, of people from the state and you want to see how many of them have the disease. Or maybe rather than using simple random sample you stratify in certain ways and so on. They can get into it in a sampling class which is not our topic here. But sometimes for practical or other reasons the way that these kind of things work is, you pick a random city, right? And then go into the city, collect the sample from the city, it's easier, right? And then we wanna make some conclusions. All right, so just to try to formalize that, what I'm saying is we pick a random city in some state. And then pick random sample of people in that city. Of people in that city, And then, you test each of those people for the disease that you're studying. Let's say this is a random sample consisting of n people. So n is our sample size which we treat as fixed. Okay, so pick a random city, then go to the city, get n people, test them all for the disease and let X equal number of people with the disease in the sample. And let's let Q = dp, true proportion of people infected in the random city. Proportion of people in the random city. So that is once we've selected the random city, then Q is just literally how many people in that city have the disease divided by the number of people in that city. But I'm using a capital letter cuz initially it's a random variable, because different cities have different prevalences of the disease, and we're picking a random city, so that's a random variable. So you can think of this as a random probability, right? Cuz this is gonna be a number between 0 and 1, which is the proportion of people who have the disease, but the city is random. Okay, so who have the disease. By city with the disease, I mean people with the disease, not the cities with the disease. So there are a lot of questions we could, this is a pretty general. So you can see how this kind of setup has lots of applications in epidemiology. But it doesn't have to be disease. It could be political opinions or whatever you want. It's just it's just that we have. That is it's similar to what I was just saying here. In that we're assuming that we have variability between different cities have different political opinions or different disease characteristics. And within each city there's also variation, right? So we have those two types of variation, how do we deal with that? Okay, so there are lots of things we could ask about this setup. But for right now, let's just find the mean and the variance of X. To do that, though we need some assumption about the distribution of Q, okay. So the most commonly used choice in practice would be to pick a beta distribution. So we're gonna assume that Q is Beta a, b where a and b are known. Because a Beta as we were just saying when we were doing the Beta it's a very flexible family. It takes continuous values between 0 and 1. We know Q has to be between 0 and 1. And we know the Beta is a conjugate prior for the binomial, so it has a lot of nice properties. So it's mathematically convenient but it's also a pretty flexible family to work with. By playing around with a and b, you could get a variety of distributions that hopefully would accurately reflect what you wanna do, as far as what the distribution of Q is like. So we'll assume a beta. You can assume something else if you want and then do a similar calculation but the Beta would be the most popular choice here, and also happens to be convenient, okay. So now, so that's Q. We're also implicitly, I man it's basically set in words here, but we're implicitly assuming x given Q is binomial n, Q. That is once we know that the true value of what proportion of people in that city have the disease, then we're doing binomial. Hypergeometric might be a little bit better. But we're either assumed sampling with replacement or that the n is small enough compared to the population size that it is essentially binomial, okay. So now we're all set to find what we want. E of x, so it's very, very natural here to use conditional expectation, right? Cuz the whole problem was set up in a way where conditional on which city you're in. Then we have a good sense of what's going on unconditionally then you have to kinda combine all these different cities that's harder to think about. It's easier if you kind of zoom in on one city first. So that suggests, okay, just condition on Q. So this is gonna be E of E of X given Q. E of X given Q, well given Q, we just have a binomial n, Q, expected value of binomial of n, Q will be n, Q. So that's just E(nQ), n is just a constant. So then it's just na/(a+b) because a Beta ab has expected value a/(a+b). So it's a quick calculation at that point once you condition. All right, so now lets do the variants. So again, we're gonna do this by thinking conditionally. We have a between cities and a within city term. Eaves Law says this is the expected value of the variants of X given Q. Plus the variance of the expected value of X given Q. Then we just have to work out what are these two things Okay, so for the first term, the variance of X given Q, of given Q is just a binomial, and we know that this, if we treat Q as a constant, this has variance, n, Q, 1-Q, right? And the n, so let's just write that down. So it's just the expected value of the variance of X given Q is just nQ 1- Q. And then, For the other term, the variance of X given Q, I just get that from the binomial, all right. For the other term, if the variance of E of X given Q, well, we just said that E of X given Q, is n,Q. So I just want the variance of n,Q. N comes out as a squared, so just give me n squared times the variance of Q. Now we just have to compute those two things. And for the Beta distribution, those things are actually both pretty easy because when we see this Q(1-Q), that kinda reminds you of what the Beta looks like, right? So to compute those two things, Let's just do those on the side, we just need to compute those two quantities and then we're done. So we need to know the expected value, the end comes out here, so all we need is E of Q(1-Q). Okay, so let's just do that, just for a quick practice with a Beta and with LOTUS. Well, one thing we could do is just say this is E of Q- E of Q squared. And we already know E of Q and we could get E of Q squared. That parts really fine. But let's just do it directly using LOTUS. So I'm just gonna write down, LOTUS, right? So this is just Q. I'm going to change capital Q to lower case q. Q(1-Q), and then we integrate the beta density. And let's just simplify the beta density has Q to the a-1, but we are multiplying by q. So now it's q to the a. And then has a 1-Q to the b-1, but we also have this 1-Q, so that becomes to the b. Dq times the normalizing constant of the beta, which is gamma a plus b over gamma of a, gamma of b. Well, it looks like this complicated thing until you realize that is just another beta integral, right? So, this is just gamma of a+b over gamma a gamma b. And then we just have to multiply and divide by whatever we need in order to make this exactly the integral of a beta PDF. So I'm just imagining putting in the normalizing constant of the beta, which in this case is a beta of a+1, b+1, okay. So this is gonna be gamma of a+1, gamma of b+1, divided by gamma of a+b+2. Times one because I'm multiplying and dividing by this thing so that this is exactly integral of the beta density. And then, right? Okay, so now let's just simplify this thing that looks ugly with the gammas. But hopefully we can simplify it. If we remember the fact that gamma of x+1=x gamma of x, just use that fact a bunch of times. So gamma of a+1 is a gamma of a. So this gamma of a is going to cancel. That's b gamma of b. So there's going to be a b there and then, in the denominator, gamma of a+b. Gamma of a+b+2 is a+b+1 times gamma of a+b+1. But gamma of a+b+1 is a plus b gamma of a+b. So that cancels. So we just get this expression in terms of a and b. And similarly, you can get the variance of q. Kind of the nice way to write the variance of a beta is mu, 1-mu over a+b+1 where mu is the mean of the beta, so mu = a over a+b. You check this in exactly the same way, okay? So then we're done with, I mean, you can do some algebra to simplify, if I just plug those things in, and then that's the answer. Okay, so that's all for today.

Info

Channel: Harvard University

Views: 54,262

Rating: 4.8600583 out of 5

Keywords: harvard, statistics, stat, math, probability, conditional expectation, adam's law, eve's law

Id: gjBvCiRt8QA

Channel Id: undefined

Length: 50min 33sec (3033 seconds)

Published: Mon Apr 29 2013