All right. So let's get started. So we're still talking about
conditional expectation, right? And so today we'll finish conditional
expectation as a topic in its own right, which of course, doesn't mean you can
then forget it because everything in this course is about
thinking conditionally. But as its own topic,
we'll finish that today. So I wanted to start with just
a couple quick examples of conditional expectation where you're
conditioning on a random variable. Last time we were talking about
conditioning on an event versus conditioning on a random variable. So we'll do a couple quick examples,
then derive some properties, and then do some more difficult examples. But just to start with a couple easy
examples just to help get the notation and concepts in mind. So, here is a simple example. Let's just start with a normal,
x is standard normal. And let's let y = x squared, okay? And then suppose we want E(Y given X). This is just a practice,
you know what does the concept mean? E(Y given X) is E(X squared given X). This notation means we get
to treat x as known and then we try to give our best
prediction for x squared. Best is in the sense of
minimizing mean squared error. But in a certain sense,
it's the best prediction. If we know X, we know X squared, so obviously our best prediction would
be X squared, which equals Y. Okay, so now that's a very easy
calculation, but if we didn't get X squared here, then there's something
very suspicious about this, right? We get to know X and somehow predicting
something else doesn't make sense. So this should be very very clear. Now let's see what would happen
if we went the other way around. Same example, but now let's do E(X
given Y) instead of E( Y given X). So that's E (X given X squared). So we get to observe X squared. Now we treat X squared as known,
but we don't know X. Okay what do you think this is? Zero, why? Negative, yeah. This is just 0 and
you can do some big calculation, but you shouldn't have to cuz you just think
about what's the conditional distribution. Since if we know, if we get to
observe that in fact X squared = a, so we're treating it as knowns so
I'm just going to call it a, that we get to know a,
then we know that x is plus or minus the square root of a, but
by symmetry those are equally likely. All right, this is only giving us
information about the magnitude, it's not giving us any
information about the sign. Since the normal is symmetric,
then it's equally likely. So this is equally likely to be square
root of a or minus square root of a. Those are equally likely. If you average square root of a and
minus square root of a you'll get zero. So this doesn't say that X and
X squared are independent, right? We saw before that they're uncorrelated. But they're definitely not independent. But this just says that X squared doesn't
help very much with predicting X. Just as a number in this sense, right. We know the magnitude but
we don't know anything about the sign so we just have to guess one number and
we may as well say zero. All right, let's do another example,
just another quick example. Okay, so suppose we have a stick and
we break off, we have these stick
breaking type problems. We have a stick,
let's say it has length 1 and break off a random piece, and
by random here I mean uniform. So we break off a piece,
throw out the other piece. So now we only have one random piece
then break that piece again, okay? So break off another piece. And suppose we want the expected value, or the conditional expectation, for
the length of the second piece. So in terms of the picture,
what we're doing is we're first picking x. I have to put it somewhere for
the sake of the picture. That's x, but let's assume that
x is uniform between 0 and 1. I'm just translating what I just
said into probability notation. So that's the first break point. So we break the stick here, and we keep
this part, throw out the other piece. And then,
now we just have this piece from 0 to x, then pick a random break
point in this piece. Let's say there, that's y. Okay, and the question is then,
what's the length of this piece, right, or the distribution, or the conditional
expectation, that kind of thing. So to write that out conditionally,
we would just write y given x is uniform (0, x). So this notation would not make any
sense if I didn't write given x here, right, because I just wanna
specify a distribution. But what this notation means, it's just
shorthand for saying that if we know that big X equals little x, then it's going
to be uniform between 0 and little x. And this is just short hand for that. But you can always think of it back
in terms of conditioning on big X equals little x. So that's just short hand saying
if we get to treat x as known, then we're picking a random
point from here to here. All right, so that's the setup. Okay, now let's compute E(Y given X=x). So that's going to be
a function of little x. This just says that we know
the first break point is here, call this point little x. This one is anywhere from zero
to little x uniformly, so on average it would be little x over 2. And so E(Y given capital X)= capital X /2 because we just changed
lowercase x to big X. And as I said, you can just think
of this as short hand for this. It's easier to write this and to work with
it once you understand what it means. But it's not essentially
a different concept. Okay so that's a random variable, right? E(Y given X) is a random variable and
it's a function of capital X. And let's just quickly see
what happens if we then take, so this is a random variable,
we can then ask, what's its expectation? So if we now take E(E(Y given X)), That makes perfect sense to do that,
right? Because that's a random variable,
I can take its expectation. Expected value of x is one-half,
cuz that's uniform zero-one. So one-half of one-half is one-fourth. And I also said at the very end last time,
we didn't prove it yet, but we'll prove it today that just in
general not just for this problem. E(E (Y given X)) is just E(Y). So that would be a quick way to
get the expected value of that second piece after we break twice. And one-fourth seems pretty intuitive,
right? Because on average, you're taking half
the stick and then half the stick again. So that seems reasonable, but as we've
seen many times our intuitions could be wrong, but
in this case it's pretty intuitive. And that actually proves that that's true,
at least once we know that this equals this, which I stated but
we haven't proven yet, okay? So those are just a couple
of quick examples. So now let's talk about
kind of the general. Properties of conditional expectation. There's like three or four main properties
and once you understand those few properties then you can derive all kinds
of stuff about conditional expectation. So these are very, very useful properties. I'll even write useful properties. Although we wouldn't do
them if they were useless. Okay. Property one, similar to what we
were just doing over there, but I just want to kind of write
that as a general statement. If we have E of, let's say,
each of X times Y, given X. Now we know that if we have a constant
in front, we can take it out. Right. Well in this case we're
treating X as known, so from our point this is capital X. So this h of X is random variable. h is just any function. Could be X cubed, E to the X, whatever. It's a function of X, we're treating
X as known, so we know h of X so we can take it out,
because we're treating it as a constant. So that just becomes h of X,
E of Y, given X. So that's really what we
implicitly were doing up here, I just took out the X squared,
and we're left with a 1 inside, the expected value of 1 given anything
is 1, cuz it's always 1, okay? So that's called taking out what's known. Okay, so we use that a lot to simplify
when we see a function of X there and we're conditioning on X. We can take it out. That's good. Okay, and secondly,
E of Y given X equals E of Y, if X, and Y are independent. This is not if and only if. We just saw an example over there, where E of X given X squared is zero,
we now that they're not independent, but if they are independent then we
can just get rid of the condition. That's just clear from the definition,
right, because the definition of this says that we work with
the conditional distribution given X. The conditional distribution of
Y given X is no different from the unconditional one,
because they're independent. So being given X doesn't help at all for
predicting Y. So then, that's just true. Okay. Third one is the one we we just stated,
E of E of Y given X equal E of Y. So we have the conditional expectation. Take its expectation and
just get the unconditional expectation. This one needs some proof. [COUGH]
This one goes by different names, depending on where you look. But I'll call it either iterated
expectation or in this department, we like to call it Adam's Law for
Reasons that we might get to later. Anyway whatever you call it
it's an extremely useful fact. Its main use is not to say
that this equals this, maybe I should have written it
the other way around, this equals this. It's more useful the other way around
just like in law of total-, and why do we care about
conditional probability? Well, one reason is just
that we gather evidence and we condition on the evidence, right? But the other reason is even we want
unconditional probability, then we keep using the law of total probability, and
reduce it down to conditional, right? This is an analogous to that, this is actually a generalization
of the law of total probability. So it's says we want the expected value
of Y, but we don't know how to do it, we can try to cleverly choose X to
make the problem simpler, where E of Y given X is simpler to work with,
and then take the expected value of that. That's basically what we did over here,
E of Y given X, I could immediately just write down that's X over 2, right,
cuz we know that conditional distribution. It's harder to just say right away, what's
the unconditional distribution of Y, right, cuz it has the conditional
structure built in. So that's an extremely useful property. We'll prove this in a few minutes. Just want to state one more property,
I think. And that one is that if we
take Y minus E of Y given X, that's a natural thing to look
at because we're thinking of this as the prediction,
that is we're using X to predict Y. E of Y given X is our prediction. So, Y minus that is just how far
off is the prediction, right? That's the actual value of Y,
minus the predicted value of Y, okay? And then the statement is that if we
multiply this by any function of X, you'll always get 0, h is any function
of x, h of x is any function of x. In words, this says that, This thing, Y minus E of Y, given X,
which in statistics is called a residual. It's just what's left over
after you try to predict Y and then the difference is uncorrelated
with any function of X. Because if we computed
the co-variance of these two things, I'll just write out the co-variance,
Y minus E of Y, given X, just the definition of co-variance. The co-variance would be exactly
this thing that we wrote here, E of, I'm just writing the same thing again. E of Y minus E of Y, given X, h of x, and then minus, definition of co-variance. Expected value of this times this,
minus E of this, E of this, right? So it's minus E of Y minus E of Y,
given X. Looks complicated, but
it will simplify, E of h of X. Just writing down what the co-variance is. But this thing is 0. We know that's 0, right, just by
iterated expectation and linearity. That's E of Y minus E of Y,
so this part is just 0. So we only have this term. So in other words what that shows is
that this is actually the covariance of this and this and it says it's 0. We haven't shown that yet. So let me draw a picture to show
geometrically what this says, and then I think we'll prove this property assuming the third one,
then we'll prove the third one, okay? All right, so first, here's a picture,
and whether this picture makes sense or not kind of depends on how much
linear algebra you've had. So, if you haven't had much then
you can ignore the picture. But if you have, then this picture
will help with your intuition for some of these properties, okay? So the picture is like this. We start with Y, and
we're representing it as a point. So the whole, not the whole idea, but a
big part of the idea of linear algebra is you start treating vectors
in an abstract way, right? Where a vector could be a function,
It could be a cow. It could be anything. Well, it doesn't matter what it is. All that matters is the axioms that
you have certain operations, right? So if you have an operation on cows that
satisfies the axioms of a vector space, then you treat them as vectors and
you need to just work with them, right? And so it's an axiomatic thing. And a big strength of
that approach is that we all have at least some intuition for
what goes on in r2 and r3, right, in Euclidean space,
in the plane, and things like that. It's harder when you have infinite
dimensional spaces, or even four or five dimensional spaces, it's harder
to figure out what's going on, like. But a lot of the geometric
intuition still applies. Okay, so we're thinking of this
as a random variable, which, remember, formally speaking,
a random variable is a function, but we are treating that function as if
it's just a point or a vector, okay. And then in our picture,
I'm gonna draw a plane, and it's not literally a plane, but
we're just visualizing it as a plane. This plane consists of all
possible functions of x. So it's a collection of random variables. It's a plane through the origin. It's not really a plane, but it goes through the origin because
one function of X is just zero. So every constant is contained in here. And X is in here somewhere, and X squared
is in here, and E to the X is in here. Any function of X, that's this plane. Okay, now what we're doing geometrically
when we do conditional expectation is a projection. So we're taking Y and we're going,
project it down into the plane, E (Y|X). E (Y|X) is a function of X, right. I keep emphasizing that. So E (Y|X) is in this plane. E(Y|X) is defined to be the point in
this plane that's closest to Y, okay? So that tells us why is it true, if Y is already function of X,
then E(Y|X) = Y. Because that says if it's
already in the plane, you don't need to project it anywhere. But if Y is not already a function of X,
then you're projecting it down to whatever function of X is
closest in a certain sense. The inner product of X,Y is E(X,Y). That's just for those of you who've seen
inner products before, that's what it is. The inner product is just a fancy word for
a dot product, right? You've all seen dot products in LSU and
for r2 and r3 and that's just the generalization
of that concept. You can check that this has
the properties of an inner product. The only assumption here is that
we're working with functions of X. All our random variables we want
to assume have finite variance. So implicitly assuming finite
variance in this picture. Okay, so anyway, we project Y to
E(Y) given X, and then just thinking geometrically if we want
this residual vector, Y-E(Y|X) that's just gonna
be a vector like this, right, the vector from here to here. Okay, and that, just from projection, you know how if I have a point above
this table, and I wanna project it down, I'm gonna go perpendicularly
down til I hit the table, right? So that's perpendicular. So all this is saying, all number (4)
says in this picture is that this vector from here to here is
perpendicular to the plane, right? So take any function of X and
this residual is gonna be perpendicular. So that's what it says geometrically. And let's see,
what does this statement say? E(E(Y|X)). So it's just, This is a function of X and we're taking
its average and we say we got E(Y). I'll talk more about the intuition for this one later when we also get to
the version of this for variance. But first, let's prove number (4),
assuming number (3). Then we'll prove number (3). So that's just a picture to keep in mind,
okay? That's not a proof, though, so
we still need to prove these things. So let's just calculate that. That proof of (4), I just wanna
calculate it and see if we get 0. Hopefully, we will. Okay, proof of (4). So let's just take this thing and
use linearity. So I'm not gonna rewrite
that whole expression, but I'm just gonna use linearity. I'm just gonna look at
what I'm gonna distribute, this time this minus this time this. So it's E(Yh(x))- E(Y|x)h(X)). I'm just rewriting the same
thing except splitting it into two terms using linearity. Okay, now let's just try to see what could we do with this to try to simplify it. This looks as simple as we can get it. So just leave that alone. Let's try to simplify this part. E of something. Well, I kind of like really wanna apply
Adam's law, cuz I have this E(E), but then there's like this h(X) here,
okay? So I can't directly apply it. So what do you think we
should do with h(X)? >> [INAUDIBLE]
>> Here we know X, so let's actually put it back. So that's called putting
back what's known, but it follows from taking out what's
known that you can put back what's known, Right, we're treating that as known so I can write it here,
write it there, it's fine okay? So now so we have E(E(h(x)Y|X)), right? Now it's exactly, well, I should
have put it back over here, right, move it over there. That should be Y times h(x). So that's of the form where we
can apply Adam's law now, right? So that's E(Yh(x))- E(Yh(x)) = 0. Okay, so it's really just linearity,
taking out what's known or putting back what's known,
and iterated expectation. That proves property (4),
assuming iterated expectation. Okay, so now we really need to check that this iterated expectation
formula is correct. Okay, so let's do that. Okay, so just to simplify notation, let's do it in the discrete case. Continuous case is analogous. Proof of (3) discrete case just to simplify our notation And let's let, So we're trying to find
E of E of Y given X, so let's give it a name, g of X, okay? So we're gonna let E of Y given X. Remember, it's a function
of X as I keep saying, so we may as well give it a name, g of X. So really all we're trying to do here,
that's just E of g of X, that's all we're trying to do, okay. So we need to find the E of g of X, and show that that just reduces to E of Y,
right. Okay, so let's just do that, E of g of X,
well, we've dealt with things that looks like E of g of X many times
before already, just LOTUS, right. So lets just write down discrete LOTUS,
in the continuous case, we could write down continuous LOTUS. So by LOTUS,
that's just the sum over X of, g of X times the probability
that X = little x. Now, let's write down
what's g of little x. g of little x is E of Y given, this is
how we define conditional expectation, Is that, We started by conditioning on
big X = little x, call that g of X. Then we changed little x to
big X to get g of capital X. So that's just what g of little x is. Okay, so I just used the definition. So far all I've done is used LOTUS,
used the definition. Now, again, let's just use
the definition of this, okay? Cuz I don't like memorizing proofs or
anything, and I don't remember how to do this. So all I'm gonna do is just plug into
the definition and hope it works, okay? So, let's just see, what's the definition
of this thing, E of Y given X = x? Well, again, I don't like memorizing
definitions any more than I like memorizing proofs, but
I know the definition of expectation, and then conditional just
means make it conditional. So I'm just gonna write down
that the definition, just y. Right, if we're unconditional,
we would just do P of Y = y here, right? That would just be E of Y, but it's
conditional, so we just put given X = x. All right, and then we have this probability X = x here,
Which is outside of that sum. But actually, if we want,
we can bring it inside of the sum, okay? So, because this depends on X and
we're summing over Y. So somehow, we have to reduce
this down to, So if we want, we can bring this in here because this is
a function of X and we're summing over Y. So somehow we're hoping that this
will reduce down to just the expected value of Y. So somehow we have to get rid of all of
these xs here somehow have to go away. So a very, very common trick when
we're dealing with a double sum or a double integral is to swap
the order of summation or swap the order of integration, okay? Especially in the discrete case that
as long as the sum converge absolutely, it's a completely valid thing to do. You just rearranging
the whole a + b is b + x. So I'm gonna add up the same
thing in a different order. So I'm just gonna say sum over y first and
then sum over x. That's the same thing as summing
over x and then summing over y. We're just rearranging terms. We're just adding in a different order. Okay, and then, that's y p times y = y. Let me write it this way, X = x. I can write it this way because
remember that's the joint PMF. But remember the joint PMF is the
conditional PMF times the marginal PMF, that's the marginal, right? That's the marginal PMF of X,
that's the condition PMF of Y given X. So we multiply this thing times this
thing, that's just the joint PMF, so we may as well write it that way. Okay, so, Now, notice since I swapped the order of
summation, something good happens, which is that this y doesn't depend on x,
so we can pull this y out. So that y goes right there, okay? So that's the sum over y of y. So just imagine pulling this y out and
let's just stare at this sum here. We have the joint PMF and
we're summing up over all x. Well, that's exactly how
we got the marginal, right? To get a marginal from a joint, we just
took the joint PMF and we sum up over X, we'll get the marginal of Y, if we sum
over Y, we'll get the marginal of X. In this case, we're summing up over X. Just like remember those
2x2 tables we were drawing? Just add up a row or add up a column
we get to get the marginal? So we're summing up over x, that gives
us the marginal distribution of Y. So therefore,
by definition that's just E of Y. So really the only trick here was
to write this as a double sum and then swap the order of summation which is
often a useful trick and proving things. Other than that, I would just plugged
into LOTUS, plugged into the definition. And just used what's a conditional
distribution and marginal and joint distribution which we
talked about before, okay? So that is the proof of this property,
and I want to do some more examples. First, one more, One more
definition of a conditional thing. So definition of conditional variance,
cuz we have conditional expectation, and I think this would be a good time
to get to conditional variance. It's defined analogously. So the variance of Y given X. Lets just try to think intuitively. Either we could write
it as E of Y squared. Usually the way we do variances,
E of Y squared- the square of E of Y. So let's just write down the same thing,
except make everything given x, right. Because this says, we get to treat
X as known, given that information, what's the variance of Y, okay? So a natural thing to do
would be E of Y squared given x- E of Y given X squared,
Which is correct. But remember we also define
variance a different way, that was the expected square
difference from the mean. So let's try to also write
that down that definition. If we define it the other way, we did like E of Y- E of Y. If it was just unconditional, we would
do Y- its mean and square that thing. However, we're trying to
make it conditional, so we're gonna put Y -, we get to treat X as
known, so instead of E of Y, we're using E of Y given X to make it conditional,
and we're squaring this thing. Now if I just close the parentheses here,
that would be wrong. And you could immediately see that would
be wrong just by thinking about what kind of object this is. If I just put closed parentheses here, then that's just gonna be a number
cuz this is random variable. Taking its expected out,
we'll get a number. But actually this equation or
just this expression makes it clear that variance in y given
x should be a function of x. As we're treating x as known, okay? And then what's the variance
of y as a function of x. That means we need another given Xhere. So all this is saying is like throughout
this problem, everything is given x. So we can't forget one of the given xs, everything is based on
the assumption that we know x. Okay, so I just wrote that
these two things are equal, we didn't prove that these are equal. We kind of hoped that there will be
something kind of strange if they were not equal. Cuz intuitively, we're just doing
variance except everything is given x. So it should work out, but
it should still be checked. It's good practice, so I'll probably
put this on the next strategic practice cuz it is good practice
to check that this equals this. Just practice with, because at this point
we reduced it to conditional expectation, so you an just use the properties of
conditional expectation, all right? So that's variance and then, okay, now we have one more property, property 5, I'll write it up there, easier to see. These are four properties of
conditional expectation, and it would be sad not to get at least
one property of conditional variance. So the property is that the variance
of y equals the expected value of the conditional variance, plus the
variance of the conditional expectation. So it's a pretty cool-looking formula,
right? This is the unconditional variance of Y. And somehow we want to like,
so imagine we're trying, we have this quantity y and we want it's
variance and we don't know how to get it. So we kinda want to do some condition on
something to make the problem easier. So the condition on x, but then should
we do the variance of y given x first? Or should we take
the conditional expectation and then take the variance of that? Not so obvious, right? This says two ways you could do this,
add them together. This property is called
Eve's Law because it's EVVE. That actually should be EVVE but
we abbreviate to Eve, Eve's law, okay? So, that explains some of
the etymology here for Adam's law. Especially when you see
the proof of Eve's law, which is also very very good
practice to prove this. So, I'm going to put this on
the next strategic practice, too. You should try it yourself first, then you can study the proof that I
will put in the strategic practice. Let me explain the intuition of this
a little bit and then, we'll do an example of how to use Adam's law and Eve's law
together to get the mean and variance. So here's kind of an intuitive picture. Imagine we have different
groups of people, okay? Just to have a simple picture in mind,
let's say we have three groups, okay? And then there's lots of
people inside each group. And then, just have a concrete example
in mind, maybe think of y as height, so you have some population of people
which consists of three subpopulations. You wanna know the mean and
the variance for the heights of people in this population,
or make up your own example. So these are the three subpopulations and each subpopulation may have its
own mean and variance, right? But you want the overall, okay? So it's kinda hard to think about
this entire population all at once. It's much easier to think about
each subpopulation, all right? So notice that there are two
types of variability going on. One is that different subpopulations
may have differences in height, right? So we have differences
between populations, then you have variability
within each population, right? So within each population, unless everyone
in that subpopulation is the same height, you have variability
within each of these and you have variability between them, okay? So if you want, I'm thinking of
x in this case, it'll be like, this wold be x = 1, x = 2, x = 3. So x takes three values, x just says
which, if you take a random person from this population,
which sub-population are they in, okay? So that would be the x. So if we do e of y given x equals 1, that was just be the mean for
this population, right? So really, what this says is,
this term here is saying look at the average
within each population. And then take the average,
take the variance of those numbers, right? So that's really looking
between populations. And this is saying look
within each population, this one's within, this one's between. This says look within each population,
take its variance and then average those numbers. This one says look, Replace each population by just its average height
and take the variance of those, okay? So it is pretty intuitive that there
are those two types of variabilities, but what's kind of cool is that this just
says you can just add them, right? Intuitively there's two
types of variability, but it's pretty nice that to get the overall
variance, you just add those two things. It sounds too good to be true,
but that's how it works. Okay, so let me do an example. All right, so imagine that we're studying
prevalence of a certain disease and the fraction of people, so
you have some country, or let's say some state that
consists of different cities, and different cities have different
prevalences of the disease, okay? So, this is just an example. So, suppose that you pick
a sample of basically, here is how you do your sampling. Sometimes this is done in practice cuz
like, I guess if you've been studying the entire state ideally ,maybe you
would get a simple random sample, of people from the state and you want to
see how many of them have the disease. Or maybe rather than using simple random
sample you stratify in certain ways and so on. They can get into it in a sampling
class which is not our topic here. But sometimes for practical or
other reasons the way that these kind of things work is,
you pick a random city, right? And then go into the city, collect the
sample from the city, it's easier, right? And then we wanna make some conclusions. All right, so
just to try to formalize that, what I'm saying is we pick
a random city in some state. And then pick random sample
of people in that city. Of people in that city, And then, you test each of those people for
the disease that you're studying. Let's say this is a random
sample consisting of n people. So n is our sample size
which we treat as fixed. Okay, so pick a random city, then go to
the city, get n people, test them all for the disease and let X equal number of
people with the disease in the sample. And let's let Q = dp, true proportion of people infected in the random city. Proportion of people in the random city. So that is once we've
selected the random city, then Q is just literally how many people in that city have the disease divided
by the number of people in that city. But I'm using a capital letter cuz
initially it's a random variable, because different cities have different
prevalences of the disease, and we're picking a random city,
so that's a random variable. So you can think of this as
a random probability, right? Cuz this is gonna be
a number between 0 and 1, which is the proportion of people who
have the disease, but the city is random. Okay, so who have the disease. By city with the disease,
I mean people with the disease, not the cities with the disease. So there are a lot of questions we could,
this is a pretty general. So you can see how this kind of setup has
lots of applications in epidemiology. But it doesn't have to be disease. It could be political opinions or
whatever you want. It's just it's just that we have. That is it's similar to what
I was just saying here. In that we're assuming that we have
variability between different cities have different political opinions or
different disease characteristics. And within each city there's
also variation, right? So we have those two types of variation,
how do we deal with that? Okay, so there are lots of things
we could ask about this setup. But for right now, let's just find
the mean and the variance of X. To do that, though we need some assumption about the distribution of Q, okay. So the most commonly
used choice in practice would be to pick a beta distribution. So we're gonna assume that Q is Beta a,
b where a and b are known. Because a Beta as we were just
saying when we were doing the Beta it's a very flexible family. It takes continuous values between 0 and
1. We know Q has to be between 0 and 1. And we know the Beta is a conjugate
prior for the binomial, so it has a lot of nice properties. So it's mathematically convenient but it's also a pretty flexible
family to work with. By playing around with a and b,
you could get a variety of distributions that hopefully would accurately
reflect what you wanna do, as far as what
the distribution of Q is like. So we'll assume a beta. You can assume something else if you want
and then do a similar calculation but the Beta would be the most
popular choice here, and also happens to be convenient, okay. So now, so that's Q. We're also implicitly, I man it's
basically set in words here, but we're implicitly assuming x
given Q is binomial n, Q. That is once we know that
the true value of what proportion of people in that city have the disease,
then we're doing binomial. Hypergeometric might be
a little bit better. But we're either assumed sampling with
replacement or that the n is small enough compared to the population size that
it is essentially binomial, okay. So now we're all set to find what we want. E of x, so it's very, very natural here
to use conditional expectation, right? Cuz the whole problem was set up in a way
where conditional on which city you're in. Then we have a good sense of what's going
on unconditionally then you have to kinda combine all these different cities
that's harder to think about. It's easier if you kind of
zoom in on one city first. So that suggests, okay,
just condition on Q. So this is gonna be E of E of X given Q. E of X given Q, well given Q,
we just have a binomial n, Q, expected value of binomial of n,
Q will be n, Q. So that's just E(nQ),
n is just a constant. So then it's just na/(a+b) because a Beta ab has expected value a/(a+b). So it's a quick calculation at
that point once you condition. All right, so now lets do the variants. So again, we're gonna do this
by thinking conditionally. We have a between cities and
a within city term. Eaves Law says this is the expected
value of the variants of X given Q. Plus the variance of
the expected value of X given Q. Then we just have to work out
what are these two things Okay, so for the first term,
the variance of X given Q, of given Q is just a binomial,
and we know that this, if we treat Q as a constant,
this has variance, n, Q, 1-Q, right? And the n, so let's just write that down. So it's just the expected value of
the variance of X given Q is just nQ 1- Q. And then, For the other term, the variance of X given Q, I just get
that from the binomial, all right. For the other term,
if the variance of E of X given Q, well, we just said that E of X given Q, is n,Q. So I just want the variance of n,Q. N comes out as a squared, so just give
me n squared times the variance of Q. Now we just have to
compute those two things. And for the Beta distribution,
those things are actually both pretty easy because when
we see this Q(1-Q), that kinda reminds you of what
the Beta looks like, right? So to compute those two things, Let's just do those on the side, we just need to compute those two
quantities and then we're done. So we need to know the expected value, the end comes out here, so
all we need is E of Q(1-Q). Okay, so let's just do that, just for a
quick practice with a Beta and with LOTUS. Well, one thing we could do is just
say this is E of Q- E of Q squared. And we already know E of Q and
we could get E of Q squared. That parts really fine. But let's just do it directly using LOTUS. So I'm just gonna write down,
LOTUS, right? So this is just Q. I'm going to change
capital Q to lower case q. Q(1-Q), and
then we integrate the beta density. And let's just simplify the beta
density has Q to the a-1, but we are multiplying by q. So now it's q to the a. And then has a 1-Q to the b-1, but we also have this 1-Q, so
that becomes to the b. Dq times the normalizing
constant of the beta, which is gamma a plus b over gamma of a,
gamma of b. Well, it looks like this complicated thing until you realize that
is just another beta integral, right? So, this is just gamma of
a+b over gamma a gamma b. And then we just have to multiply and
divide by whatever we need in order to make this exactly
the integral of a beta PDF. So I'm just imagining putting in
the normalizing constant of the beta, which in this case is a beta of a+1,
b+1, okay. So this is gonna be gamma of a+1, gamma of b+1, divided by gamma of a+b+2. Times one because I'm multiplying and
dividing by this thing so that this is exactly integral
of the beta density. And then, right? Okay, so now let's just simplify this
thing that looks ugly with the gammas. But hopefully we can simplify it. If we remember the fact that
gamma of x+1=x gamma of x, just use that fact a bunch of times. So gamma of a+1 is a gamma of a. So this gamma of a is going to cancel. That's b gamma of b. So there's going to be a b there and
then, in the denominator, gamma of a+b. Gamma of a+b+2 is a+b+1
times gamma of a+b+1. But gamma of a+b+1 is
a plus b gamma of a+b. So that cancels. So we just get this expression
in terms of a and b. And similarly,
you can get the variance of q. Kind of the nice way to write
the variance of a beta is mu, 1-mu over a+b+1 where mu
is the mean of the beta, so mu = a over a+b. You check this in exactly the same way,
okay? So then we're done with, I mean,
you can do some algebra to simplify, if I just plug those things in,
and then that's the answer. Okay, so that's all for today.