Last time I left you with a cliffhanger,
right? So we better try to resolve that,
the two envelope problem. So, I'll remind you what the problem is,
it's very simple to state, but not so easy to resolve. So the problem is just, we have two
envelopes, envelope one, envelope two. They look identical, and suppose they are
X dollars in here and Y dollars in here. And you don't know anything
at all about X and Y, just they're random
variables right now. And all you know is that one envelope
has twice as much money as the other. So they each have a check for
some amount of money. And one is double the other. But you have no information other than
that, twice as much as the other, okay. So the argument last time was,
Let's say you get to pick one. So let's say you pick this one,
and this one's X. And you don't know what
X is unless you open it. But you do know,
the other one is either 2X or X over 2. And if you average 2X and X over 2,
you get a number bigger than X, which suggests that this one is better. But you could reply the same
argument to this one. This one's a Y,
the other one's 2Y or one-half Y. If you average 2Y and one-half Y,
you get a number bigger than Y. So, all right. So, let's actually write that out
kinda formally as an argument. So, based on the other two
competing arguments here. Argument one, Is simply that E(Y) = E(X) by symmetry. So, if you were given some
piece of information that the person who put the money in
the envelopes is left-handed. And left-handed people subconsciously
wanna put more money on the left, or something like that, then you'd
have an asymmetry in the problem. But there was no asymmetry given, that
the way it was stated is just there's no difference that the left envelope
should have more or the right one, so it's a symmetrical situation. It's kind of hard to argue
against symmetry here, right? There is no asymmetry in
the statement of the problem, so if somehow the envelope on the left were
better, where would that have come from? It doesn't make sense, okay? So that's a pretty strong argument. But let's look at argument 2,
which also seems pretty strong. Argument 2 is the condition. Say how do we compute expectation? Just like when we had a probability,
we didn't know. We want a condition on
the thing we wish that we knew. And so this is just the law of total
probability except the expectation version. We're gonna condition on whether, You know either Y = 2x or Y = one-half X. So we could just write it this way. E(Y|Y=2x) P(Y=2x) + E(Y|Y=
one-half X)P(Y= one-half X). So I mean that's just a fact about
expectation that you condition. And the proof of this is basically just use the law of total probability,
and that's true. So it's hard to argue with that. Let's compute this. So the argument goes, E(Y|Y=2x),
well, then we know Y = 2x. So we're just gonna replace
that with E(2x) by symmetry. It's equally likely that Y = 2x and
Y = one-half X, so that's just one-half. Plus, in this case, Y is X over 2,
so we're gonna get E of X over 2, Probability, or that's one-half again. And if you simply this,
you can take out the 2, take out the one-half,
simplify you get five-fourths E(x). All right, so
then the question is how to resolve this? Well, first of all, is there any case in which both of
these statements could be true? Someone said 0. Well, kinda, yeah, both envelopes have $0,
then that would be okay. Now, so let's assume there is a positive,
nonzero amount of money in both envelopes. Are there any other case? Infinity, well, infinity equals infinity, infinity equals five-fourths infinity. So one way out out it,
could we just say, well, on average, there's an infinite amount of money,
and then you're pretty happy, right? But there are not many kinda like
the Saint Petersburg paradox, where the expected value is infinity. But there are not many real world
scenarios that you could think of, probably, where your expected value of
how much you'll make is infinity dollars. So let's assume that these
numbers are not 0 or infinity. In that case,
this is just a direct contradiction, meaning that one of these
two things must be wrong. Well, symmetry, I can't find any way
to argue against the symmetry here. So, this has to be right. Symmetry takes precedence. So let's try to see what's
wrong with argument two. Well, actually,
this is one of the most common and troublesome mistakes with conditioning is to kind of like use the information,
and then forget about it. Look at what we did here,
we plugged in the information Y=2x, but then the condition went away. What's the justification for that? There is none, it just looked good to
say that, but there's no justification. So, actually this step is wrong. So, actually it's not equal. What we should have written was,
it's perfectly valid to plug in the 2x. Let's write the corrected version. It's perfectly valid to plug in 2x, but that doesn't mean we can
forget that we know Y=2x. So, it's 2x given Y= 2x. And similarly over here,
we can plug in the information, but it doesn't mean we're not
conditioning on it anymore. So, we can do that. But we still have to then evaluate
this conditional expectation. We can't just say that we can
forget this information here. In fact, there's no more justification for forgetting this information here,
then there would be up here. You just can't say that. So actually what we just showed is that essentially, E(X|Y=2x), Is not equal to E of,
let's do it this way, E(Y|Y=2x) is not equal to E(2x). So what's going on there is I
like to think of this in terms of indicator random variables. So let's just let I be
the indicator random variable. I is the indicator of which envelope has
more money, so we could say I is one, if y equals 2x, and zero otherwise, or
we could define it the other way around. That's the event that says that the
envelope on the right has twice as much money as the one on the left, okay. So we have an indicator for that. Then, essentially what we just showed
is that X and I are dependent. We can either talk about X and
I or talk about Y and I. They're dependent, and
that seems surprising at first. So let's think about what that means. It says that if you got to observe X, then
somehow that gives information about I. So now we get to think
of more about the case, what happens if you actually get
to open the envelope on the left? Okay, and you see $100 there. So you know the other one is 50 or
200, but the question is whether that changes your probabilities for I,
that is, is the other one 200 or 50? Is saying it's not 50,50 anymore. Which seems a little strange because you're not given any information
kind of the scale of the problem. So let's say you open envelope on the left
and there's a trillion dollars there. Well probably you're really happy,
but is a trillion a big number, or a small number? Well it's a lot of money, but in the grand
scheme of things, compared to the entire real line, from zero to infinity,
a trillion is miniscule, right? A trillion is nothing compared to other
numbers I could name if I wanted. So when you see that trillion dollars, does that give you information that
makes you think that the other one is probably only half a trillion
cuz a trillion is big? What's a trillion, right? It's nothing compared 2 to the 2 to
the 2 to the trillion to the trillion or something like that. But if you would observe that
then that's nothing compared to other numbers I could name. So it seems surprising that they're
dependent, but essentially we just proved that they have to be dependent if
the expected values are finite. And there's a strategic practice
that's related to this too, which says that, you can look up
the problem if you haven't already. But just to tell you what the result of
that problem is, which is also surprising. It also is a two envelope problem,
but in that problem it just assumes that there's two amounts
of money, two positive amounts of money. It's not assumed that
one is double the other. And the problem is to come up with
a strategy guaranteed to give you better than a 50% chance of getting
the envelope with more money. You get to observe one and
then you choose whether to switch. You can guarantee that your probability of
success is strictly greater than one half. Which again, at first sounds impossible. Because if it's a trillion dollars,
should you switch or not? But it says in a certain
sense that actually you can. You can make a measure of is a trillion
a big number or a small number. And the strategy is in that problem,
is to generate your own random threshold. That is generate some value t, let's say from an exponential
distribution, but it doesn't have to be. You can pick some other distribution. You generate your own threshold value and then you say you're happy if you got more
then t and unhappy if you get less then t, and that gives you better
then a 50% chance of success. All right, so anyway,
that's the two envelope problem. There's many, many different articles and
debates about this. And some people try to take Bayesian
approaches to resolving this and so on. But I think that's fairly unnecessary, I think the key blunder
is just in this step. And this is kind of a strange problem, but this mistake comes up in a lot
of other contexts as well. So it's worth thinking carefully about. We plugged in the information, but that doesn't mean we can then
get rid of the information. The only time when we can get rid
of the stuff we're conditioning on is when we know we have independence. And here there was no justification for
independence and in fact, X and I, or Y and I can't be independent. Which is not obvious. It's not obvious that
they're not independent, but on the other hand you can't just say
they're independent without proving it. And if you try to prove it you'll find
they actually are not independent here. So that's what's going
on with that problem. All right, so let's do another
example of conditional expectation. The coin flipping problem. So I'll call this one,
Patterns in coin flips. Okay, so assume we have a fair coin,
all right? You have a related homework problem where,
where the coin may be biased. But right now,
we're assuming we have a fair coin and we do a repeated fair coin flips. And we're waiting for
a certain pattern, okay? So we're gonna wait until or
we want to know how long? How many flips until we observe
the pattern heads followed by tails? So I'lll just call that HT. That is keep flipping the coin and
eventually, you will observe heads immediately
followed by tails, right? That particular pattern
will eventually show up. You wanna know how many flips does that
take, including the H and the T there. Okay, that's a natural
question we could ask. Similarly we could ask
how many flips until HH? That is, how long do you have to flip a coin until
the coin lands heads twice in a row. Okay, so let's find the average. So let's say, let's call that,
I'll just call it WHT and WHH. But WHT I just mean the random
variable representing how many flips does it, you know. Imagine this long sequence
of coin flips and how far into the sequence do you have to
get until you see HT for the first time. How far until you see HH for the first
time, those are both random variables. But right now we're talking
about expectation, so the question is to find the average. So our problem is to find the expected
values of these two waiting time. I'm using W for waiting time. Expected value, how long you have
to wait for these patterns, okay? Well, so we'll solve both of these, but
before I actually solve them, let's try to think more qualitatively about whether
there's an inequality or equality here. Okay, so we know there's only three
possibilities, either this number is bigger than this number or it's equal,
or this is less than this, okay? So why don't you take a few seconds to
think intuitively about whether you think they're equal or one is bigger and
if so which one is bigger. Then we'll vote. Then we'll see, okay. So there's three possibilities, right? Think about which one you think is true. Okay, so how many of you think
that this is greater than this? Raise your hand if you think
this is greater than this. Okay, and
how many of you think this equals this? Okay, and how many of you think
this is greater than this? Okay, so the dominant answer is equality,
and it's roughly equal numbers, saying this is greater than this or
that this is greater than this. Okay, well the answer
is this one equals 4. And this one equals 6, 50% bigger. But you might say, well, by symmetry
don't they have to do the same thing? That would be a false use of symmetry. Symmetry would tell us that,
if we would ask the same question. Let's do a little symmetry
argument over here. Symmetry would say expect the value
of WTT equals expected value of WHH. That has to be true by symmetry. Because, the coin, it says heads on one
side, tails on the other, but you could have, like relabeled it, the same problem
again, just with a different labeling. So that's symmetry,
that I'm interchanging heads and tails. Similarly, we know that E
of WHT equals E of WTH, But this, neither of these
tells us that these are equal. It doesn't tell us that they're not equal. It just you can't just say by symmetry. Because you have to swap heads and
tails everywhere, and those are different things. All right, so let's do the calculation. Then I'll talk a little more
about the intuition for why is this bigger than this? But let's just calculate it first. Okay, so first let's do E of WHT. So a nice way to think about it, we don't actually need
a conditional expectation for this. A nice way to think about this problem
it is just imagine a sequence of tosses. Maybe it starts out tails,
tails, tails, tails. Eventually the coin will land heads,
right? Now once this has happened you can see
we've made partial progress, right? Because now, well if the next
flip is tails, then we're done. But if not, that's okay. Heads, Heads, Heads,
eventually it'll be Tails. So then we're done, right, then we got it. So just drawing, a few examples like that,
make up your own little sequences to make it concrete then you'll see
what's going on here, which is that all we have to do is wait for
the first time the coin lands heads. Let's call that W1. That could be the first flip, but it could be however long it
takes to til the first head. And then, after that point. How long do you have to wait
additionally for the first tails? Let's call that W2. So matter what happens, I can always
split it up between a W1 and a W2, right? Time to the first heads, time to the first
tails after the first heads, right? Well, okay, those are independent of each
other, because the coin is memory less. But even if they were not independent,
we could still apply linearity and just say this is E of W1, plus E of W2, equals 2 plus 2 equals 4. Since the W is basically a geometric
the only thing we have to be careful about is that we define the geometric
to not include the success. So actually W j -1 with our
convention is gemoetric 1/2. Right, it's just a waiting time for
success and where here we're defining success to be this heads and here
we're defining to be this tails, okay? So geometric 1/2, this has expected
value of 1, 1 plus 1 is 2. So each of this, an average will
take two flips to get to this stage, two more flips to get to this stage,
2 plus 2 is 4 linearity. [COUGH] Okay so,
Let's try to do the same thing with heads, heads and see why that doesn't
give us the same thing there. All right, so
now we're gonna do E of (WHH) and again it just really helps for
concreteness draw some little examples. So, again, maybe the sequence
starts out Tails three times, four times, like here,
eventually the coin lands heads. Now at this point, so we could still call
that W1 if we want, but at this point either of two things can happen, either
the next flip is heads then we're done, we got it, but if the next toss is tails,
we have no partial progress anymore. That means that in this scenario
these tosses were all wasted, right. They do nothing for
us then the thing just restarts. Al right, I mean it took six tosses to get
to this point, but we're starting again with exact same problem we had,
right, no partial progress. In this case we have partial progress. Because once you get this heads for
the first time, then you're halfway there essentially,
right? That, that's the key distinction. All right, so here we have to be more
careful because in this case all these tosses are gone and
we just start over again. Okay, so now we're gonna use conditional
expectation to compute this. So it's kinda like gamblers
ruin type of thing where we condition on the first toss. So this is E of WHH given, you can
make up some notation if you want it, but I'll just write it. First toss is heads times 1/2 + that's the probability that
the first toss is heads, plus expected value WHH given
first toss his tails times 1/2. Just conditioning on the two cases that
this first toss is either heads or tails. And now let's expand it out further. This one we need some space for probably. This one the second term is
actually easier here, okay? So I wanna compute this thing I want the expected value WHH given
that the first toss is tails, okay? Then the first toss is tails,
what is that say? It says the first toss is tails,
that cost us one toss. Tossed the coin once,
the coin was tails, okay? Then it's the same problem again, right? Exact same problem. So that's just E of WHH again. We're gonna try to solve for
it in terms of itself. Which sounds circular but
then we have an equation. We can just solve that equation for
E of WHH, okay? So that was the first toss, was a waste
and then it's the same problem again. Now for this first term,
first toss is heads. We have to further subdivide it into
two cases based on the second toss. So here we're when we're doing this
conditional expectation it means we're working within the world where we
now know the first toss is heads. That's the information we have. Now we look at the second toss. If the second toss is also heads, then
that means the first two tosses were H, H and then we're done and
it only took two flips. And that has probability 1/2. But, with probability
one-half within this case, the second toss is tails. In that case, the sequence started out HT,
which does nothing for us, right? That means it's a waste of two tosses and
then it's the same problem again. Right, two tosses from heads tails and
its the same problem. It resets again at the same problem. All right, so now, we have an equation for
E(WHH) in terms of itself. So just multiply the 2 times 1/2 is 1 and
move things around, and you'll get that
the expected value is 6. But you can also check just
by plugging in 6 here. This is five-halves plus
seven-halves is twelve-halves, is 6. So that works. All right, so okay, so
that seems a little strange. And to kind of explain a little
bit of the intuition here, let's imagine drawing like a long
sequence of coin tosses and look at where did the HHs appear and
where did the HTs appear? So maybe it looks like TTHHHT, blah, blah, blah, blah, blah, blah, blah. And then somewhere later on there's an
HHHTT, you can make up your own sequence. I'm just gonna try to illustrate
kinda what's the big picture of why this is true. In other words, the question I guess is
why doesn't this contradict the fact that, since it's a fair coin, if we flip the coin twice, then all
four possibilities are equally likely. So if I'm only looking at two particular
positions, let's say here and here, it's equally likely that this
will be HH as for it to be HT. Those are both one-fourth, okay, so I think that's where the intuition
comes from that it should be the same. When you look at any two positions,
it's equally likely. But we're not just
looking at two positions. We're looking at the entire sequence. And what happens is that what the HH is, sometimes you'll get three heads in a row,
sometimes you'll get five heads in a row. If you have three in a row,
then that means there's an HH there and an HH there, they're nested. So if we have five Hs in a row,
which will happen more often than most people would expect,
cuz coincidences happen a lot. And look at all these HHs
with those five in a row. Try to do the same thing with HT. You can't, right? It doesn't nest inside
itself in the same way. So that means that when you get the HHs,
they're kind of clumped together. But on the other hand, you have the same expected total
number of appearances of that. So since the HHs are more clumped,
they appear in clumps, those clumps must be further apart. That's what's going on. Okay, and if you think this is just
kind of a curiosity of coin flipping, this kind of problem actually has
important applications in genetics. In genetics, we will not be looking
at sequences of heads and tails. But we'll look at a DNA sequence which
is the sequence that's drawn from the alphabet ACTG. And a lot of times,
in genetics these days, you won't necessarily study certain
patterns, right, they call motifs, where the pattern appear in the DNA
sequence and you get similar problems. And, while I'm mentioning this,
I'll recommend a TED Talk, Peter Donnelly is a statistician
who works on genetics. And I don't know if you've
seen any of the TED Talks. But if you go to ted.com and
look for Peter Donnelly's talk, he mentions some interesting courtroom
examples, and statistics and probability in the courtroom. And he also talked about
an example similar to this. It's quite a nice talk. Okay, so, that's the waiting for
certain sequences. You can extend this in various directions. This was already a fair amount of work,
not too bad. You could also ask more
complicated questions, like longer sequences and
things like that. And there are a lot of other
methods that can be used for more complicated sequences. But that basic method
is just conditioning. So you can see that
conditional expectation, just like in the Gambler's Ruin, let us
simplify kind of this complicated problem, break it up into simpler pieces,
by conditioning on the first two. Essentially, what we did was to
condition on the first two tosses. Another way to do that, but
I sort of did it in two steps. We could have also done this just by,
at the very beginning, breaking it up into four cases,
based on the first two tosses. That might have been even easier, but
anyway, I did the two-two step way, okay? So that's the basic idea
of conditional expectation. And I guess to just state
it a little bit more, some of the general properties
of conditional expectation, and would be useful. And also we're leading into
conditioning on a random variable. This whole week is about
conditional expectation. So far, we've been doing conditional
expectation where we conditioned on an event, okay? But what we're gonna get to is what does
it mean to condition on a random variable? They're very, very closely related. But if you mix them up,
then it can be bad. So we have this thing with E(Y given X=x). I just wanna make sure everyone is
completely clear on what this means both intuitively and mathematically. What does this expression actually mean? Well, I've said over and over again that
big X equals little x is an event, right? I mean assuming capital X and Y are random
variables, little x is a number. We're conditioning on that event. And as I've been saying, all that means
to do is use the same definition of expectation except make it conditional. That's why it's called
conditional expectation, okay? So at this point, you should all be
able to just write down the definition, that this is for Y discrete. And we'll write down
the continuous case too. Okay, so in the discrete case,
to get the expected value, we just sum up the values times the PMF,
right? So the sum of y times the probability that
Y=y, that would just be expectation of Y. The only difference is we make
this conditional, given X=x. That's all this means,
that we learned the X was equal to x, and so we conditioned on that information,
right? So it's our best prediction of
the value of Y given this information. Best in a certain sense of minimizing the
square of how far off you are on average, cuz it minimizes
the expected square error. That's the discrete case. Let's also just write
down the continuous case. So if Y is continuous,
then the basic definition of expectation, we'd integrate
y times the PDF of Y. And now we're just gonna use
a conditional PDF instead, which you might write as F of Y given X,
Y given X. Dy, okay? And if x is also continuous,
just to remind you of how we get this conditional PDF,
what the definition of the conditional PDF is completely analogous
to conditional probability. That is instead of a probability,
we have the density but the density of Y and
x divided by the density of x. So that would just be the joint PDF. Divided by the marginal PDF of x. So this just says that this equals this
because if just says that to get the joint PDF we could take the marginal PDF of x
times the conditional PDF of y given x. So we could also write it this way. And notice that this marginal here, that doesn't depend,
that's supposed to be y not px. This doesn't depend on y, so we could take
this out of the integral, if we want or we can leave it there, same thing. So, this is just the analog, it's just
saying we're using conditional PDF, its defined analogous. So, I mean that that that's how you
compute things using a sum or an integral but that that doesn't yet say like
what property does this satisfy, okay? So let's let this thing be g of x. That's g of lower case x = E of Y, given big X = little x. I'm writing it as g of x just
to emphasize the fact that this, when you compute this thing, because I
can't even count the number of mistakes I've seen in these kinds of problems where
I give a problem like this on an exam. And then get an answer that
involves capital X or capital Y, which you should immediately know would
be would be completely wrong, okay, because this is the expected value of y, it can't depend on capital Y because
you're taking the average of Y. So how can that depend on,
you can't you can't base that on Y, right that's your prediction
of Y you can't use Y, okay. Given this information X = x, for
conditioning on this event and it can involve capital X, it can involve capital
Y, it's just a function of little x. It might be a constant. If X and Y are independent, then conditioning on X gives
us no information about Y. So if they're independent, this will just
reduce to E of Y, which is a constant function, but it has to be a function of
x, possibly a constant function, okay. So that's just to emphasize
that's a function of x. But that also suggests how to define conditional expectation
given a random variable. So we wanted to find E
of Y given capital X, As g of X, capital X now. [COUGH] So I mean this is like not
a complicated looking equation, but conceptually, this takes a lot
of thought and I'll explain it. And I'll explain it more
in the next lecture. But really, you just have to take some time to just
think about what these things mean. I think carefully about it
because there's just so many mistakes you could fall into we don't
fully understand what this notation means. All right, so what this thing means
is we have this function of little x, and we replace little x by big X,
after evaluating it, not before. So one trap you could
fall into here is to say, well, g of capital,
this is g of little x, g of capital X, then I should plug in big X for
everywhere I see a little x. So I plug it in there and there, but in that case you would be saying
EY given capital X = capital X. And then you'd say, well,
I already knew capital X = capital X. So that's just irrelevant information that
I'll cross out and you'll get E of Y. That's not what this means. What it means is, think of this as a function of little
x that you've actually computed. Like, maybe it's x squared. I'll just write this as an example. So what it means is if g
of x little x = x squared. Then g of big X = big X squared. That's all it means,
that this is some function of X. And you replace little x by big X after
we actually have our hands on that function, okay? It's really just a notational trap
to think of plugging in the x here. But think of this as some function
you're replacing little x by big X. Notice that that's a random variable. So in particular,
E of Y given X is a random variable. It's a random variable
that's a function of x. It's not a function of y,
it's a function of x. All right, so the intuition for
that is that, I mean this is the definition. But the intuitive interpretation
is that E of Y given X, what that means is X is a random variable. But let's pretend that we observed X,
and we know what X is, and we then get to treat X as if it's
a known constant, that's what this says. So, it says assuming we get
to pretend that X is known, then what's our best prediction of Y? So that's allowed to be
a function of capital X, okay? So that's what it means intuitively. It's not really that
different from this and in fact when you have a problem like this,
you can always at least in principle translate it
back into conditioning on an event. It's just that the notation can
gets more unwieldy in this case. And we'll see examples where
it's just a lot more compact and convenient to write things this way than
always having to revert back to this. But if you start getting
confused about what this means, then you probably should revert
back to this kind of notation. Okay, so all right, so
let's do a couple of examples. So, Let's do one with Poissons. So let's let X and Y iid, I-I-D Poisson lambda. So all right, so suppose we wanna find E of X + Y given X. So, as I said, if we want, we could
first find this given big X = little x. And then change little x to big X and
you'll get the same thing. But, just for practice, let's do this
directly using the interpretation I mentioned, that this says we get
to pretend that we know X now. All right, so linearity still holds. So this is E of X given
X + E of Y given X. We can still use linearity. The reason is that conditional
probabilities are probabilities, right? So conditional probabilities
satisfies all the same rules and properties as probability. But therefore, since conditional
expectation is defined in terms of conditional probability,
conditional expectations are expectations. It satisfies the same properties such as
linearity So linearity is still true. Okay, so now let's just think about,
what does this thing actually mean? E(X|X), that's easy, that's X,
right, what else could it be? I know X and I want to predict X,
I'm gonna use X as my prediction. So okay, it has to be that. >> No. >> [LAUGH]
>> Now let's think about E(Y|X). X and Y, I said they're iid,
so they're independent. Since they're independent, that means that getting to know X is
of no use at all in predicting Y. So that's just the same thing as E(Y). So this part is by independence, and
this part is X, X is a function of itself. In general, we'll talk more about this
later, but I may as well mention it now. If we have E(h(x)|x), this is true for
any problem, not just the Poisson, so I'm putting it above this. We'll come back to this later,
but since it's already coming up. If we get to know, h is any function,
we get to know X, well then we know h(x), so we could just compute it. So E(h(x)|x) = h(X), right? Cuz we have X, we have h(x) so
that's it, no uncertainty about it. Okay so E(X|X) is X,
E(Y|X) is just E(Y), so that's just X + lambda,
For the Poisson case. Actually, up to this point, I'm not using
anything in particular about Poisson, I'm only using the fact that
they're independent, okay? So that's just an example that shows,
okay, we can still use linearity. If we have independence, we can drop
the stuff we're conditioning on. And if we have something
that's completely known, then we can actually just take it out,
because it's known. All right,
now let's do the kind of reverse problem, suppose we were asked to find E(X|X+Y). And okay, so there's no form
of linearity that says this E(X) given this plus E(X|Y). And I've seen that mistake several times,
maybe, in time pressure-type of panic. Just, linearity is like this,
no, we can't do that here. Okay, so
we have to think about what this means. So I'll show you two ways to do this,
actually. One way is just straight
from the definition. We're gonna let T = X + Y and let's find
the conditional distribution, right? Because what this says is we
get to know the value of T, and then we use the conditional
distribution rather than the, right? Without this it would say just
the unconditional, right, what's the expected value,
we know it's lambda. But given this information, right, treating T as known,
then we need a conditional PMF. So, find the conditional PMF, Okay, so this is similar to
other problems we've seen, but let's just compute it for practice. We need the conditional PMF. We want the probability that X equals k,
let's say, given that T equals some number,
let's say n. Those are just dummy variables here,
I'm just finding the conditional PMF. That is, what's the distribution of X,
given that we know T, okay? So by Bayes' rule, that's just P(T = n | X = k) P(X = k), divided by P(T = n). And that's actually an easy calculation,
because T, this is another example of like that plugging in
thing with the two-envelope problem. T is X + Y, so we're gonna plug X = k,
and so we know that Y = n- k. And then in this case, we do get to cross
out this information at that point, because X and Y are independent. This would fail if
they're not independent. But in this case, X and Y are independent, so really this is just P(Y = n- k) P(X =k) divided by P(T = n). And sorry, we're going right to left here,
let's just write what is this here. Y = n- k,
just straight from the Poisson, right? E to the -lambda,
lambda to the n- k over (n- k) factorial. X is also Poisson, e to the -lambda,
lambda to the k over k factorial. And in the denominator, we know that the
sum of independent Poissons is Poisson. So this is gonna be a Poisson
of 2 lambda in the denominator. So we know that's e to the -2 lambda, (2 lambda) to the n
divided by n factorial. Okay, and if we simplify this, we get a very familiar-looking
distribution, right, what distribution is this? >> Binomial. >> Binomial, notice the e to
the -2 lambdas cancel, right? And these power of lambdas cancel, right, there's lambda to the n
over lambda to the n. We just have a one-half to the n here,
right? And this thing, n factorial,
put that on top, that's just n choose k. So this thing,
I'll simplify it one more step, it's just n choose k one-half to the n,
that's just a binomial. Okay, so what we just showed with that,
Bayes' rule, is that X|T = n is binomial
with parameters n and one-half. So now, let's use this to get
the conditional expectation. So what that says is E(X | X + Y), but let's first do the case of
conditioning on an event. So we're given T = n,
well that just says, given T = n, it's no longer Possion now, it's binomial. The expected value of this
binomial is just n over 2. So in the other notation,
this is conditioning on an event, right? But if we want to condition on T instead,
E(X|T), then all we have to do is replace this n by T,
just according to that definition there. So that's gonna be T over 2. So it says that if we
get to know the total, then our best prediction is 1/2 the total. That's actually a pretty intuitive result,
right? That we get to know the total of X and
Y, and they're iid, and we wanna predict
the average of one of them. Well, if I told you the total was 100,
and they're iid, you'd probably guess that they're each 50,
right? That would be a reasonable guess. So that's a mathematical proof that
that's kinda the correct guess. That's one way to prove it,
here's a way that I like even more, is to notice the symmetry. E(X | X + Y) = E (Y | X + Y) is true by symmetry, because they're iid. This is always true with iid, that doesn't
assume anything about the Possion. Lets add these two things, E(X | X + Y) + E(Y | X + Y), by linearity, that's E(X + Y | X + Y). But E(X + Y | X + Y) is X + Y, which is T. So I added something plus itself and
I got this, so that immediately implies the same result,
E(X|T) = T/2. And this didn't use anything
about the Poisson, so that's actually a more general result. All right, lastly let me just mention one
key property that his notation gives us, we'll prove it next time. So this is called iterated expectation, or it's also called Adam's law for
reasons I'll tell you next time. This is the single most important
property of conditional expectation. So it's good to be aware of it now, and then I'll talk in detail
of that next time. It just says, as I said, E(Y|X) is a
random variable, so we might want to know, what's the expected value
of that random variable? Well, if you do E(E(Y|X)), you get E(Y). And this is closely related to
the law of total probability. In a sense, it's a very compact way of
writing the law of total probability, extremely useful fact. All right, so
I'll see you all on Wednesday.