Okay?
So last time we were talking about joint distributions. And just to kinda quickly remind everyone
I like the big theme right now is joint, conditional, and marginal distributions. And everyone needs to get comfortable
at how all those concepts relate. So there's three different
types of things. Joint, conditional, and marginal. And we were talking about joint and
marginal distributions last time. Not so
much about conditional distributions. But it's analagous to the stuff we've
already seen about conditioning. So those are the three key words. Joint, conditional, and
marginal distributions. So at this point in the course we pretty
much have all the tools we need for working with one random
variable at a time. But there's much much more that we need
to study about what happens when we have two random variables. Or a list, a sequence of random variables. Things like that.
A sum of a million random variables, and things like that. So we're gonna talk a lot about what
happens with lots of random variables at the same time. And that's why I keep emphasizing
that everything is accumulative here. Because if you have trouble
with one random variable and it's CDF then understanding two of them at
the same time is gonna be very difficult. So we always have a joint CDF. If there's two of them. I'll just write down what it looks like,
F(x,y). So joint CDF would be this,
for two random variables. But of course, if we had a million
random variables instead of two, I'm not gonna write this down. I could write x 1 through x a million. And then, x 1, less than or
equal, little x 1, and so on. And a million of them. So this extends to as many as you want. But it's just easier to write it down and
think about it for two of them. But it's more general than this. That's the joint CDF
that always makes sense. They can be discrete, continuous, mixtures
of discrete and continuous or anything. In the continuous case then we have a joint PDF which
I talked a little bit about. But I don't think I wrote down how to
get from the joint CDF to the joint PDF. So then we have a joint PDF. And if it's analogous to
the one-dimensional case. Where in the one- dimensional case we take
the derivative of the CDF to get the PDF. In this case, we take the derivative except that
it's a function of two variables. So, we're gonna take two
partial derivatives. And so I would write it as d squared of d squared, dx/dy F(x,y). Which looks complicated, especially if
you haven't seen partial derivatives. But even if you haven't ever
done partial derivatives before there's nothing really to
worry about with this. All it means is take the derivative,
this is a function of two variables. Take the derivative with respect to y,
treating x as a constant, right? So if you can do derivatives, which I'm assuming you can do,
you can pretend x is a constant. And then take the derivative with
respect to x, holding y as a constant. And there's a theorem in
multi-variable calculus that says that under some mild conditions, it doesn't
actually matter if you take the partial with respect to y then with respect to x. Or with respect to x and then with
respect to y, you'll get the same thing. So this is again analogous
to the one-dimensional case. And the joint PDF, this is not
a probability, that's a density. That's all we integrate to get a density. So integrate this. If we want to know what's
the probability that x, y is in some set A? Then that's just gonna be the integral
over that set A of the density. If you haven't done double intervals
before, again, it's no big deal. Just integrate with respect to x,
holding y constant, and then integrate it with respect to y. Basically, the only complicated thing is
figuring out the limits of integration. So I just wrote double interval over A,
cuz A could be any region in the plane. So if we had something like,
if A is this blob, that may be a hard
problem to this integral. What does it mean to
integrate over the blob? I mean, that turns into a nasty
multi-variable calculus problem. That's not something we care about for
this course. It's just a nasty calculus problem. It's not an interesting
probability problem. So the more interesting case for our purposes would be if it's,
let's call that A1. Down here's the A we actually want. If it's a rectangle, then this
double integral just means integrate x goes from here to here,
y goes from here to here. So it's just literally
the integral of the integral so it's no different from
doing two integrals. So you don't have to worry
too much about the blobs. There's only one case where we might
care about the blobs in this course. And that's when we have a uniform
distribution over some region. So I'll come back to this. So at the very end last time we were
talking about a distribution that's uniform over a square or
over a circle, that kinda thing. And in the uniform case, we can interpret
probability as proportional to area. So in the uniform case probability
is proportional to area. And then I could say well, I'm just going to do something
proportional to the area of the blob. And at least I can think
more geometrically. But anyway, conceptually it's analogous. The joint PDF is what we integrate
to get the probability of any of xy being in any particular set, right? In one dimension we'd say, what's
the probability of x is between -3 and 5, right? We want an interval. And here we want the probability
that it's in some region. But the rectangular case
is gonna be the nicest one. So, those are joint distributions. And I talked a little bit last time
about how to get them marginal. And it's very straight forward. Marginal PDF of x. To get the marginal PDF of x,
we just integrate out the y. So we just integrate minus
infinity to infinity, f (xy) dy. Notice that by doing this, we'll get
something that's now a function of x. X is just treated as a constant
here we are integrating overall y this is no longer gonna depend on y
because you're integrating overall y becomes a dummy variable. Similarly you get the marginal
PDF of y by integrating dx. And this is just completely analogous
to doing a summing over the cases. It's just saying we want
x to be this little x. And y has to be something. So we just integrate
over all possibilities. So that's the marginal. So that's called marginalization. We marginalized out the y. Then we get the marginal of that. We integrate out the y,
we get the marginal of x. It's just terminology for
something very simple. Just integrate. If we did a double integral. So if we then took this thing and
integrate this dx, we should get 1. And what that says, one way to think of it
is to say if we let A be the entire plane, everything, we'd better get 1, right? Otherwise it wouldn't make any sense. The other way to think of it is this
is supposed to be the density of x, just viewed as x in its own. So if we integrate this dx,
we have to get 1, otherwise do not find
a valid marginal PDF. So that has to integrate to 1. May as well write that down just for
emphasis, the double integral equals 1. And it's always minus
infinity to infinity, minus infinity to infinity to start with. It might be that this is zero
outside of some region, and then we could restrict it further. But we could always write
it like this at first, and then we should be careful about where
is it zero or where is it non-zero. So that's gonna be our marginal,
let's do a conditional. Conditional distribution,
so we want conditional PDF. And this should be easy to understand and
remember, because it's analogous to
conditioning we've done before. So let's say we want
the conditional PDF of Y|X, well, we would just write that as f. Sometimes we'd put a subscript
of Y|X just for emphasis. And sometimes we may
leave out the subscript, just cuz it's clear from the context. Conditional PDF It's just, think of it as the PDF where we get
to pretend that we know what X is. We get to observe what x is, okay? Given that information, that we now know the value of x,
what is the appropriate PDF for y? Well, we could think of
that as being the joint density, divided by
the marginal density of x. What I just wrote down just
looks like the definition of conditional probability, right? The probability of this given this
is the probability of this and this, divided by the probability of this thing. Now x and y are representing numbers,
not events, okay? But it looks the same as the definition
of conditional probability, and you can derive this from the definition
of conditional probability. Where basically what you would do is say, our event is that y is either, take y = Y. Or if we are worried about probability
zero, say y is extremely close to Y. That is, we let capital Y be
in some tiny little interval around little y and find the conditional
probability of that, given the value of x. And it's completely analogous
to conditional probability. So this says that we can get
the conditional just by doing the joint distribution, joint density
divided by the marginal density. We could also do something
that looks like Bayes' rule. That is, what if we want
the conditional PDF of Y|X? Well, we want fX|Y(x|y) fY(y), I'm just writing down something
that looks like Bayes' rule, That looks like Bayes' rule, right? I swapped the x and the y, but
instead of probability I'm doing density, completely analogous to Bayes' rule. The proof is really, use Bayes' rule and then take a limit, and so
this should be easy to remember. And the numerator is the same, another
way to say this is that to get the joint density, we can take one of the marginals,
then times the other conditional, right? That's like, if we're pretending it's
probability rather than density. It's like the probability of this y value
times the probability of the x value, given that y value. So everything is analagous to
Bayes' rule in the discrete case. All right, so those are just
the basic concepts we need for that. And I should mention again
how to think of independence, so again, this is the continuous case. X and Y are independent if, well, the general definition
in terms of CDFs, but it's usually easier to work
with the PDF than the CDF. So usually, the best way to think
of it is independent means that the joint PDF is the product
of the marginal PDFs. And that has to hold for all x and y. It's not too hard to show that that's
equivalent to having the CDFs factor. Cuz basically, if the CDFs factor, you could take the derivative, this
derivative thing, and you'll get this. You could take this thing, and
integrate, and go back there, so it's basically equivalent. Intuitively, it should be equivalent. All right, so
let's come back to this uniform example. Because I wanted to write
what the conditional, we wrote down the joint PDF last time,
I'll remind you. That is, we have the distribution that was
uniform on a circle, or inside the disc. So uniform in the disc, which is x squared
+ y squared less than or equal to 1. We are picking a uniformly random point,
maybe there. Uniform means that probability of some
region is proportional to area, okay? So therefore, so one nice thing
when we have problems that involve a uniform distribution on
some region in the plane. We can actually think of
probability in terms of area, or at least it's proportional to area. So the joint PDF we did
last time Is just 1 over pi, it's one over the area,
because that'll make it integrate to 1. Within the circle,
x squared + y squared less than or equal to 1, and 0 outside, okay? But just for practice,
let's get the marginal density of x and then the conditional density, x|y or y|x. So, and by the way, this may look like
they're independent because this doesn't, this looks like somehow it factors
as a constant times a constant. But x and
y are not independent here, right? Because it x is very close to 1,
then it's constraining the values of y, so they're definitely dependent. You have to be careful
about things like that. Cuz if you just only
look at the 1 over pi, it looks like they might be independent. But the key thing is that they
are constrained together to be, right? So this is saying that x and
y are actually closely related. But if you only look at this part and
ignore this part, you might think they're independent. All right, so let's get the marginal, fx(x), all we have to
do integrate out the y. So we're gonna integrate the joint PDF,
which is 1 over pi, as long as we're careful to,
we're gonna integrate this thing, dy. The only thing we have to be careful
about is the limits of integration. This is only valid when x squared +
y squared is less than or equal to 1. Which is the same thing as saying
that y squared is less than or equal to 1- x squared. And that tells us that y has to be
between minus square root of this and plus square root of this. So we're gonna integrate from
minus square root 1- x squared, to square root of 1- x squared. So the main mistake with
this kind of problem is messing up the limits
of integration somehow. We have to be very,
very careful with limits of integration. You're not actually ever gonna have to do
any difficult integral in this course. But sometimes, you have to think carefully
about the limits of integration, okay? So this is just saying,
these are the bounds on y for which I should have 1 over
pi rather than 0 here, okay? So if we get the limits
of integration wrong, then it's just Just completely wrong. All right, this is a very easy integral. This integral of a constant is just the
constant times the length of the interval. So that's just 2/pi square
root of 1- x squared. And that's valid for -1 less than or
= x less than or = 1. As a check,
we could integrate this thing, dx and, How do you actually integrate
the square root of one minus x squared? You would do a trick substitution. I'm not gonna do that integral right now,
but you could integrate this thing from minus one to one, use a trick
substitution as he just suggested. That's basically gonna reduce it back down
to the fact that it's based on a circle, and you'll get one. So that does integrate to one. So that's the marginal, notice that
this does not look like a uniform, so it's certainly false to say that
it's uniform between minus one and one. The point xy is uniform, but
the marginals are not uniform, right, and in fact you can see that this is largest
when x is 0 which kinda makes sense. Cuz if you imagine the random point here, then kind of near the center
seems like there's more space for stuff to happen and seems a little
less likely to be further out, okay? So let's get the conditional PDF now. All right so we can either do y given x or
x given y whichever we feel like. Notice that if you want the marginal
PDF of y, just changed the letter x to y here by symmetry no need
to repeat the same calculation. Okay, so let's do the PDF of y given x. So that's just gonna be be the joint
PDF divided by the marginal PDF of x. So it's just gonna be 1/pi/2/pi, square root 1- x squared. I just took the joint PDF
divided by the marginal PDF, and we have to be careful about
where is this non-zero. I'm thinking of y as fixed right now,
it's like we get to observe x and I wanna say well,
what are the possible values of y? Well, for each x,
we know that y has to be between, square root of 1- the same thing again. y has to be between root 1- x squared, okay, 0 otherwise. So the pi's cancel, and, That looks kind of ugly. What would be another way to say
what this conditional density is? You're treating x as a constant. What would you call this thing? Uniform, because notice this
only has a x here, there's no y. And general you would have a y here. There's no letter y on the right
side of this equation. So a nicer way to write
this would be to say that y given X is uniform between
(- root 1- X squared, root 1- x squared) because this is
just a constant for each fixed x. I wrote this with capital X
here to clarify this notation. When you see this thing
like y given capital X. What does that mean? Intuitively that means just pretend that
capital X, we know x is a random variable but pretend capital X is a known
constant cuz we got to observe it. But you can just think of this as
short hand for saying, Y given X = x. This kind of is a more direct way to write
it that is we get to observe that X = x. And we're saying that if we know that
then we have a uniform distribution, between (- square root of 1- x squared,
square root of 1- x squared). But its a little more cumbersome
to write it this way. So sometimes I'll write it
this way with capital x but just treat that as short hand for this. That just means given that we get to know
what x is, here's the distribution for y. So we're treating x as a constant here,
and here we're explicitly calling that
constant little x, it's just notation. So okay, so that says it's conditionally
uniform over some interval. Notice that that's
the appropriate interval. Cuz as soon as you specify what x is,
we know y has to be between here and here. This says it's uniform. So similarly you could do f of x given y. And you can see that
they're not independent, because well one way to see it, is, fx, y does not equal the product of
the marginal PDFs in general here, right? Take this thing and
then the same thing with y, you multiply them you do
not get the joint PDF. So they're not independent. Another way to say they're not independent
is that the conditional distribution of y given x is not the same thing as
the unconditional distribution of y. That is learning x gives us information,
okay? All right, so those are these basic
concepts, joint, conditional, marginal. I wanted to mention one more thing that's
analogous to the one-dimensional case. And that's what I call the 2D LOTUS. And it's completely analogous. So we wanna do LOTUS where we have
a function of more than one variable. So, let's let (x,y) have a joint PDF. I'll state it in the continuous case, but you could also do a discrete
2-D LOTUS if you want. So we have a joint PDF, f(x,y) okay, and then just let g be any function of xy. Let's say it's real valued. So this function g, takes two values
as input and outputs one value. For example it could just be x plus y, or
it could be x squared times sine of x,y, cubed or whatever. Just any function of x,y, okay? A real-valued function of x,y. And then we're gonna write down LOTUS. LOTUS tells us how to get
the expected value of g(X, Y), and it says, we do not need to try
to find the PDF of g(x, y), we can work directly in
terms of the joint PDF. And all we have to do is integrate, It's gonna be minus infinity to
infinity minus infinity to infinity. But possibly we can narrow
it down that range of I just change capital
X,Y to lowercase x,y. And then I use the joint PDF. Completely analogous. So let's do a couple of examples,
how is this fact useful? So here is an important fact that I
already needed this fact once and we didn't improve it yet, which was
like we were talking about the fact that the MGF of a sum of independent random
variables is the product of the MGFs. And at some point we need to say E
of something times something is E of something, E of the other thing. That's true when they're independent,
that's what we need to show right now. So the theorem is that if X and
Y are independent. Then E of XY equals E of X E of Y,
that's a very useful fact. Well, we'll come back to this fact
later when we talk about correlation, the way we would say this in words is
that independent implies uncorrelated, that's just foreshadowing. Later we'll talk more about what
exactly does correlation mean. But, when we define correlation,
in a later lecture, we'll see that that actually
says that they're uncorrelated. And so this independent implies uncorrelated,
is the way to say it in words. So let's prove this fact. And this is always true. It doesn't matter if they're continuous or
discrete or whatever. But so we don't have to invent a lot
of notations or do a lot of cases, let's just do a continuous case for
practice. So proof in the continuous case Well,
we're just gonna use the 2D LOTUS. That saves us a lot of effort, because
when you just see this thing E of X. X times Y is a random
variable in its own right. So the first time you see this, you might think I need to
study that random variable. That takes a lot of work. 2D LOTUS just says that's the function
of XY I'm just gonna use LOTUS and then it's gonna be easy. So E of XY equals, how do we do this? Well, I'll just write down double
integral minus infinity to infinity, minus infinity to infinity
xy times the joint PDF. But since we assumed
that they're independent, the joint PDF is just the product
of the marginal PDFs. So independence means the joint
PDF just factors like that. And that's what makes this,
actually, easy to deal with. Because this function is
just separated out like, this is a function of X function of Y. Function of X, function of Y. Very nice. So, now, what do we actually do? Well, what this is to do is take this. I'll put parentheses here to
make it a little clearer what this double integral means. That's just the definition
of this double integral. It says do this integral, then to this
outer integral, so you work your way out. When you're doing this inner integral
you're treating Y as a constant, so this Y you're gonna stick it right there. And this fy of y also stick that there. Both of those come out. So that look a little messy,
let's rewrite that. All I did was to take out the y and
the marginal PDF of y. And what's left is the x and
the marginal PDF of x. So I just took them out. Now this whole thing here,
that's just a number. That just says we took this function and
we integrated it over x and we get a number. That's just a constant. And we know what constant that is,
that's E of X. That's just a number. So that constant you can pull
out of this entire integral. It's just a constant,
take it out of the integral. What's left? Integral of Y times the PDF of Y. That's just E of Y. So that's immediately just E of X E of Y. So basically this amounts
to E of X E of Y. All this amounts to doing is just taking
out things that you're treating as constant and then the factors. So that's a useful fact. And it would be a nightmare to try to
prove this without having LOTUS available. But with LOTUS then we can
do that pretty quickly. All right, there's another problem
I like to do with the 2D LOTUS. And that's like expected
distance between two points. So let's start with the uniform case. I talked about this on the strategic
practice too, you can look at that later. But I think this is a useful point for
everyone to see this now. So we have, so this is an example,
where we take two uniforms, let's let them be X and Y, B i.i.d. uniform 0, 1. And we wanna find expected
distance between them. So, this kind of problem comes a lot
in applications where you have two random points. And often you wanna know
how far apart they are. So, this is used for various applications. And so one approach would be,
try to study absolute value of X minus Y, find it's distribution and maybe for
some problems we need the distribution. But in this case,
they said I only want the mean, so, therefore LOTUS should suffice for that. So just write down LOTUS So it's a double integral, x minus y. And since they're i.i.d uniformed,
the PDF is just 1. The joint PDF is just 1. So you just have to integrate this thing,
dxdy from 0 to 1, 0 to 1. So, Then the only question I guess is
how do we integrate the absolute value? Well usually if you wanna
integrate an absolute value, the best strategy would be to
split the integral into pieces such that you can get rid
of the absolute value. So we could split this up as one
piece where x is greater than y. I'll write it this way. X greater than y. That is I'm integrating over
the set of all points in 0, 1 where x is greater
than y of this function. Now if x is greater than y,
I can just drop the absolute value plus and now integrate over the piece
where x is less than or equal to y. And in that case,
it's y minus x not x minus y. Now if you think of
the symmetry of the problem, this problem is completely
symmetrical because of the i.i.d. And this is symmetrical function,
I could have changed this to y minus x. So, really there is no point
in doing two double integrals, let's just do one and
double integral, and double it. And then we have to do two integrals
instead of four, so that's much nicer. This is just gonna be 2 times
the first integral I wrote down. Okay, I am not gonna do a lot
of double integrals in class and you won't have to do many double
integral in general in this course. We will have to do a couple of them. So just for practice, let's do this. Basically, the only thing you could
mess up is the limits of integration. So let's carefully say, how do we get
the correct limits of integration here? The outer limits, I could've done dydx and then it'll be different
limits of integration. Okay, but I chose, for no particular
reason, to just write it as dxdy. So if we write dxdy,
the outer limits must refer to y. And we know y goes from 0 to 1. Okay, now the inner limits, so these
outer limits have to just be numbers. But as you move inward, the limits can
start depending on other variables, so these inner limits can depend on y. In fact, they have to depend on y. Okay, it would not work to go 0 to 1 here. We know x has to be between 0 and 1. But we also know that we're only
integrating over x greater than y, so x has to be greater than y. So we go from y to 1. Right, because x is bigger than y so
it has to start at y, so that's all we have to do. We do have to be careful. It's easy to mess up
the limits of integration. Now this just says do 2 easy integrals,
okay? So it's integral 0 to 1,
this inner integral, Inner integral, I integrate x- y dx so I'm treating y as a constant, so I would just do x squared over 2- yx. All right, I'm treating y as a constant
and then evaluate this from y to 1. Okay, and so then we just plug in 1 and
subtract, plug in y. And then it's just a very,
very easy integral. And I won't bore you with
all the algebra for that. You just plug in 1, plug in y, and
it's integrating a very easy integral. If you simplify that you get one-third. So the average distance between
two uniforms is one-third. Let's draw a little picture to see
whether that makes intuitive sense to us. So we have this interval 0 to 1, okay? And we're picking 2 uniformly
random points in this interval. Let's say there and
there, completely random. But notice that the distance between then
is one-third because that's one-third, two-thirds, the distance is one-third. That sort of looks like
your stereotypical, if you had to guess something
what would it look like, that might be what you would guess,
right, and it works. This actually, for me at least, this
actually makes the result one-third easy to remember,
even though that's not a proof, obviously. >> [LAUGH]
>> But that actually does suggest another
way to look at this problem. Which is,
I'm picking these two random points, and there's gonna be a point on the left and
a point on the right. So that suggests reinterpreting this
in terms of the max and the min. So another way to look
at this would be to let, let's say M = maximum of x,y. And you should think through for yourself. Why is the maximum random
variable as a random variable. That's just basic practice
with your random variables. L, this is something that always
annoys me is that the word maximum and the word minimum both start with m, so
it's hard to remember your notation. So I started using L for
the minimum because L stands for the least one or L stands for
the little one, but unfortunately, then I realized L
could also stand for the large one. >> [LAUGH]
>> It's just one annoying fact about English. >> [LAUGH]
>> Well, anyway, we'll let M be the max and L me the min. Here's a handy fact, X-Y absolute
value the same thing as M-L, right. Because you take the bigger
one minus the smaller one, that's the same thing as taking in the
absolute difference, same thing, right. That's how you do an absolute value, you just take the bigger
one minus the smaller one. So therefore,
what we've just shown is that E of M-L = one-third,
according to that calculation. So that says that E of
M-E of L = one-third. And on the other hand, sorry,
I should have written this up higher. I'm gonna go loop around to the top here. So the difference of
the expectations is one-third. Let's also look at the sum. If we look at E of M+L, Well, by linearity, that's E of M + E of L. But on the other hand,
what's M + L in terms of X and Y? It's just X + Y because if you add
the bigger number plus the smaller number, all you've done is add the two numbers,
right? M + L is the same thing as X + Y But by linearity,
E of X + Y is E of X + E of Y and both of those are one-half cuz they're
uniform 0 to 1, so this must = 1. We just showed that that is = 1. So from this,
we actually now have an expression for the sum of these two expectations and
the difference. So therefore, we can just solve that and
we get E of M and E of L. So, E of M = two-thirds, and E of L,
I have a system of two equations and two unknowns,
just solve that the usual way, add the two equations, that kind of thing. E of L = one-third,
just like in this picture. That's L, that's M, On average. So on average, it looks like that. So another approach to this
problem would have been, if I used this result to prove this
result, another approach would have been, let's directly study the max and
the min, okay. And you've seen examples like on
the strategic practice like very useful factors that the minimum of
independent exponentials is exponential with a larger rate. And we showed that on
the strategic practice problem. And on the new strategic practice,
there's something related with the min and the max. So another way to do this would have been
directly find the PDF of M, the PDF of L, and that would give us this result, right. So you could go in either direction. I actually don't like
doing double integrals, I'm not gonna do a lot of double
integrals, you won't have to do many. I felt I should do one for
practice with the 2D LOTUS. Okay, but in general,
I would rather think more in this way. Use linearity, use the CDFs, the things
like that and not do a lot of integrals. Okay, so those are continuous examples. I want to do one discrete example for
the rest of today. And then next time,
we'll also do some more discrete stuff and maybe some more continuous stuff too. This is one of my favorite
discrete problems. I call it the chicken and egg problem. Chicken-egg, We already had a homework problem about chickens and
eggs, and hatchings and so on. But it's not exactly the same
as this problem, it's related. So here's the problem. I'll state the problem, then we'll solve
the problem, and then we'll be done. >> [LAUGH]
>> Okay, here's the problem. There are some eggs, some of them hatch,
some of them don't hatch. The eggs are independent. Let's assume there are N eggs. The twist to this problem is that
the number of eggs is random, chicken doesn't always lay
exactly the same number of eggs. So let's assume that it's Poisson Lambda. That's the number of eggs,
now each one either hatches or fails to hatch, so
each hatches with probability p And independently, so you can think
of each egg as an independent Bernoulli p trial for
whether it hatches or not. Hatching is success. So independently, and
let X equal the number that hatch, So I would write that as
X given N is binomial Np. That's just a restatement
we already know this. As I explained this notation means
pretend that N is a known constant, actually N is Poisson, but pretend
that now we know the number of eggs. So we're treating N as a constant then
just binomial Np because I assumed independent Bernoulli trials. So still we know that. Okay, let's also let Y equal
to number that don't hatch. So we have an identity X plus Y equals N. All right, well,
that's not the end yet, but we derived the theorem that X plus
Y equals N, because the number that hatched plus the number that don't
hatch equals the number of eggs. Now the problem is to find the joint PMF. It's discrete so
I could find the joint PMF of X and Y. And in particular we'd like to know,
are they independent? And intuitively, they seem extremely
dependent because their sum must equal N. That's not a proof though because this proof that they're
conditionally dependent. That is if we know N they are dependent,
and intuitively they seem pretty dependent. That is if you have a lot of eggs
that hatch then there's not so many left that don't hatch. But we haven't yet proven whether they are
independent of not cuz this is equal N. So now let's find the joint PMF. So just by definition the joint PMF is
the probability that X equals something, lets say i, Y equals j,
could use little x little y. But I'm just using i and
j to remind us that they are integers. Now, to do that, somehow we have
to bring in this Poisson thing. So our strategy for solving this should be going back
to our early part of the course. You have a probability, if you don't
immediately know how to do it, try to find something to condition on. What do you condition on? What we wish that we knew. I wish I knew the number of eggs. Then it's an easy binomial problem. Conditional on the number of eggs,
just a binomial. So we're gonna condition on N. The law of total probability says we can
just write this as the sum of X equals i, Y equals j, given that N equals n
times the probability that N equals n. Summed over all n from zero to infinity. That's just the total probability. And probably N equals n,
we already know that from the Poisson. Well, okay, that looks a little scary like
we're gonna have to do an infinite sum. For similar problems I seen a lot of
students get stuck at this point. And my suggestion is if you ever find
yourself getting stuck at a point like this is to try some simple examples,
make up some numbers, do some special cases so
you think about it more concretely rather than being intimidated
by this infinite series. If you actually think about it concretely,
you'll notice something very, very simple. That is, if I said,
what is the probability of this, this is just kind of some scratch work. What's the probability that X equals 3,
Y equals 5, given N equals 10? What's that? 0, because there's 10 eggs, 3 hatched, 5 didn't hatch,
someone stole the other two eggs, I mean it doesn't make any sense,
that it's impossible, 0. What's the probability that X equals 3,
Y equals 5, given N equals 2? 0, There's only two eggs and yet
you're claiming three hatched and five didn't hatch, that makes no sense. So as soon as you write down, I find writing down a few
simple numbers like that it becomes completely obvious that this
incident sums up actually only one term. The one term is the case when
in fact N equals i plus j. So we only have one term here. X equals i, Y equals j,
given N equals i plus j. Otherwise there's a mismatch. Times the probability
that N equals i plus j. And now we know everything we need
to know to just evaluate this. Notice there's now some redundancy,
because if I know there's i plus j eggs and i hatched,
I already know that j hatched. You didn't have to tell me that. Redundant information,
we just cross that out. Now X equals i given N equal,
that's just from the binomial. Cuz given the value of N we're treating
X as binomial so we're just gonna take something from the binomial PMF times
something from the Poisson PMF. And so let's see, that board is broken, so we can do this
here still, just have a little more space. So we want to find
the probability that X equals i, given N equals i plus j, times
the probability that N equals i plus j. Okay, this is just, this is an easy calculation now,
but let's see what the answer is. For the first term we
just use the binomial. So i plus j choose i. I'll write that as i plus j
over i factorial, j factorial. That's just i plus j choose i. Then, That's a factorial, thank you. So that's i plus j, thanks,
that's i plus j choose i. From the binomial. Times p to the i, times. Because we're assuming binomial Np. So p to the i, and q,
as usual, q is one minus p. So i successes, j failures,
q to the j, and then times the Poisson PMF,
e to the minus lambda, lambda to the i plus j
over i plus j factorial. Let's just simplify this quickly. i plus j factorials cancel, and let's try
to write this in a nicer looking form. Where we are going to try to split it up into a function of i
times a function of j. So we could write this as lambda to
the i plus j we can split that up so really we have lambda p to
the i over i factorial. And we have a lambda q to
the j over j factorial. And the only thing left that we have to
deal with, is this e to the minus lambda. But remember, that p plus q equals 1. So I can think of it as having a p
plus q sitting up in the exponent. So this is e to the minus, lambda p. And this e to the minus lambda q. So actually it factored, so actually
that shows that they are independent. That says that X and Y are independent. And X is also Poisson X is Poisson
lambda p, and Y is Poisson lambda q. Which sounds like impossible at
first how could they be independent? And if your intuition was that
they're not independent, you shouldn't feel bad about that because it turns out
that this is only true for the Poisson. So this is actually a very
special property of the Poisson. If you change Poisson to anything
else they will become dependent. It happens to be true for the Poisson,
we just proved that they're independent. That is you think well,
you have more eggs that hatched, there is less that didn't hatch, but
the number of eggs is random and that randomness exactly for the Poisson
exactly makes them independent. Well, let's just get
an example of a joint PMF. It's also a nice story. And have a good weekend.