Well let's get started. Thanks for coming despite the rain, but at least we can feel lucky that the sun rose
today cuz we have a lot more to do and it would be hard to do if
the sun stopped rising. So, okay, so
we were talking about MGFs last time. We've done all the theory that we need for
MGFs but I'm not sure that the intuition is clear enough yet
and there are some important examples. So I just want to start with
a few examples of MGFs. We already have all the theorems. Okay, especially,
how do we work with MGFs for some of the most important distributions
such as exponential, normal, and Poisson? Just to show you how
the MGFs are useful for some of those famous distributions, okay. So let's start with the exponential. And [COUGH] we talked before about,
So this is the Expo MGF. We talked before about the fact that if we
have exponential lambda we can always find a constant to multiply by
to make it exponential one. So let's just start with the exponential
one case cuz that's simpler, hat is lambda equals one. Let x be exponential one, and suppose that we wanna find find MGF and
find the moments, okay. Find moments. And this will really show you why it's
called the moment generating function. That doesn't actually, I didn't actually talk about where
did the word moment come from? It comes from physics. Those of you who've done
moment of inertia and stuff, there's actually a pretty strong analogy
between variance and moment of inertia. That doesn't answer the question for where
did the word moment come from in physics, but you can ask the physicist that. But it came into statistics via physics
because of this analogy with moment of inertia. Anyway, so we have an exponential,
okay, and so let's find the MGF. Well by lotus that's
a pretty easy calculation. M(t), remember just by definition it's
just the expected value of (e to the tx). And this is a perfectly
valid thing to write down. This (e tx), that's just some random,
we're taking it's expectation, and then we're viewing
this as a function of t. And I pointed out last time, t is a dummy
variable, so I could of just as well said M(s) = expected (e to the sx), or whatever
you wanna call it that doesn't clash. [COUGH] The interpretation
is just that this is a very, very useful bookkeeping device for
keeping track of moments, and it's another way to describe distribution,
rather than a CDF or a PDF. Okay, so let's just compute this thing. Well this is an easy lotus problem
cuz by lotus we can just immediately say this is the integral 0 to infinity,
e to the tx, e to the -x dx. All right,
that's just immediate from lotus, combine the two exponentials,
so that's e to the -x(1-t) dx. So that's just an easy integral, right. So that integral, well actually one way
to do it is just to do the integral. Another way to do this integral is to
recognize this as another exponential PDF with a different parameter and
put it in the normalizing constant. And you'll get 1 over 1- t,
and this is for t less than 1. If t is bigger than 1,
we have some problems here. Cuz if you let t be 2 for example,
then that would 1- 2 is -1. You'd get e to the +x,
which would blow up. But as long as t is less than 1,
this will be okay. Exponential decay, not exponential growth. So we have to assume t is less than 1. But that's okay, cuz we talked last time
about the fact that we wanted to have some interval, I called it -a to
a on which this is finite. So in this this case it's finite
everywhere to the left of 1, right. So in particular it could take
some interval like say -1 to 1 open interval on which it's finite. So this is a perfectly valid MGF. Okay, so
now we wanna get the moments, right. So from what I said last time,
we could take this thing 1 over 1-t and start taking derivatives,
so, and plug in 0. So would be true that M'(0)
would be the mean and M''(0) would be the second moment. And once we have this and
this we could easily get the variance. We already talked about the mean and the
variance of the exponentials, so you could do this and check that it agrees with
what we did earlier through lotus, okay. And then third moment would be the third derivative evaluated at 0,
and so on, right. So we could do that, but that's kind of annoying in the sense that
you have to keep taking derivatives. Now for this function, taking a bunch
of derivatives is not too bad, okay. But it's still a much better way to do
this, Is to recognize the pattern, right. A lot of this is about
pattern recognition, okay. Where have we seen 1 over 1-t before? Geometric series, right. We keep using the Taylor series for
e to the x and geometric series over and over again. It can go in both directions, right. You can have the geometric series result
and write it as a geometric series or you can have a geometric series and
simplify it to this. Anytime you see one over
one minus something, you should be thinking that may have
something to do with the geometric series. That may be a useful interpretation,
it may not, but at least the idea should pop into your
mind just cuz you see this pattern, okay. So if we do that we get 1 over 1-t
equals just a geometric series, a sum t to the n, n=0 to infinity. And this is valid for t greater than,
for absolute value of t less than 1. That's when this converges. By writing it this way,
we don't actually have to do derivatives. We're just looking at this series,
okay, and then we're just gonna
read off the moments. So the only thing we have to be
careful about is the n factorial. Because I said with the MGF,
you take the Taylor expansion and the moment is whatever is in front
of t to the n over n factorial. I don't see an n factorial here,
but that's no problem, right. We just multiply and divide by n factorial,
cuz we need the n factorial there. So I'll multiply by n factorial
t to the n over n factorial. Now this matches exactly the pattern
that we talked about last time, about what ever's in front of the t to the
n over n factorial, that's the nth moment. So that's the nth moment. So we immediately know now that E(x
to the n)=n factorial for all n. So instead of taking derivatives over and over again, we simultaneously
get all the moments of x, okay. So that's nice, right. Didn't need to take any derivatives. So, by the way, that's kind of like
the coolest thing about MGFs is the fact that if you, just in general,
not necessarily for this example. If you wanna find the moments
of some distribution by lotus, you would think you have to integrate,
right. You want e of x to the n so you're going
to integrate x to the n times the PDF. That may be an incredibly
difficult integral. But the MGFs, once you have the MGF,
we're taking derivatives not integrals. So it's pretty surprising
to me at least that you can do derivatives of the MGF rather
than the integrals of powers of X. Derivatives are much easier usually than
integrals, so that can save a lot of work. So let's just quickly see what
happens if it's exponential lambda, where lambda is not necessarily 1. So now let's let Y be exponential lambda. And then, let's just convert it,
just to see how to apply this. Convert it, well, exponential, so we talked before about the fact that
if you multiply or divide by lambda, it may be hard to remember, should you
multiply by lambda or divide lambda? But there's an easy way to see that. Let's just let X =, Lambda Y. So I need to multiply by lambda
rather than dividing because we know the exponential lambda
has mean 1 over lambda. So if we multiply by lambda
now this has a mean 1. And we show that this is,
in fact, exponential of 1. So we've converted it to this case. In other words, Y = X over lambda,
and we can take nth powers. So now we immediately have to
moment of the nth moment of Y. Expected value Y to be n = expected
value of X to the n which is n!, divided by lambda to the n. Okay, so I didn't do any calculus here. I only used the geometric series. We could have directly done something
similar to this for Y, but I think it's easier working with the exponential
one and then converting it back. Similarly, at the end of last time, we derived the MGF of the standard normal,
okay? Now if you want any
normal mu sigma squared, then you just write a mu plus sigma z,
right? Then you can get its MGF very easily. So a lot of times it's easier to
work with the standard normal. Okay, so speaking of standard normal,
let's actually get the normal moments now. We already know the odd moments. So the problem is let Z be standard
normal, And find all its moments. Okay? We already know that
the first moment is 0 and the second moment is 1 cuz
it's mean 0 variance 1. We already know that
the odd moments are all 0. That's just by symmetry, we mentioned
that fact before but you should check for yourself that that makes sense,
that to practice the symmetry. Because if you write down
the integral using LOTUS, you would be integrating an odd
function symmetrically about 0 so the negative area cancels
the positive area. So don't need to do any work to get this,
just use symmetry. Even moments, though,
that seems pretty hard. And we already know E of Z squared, and
we did that by doing some integrations. Now if we want E of Z to the forth,
if we use LOTUS, you're gonna have to integrate Z to
the fourth times the normal PDF. How do you do that integral? I don't know, I mean, you can try doing
some substitutions, you can try doing integration by parts and you can easily
spend a couple hours doing that integral. And it's possible to do it, but
it's not easy, it'll be a lot of work. And that would just be the fourth moment,
and then you'll say well, what about the sixth moment? What about the eighth moment, right? So that's not a very
efficient way to do things, it's doing a lot of
nasty looking integrals. Okay, so let's use the MGF instead. The MGF that we derived last time is the function M of t = e
to the t squared over 2. So that at least gives us an approach to
getting the moments that doesn't involve having to figure out how to
do these integrals, okay? It's something more straightforward. Like for derivatives, we have the chain
rule, the product rule and so on. There's no chance that you can't do this
derivative if you know your chain rule and product rule, and stuff like that. Whereas for integration, you may just
not know how to do the integral, okay? So we could take the derivative of this,
use the chain rule. And we're gonna get a t that comes out
in front because of the chain rule. And then we take the second derivative, because then there's gonna be a t out
there after the first derivative. Then we will have to use the product rule,
okay. And then we take another derivative,
then we have 2 terms and then terms start multiplying and
get more and more terms to deal with. And it'll get more and more tedious and
ugly, the more derivatives we took. It's still something that you can do. It's pretty mechanical, but it's tedious,
and we wanna avoid tedious stuff, okay. So here's a much better
way to think about it. Over there with the exponential, I emphasized just the pattern
recognition geometric series. Let's apply the same thinking again. Pattern recognition,
this is e to a power, okay? Unlike the geometric series,
the Taylor series for e to the x converges everywhere. So I can immediately just write
down the Taylor series for this, without taking any derivatives. This is just the sum of t squared
over 2 to the n over n!, right. Because the Taylor series for e to the x
is valid everywhere, so in particular, I can plug in t squared over 2, okay? So this is a much,
much better way to do it, than to start taking derivatives of this. So let's simplify this, this is the sum,
notice that we're only gonna get even powers of t, which makes sense
because this is an even function. So it's gonna be t to the 2n, And, there's a 1 over 2 to the n in
the denominator and there's an n!. Okay, so that's what it is. Now, same as over there,
we just have to read off the moments. The only thing you have to be careful
about is the fact that there's a 2n here in the exponent, and there's an n!, there. So there's kind of a mismatch right now,
okay? We want the 2n moment because 2n
is just an arbitrary even number. Okay, we want the 2n moment,
so the 2n moment, we want the coefficient
t to the 2n over 2n!. We don't have a 2n!. Well that's okay, just put in 2n!. As long as we multiply by that, it's okay. So I just multiplied and divided by 2n!,
that immediately tells use the answer. The expected value of Z to the 2n, so
that's just an arbitrary even moment, we already have the odd moments, Is just
the coefficient of t to the 2n over 2n!. That's everything that's left. That's 2n!, over 2 to the n n!. And let's just check whether this
makes sense in the cases we know. If n = 1, N = 1, this is 2!, over 2 times 2, when n = 1, so 2 times 1. Okay, so So
that's 2 divided by 2 times 1 is 1. So E(Z-squared) = 1, and that's what
we expected because the variance is 1. And let's just do a couple more, n = 2, so then get the fourth moment,
z to the fourth. n = 2, four factorial is 24 divided by 8, 24 divide by 8 is 3,
so the 4th moment is 3. And the next one E(Z to the 6th) is gonna be 3.5 = 15. And you'll see,
alright as 1 times 3 times 5. You'll see the pattern as, if it's 1,
1 times 3, 1 times 3 times 5, 1 times 3 times 5 times 7 and so on. And this is not the first time
that we've seen these numbers, or at least if you've done the strategic
practice problems going way back. That was the number of ways to break two
n people into end to end partnerships and there's a story problem there. We could either write it this way,
or as a product of odd numbers. So kind of a surprising fact,
or at least I found it really surprising that the same expression
comes up for even moments of the normal. As it's the same number as breaking
up people into partnerships and counting number of ways to do that. And I thought that was kind of mysterious,
it turns out that it's not a coincidence. But there's this kind of a very
deep combinatorial explanation for that which I can't get into but
there is a reason for that. Anyway, that gives us all the moments
of the normal distribution now without doing any calculus. So that's nice, okay? So one more MGF problem,
we haven't talked it yet in class, the MGF of the Poisson distribution,
so let's do that. So you know Poisson, we know that
the mean is, the Poisson lambda, we know it has mean lambda and variance lambda,
but we haven't computed any other moments. But mainly for the Poisson,
I wanted to show you the other, like I said last time that there are three
reasons why the MGF is important. And those examples to illustrate
the fact why it's a moment generating, cuz we generated all the moments. But for the Poisson I wanna show
you the other important reasons. So let's let x be Poisson lambda and
find its MGF again. Well, let's just use LOTUS,
the expected value of E(e to tx) = the sum, so Poisson takes non-negative integer values
so I'll just say k equals 0 infinity. e to the tk, all right it's just LOTUS,
e to the mn times the Poisson pmf, e to the minus lambda,
lambda to the k over k factorial. Okay, looks like a kind of ugly sum. But actually you'll find that this sum is
an example on the math review handout, so I was planning for in advance. But you don't have to memorize that or
anything, this is just another example of pattern
recognition dealing with a series. It looks a little ugly
when you first see it but this is actually easy once you're
familiar with the pattern, right? So either the minus lambda comes
out cause that's just a constant, look at what's left inside. We have, this is either the t to the k and
this is lambda to the k, so together that's lambda e
to the t to the k, right? So all that's left is the sum of
something to the k over k factorial. That's just the the Taylor series for
e to the x again. So this is very easy once you've mastered
the Taylor series for e to the x. So and just immediately write that down,
that is the Taylor Series for e to the x evaluated at x
= lambda e to the t, okay? So we can simplify that a little bit,
it's e to the lambda e to the t- 1. So that's the Poisson MGF and it's valid for all values of t
because the series converges for all t. Okay, so that's the Poisson MGF. So one thing we could do with it
is to start taking derivatives or trying to expand it or
whatever to get the moments. But I'm not doing this example
because I want do moments, I want to show you the other
applications of MGF. So now let's let Y be the Poisson of mu. So we have two Poisson's now,
not one Poisson. And suppose that X and Y are independent. And the problem is find
the distribution of X + Y. So we wanna study the sum of
two independent Poissons. Okay, so that's called a convolution, and
we'll come back to convolutions later on in the semester, but
you know in general it can be nasty. But I pointed out last time that for MGFs you can just multiply the MGFs,
that's easy. Whereas, doing a sum or
an integral could be pretty nasty, okay? So all we have to do is multiply the MGFs. That is I'm just going to take
the MGF of X times the MGF of Y. So here's the MGF of X,
e to the lambda, e to the t minus 1. MGF of Y is going to be the same thing, except that the parameter is now
called mu instead of lambda. So that's gonna be e to the mu, e to
the t- 1 =, and let's just simplify it. That's e to the lambda + mu,
factor that out, e to the t- 1. That immediately tells us that
X + Y is Poisson lambda + mu. Because of the fact,
we didn't prove this theorem, as I said that's a really
difficult theorem. But that is a theorem that this is
the Poisson lambda plus mu MGF. There's no other distribution
that has the same MGF. So this is, therefore,
the only possibility. By the way, it was obvious that the mean
had to be lambda + mu, by linearity. So this, we already knew. The interesting part is that it's Poisson, the sum of independent
Poissons is still Poisson. Most distributions don't
have such a nice property. Like you'll add independent
versions of them, and usually you get some other family. Here it's still within the Poisson
family of distributions, okay? So that's a very,
very nice property of the Poisson. And a common mistake with this, is to ignore the assumption that X and
Y are independent. So to justify being able to be just
multiply the MGFs we need X and Y to be independent. So just to see a quick counter example,
if they're not independent. If X and Y are dependent,
well the most extreme case of dependence that I can
think of is when X = Y. Okay, so let's just see why
this doesn't work when X = Y. Well obviously if X = Y, then X + Y is 2X. And that's not Poisson. Why is that not Poisson? Yeah,
>> [INAUDIBLE]. >> Okay, so that's a good way
to think of it with the MGF, if we take the MGF of this thing
you're gonna get 2 in there. And what you actually gonna have, you are gonna take the selected value of
e to the 2tx, so you've replaced t by 2t. So you'd get 2t up there and
that doesn't look like a Poisson, now so that's close to a proof but it's a little
more complicated than I was thinking of. And you would still deed to say like could
there be some miracle of algebra that would reduce that back down. It's not true right? If you put a 2 there
it's not of this form. But still, what if you just didn't
think of the brilliant algebraic way to simplify it down. Yeah.
>> [INAUDIBLE] >> Yeah that's the simplest way to see it. What she just said was
that this thing is even. So that's one good way to see it. A Poisson has to take on any
possibles non-negative integer value. This thing is always an even number,
so it couldn't possibly be a Poisson. That's the simplest way to think about it, is just looking at one
of the possible values. Another way to see it, would be to compute
the variance, the mean and variance. So the expected value of x plus y,
which is 2x, would be 2 lambda. So if it were Poisson, it would have to
be Poisson 2 lambda cuz that's the mean. But the variance of 2x is 4 lambda
cuz the 2 comes out squared. For a Poisson the mean
always equals the variance. For this thing the variance is double. Intuitively that should make sense because
you're adding the same thing to itself. That increases the variance compared
to if you added independent things, then you might expect if one
thing happens to be very large, then the other thing might offset it,
right? But if you're adding the same thing
to itself and it happens to be large, then you're adding the same
large thing twice, okay? I've seen similar mistakes, cause
this is like an easy counter example. I've seen some more mistakes since
that one time, many many times where maybe we have something
like a sum of x1 plus x2 plus x3. And a student just then
replaces them all in their iid, and a student replaces them all by
x plus x plus x, and then get 3x. But X is not independent of itself, and you'll end up with
the same mistake as this. So I wanna mention that counterexample,
okay? So and there's other ways to see it,
too, but we just talked about three reasons why
this was not Poisson when using the MGF, one by looking at the possible values,
one by looking at the mean and variance. So hopefully you're convinced
now that that's not Poisson. Okay, so next major topic in this
course is joint distributions. That is, something we dealt
with a little bit before but just like bringing it in as its
own topic in its own right. So joint distributions just means, how do we work with the distribution
of more than one random variable, okay? So that's why everything in this
course is cumulative, right? Because if you don't fully
understand the CDF of one random variable then it's going to be really
hard to understand the joint CDF of more than one random variable, okay? So joint distributions, we already talk about independence
versus dependence right? If you have independent random variables
joint distribution just means multiply the individual CDFs of the individuals
PDFs and it's pretty straightforward. Remember the slogan independent
means multiple, okay? But in general we need to have
some tools and notation and so on for
dealing with dependent random variables. Maybe just two of them or
maybe a million of them, okay? So we're gonna talk about
joint distributions. And I think the best way to
start is in the simplest case, where we have two random variables. And let's even say there are two
binary random variables. So we can think of this in
terms of two by two tables. And this may seem really, really simple. I hope it seems pretty simple, cuz then if
you understand this simple case really, really well, it'll give you a lot of
intuition for the more complicated case. Okay, so I'll start with a simple one where x and y are Bernoullis. Possibly dependent, possibly independent,
and possibly the same p, possibly different p's. I'm not saying they're both
Bernoulli-p with the same p. Okay, then we can think of this
in terms of two by two tables. So we could draw an example
light like this with a table and we could just tabulate
values where here is x = 0, x = 1, and y = 0, y = 1, okay? Okay and then to specify the joint
distribution all we have to do is put in four numbers here that
are non-negative and add up to 1, right? Any four numbers you want as long
as they are non-negative and add up to 1 that will be
a valid joint distribution. So remember for your know PMFs to be valid
a PMF just non-negative adds up to 1? Completely analogous it's just now in two
dimensions instead of one dimension okay? So we can just make up four
numbers that add up to one. I guess we can talk about some
of the general definitions here. So this is for this specific case. But let's also talk
about the general case. So if we have x and y, first of all, they're joint CDF. It's completely analogous
to the individual CDF. So the joint CDF is the function
of two variables now. F(x,y) = the probability X less than or
equal to x, Y less than or equal to Y. Similarly we have a joint
PMF in the discrete case. Which would just be the probability that
X equals little x, Y equals little y, all right. So we just add this part. That's the PMF. The joint PMF means we're considering
both of them together, okay? Now, in the case where they're
independent if x and y are independent. That means that this joint PMF is the
product, P of X = x times P of Y equals y. So we need, so
that's called the joint CDF. Joint PMF. And now, so this is when we're
considering them together, right? Because it's comma within the same P. Right, it's considering them jointly. Okay if they're independent, that's equivalent to independence
is you can split this up. Okay, so now there's another concept
that we need called marginal The marginal distribution,
marginal just means take them separately. So the marginal distribution for
x would be a probably x less than or equal x is called the marginal
distribution of x. Similarly marginal PMF would
just be just this part, okay? So therefore in words,
we could say that marginal Independence means that
the joint distribution, the joint CDF is the product
of the marginal CDFs. Okay, and similarly we have, we can continue this over here,
we have the notion of a joint PDF I'm doing kind of discrete and continuous together, because
they're analogous to each other and they're analogous to
the one dimensional case. So a join PDF, which we might
write as little f(x, y) such that, so this would be the continuous
case in two dimensions. What does it mean to be a joint PDF? Just as like in the one dimensional case, the PDF is what you integrate
to get a probability. Two dimensional case, same thing. If we wanna know what's
the probability that x and y are in some set,
let's say x,y is in some set B, where B is some region in the plane. Maybe it's a rectangle,
maybe it's a circle or something. Just imagine some area in the plane. Then what we do is integrate
over that region f of x, ydxdy. So that's the first time that we've
written down a double integral here, but as far as what we're concerned, for
the most part, double integrals, for this course, the double integrals,
we're not gonna need to do a lot of them. And when we do normally we can just
think of it as one single integral and another single integral so
just do two integrals. But the intuition should be clear, right? The PDF is what you integrate
to get a probability. So it's completely analogous. And so independence means that We've already talked about this before,
I'm just using new terminology for it. Independence means that the joint x and
y are independent. If and only if, The joint CDF is the product of the marginal CDFs. So, I'll call that, just for emphasis, it would be confusing to use the same
letter F here without any clarification. This is the marginal CDF of x. This is the marginal cd F of x,
this is the marginal cd F of y, this is the joint cd F, okay? So it says that instead of having to do
some kind of complicated joint thing, I can just find the probability of this
event times the probability of this event. So that's the definition of independence. But we've seen over and over again that
usually it's easier to use PDFs or PMFs. So it's equivalent,
it's not too difficult. It's a little bit tedious but
with some algebra, we can show that it's the same thing as saying the joint
pmf is the product of the marginal pmfs. That's in the discrete case. And in the continuous case,
that the joint PDF, is the product of the marginal PDFs. And I wanna emphasize that this
has to be for all x and y. Not just for x and
y that make this thing positive. You have to pay attention
to the zeros also. We'll see an example like that later. So for all real x and y,
we can't restrict it. All right, so coming back to this
little example, we can make up any four numbers we want as long as they're
non-negative and add up to one. So I made up 4 numbers,
just for the sake of example. Two-sixth, one-sixth,two-sixth, one-sixth. So I made up a simple little example here,
and I could ask the question,
are x and y independent? And to answer that we need to say well two ways we can think about it. One would be so I so I wrote this in terms
of, you know, joint CDFs, joint PMF. We could also write
something like conditional. That is independence means you
don't have the distribution of y given that x equals something. It doesn't actually depend on that x part. So it's the same as
the unconditional distribution. Okay, so, well anyway, so each number in this table is one
of the joint probabilities, right? So two-sixth is the probability that x and
y are both zero, one-sixth is the probability that
they're both one, and so on, okay? So to check that they're independent
from the definition, well, what that means is we first need to
find the marginal distributions and then check that this is true. Okay, now to get from
the marginal to the joint. Here's just quickly how do we
get marginal distributions? Getting marginals is actually pretty
easy from the joint distribution. Because Let's just do
the discrete case first. If we wanna know the marginal
distribution of x as the marginal PMF, then just by the action of probability, all we have to do is add up
the different possibilities for y. So that the sum of all y P of X = x,
Y = y, okay? Because just the axiom
of probability right? That we're adding up just
joint cases the union is this. You can also write it as a conditional. You can also think of this as the law of
probability, and write given Y equals y, times P of Y equals y,
it would be the same thing. Okay, that just says add up, X = x, but
Y could be anything so we sum over Y. That's called marginalizing over Y,
that we're just summing up. We start with this thing that's
a function of x and y, sum over all y, then we just get a function of x. And in the continuous case, let's get
the marginal so that's the discrete case. And the continuous case, let's say we want
the marginal distribution of y, similarly, you can get the marginal
distribution of y, I'm not gonna write the same thing again. If you want the marginal PDF? So this is the marginal PDF of y. Marginal, this means viewing it. On its own, as its own thing, right? Then all we have to do is integrate
completely analogous to this. Integrate the joint density, f of x,y, (x, y), integrate over all x. That's just the continuous analog of that. Here we're summing over all values. I swapped the x and
y here just for variety, here we are summing overall values of y. Here we're integrating overall values
of x, the joint density, okay? So, you can go in that direction, this is getting marginal distributions
from joint distributions. You can't go in the other direction. If we only know the marginal distributions
that doesn't tell us anything about how x and y are related to each other, right? So you can't go any other way. But you can go from the joint
distributions to the marginal distribution. So for this example,
let's get the marginal distributions. So what's the probability that y equals 0? Well, obviously, we're just adding
this case plus this case, right? Cuz those are the two
cases where y equals 0. So we add those two cases,
we get four-sixths. Add these two cases, we get two six's. And for the other way around if we
want the X = 0, just add this case and this case and
you get three six's or one half. This one plus this one, 3 / 6. And by the way,
one thing you have to be careful about, is the terminology in economics and
statistics is very different. And when you take an econ class you
always hear about marginal revenue and marginal cost and things like that. And usually in like, AP Econ,
then they don't want to use calculus, and so they explain everything is incremental,
if you do one more unit of something, then what happens? And then later when you actually
see what's going on with calculus, you realize that in Econ,
marginal means derivative and in statistics, marginal means integrate,
so it's completely opposite meaning and I don't know
where the Econ term came from but you can see here where
the statistics term came from. Cuz it's called marginal cuz we
write these numbers in the margins. So that's a marginal distribution. So, once you understand
this two by two table, you basically have the key
intuition into joint distributions. In this case,
here they are independent in this example. To check that they're independent
There are other ways to do it, but just to check it by the definition, what independence means is that
to compute any of these entries. Let's say 2/6ths Asl I need to do
is find the probability that X = 0 x the probability that y = 0 so
I'd multiple 3/6ths times 4/6ths. Which is 1/2 times 2/3 is 1/3 which
is this, so if you get this number I can multiply this times this and so
on so you check this four numbers. So each of these joint probabilities
is obtained by just multiplying two marginal probabilities. So that means they are independent. Or as you can make up your own examples,
if you just here is kind of an extreme example
it doesn't have to be this extreme. But I can pick whatever numbers I want
as long as they're nonnegative and add up to 1. For example, I just made one up here
where these nonnegative add up to 1. So this is a perfectly
valid joint distribution. But you can see right away that this
0 means that it's not gonna be true, that if you multiply,
you can't obtain it that way cuz if you do the marginal thing again,
1/2, 1/2, and this is 1/4, 3/4, and you multiply 1/4 x 1/2,
you don't get 0. So this one would be dependent. This one is dependent,
you can make up your own examples. It doesn't have to have a 0
in it to make it dependent, that was just an easy,
extreme case to see what's going on. Okay, so this is a simple two dimensional
discrete example to think about. Let's also do one simple
continuous example just to have some intuition on
what this all means. So the simplest way to start is I
think at the uniform distribution. What is uniform in two dimensions mean? So let's consider as an example what if we have uniform on the square that's all x y such that x and y are both between 0 and 1. So we just have this square here. We can draw our coordinates, and
have a square here where this is 1 and this is 1, okay? So we have this square, and we want a distribution that's
uniform over this square, so. Remember, in the one-dimensional case, uniform meant that the PDF was
constant on some interval, okay? So the analogous concept would be, we want
a PDF, which is gonna be a joint PDF, and we want it to be
constant on that square. And 0 outside the square, right? So, that just captures the notion
of being a completely random point. As we're picking a random point, x comma
y, we want a completely random point in the square, so we want the density
to be constant all over that square. So 0 outside, let's find the joint PDF. Well, the joint PDF, therefore from
what I just said is some constant c, if x and y are both between 0 and
1, and 0 otherwise. Now in one dimension, if you integrate
the number of one over some interval, you get the length of the interval. In two dimensions if you
integrate the constant one over some region you get
the area of the region so if we integrate this thing we get
the area, so the integral is area, So C = 1/area would normalize it, which = 1 because the area
of that square is 1. So the joint PDF would just be 1
inside here and 0 outside, and if you want the marginal distributions,
then just integrate out the, integrate this Dx or integrate this Dy,
you'll get 1 so marginally, X and y are independent uniform,
which is pretty intuitive uniform 01. Which is kind of intuitive right because
it just says if you pick a random point in the square in the x coordinates
uniform, the y coordinate is uniform. So that's pretty straightforward,
that's an example of independence. But I want to contrast that
with an example of dependence, where instead of a square,
let's use a circle. So, suppose we want uniform in
the circle I'll say disc for clarity. A circle might just mean a circle
we want everything inside. So on the disc, x squared plus y
squared less than or equal to 1. Okay, so let's see what that looks like so
we just draw a circle. Sorry, it doesn't look like
a very good circle, but pretend that that's a perfect
circle centered at 0 of radius 1. And we wanna be uniform in here, okay? We wanna write down what the joint PDF,
what are the marginal PDF's, okay? So first of all for
the Joint PDF by the same kinda reasoning. It's just because its uniform that that
means another way to say uniform is that the probability of some region must
be proportional to its area, right. So now in one dimension I said
probability is proportional to length for uniform distribution. Here probability is proportional to area,
so because of that the normalizing constant has to be 1 over the area of
the circle, Pi r squared, so that's Pi. So a joint PDF is 1 over pi
inside the circle and 0 outside. And a common mistake with this kind
of thing is to then think that that says that they're independent
because that's just a constant so it doesn't It looks like I can factor 1
over pi as constant times a constant. It's just a constant but they're not
independent because of this constraint. They're actually very dependent
because for example, if x is 0, then y could be anywhere from -1 to 1. But if x is close to one, then y has to
be in some tiny little interval, right? So, if we fix x to be here, then y
could be between here and here, right? So the values depend on where, that is knowing x constrains
the possible values of y. That says that they're not independent. So here x and y are dependent. And in fact,
we can show that given that x equals x Then, we can actually say, what can y be? Y has to be between square root of
1 minus x squared and minus that. Because x squared plus y squared
are less than or equal to 1. So this depends on x,
this is the constraint. So we might guess that Y is
uniform between here and here. That is if X is here,
then we know it's between here and here, but could be anywhere, right? So a good guess would be uniform, but next time we'll do an integral
to show that for practice. But you can see right
now they're dependent. Okay, so see you on Friday.