So we're talking about
joint distributions, right? And there's a lot more to do with that,
so to just continue. So last time, we calculated the expected
distance between two iid uniforms, okay? So I wanted to do this analagous
problem for the normal. Because I think that's another nice
related example that has a different approach that makes it easier, okay? So last time,
we did expected absolute difference. This is just an example, but
I think it's a nice example. Expected absolute difference
between two uniforms, and what if we wanna do the same
thing with normals? So we wanna find
the expected value of say, let's call them Z1 and Z2. So, we did this with uniform last time,
now assume these are iid standard normal. Okay, wo last time we did this for
uniform, using the 2D version of LOTUS, right? Completely analogous to LOTUS, except we had a double integral
instead of a single integral. So these are iid standard normal. So, we could write down the 2D LOTUS here,
and try to do that integral. And because they're iid,
the joint PDF of Z1 and Z2 is just the product of
the two marginal PDFs. And well,
we could just try to do that integral, and we could probably get it with some effort. But that's not a good way to do this
problem, it's better to stop and think about the structure of the problem,
okay? So in the case of the uniforms,
we've never particularly studied, what are the properties of
the difference of two uniforms? On the other hand, the difference of normals is
something we've talked about before. So instead of jumping right into
this two-dimensional thing, let's see if we can actually
simplify the problem first. So in fact, we've mentioned before that
the sum of independent normals is normal. We haven't proven that yet, but we have all the tools to
be able to prove that now. So let's just do that quickly
to verify what I said before about the sum of normals,
so just a little theorem. This is gonna be easy now,
because we know MGFs. The sum of normals, so
we stated this before. If X is, let's say N(mu 1,
sigma 1 squared), and y is N(mu 2, sigma 2 squared) and
they're independent, X has to be independent of Y,
otherwise this won't work. Then the sum, we talked about this before, by linearity the means just add, and also the variances add. And we talked about the fact
that if we took a difference, we would take the difference of means. But we would still add the variances,
not subtract. Because if this were -Y, you would
just think of it as plus -Y, okay? So anyway, let's just prove this fact now,
which we haven't done yet, and this is just an easy MGF calculation. So we just use the MGFs, So
let's get the MGF of X + Y. Since they're independent, we talked
about the fact that since they're independent ,we can just multiple
the MGF of X times the MGF of Y. The MGF of a normal, well, we derived
the MGF of a standard normal before. But it's very easy to get from
a standard normal to any normal, right? If we do this thing, mu + sigma z, we can
immediately get the MGF of any normal. And that's just gonna be e to the mu 1 t, this is the MGF of x, mu 1 t, + one-half sigma 1 squared t squared. That's the MGF of X. We multiply by the MGF of Y,
which is the same thing, you just change the subscripts. Mu 2 t + one-half sigma 2
squared t squared equals, Now let's just write this as
one exponential and factor. So that's e to the mu 1 + mu 2 t,
just factor out the t + one-half,
this is all up in the exponent. One-half t squared (sigma 1
squared + sigma 2 squared), right? Sigma 1 squared plus sigma
2 squared t squared. Okay well, I ran out of space on this
board, but that's the end of the proof. Because all we have to do is just say,
look, that's the MGF. I have little more space,
that's the MGF of N(mu 1 + mu 2, sigma 1 squared + sigma 2 squared),
All right, so since the MGF determines the distribution, then that's the end,
we don't have to do anything else. So, it's a very easy calculation,
using MGFs. Okay, so now that we've proven that fact,
and we see this thing, z1- z2. Rather than jumping into the 2D LOTUS,
let's just say, what is that? Well note that Z1- Z2 is N(0, 2), just add the variances. So really all we're asking is for
the expected value of, Expected value of absolute value of,
now when we say N(0, 2), let's once again think about that
as location and scale, right? We could take a standard normal, and
multiply by the square root of 2, and that would give us variance 2. So the easiest way to think
of this is as square root of 2 times Z,
where Z is standard normal, right? That's just the scale,
that gives it variance 2. Now this is just square root of 2 E|Z|. Now its just a one-dimensional LOTUS. And this is a LOTUS that
you've actually seen. If you studied strategic practice five,
we did this. But whether you remember
ever looking at that or not, doesn't matter, this is a easy LOTUS. Whereas here, you have to do a double
integral, here I just write down LOTUS. So I'll do this quickly,
cuz on the strategic practice, it's just write down LOTUS. Integral minus infinity, to infinity |z| 1 over root 2 pi e to -z squared over 2 dz, And notice that this is an even function, That is, if we replace z by -z,
we get the same thing. So we can just multiply by 2 and
go from 0 to infinity. And once we go from zero to infinity,
we can drop the absolute values. Then it's just z e to
the minus z squared over 2. That's a really easy
u-substitution integral, right, cuz you can just let u equals z squared,
or u equals z squared over 2 if you like. And then you get exactly what you want,
so that's then an easy integral. And if you simplify it,
you get square root 2 over pi, which should be an easy calculation. It's also on the strategic practice, so I
won't write out more of that calculation. So then that becomes just
a simple one-dimensional LOTUS, that's a much better way to think of it. All right, so just an example that you
don't always have to jump into the 2D LOTUS, just cuz you have this
function of two variables. Okay, so that's a continuous example. I wanted to do some more discrete stuff. In particular,
to introduce the multinomial distribution. Which is by far the most important
discrete multivariate distribution, and I'll tell you multivariate
distribution means. So this is gonna be
called the multinomial. A multivariate distribution just
means that's a joint distribution for more than one random variable, right? So we have all these normals and
Poisson and geometric, and so on. Those are all univariate distributions,
cuz we have one random variable. Now we're working with more than
one random variable at once. And for this course, there's really only
two multivariate distributions that you need to know by name. One is the multinomial,
which we are about to do. The other one is the multinomial which
is the generalization of the normal distribution to higher dimensions,
and we'll get to that one later, okay? So multinomial as the name might suggest
it's a generalization of the binomial, right? Bi becomes multi, okay? So it's like a higher dimensional
version of the binomial, and let's just introduce it by its story. So this is the definition and
story, Of the multinomial, which I'll sometimes just
abbreviate to mult of np. It has to parameters,
n and p like the binomial, except in this case,
this p is actually a vector. So p = vector, let's say P1 through Pk, where we assume that that's
a probability vector. And by probability vector all,
I mean is that these are non negative and add up to 1. Cuz we're gonna think of them as
probabilities for disjoint cases, so that encompasses all possibilities. So we want pj greater than or equal to 0, and the sum of all pj's = 1. That's the assumption, okay? So the binomial would just be
if this is one dimensional and then we just have binomial np,
but now we have k of them. So the intuition is that in the binomial, we just talked about success and
failure, right? There are two possible outcomes,
there are two categories. Multinomial means instead of two
categories, we have k categories, okay? So it's a natural extension, right? And binomial, we have to classify
everything as either success or failure for each trial. Here we have more than two
possible result, okay? So we say that x is multinomial np, We think of that as saying that, so in this case, X is also a vector. This is a multivaried distribution so X = (X1 to Xk), if we can think of X. So like in the binomial,
we have n independent trials. But I'll just call them objects
instead of trials and each object, objects could be people, could be trials,
could be anything, so just very general. We have n objects that we
are categorizing, okay? We have n objects, which we are
independently putting into k categories. So there are k possible categories, and
the binomial is just success or failure, but now we have k categories. And there, each object is independently determined
which category it falls into, okay? Just like in the binomial,
we had independent Bernoulli trials. And if Pj is the probability
of category j, by P of category j,
I mean the probability that any one of these objects is in category j,
has probability Pj. And we interpret Xj is just the count, is the number of objects in category j. All right, so that was a lot of writing,
but the concept is really simple. We just have n things that we're
breaking them into categories, and then we just see how many objects
are in each category, right? So it's very natural, you can make up
as many examples of this as you want, really easily, right? Just anytime you're classifying
things into categories. It's very, very general. Okay, so let's find the PMF. So this is gonna be a joint PMF,
cuz it's a joint distribution. So we want the probability that x1 = n1, blah, blah, blah xk = nk., right? That's a joint PMF, we just need to say
what's the probability that there are n1 objects in the first category and
to in the second category and so on? And we can immediately write
down the answer just by thinking back to how do we derive the binomial PMF. All we have to do is imagine
any particular sequence, it's gonna be P1 to the n1, P2 to the n2,
blah, blah, blah, Pk to the nk. Just to have a little
intuitive example in mind, let's just suppose this is very
similar to how we did the binomial. But just to quickly review and
generalize that. Suppose we just have three categories, just to have a little
mental picture in mind. We had three categories,
lnd let's just say our sequence, and let's just write one, two, three. Where one means category one and so on. So we might have a sequence like 23311112,
for example, okay? So let's put a couple more 2s, there
are four 2s two 3s four 1s for example. This says that the first
object is category 2, right? We're just categorising
the objects one by one. So any particular sequence like this, the probability would be P1 is
the probability of category one. Multiple to the power of how
many ones there are, right? I need to put another one there. P2 to the power of the number of twos and
so on. That will be the probability
of any specific sequence that has the desired counts,
right? But then we can permute
this however we want, then it's just going back
to those counting problems. How many ways are there to permute
the letters in the word pepper, or the letters in the word Mississippi or
something like that. Where you start with n factorial, but that overcounts because the twos
could have been in any order. The threes could have been in any order, the ones could have been in any order,
and so on. So you have to adjust for
that overcounting. Exactly like we did for the binomial,
so we just divide by n1 factorial, n2 factorial, blah, blah,
blah n k factorial to account for all the ways you could permute the 3s,
permute the 1s, permute the 2s. Of course, there's a constraint here, this is if n1 plus blah,
blah, blah plus nk = n. Otherwise, it doesn't make sense, right? Cuz we have n objects. We're assuming that every object
is in exactly one category. So it wouldn't make sense if
we added up these counts and they had too many or
too few, makes no sense. So it's 0, otherwise. That is if the sum of
these ns is not this n, then it's impossible, so it's 0. So that didn't require a calculation. It just required thinking
about an example like that and just so
different ways to promote things, right. So that's the joint PMF. It looks a lot like the binomial,
all right. So it's a generalization of the binomial
when you have more than two categories. So we'll come back to some other
properties of the multinomial later, but just to do a couple quick
properties to think about. We could ask about the marginal
distribution, conditional distribution, things like that. So let's think about
the marginal distribution first. Okay so we're letting X be multinomial. N, p. Sometimes I'll subscript a k,
just to indicate what the dimension is, so the number of categories. And suppose we want the marginal,
find the marginal distribution of just one of these component,
let's say Xj. So Xj is just how many people or
how many objects are in category j. We want its marginal distribution. What do you think that is? Yeah, binomial, why did you say binomial? Exactly, it's either nk or it isn't nk. So I mean if you said if
you look at your notes, how do you get from joint distribution
to marginal distribution? I would say if you take this thing and do k- 1 sigma sign sum over all
possible things, do a lot of algebra. But that's not thinking about it, right? To marginalize we'd sum up the joint or
we integrate in the continuous case. We sum in the discrete case,
sum of everything we don't want, okay? But instead let's just think about
the story, think about what it means. As you just said, each of these objects,
either it's in category j or it isn't. We're assuming they're
all independent trials. So if we define success to
mean being in category j, the probability of success
is pj in our object. So that's just immediate. I didn't write justification for this but
that just proved itself from the story you know it's a complete truth
just to say because the binomial, it's independent Bernoulli trial. That's the probability of success, okay? So we can get that immediately and in
particular that also gives us the mean and the variance without having to
do a calculation, E(Xj) = npj. And the expected value of the variance
because we derived the variance of the binomial before we
don't need to re-derive that. We already know the variance of a binomial
is np(1- p) so this npj (1- pj), no additional work needed
because we know it's binomial. Okay, so that's just immediate from
thinking about what this means. So that's one property. That's the marginals. And let's think about kind
of something similar. Let's call this, well,
I call this the lumping property. What happened, the question is,
we have all these categories, well what happens if we decide to merge
certain categories together, right? Okay, so just to have an example in mind, let's let K = 10, so
we're thinking of X as a vector. X1 through X10 and
just to have a concrete example in mind. So this is multinomial, let's say this
is multinomial, and, P1 through P10. And to have a concrete example in mind, well let's imagine we're in a country
that has ten political parties. Okay, and you take n people and
assume that the people are independent of each other, and
you wanna know how many people are in, and assume that everyone in this country is
a member of one of these ten parties. Okay, and then you take all
these people and you say, okay. Ask each person which party they're in. X1 is the number of people in
the first political party, X2 is the number in the second one,
and so on, right? So that that would be multinomial if these are the probabilities of the
different party memberships, all right? So now, what I call the lumping property
is what if it's a country where it's, there are only two dominant parties, and
all the other parties are much smaller? And so it might be kind of unwieldy to
deal with this ten dimensional vector. Maybe we wanna compress
all the third party, so suppose that the first two are kind
of the two dominant major parties and the rest of them are kind of minor, so
we may wanna just lump them together. So that's why I call it
the lumping property, lump all the other parties together. So what if we considered,
let's see, let Y = X1, X2, and
then group all these ones together. So I'll just add them up,
X3 plus blah, blah, blah, plus X10, right. So this would be like party one,
party two, and then other third party grouped together. Without doing any calculation or
algebra whatsoever, we can immediately write
down the distribution of Y. Y is just gonna be multinomial. Same n. And then all we've done is group
these categories together, but then it's the same problem again, it's just it has a larger probability
just lump together all those Ps. Okay, so this should be
obvious from the story, right. It's the same problem again. So just like we emphasized with
the binomial we can define success and failure however we want. Here we can rearrange the categories and
whatever, the only thing we need to make sure of is that each
object is in exactly one category. So it would not work if you could
be in more than one category or be in no categories. But if you define your
categories such that it's true that each object isn't
exactly one, then you get multinomial. Didn't need to do any algebra or
calculus to show that. So that's pretty nice. Similarly, let's get
the conditional distribution. So what if we wanted, so
again x is multinomial. What if we want a conditional distribution
where we got to learn what X1 is and we want the conditional distribution
of the rest given that we now know X1. So we want a conditional, you might call it a conditional
joint PMF because you're given X1. Let's say that we're given that X1 = n1,
okay? And then we want the conditional joint
distribution of everything else. So we know exactly how many
people are in the first category. But we don't know about the rest of them. Well, given that X1 = n1, we want the joint PMF of
the rest X2 through Xk. Still gonna be multinomial, but we have to be a little bit careful
with getting the parameters right. So now this is gonna be k- 1 dimensional, cuz we know how many people
are in the first category, but we're looking at the remaining
k- 1 categories. And the number of people, well, n- n1 have been allocated
into the first party, okay? So we have n- n1 people left. And then we just have to get
the probability vector, right? Now if we just wrote p2 through pk,
that would be a common mistake, but it should be easy to see that that's
a mistake because those don't add up to 1. So it can't just be p2 through pk, right? I'm imagining that I've taken, and
it doesn't matter which people. I can imagine,
I'm conditioning on the count. But then I could further condition on
which specific people are in category one, and then use symmetry. So I guess, so I may as well just assume
that the first n1 people are in category one, okay, but to get these ps, well, then
we have to think conditionally, right? So let's call this vector, let's call it p2 prime through
pk prime where somehow we have to figure out what's p2 prime and
so on. Because without the primes it doesn't
add up to one makes no sense. So let's find p2 prime for example. Intuitively, I want this to
be proportional to P2, right? Cuz I know how many people
are in the first party, but that shouldn't kind of affect the relative
distribution of the rest of the party. So basically you just have to renormalize. If I want to write that out, mathematically I would say P2
equals the probability of being in category 2, given a random object being in category 2 given,
that it's not in category 1. Because we've already thrown out
the ones that are in category 1. So just by the definition of conditional
probability, being in category 2 I take the intersection of this and this,
but once you say you're in category 2, you know you're not in category 1,
so that's redundant. So the numerator is just P2. And the denominator is 1- P1, that is just the probability
of not being in category 1. Or we could also write it as P2
over P2 + blah, blah, blah + Pk. So all this says is we've taken these and
you know similarly for the other ones Pj prime equals Pj
over P2 + blah, blah, blah + Pk. All this says is that we've renormalized
this, it's still multinomial. Okay, so multinomials have really
nice properties like this and you can see these things just by thinking about
what it means without doing a calculation. So that's a very useful distribution for
lots of applications. Okay, so we'll say more about
the multinomial in the next lecture or the lecture after. But I wanna do one more
continuous example as well, an example where we actually
do need to do a calculation. And this is another kind of famous one. Good example of how do
we work with joint PDFs, which I think we need more practice with
or at least you need more practice with, then I'll try to help with that,
so this is a good example. I call this the Cauchy Interview Problem. I call the Cauchy Interview Problem
not because Cauchy ask this as an interview problem, but because it sounds more interesting to
call it that than the Cauchy Problem. But actually for some reason, this doesn't seem like it should
be a common interview problem, but I've actually seen this on several
occasions asked as an interview problem just I think to test whether you can do
work with joint PDFs and things like that. So okay, it is an interview problem, though it sort of
shouldn't be in some sense. Anyway, I have to tell you what
the Cauchy, I mean Cauchy was a famous mathematician, but in this context, Cauchy
is referring to a specific distribution. The Cauchy distribution. It's a famous distribution
that has a lot of kind of weird, scary properties. I just got some of these distribution
plushies that I found online. I might bring them, but they're
a little bit small to show you here. But if you come to my office hours,
you can see them in my office. But little pillows illustrating different
distributions, I have them in my office. The Cauchy is called the evil Cauchy and it looks pretty evil. And so let me first tell you
what the distribution is, and then tell you a little
bit about why is it evil. And then we'll try to find its PDF,
which as I said, has been a common interview problem,
find the PDF of a Cauchy. That's the problem, okay? So the Cauchy is
the distribution of let's say, X over Y with x and y iid standard normal. So it's a simple definition, just take a ratio of two iid standard
normals and we call that a Cauchy. And you can see why that could
be a useful distribution for a lot of different applications where
ratios is a pretty natural thing. So that's a Cauchy, and
the problem is find the PDF. Of this random variable, okay? Let's call this thing T. Find PDF of T. So we're defining T to be
the ratio of iid standard normals. We want to find its PDF. All right, so that doesn't yet
answer why this is evil. Well, some properties of the Cauchy
that we're not gonna prove right now, but just to kind of foreshadow
why is this thing so evil. First of all,
it does not have an expected value. If you try to compute e,
expected value, it'll blow up. No, no, no, that's not that evil. There are a lot of distributions where if
you try to compute the expected value, it blows up. So it does not have a mean,
it doesn't have a variance. The thing that's really evil about the
Cauchy is that If you take iid cauchys, so let's say don't just have T,
we have T1 through TN. They're just iid ratios of normals. When we get to the law of large numbers
later in the course, we'll see that when we average a bunch of iid random
variables, we What happens if we average a lot of them is that
should be close to their mean, right? You average a lot of IID things
that should be close to the mean. In this case there is no mean. But the weird fact is that if you
average all these Cauchy, IID Cauchy the distribution of that average is still
a Cauchy, doesn't change the distribution. You can average a million IID
cosye it's still gonna be Cauchy. So in some sence you that's kind of you're
hoping soil as you collect more and more data you're hoping to converge
to the truth in some sense. In this case if all you do
is average then you just not getting anywhere
the distribution doesn't change. There are other ways to work with, if you had Cauchy data there
are other ways to work with it. It would be a bad idea to just take
the naively average everything, there are other things you could do. Okay, so
any way that's the Cauchy Distribution. Now let's find the PDF, just for
practice with our joint distributions. And there are several
ways we could do this. One way would be to use
the law of total probability, and condition on y to make things easier. And that's a perfectly good way to do it. But I think I wanna just start by
practicing just more directly how to just directly get the CDF. Let's find the CDF, and
take the derivative and get the PDF. So with the CDF we could use
the law of total probability, but let's just directly write down. It's going to be a double
integral because we have an X and a Y and let's just write down that double
integral and see if we can do it, okay? So let's find the CDF. So the probability that
x over y is less than or equal to some number, t,
that's what we need for the CDF. This is practice with, this is an event, it's an event that the ratio
is less than or equal to t. We want to find some probability of
an event where it's based on x and y, so unless we can think
of some clever trick for simplifying this we basically
have to do a double integral. Or else, we can use the law of total
probability and do a single integral, but I actually don't think
that's any easier here. So my first impulse would be to
multiply both sides by y here. But you have to be careful in doing
that because y could be negative, so we can simplify this
a little bit by using symmetry first, And putting absolute values. This follows from the symmetry
of of the normal. And you can think through for yourself
exactly how I'm using symmetry here, but the basic idea is with the normal. If I have a standard normal and multiply
it by minus 1 it's still standard normal, if I multiply it, if I randomly chose say
with probably one half multiply it by minus 1, probably one half do
nothing It's still standard normal. Have the same symmetry in the denominator,
so sort of have two symmetric things. And we might as well just kind
of absorb the plusses and minuses and write it this way,
follows from symmetry. The reason I wanted to do that is just so
that I could write this as x less than or equal to t absolute y, without having to flip the inequality or worrying
about whether the inequality flips. Now let's just write this down
as a double integral, okay? We are saying that x so we can either dxdy or dydx, but let's suppose that we are doing dx dy. And to get a probability well what we do, we integrate the joint PDF over whatever region we want, okay? So Y goes from minus infinity to infinity. And, the main thing again to be careful
about, is the limits of integration. X, the inner limits can depend on Y, and X, we're looking at the region
that goes up to t absolute y. So x goes from minus infinity
to t absolute y, and then what we're integrating
is just the joint PDF, right? So the joint PDF is 1 over root 2 pie,
e to the minus x squared over 2. And then same thing for the y,
1 over root 2 pi e to the minus y square, because they're IID standard normal. So the other term, e to the minus y
squared over 2, doesn't depend on x. So I could write it here but
I could immediately then pull it out here. So I may as well write it here so
that it's not interfering with this part. Sets e to the minus y square over two and there is another 1/2 pi
this just stick over there. So all I did is write
down the normal PDF for x and the normal PDF for y and
I took the y part cuz that depends on x. That looks pretty ugly so
let's see if we can do it, well, one thing that we could simplify is just
recognizing what do we actually have here. So we have this integral,
minus infinity to infinity, e to the minus y squared over 2, and
then we have this inner integral. Okay?
Now in one sense we can't do this integral. Because that's the normal PDF and you can
prove that you can't do that integral. And in another sense,
not only can you do that integral, you already know what
that integral is right? That's just capital phi evaluated here,
that's just the normal CDF. So actually it's just phi,
so depending on whether you consider that doing the integral or
not, it's just that, dy. That's just the definition of
the standard, normal CDF, okay? Now these absolute value signs
are a little bit annoying. So, let's notice that we
have an even function, because y squared, absolute value y,
this is an even function. So we may as well go from 0 to
infinity instead, and multiply by 2. So then we'd have a square
root of 2 over pi. I just multiplied by 2, and then we're going from 0 to infinity, e to the -y squared over
two capital phi of ty dy. All right, and then you know, the clock
is ticking on our job interview and we've get here and
it's sort of possibly start to panic. And that capital phi is
an intractable integral, that's why we call it capital phi,
it's cuz we couldn't do it. Now, you are being asked in
your interview to integrate an integral that you couldn't do, sounds pretty bad, However. One thing that might help,
is that on the interview, we were asked to find the PDF,
not to find the CDF, that's the CDF. And we know that the PDF is
the derivative of the CDF, so the PDF is the derivative of the integral
of an integral that we can't do. So somehow maybe that will save us. So let's take the derivative. So here's the PDF,
PDF is the derivative of the CDF. This thing is capital F(t),
if we call the CDF capital F. The PDF is the derivative, F'(t). So we're taking the derivative
with respect to t, not with respect to y
which would make no sense. Notice that this y is a dummy variable,
okay? This is a function of t we're taking
the derivative with respect to t. Okay, now there's a theorem
in calculus that says, under some pretty mild conditions, if you
have a reasonably well-behaved thing that you're integrating, you can exchange
the derivative and the interval. This is a very,
very well behaved function. Capital Phi is just a continuous
differentiable thing between 0 and 1. Either the -y squared over 2,
that's infinitely differentiable. It decays to 0 very fast, so
this is a very, very nice function. So there's gonna be no technical problem
whatsoever with swapping the derivative and the integral. We're gonna take the derivative
of this with respect to t, and then we're gonna try to simplify it. So we take the derivative,
bring the derivative inside, okay? So we have the integral 0 to infinity,
e to the -y squared over 2. We're differentiating with respect to t,
we're bringing in a d with respect to dt. So we're treating either the -y square
derivitive which behaves as a constant when we're differentiating
with respect to t. Then we take the derivative of
capital Phi of ty, by the chain rule, y is gonna come out, because we
are differentiating with respect to t. So y is going to come out
from the chain rule, y. And then we just need
the derivative of this, but the derivative of
the standard normal CDF is the standard normal PDF,
which is 1 over root 2pi, e to the -z squared over 2
in general where z is ty. So it's e to the -t squared,
y squared over 2. I just squared this thing divided by 2dy. Now let's see if we can do it. So the square root of 2 here
cancels this square root of 2. We have square root of pi, square root
of pi, so we're gonna get 1 over pi. And then we just need to integrate from 0 to infinity of ye to the -t squared y squared over 2dy. Now this looks like an integral we can do. >> [INAUDIBLE]
>> The other what? >> [INAUDIBLE]
>> Say that again. >> [INAUDIBLE]
>> There's another e to the -y squared over 2. Yeah, I forgot that one, thank you. There's another,
we'll just combine that one with this one. So that would be 1. Uh-oh, I guess don't get hired,
that's sad. There's another,
e to the -y squared over 2 that I forgot. But now, I thank you, I put it back, okay? I haven't interviewed for
any jobs since I came here five years ago, so I'm kinda rusty. So I put back either the -y squared
over 2 that you helped me with, and now that should be okay, right? Now, this is an integral we can do, because we know that the derivative
of y squared is gonna be 2y, and that's gonna be taken care of there,
now it's an easy u substitution again. So we can just let u = let's say 1 + t squared y squared over 2. Just make that substitution, so
then this just becomes e to the -u, okay? So du = 1 + t squared times, now we're treating t as constant again. We're changing the variable y,
transforming it to u. So the derivative y squared over 2 is y,
so we have y times 1 + t squared dy. So we have the Ydy, we're just
missing the 1 + t squared, okay? So I'll just multiply and
divide by 1 + t squared. Then we're just integrating e to the -u
du, which is a very, very easy integral. We know that that's 1,
either just by doing it or because it's the integral of
the exponential PDF again. Okay, so then we immediately now have the
answer, 1 over pi 1 + t squared for all t. So that's the PDF. If we wanted the CDF,
all we would have to do is integrate this, then it's gonna do some arc tangent thing. All right, so that's the Cauchy. And let me just quickly just show you
how you would start the other method, which would be the probability. I'm not gonna do the whole thing,
because at some point, that's just gonna reduce
back to this method. But just to show you
what it would look like. Just as a quick alternative without
going through the whole thing, cuz it's gonna be similar. But it's useful to have both methods. So this would be the method using the
double integral, okay, and that's the PDF. Which by the way,
we should check thati that's a valid PDF. Does it integrate to 1? Well, if you integrate that thing,
you'll get an arctangent thing. And you can check that when you
evaluate the arctangent thing, you will get 1, okay? So just quickly, the alternative
using the law of total probability. x less than or
equal to t absolute value of y. You kind of just think to yourself,
what do we wish that we knew here? We could decide to condition on x, or
we could decide to condition on y. This is gonna be the integral,
let's say we condition on y. The probability x less than or equal to t, absolute value of Y,
given, let's say, Y = y. This would be the law of total
probability, right, just conditioning. We can choose whether to condition on x or
to condition on y, but I think I wanna condition on y. Okay, law of total probability,
remember the discrete case we just seen, sum over all cases, p of a given,
b, whatever, p of v, whatever. We have a partition. And in this case,
we're integrating instead of summing, so we're conditioning on y and
then we're multiplying by 5y. Lower case 5y is the standard normal PDF. All right, well,
let's see if this helps at all. This is saying to treat Y as just
being known to equal little y, okay? So I can plug in little y there. And then the tricky part here is that we need to use the fact
that x is independent of y. Because if x are not independent of y,
you could plug this thing in, but then you still have this condition, okay? But since they're independent,
you can plug in Y = y and then get rid of the condition,
because they're independent. So when we do that,
that's just gonna be phi. The probability that x is less than or
equal to t absolute value of y, is just phi of tf to the value of y,
just by definition, right? Because we're plugging in y, that's just a constant probability
of x less than some constant. It's just the standard
normal CDF evaluated there. Phi of (y)dy, which I think is
the same as the integral we had. Does that look the same? Yeah, so over there, I just wrote out
what this is, but it's the same thing. And then proceed in the same way. So that would be a second way to do this. We'll see a third way later on, just because this is
a common interview question. So it's good to have more than three or
more than two ways to do it. All right, so I'll stop for now. I'll see you Wednesday.