Today is a pretty sad day for Stat 110
because I just saw in the news that a judge in the UK has ruled against
the use of Bayes Rule in court. So at least it was not in the US. It was in the UK. I posted a link on Twitter. Well, that's a very disturbing case. I haven't read the full legal ruling yet,
but I intend to. But my impression is that
the judge kind of didn't like, maybe the judge, I haven't seen
all the details of the case. So maybe the judge had a valid point
of like kind of the so called expert witness was just kind of like estimating
some probabilities and throwing them into Bayes Rule, and as with anything else,
it's garbage in, garbage out. If you put in garbage probabilities
in Bayes Rule, you get garbage out. But sounds like he wrote the opinion
in this case in a kind of sweeping way that others could use as
a precedent against using Bayes Rule. So normally I don't wanna inject politics
in this class especially not British politics but, I strongly oppose that and
I think that that should be overturned. So if any of you have any connections
in the British government or anything, maybe we can
do something about that. So, sorry I had to bring you that bad
news, but at least it's not a US case. So to cheer us up, let's talk more
about universality of the uniform. >> [LAUGH]
>> So, we were doing that at the end last time. I proved the theorem, but
I did not do an example yet. And it probably looked a little
bit mysterious last time. I mean, the math is perfectly correct and
you could reveal what we did last time. And hopefully, you can follow every step. And if you were confused by the proof,
you should review that. But it's a proof, so
you can't argue with it. But that doesn't explain what
does it really mean, okay? So I'm gonna talk a little
bit more about universality. And a couple other odds and ends and then
we'll get to the normal distribution which is the other key continuous distribution
that we need at least before the mid term. So just to remind you of the statement,
universality of the uniform. We proved it last time,
I'll just remind you quickly what it said. It said, well that, let capital F be a continuous, strictly increasing CDF. So this is a little bit unusual
compared with what we're used to, because we've mostly been starting with
a random variable, then finding the CDF. But we did talk about the fact that any function that's right continuous and
increasing. And goes to zero if you go to the left,
to minus infinity and goes to one as you to the right,
as you go to infinity, that's a valid CDF. So here we're starting with the CDF,
not starting with the random variable, and I strengthen just to, this result can be
generalized to make it easier to prove. I assumed it was continuous, remember in
general we just need right continuous, and I assumed it was strictly increasing, just makes things nicer not
having to deal with flat regions. But you know in general a CDF could
be flat and then increase, flat and then increase. I mean just assuming it keep
increasing and increasing, strictly. Those are the assumptions, and
then the statement of the theorem is that, If we let X = F inverse of (u)
that's the definition of X. If we let X = F inverse of (u), that's gonna have CDF F,
if U is uniform 0,1. So this says so that's the reason I call
universality cuz you just start with any uniform 0,1 random variable,
and at least in principle, we can synthetically create a random
variable with any distribution we want. So we're interested in, so
this is useful for simulation, right? It's saying, if we're interested in simulating random
draws from the distribution of F. One approach to doing that is get some
uniforms which are typically it's easier to generate uniforms and to generate other
distributions in the continuous case. Then compute F inverse of u and
then we're done. In practice, this,
in some cases this is very easy. In many cases, it's gonna be very hard to
analytically write down this F inverse. But at least in principle, this is saying
from the uniform you can get everything. So that's conceptually quite nice. This is also the fundamental reason why
remember we talked about these three properties of the CDF, but they didn't
prove the fact that those three properties are enough to describe a CDF. If we extend this a little
bit it's saying that if we have any function F that satisfies
the properties of a CDF, then it is a CDF. Okay so I wanted to do an example and talk a little bit more about
the intuition behind this. We proved it last time. There is kind of a flip side to this. So that's statement one, there is another way to write this which
is kind of going the other way around. Here we started with a uniform 0,1. We computed F inverse of U, and
we claim that that has CDF F, okay. What if we want the other way around and
if we already had X? So if we start with X and
we don't have a uniform yet, we just have X which has CDF capital F. And then, Just to kinda get
an idea of what's going on here. Just apply F to both sides,
it would say F(x) = u. So we're going in the opposite direction. If X is distributed according
to F then if you compute F(X) that's gonna be
distributed as uniform 0,1. And this looks very mysterious to most
people the first time they see this, okay? Because there's something beautifully and
bizarrely self referential about this. What did we do here? We took a random variable X and
we plugged it into its own CDF. So that sounds like kind of
a strange thing to do, okay, but first make sure that it makes sense
to you, what does that actually mean? What it means is simply
X is a random variable. We've talked before about the fact
that a function of a random variable is a random variable. F is just a function, so a function of
a random variable is a random variable. So that's a perfectly
valid random variable. That doesn't explain why we
wanna plug X into its own CDF, but we can do it if we want. That's a random variable. And the theorem says that that's
actually gonna be a Unif(0,1). Notice of course, there could be
a CDF takes values between 0 and 1. We know that this is gonna always
take values between 0 and 1. But it doesn't show it's uniform but at least that shows it's going
from 0 to 1 which makes sense. So, notice also that there's some notational difficulty that
you could run into here. I just wanted to warn you about. Remember the CDF F(x) = P(X is
greater than or equal to x). So if you kind of just
blindly plug in X here, you would get F(x) = P(X < or equal to X). But, so X less than or
equal X, that's an event, but that's an event that always happens. All right? You don't need any kind of
not insider information or to do anything to know that it's true. Big X is less than a little big I
actually didn't have to tell me that that's actually one. But that would say that F of capital
X equals one which is not what we're trying to say this is so
this step is actually wrong. And it looked very natural to
just plug in capital X, okay? So that's why you have to
interpret this carefully. The interpretation is best
seen through an example. So, let's suppose that F of x, F of little
x equals 1- e to the minus little x. That's a little x here just
am writing x so you can see. f(x)= 1- e to the second of -x for
x greater than 0. This is an important CDF called
the exponential distribution, that we will get to later. Then if X is negative then
we set it equal to 0. So this is continuous, and it's strictly
increasing on the positive side, it's just flat at 0 on the negative side,
but the same argument is gonna work. Then the interpretation would
be F(X) = 1- e to the -X, right? So that's very natural, right,
I just Change little x to big X. This is the intended interpretation. So the interpretation of this
expression is that we first evaluate the function f as a function,
something squared or whatever. Write it as a function,
then replace little x by big X, then it won't run into trouble. We're here, it's just a notational Issue, but it can be confusing if you
haven't thought it through carefully. All right, so this results kind of just
looks like a curiosity at this point, but the reason I'm talking about about it is
first of all it's just good practice with thinking about the difference between
a random variable and a distribution. Just to take the effort to understand
what this really means is worthwhile. Secondly, this result is
quite useful in 111 and statistical inference and
I'm not going to get into that much because I don't
want to steal 111's thunder. But the basic idea is simply that X may have some complicated or
maybe even unknown distribution. And if we wanna kind of be
testing a certain model or things like that, it may be useful, to be able to reduce things to uniform is just
a very simple known standard distribution. And so it may be useful to
reduce things to uniformity. And if we generate a lot
of instances of this and we find that they don't look uniform, then we conclude maybe there's something
wrong with the model, that kind of thing. So it's actually quite useful,
but right now, I'm just thinking of it more as of
a kind of conceptually neat thing. Similarly this one here. Well I just think that's a beautiful
result because it says from uniform you can get anything,
which is pretty surprising. But it's also quite useful for simulation. So just as an example of how
to use this for simulation Let's use this distribution here
which I introduced over there. So, let's consider that CDF,
I'll just rewrite it, cuz it's easy,
1- e to the minus x for x positive. And that's called the exponential
distribution with parameter 1, which will be an important distribution for later,
but you don't have to know it right now. And suppose that we have access
to a uniform between 0 and 1, and we're interested in simulating
from this distribution. So we wanna simulate X which
is distributed according to F. Well, all we have to do
according to this result is compute the inverse of this function,
so F inverse of U. Well this is just like high school algebra
finding the inverse function, right? So if I set this thing equal to little
u and then this is u in terms of x, we just solve for x in terms of u, right? Then we would get minus log one minus u
just by taking the inverse function, okay? So, therefore the universality theorem
immediately tells us that f inverse of capital u,
which is -ln(1- U) has the distribution F. So if we were doing it on a computer and
we wanted ten random draws from this distribution we would just
generate ten iid uniforms and then just compute, this is just an easy
function, compute this ten times. And then we'd have 10 iid random
draws from this distribution, so that's how it would be useful. While we're talking about
this particular function, this is a good time to
mention something about 1- u. 1- u is also uniform 0, 1. So if we want to, we could have just
done minus log of u rather than 1-u, that's an important
symmetry of the uniform. And you check this easily for yourself, just simple good practice with CDFs and
PDFs. But in terms of a picture, if we're picking u is uniform between
0 and 1, let's say it's there. So use the distance from here to here and
its uniform. And 1 minus u is from here to here, right? But there is a symmetry going on so
why do we care if we are measuring from the left to the right
or from the right to the left. It doesn't matter,
We just have a random point, okay? So that's the intuitive symmetry, but
you should check the calculation too. In general that's gonna be
true that if we take a + bu, where a and b are constants and
u is uniform between 0 and 1. So then we're just doing linear
stuff to it, that will be uniform still on some interval,
whatever the appropriate interval is. So if we start with the uniform 0,1 and
we want uniform 0 to 10, just multiply by 10,
that's very straight forward. But like an important common mistake
to be aware of that is that if we do, this is like a linear transformation,
if we do something none linear, then it's usually not
gonna be uniform anymore. Nonlinear usually leads to nonuniform. For example, if we squared u, you can
check that it will not be uniform anymore. And to check that, just compute the CDF, it doesn't look like the CDF of a uniform,
so it's not uniform. So that's not gonna be careful
if you can't just say, well, u squared is between 0 and 1,
so therefore it's uniform. You have to actually check it, and in that
case, if you check it it's not true, okay? So that's what this theorem is doing. It's just saying, in this direction
here's how we can simulate. And in this direction, it's saying how do
you go back from x to a uniform, okay? So you can kinda convert back and
forth between distributions this way. All right, so let's talk a little
more about independence just because I said that, sorry, these
boards are very squeaky in this room. I said that I'd say something
about independents, so I'll still say something
about independents. We've talked already about
independents of, Events. And I sent an email about this, I wanted, I really wanted to elaborate a little
bit about what's the difference between independence of random variables
vs independence of events. So independence of random variables,
they're closely related, we just define it directly in
terms of independence of events. So if we have random
variables X1 through Xn, and we wanna say what does it mean for
these random variables to be independent? Then the definition is that. Definition is that they're independent. If we look at the probability that X1 is
less than or equal to little x1, blah, blah, blah, Xn less than or
equal to little xn. So that's an event right,
it's an intersection of n events, okay? Now intuitively, remember
independence intuitively means that to find the probability of
an intersection, we just have to multiply. So what this means is that
we can just multiply, notice these are just the CDFs. So to compute this probability we just
need to multiply the separate CDF's. This thing here that I wrote
down is called the joint CDF. Because this is the CDF of all
the random variables considered jointly as one probability thing. In here we separated it
out into the separate, these ones are called marginal, we'll
come back to that in a later lecture. But these are just the individual CDFs,
okay, and that has to be for
all little x through little n. So actually it looks simpler than
the definition of independence of events. Cuz remember for independence of events,
if we wanna say three events are independent, we talked about the fact
that we have to the triple intersection. But that's not enough, right,
we also need all the pairwise ones. And here I didn't bother writing
any pairwise statements or three-way statements,
I just have everything all in one. So it looks simpler which would
seem strange at first, but the way to resolve that is that
this is for all X1 through Xn. So even though it looks simpler, actually I've written down
uncountably many equations. And then in the case of just independence
of events, it's a large list, but a finite list of equations. Here we have infinitely many equations, it
just looks like one equation cuz it's for all X1 through Xn. All right, so that's the definition
of independence in general, but in the discrete case, usually it's
easier to work with PMFs rather than CDFs. And so then we work with what's
called the joint PMF, where, I'm just going to write
the same equation again. Except I'm gonna replace less than or
equal by equal. So that's just the product of the PMFs. So this thing here is
called the joint PMF. And in the discrete case,
these two things are equivalent. Proving they're equivalent,
there's nothing difficult about it, it's just a little bit
tedious to write down. Cuz you have to write up the appropriate
sums to convert between these things. And it works out, and the intuition that it should work
out should be very, very clear. Because what this is saying is,
what this statement says is that, knowing any collections,
any subcollection of these Xs, knowing their values tells us
nothing about the other ones, right? And this says the same thing. So this is stronger than
just pairwise independence. Pairwise independence would say
if you know one random variable, doesn't tell you any information
about any other one random variable. Full independence means
knowing any of them, any collection of them tells you
nothing about ones that you don't know. No information whatsoever. So that's a stronger statement. And just to give you a quick example
where, pairwise independence doesn't imply independence, here's
the simplest example I know of that. Let's just let X1 and X2 be iid coin flips, that is Bernoulli
one half fair coins. So if you want, just think of
flipping a fair coin twice, and then this is kind of like
an old game that people used to play called matching pennies. Where each person like you know pulls out
a penny and see whether it's the same, or not one person wins if they're the same. And if there both heads or both tales
the other person wins otherwise. I think that used to be a popular game but
not so popular anymore. However, that suggests another random variable which is a natural one to look
at, that is just are they the same or not? So let's let X3 equal 1 if X1
equals X2 and 0 otherwise. So that's just saying, 1 if,
if the two pennies match, 0 otherwise that's an indicator
of a random variable, right? So okay, these are pairwise independent but not independent. It's obvious that they're not
independent because if I know X1 and X2 I immediately know X3,
X3 is a function of X1 and X2. So not only does it give me information,
it gives me total information. So knowing X1 and
X2 give us total information about X3. However, just knowing X1 tells us
nothing about X2 because those, I assumed, are independent. Just knowing X1 tells us nothing about
X3 because it's still 50-50, right? If we know that X1 is 1,
then this just reduces to saying X2 is 1, in order for this to be 1 but
that's still 50-50. So similarly X2 is independent of X3. So they're pairwize independent but
they're not independent. So, I just said in words why that's true,
but you can try checking what happens with these
equations, why does that agree with that? So, pairwise independence isn't enough
in general just to have independence. All right, so, the other thing we
were doing last time was LOTUS, and we are gonna talk more about LOTUS,
definitely on Wednesday. But I wanted to start on
the normal distribution first, which is both important and also a good distribution to have on
hand when we're doing more LOTUS stuff. Okay, so here's the normal distribution. Now it's also called the Gaussian but
I don't call it the Gaussian. Because first of all the Gauss was one of
the greatest mathematicians of all time, he has enough stuff
named after him already. And secondly, he was not the first
person to use this distribution. So it is not really fair
to give him credit, so we call it the normal distribution. But I'm just mentioning that because
you may see the term Gaussian and it's the same thing. The normal distribution
is by far the most famous important distribution
in all of statistics. And There are many reasons for that. The most famous reason for that is
what's called the central limit theorem, which we're gonna do towards
the end of the semester. But I'm just gonna tell you in words,
intuitively, what it is now. Just to kind of foreshadow it, just so
you have a sense of why is this important. Then we'll go into the details of
it much much later in the course. Central limit theorem,
The central limit theorem is possibly the most famous and important
theorem in all of probability, so we're definitely gonna
talk about it later. And it's kind of a very,
very surprising result. What it says is that if you add up
a bunch of iid random variables, the distribution is gonna look
like a normal distribution. So this is just one distribution,
or one family of distributions, cuz you can also shift it and scale it. Asides from the shift and scale,
it says that adding up the large number of iid random variables,
it's always gonna look normal. Which is kind of very, very shocking,
I think, because it seems like, I didn't say I was adding
continuous random variables. They could be continuous, they could
be discrete, they could be beautiful, they could be ugly. They could be anything,
you add them up and then it's always gonna
look like the same shape. And that shape is the one
you've all seen before, it's just the standard bell-shaped curve. But there are different curves you
could draw that look like bell curves. So why does this one
particular bell-shaped curve, comes up always as the only possibility,
that's what this theorem says. So sum of a lot,
I'll make this precise much later, when we actually state it as a theorem and
prove it. But just intuitively, sum of a lot of
iid random variables looks normal. And by looks, I mean if you look
at what's the distribution, it's gonna be approximately
a normal distribution. And there are even further generalizations
of this, going beyond the iid case. You need some technical assumptions, but there are generalizations
even beyond this, okay? So that's one reason that the normal is so
fundamental, and then there are others as well. Okay, so let's draw a little picture. So it's gonna look like,
the PDF is just gonna look like this nice symmetrical
bell-shape curve, but there are many possible PDFs. So as long as this thing is supposed
to be symmetric, it's more or less symmetric the way I drew it,
but it's exactly symmetric. As long as the area is 1, I mean, there's
millions of different possible PDFs that you can come up with that
look basically like this. The normal is this specific
one that's given by, let's start with what's called the standard
normal, which is written as N(0,1). This notation means that the mean is 0 and
the variance is 1, but we'll prove that later. Okay, so the normal has two parameters,
which are the mean and the variance. And we're gonna start
with the standard normal, which is mean 0, variance 1,
and let's write down its PDF. The PDF is f(z). It's kind of a tradition
of using the letter z for standard normals,
not that you have to do that. But often we'll use z for
standard normals, that's why I'm calling it z rather than x. So, well,
it's gonna be e to the -z squared over 2. If you plot this function,
just using your old calculus techniques. Find the derivative, and the second
derivative, and the points of inflection, and so on, and plot this. Or better yet, just using graphing calculator,
you'll get something that looks like this. That's not yet a PDF, though,
because it doesn't integrate to 1. So I'm just going to put
some constant here, c. Where c is whatever constant we need,
that's called the normalizing constant. It's whatever constant we need
such that the area will be 1. So you can plot this thing and you see
it will look like this, that doesn't yet show, why do we choose this? I mean,
it's a nice looking function, right? We can see that it's symmetrical, because
if we replace z by -z, nothing happens. We can see that it will go to 0 very,
very fast. Because exponential decay's already fast,
and here it's being squared
up in the exponent. So it's gonna decay to 0 very, very fast,
as z gets very positive or very negative. So it's a nice enough looking function,
but it will be much later when we see
why this is the most important PDF. Right now it's just a PDF with
an unknown normalizing constant, okay? So before we can get
much further with this, it would be useful to know the value of c. So let's try to get
the normalizing constant. So that might be a long calculation,
so I'll try to do that over here. So we need to know,
in order to make this integrate to 1, we just need to integrate this function. So let's try to do this integral. We wanna know the integral,
from minus infinity to infinity, of e to the -z squared over 2, dz. Okay, so that's actually
a famous integral in mathematics. Partly because of probability and
statistics, but partly just in its own right. This seems like something we
should be able to integrate. So of course, you can try doing,
I could use substitution or some other kind of change of variables,
it will not work. You could try doing integration by parts, and you know there's many ways
to try integration by parts. You can try to split it up in some way,
it will not work. Anything else you could ever think of, let's say we want to find
an antiderivative and let's use the fundamental theorem of
calculus, the usual way we do integrals. I can guarantee you that they won't work. The reason is that there is actually
a theorem that says that this integral, as an indefinite integral,
that is without the limits of integration, is impossible to do it in closed form. And it's kind of pretty amazing, I think,
that someone was able to prove that. It's not just like,
no one's thought of a way to do it. Someone proved that you can't ever do it,
so don't even try. And what I mean by, well,
to qualify that a bit, there is one way we could do
this integral that will work. And that is the definite integral,
that is, that's e to a power. So we've been using the Taylor series for
e to a power over and over again,
that series converges everywhere, okay? So if we want, we could just expand
this out as a Taylor series. And the way to do that would not be
to start taking derivatives of this. It would be to take the Taylor series for
e to the x, and then plug in x is -z squared over 2. Well, we'll get an infinite series, and then with some analysis you can
justify integrating that term-by-term. That is, we replace this by an infinite
sum and then integrate each term. And those are all very,
very easy integrals, because that's just polynomial stuff. But then we just get
this infinite series and we wouldn't know what to do with it, okay? So when I say this integral is impossible, it means that it's impossible
to do it in closed form. That is, just as a finite sum In terms
of what are called elementary functions. By elementary we mean just the familiar
functions, sine and cosine and exponential and log and
polynomials and stuff. And anything you can write down in terms
of standard functions you can't do. Okay, so this is impossible
as a indefinite integral. But that doesn't rule out all hope
that we could do the definite integral right that is we might be
able to find the area under the curve without first finding anti derivative. We will not be able to find
an anti derivative enclosed for them because it's impossible. Okay but we can try to find an area. All right so
how do we find the area under the curve. So we have this function here
which looks kinda like that, and we want to know this area here from
minus infinity to infinity, and so we write down this integral and
we can't find an antiderivative, and at this point if I didn't already know
how to do this I would probably be stuck. Because that seems very difficult. Though the way we're use to doing this
would be find an anti-derivative, right? So we're kind of basically
stuck at this point. And so someone, and I wish I knew who,
someone came up with an incredibly stupid and an incredibly
brilliant way to solve this integral. This method does not usually work,
but in this problem it does. And that method is we have this
problem that we can't do, so we write it down a second time. [COUGH] Now that solution may have just
looked like kind of like banging your head against the wall. That you can't do the integral so you
keep writing it down over and over again. That doesn't seem like it
would help the situation. Actually, this solves the whole problem. So let me show you why this trick of
just writing down the same thing twice solves this problem. Well, let's just change
the notation a little bit. This letter z here is what's
called a dummy variable. You can change x to whatever you want. This is just notation for
area under the curve so just so I can keep track of this is my. First integral that I can't do and
this is my second integral I can't do. Am gonna change the notation to X Andy so
I can tear them apart. So this is e to the minus X
to the second over two dx and this is integral Either -y squared
over two dy it's exact same thing. Now let's write this as one integral. So this is a double integral but if you
haven't dealt with double integrals much as they said on the first day of class. It's nothing to worry about just do
single integrals one after the other. This also can be written as the double
integral of e to the minus x squared plus y squared over 2 dxdy where the interpretation of this double Integral
is that first we do the inner integral, keeping y held constant, and
then do the outer integral. That's the exact same thing because if you
imagine rewriting this as a product again, when we're doing this inner integral,
we're holding y constant so we can just pull that out. And then what we have left here
is one of these intervals, so you can pull that interval out,
it would be the exact same thing. So I've written it as a double interval. It still doesn't look so much easier, but then there's one thing
that is saving us here. That's the fact that we have
this X squared plus Y squared. Whenever you see an x
squared plus y squared, it should remind you of
the Pythagorean theorem, right? Sums of squares, so
we draw a simple little picture. Let's say, suppose x,
y is up here just for simplicity. Of course, they could be in any quadrant. So here's x, here's y. And if we let r^2=x^2+y^2, I'm using the letter r for radius, that
is just the distance from here to here. Right it's just the distance formula,
just basic geometry and there there's some angle theta
The fact that we have r squared sitting right there is a hint that it may
be useful to convert to polar coordinates. Polar coordinates just says,
represent points in terms of a radius and an angle rather than in terms of
the cartesian coordinates of x and y. So we're going to convert from x,y to
r theta, that is polar coordinates. So our limits of integration
are gonna change, so, first of all what is that thing
that's just e to the- r squared /2. And we we are gonna
integrate as drd theta. We can also do d theta dr,
and theta is this angle so that's gonna go from zero to two pi. And r is this length, so
r goes from zero to infinity. And then there's one
other thing we need here, which is something that, one of the very
few facts from multi-variable calculus that we need in this course is what
happens when you transform in more than one dimension you You need to multiply
by something here called the Jacobean. And I discussed this on
the math review handout. So if you're not familiar with
this you can review that handout. And I actually do this particular
case of this transformation. Cuz this is a very common transformation,
convert to polar coordinates. Okay, so if we do that What
the Jacobian here works our to r. That's just a little calculation. If you haven't seen it before, then you can read about it in the math
review handout or in a calculus book. So that's called the Jacobian. So we don't just replace
dxdy by dr d theta, we replace it by r dr d theta
is the correct way to do it. That is what now makes this from go from being a very hard
problem to a very easy problem. As soon as we have the R here then that just suggests well look if
the derivative of R squared is two R. So we have kind of the derivative
sitting right here. The derivative of what's up in
the exponent is just sitting right there. Now it's just an easy
substitution integral. So this is the integral
from zero to two pi. Now let's just do this inner integral,
and then we have a d theta at the end. So just doing this inner integral, Let's just let u = r squared over two. So du=rdr, okay? So rdr is what we have there, so that's just the integral of zero
to infinity, e to the minus u, du. Now that's a really easy integral. Integral of e to the minus u is minus u to
the minus u so this integral is just one. So what are we doing? We are integrating one from
zero to two pi that's two pi. And lastly, we just notice well,
that's not the integral we started with, that's the square of
the interval we started with. So therefore, the integral for
minus infinity to infinity of e to the minus e squared over 2, dz Since
we've wrote it down twice we got 2 pi, if we had written it once we would
get the square root of 2 pi. So that's what we need for
the normalizing constant. All right. So now we know what the C is. C = 1 over square root of 2pi. Kind of amazing,
first of all that this trick worked. And secondly,
we're integrating an exponential. And suddenly we get square root of pi,
where did the pi come from? Where did the circle come? pi makes you think of circle,
where did the circle come from? Circle came from the fact that
we're using polar coordinates. So some way the pi appears. All right, so that's nice. Now we know the normalizing constant. So that's the standard
normal distribution. Let's compute its mean and variance and
then we can talk a little bit about the general normal as opposed
to the standard normal, okay? So that's the standard normal. Let's compute its mean and variance. I've already claimed
that the mean is zero. And the variance is one, but
we haven't checked that yet. So let's verify that, okay? So, first of all let's get the mean. The mean is easy. So we're gonna let z be standard normal,
and sorry for the pun. We're gonna compute EZ,
which I said is easy. Why is it easy? It's easy because of symmetry. By definition the mean is
the expected value of Z times the PDF which we now know is
1 over square root of 2pi. Which I can take out
cuz that's a constant. e to the -z square, where over 2dz = 0. That was an easy integral. Why is it 0? Well, I just said it's by symmetry. And the general type of
symmetry we're using here. Is that if we have an odd function. Let's say g of just as
a general statement. If g(x) is an odd function,
which I'll remind you. Means that it has the symmetry property. That g(-x) = -g(x). That's the definition of an odd function. An even function would be if we
do not have a minus sign here. That's the definition of an odd function. Then if we integrate g symmetrically. Let's say, from -a to a,
where that's the same from -a to a, not from just any a to b, of course. We'll always get 0. And you can do that by splitting this
up into two integrals and check that. But the easiest way to see that
is just like, as an example. For example, sine is an odd function. And if you have something that looks
like that where this is symmetrical. And you say,
you integrate from there to there, then the negative area cancels
out the positive area. And so it's true. Or you can verify it by
splitting up into two integrals. The positive area cancels
the negative area. Well, this is an odd function, right? Because if I replace z by minus, this
thing I'm integrating, is an odd function. So if I replace z by -z, nothing
happens to the exponential part, and that becomes -z. So it's an odd function. So just by symmetry,
you can immediately say 0, without having to do
some nasty calculation. All right, so that's good. But let's try to get the variance. Variance is gonna take a little more work. So the Var(Z) = E(Z
squared)- (EZ) squared. The other way, but
that second part we just showed is zero. So that's just E(Z squared). So, now is where we're
starting to need LOTUS again. I'll remind you,
we haven't proven what LOTUS yet. But we'll talk about that on Wednesday. Here I just wanna show how to use it. Lotus says that if we want E(Z squared), we do not first need to
find the PDF of Z squared. We can directly work on
in terms of the PDF of Z. That's what we talked about last time. So we know immediately that
this is just the integral, I'll take out the 1 over
square root of 2pi again. Integral minus infinity to infinity
z squared e- z squared/2 dz. So it was exactly the same integral
except I replaced z by z squared. That's why it's called the law
of the unconscious statistician, cuz that's just kind of an obvious
thing to do, just plug in z squared. Lotus says that, that is in fact valid. Okay, so
now we just have to do this integral. This integral,
this is now an even function. Which is nice but
not as nice as having an odd function. So with an even function if we want, we can integrate from
zero to infinity instead. And I think I'd rather do that. It's not necessary to convert it this way
but I'd rather integrate from zero to infinity just so that I can avoid thinking
about negative numbers for a while. Let's go from 0 to infinity and
multiply by 2. Since it's an even function,
that's perfectly correct, cuz we have a positive area, a positive area and so
by symmetry, it's the same thing twice. So just twice the area from 0
to infinity of the same thing. Z squared e (-Z squared)/2 dz, now here I think we need to
resort to integration by parts. Which usually I try to avoid, but
once in a while we can't avoid it. This integral, so
just in terms of the strategy for using integration by parts,
remember with your integration by parts, you need to split
the integrand into two pieces. One piece that's easy to integrate, and
one piece that's easy to differentiate. Now Z squared is easy to
do whatever we want with, the part we should focus on is this thing. And we don't know how to integrate that. We know how to integrate it from
minus infinity to infinity, but we don't know how to integrate
it in general over some interval. But we saw over in this
calculation we just did that if there were an extra z here,
then it's a really easy integral. So we're gonna split this z squared
into two zs, zz there, okay? Now we're in good shape because this
thing, we could let this thing be u and this thing be dv, right? Because that's something
we know how to integrate. So in other words,
we're letting u = z, so du = dz. That's really simple. And we're letting dv =
ze to the -z squared/2, so v = e to the -z squared/2. And to check that if we take
the derivative of this, there should be a minus sign in front. If we take the derivative of this by
the chain rule, we get that, right? So, okay. So, therefore, now we're in good
shape to do integration by parts. So this is just your usual
integration by parts thing. It's 2 over root od 2pi times
(uv) integrated from 0 to, evaluated from 0 to infinity. And then minus the integral of VDU, but
that's minus a minus because of that, so we're going to do minus Minus this plus the integral of e to the -z squared over 2 dz, from 0 to infinity. Okay, now we're actually
done with this calculation, because all we have to do is say,
that's the integral we just did. The only difference is that
we're going from 0 to infinity, instead of minus infinity to infinity. This part is just 0, because if you look
at what is this near 0, this part is 0 and this part is close to 1, or close to -1. When z is very large,
then this part's exponentially small, so this part is just 0. So we only have to concentrate on this
part, but that's what we just did. This is one half of the integral
that we just computed, right? So it's one-half of square root of 2 pi,
and we multiply one-half square
root of 2 pi by this, we get 1. So this whole thing just becomes 1,
cuz it's just this times its reciprocal. So what this showed was that
the variance is equal to 1, which is what I said here, so that's good. All right, so a couple more very
quick things about the normal, and then we'll continue next time. Just for an important piece of notation,
this is standard notation in statistics The notation is that, is for the CDF, capital Phi is the standard normal CDF. Because this distribution is so
important, and yet so hard to deal with in the sense that
it's a lot of work to do that integral. And now it's going from minus infinity to
infinity, then it deserves its own name. So in other words,
Phi(z) equals 1 over root 2 pi, Times the integral from minus infinity up
to z, of e to the -t squared over 2 dt. I just changed the letter to a t,
to avoid clashing with this z here. So that's just the CDF, right? Because this function is so important,
this CDF is very easy to calculate using various computer or calculators,
very easy to find tables of this. So in a sense, we got around the problem
that we couldn't do this integral by just saying, call this Phi(z), and now just
treat that as a standard function. And now we can do this integral,
it's just capital Phi, right, so that's standard notation for that. And one other remark about that is,
what happens if we get Phi(-z)? This is something you should check for
yourself, Phi(-z) equals 1- Phi(z), by symmetry. And just for practice in
the concept of symmetry, and CDFs, you should check this for yourself. Just draw a picture and
see why that's true, it's a useful fact. All right, so next time we'll continue
with the non-standard normal, and that's the standard normal