So we were talking about the exponential
distribution, and if I remember correctly we were talking about something
called the memoryless property, right? So we showed the last time that the
exponential distribution is memoryless but at this point as far as we know there
could be infinitely other memoryless distributions. So what I want to talk about now is just
kinda the fact that I find pretty amazing, which is that the exponential is
the only memoryless distribution in continuous time. In discrete time we have the geometric. So in a very deep sense, the geometric is
the discrete analog of the exponential. The exponential is the continuous
analog of the geometric. So those two distributions
are very closely related. So just to remind you what
the memoryless properties said, and also just cuz I saw some
news article recently that completely misunderstood
the concept of a life expectancy. And that's not the first
time that that happened. So basically it's a mistake of
not understanding the difference between expectation and
conditional expectation. And we haven't formally done
conditional expectation yet. But I claim that you already know
how to do conditional expectation. Because you just do expectation, except you use conditional
probabilities instead of probability. But we talked about the fact
that conditional probabilities are probabilities, so
it's completely analogous. So we will spend a lot of time
on conditional expectation later as a topic in its own right. But it's already something
that's familiar, right, just use conditional probability. Okay, so for this life expectancy thing, here's like the common misconception
that I've seen in various news articles. Last time I looked the life
expectancy in the US was 76 years for men, 81 years for women. And it's different in different
countries and whatever. And it's kind of an interesting
statistical problem. How do you actually come
up with those numbers? So I'm not saying those numbers
aren't exactly correct. I'm saying those are the latest
numbers that I've seen reported. Now how do you get the those numbers? Because in principle, you think, if you
wanna know what's the life expectancy of a baby who's born tomorrow, right, then I guess in principle what you would
is take all the babies born tomorrow, wait and wait until they all die, and
then take the average lifetime, okay? Well, first of all,
that's gonna take over 100 years, and you might want an answer now. But secondly, at some point in
time you want an answer, right? But if you only look at the ones
who've died up to that point and average those, that's gonna be
a very biased answer, right? Because you're ignoring all
the ones who have longer lifetimes. Okay, so that's an example of
what's called censored data. That's a good kind of censoring. It's censored because they're still alive. So anyway, that's a hard statistical
problem, and an interesting one. The reason I'm mentioning it now
is kind of like good news and bad news about life expectancy. So let's just assume it's 80 years for
simplicity. The mistake that I saw on this
news article is basically assuming that it's like 80 years for everyone. It was about Social Security and Medicare,
stuff like that, even still assuming 80 years even for people who
are already in their 50s and 60s, okay? But the fact is,
the longer you've lived then you're expected how long you're going
to live total becomes longer, okay? So if I wanted to write that as
an equation, I would say, for example, if we let T be how
long someone's gonna live, given that that person
lives to be at least 20, that's gonna be greater than
just the expected value of T. It's kind of intuitively clear that
that's the conditional expectation. It just means given this information. And we compute our expectation based on
conditional probabilities rather than unconditional probabilities. This should be pretty intuitive, right? But the case where that would not
hold would be if everyone lives exactly the same lifespan,
then this is irrelevant information. And as soon as there's
variability in that, given that you live this long,
that's a good thing, that's the good news. The bad news is that human lifetimes
are not memoryless, right? People get older and decay with age. And so this is just to illustrate what
the memoryless property would say. If human lifetimes were memoryless, it
would say, and if the average is 80 years, then it would say, if you lived to be 20,
then your new expectation is 100, right? Because memoryless says
you're good as new, right? So no matter how long you live you
get an extra 80 years on average, and that's not true empirically. So the memoryless property would say. I'll say if memoryless,
cuz that's realistic for human beings, but it's realistic in some other applications. If memoryless, we would have, I just wanna translate what I
just said into an equation. It would say that E of T. Given T greater than 20, well, you have
those 20 years plus you're as good as new. So you get an extra E of T. That's what the memoryless
property would say, okay? So the truth is somewhere in between. And so, we actually have upper and
lower bounds at this point. It's gonna be somewhere between E(T) and
E(T)+20, okay? So that's the memoryless property. So of course, you could ask, since it's
not very realistic for human beings, why do we care so much about it? Well, first of all,
it's used a lot in different science, chemistry, physics, types of applications. See, sometimes using economics, in some
application basically it's realistic in problems where things
don't decay with age. Or another way of thinking of it like
there's a homework problem about doing homework problems. And there are sort of two sorts of
homework problems, roughly speaking. One is a type where just like you
have to do a certain calculation, and you do the calculation,
and then you're done. And it sort of takes
a fixed amount of time. Or at least that you can make
partial progress, right? And then there's another type of homework
problem where you could just stare at it for hours and
have absolutely nothing, right? And then, at some point, eventually,
you have to be very determined, very persistent. At some point, you get this a-ha moment,
you get the breakthrough, and you get it, okay? So memory lets you restore,
like you can't make partial progress. Either you get it eventually,
or you don't. Whereas another type of problem you saw
is more of a fixed gradual progress, progress, progress until
we finish the problem. Okay, so
there are cases where it's realistic. But the other big reason for
studying it is that it's a building block. So if you go in and study what
distribution do people actually use in practice for
something like this T survival time? The most popular distribution that's used
in practice is what's called the Weibull. And you don't need to know Weibulls right
now, but just to mention it right now, a Weibull is just obtained by
taking a exponential to a power. And as soon as if you take an exponential,
random variable and then cube it, that's not going to be exponential anymore,
it's not going to be memoryless anymore. That's called a Weibull. It's not gonna be memoryless anymore. And it actually turns out
to be extremely useful. So exponentials are a crucial
building block. But in some cases memoryless is
not an unreasonable assumption or it may be a reasonable approximation
even if it's not exactly true. Okay, so that's just the intuition
of the memoryless property. We proved that it was true last time for
the exponential distribution. But now, let's show that it's only
the exponential distribution that has that property. So I'll state that as a theorem. So if X is a continuous random variable, And we're thinking about applications
like lifetime, survival time. So we're thinking of positive-valued
random variables, so I'll say positive. That is, it takes values from zero to
infinity, with the memoryless property, So the memoryless property is
a property of the distribution, not of the random variable itself, per se. But we would say the random variable
has the memoryless property if its distribution has the memoryless property. Okay, and then the claim is just
that it has to be exponential. So with exponential lambda for
some lambda. So this is a characterization
of the exponential distribution. Okay, so
now let's try to prove this result. And it's kind of an unusual proof, compared to most that
you've probably seen. Because we're gonna write down
an equation, but we're solving for a function, not solving for the variable. So it's what's called
a functional equation, okay? So let's let F be the CDF, as usual. But as we saw last time with
the exponential distribution, it's easier to think in terms of
1- CDF for this kind of problem. Because that's the probability
of surviving longer than time T. So that's the CDF of X, and let's say G(x) = P(x > x) = 1- F(x). So it's easier to do this
problem in terms of G(x). Now, the memoryless property says, Says, in terms of G, it's easy to write. It's just the equation
G(s + t) = G(s) G(t). And we saw this, I'm not gonna
repeat the argument for this. Because, the same thing
as we did last time for the specific case of the exponential. Just write down, memoryless property is defined in
terms of a conditional probability. But just write down the definition
of conditional probability, and just in one line you can
rewrite it like this. Notice this is true for
the exponential, right, because e to the -(s + t) is (e
to the -s)(e to the -t), okay? We basically want to show that
this is not like a usual equation, where we're trying to solve for s or
solve for t or something like that. We're trying to solve for G, as we want to show that only exponential
functions can satisfy this identity, okay? So, the way to approach this
kind of equation is just to start plugging some stuff in, and
try to learn more and more about G. So we're trying to solve for G,
And it's not like something where, just plug it into the quadratic formula or
something, and then you just solve for G. We're trying to solve for
this function G, so we just got to try to gradually learn
more and more stuff about G, right? Okay, so the first thing I can think of,
so this has to be true for all positive numbers s and t, okay? So, let's see, what can we learn about G? Well, one thing I can do is let s = t,
just to see what that says. Okay, so we can choose s and t to be whatever we want, so we may as
well derive some consequences of this. So one choice would be let s = t, and then that says that G(2t) = G(t) squared, I just rewrote that. Okay, so that's nice to know, what else
can we see, well, let's try G(3t). G(3t), well, I could replace s by 2t, that would be G(3t), G(3t) = G(2t) G(t). But we know G(2t) = G(t) squared,
so this is G(t) cubed. And you can keep repeating that,
formally by induction, by just repeat a few of them and
you immediately see the pattern,OK ay? So we immediately have
the G(kt) = G(t) to the k, If k is a positive integer, so
that seems like a useful property to know. Now what we went the other way around? What if we want to know what's not G(2t),
but G(t/2)? Well actually, if I take this equation and
replace t by t/2, cuz this is true for all t's,
so I can plug in t/2 for t. That's G(t/2) and that's G(t),
take the square root of both sides. Then we get the G(t/2) is
the square root of G(t). Similarly, G(t/3) is the cube root of G(t), and so on. So now, we've figured out that
this equation is true if k is a positive integer, or if k is the
reciprocal of a positive integer, okay? Well then the next step would be,
what if k is a rational number? That is, a rational number by
definition is just a ratio of integers, so let's say we have G(m/n t). Well, if we just apply these two
properties, that this is true. All right,
this one has G(t/k) = G(t) to the 1/k, where k is a positive integer. We apply these two properties,
then we immediately get that if we multiply
by any rational number, the same thing holds,
that's G(t) to the m/n. So then this m over n is
any rational number, okay? Now, if we have any real number, we can always treat a real number as
a limit of rational numbers, right? Like pi, you could approximate pi,
you could say that pi is the limit of the sequence, 3, 3,1,
3.14 and so on, all right? So you can pick a sequence of rational
numbers that converges to pi or any number you want. So just take the limit of both sides. Imagine, replace this m/n by
a sequence of rational numbers, take the limit of both sides. We're using the fact that
capital G is continuous. Continuous means that if we take
the limit of something like this, you can swap the limit and the G. So by continuity, this is true for
any positive real numbers. Let's say, G(xt) = G(t) to the x for any real x > 0. Just by taking the limit of rational
numbers, we can get real numbers. Okay, now we're basically done,
because this is true for all x and t. So to simplify it,
let's just let t equal 1. And if we let t = 1,
this just says that G(x) = G(1) to the x. That looks like an exponential function. In particular,
let's write it in terms of base E. That's the same thing
as E to the X log G(1). So this thing, now G(1) is a probability. So G(1) is clearly between 0 and 1. If you take the log of
a number between 0 and 1, you'll just get some negative real number,
so this is just some negative number here. So we could call this thing -lambda,
but a lambda is a positive number. This is just a constant, right, so I'm just calling it -lambda,
it happens to be a negative constant. So that's just e to the -lambda x,
which is 1- the CDF for the exponential, so
that's the only possibility, okay? So exponential is the only
continuous memoryless distribution. That's the proof of that fact, okay? So, we'll use the memoryless property and more stuff with the exponential
distribution from time to time. But there's another kind of key
tool that we need at this point. You'll need it for the homework,
and you'll need it in general, and that's called the moment
generating function. So let's talk about moment
generating functions. Moment generating function,
which sometimes seems mysterious at first. But if you think carefully about what it
means, you'll see why it's useful and what it actually means. Moment generating function,
which we abbreviate to MGF, Is just another way of describing
a distribution rather than CDFs and PDF. MGF is another alternative way
to describe a distribution. So the definition is that, A random variable x has MGF M(t) = the expected value of (e to the tx). This is as a function of t. And we say that it exists,
this is not a useful concept unless this thing is actually finite
on some interval around 0. So we would just say if this
is finite on some interval, let's say, -a to a,
where a is greater than 0. It could be that this thing is finite for
all numbers to you, which is great. But we're not requiring that to be true,
but we are requiring at least having at least some tiny interval
about 0 on which this is finite. All right, well, this definition at first
seems to come out of nowhere, I think. And students sometimes really wonder,
what's t? What does t mean, okay? The first thing to understand about this
is that this letter t is just a dummy variable, right? Conventionally, we call it t, but we could
could have called that s or q or w or anything else that wouldn't clash
with the rest of the notation, right? I wouldn't have called it E or
M or capital X, but anything that doesn't clash is fine. So t is just a placeholder, Secondly
what does this thing actually mean? Well, this is a well defined thing
because for any number t, we talked many times about the fact that a function
of a random variable is a random variable. So that's a function of a random variable,
that's a random variable. We can look at its expectation. That doesn't yet show why we would want
to do that, but we can do that, right? So this is a function of t,
that is for any number t, we could imagine computing this expectation,
so it's a well defined function of t. It might be that it's infinity for
some values of t, okay? But at least you know it's something
we can write down and study. Okay, so think of t is kind of
like a book keeping device. All the MGF is, is a fancy book keeping device for keeping
track of the moments of a distribution. So let's see why that's true. So why is it called moment generating? Well, to see why it's called that,
all we have to do is say, we have e to a power here, let's use
the Taylor series for e to a power, right, which we've been using many times. So if we expand this thing out,
expected value of e to the tx, that's the expected value,
just the Taylor series for e to the x. So that's x to the n, t to the n
over n factorial, n= 0 to infinity. This is always valid cuz the Taylor series
for e to the x converges everywhere. x is a random variable,
but this is always true. So that's a valid expansion. Now intuitively at this point,
we wanna swap the E and the sum. And that's where some technical
conditions come in, in particular, that's where this fact matters that we
have this interval, n=0 to infinity. So just suppose for a second that
we can swap this sum and the e. Then we would get this thing, expected value of x to the n,
t to the n over n factorial. This thing here, E(x to the n),
is called the nth moment. So the first moment is the mean. The second moment is not the variance
unless the mean is 0, but the second moment and the first moment are what
we need to compute the variance, right? And then there are higher moments,
that's called the nth moment. And higher moments have different
interpretations that are more complicated than mean and variance, but they turn
out to be useful for a lot of reasons. So assuming that we can do this swap, bringing the E inside the sum, then
what we've really done is just capture all the moments of x into this
Taylor series, all right? But that's why it's called the moment
generating function cuz you see all the moments are just sitting
there in the Taylor series. As far as showing why you can swap the E
and the sum, if this were a finite sum, that would just be immediately
true by linearity, right? Since it's an infinite sum, that requires
more justification, and for that kind of justification, either you need to take a
certain real analysis course, or step 210. So we need some more analysis and
math to do that. But this is valid,
under fairly mild assumption that this exists on some
interval like that. So it turns out that we can, it's kind of like an infinite
version of linearity, right? Swap the E and the sum,
and we get that thing, okay, so that's called
the moment generating function. All right, now, I guess this shows that it would be useful
if we were interested in the moments. But what if we don't
care about the moments? I mean, usually we might care about
the mean and the variance, but we haven't yet worried that much
about higher moments than that. So let me just tell you three
reasons why the MGF is important. Okay, so why is the MGF important? Well, the first reason is the moments. That's what we just talked about. Cuz sometimes we do want the moments. So if x, so we're gonna let x have MGF. Capital M(t). If necessary for clarity,
we might subscript the x here. But right now, we're just talking
about one random variable. And here is it's MGF, so
we don't need a subscript. So the nth moment. That is, E(x to the n) is,
there's two ways to think of it. The nicer way is that
it's the coefficient. Of t to the n over n factorial
in the Taylor expansion. Of M about zero,
from Maclaurin series if you like. That's what we just showed over there,
assuming we could do that swap. Another way to say this, just remember from your basic Taylor
series from integral calculus that if you have some functions and you wanna
compute its Taylor series, what do you do? You take the sum like this. And right here in the circle place, you put the nth derivative evaluated at 0,
right? That's just how you do Taylor series,
right, take derivatives evaluated at 0. So another way to say it is
that it's the nth derivative, which I'll write like that. So if we want the first moment, we could take the first derivative,
evaluate it at 0, and so on. Okay, so the coefficient,
and it is this thing. I'll just write that =E(x to
the n) again just for emphasis. So to get the nth moment, we could take the nth derivative
of the MGF evaluated at 0. But as we'll see, sometimes it's
a lot easier to just directly work out the Taylor series
by some other method. For example, we already know the Taylor
series for E to the x, right? Rather than going through take derivative,
take derivative, take derivative. Just write down the Taylor series. Okay, so that's the first reason. Second reason,
which is probably even more important, the other two reasons are more important
even if we don't care about moments, okay? The second reason is that the MGF
determines the distribution. So another way to say that would be. If you have two random variables, X and
Y, and they both have the same MGF, then they must have the same distribution. So if X and Y have the same MGF, then they have the same distribution,
for example, same CDF. If they're continuous,
they have the same PDF and so on. Then, they have the same CDF. This fact is very difficult to prove. So I'm not gonna try to prove this. But it's useful to know. If you compute some MGF, and
you recognize, that's a Poisson of 3 MGF, then you can conclude that that's
a Poisson of three random variables. There isn't some other distribution that
kind of pretends to be a Poisson of 3 at the same MGF, right? Once you know the MGF, you know
the distribution, at least in principle. Okay, and then the other reason why
they're important aside from 1 and 2, is that they make the sums much,
much easier to handle. So we've dealt a little bit with
sums of random variables before and we'll deal more with it later. In general, finding the distribution of
a sum of independent random variables is complicated,
that's called a convolution. But if we have access to MGFs,
things are a lot easier. Convolution, you have to
do this convolution sum or convolution integral which
we'll deal with somewhat later. We've done a little bit of it before. It's complicated. So suppose we have MGFs. So let's say, if x has MGF Mx and if Y has MGF My. And they're independent. Then we want the MGF of the sum. A lot of times, we're interested in
sums of random variables, right, just adding things up,
comes up all the time. Then the MGF of X+Y. Or by definition, it's the expected
value of e to the t(x+y). That's e to the t(x+y). And we haven't proven this yet. But this is another fact
that we're gonna prove soon. That if we have expected value of
two things, a product of things that are independent, then they're
the product of the expectations, okay? So we haven't shown that yet. That would be false in general,
if they are not independent, so it's this. Crucial that we have independence or some other condition here,
but we'll prove that later. I'm writing this as e to
the t X times e to the t Y. Since X and Y are independent, then, e to
the t X and e to the t Y are independent. And according to that fact, then I can write e of the product of
the product of e, using that assumption. But then notice that that's
just the product of MGFs. So that's Mx of t by definition,
Mx, My of t. So this is really simple in the sense
that if we have both of those MGFs, we just multiply them, right? We didn't have to do an integral. We didn't have to do some complicated sum. We just multiplied the two MGFs. So that's really convenient, okay? Let's do a couple quick examples. Of MGFs for specific distributions. So the easiest example to start
with is Bernoulli, right? Bernoulli is the easiest,
simplest distribution. So let's just start with the Bernoulli. So if X is Bern(p) Then the MGF M(t) = E(e the tx). Now we could use LOTUS, but we don't even
need LOTUShere, because we could just say either the tx, X is just 0 or 1. So e to the tx is either e to the t or
1, right, only two possibilities. The probabilities of those two cases are,
with probably p, it's e to the t. Pe to the t, and with probability q,
it's 1, where q is 1-p as usual. So I'm just stating a weighted
average of the two possible values, it's either e to the t or it's 1. So just take a weighted average. So that's really easy calculation. But because of that, now we can
immediately get the MGF of a binomial. Because. Well, if we write down the definition, then we know we're gonna have
to do some big LOTUS thing. But we don't have to do that
because if we think of the binomial as the sum of iid Bernoulli p,
then you use fact 3 there. Then we immediately know that the M(t) equals the MGF of
a Bernoulli To the nth power. So we just write it down immediately just
by taking the nth power of that, okay? So for practice you could check
remember the binomial has the mean of the binomial,
NP is NP and the variants is NPQ. And if you wanna check that
statement one there is true, you could take this thing and check that
the first derivative, evaluate it at 0, you get the mean, second derivative,
evaluate it at 0, you'll get the second moment,
from there you get the variance. Okay, so that's the binomial. One more that we need
right now is the normal. And that's a bit more
involved of a calculation. So let's let z be standard normal. Notice that once we have the MGF of the
standard normal, then we know the MGF of any normal we want, cuz remember
this thing about location and scale. So once we know this answer MGF,
then we can take any normal we want then go mu plus sigma z figure out
its MGF in a straightforward way. And that's good practice to make
sure you know how to do that. So let's just talk about the standard
normal first cuz that's simpler. And we wanna compute the MGF. So by LOTUS, this is 1 over square root of
2pi integral minus infinity to infinity. So we want the expected
value of e over tz. So by LOTUS it's e over tz, and then
we just have to write down the standard normal density,
which is e to the minus z squared over 2. I already put the normalizing
constant out there. So this is minus z squared over 2 dz. Now looks like a pretty nasty integral,
but we will bravely try to do it anyway
cuz you wanna know the answer. So that was kind of nasty. Now if t is 0, then that's just 1, because then it's just saying integrate
the standard normal PDF, we'd get one. So we know how to do this
without this linear, right? We have a linear term,
we have a quadratic term. It's only the linear term
that's annoying here, right? Without the linear term,
this is just really easy. It's not easy, but it's one that we did. So it's easy given what we've done before. So the question is just how to
get rid of the linear term. Well then you have to think like all
the way, way back to algebra class. And how do you solve quadratic formulas, stuff like that,
completing the square, right? Something you probably thought you'd never
see again, because once you know like the quadratic formula, you rarely bother
to complete the square anymore, right? But that's what we need here
because if we complete the square, then it's gonna look like a quadratic and
not look like a linear thing anymore. Okay, so
we just need a little bit of algebra. So this is e to the minus one-half. Let's see if I can complete
the square correctly. And then we can factor
out a minus one-half. We attempt to complete the square. Well, it's gonna be z-t squared, I think. Then I have to adjust it. All right,
I'm just trying to do some algebra here. And it's hard to do algebra on a board,
but let's try. So we have this thing, and
then we need to fix it. But let's see if this gets at
least the beginning right. If we multiply this side is either
minus half z squared, that's good. Then we have minus 2tz times negative
one-half, so that matches this. And then the only part that we
need to fix up is this t squared. So this is gonna be
minus t squared over 2. So if we also multiply by
either the t squared over 2, an I'll do that on the outside because
that's just a constant with respect to z. Then that's gonna cancel out that part. So that's just completing the square, dz. It's just algebra. But by writing it this way,
it's much nicer than this, because now we don't have this
linear term anymore, right? Now it's just something squared, okay? All right, so whats this integral? Yeah, root 2pi. Because, well, and
if we include this part, then we get 1. That's just a normal. So this is either the t squared over 2. If we include the 1 over 2pi,
this is exactly the normal set. We recentered it at t. We didn't change the variants, okay? So we're integrating a normal density,
we must get 1. So just by recognizing that's a normal
except centered at t rather centered at 0, we immediately know that's 1. So we get e to the t squared over 2. And from this you can derive
the nonstandard normal as well, which you should do for practice, okay? So we're gonna come back to MGFs later and
use them. But I want to do one more kind of
like famous probability example that's not exactly related to MGFs,
but which will be useful for the homework, and for
what we're doing next. And that's called Laplace's
rule of succession. A famous old problem, Laplace was
a great mathematician and physicist. And so
there's this calculation that he did. And he phrased it in terms of what's
the probability that the sun will rise tomorrow? So suppose that the sun has risen for
the last n days in a row, and suppose we've observed and
we've been alive for n days, every day the sun came up,
n times in a row, right? What's the probability that
the sun will rise tomorrow? And he kind of got ridiculed for
working on that problem, probably cuz it seemed like
kind of a crazy thing. And partly because the assumption is
that we're considering random variables X1, X2, blah, blah, blah, blah, blah, iid, Bernoulli p,
where think of those as the indicators, random variable for the sun rising
on the first day, the second day. So suppose that each day the sun
either rises or it fails to rise, probability p that the sun rises,
iid Bernoullis. The other thing that's kind of fishy
about this is that, well I have never experienced a day where the sun didn't
rise, but if I did then I started thinking probably that's the end of the world, and
it's not gonna rise the next day either. So it doesn't seem very realistic
to assume that they're independent. But it's kind of fun to think
about the problem in those terms. That's just one story, and you can
easily think of other problems that have the exact same structure as this. So even if you wouldn't apply this to
the actual question of the sun rising, it's just a useful structure. So this is something we've dealt with
before, just iid Bernoulli trials. Okay, but the twist to this
problem is that Laplace is saying the probability that
the sun rises is unknown. So p is actually unknown, and the question is how do we
deal with that unknown p. So more precisely, let's say,
given p, this is true. If P is known, we're assuming this, okay? So all of this is conditional,
this is conditional independence, okay? So this is all given p We're
assuming there are iid. But now we're saying we don't know, right? We have evidence that the sun
has risen n times in a row, but we don't know for
sure what the value of p is, right? So we're going to treat p as unknown. And then kind of one of
the deep philosophical debates in statistics is how do
we deal with unknowns, right? And for many decades, there's been
this controversy between Bayesians and Frequentists. That's a big topic for 111, and I'm not
saying we'll talk much about it here. The question is, how do we deal with
the fact that this p is unknown? Well, the Bayesian point
of view is to say, well, since it's unknown,
we're gonna quantify its uncertainty by treating it as a random variable
that has some distribution, okay? Now, distribution is just
a reflection of our uncertainty. So the Bayesian approach is
treat P as a random variable. The reason it's called Bayesian is
because then we can use Bayes' rule to say what's the distribution of p given
all the evidence we have, right? So we start with some prior beliefs about
p, that is before we have any data or any evidence,
we have some prior uncertainty. Then we collect data and we use Baye's rule, Baye's rule is how
do we update based on evidence, okay? So update using Baye's rule and
then we have some new uncertainty, okay? So, that's the idea. So this is gonna look a little bit
strange because we're not used to treating lowercase P as a random variable,
but this is just good practice in thinking
carefully in what a random variable means. So now we're going to treat
p as a random variable and we're going to say Laplace said,
let p be uniform. Of course you don't have to use
the uniform, but Laplace, and this is also a bit controversial,
so this is called the prior, that's also a bit controversial,
right, like why uniform? Well Laplace basically said well, uniform
should reflect complete uncertainty, just completely random,
we know nothing about p, okay? But there are definitely some
controversial issues about that. Well, let's assume that for now. So the structure of the problem,
and let's let Sn be the sum of the first n of them, okay? So here's the structure of the problem. The structure is that
Sn given p is binomial. That is the conditional on p,
this just means p as known constant. Which is what we've been doing
most of the time, but not always. This relates back to the problem about the
random coins and things like that, right? The difference between independence and
conditional independence, so this is just another example of that. If we know which coin we have,
if we know the probability of the sun rising the we're the assumption
here is that it's IID And so some of id Bernoulli p
is we know is binomial np. But p itself is random with
the uniform distribution. That's the structure, okay? And the problem is defined, first of
all find the posterior distribution. By definition, posterior distribution means the
distribution after we collect the data. This part is the prior, and
the posterior is P given, Sn. As we assume we observe S. You could also assume that you observe
X1 through Xn, and it turns out that you would get the same, this is what's
called a sufficient statistic, which again is a 111 topic that you
don't have to worry about Per se. But it turns out that just observing how
many of these are one is enough, so, we can condition on Sn or we can condition
on X1 through Xn and it won't Will matter. And the other question would be,
what's the probability, the more practical question is what's the probability
the sun it's gonna rise tomorrow? So that's the probability
that Xn + 1 = 1 given, Sn equals n let's say. So the sun is rasing for the last n days, what's the probability that
it would rise tomorrow, okay? So that's what we wanna do. So how do we do that? Well, the answer is just Bayes' rule. It's just that it's an unfamiliar
form of Bayes' rule but it's completely analagous, okay? So define this posterior distribution, I'm just gonna write down something that
looks like exactly like Baye's rule, okay? So we wanna find the, f(p), I'm just using f to kind of
generically mean pdf, p is continuous. Before we have data, we're treating as
uniform, then after we have data it's just gonna have some density, a PDF, but it's
conditional, okay, it's a conditional PDF. So I'm calling that f (p),
let's say, given that Sn = k, okay? So we're especially interested
in the case where k is n. That is the sun has risen for
the last ten days, but we may as well consider more generally. The sun has risen on k of the last n days,
okay? Given that information how do we
update our uncertainty about p? It's just Baye's rule, it's just
a form that we haven't seen before because on the one hand, these are PDFs. PDFs are not probabilities, okay? But PDFs, we can think of it intuitively
as a probability or, at least, if we multiply it by if we took a PDF
times some little increment then that's going to be approximately the probability
of being in that little interval. So Baye's rule in this case,
looks Just the same as if you ignored the fact that that's
a density in our probability. So we're gonna swap those two things, I'm gonna say it's probability
of Sn = k given p. The notation, it takes a little while though to get
used to because we try, when possible to distinguishing random variables and their
values of lower case and capital letters. But that's a little hard to, you know,
to do here because we let p be, we let a lower case p be
a random variable and I don't want to start letting
capital p be a random variable. So you can make up some
new notation if you want. But it's easier to just think
about what things mean. So, this given p means we're just
treating p as a known constant, even though it's a random variable. So, Baye's role, we swap these things,
times f (p), that's the prior. And it's also just equal to one,
because we used a uniform prior, so that makes it easy. Divided by the probability that Sn=k, this thing here in the denominator
is an unconditional, this is conditional given p, this thing
does not depend on p, it's not allowed to. But this thing, in the numerator,
it's a function of p, in the denominator,
it does not depend on p. If we want to define it directly, we would use something that looks
like the Law of Total Probability. This is completely analogous to
the Law of Total Probability. This is the continuous version
of the Law of Total Probability which we haven't done yet so I'm doing it now, but it's completely
analogous to the Law of Total Probability. That if we want the probability of this
event, then rather than doing a sum, we do an integral cuz it's continuous. We just do the integral that given p,
f(p)dp. So that's a continuous version
of the law of total probability. We don't actually need this, we don't
need it now, and I'll show you why. But I wanted to just tell you a little
bit more about what this denominator means, okay? This denominator is a constant
that doesn't depend on p. So let's just look at this thing
up to proportionality, right, that's a proportionality symbol. We're gonna ignore the denominator,
because it doesn't depend on p, right? And for the numerator, that's just
from the binomial, that's n choose k. n choose k is also a constant,
that is, it doesn't depend on p. So I'm gonna ignore the n choose k. That's just p to the k q, (1-p) to the n-k, okay? So that part's one, so
that's actually easy. To get the constant in front,
then we'd have to integrate this thing. And we're gonna do that
much later in the course. But now let's just do the easier case,
f(p) given Sn=n. So that's the case where the sun
did rise for the last n days. Now this is just p to the n. Now this one is easy to normalize, right? Because the integral of this, from 0 to 1,
is just p to the n + 1 over n + 1. So it's 1 over n + 1, so to normalize it,
we'll just stick an n + 1 there. Now this is a valid PDF, okay? And now lastly to get this thing, so in other words we got this thing
without evaluating the denominator. And then lastly if we want p(xn + 1) = 1 given Sn=n. Well just think of it this way that
this is our, we want the expected value of a random variable with this
distribution by the fundamental bridge. We just want the expected value of
a random variable with this distribution. So that's just going to be integral
0 to 1, (n + 1) p p to the n, dp. Integral p to the n + 1 is
p to the n + 2 over n + 2. So we get n + 1 over n + 2. So according to Laplace,
if the sun rose 100 days in a row, then it would probably be 101 over 102 for
the next day.