Lecture 17: Moment Generating Functions | Statistics 110

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
So we were talking about the exponential distribution, and if I remember correctly we were talking about something called the memoryless property, right? So we showed the last time that the exponential distribution is memoryless but at this point as far as we know there could be infinitely other memoryless distributions. So what I want to talk about now is just kinda the fact that I find pretty amazing, which is that the exponential is the only memoryless distribution in continuous time. In discrete time we have the geometric. So in a very deep sense, the geometric is the discrete analog of the exponential. The exponential is the continuous analog of the geometric. So those two distributions are very closely related. So just to remind you what the memoryless properties said, and also just cuz I saw some news article recently that completely misunderstood the concept of a life expectancy. And that's not the first time that that happened. So basically it's a mistake of not understanding the difference between expectation and conditional expectation. And we haven't formally done conditional expectation yet. But I claim that you already know how to do conditional expectation. Because you just do expectation, except you use conditional probabilities instead of probability. But we talked about the fact that conditional probabilities are probabilities, so it's completely analogous. So we will spend a lot of time on conditional expectation later as a topic in its own right. But it's already something that's familiar, right, just use conditional probability. Okay, so for this life expectancy thing, here's like the common misconception that I've seen in various news articles. Last time I looked the life expectancy in the US was 76 years for men, 81 years for women. And it's different in different countries and whatever. And it's kind of an interesting statistical problem. How do you actually come up with those numbers? So I'm not saying those numbers aren't exactly correct. I'm saying those are the latest numbers that I've seen reported. Now how do you get the those numbers? Because in principle, you think, if you wanna know what's the life expectancy of a baby who's born tomorrow, right, then I guess in principle what you would is take all the babies born tomorrow, wait and wait until they all die, and then take the average lifetime, okay? Well, first of all, that's gonna take over 100 years, and you might want an answer now. But secondly, at some point in time you want an answer, right? But if you only look at the ones who've died up to that point and average those, that's gonna be a very biased answer, right? Because you're ignoring all the ones who have longer lifetimes. Okay, so that's an example of what's called censored data. That's a good kind of censoring. It's censored because they're still alive. So anyway, that's a hard statistical problem, and an interesting one. The reason I'm mentioning it now is kind of like good news and bad news about life expectancy. So let's just assume it's 80 years for simplicity. The mistake that I saw on this news article is basically assuming that it's like 80 years for everyone. It was about Social Security and Medicare, stuff like that, even still assuming 80 years even for people who are already in their 50s and 60s, okay? But the fact is, the longer you've lived then you're expected how long you're going to live total becomes longer, okay? So if I wanted to write that as an equation, I would say, for example, if we let T be how long someone's gonna live, given that that person lives to be at least 20, that's gonna be greater than just the expected value of T. It's kind of intuitively clear that that's the conditional expectation. It just means given this information. And we compute our expectation based on conditional probabilities rather than unconditional probabilities. This should be pretty intuitive, right? But the case where that would not hold would be if everyone lives exactly the same lifespan, then this is irrelevant information. And as soon as there's variability in that, given that you live this long, that's a good thing, that's the good news. The bad news is that human lifetimes are not memoryless, right? People get older and decay with age. And so this is just to illustrate what the memoryless property would say. If human lifetimes were memoryless, it would say, and if the average is 80 years, then it would say, if you lived to be 20, then your new expectation is 100, right? Because memoryless says you're good as new, right? So no matter how long you live you get an extra 80 years on average, and that's not true empirically. So the memoryless property would say. I'll say if memoryless, cuz that's realistic for human beings, but it's realistic in some other applications. If memoryless, we would have, I just wanna translate what I just said into an equation. It would say that E of T. Given T greater than 20, well, you have those 20 years plus you're as good as new. So you get an extra E of T. That's what the memoryless property would say, okay? So the truth is somewhere in between. And so, we actually have upper and lower bounds at this point. It's gonna be somewhere between E(T) and E(T)+20, okay? So that's the memoryless property. So of course, you could ask, since it's not very realistic for human beings, why do we care so much about it? Well, first of all, it's used a lot in different science, chemistry, physics, types of applications. See, sometimes using economics, in some application basically it's realistic in problems where things don't decay with age. Or another way of thinking of it like there's a homework problem about doing homework problems. And there are sort of two sorts of homework problems, roughly speaking. One is a type where just like you have to do a certain calculation, and you do the calculation, and then you're done. And it sort of takes a fixed amount of time. Or at least that you can make partial progress, right? And then there's another type of homework problem where you could just stare at it for hours and have absolutely nothing, right? And then, at some point, eventually, you have to be very determined, very persistent. At some point, you get this a-ha moment, you get the breakthrough, and you get it, okay? So memory lets you restore, like you can't make partial progress. Either you get it eventually, or you don't. Whereas another type of problem you saw is more of a fixed gradual progress, progress, progress until we finish the problem. Okay, so there are cases where it's realistic. But the other big reason for studying it is that it's a building block. So if you go in and study what distribution do people actually use in practice for something like this T survival time? The most popular distribution that's used in practice is what's called the Weibull. And you don't need to know Weibulls right now, but just to mention it right now, a Weibull is just obtained by taking a exponential to a power. And as soon as if you take an exponential, random variable and then cube it, that's not going to be exponential anymore, it's not going to be memoryless anymore. That's called a Weibull. It's not gonna be memoryless anymore. And it actually turns out to be extremely useful. So exponentials are a crucial building block. But in some cases memoryless is not an unreasonable assumption or it may be a reasonable approximation even if it's not exactly true. Okay, so that's just the intuition of the memoryless property. We proved that it was true last time for the exponential distribution. But now, let's show that it's only the exponential distribution that has that property. So I'll state that as a theorem. So if X is a continuous random variable, And we're thinking about applications like lifetime, survival time. So we're thinking of positive-valued random variables, so I'll say positive. That is, it takes values from zero to infinity, with the memoryless property, So the memoryless property is a property of the distribution, not of the random variable itself, per se. But we would say the random variable has the memoryless property if its distribution has the memoryless property. Okay, and then the claim is just that it has to be exponential. So with exponential lambda for some lambda. So this is a characterization of the exponential distribution. Okay, so now let's try to prove this result. And it's kind of an unusual proof, compared to most that you've probably seen. Because we're gonna write down an equation, but we're solving for a function, not solving for the variable. So it's what's called a functional equation, okay? So let's let F be the CDF, as usual. But as we saw last time with the exponential distribution, it's easier to think in terms of 1- CDF for this kind of problem. Because that's the probability of surviving longer than time T. So that's the CDF of X, and let's say G(x) = P(x > x) = 1- F(x). So it's easier to do this problem in terms of G(x). Now, the memoryless property says, Says, in terms of G, it's easy to write. It's just the equation G(s + t) = G(s) G(t). And we saw this, I'm not gonna repeat the argument for this. Because, the same thing as we did last time for the specific case of the exponential. Just write down, memoryless property is defined in terms of a conditional probability. But just write down the definition of conditional probability, and just in one line you can rewrite it like this. Notice this is true for the exponential, right, because e to the -(s + t) is (e to the -s)(e to the -t), okay? We basically want to show that this is not like a usual equation, where we're trying to solve for s or solve for t or something like that. We're trying to solve for G, as we want to show that only exponential functions can satisfy this identity, okay? So, the way to approach this kind of equation is just to start plugging some stuff in, and try to learn more and more about G. So we're trying to solve for G, And it's not like something where, just plug it into the quadratic formula or something, and then you just solve for G. We're trying to solve for this function G, so we just got to try to gradually learn more and more stuff about G, right? Okay, so the first thing I can think of, so this has to be true for all positive numbers s and t, okay? So, let's see, what can we learn about G? Well, one thing I can do is let s = t, just to see what that says. Okay, so we can choose s and t to be whatever we want, so we may as well derive some consequences of this. So one choice would be let s = t, and then that says that G(2t) = G(t) squared, I just rewrote that. Okay, so that's nice to know, what else can we see, well, let's try G(3t). G(3t), well, I could replace s by 2t, that would be G(3t), G(3t) = G(2t) G(t). But we know G(2t) = G(t) squared, so this is G(t) cubed. And you can keep repeating that, formally by induction, by just repeat a few of them and you immediately see the pattern,OK ay? So we immediately have the G(kt) = G(t) to the k, If k is a positive integer, so that seems like a useful property to know. Now what we went the other way around? What if we want to know what's not G(2t), but G(t/2)? Well actually, if I take this equation and replace t by t/2, cuz this is true for all t's, so I can plug in t/2 for t. That's G(t/2) and that's G(t), take the square root of both sides. Then we get the G(t/2) is the square root of G(t). Similarly, G(t/3) is the cube root of G(t), and so on. So now, we've figured out that this equation is true if k is a positive integer, or if k is the reciprocal of a positive integer, okay? Well then the next step would be, what if k is a rational number? That is, a rational number by definition is just a ratio of integers, so let's say we have G(m/n t). Well, if we just apply these two properties, that this is true. All right, this one has G(t/k) = G(t) to the 1/k, where k is a positive integer. We apply these two properties, then we immediately get that if we multiply by any rational number, the same thing holds, that's G(t) to the m/n. So then this m over n is any rational number, okay? Now, if we have any real number, we can always treat a real number as a limit of rational numbers, right? Like pi, you could approximate pi, you could say that pi is the limit of the sequence, 3, 3,1, 3.14 and so on, all right? So you can pick a sequence of rational numbers that converges to pi or any number you want. So just take the limit of both sides. Imagine, replace this m/n by a sequence of rational numbers, take the limit of both sides. We're using the fact that capital G is continuous. Continuous means that if we take the limit of something like this, you can swap the limit and the G. So by continuity, this is true for any positive real numbers. Let's say, G(xt) = G(t) to the x for any real x > 0. Just by taking the limit of rational numbers, we can get real numbers. Okay, now we're basically done, because this is true for all x and t. So to simplify it, let's just let t equal 1. And if we let t = 1, this just says that G(x) = G(1) to the x. That looks like an exponential function. In particular, let's write it in terms of base E. That's the same thing as E to the X log G(1). So this thing, now G(1) is a probability. So G(1) is clearly between 0 and 1. If you take the log of a number between 0 and 1, you'll just get some negative real number, so this is just some negative number here. So we could call this thing -lambda, but a lambda is a positive number. This is just a constant, right, so I'm just calling it -lambda, it happens to be a negative constant. So that's just e to the -lambda x, which is 1- the CDF for the exponential, so that's the only possibility, okay? So exponential is the only continuous memoryless distribution. That's the proof of that fact, okay? So, we'll use the memoryless property and more stuff with the exponential distribution from time to time. But there's another kind of key tool that we need at this point. You'll need it for the homework, and you'll need it in general, and that's called the moment generating function. So let's talk about moment generating functions. Moment generating function, which sometimes seems mysterious at first. But if you think carefully about what it means, you'll see why it's useful and what it actually means. Moment generating function, which we abbreviate to MGF, Is just another way of describing a distribution rather than CDFs and PDF. MGF is another alternative way to describe a distribution. So the definition is that, A random variable x has MGF M(t) = the expected value of (e to the tx). This is as a function of t. And we say that it exists, this is not a useful concept unless this thing is actually finite on some interval around 0. So we would just say if this is finite on some interval, let's say, -a to a, where a is greater than 0. It could be that this thing is finite for all numbers to you, which is great. But we're not requiring that to be true, but we are requiring at least having at least some tiny interval about 0 on which this is finite. All right, well, this definition at first seems to come out of nowhere, I think. And students sometimes really wonder, what's t? What does t mean, okay? The first thing to understand about this is that this letter t is just a dummy variable, right? Conventionally, we call it t, but we could could have called that s or q or w or anything else that wouldn't clash with the rest of the notation, right? I wouldn't have called it E or M or capital X, but anything that doesn't clash is fine. So t is just a placeholder, Secondly what does this thing actually mean? Well, this is a well defined thing because for any number t, we talked many times about the fact that a function of a random variable is a random variable. So that's a function of a random variable, that's a random variable. We can look at its expectation. That doesn't yet show why we would want to do that, but we can do that, right? So this is a function of t, that is for any number t, we could imagine computing this expectation, so it's a well defined function of t. It might be that it's infinity for some values of t, okay? But at least you know it's something we can write down and study. Okay, so think of t is kind of like a book keeping device. All the MGF is, is a fancy book keeping device for keeping track of the moments of a distribution. So let's see why that's true. So why is it called moment generating? Well, to see why it's called that, all we have to do is say, we have e to a power here, let's use the Taylor series for e to a power, right, which we've been using many times. So if we expand this thing out, expected value of e to the tx, that's the expected value, just the Taylor series for e to the x. So that's x to the n, t to the n over n factorial, n= 0 to infinity. This is always valid cuz the Taylor series for e to the x converges everywhere. x is a random variable, but this is always true. So that's a valid expansion. Now intuitively at this point, we wanna swap the E and the sum. And that's where some technical conditions come in, in particular, that's where this fact matters that we have this interval, n=0 to infinity. So just suppose for a second that we can swap this sum and the e. Then we would get this thing, expected value of x to the n, t to the n over n factorial. This thing here, E(x to the n), is called the nth moment. So the first moment is the mean. The second moment is not the variance unless the mean is 0, but the second moment and the first moment are what we need to compute the variance, right? And then there are higher moments, that's called the nth moment. And higher moments have different interpretations that are more complicated than mean and variance, but they turn out to be useful for a lot of reasons. So assuming that we can do this swap, bringing the E inside the sum, then what we've really done is just capture all the moments of x into this Taylor series, all right? But that's why it's called the moment generating function cuz you see all the moments are just sitting there in the Taylor series. As far as showing why you can swap the E and the sum, if this were a finite sum, that would just be immediately true by linearity, right? Since it's an infinite sum, that requires more justification, and for that kind of justification, either you need to take a certain real analysis course, or step 210. So we need some more analysis and math to do that. But this is valid, under fairly mild assumption that this exists on some interval like that. So it turns out that we can, it's kind of like an infinite version of linearity, right? Swap the E and the sum, and we get that thing, okay, so that's called the moment generating function. All right, now, I guess this shows that it would be useful if we were interested in the moments. But what if we don't care about the moments? I mean, usually we might care about the mean and the variance, but we haven't yet worried that much about higher moments than that. So let me just tell you three reasons why the MGF is important. Okay, so why is the MGF important? Well, the first reason is the moments. That's what we just talked about. Cuz sometimes we do want the moments. So if x, so we're gonna let x have MGF. Capital M(t). If necessary for clarity, we might subscript the x here. But right now, we're just talking about one random variable. And here is it's MGF, so we don't need a subscript. So the nth moment. That is, E(x to the n) is, there's two ways to think of it. The nicer way is that it's the coefficient. Of t to the n over n factorial in the Taylor expansion. Of M about zero, from Maclaurin series if you like. That's what we just showed over there, assuming we could do that swap. Another way to say this, just remember from your basic Taylor series from integral calculus that if you have some functions and you wanna compute its Taylor series, what do you do? You take the sum like this. And right here in the circle place, you put the nth derivative evaluated at 0, right? That's just how you do Taylor series, right, take derivatives evaluated at 0. So another way to say it is that it's the nth derivative, which I'll write like that. So if we want the first moment, we could take the first derivative, evaluate it at 0, and so on. Okay, so the coefficient, and it is this thing. I'll just write that =E(x to the n) again just for emphasis. So to get the nth moment, we could take the nth derivative of the MGF evaluated at 0. But as we'll see, sometimes it's a lot easier to just directly work out the Taylor series by some other method. For example, we already know the Taylor series for E to the x, right? Rather than going through take derivative, take derivative, take derivative. Just write down the Taylor series. Okay, so that's the first reason. Second reason, which is probably even more important, the other two reasons are more important even if we don't care about moments, okay? The second reason is that the MGF determines the distribution. So another way to say that would be. If you have two random variables, X and Y, and they both have the same MGF, then they must have the same distribution. So if X and Y have the same MGF, then they have the same distribution, for example, same CDF. If they're continuous, they have the same PDF and so on. Then, they have the same CDF. This fact is very difficult to prove. So I'm not gonna try to prove this. But it's useful to know. If you compute some MGF, and you recognize, that's a Poisson of 3 MGF, then you can conclude that that's a Poisson of three random variables. There isn't some other distribution that kind of pretends to be a Poisson of 3 at the same MGF, right? Once you know the MGF, you know the distribution, at least in principle. Okay, and then the other reason why they're important aside from 1 and 2, is that they make the sums much, much easier to handle. So we've dealt a little bit with sums of random variables before and we'll deal more with it later. In general, finding the distribution of a sum of independent random variables is complicated, that's called a convolution. But if we have access to MGFs, things are a lot easier. Convolution, you have to do this convolution sum or convolution integral which we'll deal with somewhat later. We've done a little bit of it before. It's complicated. So suppose we have MGFs. So let's say, if x has MGF Mx and if Y has MGF My. And they're independent. Then we want the MGF of the sum. A lot of times, we're interested in sums of random variables, right, just adding things up, comes up all the time. Then the MGF of X+Y. Or by definition, it's the expected value of e to the t(x+y). That's e to the t(x+y). And we haven't proven this yet. But this is another fact that we're gonna prove soon. That if we have expected value of two things, a product of things that are independent, then they're the product of the expectations, okay? So we haven't shown that yet. That would be false in general, if they are not independent, so it's this. Crucial that we have independence or some other condition here, but we'll prove that later. I'm writing this as e to the t X times e to the t Y. Since X and Y are independent, then, e to the t X and e to the t Y are independent. And according to that fact, then I can write e of the product of the product of e, using that assumption. But then notice that that's just the product of MGFs. So that's Mx of t by definition, Mx, My of t. So this is really simple in the sense that if we have both of those MGFs, we just multiply them, right? We didn't have to do an integral. We didn't have to do some complicated sum. We just multiplied the two MGFs. So that's really convenient, okay? Let's do a couple quick examples. Of MGFs for specific distributions. So the easiest example to start with is Bernoulli, right? Bernoulli is the easiest, simplest distribution. So let's just start with the Bernoulli. So if X is Bern(p) Then the MGF M(t) = E(e the tx). Now we could use LOTUS, but we don't even need LOTUShere, because we could just say either the tx, X is just 0 or 1. So e to the tx is either e to the t or 1, right, only two possibilities. The probabilities of those two cases are, with probably p, it's e to the t. Pe to the t, and with probability q, it's 1, where q is 1-p as usual. So I'm just stating a weighted average of the two possible values, it's either e to the t or it's 1. So just take a weighted average. So that's really easy calculation. But because of that, now we can immediately get the MGF of a binomial. Because. Well, if we write down the definition, then we know we're gonna have to do some big LOTUS thing. But we don't have to do that because if we think of the binomial as the sum of iid Bernoulli p, then you use fact 3 there. Then we immediately know that the M(t) equals the MGF of a Bernoulli To the nth power. So we just write it down immediately just by taking the nth power of that, okay? So for practice you could check remember the binomial has the mean of the binomial, NP is NP and the variants is NPQ. And if you wanna check that statement one there is true, you could take this thing and check that the first derivative, evaluate it at 0, you get the mean, second derivative, evaluate it at 0, you'll get the second moment, from there you get the variance. Okay, so that's the binomial. One more that we need right now is the normal. And that's a bit more involved of a calculation. So let's let z be standard normal. Notice that once we have the MGF of the standard normal, then we know the MGF of any normal we want, cuz remember this thing about location and scale. So once we know this answer MGF, then we can take any normal we want then go mu plus sigma z figure out its MGF in a straightforward way. And that's good practice to make sure you know how to do that. So let's just talk about the standard normal first cuz that's simpler. And we wanna compute the MGF. So by LOTUS, this is 1 over square root of 2pi integral minus infinity to infinity. So we want the expected value of e over tz. So by LOTUS it's e over tz, and then we just have to write down the standard normal density, which is e to the minus z squared over 2. I already put the normalizing constant out there. So this is minus z squared over 2 dz. Now looks like a pretty nasty integral, but we will bravely try to do it anyway cuz you wanna know the answer. So that was kind of nasty. Now if t is 0, then that's just 1, because then it's just saying integrate the standard normal PDF, we'd get one. So we know how to do this without this linear, right? We have a linear term, we have a quadratic term. It's only the linear term that's annoying here, right? Without the linear term, this is just really easy. It's not easy, but it's one that we did. So it's easy given what we've done before. So the question is just how to get rid of the linear term. Well then you have to think like all the way, way back to algebra class. And how do you solve quadratic formulas, stuff like that, completing the square, right? Something you probably thought you'd never see again, because once you know like the quadratic formula, you rarely bother to complete the square anymore, right? But that's what we need here because if we complete the square, then it's gonna look like a quadratic and not look like a linear thing anymore. Okay, so we just need a little bit of algebra. So this is e to the minus one-half. Let's see if I can complete the square correctly. And then we can factor out a minus one-half. We attempt to complete the square. Well, it's gonna be z-t squared, I think. Then I have to adjust it. All right, I'm just trying to do some algebra here. And it's hard to do algebra on a board, but let's try. So we have this thing, and then we need to fix it. But let's see if this gets at least the beginning right. If we multiply this side is either minus half z squared, that's good. Then we have minus 2tz times negative one-half, so that matches this. And then the only part that we need to fix up is this t squared. So this is gonna be minus t squared over 2. So if we also multiply by either the t squared over 2, an I'll do that on the outside because that's just a constant with respect to z. Then that's gonna cancel out that part. So that's just completing the square, dz. It's just algebra. But by writing it this way, it's much nicer than this, because now we don't have this linear term anymore, right? Now it's just something squared, okay? All right, so whats this integral? Yeah, root 2pi. Because, well, and if we include this part, then we get 1. That's just a normal. So this is either the t squared over 2. If we include the 1 over 2pi, this is exactly the normal set. We recentered it at t. We didn't change the variants, okay? So we're integrating a normal density, we must get 1. So just by recognizing that's a normal except centered at t rather centered at 0, we immediately know that's 1. So we get e to the t squared over 2. And from this you can derive the nonstandard normal as well, which you should do for practice, okay? So we're gonna come back to MGFs later and use them. But I want to do one more kind of like famous probability example that's not exactly related to MGFs, but which will be useful for the homework, and for what we're doing next. And that's called Laplace's rule of succession. A famous old problem, Laplace was a great mathematician and physicist. And so there's this calculation that he did. And he phrased it in terms of what's the probability that the sun will rise tomorrow? So suppose that the sun has risen for the last n days in a row, and suppose we've observed and we've been alive for n days, every day the sun came up, n times in a row, right? What's the probability that the sun will rise tomorrow? And he kind of got ridiculed for working on that problem, probably cuz it seemed like kind of a crazy thing. And partly because the assumption is that we're considering random variables X1, X2, blah, blah, blah, blah, blah, iid, Bernoulli p, where think of those as the indicators, random variable for the sun rising on the first day, the second day. So suppose that each day the sun either rises or it fails to rise, probability p that the sun rises, iid Bernoullis. The other thing that's kind of fishy about this is that, well I have never experienced a day where the sun didn't rise, but if I did then I started thinking probably that's the end of the world, and it's not gonna rise the next day either. So it doesn't seem very realistic to assume that they're independent. But it's kind of fun to think about the problem in those terms. That's just one story, and you can easily think of other problems that have the exact same structure as this. So even if you wouldn't apply this to the actual question of the sun rising, it's just a useful structure. So this is something we've dealt with before, just iid Bernoulli trials. Okay, but the twist to this problem is that Laplace is saying the probability that the sun rises is unknown. So p is actually unknown, and the question is how do we deal with that unknown p. So more precisely, let's say, given p, this is true. If P is known, we're assuming this, okay? So all of this is conditional, this is conditional independence, okay? So this is all given p We're assuming there are iid. But now we're saying we don't know, right? We have evidence that the sun has risen n times in a row, but we don't know for sure what the value of p is, right? So we're going to treat p as unknown. And then kind of one of the deep philosophical debates in statistics is how do we deal with unknowns, right? And for many decades, there's been this controversy between Bayesians and Frequentists. That's a big topic for 111, and I'm not saying we'll talk much about it here. The question is, how do we deal with the fact that this p is unknown? Well, the Bayesian point of view is to say, well, since it's unknown, we're gonna quantify its uncertainty by treating it as a random variable that has some distribution, okay? Now, distribution is just a reflection of our uncertainty. So the Bayesian approach is treat P as a random variable. The reason it's called Bayesian is because then we can use Bayes' rule to say what's the distribution of p given all the evidence we have, right? So we start with some prior beliefs about p, that is before we have any data or any evidence, we have some prior uncertainty. Then we collect data and we use Baye's rule, Baye's rule is how do we update based on evidence, okay? So update using Baye's rule and then we have some new uncertainty, okay? So, that's the idea. So this is gonna look a little bit strange because we're not used to treating lowercase P as a random variable, but this is just good practice in thinking carefully in what a random variable means. So now we're going to treat p as a random variable and we're going to say Laplace said, let p be uniform. Of course you don't have to use the uniform, but Laplace, and this is also a bit controversial, so this is called the prior, that's also a bit controversial, right, like why uniform? Well Laplace basically said well, uniform should reflect complete uncertainty, just completely random, we know nothing about p, okay? But there are definitely some controversial issues about that. Well, let's assume that for now. So the structure of the problem, and let's let Sn be the sum of the first n of them, okay? So here's the structure of the problem. The structure is that Sn given p is binomial. That is the conditional on p, this just means p as known constant. Which is what we've been doing most of the time, but not always. This relates back to the problem about the random coins and things like that, right? The difference between independence and conditional independence, so this is just another example of that. If we know which coin we have, if we know the probability of the sun rising the we're the assumption here is that it's IID And so some of id Bernoulli p is we know is binomial np. But p itself is random with the uniform distribution. That's the structure, okay? And the problem is defined, first of all find the posterior distribution. By definition, posterior distribution means the distribution after we collect the data. This part is the prior, and the posterior is P given, Sn. As we assume we observe S. You could also assume that you observe X1 through Xn, and it turns out that you would get the same, this is what's called a sufficient statistic, which again is a 111 topic that you don't have to worry about Per se. But it turns out that just observing how many of these are one is enough, so, we can condition on Sn or we can condition on X1 through Xn and it won't Will matter. And the other question would be, what's the probability, the more practical question is what's the probability the sun it's gonna rise tomorrow? So that's the probability that Xn + 1 = 1 given, Sn equals n let's say. So the sun is rasing for the last n days, what's the probability that it would rise tomorrow, okay? So that's what we wanna do. So how do we do that? Well, the answer is just Bayes' rule. It's just that it's an unfamiliar form of Bayes' rule but it's completely analagous, okay? So define this posterior distribution, I'm just gonna write down something that looks like exactly like Baye's rule, okay? So we wanna find the, f(p), I'm just using f to kind of generically mean pdf, p is continuous. Before we have data, we're treating as uniform, then after we have data it's just gonna have some density, a PDF, but it's conditional, okay, it's a conditional PDF. So I'm calling that f (p), let's say, given that Sn = k, okay? So we're especially interested in the case where k is n. That is the sun has risen for the last ten days, but we may as well consider more generally. The sun has risen on k of the last n days, okay? Given that information how do we update our uncertainty about p? It's just Baye's rule, it's just a form that we haven't seen before because on the one hand, these are PDFs. PDFs are not probabilities, okay? But PDFs, we can think of it intuitively as a probability or, at least, if we multiply it by if we took a PDF times some little increment then that's going to be approximately the probability of being in that little interval. So Baye's rule in this case, looks Just the same as if you ignored the fact that that's a density in our probability. So we're gonna swap those two things, I'm gonna say it's probability of Sn = k given p. The notation, it takes a little while though to get used to because we try, when possible to distinguishing random variables and their values of lower case and capital letters. But that's a little hard to, you know, to do here because we let p be, we let a lower case p be a random variable and I don't want to start letting capital p be a random variable. So you can make up some new notation if you want. But it's easier to just think about what things mean. So, this given p means we're just treating p as a known constant, even though it's a random variable. So, Baye's role, we swap these things, times f (p), that's the prior. And it's also just equal to one, because we used a uniform prior, so that makes it easy. Divided by the probability that Sn=k, this thing here in the denominator is an unconditional, this is conditional given p, this thing does not depend on p, it's not allowed to. But this thing, in the numerator, it's a function of p, in the denominator, it does not depend on p. If we want to define it directly, we would use something that looks like the Law of Total Probability. This is completely analogous to the Law of Total Probability. This is the continuous version of the Law of Total Probability which we haven't done yet so I'm doing it now, but it's completely analogous to the Law of Total Probability. That if we want the probability of this event, then rather than doing a sum, we do an integral cuz it's continuous. We just do the integral that given p, f(p)dp. So that's a continuous version of the law of total probability. We don't actually need this, we don't need it now, and I'll show you why. But I wanted to just tell you a little bit more about what this denominator means, okay? This denominator is a constant that doesn't depend on p. So let's just look at this thing up to proportionality, right, that's a proportionality symbol. We're gonna ignore the denominator, because it doesn't depend on p, right? And for the numerator, that's just from the binomial, that's n choose k. n choose k is also a constant, that is, it doesn't depend on p. So I'm gonna ignore the n choose k. That's just p to the k q, (1-p) to the n-k, okay? So that part's one, so that's actually easy. To get the constant in front, then we'd have to integrate this thing. And we're gonna do that much later in the course. But now let's just do the easier case, f(p) given Sn=n. So that's the case where the sun did rise for the last n days. Now this is just p to the n. Now this one is easy to normalize, right? Because the integral of this, from 0 to 1, is just p to the n + 1 over n + 1. So it's 1 over n + 1, so to normalize it, we'll just stick an n + 1 there. Now this is a valid PDF, okay? And now lastly to get this thing, so in other words we got this thing without evaluating the denominator. And then lastly if we want p(xn + 1) = 1 given Sn=n. Well just think of it this way that this is our, we want the expected value of a random variable with this distribution by the fundamental bridge. We just want the expected value of a random variable with this distribution. So that's just going to be integral 0 to 1, (n + 1) p p to the n, dp. Integral p to the n + 1 is p to the n + 2 over n + 2. So we get n + 1 over n + 2. So according to Laplace, if the sun rose 100 days in a row, then it would probably be 101 over 102 for the next day.
Info
Channel: Harvard University
Views: 122,370
Rating: 4.9145041 out of 5
Keywords: harvard, statistics, stat, math, probability, moment generating functions, laplace, rule of succession, bayes' rule
Id: N8O6zd6vTZ8
Channel Id: undefined
Length: 50min 44sec (3044 seconds)
Published: Mon Apr 29 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.