Lecture 28: Inequalities | Statistics 110

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Okay, so we've been talking about conditional expectation, right? And I want to do one more example of conditional expectation, if I can. Okay, so one more example of conditional expectation, then the main topic for today is inequalities, as statistics inequalities. So okay, all right, so here's one more conditional expectation problem. So all right, so suppose you have a store and different customers show up at your store and spend different amounts of money. Doesn't have to be a store and customers, just to have a concrete example, I'll say it that way. That's clearly a very, very general type of setting that would come up in a lot of cases. So I'll just say store with a random number of customers. But you can make up your own story for it, but just have something concrete. That's what I'm thinking of right now, random number of customers, Which is just pretty realistic, right? You dont know how many customers you're gonna have, and then each customers chooses to spend some amount of money, maybe zero. And then you wanna know how much money you got or how much profit or whatever, so it's a very natural problem. And so let's let Xj, Be the amount that the jth customer, Spends, Okay? And then let's assume that there are, why did I say topics? >> [LAUGH] >> I have no idea. Random number of customers. The number of topics in this class is fixed, which is like I'll say we'll do these topics. Actually it's not entirely fixed because I might start rambling or something and not cover something. But actually it's been pretty much fixed, we covered exactly what I want to cover for the most part since I've been teaching this course. Okay, so random number of customers, now let's say N, N = # of customers in a day or in a week or in some time period. N cuts a random variable as the number of customers. So maybe it's Poisson, would be a reasonable. If you had to guess the distribution maybe you might use a Poisson. I don't need to specify the distribution right now, I'll just say that's N and Xj is the amount jth customer spends. And let's assume Xj has mean mu and variance sigma squared. So I'm assuming that they all have the same mean and variance for how much the customers spend. Of course, you can generalize this problem in different ways. But for now, let's assume they all spend the same average. They may spend different actual amounts, but they spend the same average amount and they have the same variance. And let's assume that, N, and this sequence of expenditures are independent. Okay, so we're not necessarily assuming that the Xs are iid, but we are assuming that they're independent. So [LAUGH] it's not like the second customer sees how much the first customer spent. And wants to spend more than that person or something like that or they come in groups and families and decide together or something like that. They're just independent customers and the other important assumption here is, the number of customers is independent of the individual choices for how much to spend. So I mean that sounds like it maybe plusable, and I'm sure you can think of examples where this would break down, right? And it's so large that you can't fit everyone in the store, and then maybe people start leaving cuz it's too crowded. Or maybe then they're more determined to buy a lot of stuff, cuz then they think this is a really good, all kinds of things could happen. Or maybe N being very large is an indication that they're having some really good sale that day and so on. But anyway, we're assuming they're independent. Okay, and then we wanna find, The mean and variance of just the total expenditure, right? That's how much revenue would you take in? So let's call that X, X is just the sum Xj, j = 1 to capital N. So it's just a sum, right? Just total amount, but what's unusual about this sum, compared to what we've seen generally is that in upper index here, capital N is a random variable. So we're adding up a random number of random variables, okay? So that's the setup, and if we just tried to use linearity, your first thought may be about linearity is just to sum. Then you might write E(x) =, well there's N terms, and each one has mean mu's, so you just go N times mu. However if you did that, how should you immediately know this is wrong? Yeah, the right hand side is a random variable that's a number, that's a random variable. So that would be a category error, E(x) is supposed to be a number, okay? It can be based on the various constants we have, but it can't involve random variable. Capital N is a random variable, so that's completely wrong, it's a category error. However that category error actually suggests something useful to us, which is that we kind of wish that N were a constant. Because if N is a constant, this is not a category error anymore, just saying a constant equals another constant. It might be true or false, but at least it's not a category error anymore. And if N is a constant, then really that is just linearity, okay? So this terrible blunder actually tells us what we should do, that we wish that we knew the value of N. So we could treat it like a constant, therefore let's just condition on N. So condition on N, that just means, well we write E(x), I'll do this in two different notations, it's the same thing. We'll do E(x) equals conditioning, so this analogous to the law of total probability. The expected value of X given big N = little n times the probability that big N = little n, right? Just condition on the value of N. Now this is the sum, n = 0 to infinity. This conditional expectation means we get to treat N that is known to equal little N, so we know we have little N customers know. In that case, we really can just apply linearity, right? So we're gonna plug in big N = little N and here's where this assumption is important. Cuz if we didn't have this assumption what we would do, is plug in big N = little n. But just like in earlier examples we saw, like the two envelope paradox, we can't then forget the condition big N = little n. In this case, though we plug in big N = little n, they're independent. Big N is independent of the Xs, so we can forget the condition. And so then by linearity, we can just write down that that is, You're adding up little n things each with mean mu, so that's mu times n. So mu is just a constant that comes out, and what left in the sum with just left the sum of n times the PMF by definition that's the expected value of n. So we know that that's just mu times E of N. So the correction here is that this N should be the expected value of N, not N. So let's also do this using Adam's Law. Which is the same thing. It's just a more complex notation. So we want E of X, we want a condition on n because then we can just apply linearity, right? It's more familiar to deal with a fixed number of terms rather than a random number of terms. So Adam's law, or iterated expectation just says we can do E of E of X given N. Okay, to get E of X given N, we just take E of X given N = little n, which was mu times little n. And we replaced little n by big N. So that's mu times N. That is-- we treat, this notation means treat N as a known constant. Then by linearity would just be mu N. And so again, this is mu times E of N, so we get the same answer. So you can see that this is more compact than that. Less writing, it's shorter and nicer. But in terms of the meaning and intuition, both of these mean the same thing. So it's good to be comfortable kind of writing out longhand like this, and shorthand like that are both useful. Okay, so that's the mean. Let's get the variance. So for the variance we're gonna use Eve's Law. So Var of X = same idea, right? Condition on N, so the expected value of the variance of X given N + the variance of E of X given N, okay? Now, let's just evaluate these two terms,the variance of X given N. To define that, we really just have to understand what that notation means. We're treating n as known, and then I say, what's the variance of that? Well, we know that the variance of the sum of a fixed number of independent random variables. You just add up their variances, right? We've proved that before. There's no covariance terms because I assume they're independent. So if we're treating n as a constant, the variance is just n times the variance of one term. So that's just N sigma squared +, we need the variance of the conditional expectation. But we already found that the conditional expectation, E of X given N was mu times N, as above. Okay, then to simplify that a little bit, we can just take out the sigma squared. So this is, let's write it as sigma squared times E of N +, and then the mu comes out squared, so it's mu squared times the variance of that. So that's the variance, in terms of the min and variance of N, of course I could have said N is Poisson or something like that, and Poisson lambda and then it would just plug in. Lambda and lambda but this is more general. All right, so let's quickly check whether these answers make intuitive sense. So or the mean, I think this result is pretty intuitive cuz it says the average amount of money that the store will take in is the average number of customers times the average amount that each customers spends. So that's pretty intuitive. As we've seen many times, intuition can be wrong in this class, but in this case, I think this is pretty intuitive. And then for this one, well, let's just do a quick check that this even makes sense in terms of the units. Now, capital N or little n we're just talking about number of people. It doesn't really have units that's just counting people, okay? People are not units in that sense, so like meters, and inches, and seconds, or units. Know in the other hand, mu, well we're measuring in some dollar or euro, or whatever let's assume dollars. So mu is in dollars, sigma squared is dollar squared which is we like to work with standard deviation rather than variance when we try to interpret things cuz I would rather work with dollars than dollar squared. So this, if we want the standard deviation, we just take the square root of that. Notice if we take the square root of this, we're gonna get dollars. And if this were mu to the fourth or sigma cubed or something, it wouldn't make any sense. You'd be trying to add dollars cubed to dollars to the 4th or something like that, wouldn't make much sense, okay? So it makes sense in terms of the units, okay so similarly if you wanted the MGF. If you want the MGF of X well again just condition on N and if we knew that N is five. Then we're just adding up five random variables independent so we know that the MGF just multiply those five MGFs right? So I'll be very, very straight forward. Assuming that we know the MGF of each XJ it will be very straight forward to get the MGF of this if n is a constant. Okay, but that tells us we can get the MGF in general by conditioning on N. Same idea, so it can work that one out for your self the same idea. okay, all right, so now we move on to inequalities, right, statistical inequalities, there are four of them that we need in Stat 110. So there's a sense in which inequalities deserve a lot more attention than they usually get in most courses. And so, there are different ways to explain that, but one I particularly like was I had a conversation recently with one of the leading experts in the world on the interface between statistics and the law. And he was making a point that if you're in court as a statistical expert witness, which is a common thing for statisticians to do. It's a lot easier if you have an inequality than if you have an approximation. And I know it's a common mistake in the past in this course has been to kinda confuse approximations with inequality. So I wanna make sure that distinction is clear, then we'll go through the inequalities. The distinction is just that, we did the Poisson approximation, right? That is under certain conditions, you can say that a certain distribution's approximately Poisson, and that's gonna be a good approximation under certain conditions. That's extremely useful because there are a lot of problems where it's just too hard to do it exactly, but we can get a good approximation without that much effort using Poisson approximation. For example, later in the course we'll do normal approximation under some conditions. A lot of conditions that are pretty realistic that normal distributions give us good approximations, I think. Those are approximations. Right now we're talking about inequalities. Now, of course, they're related. If I prove that a certain probability is between .36 and .38, right? So then I have bother upper and lower bounds, right? And then I can say well, the probability is somewhere between .36 and .38 so I would guess .37. But at least I have bounds in both directions. But if all I say is that the probability is less than .38, well, it could be .004 is less than .38, so that's not an approximation, that's just a bound, okay? So that's the distinction. And the reason that this guy who I was talking to was saying that you're much happier in court if you have an inequality is that basically [COUGH] you can kind of imagine what would happen. But let's say I'm the expert witness and I use my Poisson approximation on something and then you can just imagine kind of being cross examined, right? Dr. Blitzstein, you claim that this approximation is good, can you explain what you mean by a good approximation. And then I'd say well, good means that it's close to the truth. And then the lawyer could say, well, is there an accepted standard about how close, close is? And do you know how close it is, right? And I'd have to say, well if I knew exactly how close it is, then I'd actually know the answer, right? >> [LAUGH] >> And there is not a standard for what does good mean. So what one person says is a good approximation, another person could say is a lousy approximation, right? You don't wanna get into that, right? And lawyers are good at kind of tripping you up in that way. >> [LAUGH] >> However, if I had an inequality, then I can just say the probability, I've proven, is less that 0.37. And then there's basically not much that can be said about that, right? I actually proved a theorem that says the probability is less than 0.37, okay? It's kind of interesting, right, because there's still randomness and uncertainty that's why we're using probability, but we've proven a definite fact about something random. So it's very advantageous a lot of times to have inequalities. All right, so we're gonna talk about the four most important inequalities arguably in statistics. The first one, we've already seen in some forms, that's Cauchy-Schwarz. So for random variable, a lot of you have seen Cauchy-Schwarz in the linear algebra or math class. For random variables, Cauchy–Schwarz says that the expected value of X times Y is less than or equal to the square root of E(X) squared E(Y) squared. That's true. You can put absolute values around it also if you want. Still true. When we're talking about that geometric interpretation of conditional expectation, I mentioned the fact that this E(XY) is playing the role of the dot product. That is, you're familiar with the dot product of vectors and this is kind of the analog of a dot product. So those of you familiar with Cauchy-Schwarz just in general in your algebra, this looks the same once you interpret it that way. But even if you've never seen Cauchy-Schwarz before you can, you can just think about what this inequality says and understand this just has it's own inequality. Notice that if x and y are uncorrelated, Then by definition of uncorrelated, then E(XY) equals E(X) E(Y). That's just the definition of what it means to be uncorrelated. So in that case, it would be crazy to use this inequality because we have an exact equality, right? And this is just an inequality, you can see the direction makes sense, right? Because E(X) squared is bigger than or equal to E(X) squared the other way, so it's true, but there will be no point. Just this equals this, okay? So where this is useful is the case where they're correlated. I mean, it's good that it's true anyway. Just so we don't have to break things down into separate cases, correlated case, uncorrelated case. It is always true. Okay, but to see why this is telling us something interesting in the correlated case, kind of the cool thing about this is if we want to compute E(XY) in general, we'd have to use the 2D LOTUS, right? That is X and Y, there's some joint distribution. Well, either we could do a Jacobean and find the distribution of X times Y in the continuous case. And then so find the PDF of this or we could use the 2D LOTUS. And that could be very, very messy and difficult. So this is based on the joint distribution. This is separating it out into this is a marginal thing, and this is a marginal thing. That is, this is the marginal second moment of X. That is, this is just the expected value of X square, there's no Y in this expectation and there's no X in this one, so it separates them out. So that's nice. Okay, and the interpretation, the statistical interpretation is easiest to see in the case where they have 0 mean. And this is the case we've talked about before. Because if X and Y have means 0, Then the correlation between x and y. Well, in general to get the correlation, we take the covariance divided by the product of standard deviations. The covariance is E(XY) minus E(Y), but I assuming mean 0. So this is the covariance. And I divide by the product of standard deviations, but the variance of X is just E(X) squared because it has mean 0. So we just do E(X) squared E(Y) squared square root. That would be the correlation. And let's take the absolute value of the correlation. When we introduce correlation, we prove that it's always between -1 and 1, right? So we already showed that correlation is between -1 and 1. But notice that this statement is exactly the same as the statement of Cauchy-Schwarz, okay? So it's the same thing. So in statistics Cauchy-Schwarz means the correlation is between -1 and 1. So I'm not gonna go through a different proof of this cuz we already proved this fact. And this is just a small extension that says this is still true even if they don't have mean 0, and that's just a fact from linear algebra. But this is a very nice interpretation for our purposes, okay? So that's Cauchy-Schwarz. And you can see why it would be kind of nice, this thing, this joint thing, and this thing may be much easier. This is an upper bound. It may not be a good approximation, right? It's probably a pretty bad if you try to use it as an approximation. It's an upper bound, and the strengths are simplicity and generality, not that it gives you an approximation. Okay, so our second inequality Second famous inequality is Jensen's inequality. Which we've already seen versions of, but we have stated it in general or talked about it as its own topic. So Jensen's inequality, Says that if lower case g is a convex function, and I'll remind you of what that is, then for any random variable x. Expected value of g of x is greater than or equal to g expected value of x. So it's pretty nice. It tells you when you have convexity, it tells you which way the inequality's gonna go. Right, one of the biggest blunders in probability's is to move the e here. Move the E everywhere, you can't do things like that and this tells you specifically which way it goes for convex functions. So, okay, just to make sure everyone knows what convex function means. If the second derivative exists, it means that just the g''(x) greater than or equal to 0. That's usually the easiest way to determine if a function is convex, just take the second derivative. So a simple example would be, y equals x squared, and you can draw this U-shaped thing. Y = x squared, then the derivative is 2, which is positive, so this is convex. So at least when I took ap calculus this was not called convex, it was called concave up which was kind of a stupid terminology, at least no one actually uses that once you get past ap calculus. This is convex and we also had mnemonic, so concave is the opposite, if the second derivative is negative, or less than or equal to 0, then we say its concave. But we don't really need to study that separately, because if we have a concave function, let's just say if h is concave. I'll write, it just means the inequality flips, and you can see that right away, because if it's concave just take the negative of it, and that's gonna flip the second derivative from being less than or equal 0 to being greater than equal 0. Apply Jensen's Inequality but because of a minus sign the inequality flips. So it just says it goes the other way for a concave. So, anyway, we used to have a mnemonic for this, which was that concave up holds water. And have any of you heard that mnemonic before? Would be nice of it, if it died out, so I guess I shouldn't be repeating it, anyway, that's a very bad mnemonic. Because, first of all I don't really see why concave up holds water is more memorable than concave down holds water. So doesn't actually tells you which way it goes, and secondly it's wrong. So like there was, it was worth having this numeric just so that some mathematicians can write a paper called does concave up holds water, hold water. And the answer was no and it gave some examples where that doesn't actually work. So the way I remember it is just remember that this is convex. You just have to remember one simple example of a function that's convex, okay? And then go back to this picture. And this one is an especially good example to think about because we already knew that E of x squared is greater than or equal to E of x squared the other way. We already knew that fact, because variance is non-negative. So if you ever forget which direction this inequality goes, just think back to your friendly old parabola x squared is convex and inequality goes this way, we already knew that, okay? So you shouldn't get confused about which way the inequality goes. So that's an example, and we're gonna prove that this inequality is true, just doing a couple of examples first. As I should say what's the definition, this is usually the easiest way to figure out whether our function is convex, but the definition of convex is a little bit more general. For example if we had an absolute value function, you know it looks like a v shape and that's y equals absolute value of x. So the derivative does not exist at 0, because it has a sharp corner, that's still a convex function though. So the definition is that if you take any two points on the curve and connect them, let's say I just pick two points and connected them. This line segment is above the curve, that's what it means geometrically. Pick any two points you want, you go like that and it's above the curve, it doesn't cross below the curve. That's what it means geometrically, so that's true for the absolute value, as well, okay. So that's the geometric interpretation, but if the second derivative exists it's usually easiest to just take the second derivative and see if that's non-negative, okay. So to do a couple other quick examples, then we'll prove this theorem. What if we have the expected value of one over x and let's let x be positive. Positive random variable for this part. I don't have to worry about dividing by 0 or negative numbers and stuff like that. So that's x to the negative 1, so the first derivative is minus x to the minus 2, and the second derivative is 2 over x cubed. If x is positive, then 2 over x cubed is positive. So this is convex as long as x is positive. It's convex and this is greater than or equal to 1 over E of x. So let's let X be positive for a couple of examples. Okay, so that's true and then, what about expected value of log x, again, I'm assuming x is positive, so I don't have to worry about the log of a negative number. The derivative of log x is 1 over x, and then the second derivative is minus 1 over x squared is negative, so it's concave. So we know this is gonna be less than or equal to ln E(x). Okay, and so on. So it's pretty straight forward. So, okay, let's prove that this is true now. And we should also discuss when does equality hold here. In this case, we know this only equals this only in the case when x is a constant. Right, because the variance is 0, which means you have a constant, okay? So let's talk about that, all right, so proof of Jensen. So let's draw a little picture again. All right, well I could think of a more creative convex function, but I'm just gonna draw our familiar one again. That's what a convex function looks like. Now, kind of a geometric fact about convex functions is that what you can see in the picture draw, you would prove this formally in an analysis of course. But just to see it geometrically, just imagine we have this convex function and, take any point, let's say here, and draw a tangent line And that was a pretty bad tangent line. But anyway, it looked too thick. This is supposed to be tangent here, and then it's below the curve, right. So, or try it over here. Take a point here, draw a tangent line, and go like that. And the point is that any of our, draw it at zero where it takes its minimum, then we just have the x axis. And any of these tangent lines you draw, it's gonna stay below the curve, right. So that's the whole, that's the only fact essentially that we need for Jensen's inequality, for the proof. That if you draw this line, so let's actually draw this line. Say this is the point mu g of mu, okay. So that's a point on the curve. And supposedly draw a tangent line there. So it goes through there. Then what we're asserting is that g(x) is greater than or equal to, lets say, a+bx, or that's the equation of the line. So suppose that this line is the line y=a+bx. And the statement that this curve stays above the line is just the statement that the curve is above the line, okay. Once you've studied the geometry enough to write down this inequality, then Jensen's inequality follows very easily because this is true for every number little x, yeah, in the domain that we're looking at. So that's also true as an inequality for random variables, that is no matter what value x takes, here we're talking about comparing random variables. I'm saying that this event, that this random variable, is bigger than or equal to this one, always occurs. So we know for sure that this is true for capital X, then just put the expectation on both sides. And then we know that E(a+bX) is a+bE(x). I'm letting mu equal E(x), that's the notation. So that's a+b mu, but we chose this line such that the line intersects the curve at that point. So at that point, where x equals mu, this is the x-axis here. At this point x equal mu, that's the same thing as g(mu), which by definition is g(EX). So that's Jensen's inequality. Okay, so it's a pretty short proof once you have the geometric picture in mind. You can also prove this by doing a Taylor expansion argument, but you can look in the books for that. But I kind of like having a more geometric perspective on it for various reasons. Okay, so that just leaves two more inequalities. There are a lot of other inequalities in statistics, but this is what I consider the top four, and these are the only ones we need for this course. And you'll see why, later in this semester you'll see why we need these ones, aside from the fact that they're interesting in their own right. Okay, so the third one is called Markov's inequality. The very last topic in Stat 110 is gonna be Markov chains. Same Markov, different idea. Markov's inequality says that the probability that any random variable x, let's say absolute value of x is greater than or equal to a, is less than or equal to expected value of absolute value of x divided by a for any constant a greater than 0. So, we're gonna prove this in a minute, that the strength of this inequality is not that it gives a good enough approximation. Its simplicity and generality, that this is completely general for any random variable. Of course, you have a random variable where this is infinity, and that's a pretty bad inequality, a probability less than or equal to infinity. Okay, but it's still true. And in fact, in some cases, the right-hand side is bigger than one. In which case this is true but tells us absolutely nothing, okay. So this is a simple crude inequality, and so let's prove it. Well the proof is basically to use the fundamental bridge that I'm gonna convert this probability of an event. It's the same as the expected value of the indicator of that event, right. So that's the same thing as the expected value of the indicator of x greater than or equal to a. I'm just using this as notation for a one if this event occurs, zero otherwise. That's the same thing as this, right, fundamental bridge. And let's multiply by a. I'm just thinking of the same inequality but with an a on the left, so I'm rewriting the left-hand side, except put the a over there. And then let's see, how does this thing compare with the absolute value of x? Okay, this inequality is always true, let's just think why. There are, I shouldn't have an x, do this without expectation first, then we'll bring in the expectation. Okay, so I say that this inequality is always true because, there are only two cases to consider, right. Anytime you have an indicator of a random variable, either it's zero or it's one, right. If it's zero, that just says zero less than or equal to the absolute value of x. So of course that's true, all right. That's one case. Other case is the indicator is one. So this I sub whatever is one. So left-hand side becomes a. Now in that case, if this equaling one says that absolute value of x is greater than or equal to a. But that's what I just said, all right. Replace this by one, it says absolute value of x square greater than or equal to a. That's what we just said. So this is always true. So I'll just write, note that this is true. These are random variables, but this relationship always holds between those random variables. Once you recognize that this is less than or equal to this, then Markov's inequality just follows just by putting E on both sides. So the expected value of this indicator, take out the a which is a constant, is less than or equal to E(x). And by the fundamental bridge, that's the same thing as Markov's inequality. So that proves Markov's inequality. Okay, so if you want a little bit of intuition on Markov's inequality, let's think of a simple example. So, then we'll do the last inequality. All right, so here's a simple little example to think about. Suppose that we have 100 people. Okay, and let's just think intuitively about a couple simple questions. And we just proved this, but that doesn't make it intuitively obvious to most people, so we should think also about the intuition. Okay, suppose we have 100 people and suppose we ask is it possible that, let's say 95% of the people, I'll even say at least 95% of the people are, let's say, younger than the average person in the group. Average meaning mean. Is that possible? Yes. Why? You have 100 people, they all have different ages. Usually I do income here, but I'm trying to avoid that. Yeah? >> [INAUDIBLE] >> Older, one person's much older. So one of these 100 people is really, really, really old. That one person is gonna pull up the average a lot, right? And so then it's easily possible that 95 people could be younger than the average right, cuz one person could pull up the average a lot, okay? So that's pretty intuitive, but this is possible. If we talk about median that's a different thing. But here I just mean mean. Okay, so that's possible. The answer is yes. Now, let me ask you a similar question. Is it possible, same question, that at least 50% of the people are older than twice the average age? Take the average age, double it, can more than half of the people be older than that? No, why not? Yeah, so, because just taking those 50% who are more than double the average age, you just compute their average or the total. Let's think about the total, cuz if the average is mu for 100 people, then the total is 100 mu. Now, suppose you had 50 people who are all bigger than double the average, those people alone have already made the average bigger than what it is, which is impossible, right? Just those people already pulled up the average from what it was, which doesn't make sense. Okay, so that's impossible. Similarly, you can't have more than one-third of the people. You can't have more than one third be more than triple the average age, and so on. Right? It's impossible. That is exactly what Markov's inequality says. So, that is the intuition. All right. So, our last inequality is Chebyshev's, which is another famous inequality. Chebyshev's inequality follows almost immediately from Markov's inequality. Which is kind of ironic because in real life Chebyshev was Markov's adviser. But the inequality, and the both of them are famous mathematicians for other reasons, but these inequalities are very useful but very simple. That's a crude general upper bound. Chebyshev is basically says that, well, let me write down the inequality. It says that the probability that x minus its mean, we're just letting mu equal e of x. Is greater than something. So here, we're just looking at differences from the mean, greater than some number a is less than or equal to the variance divided by a squared. So mu is the mean. And a is just, again, any positive number. Okay? So it's kind of similar in spirit. Except we're looking at the difference from the mean. And we get variance and a squared thing up here. Okay? And the other way to write this is that x minus mu greater than, let's say, c times the standard deviation, Is less than or equal to 1 over c squared. Where, again, C is greater than 0. So this says that the probability that x minus it's mean is more, that the probability that x is more than two standard deviations away from it's mean is at most one quarter. Right? 0.25. So you can see why kind of cool like in the normal case we have the 68, 95, 99.7% rule, remember? Which part of it says that the probability that a normal random variable is more than two standard deviations away from it's mean is about 0.05. And Chebyshev's inequality says that that would always be true, except with 0.25, that's 1 over 2 squared, rather than 0.05, so it's a crude upper bound. And the proof is very easy once we have Markov's inequality. And this is equivalent to this just by letting a equal c, standard deviation, then those are the same thing. So to prove this first line here, let's just use Markov's inequality and just do one step first, which is to square both sides, right? So, let's square both sides. And since we're dealing with this is a non negative random variable. This is a positive number. It's an equivalent event if we just square both sides. So that's squared. We can drop the absolute value cuz we squared it. Greater than a squared, now let's use Markov's inequality on this term. So my Markov's inequality, this is less than or equal to the expected value of this divided by this. Which is E(x-mu) squared, divided by a squared. So that's just Markov, we can put greater than or equal here, it's still true either way, but I'll write it with greater than or equal. Markov's inequality, so this is less than or equal to this, right, just immediately applying Markov's inequality there. But the numerator, that's just the definition of variance, right? So that's the variance of x divided by a squared, which is what we wanted. Okay, so that proves Chebyshev's inequality. And that's all for today, so see you next time.

Info

Channel: Harvard University

Views: 44,717

Rating: undefined out of 5

Keywords: harvard, statistics, stat, math, probability, Cauchy-Schwarz, Jensen, Markov, Chebyshev

Id: UtXK_EQ3Pow

Channel Id: undefined

Length: 47min 29sec (2849 seconds)

Published: Mon Apr 29 2013