Okay, so we've been talking about
conditional expectation, right? And I want to do one more example of
conditional expectation, if I can. Okay, so one more example
of conditional expectation, then the main topic for today is
inequalities, as statistics inequalities. So okay, all right, so here's one
more conditional expectation problem. So all right, so
suppose you have a store and different customers show up at your store
and spend different amounts of money. Doesn't have to be a store and customers, just to have a concrete example,
I'll say it that way. That's clearly a very, very general type of setting that
would come up in a lot of cases. So I'll just say store with
a random number of customers. But you can make up your own story for
it, but just have something concrete. That's what I'm thinking of right now, random number of customers,
Which is just pretty realistic, right? You dont know how many customers
you're gonna have, and then each customers chooses to spend
some amount of money, maybe zero. And then you wanna know
how much money you got or how much profit or whatever, so
it's a very natural problem. And so let's let Xj,
Be the amount that the jth customer, Spends, Okay? And then let's assume that there are,
why did I say topics? >> [LAUGH]
>> I have no idea. Random number of customers. The number of topics in
this class is fixed, which is like I'll say
we'll do these topics. Actually it's not entirely fixed because
I might start rambling or something and not cover something. But actually it's been pretty much fixed,
we covered exactly what I want to cover for the most part since I've
been teaching this course. Okay, so random number of customers,
now let's say N, N = # of customers in a day or
in a week or in some time period. N cuts a random variable
as the number of customers. So maybe it's Poisson,
would be a reasonable. If you had to guess the distribution
maybe you might use a Poisson. I don't need to specify
the distribution right now, I'll just say that's N and
Xj is the amount jth customer spends. And let's assume Xj has mean mu and variance sigma squared. So I'm assuming that they all have
the same mean and variance for how much the customers spend. Of course, you can generalize
this problem in different ways. But for now, let's assume they
all spend the same average. They may spend different actual amounts,
but they spend the same average amount and they have the same variance. And let's assume that, N, and this sequence of expenditures
are independent. Okay, so we're not necessarily
assuming that the Xs are iid, but we are assuming that
they're independent. So [LAUGH] it's not like the second
customer sees how much the first customer spent. And wants to spend more than that
person or something like that or they come in groups and families and
decide together or something like that. They're just independent customers and
the other important assumption here is, the number of customers is independent
of the individual choices for how much to spend. So I mean that sounds like
it maybe plusable, and I'm sure you can think of examples
where this would break down, right? And it's so large that you can't
fit everyone in the store, and then maybe people start
leaving cuz it's too crowded. Or maybe then they're more
determined to buy a lot of stuff, cuz then they think this is a really good,
all kinds of things could happen. Or maybe N being very large
is an indication that they're having some really good sale that day and
so on. But anyway,
we're assuming they're independent. Okay, and then we wanna find, The mean and variance of just the total expenditure,
right? That's how much revenue would you take in? So let's call that X, X is just the sum Xj, j = 1 to capital N. So it's just a sum, right? Just total amount, but
what's unusual about this sum, compared to what we've seen generally
is that in upper index here, capital N is a random variable. So we're adding up a random
number of random variables, okay? So that's the setup, and
if we just tried to use linearity, your first thought may be about
linearity is just to sum. Then you might write E(x) =,
well there's N terms, and each one has mean mu's,
so you just go N times mu. However if you did that, how should
you immediately know this is wrong? Yeah, the right hand side is
a random variable that's a number, that's a random variable. So that would be a category error, E(x) is supposed to be a number, okay? It can be based on the various
constants we have, but it can't involve random variable. Capital N is a random variable, so that's
completely wrong, it's a category error. However that category error actually
suggests something useful to us, which is that we kind of
wish that N were a constant. Because if N is a constant,
this is not a category error anymore, just saying a constant
equals another constant. It might be true or false, but at least
it's not a category error anymore. And if N is a constant,
then really that is just linearity, okay? So this terrible blunder actually
tells us what we should do, that we wish that we knew the value of N. So we could treat it like a constant,
therefore let's just condition on N. So condition on N, that just means,
well we write E(x), I'll do this in two different notations,
it's the same thing. We'll do E(x) equals conditioning, so this
analogous to the law of total probability. The expected value of X
given big N = little n times the probability that big N = little n,
right? Just condition on the value of N. Now this is the sum, n = 0 to infinity. This conditional expectation means
we get to treat N that is known to equal little N, so
we know we have little N customers know. In that case,
we really can just apply linearity, right? So we're gonna plug in
big N = little N and here's where this assumption is important. Cuz if we didn't have this assumption what
we would do, is plug in big N = little n. But just like in earlier examples we saw,
like the two envelope paradox, we can't then forget
the condition big N = little n. In this case, though we plug in big
N = little n, they're independent. Big N is independent of the Xs,
so we can forget the condition. And so then by linearity,
we can just write down that that is, You're adding up little n things each
with mean mu, so that's mu times n. So mu is just a constant that comes out,
and what left in the sum with just left the sum of n times the PMF by
definition that's the expected value of n. So we know that that's
just mu times E of N. So the correction here is that this N
should be the expected value of N, not N. So let's also do this using Adam's Law. Which is the same thing. It's just a more complex notation. So we want E of X, we want a condition on n because then
we can just apply linearity, right? It's more familiar to deal with
a fixed number of terms rather than a random number of terms. So Adam's law, or iterated expectation
just says we can do E of E of X given N. Okay, to get E of X given N,
we just take E of X given N = little n,
which was mu times little n. And we replaced little n by big N. So that's mu times N. That is-- we treat, this notation
means treat N as a known constant. Then by linearity would just be mu N. And so again, this is mu times E of N,
so we get the same answer. So you can see that this
is more compact than that. Less writing, it's shorter and nicer. But in terms of the meaning and intuition,
both of these mean the same thing. So it's good to be comfortable kind of
writing out longhand like this, and shorthand like that are both useful. Okay, so that's the mean. Let's get the variance. So for
the variance we're gonna use Eve's Law. So Var of X = same idea, right? Condition on N, so the expected value of the variance of X given N + the variance of E of X given N, okay? Now, let's just evaluate these two
terms,the variance of X given N. To define that, we really just have to
understand what that notation means. We're treating n as known, and
then I say, what's the variance of that? Well, we know that the variance of the sum of a fixed number of
independent random variables. You just add up their variances, right? We've proved that before. There's no covariance terms because
I assume they're independent. So if we're treating n as a constant, the variance is just n times
the variance of one term. So that's just N sigma squared +, we need the variance of
the conditional expectation. But we already found that
the conditional expectation, E of X given N was mu times N, as above. Okay, then to simplify that a little bit,
we can just take out the sigma squared. So this is,
let's write it as sigma squared times E of N +, and then the mu comes out squared, so it's
mu squared times the variance of that. So that's the variance,
in terms of the min and variance of N, of course I could have said N is
Poisson or something like that, and Poisson lambda and
then it would just plug in. Lambda and lambda but
this is more general. All right, so let's quickly check whether
these answers make intuitive sense. So or the mean, I think this result
is pretty intuitive cuz it says the average amount of money that the store
will take in is the average number of customers times the average
amount that each customers spends. So that's pretty intuitive. As we've seen many times, intuition can
be wrong in this class, but in this case, I think this is pretty intuitive. And then for this one, well, let's just do a quick check that this
even makes sense in terms of the units. Now, capital N or little n we're
just talking about number of people. It doesn't really have units
that's just counting people, okay? People are not units in that sense,
so like meters, and inches, and seconds, or units. Know in the other hand, mu,
well we're measuring in some dollar or euro, or whatever let's assume dollars. So mu is in dollars, sigma squared is
dollar squared which is we like to work with standard deviation rather
than variance when we try to interpret things cuz I would rather
work with dollars than dollar squared. So this,
if we want the standard deviation, we just take the square root of that. Notice if we take the square root of this,
we're gonna get dollars. And if this were mu to the fourth or
sigma cubed or something, it wouldn't make any sense. You'd be trying to add dollars cubed to
dollars to the 4th or something like that, wouldn't make much sense, okay? So it makes sense in terms of the units, okay so similarly if you wanted the MGF. If you want the MGF of X well
again just condition on N and if we knew that N is five. Then we're just adding up five
random variables independent so we know that the MGF just
multiply those five MGFs right? So I'll be very, very straight forward. Assuming that we know
the MGF of each XJ it will be very straight forward to get
the MGF of this if n is a constant. Okay, but that tells us we can get
the MGF in general by conditioning on N. Same idea, so it can work that one out for
your self the same idea. okay, all right, so now we move on to inequalities, right, statistical inequalities, there are four of them
that we need in Stat 110. So there's a sense in which inequalities deserve a lot more attention than
they usually get in most courses. And so, there are different
ways to explain that, but one I particularly like was I had
a conversation recently with one of the leading experts in the world on the
interface between statistics and the law. And he was making a point that if you're
in court as a statistical expert witness, which is a common thing for
statisticians to do. It's a lot easier if you have
an inequality than if you have an approximation. And I know it's a common mistake in
the past in this course has been to kinda confuse approximations
with inequality. So I wanna make sure that
distinction is clear, then we'll go through the inequalities. The distinction is just that,
we did the Poisson approximation, right? That is under certain conditions, you can say that a certain distribution's
approximately Poisson, and that's gonna be a good approximation
under certain conditions. That's extremely useful because there are
a lot of problems where it's just too hard to do it exactly, but we can get a good
approximation without that much effort using Poisson approximation. For example, later in the course we'll do normal
approximation under some conditions. A lot of conditions that are pretty
realistic that normal distributions give us good approximations, I think. Those are approximations. Right now we're talking
about inequalities. Now, of course, they're related. If I prove that a certain probability
is between .36 and .38, right? So then I have bother upper and
lower bounds, right? And then I can say well, the probability
is somewhere between .36 and .38 so I would guess .37. But at least I have bounds
in both directions. But if all I say is that the probability
is less than .38, well, it could be .004 is less than .38, so that's not an
approximation, that's just a bound, okay? So that's the distinction. And the reason that this guy who I was
talking to was saying that you're much happier in court if you have an inequality
is that basically [COUGH] you can kind of imagine what would happen. But let's say I'm the expert witness and
I use my Poisson approximation on something and then you can just imagine
kind of being cross examined, right? Dr. Blitzstein,
you claim that this approximation is good, can you explain what you mean
by a good approximation. And then I'd say well,
good means that it's close to the truth. And then the lawyer could say, well, is there an accepted standard
about how close, close is? And do you know how close it is, right? And I'd have to say,
well if I knew exactly how close it is, then I'd actually know the answer, right? >> [LAUGH]
>> And there is not a standard for what does good mean. So what one person says
is a good approximation, another person could say is
a lousy approximation, right? You don't wanna get into that, right? And lawyers are good at kind of
tripping you up in that way. >> [LAUGH]
>> However, if I had an inequality, then I can just say the probability,
I've proven, is less that 0.37. And then there's basically not much
that can be said about that, right? I actually proved a theorem that says
the probability is less than 0.37, okay? It's kind of interesting, right,
because there's still randomness and uncertainty that's why we're
using probability, but we've proven a definite fact
about something random. So it's very advantageous a lot
of times to have inequalities. All right, so we're gonna talk about
the four most important inequalities arguably in statistics. The first one, we've already seen in
some forms, that's Cauchy-Schwarz. So for random variable, a lot of you have seen Cauchy-Schwarz
in the linear algebra or math class. For random variables,
Cauchy–Schwarz says that the expected value of X times Y is less than or
equal to the square root of E(X)
squared E(Y) squared. That's true. You can put absolute values
around it also if you want. Still true.
When we're talking about that geometric interpretation of conditional expectation, I mentioned the fact that this E(XY)
is playing the role of the dot product. That is, you're familiar with
the dot product of vectors and this is kind of the analog
of a dot product. So those of you familiar with
Cauchy-Schwarz just in general in your algebra, this looks the same
once you interpret it that way. But even if you've never seen
Cauchy-Schwarz before you can, you can just think about what
this inequality says and understand this just has
it's own inequality. Notice that if x and y are uncorrelated, Then by definition of uncorrelated,
then E(XY) equals E(X) E(Y). That's just the definition of
what it means to be uncorrelated. So in that case,
it would be crazy to use this inequality because we have an exact equality, right? And this is just an inequality, you can
see the direction makes sense, right? Because E(X) squared is bigger than or
equal to E(X) squared the other way, so it's true, but there will be no point. Just this equals this, okay? So where this is useful is
the case where they're correlated. I mean, it's good that it's true anyway. Just so we don't have to break things down
into separate cases, correlated case, uncorrelated case. It is always true. Okay, but to see why this is telling us
something interesting in the correlated case, kind of the cool thing about
this is if we want to compute E(XY) in general,
we'd have to use the 2D LOTUS, right? That is X and Y,
there's some joint distribution. Well, either we could do a Jacobean and find the distribution of X
times Y in the continuous case. And then so find the PDF of this or
we could use the 2D LOTUS. And that could be very,
very messy and difficult. So this is based on
the joint distribution. This is separating it out into
this is a marginal thing, and this is a marginal thing. That is,
this is the marginal second moment of X. That is, this is just the expected
value of X square, there's no Y in this expectation and there's no X in this one,
so it separates them out. So that's nice. Okay, and the interpretation,
the statistical interpretation is easiest to see in the case
where they have 0 mean. And this is the case we've
talked about before. Because if X and Y have means 0, Then the correlation between x and y. Well, in general to get the correlation, we take the covariance divided by
the product of standard deviations. The covariance is E(XY) minus E(Y),
but I assuming mean 0. So this is the covariance. And I divide by the product
of standard deviations, but the variance of X is just E(X)
squared because it has mean 0. So we just do E(X) squared
E(Y) squared square root. That would be the correlation. And let's take the absolute
value of the correlation. When we introduce correlation, we prove
that it's always between -1 and 1, right? So we already showed that
correlation is between -1 and 1. But notice that this statement
is exactly the same as the statement of Cauchy-Schwarz, okay? So it's the same thing. So in statistics Cauchy-Schwarz means
the correlation is between -1 and 1. So I'm not gonna go through a different
proof of this cuz we already proved this fact. And this is just a small extension
that says this is still true even if they don't have mean 0, and
that's just a fact from linear algebra. But this is a very nice interpretation for
our purposes, okay? So that's Cauchy-Schwarz. And you can see why it would be kind
of nice, this thing, this joint thing, and this thing may be much easier. This is an upper bound. It may not be a good approximation, right? It's probably a pretty bad if you
try to use it as an approximation. It's an upper bound, and
the strengths are simplicity and generality, not that it
gives you an approximation. Okay, so our second inequality Second
famous inequality is Jensen's inequality. Which we've already seen versions of, but we have stated it in general or
talked about it as its own topic. So Jensen's inequality, Says that if lower case
g is a convex function, and I'll remind you of what that is, then for any random variable x. Expected value of g of x is greater
than or equal to g expected value of x. So it's pretty nice. It tells you when you have convexity, it tells you which way
the inequality's gonna go. Right, one of the biggest blunders in
probability's is to move the e here. Move the E everywhere,
you can't do things like that and this tells you specifically which
way it goes for convex functions. So, okay, just to make sure everyone
knows what convex function means. If the second derivative exists, it means that just the g''(x)
greater than or equal to 0. That's usually the easiest way to
determine if a function is convex, just take the second derivative. So a simple example would be,
y equals x squared, and you can draw this U-shaped thing. Y = x squared, then the derivative is 2, which is positive, so this is convex. So at least when I took ap calculus
this was not called convex, it was called concave up which
was kind of a stupid terminology, at least no one actually uses that
once you get past ap calculus. This is convex and
we also had mnemonic, so concave is the opposite,
if the second derivative is negative, or less than or equal to 0,
then we say its concave. But we don't really need
to study that separately, because if we have a concave function,
let's just say if h is concave. I'll write, it just means the inequality
flips, and you can see that right away, because if it's concave just take
the negative of it, and that's gonna flip the second derivative from being less than
or equal 0 to being greater than equal 0. Apply Jensen's Inequality but because
of a minus sign the inequality flips. So it just says it goes the other way for
a concave. So, anyway,
we used to have a mnemonic for this, which was that concave up holds water. And have any of you heard
that mnemonic before? Would be nice of it, if it died out, so I
guess I shouldn't be repeating it, anyway, that's a very bad mnemonic. Because, first of all I don't really
see why concave up holds water is more memorable than concave down holds water. So doesn't actually tells you which
way it goes, and secondly it's wrong. So like there was, it was worth
having this numeric just so that some mathematicians can write a paper called
does concave up holds water, hold water. And the answer was no and it gave some
examples where that doesn't actually work. So the way I remember it is just
remember that this is convex. You just have to remember one simple
example of a function that's convex, okay? And then go back to this picture. And this one is an especially good example
to think about because we already knew that E of x squared is greater than or
equal to E of x squared the other way. We already knew that fact,
because variance is non-negative. So if you ever forget which
direction this inequality goes, just think back to your friendly old
parabola x squared is convex and inequality goes this way,
we already knew that, okay? So you shouldn't get confused about
which way the inequality goes. So that's an example, and we're gonna
prove that this inequality is true, just doing a couple of examples first. As I should say what's the definition, this is usually the easiest way to figure
out whether our function is convex, but the definition of convex is
a little bit more general. For example if we had
an absolute value function, you know it looks like a v shape and
that's y equals absolute value of x. So the derivative does not exist at 0,
because it has a sharp corner, that's still a convex function though. So the definition is that if you
take any two points on the curve and connect them, let's say I just pick
two points and connected them. This line segment is above the curve,
that's what it means geometrically. Pick any two points you want,
you go like that and it's above the curve, it doesn't cross below the curve. That's what it means geometrically, so that's true for
the absolute value, as well, okay. So that's the geometric interpretation,
but if the second derivative
exists it's usually easiest to just take the second derivative and
see if that's non-negative, okay. So to do a couple other quick examples,
then we'll prove this theorem. What if we have the expected value of
one over x and let's let x be positive. Positive random variable for this part. I don't have to worry about dividing by 0
or negative numbers and stuff like that. So that's x to the negative 1, so
the first derivative is minus x to the minus 2, and
the second derivative is 2 over x cubed. If x is positive,
then 2 over x cubed is positive. So this is convex as
long as x is positive. It's convex and this is greater than or
equal to 1 over E of x. So let's let X be positive for
a couple of examples. Okay, so that's true and then, what about
expected value of log x, again, I'm assuming x is positive, so I don't have to
worry about the log of a negative number. The derivative of log x is 1 over x, and then the second derivative is minus 1 over
x squared is negative, so it's concave. So we know this is gonna be less than or
equal to ln E(x). Okay, and so on. So it's pretty straight forward. So, okay,
let's prove that this is true now. And we should also discuss
when does equality hold here. In this case, we know this only equals this only
in the case when x is a constant. Right, because the variance is 0,
which means you have a constant, okay? So let's talk about that,
all right, so proof of Jensen. So let's draw a little picture again. All right, well I could think of
a more creative convex function, but I'm just gonna draw our
familiar one again. That's what a convex function looks like. Now, kind of a geometric fact about
convex functions is that what you can see in the picture draw, you would prove
this formally in an analysis of course. But just to see it geometrically, just
imagine we have this convex function and, take any point, let's say here,
and draw a tangent line And that was a pretty bad tangent line. But anyway, it looked too thick. This is supposed to be tangent here,
and then it's below the curve, right. So, or try it over here. Take a point here,
draw a tangent line, and go like that. And the point is that any of our, draw it at zero where it takes its
minimum, then we just have the x axis. And any of these tangent lines you draw,
it's gonna stay below the curve, right. So that's the whole, that's the only
fact essentially that we need for Jensen's inequality, for the proof. That if you draw this line, so
let's actually draw this line. Say this is the point mu g of mu, okay. So that's a point on the curve. And supposedly draw a tangent line there. So it goes through there. Then what we're asserting is
that g(x) is greater than or equal to, lets say, a+bx, or
that's the equation of the line. So suppose that this
line is the line y=a+bx. And the statement that this curve
stays above the line is just the statement that the curve
is above the line, okay. Once you've studied the geometry
enough to write down this inequality, then Jensen's inequality follows
very easily because this is true for every number little x, yeah,
in the domain that we're looking at. So that's also true as an inequality for
random variables, that is no matter what value x takes, here we're talking
about comparing random variables. I'm saying that this event, that this
random variable, is bigger than or equal to this one, always occurs. So we know for
sure that this is true for capital X, then just put the expectation
on both sides. And then we know that E(a+bX) is a+bE(x). I'm letting mu equal E(x),
that's the notation. So that's a+b mu, but we chose this line such that the line intersects
the curve at that point. So at that point, where x equals mu,
this is the x-axis here. At this point x equal mu,
that's the same thing as g(mu), which by definition is g(EX). So that's Jensen's inequality. Okay, so
it's a pretty short proof once you have the geometric picture in mind. You can also prove this by doing
a Taylor expansion argument, but you can look in the books for that. But I kind of like having a more geometric
perspective on it for various reasons. Okay, so
that just leaves two more inequalities. There are a lot of other inequalities
in statistics, but this is what I consider the top four, and these
are the only ones we need for this course. And you'll see why, later in this semester
you'll see why we need these ones, aside from the fact that they're
interesting in their own right. Okay, so the third one is
called Markov's inequality. The very last topic in Stat
110 is gonna be Markov chains. Same Markov, different idea. Markov's inequality says that the
probability that any random variable x, let's say absolute value of x
is greater than or equal to a, is less than or equal to expected
value of absolute value of x divided by a for
any constant a greater than 0. So, we're gonna prove this in a minute,
that the strength of this inequality is not that it gives a good
enough approximation. Its simplicity and generality, that this is completely
general for any random variable. Of course, you have a random variable
where this is infinity, and that's a pretty bad inequality, a probability
less than or equal to infinity. Okay, but it's still true. And in fact, in some cases,
the right-hand side is bigger than one. In which case this is true but
tells us absolutely nothing, okay. So this is a simple crude inequality,
and so let's prove it. Well the proof is basically
to use the fundamental bridge that I'm gonna convert
this probability of an event. It's the same as the expected value of
the indicator of that event, right. So that's the same thing
as the expected value of the indicator of x greater than or
equal to a. I'm just using this as notation for a one
if this event occurs, zero otherwise. That's the same thing as this,
right, fundamental bridge. And let's multiply by a. I'm just thinking of the same inequality
but with an a on the left, so I'm rewriting the left-hand side,
except put the a over there. And then let's see, how does this thing
compare with the absolute value of x? Okay, this inequality is always true,
let's just think why. There are, I shouldn't have an x,
do this without expectation first, then we'll bring in the expectation. Okay, so I say that this
inequality is always true because, there are only two cases to consider,
right. Anytime you have an indicator of
a random variable, either it's zero or it's one, right. If it's zero, that just says zero less
than or equal to the absolute value of x. So of course that's true, all right. That's one case. Other case is the indicator is one. So this I sub whatever is one. So left-hand side becomes a. Now in that case,
if this equaling one says that absolute value of x is greater than or
equal to a. But that's what I just said, all right. Replace this by one, it says absolute value of x square
greater than or equal to a. That's what we just said. So this is always true. So I'll just write,
note that this is true. These are random variables, but this relationship always holds
between those random variables. Once you recognize that this
is less than or equal to this, then Markov's inequality just follows
just by putting E on both sides. So the expected value of this indicator, take out the a which is a constant,
is less than or equal to E(x). And by the fundamental bridge, that's
the same thing as Markov's inequality. So that proves Markov's inequality. Okay, so if you want a little
bit of intuition on Markov's inequality, let's think
of a simple example. So, then we'll do the last inequality. All right, so here's a simple
little example to think about. Suppose that we have 100 people. Okay, and let's just think intuitively
about a couple simple questions. And we just proved this, but that doesn't make it intuitively
obvious to most people, so we should think also
about the intuition. Okay, suppose we have 100 people and suppose we ask is it possible that, let's say 95% of the people, I'll even say at least 95% of the people are, let's say, younger than the average
person in the group. Average meaning mean. Is that possible? Yes. Why? You have 100 people,
they all have different ages. Usually I do income here, but
I'm trying to avoid that. Yeah?
>> [INAUDIBLE] >> Older, one person's much older. So one of these 100 people is really,
really, really old. That one person is gonna pull
up the average a lot, right? And so then it's easily possible
that 95 people could be younger than the average right, cuz one person
could pull up the average a lot, okay? So that's pretty intuitive,
but this is possible. If we talk about median
that's a different thing. But here I just mean mean. Okay, so that's possible. The answer is yes. Now, let me ask you a similar question. Is it possible, same question, that at least 50% of the people are older than twice the average age? Take the average age, double it, can more than half of
the people be older than that? No, why not? Yeah, so, because just taking
those 50% who are more than double the average age, you just compute
their average or the total. Let's think about the total, cuz if the average is mu for
100 people, then the total is 100 mu. Now, suppose you had 50 people who
are all bigger than double the average, those people alone have already made
the average bigger than what it is, which is impossible, right? Just those people already pulled
up the average from what it was, which doesn't make sense. Okay, so that's impossible. Similarly, you can't have more
than one-third of the people. You can't have more than one third be more
than triple the average age, and so on. Right? It's impossible. That is exactly what
Markov's inequality says. So, that is the intuition. All right.
So, our last inequality is Chebyshev's,
which is another famous inequality. Chebyshev's inequality follows almost
immediately from Markov's inequality. Which is kind of ironic because in real
life Chebyshev was Markov's adviser. But the inequality, and the both of
them are famous mathematicians for other reasons, but these inequalities
are very useful but very simple. That's a crude general upper bound. Chebyshev is basically says that, well,
let me write down the inequality. It says that the probability that x minus its mean,
we're just letting mu equal e of x. Is greater than something. So here, we're just looking at
differences from the mean, greater than some number a is less than or equal
to the variance divided by a squared. So mu is the mean. And a is just, again, any positive number. Okay?
So it's kind of similar in spirit. Except we're looking at
the difference from the mean. And we get variance and
a squared thing up here. Okay?
And the other way to write this is that x minus mu greater than, let's say,
c times the standard deviation, Is less than or equal to 1 over c squared. Where, again, C is greater than 0. So this says that the probability
that x minus it's mean is more, that the probability that x is more than two standard deviations away from
it's mean is at most one quarter. Right? 0.25. So you can see why kind of cool like in the normal case we have
the 68, 95, 99.7% rule, remember? Which part of it says that the probability
that a normal random variable is more than two standard deviations
away from it's mean is about 0.05. And Chebyshev's inequality says that that
would always be true, except with 0.25, that's 1 over 2 squared, rather than 0.05,
so it's a crude upper bound. And the proof is very easy once
we have Markov's inequality. And this is equivalent to this
just by letting a equal c, standard deviation,
then those are the same thing. So to prove this first line here,
let's just use Markov's inequality and just do one step first,
which is to square both sides, right? So, let's square both sides. And since we're dealing with this
is a non negative random variable. This is a positive number. It's an equivalent event if
we just square both sides. So that's squared. We can drop the absolute
value cuz we squared it. Greater than a squared, now let's use
Markov's inequality on this term. So my Markov's inequality,
this is less than or equal to the expected value
of this divided by this. Which is E(x-mu) squared,
divided by a squared. So that's just Markov,
we can put greater than or equal here, it's still true either way, but
I'll write it with greater than or equal. Markov's inequality,
so this is less than or equal to this, right, just immediately
applying Markov's inequality there. But the numerator, that's just
the definition of variance, right? So that's the variance of x divided by
a squared, which is what we wanted. Okay, so
that proves Chebyshev's inequality. And that's all for today,
so see you next time.