Three and now we are going to see Chernoff
bounds. It is a family of bounds, it is not just one bound, but family of bounds they
are tied together by the technique the way we would derive these bounds and an important
concept along the way we will try to understand the notion called moment generating functions
ah. These moment generating functions they are quite nice; because they capture all the
moments. If you recall the moments the nth moment is expectation of X to the n this one
moment generating function is like a bag that just collects all of them together.
So, if you recall the Markov's inequality required the first moment, Chebyshev's required
the second moment. Why is Chernoff? So, powerful because it includes information of all the
moments and as it turns out if you have all the moments you actually uniquely define the
random variable. So, it is this moment generating function actually capture. So, characterizes
random variables ah uniquely. So, that is why this is the significantly more powerful
than Chebyshev's inequality and we will work out Chernoff bounds for the sum of ah Poisson
trials it is basically a generalization of some Bernoulli trials which is what we call
binomial distribution it is a slightly more general version for which we will work out
the Chernoff bound ok. So, here is the Chernoff bound technique at
a very very high level ok ah. Let us set that in the backdrop of Chebyshev's. How did we
prove Chebyshev's inequality? We took a random variable X ok.
Yes what is X minus E of X the probability of X minus E of X greater than or equal to
a well we did not know how to handle that, but we squared it ok and when we squared if
you recall the event stage this essentially the same, but now you are able to apply ah
the second moment right and that is what we did for Chebyshev's ok.
We are just going to push this further to the extent possible. What we are going to
do is not square it, but ah take the exponential ok. So, now, let us say you want ah the probability
that X exceeds a certain quantity a; what we do is we pick a suitable t which is some
quantity greater than 0 ok; and we ask when we say you know we we morph this this event
over here X greater than or equal to a to E to the t X greater than or equal to E to
the t a. So, basically where it takes exponentiating on both sides.
So, the inequality stays the same. So, the event is really capturing the same thing.
So, the to have equal probabilities; and So you can easily see the analogy between this
and Chebyshev's inequality it is essentially the same idea, but here we are pushing into
the limits not just squaring it with exponential and the nice thing here is now for the right
tail this is what ah what we have here is a right tail X we were asking what is the
probability that X exceeds some position a some value a and that is giving you the right
end of the tail. For the left end of the tail all you need
to do is pick a t that is less than 0 ok and when you say pick a t that s less than 0 then
you can ask what will happen what you do instead of greater than or equal to is you say what
is X was the probability of X less than or equal to a. So, that will be the ah left tail
and for that remember your t is going to be less than zero. So, what will happen over
here when ah when you apply ah this inequality is well you how do you convert the less than
or equal ah to sorry to a greater than or equal to you will raise it to the power E
to the tx in this context, but t being less than 0 your ah what happens over here is you
will have a negative ah in the exponent ok So, the inequality will appropriately fall
into the greater than and which means that again you can apply Markov's inequality and
so on and so forth right and that is exactly what we are doing here. So, for the moment
let us not worry about the left tail let us continue to focus on the right tail. So, if
you ah what you can now do is apply the Markov's inequality. So, what we have is a random variable
here and we are asking; what is the probability that it is more than some positive constant
and you will you can then apply Markov's inequality? So, you see how Markov's inequality is still
the basis upon which all the other tail bounds have already arrived, but nevertheless we
were able to do fancy things now ok. Ah. So, now, we have to complete the story.
So, we need a good bound for this ah numerator over here the expectation of E to the t X
ok and that is something we will focus our efforts on ok, but if we have the right bounds
for the numerator what we can do is we can play with the value of t. So, this is true
for any t greater than 0 which means that for the for the best bounds you may need to
play with the choose the right value of t ok.
And then hopefully you know the terms will cancel out nicely; and you will get a bound
that is a Chernoff bound. Basically, this is a technique a general technique to deriving
ah these sort of bounds and any bounded derived using this general technique of taking the
random variable exponentiate; and asking what is the probability that it exceeds a particular
value; then taking the exponentiation on both sides of that ah event and applying Markov's
inequality and bounding the numerator. So, these this is a very standard technique this
technique is called the Chernoff bound technique and any bound derived from it is the Chernoff
bound ok. So, ah just to recall now we we have to worry
about this expectation of E to the t X ok. And as it turns out that fits very nicely
with a well known notion called the moment generating function ok what is the moment
generating function of a random variable X it is exactly what we want it is denoted M
X whether this has to be a capital X M X of t is defined this equal to the delta on top
of it is defined is defined as the expectation of E to the t X this is exactly what we want
and we have a quantity we we have a well defined notion ok the moment generating function.
Let us try to understand this object this moment generating function.
First ah theorem about the moment generating function; it captures all the moments ok this
is something that we talked about a little while ago right the Chernoff bounds why is
it. So, powerful it is capturing all the moments not just ah the first in a second moment.
So, why is that the case and it is ah under some conditions. So, here the condition is
that ah when we have expectation and differentiation we should be able to interchange there the
way we apply them. If we can interchange them then we have this theorem ok. So, basically
what are we claiming here; this expectation of X to the n this is the nth moment you simply
get that by taking the nth derivative of the moment generating function; this n this is
the nth derivative of the moment generating function. Then evaluate it at the position
0 ok. Then you ah you basically by doing that you get the nth moment of the random variable
X; that is the nice thing. So, basically what what is what is happening here is that the
moment generating function in it is its somehow capturing all the moments when you take as
you take the derivatives ah each derivative evaluated at 0 gives you the appropriate ah
nth moment. The proof is quite simple; you look at the
ah nth derivative of the moment generating function ok and remember this one important
thing to notice is that you are you are taking the nth derivative not with respect to X,
but with respect to t ok. So, dn by d ah dt to the n. So, that is the end of the derivative
of ah well; what is the moment generating function? It is expectation of E to the t
X right. So, you apply that and remember we have said that differentiation and expectation
can be interchanged. So, I am just interchanging it and taking the differentiation inside and
taking the derivative. So, here derivative the nth derivative of E to the t X is X to
the n E to the t X ok this is remember definite the derivative with respect to t ok.
And now you; so now that you have this form you can just up start applying or evaluating
the this expression at t equal to 0 ah. So, when you evaluate at t equal to 0 essentially
this term will vanish away you will get E to the 0 X and that is just a 1 ok. So, you
will just be left with E to the expectation of X to the n and that is what we have over
here. So, the nth derivative evaluated at 0 is expectation of X to the n which is the
end. Let us look at one more one specific example
where this is applied ok. So, now, let us go back to our favourite Bernoulli random
variable. We just want to confirm this; basically what are we doing here.
We are taking the nth derivative up evaluating at it 0 we need to be able to get the moments
ok; and let us just test it out for a very simple case this is the simplest possible
case that you can think of. So, ah what is the moment generating function that is expectation
of E to the t X ok and that; what is this expectation of E to the t X? well.
Essentially the variable E to the t X what is the probability I mean with probability
p it is going to take the value E to the t, because X remember this is E to the t X is
defined based on X;; and X will be 1 with probability p and 0 with probability 1 minus
p ok when X is 1 you are going to get E to the t 1 which is just E to the t that is.
So, you get p E to the t plus when X is 0 you are going to get E to the 0 which is a
1; so you when that happens with probability 1 minus. So, that is how you get this expression
for the moment generating function for ah; and then you can play with it a little bit
and apply the fact that 1 plus X is at most E to the X ok and you can get it to be of
the form E to the p times E to the t minus, but the most for now the the form that we
want is this what was what is shown in yellow ok.
So, now we have a moment generating function. Let us take the first derivative and when
we take the first derivative of this expression this is with respect to t. So, you will just
be left with p e to the t and when you evaluate it at t equal to 0 E to the t becomes a one
and. So, you are left with p and that is expectation of X if you recall this is fitting exactly
the theorem that we have just proved if you take the first derivative up and apply it
at value equal to 0 you you get back expectation of X ah you can do the same thing with the
second derivative ok. I will leave you to work that out on your own time, but essentially
the second derivative I mean the second moment is for the Bernoulli random variable is going
to continue to be p and you are going to get the same.
So, the other thing is that not only does it capture all the moments when it captures
all the moments it fully defines the random variable and this we are just going to state
the theorem and going to skip the proof ah. So, what is the theorem stating if you have
two random variables have X and Y; and if they are moment generating functions in the
vicinity of 0 what does what does that mean when t is around 0 it is between my delta
and delta if both those moment generating functions are exactly equal, then we are guaranteed
that both X and Y are essentially the same distribution same essential the same ok. So,
this is ah this tells you once I mean if anything this tells you that the moment generating
function really gets to the heart of what a random variable is ok really can characterize
random variable and one more nice theorem before we get back in the Chernoff bounds
ok. What is this if you this again is a way to
exploit independence of random variables ok. So, you have two random variables X and Y.
X plus Y itself is a random variable. So, there is this well defined notion of the moment
generating function of X plus Y and that is this left hand side right ok. The nice thing
is that is just the product of the individual moment generating functions; when X and Y
are independent this is not true when X and Y are not independent, but when they are independent
you have this ok. This is very useful because later on when we are trying to bound say the
binomial random variable or sum of Poisson random variables or whatever you will exploit
independence and this is a key crucial ah requirement for that ok.
So, you why is it ok. So, you take the the moment generating function for X plus Y you
apply the formula that s nothing, but the expectation of E to the t times X plus Y ok
and you expand it out you get E to the t X times E to the y, but the nice thing is now
these are two random variables E to the t X and E to the t Y if X and Y are independent
then E to the t X and E to the t Y are also independent ok and if they are independent
there the expectation of their product is the product of their expectations and that
is what we are over here and that is by independent which ok. What is each of these multiplicative
terms? It is nothing, but the moment generating function of X and is this clear what we have
talked about; so far, any questions. Now, we are ready to get back into the Chernoff
bounds, we will work out a particular example, it is basically the sum of Poisson trials
this is ah if you recall what is the binomial distribution it is the sum of n Bernoulli
trials the sum of Poisson trials it is just a generalization it is just the techniques.
So, everything that we say from now on; ah with respect to Poisson trials is going to
be applicable to Bernoulli binomial distribution as well, but; in fact it is going to be a
one step more general ok. What is the general part of it? In when you
take the binomial distribution each Bernoulli trial must have the same probability of success
p ok when we go in to sum of Poisson trials we are generalizing we are allowing each ah
Poisson random variable X i to be 1 with some probability p i and 0 with some with probability
1 minus p i and these p is can be different for each X i. So, what is X equal to X 1 plus
X 2 plus and so on up to X n; each one can have it is own probability of success ok.
So, that is the general nature of some of us on trial. So, of course, if all of these
p is are equal, you get back binomial distribution. Now, that we have established that to the
we know what the Poisson some of us on trials is essentially a binomial distribution generalized
a little bit. Let us workout Chernoff bounds for this X which is the sum of Poisson trials
ok. So, we will need remember to apply the Chernoff
technique; we will need E to the t X which means that we need the moment generating function
ok. So, the nice thing now we immediately start applying the theorem based on independence.
If you want the moment generating function for X, it is basically the product; because
these individual X i s are independent. The moment generating function for X is the product
of the moment generating functions for the individual X is. So, what is the ah moment
generating function for the individual X i? If you recall we worked this out a little
while ago; when we worked out the Bernoulli trial with respect to ah probability of success
p; we worked it out to be of the form e to the p i times E to the t minus one a few slides
ago. So, we are just applying that and so now,
you have the product of several exponentials which means you can take the ah e to the you
can take the exponential of their some of their ah some of their this this product of
the exponentials is just e to the sum of the X. ok ah And well what is that this e to the
t minus one and I think this needs to be slightly clarified this it looks like this e to the
t minus one is in the subscript over here it is actually ah p i times e to the t minus
one this; this e to the t minus one can come out of the summation because the summation
is with respect to i. So, you you get ah e to the e to the t minus
one if that if this comes out you will be left with summation i p i what a summation
i p i. What is p i? P I corresponds to random variable X I; expectation of X i equal to
p i . So, what is summation of over I p? Expectation of X. And; and we are going to denote that
by mu. So, we have got the moment generating function; which means now we can go back to
the Chernoff technique and apply this. So, let me let us before we start applying
it; let me state the bounds that we can get only one bound I am going to prove, the rest
of them you you will have to proof your own I mean the textbook has it. So, you kind of
have to take some time to work through them ok.
The first and and if you recall; I said there are there is this one framework or technique
the Chernoff bounds technique and you can derive multiple bounds. So, that is why we
are going to give multiple bounds now ok. Now the first bound for any delta greater
than 0 the probability; that X exceeds 1 plus delta times mu. So, just to give you give
yourself a picture for what is going on here. You have a random variable X this is ah E
of X equal to mu; you have this sum distribution and you are asking well and let us say this
this point is a 1 plus delta times mu. We are asking what is the probability that X
will fall in this right tail; and that is at most this right hand side quantity; E to
the delta divided by 1 plus delta raised to the 1 plus delta the whole raised to mu looks
complicated, but there are nicer forms this is why it helps to have multiple Chernoff
bounds. So, let us look at the second form. We have
here we need to restrict delta to be at most 1. What is probability X greater than 1 plus
delta times mu again? Is the same story you are you are basically trying to bound the
right tail it is e to the minus mu delta square by 3 ok Y and you hope this might look like
just some mathematical jargon, but let us pause for a moment to see why this is powerful
in in the binomial distribution; what is your mu? Mu is the average right. So, if you have
like 10000 coin tosses what is your mu it is 5000 and where does in the in the tail
bound probability where does mu appear mu appears in the exponential and it is e to
the minus mu ok. So, as the as the mean increases the probability
decreases exponentially ok. So, that is the power of this Chernoff bounds this the probability
of deviating from the expectation this tail bound it drops exponentially. So, in that
example this coin flipping example that we have worked out in the past it this it will
show up as e to the minus 5000 times delta square by 3. So, these delta square and 3
are also relatively small quantities, but you can immediately see as the number of coin
flips goes larger and larger your ah probability of deviating significantly from the mean drops
very very quickly ok. So, hopefully this intuition is making sense
to you because that is very very important this is this is what is telling you about
the power of the Chernoff bound ok and one more ah convenient form which it is.
. well it is positive for the numerator it is
positive also for the denominator and if the denominator is dominating then it is going
to be again having the same ok what is a third form for some R greater than 6 mu. So, basically
ah this is useful when you want to ask what is the probability of deviating beyond 6 times
expectation sum in some cases this makes sense again you see this exponential drop probability
if X exceeding R is at most 2 raised to the minus R.
from the . Yes. So, that makes sense particularly for
things like normal distribution. yeah.
Things like that here ah it; so when you work with the standard deviation you are actually
working with Chebyshev's inequality or some related ah equality and you can do that and
it is usually done, because you do not have a handle on the moment generating function
this is what you are saying is particularly true for a lot of fields, but in computer
science; we often we define the algorithm which means that we control what random variable?
What the random variables going to look like? Which means that we have more power to in
our hands? We can go in to Chernoff bounds and that is what will happen very commonly;
we end up being able to exploit Chernoff bounds. Because we have access to all the moments
and therefore, implicitly we do not explicitly think about it all the time implicitly we
have access to all the moments and so we will be applying the more powerful techniques.
And this is significantly more powerful than just asking how many times ah away from; how
many standard deviations away from the mean you are? So, this is I think your question
is also motivated by this right. yeah.
This is just one convenient form; it might be useful in some cases where the mean is
small. If the mean is small then you can ask; what is the probability that you are exceeding
6 times the small mean? Then it might be a this might be a convenient form to use ok.
And again goes to say that it is basically the same technique all these three bounds
are derived using the same technique; it is just playing with the value of t, if you recall
there is a t a parameter t showing up in the derivation of Chernoff bounds displaying with
a value of t and other mathematical minor mathematical juggleries you get these convenient
forms. And so when you think about applying them;
you just choose the most appropriate one and typically the second form is the most commonly
used form of this ok . And as I said this the same technique can
also be ah applied in the in the left tail as well ok we will see that, but for now let
us actually look at the proof. So, ah just to be clear all these three have the proofs
in the textbook. I am only going to go through the first ah inequality. So, that we have
one inequality nailed down; we know how to do it, but the other things I still expect
you to go through it on your own ok so. We are now focusing on the first inequality.
X greater than 1 plus delta times mu and here delta is any value any positive value; and
we will not need to prove that it is at most this right hand side ok. And again we are
going to apply the exact Chernoff bound technique that we have already seen before. We are just
going to say what is X greater than probability that X greater than 1 plus delta times mu
well; you take the exponentiation on both sides probability of e to the t X greater
than e to the t times 1 plus delta times mu is standard technique. Apply the Markov's
inequality. So, that is you get expectation of e to the t X divided by the right hand
side of this event ok. So, ah, but we already know this; we know
ah remember we just a few slides ago we derived the moment generating function for the random
variable X that is nothing, but e to the e to the t minus 1 times mu if you recall we
did that ok. So, I am just applying that over here. So,
you get this ah you get this form over here. Now comes the trick because now you have to
choose a particular value of t that will be convenient for you ok and you see a lot of
e to the t s and if you if you want to ah cancel things out what do you do when you
have E to the some lon of something you the e and the lon cancel each other out right.
So, that is the; that is what we are going to exploit. So, we are going to set t equal
to lon 1 plus del and because delta is greater than 0 this lon 1 plus delta is also going
to be greater than 0. So, which means we can apply it. So, going
back to this form; if you apply ah if you apply t equal to lon 1 plus delta; so essentially
here you will be ah you will be getting lon 1 plus delta. So, what will you get over here
you will get ah ah e to the e to the t will become lon 1 plus delta ah you know it will
it will e to the 1 lon 1 plus delta will become 1 plus delta minus 1. So, it will be 1 plus
delta minus 1 divided; now I M going to take the whole to the mu separately and here e
to the t will become lon you know 1 plus delta. So, and this 1 plus delta will be posted over
here; which is exactly the form that we; so this is the first technique. The other techniques
in similar fashion will be derived and at; in fact, in this case you look at the other
you basically play with this form. This is one of the best forms that you can get the
other ah bounds are obtained by displaying with;
So, those are left as homework for you. So, now, let us any any questions before I move
on and how this derivation works out ah. This is just two forms for the left tail,
that we have; again they look quite similar because they are derivations are essentially
of the same pattern. ah Now because it is the left tail you need to ah consider delta
values to be in the range 0 to 1 and you have these two forms again here it is X less than
or equal to 1 minus delta times mu. So, so you will essentially what you are doing is
this is your mu; this is going to be your 1 minus delta times mu.
So, you are asking what is the probability of this left tail? ah That is again at most
some quantity which is if you notice it looks very similar to the type of bounds that we
have and the second form is also their probability X at most one minus delta is at most e to
the minus mu delta square by two and usually this form is more useful and. In fact, you
can take the two ah left in the right tail the second form of them this.
In this case; it is this one we have something similar in the right tail as well we can put
them both together and apply the union bound and you can get a combination of both right
and left tails. So, this is here I because it is combination of right hand left tails
you have the absolute value of X deviating from the expectation ah to be greater than
some delta times mu is at most 2 E to the minus mu delta square by 3.
. ah what is the time?
divided by 2 or 3. Its divided by 2; it is this happens to not
be a typo, but when you apply the union bound you are just a little bit relaxed about how
you apply it. The first one is I believe the tightest, but it really is not. So, much about
tightness in these cases because the tightness usually just gives you some I mean typically
the benefits just in constants the you usually use the more appropriate question would be
which one is the most convenient one of the use of course, in some cases one might be
convenient, but might not give you the good enough ah bound right.
So, the first one is usually the tightest because it is pretty close to what you want,
but what is also and the reason why I even went through the proof of the Chernoff technique
is that there are some cases where you actually have to ah redo the proof of the Chernoff
technique for a specific t value that fits the application. Remember this here we applied
a particular t value that; that nicely cancels things out and ah it is great for a general
form, but there are applications for which you really have to go through the derivation
and choose the appropriate t value and all that.
So, it is good to know the general derivation technique. So, we are now almost done; only
thing I want to point out now is, how this Chernoff bounds that we have seen so far fares
with respect to coin flips? This is the example that we have seen. And here; now we are going
to ask what is the; so X is the number of heads out of n coin flips ok. So, this is
ah essentially the binomial distribution and since we know even how to handle Poisson;
some of Poisson trials you we can actually handle the binomial distributions just a special
case. Ah and so we want to ask what is the probability
that ah; so in the particular example that we had n was 10000. So, ah the mean would
have been something like 5000 right. And we are asking what is the probability there it
deviates away ah from ah the mean by this quantity ok? And so let us let us work that
out. So, this is this is just this is recall this is if you ignore the constants essentially
what you have is a square root of n lon n ok. So, it is basically ah if you for for
a rough understanding of what we are talking about here? What is the ah square root of
10000? Ah well it is roughly a 100 right. So, we are asking what is the probability
that X deviates from the mean by more than roughly a 100 ok. If you just you can ignore
the constants in the lon term; lon is small constants are also small and ah let us see.
How do you work this out? So, ah here on the right hand side you need it to be of the form
ah delta times mu right. So, you should work out the appropriate. So, we on the left or
what we have over here is; what is exactly the tail bound that we want? We want to understand;
what are these probabilities these tail probabilities? Ok, but now we have to fit it to the Chernoff
bounds form ok. So, now, ah what is your mu? Mu is n by 2. So, your delta if mu is n by
2; what you have to do? To get the mu; to get the delta is ah just multiply by n over
2 and also multiply by 2 over n. If you do that you get ah this will be your mu this
1 n o. So, basically let me write that a little bit more carefully over here what you do is
you multiply by n over 2 and you also multiply by 2 over n ok.
This n over 2 is nothing, but your mu. Whatever is left is your delta ok. So, that so now
we have all other delta and mu. So, we can apply the formula and if you recall the formula
it is 2 e to the minus mu delta square by 3. So, that is what you are applying over
here ok. And when you apply you get oops 2 over n. And for a similar range what did we
get out out of Chebyshev's? I think we got something like 1 over 9 or something like
that the probability of exceeding 5000 plus or my minus 160 ah we got was something like
1 over 9, but now what are we getting we are getting 2 over n. So, this is like 2 over
10000 ok. So, the you see how the tail bound tail probability is much tighter when you
use the Chernoff bounds. So, that is generally the idea here wherever possible you want to
use the Chernoff bounds, but when it is not possible you live with the Chebyshev's or
the Markov's inequality ok ah. So, with that we can conclude our ah segment
for today. So, we introduce the Chernoff bounds technique and you know along the way we picked
up an understanding of moment generating functions we looked at some properties and we worked
out the derivation of a Chernoff bound for some of Poisson trials at least one form of
it ah there were other forms that we have stated we have not actually looked at the
derivations ok and what we did was we applied it to the coin flips ok and we saw it is immediate
how powerful Chernoff is compared to ah Chebyshev's I hope you can appreciate this difference.
So, that is what we have seen. So, far and in this module we have seen two slightly larger
segments. So, we have seen the analysis of the median algorithm in the last segment which
again like slightly longer module and this Chernoff bounds technique again is slightly
longer module five segments in this module 3 ok.
So, in the next module what we are going to do is we are going to focus on ah applications
we are going to now that we know all these bounds these markers in equality Chebyshev's
and Chernoffs and things like that what we are going to do is start looking at some algorithmic
contexts and we are going to start applying them. So, hopefully these bounds you know
come alive to you. So, you will actually see how they help in a computer science ok. So,
that will be the topic for next module ok. Thanks