Tail Bounds I - Moment Generating Functions and Chernoff Bounds

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Three and now we are going to see Chernoff bounds. It is a family of bounds, it is not just one bound, but family of bounds they are tied together by the technique the way we would derive these bounds and an important concept along the way we will try to understand the notion called moment generating functions ah. These moment generating functions they are quite nice; because they capture all the moments. If you recall the moments the nth moment is expectation of X to the n this one moment generating function is like a bag that just collects all of them together. So, if you recall the Markov's inequality required the first moment, Chebyshev's required the second moment. Why is Chernoff? So, powerful because it includes information of all the moments and as it turns out if you have all the moments you actually uniquely define the random variable. So, it is this moment generating function actually capture. So, characterizes random variables ah uniquely. So, that is why this is the significantly more powerful than Chebyshev's inequality and we will work out Chernoff bounds for the sum of ah Poisson trials it is basically a generalization of some Bernoulli trials which is what we call binomial distribution it is a slightly more general version for which we will work out the Chernoff bound ok. So, here is the Chernoff bound technique at a very very high level ok ah. Let us set that in the backdrop of Chebyshev's. How did we prove Chebyshev's inequality? We took a random variable X ok. Yes what is X minus E of X the probability of X minus E of X greater than or equal to a well we did not know how to handle that, but we squared it ok and when we squared if you recall the event stage this essentially the same, but now you are able to apply ah the second moment right and that is what we did for Chebyshev's ok. We are just going to push this further to the extent possible. What we are going to do is not square it, but ah take the exponential ok. So, now, let us say you want ah the probability that X exceeds a certain quantity a; what we do is we pick a suitable t which is some quantity greater than 0 ok; and we ask when we say you know we we morph this this event over here X greater than or equal to a to E to the t X greater than or equal to E to the t a. So, basically where it takes exponentiating on both sides. So, the inequality stays the same. So, the event is really capturing the same thing. So, the to have equal probabilities; and So you can easily see the analogy between this and Chebyshev's inequality it is essentially the same idea, but here we are pushing into the limits not just squaring it with exponential and the nice thing here is now for the right tail this is what ah what we have here is a right tail X we were asking what is the probability that X exceeds some position a some value a and that is giving you the right end of the tail. For the left end of the tail all you need to do is pick a t that is less than 0 ok and when you say pick a t that s less than 0 then you can ask what will happen what you do instead of greater than or equal to is you say what is X was the probability of X less than or equal to a. So, that will be the ah left tail and for that remember your t is going to be less than zero. So, what will happen over here when ah when you apply ah this inequality is well you how do you convert the less than or equal ah to sorry to a greater than or equal to you will raise it to the power E to the tx in this context, but t being less than 0 your ah what happens over here is you will have a negative ah in the exponent ok So, the inequality will appropriately fall into the greater than and which means that again you can apply Markov's inequality and so on and so forth right and that is exactly what we are doing here. So, for the moment let us not worry about the left tail let us continue to focus on the right tail. So, if you ah what you can now do is apply the Markov's inequality. So, what we have is a random variable here and we are asking; what is the probability that it is more than some positive constant and you will you can then apply Markov's inequality? So, you see how Markov's inequality is still the basis upon which all the other tail bounds have already arrived, but nevertheless we were able to do fancy things now ok. Ah. So, now, we have to complete the story. So, we need a good bound for this ah numerator over here the expectation of E to the t X ok and that is something we will focus our efforts on ok, but if we have the right bounds for the numerator what we can do is we can play with the value of t. So, this is true for any t greater than 0 which means that for the for the best bounds you may need to play with the choose the right value of t ok. And then hopefully you know the terms will cancel out nicely; and you will get a bound that is a Chernoff bound. Basically, this is a technique a general technique to deriving ah these sort of bounds and any bounded derived using this general technique of taking the random variable exponentiate; and asking what is the probability that it exceeds a particular value; then taking the exponentiation on both sides of that ah event and applying Markov's inequality and bounding the numerator. So, these this is a very standard technique this technique is called the Chernoff bound technique and any bound derived from it is the Chernoff bound ok. So, ah just to recall now we we have to worry about this expectation of E to the t X ok. And as it turns out that fits very nicely with a well known notion called the moment generating function ok what is the moment generating function of a random variable X it is exactly what we want it is denoted M X whether this has to be a capital X M X of t is defined this equal to the delta on top of it is defined is defined as the expectation of E to the t X this is exactly what we want and we have a quantity we we have a well defined notion ok the moment generating function. Let us try to understand this object this moment generating function. First ah theorem about the moment generating function; it captures all the moments ok this is something that we talked about a little while ago right the Chernoff bounds why is it. So, powerful it is capturing all the moments not just ah the first in a second moment. So, why is that the case and it is ah under some conditions. So, here the condition is that ah when we have expectation and differentiation we should be able to interchange there the way we apply them. If we can interchange them then we have this theorem ok. So, basically what are we claiming here; this expectation of X to the n this is the nth moment you simply get that by taking the nth derivative of the moment generating function; this n this is the nth derivative of the moment generating function. Then evaluate it at the position 0 ok. Then you ah you basically by doing that you get the nth moment of the random variable X; that is the nice thing. So, basically what what is what is happening here is that the moment generating function in it is its somehow capturing all the moments when you take as you take the derivatives ah each derivative evaluated at 0 gives you the appropriate ah nth moment. The proof is quite simple; you look at the ah nth derivative of the moment generating function ok and remember this one important thing to notice is that you are you are taking the nth derivative not with respect to X, but with respect to t ok. So, dn by d ah dt to the n. So, that is the end of the derivative of ah well; what is the moment generating function? It is expectation of E to the t X right. So, you apply that and remember we have said that differentiation and expectation can be interchanged. So, I am just interchanging it and taking the differentiation inside and taking the derivative. So, here derivative the nth derivative of E to the t X is X to the n E to the t X ok this is remember definite the derivative with respect to t ok. And now you; so now that you have this form you can just up start applying or evaluating the this expression at t equal to 0 ah. So, when you evaluate at t equal to 0 essentially this term will vanish away you will get E to the 0 X and that is just a 1 ok. So, you will just be left with E to the expectation of X to the n and that is what we have over here. So, the nth derivative evaluated at 0 is expectation of X to the n which is the end. Let us look at one more one specific example where this is applied ok. So, now, let us go back to our favourite Bernoulli random variable. We just want to confirm this; basically what are we doing here. We are taking the nth derivative up evaluating at it 0 we need to be able to get the moments ok; and let us just test it out for a very simple case this is the simplest possible case that you can think of. So, ah what is the moment generating function that is expectation of E to the t X ok and that; what is this expectation of E to the t X? well. Essentially the variable E to the t X what is the probability I mean with probability p it is going to take the value E to the t, because X remember this is E to the t X is defined based on X;; and X will be 1 with probability p and 0 with probability 1 minus p ok when X is 1 you are going to get E to the t 1 which is just E to the t that is. So, you get p E to the t plus when X is 0 you are going to get E to the 0 which is a 1; so you when that happens with probability 1 minus. So, that is how you get this expression for the moment generating function for ah; and then you can play with it a little bit and apply the fact that 1 plus X is at most E to the X ok and you can get it to be of the form E to the p times E to the t minus, but the most for now the the form that we want is this what was what is shown in yellow ok. So, now we have a moment generating function. Let us take the first derivative and when we take the first derivative of this expression this is with respect to t. So, you will just be left with p e to the t and when you evaluate it at t equal to 0 E to the t becomes a one and. So, you are left with p and that is expectation of X if you recall this is fitting exactly the theorem that we have just proved if you take the first derivative up and apply it at value equal to 0 you you get back expectation of X ah you can do the same thing with the second derivative ok. I will leave you to work that out on your own time, but essentially the second derivative I mean the second moment is for the Bernoulli random variable is going to continue to be p and you are going to get the same. So, the other thing is that not only does it capture all the moments when it captures all the moments it fully defines the random variable and this we are just going to state the theorem and going to skip the proof ah. So, what is the theorem stating if you have two random variables have X and Y; and if they are moment generating functions in the vicinity of 0 what does what does that mean when t is around 0 it is between my delta and delta if both those moment generating functions are exactly equal, then we are guaranteed that both X and Y are essentially the same distribution same essential the same ok. So, this is ah this tells you once I mean if anything this tells you that the moment generating function really gets to the heart of what a random variable is ok really can characterize random variable and one more nice theorem before we get back in the Chernoff bounds ok. What is this if you this again is a way to exploit independence of random variables ok. So, you have two random variables X and Y. X plus Y itself is a random variable. So, there is this well defined notion of the moment generating function of X plus Y and that is this left hand side right ok. The nice thing is that is just the product of the individual moment generating functions; when X and Y are independent this is not true when X and Y are not independent, but when they are independent you have this ok. This is very useful because later on when we are trying to bound say the binomial random variable or sum of Poisson random variables or whatever you will exploit independence and this is a key crucial ah requirement for that ok. So, you why is it ok. So, you take the the moment generating function for X plus Y you apply the formula that s nothing, but the expectation of E to the t times X plus Y ok and you expand it out you get E to the t X times E to the y, but the nice thing is now these are two random variables E to the t X and E to the t Y if X and Y are independent then E to the t X and E to the t Y are also independent ok and if they are independent there the expectation of their product is the product of their expectations and that is what we are over here and that is by independent which ok. What is each of these multiplicative terms? It is nothing, but the moment generating function of X and is this clear what we have talked about; so far, any questions. Now, we are ready to get back into the Chernoff bounds, we will work out a particular example, it is basically the sum of Poisson trials this is ah if you recall what is the binomial distribution it is the sum of n Bernoulli trials the sum of Poisson trials it is just a generalization it is just the techniques. So, everything that we say from now on; ah with respect to Poisson trials is going to be applicable to Bernoulli binomial distribution as well, but; in fact it is going to be a one step more general ok. What is the general part of it? In when you take the binomial distribution each Bernoulli trial must have the same probability of success p ok when we go in to sum of Poisson trials we are generalizing we are allowing each ah Poisson random variable X i to be 1 with some probability p i and 0 with some with probability 1 minus p i and these p is can be different for each X i. So, what is X equal to X 1 plus X 2 plus and so on up to X n; each one can have it is own probability of success ok. So, that is the general nature of some of us on trial. So, of course, if all of these p is are equal, you get back binomial distribution. Now, that we have established that to the we know what the Poisson some of us on trials is essentially a binomial distribution generalized a little bit. Let us workout Chernoff bounds for this X which is the sum of Poisson trials ok. So, we will need remember to apply the Chernoff technique; we will need E to the t X which means that we need the moment generating function ok. So, the nice thing now we immediately start applying the theorem based on independence. If you want the moment generating function for X, it is basically the product; because these individual X i s are independent. The moment generating function for X is the product of the moment generating functions for the individual X is. So, what is the ah moment generating function for the individual X i? If you recall we worked this out a little while ago; when we worked out the Bernoulli trial with respect to ah probability of success p; we worked it out to be of the form e to the p i times E to the t minus one a few slides ago. So, we are just applying that and so now, you have the product of several exponentials which means you can take the ah e to the you can take the exponential of their some of their ah some of their this this product of the exponentials is just e to the sum of the X. ok ah And well what is that this e to the t minus one and I think this needs to be slightly clarified this it looks like this e to the t minus one is in the subscript over here it is actually ah p i times e to the t minus one this; this e to the t minus one can come out of the summation because the summation is with respect to i. So, you you get ah e to the e to the t minus one if that if this comes out you will be left with summation i p i what a summation i p i. What is p i? P I corresponds to random variable X I; expectation of X i equal to p i . So, what is summation of over I p? Expectation of X. And; and we are going to denote that by mu. So, we have got the moment generating function; which means now we can go back to the Chernoff technique and apply this. So, let me let us before we start applying it; let me state the bounds that we can get only one bound I am going to prove, the rest of them you you will have to proof your own I mean the textbook has it. So, you kind of have to take some time to work through them ok. The first and and if you recall; I said there are there is this one framework or technique the Chernoff bounds technique and you can derive multiple bounds. So, that is why we are going to give multiple bounds now ok. Now the first bound for any delta greater than 0 the probability; that X exceeds 1 plus delta times mu. So, just to give you give yourself a picture for what is going on here. You have a random variable X this is ah E of X equal to mu; you have this sum distribution and you are asking well and let us say this this point is a 1 plus delta times mu. We are asking what is the probability that X will fall in this right tail; and that is at most this right hand side quantity; E to the delta divided by 1 plus delta raised to the 1 plus delta the whole raised to mu looks complicated, but there are nicer forms this is why it helps to have multiple Chernoff bounds. So, let us look at the second form. We have here we need to restrict delta to be at most 1. What is probability X greater than 1 plus delta times mu again? Is the same story you are you are basically trying to bound the right tail it is e to the minus mu delta square by 3 ok Y and you hope this might look like just some mathematical jargon, but let us pause for a moment to see why this is powerful in in the binomial distribution; what is your mu? Mu is the average right. So, if you have like 10000 coin tosses what is your mu it is 5000 and where does in the in the tail bound probability where does mu appear mu appears in the exponential and it is e to the minus mu ok. So, as the as the mean increases the probability decreases exponentially ok. So, that is the power of this Chernoff bounds this the probability of deviating from the expectation this tail bound it drops exponentially. So, in that example this coin flipping example that we have worked out in the past it this it will show up as e to the minus 5000 times delta square by 3. So, these delta square and 3 are also relatively small quantities, but you can immediately see as the number of coin flips goes larger and larger your ah probability of deviating significantly from the mean drops very very quickly ok. So, hopefully this intuition is making sense to you because that is very very important this is this is what is telling you about the power of the Chernoff bound ok and one more ah convenient form which it is. . well it is positive for the numerator it is positive also for the denominator and if the denominator is dominating then it is going to be again having the same ok what is a third form for some R greater than 6 mu. So, basically ah this is useful when you want to ask what is the probability of deviating beyond 6 times expectation sum in some cases this makes sense again you see this exponential drop probability if X exceeding R is at most 2 raised to the minus R. from the . Yes. So, that makes sense particularly for things like normal distribution. yeah. Things like that here ah it; so when you work with the standard deviation you are actually working with Chebyshev's inequality or some related ah equality and you can do that and it is usually done, because you do not have a handle on the moment generating function this is what you are saying is particularly true for a lot of fields, but in computer science; we often we define the algorithm which means that we control what random variable? What the random variables going to look like? Which means that we have more power to in our hands? We can go in to Chernoff bounds and that is what will happen very commonly; we end up being able to exploit Chernoff bounds. Because we have access to all the moments and therefore, implicitly we do not explicitly think about it all the time implicitly we have access to all the moments and so we will be applying the more powerful techniques. And this is significantly more powerful than just asking how many times ah away from; how many standard deviations away from the mean you are? So, this is I think your question is also motivated by this right. yeah. This is just one convenient form; it might be useful in some cases where the mean is small. If the mean is small then you can ask; what is the probability that you are exceeding 6 times the small mean? Then it might be a this might be a convenient form to use ok. And again goes to say that it is basically the same technique all these three bounds are derived using the same technique; it is just playing with the value of t, if you recall there is a t a parameter t showing up in the derivation of Chernoff bounds displaying with a value of t and other mathematical minor mathematical juggleries you get these convenient forms. And so when you think about applying them; you just choose the most appropriate one and typically the second form is the most commonly used form of this ok . And as I said this the same technique can also be ah applied in the in the left tail as well ok we will see that, but for now let us actually look at the proof. So, ah just to be clear all these three have the proofs in the textbook. I am only going to go through the first ah inequality. So, that we have one inequality nailed down; we know how to do it, but the other things I still expect you to go through it on your own ok so. We are now focusing on the first inequality. X greater than 1 plus delta times mu and here delta is any value any positive value; and we will not need to prove that it is at most this right hand side ok. And again we are going to apply the exact Chernoff bound technique that we have already seen before. We are just going to say what is X greater than probability that X greater than 1 plus delta times mu well; you take the exponentiation on both sides probability of e to the t X greater than e to the t times 1 plus delta times mu is standard technique. Apply the Markov's inequality. So, that is you get expectation of e to the t X divided by the right hand side of this event ok. So, ah, but we already know this; we know ah remember we just a few slides ago we derived the moment generating function for the random variable X that is nothing, but e to the e to the t minus 1 times mu if you recall we did that ok. So, I am just applying that over here. So, you get this ah you get this form over here. Now comes the trick because now you have to choose a particular value of t that will be convenient for you ok and you see a lot of e to the t s and if you if you want to ah cancel things out what do you do when you have E to the some lon of something you the e and the lon cancel each other out right. So, that is the; that is what we are going to exploit. So, we are going to set t equal to lon 1 plus del and because delta is greater than 0 this lon 1 plus delta is also going to be greater than 0. So, which means we can apply it. So, going back to this form; if you apply ah if you apply t equal to lon 1 plus delta; so essentially here you will be ah you will be getting lon 1 plus delta. So, what will you get over here you will get ah ah e to the e to the t will become lon 1 plus delta ah you know it will it will e to the 1 lon 1 plus delta will become 1 plus delta minus 1. So, it will be 1 plus delta minus 1 divided; now I M going to take the whole to the mu separately and here e to the t will become lon you know 1 plus delta. So, and this 1 plus delta will be posted over here; which is exactly the form that we; so this is the first technique. The other techniques in similar fashion will be derived and at; in fact, in this case you look at the other you basically play with this form. This is one of the best forms that you can get the other ah bounds are obtained by displaying with; So, those are left as homework for you. So, now, let us any any questions before I move on and how this derivation works out ah. This is just two forms for the left tail, that we have; again they look quite similar because they are derivations are essentially of the same pattern. ah Now because it is the left tail you need to ah consider delta values to be in the range 0 to 1 and you have these two forms again here it is X less than or equal to 1 minus delta times mu. So, so you will essentially what you are doing is this is your mu; this is going to be your 1 minus delta times mu. So, you are asking what is the probability of this left tail? ah That is again at most some quantity which is if you notice it looks very similar to the type of bounds that we have and the second form is also their probability X at most one minus delta is at most e to the minus mu delta square by two and usually this form is more useful and. In fact, you can take the two ah left in the right tail the second form of them this. In this case; it is this one we have something similar in the right tail as well we can put them both together and apply the union bound and you can get a combination of both right and left tails. So, this is here I because it is combination of right hand left tails you have the absolute value of X deviating from the expectation ah to be greater than some delta times mu is at most 2 E to the minus mu delta square by 3. . ah what is the time? divided by 2 or 3. Its divided by 2; it is this happens to not be a typo, but when you apply the union bound you are just a little bit relaxed about how you apply it. The first one is I believe the tightest, but it really is not. So, much about tightness in these cases because the tightness usually just gives you some I mean typically the benefits just in constants the you usually use the more appropriate question would be which one is the most convenient one of the use of course, in some cases one might be convenient, but might not give you the good enough ah bound right. So, the first one is usually the tightest because it is pretty close to what you want, but what is also and the reason why I even went through the proof of the Chernoff technique is that there are some cases where you actually have to ah redo the proof of the Chernoff technique for a specific t value that fits the application. Remember this here we applied a particular t value that; that nicely cancels things out and ah it is great for a general form, but there are applications for which you really have to go through the derivation and choose the appropriate t value and all that. So, it is good to know the general derivation technique. So, we are now almost done; only thing I want to point out now is, how this Chernoff bounds that we have seen so far fares with respect to coin flips? This is the example that we have seen. And here; now we are going to ask what is the; so X is the number of heads out of n coin flips ok. So, this is ah essentially the binomial distribution and since we know even how to handle Poisson; some of Poisson trials you we can actually handle the binomial distributions just a special case. Ah and so we want to ask what is the probability that ah; so in the particular example that we had n was 10000. So, ah the mean would have been something like 5000 right. And we are asking what is the probability there it deviates away ah from ah the mean by this quantity ok? And so let us let us work that out. So, this is this is just this is recall this is if you ignore the constants essentially what you have is a square root of n lon n ok. So, it is basically ah if you for for a rough understanding of what we are talking about here? What is the ah square root of 10000? Ah well it is roughly a 100 right. So, we are asking what is the probability that X deviates from the mean by more than roughly a 100 ok. If you just you can ignore the constants in the lon term; lon is small constants are also small and ah let us see. How do you work this out? So, ah here on the right hand side you need it to be of the form ah delta times mu right. So, you should work out the appropriate. So, we on the left or what we have over here is; what is exactly the tail bound that we want? We want to understand; what are these probabilities these tail probabilities? Ok, but now we have to fit it to the Chernoff bounds form ok. So, now, ah what is your mu? Mu is n by 2. So, your delta if mu is n by 2; what you have to do? To get the mu; to get the delta is ah just multiply by n over 2 and also multiply by 2 over n. If you do that you get ah this will be your mu this 1 n o. So, basically let me write that a little bit more carefully over here what you do is you multiply by n over 2 and you also multiply by 2 over n ok. This n over 2 is nothing, but your mu. Whatever is left is your delta ok. So, that so now we have all other delta and mu. So, we can apply the formula and if you recall the formula it is 2 e to the minus mu delta square by 3. So, that is what you are applying over here ok. And when you apply you get oops 2 over n. And for a similar range what did we get out out of Chebyshev's? I think we got something like 1 over 9 or something like that the probability of exceeding 5000 plus or my minus 160 ah we got was something like 1 over 9, but now what are we getting we are getting 2 over n. So, this is like 2 over 10000 ok. So, the you see how the tail bound tail probability is much tighter when you use the Chernoff bounds. So, that is generally the idea here wherever possible you want to use the Chernoff bounds, but when it is not possible you live with the Chebyshev's or the Markov's inequality ok ah. So, with that we can conclude our ah segment for today. So, we introduce the Chernoff bounds technique and you know along the way we picked up an understanding of moment generating functions we looked at some properties and we worked out the derivation of a Chernoff bound for some of Poisson trials at least one form of it ah there were other forms that we have stated we have not actually looked at the derivations ok and what we did was we applied it to the coin flips ok and we saw it is immediate how powerful Chernoff is compared to ah Chebyshev's I hope you can appreciate this difference. So, that is what we have seen. So, far and in this module we have seen two slightly larger segments. So, we have seen the analysis of the median algorithm in the last segment which again like slightly longer module and this Chernoff bounds technique again is slightly longer module five segments in this module 3 ok. So, in the next module what we are going to do is we are going to focus on ah applications we are going to now that we know all these bounds these markers in equality Chebyshev's and Chernoffs and things like that what we are going to do is start looking at some algorithmic contexts and we are going to start applying them. So, hopefully these bounds you know come alive to you. So, you will actually see how they help in a computer science ok. So, that will be the topic for next module ok. Thanks
Info
Channel: Probability And Computing - IITM
Views: 4,395
Rating: undefined out of 5
Keywords:
Id: dBuldoAzj20
Channel Id: undefined
Length: 39min 41sec (2381 seconds)
Published: Sun Feb 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.