Continuous Distributions: Beta and Dirichlet Distributions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
we're going to continue our discussion of continuous distributions by talking about the beta distribution so the beta distribution is a little bit different from the normal distribution so the normal distribution gave you a distribution over all real values but the beta distribution only gives you probabilities over the range 0 to 1 inclusive this allows the beta distribution to model anything that looks like a probability so this could be the proportion of people in a country with the disease the probability of an unfair coin or say a batting average the beta distribution has two parameters alpha and beta we'll talk about that in a second the density function of the beta distribution is X to the Alpha minus 1 times 1 minus X to the beta minus 1 and this looks a lot like the density function of the Bernoulli distribution but it's important to keep what we're modeling straight in the beta distribution we have a density over a continuous value so X can be anything between 0 & 1 even though the Bernoulli distribution looks a lot like that X is either 0 or 1 it's over discrete outcomes we'll talk more about why they look so similar in assay so now let's take a look at the beta distribution for various parameters alpha and beta and so you get something that looks a little like a Gaussian distribution if you choose alpha equals 2 and beta equals 2 that corresponds to this purple line here and so here you have something that looks a little like a bell-shaped distribution but you'll notice that the probability goes all the way down to 0 at the edges there is no long tail the probability goes not just close to 0 like in the normal distribution but exactly to 0 the second you get beyond 0 the probability is 0 so anything to the left here is completely impossible as an outcome of the beta distribution if you choose alpha equal to 2 and beta equal to 5 you get more of a skewed distribution so this means that your outcome is more likely to be in this range so close to zero but again as you get close to 0 the probability gets very very small so it's not going to be exactly 0 but it's going to be a small number probably less than to a point for other parameter settings are more interesting so for example let's take a look at the red line here and so this corresponds to alpha and beta both being equal to 0.5 and so when alpha and beta are less than 1 you get a bull shape probability distribution and so this is basically saying that the probability is going to be close to either 0 or 1 but either one of those outcomes is equally likely does it's symmetric you can also have distributions that are peaked at only one of the edges so for example if you have alpha equal to 5 and beta equals to 1 you have this blue line here which ramps up very quickly and peaks at 1 so you're very likely to have probabilities outcomes of the beta distribution that are 1 are very close to 1 so now that we've seen the shape of the distribution let's take a look at how we get a function that generates that density curve so here we have a normalization term just like we saw for the normal distribution we can pay attention to the rest of it for now and ignore the normalization term and so here we have X raised to the alpha minus 1 times 1 minus X raised to the beta minus 1 from this you should be able to tell that alpha and beta have to be greater than 0 and you can tell you get special values in alpha or beta are equal to 1 and when alpha and beta are equal to 1 parts of this can go away so when beta or alpha are equal to 1 we get special cases so for example with the blue line on the previous slide where beta is equal to 1 we basically have X to the fourth if alpha is equal to 5 and so then you get a curve that goes up sharply and peaks at 1 and you should also see when if you have alpha equal to beta you get a symmetric function and so if you have alpha equal to beta and let's say for the sake of argument that it's equal to 2 so then you have x times 1 minus x then you should see that this is a symmetric function around 1/2 and so at 1/2 the left hand side will be equal to the right hand side and then they'll flip and so then 1 minus X will look like the left hand side and 1/2 will look like the right hand side so when you have alpha equal to beta you have a symmetric function so let's now return to that scary normalization term in front and so you see the Greek letter gamma there and so that is called the gamma function and the gamma function is a generalization of the factorial function that we talked about before but unlike the factorial function which is only defined for integers the gamma function also works on the real numbers which is why you can have arguments like 0.5 which are possible values of alpha and beta being passed into the gamma function the expected value of a beta distribution is simply alpha over alpha plus beta and so if alpha and beta are the same the expected value is going to be 1/2 so just as we generalize the Bernoulli distribution we can generalize the beta distribution so the beta distribution can be thought of as a distribution over things that could be probabilities and we can generalize that further so recall that we had things like the categorical distribution which were probabilities over K different outcomes and so just as the beta distribution gives us a single probability the Dursley distribution gives us K probabilities that define a probability distribution that is a vector over K things that sum to one and are non-negative so if you write out the Deraa Schley density function it looks like this mess down at the bottom so you won't need to worry about this on any exam you'll be given this distribution if I ever wanted you to use it just know that it exists and it provides a distribution over vectors so what are the kinds of things that you can draw out of a deras Li distribution let's get a little bit more intuition about that by seeing some examples so one way of visualizing and eriously distribution is as a triangle here we have a distribution over three things so think about this as a vector X 1 X 2 and X 3 and so a dearest lay distribution will be vectors that correspond to some distribution over three possible outcomes so one possible draw might be say 1 0 0 so this corresponds to a distribution that says only the first outcome is possible and we can draw that as a point on a triangle like so another possible outcome is a distribution that only gives us probability for the second event and similarly we can have a point here for the third event so these are the extreme outcomes of a d'Orsay distribution but the dearest link distribution gives us anything inside this triangle as well so this point here at the edge corresponds to zero point five point five and a point in the very middle corresponds to one-third one-third one-third so we can draw the probability distribution as the height within this triangle so if we set all of our parameters of a deal a distribution to be one-third we get a distribution that looks a little like this so this is basically saying that the most probable outcomes are the corners of the triangle that is we're going to have a vector er that is high in one coordinate and low in all the others but it doesn't particularly prefer any one of those corners over the others one thing that often confuses people about the Duras lis distribution is that the parameter of the d'Orsay distribution is a K dimensional vector and the output of AD Riesling distribution is also a K dimensional vector that is a probability distribution say a discrete probability distribution the parameter however to a d'Orsay distribution does not have to be a distribution itself you can have any vector of nonzero numbers as a parameter of your dearest a distribution so for example let's say that all parameters to your dears in the distribution were to and so the sum of that vector is six that is not a probability distribution nonetheless you can have that as your parameter to add eriously distribution and if the parameter of your D really distribution looks like that this is the corresponding distribution and so notice that it bulges in the center so remember what's in the center that is the probability distribution one-third one-third one-third when all of your diversity parameters are the same and greater than one then that means that you're saying that the most likely outcome is going to be around the uniform distribution so just as we saw a connection between the beta and the Bernoulli distribution there's also a connection between the dear ously distribution and the multinomial distribution and so both of them have a similar form when you write down their probability mass function or their density function and so remember the analogy that beta is - the binomial distribution as the D relay is to the multinomial distribution and one thing that we'll see in this class is that it's useful that they have these similar forms and this means that we can chain them together and this is often called Bayesian data analysis we can make some assumption that we have a parameter that comes from a d'Orsay distribution and that then gets used as a parameter to a multinomial distribution that provides us with observations so we'll see for example in naive Bayes we'll make assumptions about our probability distributions and those assumptions correspond to assuming a D really distribution don't worry about that for now let's explain how we can tie these two distributions together so in particular what we're going to use is something that looks a lot like the chain rule we have a D really distribution that gives us multinomial parameter data so the multinomial parameter is the output of the diversity distribution it gets fed into our multinomial distribution and then we observe a bunch of counts from that multinomial distribution and so let's say that you know what your counts are and what the deer sleep parameter was now you want to figure out what that little tiny Omiya lambda looks like and so if you take those two nasty probability functions that we had and ignore the normalization terms you get a formula that looks like this and so you'll notice that I've written the symbol for proportional to here and so this is basically saying that we know this is a probability distribution it has to sum to Y let's just ignore all those nasty normalization terms and just care about the things that depend on what we've observed and so in particular the counts and our parameters and so if we look at the probability of our multinomial parameter given what we've observed then the probability function looks like you are adding the deers a parameter alpha plus a number of counts that you've observed and this is why it's useful that the beta distribution looks like the Bernoulli distribution and the dearest late distribution looks like the multinomial distribution in statistics this is called conjugacy so you have one distribution called the prior distribution that's giving you a parameter that parameter gets fed into another distribution and then when you condition on data that new distribution is called the posterior if the posterior looks like the prior that's called a conjugate distribution you don't need to worry about that since this is a statistics class but if you see that or hear me say it that's what I'm talking about essentially conjugacy means your life is easier so before when we were talking about estimating say a distribution over words I just told you to add one so why did I tell you that and why is it not so crazy and idea so if we assume that we have a multinomial distribution whose parameter came from a dear Ashley distribution what does that mean in the case of adding one so let's go back and let's look at this term again and so if you take your dear sleeper ammeter and you add in the counts and subtract one that becomes your new combined distribution so what does that mean so let's go back to the functional form for the Duras lis distribution if you add one to everything if you add one to all of your dear slave parameters what distribution does that correspond to so take a look at this and think about it for a second if all of your dearest a parameter z' are one what happens so if all of your dear sleep parameters are one then this term basically doesn't matter anymore because you have theta K raised to the zeroth power and so in the ab1 case we're basically assuming your sleep prior because we take our counts and we add one to them and so what we have added is the Duras a parameter and if the dear sleep parameter is one then that means the D really distribution is uniform it weights everything equally no matter what that theta is and so if we go back to those pictures I was showing you earlier out the triangle that corresponds to everything being flat on that triangle that triangle is often called the simplex this encodes probability distributions and if your dear sleep parameter is zero then the simplex has equal values for all of the Thetas that you plug in all of those Thetas correspond to points within that triangle and no point is more probable than any other point because you're just going to take that vector then raise it to the zeroth power which always gives you one and then you're going to normalize that with normalization turn out in front so this concludes our whirlwind tour of distributions we'll be putting these various pieces together as we build models to do fun and interesting tasks in data science but keep in mind the intuitions that we built about these distributions because it will help you in turn understand the models and why we're making assumptions about the models when we're putting them together for applications
Info
Channel: Jordan Boyd-Graber
Views: 27,623
Rating: 4.8790321 out of 5
Keywords:
Id: CEVELIz4WXM
Channel Id: undefined
Length: 18min 0sec (1080 seconds)
Published: Sat Feb 24 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.