Information Geometry

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

each year Microsoft Research helps hundreds of influential speakers from around the world including leading scientists renowned experts in technology book authors and leading academics and makes videos of these lectures freely available you okay so so I'm gonna switch to a completely different topic now and this is an area called information geometry it might sound a little mysterious but it's basically the it's the foundation of probabilistic modeling okay so this is really this is something that's really fundamental okay and and so the problem I'm gonna be talking about is the problem of learning a distribution because you want to learn some distribution and as a toy example let's say that we want to model the distribution of English words in some corpus for example okay so we want to learn a distribution and the domain in this case is all English words okay so all we want to do is to assign a probability to each word now if that's the only information we had you know in the absence of any other knowledge the default choice would then be to simply make it a uniform distribution okay if we didn't know anything else then we'd have no reason to prefer one word to any other okay and that would be the default choice but now suppose that we decide to make a few measurements so we actually take a look at our culpas and we notice that brings one thing we check is mean let's look at the lengths of words and we notice that 30% of the words have lengths more than five and that perhaps wouldn't be true under the uniform distribution let me choose something else to measure we say hmm let's look how many of the words end in E and we find that 45% of them and in E okay and that's again something that might not be true under the uniform distribution and in this way we make a few more measurements like this okay so we have this additional information now and then the question is how do we adjust our model initially our model will just flat how would be a just our model to take this information into account okay what distribution will be choose now okay and so the answer to this question involves these three things entropy exponential families and information projection it's a very basic distributional modeling question okay so let's start with entropy okay so to be precise we have this domain s and we want a distribution over s and now we can think of each word as having certain features that we've met it measured okay so for each word one of the things we measured is is its length more than five yes or no 1 or 0 okay another feature we've measured is does the word and any yes or no and these features can be arbitrary okay there can be anything you like you know does the word contain the letters I N and G you know does the word have the same first letter and last letter it doesn't matter whatever feature you like okay and it's gonna be either 0 1 or it's gonna be a real number it doesn't matter okay so we've taken a few with taken a few of these features and now we've made these measurements we know that 30% of words have length more than 5 which means that the expected value of this feature is 0.3 okay so the expected value of a 0 1 random variable is just the probability that it occurs okay so the expected value of this feature is point three the expected value of the second feature is 0.45 and so on okay so all of the measurements we've made can be summarized as a few expected values okay so we have our domain the set of all English words we've measured these features and what we've observed is these expected values and now we want to adjust our distribution what distribution should we choose now well we certainly want a distribution that has these properties that has this expected value and has this expected value but we have no other information so other than that we would like to choose a distribution that is as random as possible okay that makes as few assumptions as possible so does that does that set up okay so the main question then or the first question is what do we mean by how random a distribution is okay what how do we measure the randomness content of a distribution okay and I mean we can just proceed a little bit intuitively so a fair coin okay we toss a coin how much randomness does it have well intuitively we'd say it has one bit of randomness okay one bit it could be either zero or one okay suppose we have a biased coin now that's a little bit less random because it's more predictable okay if there's the bias towards heads that's a less random coin than a fair coin okay we can we have a better chance of guessing its outcome suppose we have two independent fair coins now we've said that one coin has one bit of randomness now we toss two coins independently that seems like would be two bits of randomness okay let's say we have two dependent fair coins so we tossed the first coin that's one bit right there but based on the value of the first coin we can kind of guess what the second one's going to be so the overall randomness now is less than two bits and here's finally something that we'd like intuitively suppose you have 32 possible outcomes and they're all equally likely we'd like to say that's roughly five bits of randomness okay because it can be encoded in five bits so these are some sort of features that we would like a notion of randomness to possess and so let's look at a particular notion okay and this turns out to be the standard notion of how random a distribution is the entropy of a distribution okay so if you have a random variable X whose distribution is B here's the entropy of that random variable the entropy of the distribution it's just the sum over all possible outcomes of B of that out of the probability of that outcome times times 1 over the probability of that outcome okay so this is kind of a weird formula it's got a bunch of logs in there you know is this really a notion of randomness you know what's what's going on over here okay but but this is the formula and let's just go back to the examples that I mentioned and see what answer this formula would give in that case okay so let's look at a fair coin now a fair coin has two possible outcomes each with probability 1/2 so if we plug it into the formula we get 1/2 times log 2 plus 1/2 times log 2 which is 1 ok so this the entropy of a fair coin really is 1 ok let's look at a coin with bias 3/4 so this is a coin that's biased towards heads it's got a 3/4 chance of coming up heads the entropy is 3/4 log 4/3 plus 1/4 log 4 and that turns out to be 0.8 one and indeed that's less random okay let's look like an even more biased coin okay so a coin that's almost always heads we'd want it to entropy to be even less and indeed it is if you plug it into the formula the entropy of this turns out to be only 0.08 okay so it's only 8% as random as a fair coin so at least if we're looking at single coins this formula is giving us something you know fairly sensible now suppose we have a uniform distribution over K outcomes okay what would the entropy be well the probability of every outcome is one over case we get 1 over K log K and then we sum that up so we get a total of log K and again that makes sense law cares how many bits we need to encode those K outcomes ok so this formula although it's a little bit weird it's actually basically doing the right thing on on these few examples ok but who knows maybe it does strange things you know in other cases ok but let's look at its properties a little more because it turns out that this really is the standard notion of randomness ok so what a properties of the entropy the first thing is that it's a concave function which just means that it looks like that okay and in fact if you just look at a single coin and you look at the bias of the coin okay this is what the entropy looks like so the the most random coin is the one would buy us a huff okay and as you get a coin that's always heads or always tails the bias the once the bias goes to a 0 or 1 the entropy becomes 0 okay so perfectly reasonable and now let's let's look at some other properties of entropy it turns out that it has many properties that we would certainly want a notion of randomness to have okay and the first property is what we would call expanse ability okay so suppose you have some variable with this distribution so you have some variable that has n possible outcomes and these are the these are the probabilities of the different outcomes and now you have another variable that has n plus 1 possible outcomes but the last outcome never occurs and the other outcomes occur with the same probabilities now we would want them to have be equally random okay and indeed they are okay so if you plug it into the formula for entropy you find that the entropy really is the same that's good symmetry suppose you have a coin with bias 3/4 so heads probability 3/4 and then you have another coin with bias a quarter tails probability 3/4 we don't care about the labeling of heads and tails we want we'd want to say that they are equally random and they are here's a very nice property if you have two things that are independent then the entropy of the Joint Distribution is just the sum of the individual entropy x' and that's what we're talking about when you have two fair coins or that are independent then the sum their entropy of the two coins together is just the sum of the individual entropy okay and let's see why that's the case ok so how does that work out in the formula because the formula had all these logs and stuff like this why does it work out to this nice additive rule well we have x and y so we plug into the formula for entropy we look at all possible comes x & y so we get the probability of XY log 1 over the probability of XY now we've said that x and y are independent so the probability of XY is the probability of x times the probability of y ok but now the log of a product is the sum of the logs so you separate those out and I feel look at the summation we can just sum out the Y over here and likewise up here we can sum out the X and then we get the individual entropies of x and y okay so you just plug into the formula and you get this nice additive property and then you can check all these other properties to its sub additive its sub additive so if you have two variables that are possibly dependent the joint entropy is at most the sum of the individual entries very nice very intuitive we've normalized it so that a fair coin has entropy one we could set that normalization to anything we like right but that's a nice way to normalize and finally as the bias of the coin goes to zero okay so as the outcome of a coin becomes more and more certain the entropy goes to zero yeah there's no randomness so these are some nice properties of entropy and the remarkable thing is that the entropy is the only measure that satisfies these properties ok so these are all these are all properties that we would really want a notion of randomness to have and we put up a formula for entropy it looks kind of weird but it really does have these properties and in fact it's the only notion that has these properties ok and that's why it is it is how we judge the random okay so entropy is our notion of randomness okay so now so as the as the bias of the coin goes to as the bias of a coin goes to 0 to 1 the entropy goes to 0 okay so now I'll introduce another another notion that probably most of you are pretty familiar with how many of you know the KL divergence okay a lot of you now this is really a standard notion of distance between probability distributions so if we have two probability distributions and we want to measure how far apart they are this is this is perhaps the the most commonly used measure in statistics in machine learning okay there are some other choices like l1 distance that are very nice but this is this is certainly a standard notion and again it's a weird formula with lots okay now there are lots of unpleasant things about this okay for example it's not symmetric so the distance from P to Q is different from the distance from Q to P that's not nice right this can be infinite like if this like imagine that Q of X is 0 Q has no mass on 0 but P of X's is positive then this thing can be infinite that's not nice either ok so you can have two distributions that are really pretty similar but the the distance the KL distance between them is infinite but the one nice thing we can say is at least it's always positive ok at least it never becomes negative and it's zero only if the two distributions are the same ok so so this is a very standard notion of distance between distributions and so what we'll see now is that the entropy is really closely related to how far the distribution is from uniform which is another thing that we'd really like ok so we have some distribution P let u be the uniform distribution totally flat and we want to know how far is B from uniform ok so and let's say the domain is s so we plug into the formula for the KL divergence we get this the uniform distribution puts the same mass 1 / s on everything and then when the smoke clears you you get that it's just logged the size of the domain minus the entropy so the entropy really tells you how far a distribution is from uniform and this is another nice feature of the entropy okay okay and I'll give you one final justification of entropy which sort of helps to understand you know exactly what it's doing and this is called the asymptotic equipartition property okay and this is looking at sequences of variables rather than a single variable so instead of just having a single coin with some bias like B we toss the coin n times and then we look at the entire sequence okay so let's toss a coin n times and the see we get a sequence X 1 to X n or it can be some other distribution doesn't have to be a coin toss okay any distribution now when n is large we will segregate these sequences into two groups one is the group of sequences whose probabilities is r whose probabilities are roughly 2 to the minus n times the entropy of X ok so each sequence has got a certain probability the probability of the sequence is the probability of X 1 times the probability of X 2 times the probability of X 3 so we can compute the probably of any sequence and if it's problem if the probability of the sequence turns out to be roughly this put it in Group 1 otherwise put it in group 2 and it turns out that essentially group 1 is everything ok so when you have when you toss a coin many many times essentially every sequence you get is going to be group 1 its probability is going to be roughly 2 to the minus the entropy of the individual variable okay um so I'll I'll make that precise in a second because we're gonna freshly prove this property too so so the picture over here is that you have a space of possible sequences like let's say you're doing coin tosses right so each outcome is 0 and 1 and if you do n outcomes then the space of possible sequences is 0 1 to the M ok but we're gonna look only at those sequences whose probability is roughly about 2 to the minus n times H and what I'm saying is that this region has essentially probability very very close to 1 there's almost nothing outside it and since they all have the same probability the the distribution over sequences is actually kind of just like a uniform distribution over these sequences ok ok so let's see what what exactly this means let's look at a fair coin if you look at a fair coin then the entropy of the coin is 1 ok ok so the entropy of the coin is 1 and so we are looking at sequences whose probability is about 2 to the minus n which sequences have probability 2 to the minus and all of them every single sequence is probably 2 to the minus n so for a fair coin this region is just the whole space and indeed we have a uniform distribution over the whole space ok so no surprises over there ok now let's look at a coin with biased 3/4 what well see this is a coin that comes up heads 3/4 of the time and we're gonna toss it n times right so what we would expect in a typical sequence is that roughly 75% of the outcomes would be heads and if that's the case then the probability of that sequence is 3/4 to the 3/4 N and then the rest are tails so 1/4 to the quarter in and in fact this is 2 to the minus n times the entropy ok so the probability of the typical sequence really is 2 - and entropy okay and so in this case in the case of a coin with bias 3/4 the typical sequence that has 3/4 heads and a quarters tails is in fact right in there and the distribution now really is concentrated there's just this small region and the overall distribution is uniform in this region and basically zero everywhere else okay so let's look at a quick proof of this so we have independent random variables all with the same distribution and for each of these for each of these coin tosses let's say Y sub I is 1 over with long times 1 over the probability of X sub I okay so the expectation of Y sub I is the sum over this over all possible outcomes which is exactly the entropy okay now by the law of large numbers averages converge their expected values so the average of all the Y's converges to the entropy which means that for any epsilon when n is large the probability that the average of these Y's is very close is within epsilon of H is that it goes to 1 as n goes to infinity and therefore with probability going to 1 the problem the probability of anything you see lies in this range it's 2 to the minus NH plus minus Epsilon okay so with as n grows as you guess you get longer and longer sequences with probability going to 1 every sequence you observe will have probability roughly 2 to the minus n age okay so this is another way to think of entropy when you look at sequences okay so now we have some a little bit of intuition for entropy and that it really is you know despite the odd appearance of the formula it really is something that has a lot of nice properties and so on okay so let's go back to our main question so we have this thing we had these English words and we wanted a distribution over English words and initially we just said make it the uniform distribution now we've measured some features like first feature was does the word have length more than five the second feature was does it end in an e and based on that we came up with certain constraints okay so we found that 30% of the words have length more than five so we want the expected value of that feature to be point three okay and so what we want to do is to find the distribution over words that has all of these expected values but as otherwise as random as possible in other words has maximum entropy and this is the maximum entropy principle for modeling a distribution okay so if you want a distribution over words and all you've made are a few measurements okay the expected value this is something the expected value this is something what you want to do is define the distribution that has that meets those criteria but is otherwise maximum entropy okay and so well we can use any moments we like but if we have second order moments then we would have features corresponding to those okay so an example of a second order moment would so we said does it have length greater than five and does it end in an e so a second order moment was does it both have length greater than five and enemy we could call that feature T three okay so you can just put in whatever you can put anything you like into these features okay and so this is the problem we're trying to solve this is the optimization problem we're trying to learn a distribution which is a vector of length s okay it assigns a probability to every word in English okay so X is varying over English words we want to learn a probability for every one of those words we want a distribution that maximizes the entropy is as random as possible subject to these constraints it has to be a bona fide distribution so all these values are greater than zero and add up to one and then these are the observations we've made these are the measurements that we've made okay and so this is the optimization problem that we're trying to solve so any questions about that and now this is a very nice optimization problem see we're solving for the piece of X and all of these constraints are linear okay and this function we've said is a concave function so if you're maximizing a concave function over linear constraints that's called a convex optimization problem so it's the it's the sort of problem that we can easily solve okay you just give it into one of these packages and it returns the answer okay okay here's a slightly different variant on it that is often the one that would that would be used in practice okay where you also have some prior distribution over English words say okay so you have some distribution from some earlier modeling experiment or something like that which you're reasonably pleased with okay and that's by and now you've got these new constraints and unfortunately this distribution you have PI does not meet those constraints okay so you want to find a new distribution P that is as close to PI as possible but otherwise meets those constraints okay and this is again a convex optimization problem so we can get the previous problem by calling this the uniform distribution but in general you can have any distribution you like okay any prior that you're reasonably pleased with and you want to find a distribution that's pretty close to your prior to your prior belief but meets the constraints that you have okay and this is also a convex optimization problem now here's the picture for what we're trying to do okay we're trying to find a distribution over English words okay which is a distribution over s outcomes okay so it's a vector of length s you know that adds up to 1 in other words it lies in the S simplex okay it lies on the probability simplex of dimension s and you can think of this this page as being that simplex we have some prior and now we have a bunch of linear constraints and what we're trying to do is essentially find the linear find the model in this space that satisfies these constraints in other words it lies on this linear subspace but is as close as possible to that prior now usually the way we do this is to take the prior and project it at right angles on to that linear subspace right that's what we do for l2 distance but we're not using l2 distance we are using KL divergence so it's a slightly different kind of projection it's not just a straight projection at right angles it turns out to be sort of this curved thing okay so this is the closest point in KL divergence that satisfies those constraints and this this operation of projecting a prior onto a linear set of constraints is what we call information projection okay it's called an information projection okay now here's how we can solve the problem using calculus okay so we want to minimize the distance from the prior to our distribution subject to these constraints so let's just use Lagrange multipliers okay so we take our optimization and then we sub variables we create dual variables for each of the constraints okay and now we take the derivative and we set the derivative to zero okay and so it's fairly straightforward and when you do this you find that the solution has this form okay so the solution whatever the distribution is it's not this rather simple form it's whatever your prior distribution was times e to some linear function of the features okay and this is a very familiar kind of distribution this is an exponential family distribution okay so to get back to our example what this says is that the solution to the problem looks something like this okay we measured a few features we want to find the distribution that matches the moments on those features and this is the but otherwise has maximum entropy and this is what the answer is going to look like the distribution is going to be proportional to e to some linear function of those features okay for example if this coefficient turns out to be point 8 1 what that means is that any word that ends in an e has e to the point 8 1 which is two point two five is two point two five times more likely as a word that does not end in E okay so this is the kind of distribution we end up with so we started with the uniform distribution we introduced these features these measurements does it have length more than five does it end in E you know does it is its length an odd number is its length the prime number any features we want right we introduce constraints on those features and the solution looks like this it's the uniform distribution times e to the some constant times feature one times e to the some constant times feature two it has this very simple form okay okay and this is this kind of distribution is an exponential family okay so let me talk a little bit about exponential families these are so all of the standard statistical distributions like the you know these things like gaussians binomial plus all these are all different exponential families okay although they seem different you know a Gaussian looks like this a plus always integer valued a by no a you know a Bernoulli is a coin flip they all seem like they're very different but it turns out that there is an umbrella framework in which they can all be considered and they turn out to have a lot of properties in common so the exponential families these exponential families are the most basic statistical distributions okay and it's it's so this is a very powerful and sort of unifying framework so it's it's it's a framework that includes all of these common distributions and also includes say undirected graphical models you know Markov random fields all of these kinds of things okay and so let me just tell you exactly what these things are and why it is the case that you know all of these common distributions are actually exponential families so an exponential family of distributions okay so for instance a Gaussian the gaussians are going to turn out to be an exponential family and the members of that family are different gaussians okay the by the coin flips are going to turn out to be an exponential family and the members of that family are coins with different buyers okay so these are all going to be each of these is going to be an exponential family okay and to define a family you need to define three things first you say what is the input space okay so that's the first thing then you say then you put some base measure on that space now a measure is just like a probability distribution except that it doesn't have to add up to one okay that's the twenty measure is and then you'd say what features you're interested in once you've specified these three things you fully specify an exponential family and what is the family it consists of a whole bunch of distributions okay with a K dimensional parameter okay so different distributions are indexed by a k dimensional parameter ADA where K is the number of features you've chosen to measure okay and the specific distribution index by ADA is the distribution that is the base measure times e to the linear function of the features where the linear coefficients are ADA okay so for the time being the main thing I'm trying to say is that there's this very broad framework for understanding probability distributions that includes all the common distributions okay let's call exponential families and you generate a particular exponential family by specifying these three things you say specify these three things and you have the family okay and now let's look at some common families and see like let's look at for example how do we specify gaussians in this way how do we specify coin flips in this way okay just to get a sense of it okay now okay so right okay so the one one last thing I need to mention though is the issue of normalization okay so so you choose some features and if you choose K features then you get a k-dimensional family okay so you get a family that's indexed by K parameters okay by a K dimensional linear function and the probability distribution given by those parameters it's proportional to this function okay which means that it might not necessarily add up to one so you have to put a normalizing factor on the front over here okay and so let me just say a word about that normalizing factor so this is what I said before if you want to normalize this to two add up to one or two integrate to one you have to put a normalizing factor and often we call that something like 1 over Z and we stick it in the front now you don't have to put the normalizing factor in the front you can also just take and move it upstairs over here and that's what we're going to do okay so we're going to take the normalizing factor and we just gonna stick it over there we could have done it either way right but it's gonna be really helpful to do things that way and we're going to call it the log partition function what is it will you just sum up all of these numbers and and since you've moved it up you have to take the log of it okay so that's the normalizing factor okay so now let's look at some examples so I said that coin flips are an exponential family how okay well we just have to specify three things first we have to say what the possible outcomes are well there's zero in one the two possible outcomes then we have to say what the base measure is in the absence of you know in this case you can make it whatever you like but let's just say the base measure is one okay and features are we gonna measure well it's just zero in one the only feature you can measure is X okay so those are gonna be the features now once we specify these three things we get an exponential family and we said that this family is going to be a one-dimensional family where the ADA member of the family has a probability proportional to the e to the ADA X okay and when we normalize we want to normalize this to sign up some up to one well when X is either 0 or 1 so the normalization factor is e to the 0 plus e to the ADA and since we're going to move it up there we take the log okay and so this is what the distributions look like okay in other words the probability of the 1 is this the probability of a 0 is this and this looks a little weird when we talk about a coin flip we normally index it just by the bias of the coin you know like we say oh this coin has a bias of 3/4 this is giving you exactly the same distribution but it's indexing it in a different way you can get all possible coin biases by varying ADA here we're using a parameter that instead of lying in the range 0 to 1 lies in the range minus infinity to plus infinity okay so when ADA is minus infinity you get a coin of bias 0 when paid ADA is plus infinity you get a coin of bias 1 okay so you get the entire range but we're using a slightly different parameterization okay so we get all possible coin flips but we're using these different parameters and we call them natural parameters even though for us they're not natural at all okay they are natural for the you know they're they're natural in this formalism but they're not the parameters that we would naturally use at all okay so there's this natural parameter that is an arbitrary real number and then it gives us a particular coin and the bias of the coin is the kind of parameter that we are used to okay and that's the expected value of the coin and that's called the expectation parameter okay so there's this mapping between these two types of parameters okay so indeed we have all possible coins which is nice but it's being parametrized by this slightly different in this slightly different way the coin is not being parameterised by its bias it's being parameterised in a different way by a number from minus infinity to infinity okay okay now let's look at the plus all this is another exponential family how do we get plus also K so an excellent the possible distribution is a distribution over positive integers okay and this is the formula for the fossil so we have to somehow make this look like an exponential family how do we do that well to specify an exponential family we just have to specify three things what is the domain the domain is all integer is all positive integers the base measure it looks like we should make the base measure 1 over X factorial and what is the feature well let's take the feature to just be X so once we specify these three things we immediately have an exponential family and the members of the family have have a single parameter ADA and the distribution index by ADA is proportional to e to the ADA X over X factorial okay so let's normalize that so when you sum this up when you sum e to the ADA X over X factorial and you sum it over all possible X's what is this sum well you know this is just the Taylor series for the exponential so that's no problem at all so this just turns out to be e to the e to the ADA and when you take the log you get e to the ADA so that's the normalizing factor okay and so we get the Poisson distributions and again we're used to thinking of a Poisson distribution in terms of its mean but the parameterization we're using here is not the mean it's actually turning out to be the log of the mean okay so again we get a standard family but with a slightly different parameterization okay let's move on to gaussians okay the gaussians of a certain dimension are again an exponential family okay so and this family is a little bit different the previous two families were one-dimensional they had only one parameter the gaussians we know have two parameters right so we have a two-dimensional exponential family okay so this is the formula for a Gaussian with mean mu and variance Sigma squared okay and you know we can factor the numerator to be to make it look like a two-dimensional exponential family you know so we say X comma x squared dot this particular vector minus something else and so to generate the family we just need to specify the the domain for a Gaussian that's the reals we need to specify a base measure we can make it whatever we like but let's set it to one over two pi and the features we're measuring are gonna be X and x squared okay so we're measuring two features that the minute you do that you get it you get a a two-dimensional exponential family which is exactly the gaussians okay and again the parameters we are using the parameters of this exponential family the aiders are not the parameters we're used to for a Gaussian for a Gaussian we're used to we used to categorizing it according to its mean and variance the parameters we are getting here or something weird okay one of the parameters is the mean divided by the variance and the other parameter is negative 1 over 2 times the variance okay whatever but we can go back and forth between them ok so it's a different parameterization but if you use this different parameterization it once again falls into this umbrella framework okay okay so so I just wanted to - okay so first you know give you the definition of exponential family to show you how these are constructed and then just to give you examples of how all these common distributions are actually exponential families and the nice thing is that these exponential families will now end do you include a lot more sophisticated distributions like you know like graphical models okay and they are and we can you know treat them all in a unified way okay so one of the weird things about these distributions is that we have this nice form and with them we said okay we got to normalize it to add up to one so we stick this normalizing factor in the front and there okay let's just move it up to the you know to where the linear function is so that normalizing factor doesn't seem like it would be terribly important it's only there to make everything add up to one but it actually turns out that it contains a wealth of information that normalizing factor is just pure gold you know it's got a lot of information built into it okay so the first thing so we the normalizing factor is the thing we put up there we were calling it the function G okay so let's just go back to that okay so this is the normalizing factor over there we were just calling it G the normalizing factor when the parameter is ADA so the first thing is that it's strictly convex okay the second thing is that if you take its derivative you get the mean of the distribution okay so the normalizing factor is not just any old number it's a strictly convex function of the parameter okay if you take the derivative so the parameter so the normalizing factor is just some number for each parameter right but the parameters are K dimensional so when you take the derivative you get a K dimensional vector right it's del ADA of G you get a K dimensional vector and that vector is just the expected value of that distribution okay I need to take the second derivative of that normalizing factor you get the variance of the distribution okay so which is a K by K covariance matrix so this normalizing factor actually has a lot of information about the distribution built into it okay so it's kind of a and this is the main thing to bear in mind that this normalizing factor it's derivative is just a mean of the distribution okay and it's strictly convex which means it's derivative is one-to-one okay monotonically increasing okay now we have these exponential families and one of the nice things is that exponential families if you want to get a maximum likelihood Gaussian or a maximum likelihood plus all suppose you have data and you want to find the maximum likelihood Gaussian right what do you do you just find the mean and variance of the data and that's the answer right if you have data and you want to find the maximum likelihood plus on distribution what do you do you just find the mean of the point of the data and that's your possible parameter suppose you have a bunch of coin flips and you want to find the maximum likelihood coin bias what is it it's just the mean of the data in all these cases to find the maximum likelihood model you just average the data and that's your that's your parameter right there and that turns out to be true of all exponential families okay so that's one of the feature of all of these exponential families and let's see why that's the case okay so an exponential family looks like this if you're given a bunch of data you want to find the maximum likelihood model okay so you want to find the map model that maximizes the log likelihood okay so this the log likelihood you have these M data points you you take log of the model so all of this comes down okay so you're summing up all of these you factor out the ADA okay so this is the log likelihood to maximize we want to maximize this with respect to ADA we want to find the maximum likelihood ADA so we take the derivative with respect to ADA when we take the derivative we get this minus M times the derivative of this so what is the solution this is the solution over here the maximum likelihood ADA is the ADA for which this is true but we've already said that this is just the mean of the distribution so the maximum likelihood distribution is just the one that matches the you mean okay and this holds for all exponential families so this one of the nice things about this these models so let's look at what this means specifically okay so we have a we have a bunch of integer data and we want to find the maximum likelihood model all we need to do is to compute the mean of the data and the model is going to be the person with mean 7.5 okay now the one issue over here is that the plus all would mean 7.5 ada the natural parameter ada is not 7.5 okay 7.5 is the mean of the distribution and that's the way we usually think of the distribution we usually think of the plus so in terms of its mean or the Gaussian in terms of its mean but those are not the natural parameters okay the natural parameters are actually you know quite different and for a plus all the natural parameter is going to be the log of 7.5 okay so it's really easy to go for from the mean to the natural parameter of the puzzle okay so now let's get back to our problem okay so over here we have a model of this form okay and we observed certain things we observed that the the mean of this feature was 0.3 the mean of that feature was 0.45 so we want to find you know what is the maximum likelihood model well that's easy it's the model that gives this feature a mean of 0.3 and gives this feature a mean of 0.45 the maximum likelihood model is just the model that has the right means but we don't know what ADA is that corresponds to them and inverting the model to find the ADA is quite difficult okay so so this is the way to think about it okay there are two ways to parameterize models the way we like to parameterize them the thing that comes naturally to us is to think of models in terms of their means okay like the mean of the God the mean and variance of the Gaussian the mean of the Plus on the mean of the coin this is what comes naturally to us but for the mathematics the natural model the the natural parameterization is very different and they're connected by this one-to-one map which is the derivative of that normalizing factor okay so this is the one-to-one map that connects the two parameterizations okay and so when you're given a bunch of data finding this parameterization is easy you just compute the mean of the data finding this parameterization is not that easy it's like but it at least it's a convex optimization problem and that's what we saw okay so finding that parameterization is the convex optimization problem we saw earlier so it's easy to find the right distribution over here and to go back this to this parameter is a convex problem okay and that's information projection okay and so okay so so let me just connect these two things that we've talked about first we just talked about randomness in general okay and this notion of entropy and we just said that you know entropy really is the is the right notion of randomness if you want to measure how random you know a variable is or a distribution is entropy is the right way to do it okay okay and that leads to the maximum entropy formulation of probabilistic modeling the maximum entropy approach to probabilistic modeling says when you want to learn a distribution learn the distribution that has maximum entropy subject to whatever constraints you have okay so you've made some measurements those have given you some constraints other than that find the distribution that is maximally random find the maximum entropy when you're fitting a distribution to something you know whatever yours whatever the setting is you know you're modeling you know distributions over words or sentences you know just find the distribution that is that has as much entropy as possible subject to whatever constraints you've observed okay now what we found is that finding the maximum entropy entropy distribution is just a convex problem okay that's fine we can hand it off to a package we'll get it back and that's great because now we have an answer to these things but we also found that not only is it a convex problem but it has a certain form it's one of these exponential families and these are just this one that this is this wonderful framework that contains all these different distributions the family of gaussians the family of plus all the family above coin flips okay the family of graphical models and then the family of you know distributions over English words that satisfy certain certain types of constraints so all of these are just different exponential families okay so so these these are how the two things came together maximum entropy distributions are exponential families okay and so just returning to our original question if we're given a prior some prior distribution over English words and we're given a bunch of measurements and constraints what is the distribution closest to the prior that satisfies those constraints and what we derived earlier is that it's a member of the exponential family generated by the prior and the features and there's only one member of that family that satisfies those constraints the answer is that particular member of the family okay okay so let's just do a quick example over here suppose I tell you that I have some distribution over R okay not much information right now I tell you two little things okay I tell you well the mean of the distribution is zero and the variance is ten okay I give you these two pieces of information what distribution should you fit okay according now there are lots of distributions over the reals that have mean 0 and variance ten okay and you could legitimately answer any of them okay but under the maximum entropy principle the distribution you want to choose is the distribution that has these properties but as otherwise as random as possible okay and that's going to be an exponential family so we've got the domain R we have the features T of X this and we've already seen what exponential family over R is generated by these features these are the gaussians okay the exponential family generated by these features are exactly the gaussians and that tells us that we need to fit a Gaussian it tells us that the distribution we want is a Gaussian and if we're fitting a Gaussian which Gaussian do we choose the one that has the right mean invariance okay so the answer is immediate okay so we have this problem that seems grossly under constrained an arbitrary distribution over the reals with a certain mean and variance what's the distribution we have no idea but under the maximum entropy principle there is a unique answer okay and it's the Gaussian okay so okay so now let's look at the you know that was just an example let's look at the broader framework so we have some sample space like English words we've measured some features like length of the word based on those measurements we've come up with certain constraints that our distribution satisfies and we have some reference prior which is like a prior distribution over English words that you know we liked and we want to find the distribution P that satisfies these constraints but as otherwise as close to the prior as possible and what that distribution is is simply the member of this exponential family that satisfies those constraints okay and it's unique so what we'd like to know is the following okay so this is the maximum entropy distribution okay so for our problem this is the answer this is the maximum entropy answer but there are lots of other distribution that also satisfy these constraints we've just introduced a few constraints there lots of other distributions right there's one maximum entropy distributions but there's lots of others which have lower entropy okay how different are they from from by and let's just calculate that okay and it turns out that so this is this is the maximum entropy distribution this is the one that belongs to the exponential family this is let P be any other distribution that satisfies those constraints and if you work out if you work out the math the difference between those two any other distribution is further away from by than the exponential family distribution okay and it's further away by exactly this much and that's just the way the math works out okay so you've give you got a bunch of constraints you have to find a distribution there's the exponential family distribution which we are calling P star and now there's some other distribution any other distribution is further away from the prior than P star okay so this quantity is positive and the difference between them is exactly this and the weird thing over here is that this is actually an equality it's not a it's not an inequality and geometrically here's the picture okay so here's a prior distribution over English words or whatever this is the exponential family distribution which we know is the maximum entropy distribution okay so this is not prior distribution these are the constraints these are the distributions that satisfy our constraints okay all distributions the constraints are linear so the distributions that satisfy the constraints lie on a linear subspace now that linear subspace need not go through the origins we call it an affine subspace okay so it lies on this affine subspace any of these satisfy the constraints are not therefore fair game but the maximum entropy distribution is this one right here it's the exponential family now let's look at any other distribution that satisfies the constraint this KL distance is equal to this distance Plus that distance okay it's inequality and this is a Pythagorean theorem okay over here we're using KL divergence but imagine if we had done this pretty instead of an information projection suppose we had just done a Euclidean projection for Euclidean projection we would have taken this prior we have a linear subspace and we would have just projected it at right-angles right and then if we looked at any other point we would have this squared plus this squared equals hypotenuse squared this is the same kind of formula we're getting exactly the same Pythagorean theorem but squared Euclidean distance is now replaced by KL divergence okay so information projection is really like Euclidean projection in this way it also has a Pythagorean theorem but with respect to KL diversions to instead of square Euclidean distance so any questions about this picture okay so let me give you another picture okay if that wasn't okay so so again we're looking for a distribution over words and let let this be all distributions over words okay and now we have a bunch of constraints and this subspace are the kis tribution z-- that satisfy these constraints okay and there's just one of these distributions that's part of the exponential family what is the rest of the exponential family look like well the exponential family has got these if we had K features the exponential family has got K parameters now they're all distributions so they all lie in the simplex and as those K parameters vary you get a K dimensional manifold okay so the exponential family that the members of this exponential family look like a K dimensional manifold in this simplex and that manifold intersects the constraint subspace in a single point which is B star okay and you can see that the dimensions add up okay so if we have s possible outcomes the dimension of the simplex is s minus one because the probabilities have to add up to one so that reduces the dimension by one so the dimension of the simplex is s minus one if we have K constraints each linear constraint reduces the dimension by one so the dimension of this space L is s minus 1 minus K and then we have a K dimensional family of distributions and indeed the dimensions of L and Q add up to the dimension of the same place which is what we do expect okay so the various objects over here are we looking for a probability distribution which means we're looking for a point in the simplex okay we're looking for a point in this simplex we have some prior distribution which is a point in the simplex we have a bunch of constraints which is an affine subspace of this simplex okay and we want to find a point on that affine subspace that is close to our prior okay and it turns out to be an exponential family and that exponential family is a K dimensional manifold within the simplex that intersects that subspace in just one point okay and that's the maximum likelihood solution and also the maximum entropy solution okay so so so basically we have several different ways of saying the same thing if we have a bunch of data that gives us certain constraints that special point B star can be thought of as the information projection of of our prior onto the constraints or the maximum likelihood distribution within that exponential family or the unique intersection point of the constraints and the exponential family okay so these are and so this is the duality between maximum entropy and maximum likelihood okay okay so let me finish up by just giving you an algorithm for information projection okay so information projection again is just the process of taking the prior taking the constraints and projecting the prior onto the constraint set and returning a distribution and I said that that's just a convex optimization problem you can just hand it off to a package and get back the answer okay so why am I giving you an algorithm for it you know and I'm gonna give you an algorithm just to kind of demonstrate that you know you can do some super simple things in this case because of these Pythagorean theorems and the geometry of the situation okay and so this algorithm that I'm gonna tell you is due to Chu Zhang who is you know one of the Giants of information theory and what it says is it's very simple it says that okay you have your prior and you have these K constraints you know you know 30% of words and then and an e you know 45% of words of length morning you have all these constraints take the prior and project it onto the first constraint then take whatever you get him projected onto the second constraint then project onto the third then the fourth and the fifth and then keep cycling through the constraints okay so you start with your prior distribution and then you just keep looping forever you just project onto the next constraint and you just project onto one constraint at a time which is easy because one constraint means just one parameter and so it's just a line search okay so each projection is just a line search and so this is what's happening let's say you have two constraints so constraint one is L 1 constraint 2 is L 2 and you just project onto one constraint so now you satisfied constraint L 1 now find the point closest in L 2 then find the point closest in L 1 and so on back and forth okay and normally this is not the sort of thing you can do you know if you have multiple in general in optimization if you have multiple constraints you have to satisfy you kind of have to satisfy them together you can't you know sort of say oh then we find the closest way to satisfy the first constraint while forgetting about the rest and then find the closest way to satisfy the second one and then the closest way to that's going to give you the wrong answer usually okay you have to solve it in a global way but here because of the geometry of the problem this kind of local method this local ingredient that works and let me explain why okay so at any given time you've got some vector and you're gonna try and satisfy just one of the constraints okay this is the final answer that you're hoping to get and the one thing that we know is that the final answer does satisfy the constraint okay that's good so what you're gonna do is you're gonna take your current distribution and just project it onto that one constraint okay now we use the Pythagorean theorem the distance from P star to this new projected point is less than the distance less than the hypotenuse which is the distance from P star to the original to your starting point so with each projection you are actually getting closer and closer to the target distribution P star and the amount by which you're getting closer is simply the KL divergence how far you projected okay so because you have this PI tag a Pythagorean theorem you can actually just do one constraint at a time instead of solving all of them okay so let me let me quit I'm not going to say much more except that there's also this other algorithm so if you like this sort of thing you should also look at this algorithm called iterative scaling which is kind of an amazing algorithm so he was trying to solve the same problem but instead of just updating one feature at a time it updates all of them okay so it updates all of them but in this way in which each of them is operating independently not looking at the others okay so it updates all of them and you just move off the probability simplex altogether so you start from your prior which is in the simplex it updates all of them and you move off the simplex so you're not even remotely where you need to be and you're now operating all the way outside the simplex and it's guaranteed to somehow converge to something that is actually on the simplex okay so it's kind of an amazing method and the many variants on it okay so let me just finish by mentioning one last thing which is that we started with KL divergence we mentioned KL division we said that it's really unpleasant in many ways right it's not symmetric it can be infinite you know it seems I mean you know it seems a lot more unwieldy and on and unruly then you know it's Euclidean distance and yet it actually has this projection property Kayle divergence has exactly the same projection properties as euclidean distance okay what's going on over here now just like there's an umbrella framework of distributions there's also an umbrella framework of distance measures and this umbrella framework includes KL divergence and squared Euclidean distance and a whole bunch of other distance functions and they all have these Pythagorean theorems and each of these distance functions is associated with an exponential family okay so every exponential family has its own personal distance function the squared Euclidean distance belongs to the spherical Gaussian ok spherical Gaussian if you look at the balls of of equal the balls of probability mass of the spherical Gaussian you get squared Euclidean distance okay the multinomial distribution gives you KL divergence and as you look at different exponential families you get these different the contours of equal density of different exponential families give you different distance measures and at the first glance they all seem kind of crazy just like KL divergence you know they're not symmetric really really you know they have all this sort of unpleasantness but they do have these nice convexity properties and Pythagorean theorems okay and so these bregman divergences have also turned out to be very useful in generalizing things like clustering algorithm for analysis you know why clustering we have to the k-means algorithm which is the squared Euclidean distance okay which is great for Gaussian like data but if you have data that would be better model by like a plus all or something like that what way is the right distance measure to use there well just use the Bregman divergence associated with the plus one distribution and there's a version of k-means that's just like that and has all the same properties as k-means okay likewise you can get generalized versions of principal component analysis principal component analysis is easily justified for Gaussian data but it's a little bit more of a stretch you know when the data is really non Gaussian okay so if there's a better exponential family that would fit your data you can do principal component analysis that's geared towards the Bregman divergence for that family okay okay so that's what I was gonna say you have any questions about this so I actually don't know if any rates at all yeah so that's a really nice open question to figure out the the rates of either of these methods either choose yars method or or iterative scaling as far as I know there are only convergence proofs and no rates but in practice if these are not what is these are not used you know they people just use these quasi Newton methods or you know other no basically nothing is

Info

Channel: Microsoft Research

Views: 8,339

Rating: 4.5912409 out of 5

Keywords: microsoft research, data visualization, analytics and platform

Id: zmUMBLEHhZg

Channel Id: undefined

Length: 70min 57sec (4257 seconds)

Published: Tue Jun 21 2016