The Fisher Information

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] welcome to mutual information my name is dj in this video we're going to learn about the fisher information this thing is an absolutely foundational idea which quantifies the amount of information an observation carries about a parameter more generally it provides a general framework for reasoning about parameter uncertainty it is a frequentist concept which is kind of a dirty term these days but it has grown beyond that world bayesians and machine learning practitioners love to borrow it for new ideas to name a few it's used in the jeffries prior natural gradient methods experimental design and probably others that i just haven't heard of so to empower you with this idea my approach will be to explain the goal of the fischer information and then show how it's a clever way to achieve that goal now for the sake of a gradual explanation i'm going to start in the simple one-dimensional case and then generalize to higher dimensions the higher dimensional case is really what you want for applications that's where all the cool stuff happens that said let's jump in first let's say a random variable y is related to a parameter which is just a single number here by some known probability density function we write that generally with this expression which is here to represent any function that accepts both an observation of y and a parameter value and then gives us a number which shows us the likelihood of that observation according to that parameter now here's the question according to whatever this thing is how well does an observation of y locate the parameter value that produced it in fact the goal is to measure how well an observation locates that parameter now if you're like me the vagueness of this goal makes it hard to imagine what a solution would even look like but what does help me is to consider cases at two extremes for the first case let's use the normal distribution with a large variance what i mean by that is p of y given theta is the normal density function where theta is the mean of the normal and we assume we know the variance which will set to a relatively high value of 25 for the other case we'll consider the exact same thing except with a small variance of one very very frequently when we want to go from observations to parameter estimates we consider the log likelihood function to show that i'm going to simulate data according to the large variance case first and then for each observation draw the log likelihood function each one of these lines tells us how likely one of our observations is as we vary our mean parameter along the horizontal axis keep in mind the data is generated according to some parameter value which i haven't told you we'll call that value the true parameter value naturally let's consider the small variance case as well okay now take a step back and look at these two cases i have not told you the true perimeter value but you tell me in which case would you have an easier time guessing that parameter value isn't it obvious it's clearly the right side the small variance case effectively each line shows one observation's vote of where the true parameter is in this case all observations overlap on a narrow range making it easier to pinpoint that value as you may be able to guess that true parameter value is 5. going forward we'll refer to that true value as mu star and so we might say that in the small variance case our observations provide more information about where mu star is in fact i should emphasize when we say information it's this narrowness that we're after honestly this view is all you need to understand the goal of the fisher information but now separately the question becomes how does the fisher information turn this narrowness into a number how do we measure that information for that we'll need a new view first let's redraw what we just saw here we're showing the small variance case next let's pick a parameter value totally arbitrarily to plug into these functions i'll call this evaluation point mu naught if we plug mu naught in we'll get back a big list of numbers over here we'll see an estimated distribution of those numbers finally we're in a position to see the key idea of the fischer information the idea is to consider the slopes of these functions at our evaluation point if we recall a smidge of calculus we know the functions that generate these slopes are the derivatives which we can draw as well in general the derivative of a log likelihood with respect to the parameters has a special name we call them score functions as you can see in this case the score functions happen to be linear as we'll see all the action hangs out in the distribution of scores at our evaluation point so just like we did below let's show that distribution along with the mean line of that distribution to get a feel let's move around the evaluation point a bit the first thing i bet you'll notice is the shape of the slot's distribution stays the same well please ignore that observation that comes from the fact that the normal distribution has linear score functions and that's not true in other non-normal cases but there is one thing that generalizes to all other cases when we evaluate these score functions at the true parameter value the mean of the scores is zero to provide some real simple intuition the average score is zero at the true point because it has to be if it wasn't the data would be suggesting a more likely parameter value is somewhere else but in this setup that can't be the most likely parameter value must be our true parameter value because it generated our data and now we're just about ready to see the fisher information but i want you to guess what it is to help i'm going to move between our two cases we're starting in the informative case with a small variance now let's increase it hmm getting any ideas anything jumping out it seems to me in the informative case there is a wider variance in the scores in the uninformative case there is a small variance which means they are all hanging around zero to gain some intuition think about what a single small score value is telling you if it's 0.001 that means that data point recommends shifting you by a thousand if you'd like to increase the log likelihood by 1. if it's minus 0.001 it's saying you need to shift mu by a thousand in the other direction so a lot of score values huddled around zero collectively tell you to move nowhere but individually provide wildly different recommendations about where the true value lives in other words a small variance of the scores implies a large range of possible new values this is the key idea behind the fischer information and finally we can state its definition the fischer information is the variance of these score functions when we evaluate them at the true parameter value in other words it's a measure of the width of this distribution when it's centered on zero to put it on the screen i'll show two times the square of the fischer information as this horizontal width now look as we move between the uninformative and informative cases the fischer information follows naturally and hopefully you can feel mechanically what's going on remember the goal of the fischer information is to measure narrowness and this does precisely that as we become more narrow the slopes vary more widely and the fissure information increases but there's more as you may have guessed from this big empty space let's once again consider the slopes at our evaluation point but this time of the score functions in fact in the exact same way the middle row relates to the bottom row let's create a top row which relates to the middle row that means we'll plot the second derivatives and an estimated density of the second derivatives at the evaluation point as you could tell in this normal distribution case the second derivative is constant okay now i can state this result the fischer information which is the variance of the scores is also if it exists equal to the negative expected value of the second derivatives now the normal distribution isn't a great case to show that since the second derivative is always one value so let's transform it into a weirder case where that isn't true and now we can see it now you may reasonably ask why is this true well i will avoid the proof because i do not know it but i can offer intuition basically you expect a wide variance in your slopes if your slopes are changing a lot slopes that are changing a lot have extreme slopes themselves meaning the average second derivative is extreme so it's not crazy to imagine the expected second derivative is the variance of the slopes now that wordy explanation may not do it for you in that case just try to feel the mechanics between extremes when the observations make it easy to locate a parameter your slopes vary a lot when they vary a lot the average second derivative is extreme now the view i've given so far is a bit too simple to be useful i don't think anyone really cares about the one dimensional case so let's generalize all the way to two dimensions i'll do that by doing the exact same exercise except we'll be using a 2d multivariate normal the parameters in this case will be the vector of two means and we'll assume the covariance matrix is known just like we assumed the variance was known in the earlier case since two dimensions is a heavier lift visually we can't show all six of these graphs so we'll generalize only two of them the first will be the one showing the log likelihood curves and their slopes and the second will be the distribution of those slopes ready all right let's jump back into the black abyss now let's think in the 1d case the x-axis was our one-dimensional parameter space now we need to generalize that to a 2d parameter space okay let's draw that with the true parameter point called out the next thing we need to do is draw a 2d version of those log likelihood curves we saw earlier let's think those curves received a single parameter value as input and gave us back a log likelihood now they need to accept two parameter values so we could think of the 2d version as domes that cover this 2d plane and typically such things are represented with contour lines like this next we need a 2d version of a slope those told us how changing the perimeter value at a point would change the log likelihood the analogy in higher dimensions is the gradient vector that's a line that looks like this the gradient vector tells you the direction you should choose if you want to increase the function as much as possible the length of the gradient tells you how much the function will increase if you take that step earlier we didn't play with just one likelihood function we played with many one for each observation sampled using the true parameter so to avoid a blizzard of contoured lines i'll represent one likelihood function with a single ellipse and now we can represent many so staring at the setup can you guess what the 2d fischer information will tell us seriously pause the video and scratch your head on this one it's worth thinking about just try to carry the analogy from 1d to 2d done ok let's do it in the 1d case the fischer information told us the variance of the log likelihood slopes more generally it's describing the distribution of those slopes now in this case we need to describe the distribution of these gradient vectors to do that let's plot the 2d histogram of these vectors if a square is bright that means there are a lot of gradient vectors with the coordinates of that square with that i can say whatever the 2d fischer information is it's going to describe this histogram before i tell you it i should point something out on the left the gradients randomly point out from the center they don't agree on any one direction in the 1d case we saw this behavior as the slopes averaging out to zero to refresh this is because we are evaluating our gradients at the true parameter value from which all our observations are generated if we were evaluating at a point that wasn't the true parameter vector these gradients would favor one direction over the other and the fischer information wouldn't apply with that let's think about what this fischer information must do we need to describe this 2d histogram centered at 0. in the 1d case it was the variance some of you who know your stats may be able to guess it it's the covariance matrix it's a square grid of numbers that describes how elements of a vector sampled from a distribution vary together so in this case the higher dimensional fissure information is the covariance matrix of the gradient vectors i can just show you the fischer information tells us this ellipse how it tells us that ellipse isn't important think of it as a compact summary of this histogram and then let's just focus on that histogram to get a feel let's consider the simplest case where the gradient elements aren't correlated in this case the variance of each grading element is all there is to know and we can interpret each of those in much the same way we did in the 1d case if that gradient vector element varies a lot we can nail down the parameter associated with it very well now let's introduce some serious correlation to intuit this let's consider only parameter values that fall on this red strip the question is if you were only looking at parameter values here could you easily determine which ones are likely to be near the truth i mean yeah it looks like it would be easy the observations only fall on a small piece of this diagonal line and it's no accident this is also a direction the histogram varies quite a bit in other words a direction in which the gradients vary widely is a direction where you can easily separate likely from unlikely true parameter values and it's for exactly the same curvature reasons we saw in the 1d case as you can guess a direction like this is where we have difficulty determining the true parameter values and this is a direction along which gradients don't vary much at all this is the fischer information it's telling us how well observations will separate likely from unlikely true perimeter values along any given direction and it does that using the distribution of slopes at the true parameter values but what about the second derivative stuff does that work here too yes 100 there's a higher dimensional analogy of the second derivative called the hessian and the expected negative hessian turns out to be equal to our gradient's covariance matrix though showing that would be tricky but stating it that's easy in fact we should restate everything with the glory math after all my goal is to help you apply this stuff these visuals are nice and all but to be useful they need to connect to the papers and textbooks that handle this beast ready okay the fisher information is a measure of the amount of information an observation of a random variable carries about a parameter according to the probability function that relates them it's defined with this expression which i'll break down first we write it as a function of a true parameter value which is theta star to generalize it beyond a specific true parameter second this means the variance of something that depends on y where the random behavior of y is according to that theta star third this means we'll be evaluating each score function at the true theta giving us a bunch of numbers of which we are interested in their variance and as we mentioned the variance turns out to be equal to the negative expected value of the second derivative these two expressions communicate everything we covered with those six panels we saw earlier sidebar i just want to point out this is why mathematical notation is so useful with a very compact expression we can communicate a big idea very precisely next let's move on to the multi-dimensional case i covered it in 2d but just take it as given that those same ideas generalized to n dimensions ready okay the higher dimensional form is called the fischer information matrix which generalizes the official information to handle a vector of parameters rather than a single one in this case the matrix is a square grid of numbers where the ith row and j com are given by this expression if this expression looks unfamiliar that's because this is a different way to write covariance since the expected partial derivative with respect to any parameter is zero we can write their covariance as this expected product it just falls from the definition of covariance if this algebra looks like a lot that's fine it's not important to internalize it what is important is to know the overall statement which i'm about to say the fischer information matrix is the covariance matrix of the log likelihood gradient with respect to the parameters when we evaluate that gradient at the true parameter vector and the randomness comes from that true parameter vector that's a lot okay let's also say that last part regarding the hessian that is an element of the fischer information matrix also turns out to be equal to the negative expected value of the second partial derivative for the two parameters associated with that element if that's also hard to digest just think in terms of that 1d graphic and trust that it works in higher dimensions and that's all of it if you have any questions please comment and i'll do my best to answer anything worth clarifying after the fact i'll include in the description also there you'll find my go-to sources on this topic in case you want to learn more and finally if you enjoyed this video please like and subscribe content like this is the content i'll continue to make especially if i can get your support

Info

Channel: Mutual Information

Views: 5,380

Rating: 4.9430604 out of 5

Keywords:

Id: pneluWj-U-o

Channel Id: undefined

Length: 17min 28sec (1048 seconds)

Published: Tue May 04 2021