Frequentism and Bayesianism: What's the Big Deal? | SciPy 2014 | Jake VanderPlas

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Does anybody else hate calling them -isms ? It just sounds so ugly

And I don't believe the frequentist approach is just to do all-likelihood stuff and ignore the prior. That's just being Bayesian with no prior (a perfectly valid approach btw)

And his example with confidence intervals ... You can most definitely say: "95% of the time, the value is inside the confidence interval", and be frequentist about it. The random variable is the interval and not the value, and you can definitely build intervals such that 95% of the time, the random interval will hold the true value

no ?

👍︎︎ 1 👤︎︎ u/Hairy_Hareng 📅︎︎ Jun 23 2015 🗫︎ replies
Captions
that's about frequent ism Bayesian Bayesian ISM hello thank you I'm glad to be here so my name is Jake I I work at University of Washington in the East Science Institute and we'll have a table outside to talk about the kinds of things we're doing it's really really some exciting work and what I wanted to talk about today is this kind of divide in statistics between frequent ISM and Bayesian ISM this is something I find in my teaching that a lot of people have heard of but don't really know and and from my own experience I remember as a grad student kind of hearing these words and I even took a took a course in computational physics where the professor's spent a whole day now telling us about the differences and at the end of the day I had no idea what the differences were right so my goal here in this talk is to kind of to go through basically the the essential differences of what frequentist and Bayesian statistics are you know it's just this kind of divide in the statistics world that that lots of people talk about but in my experience don't know not a lot of people know exactly what the whole issue is and I want to talk a discussion of the some of the tools available that allow you to do frequentist and Bayesian analysis in Python and then also being Who I am it's all going to be a thinly veiled argument to tell you why you should be a Bayesian so get ready for that too um so what this talk is not this is not a complete discussion of this by any means it's not a complete discussion of any of the examples I'm I'm anticipating lots of people raising their hands and talking about the things that I overlooked in these examples and I'm giving there's you know there's a lot and if you want to know more I I think I did a good job in the proceedings paper that accompanies this so that will come out within the next week or so I hope so anyway the frequent is amazing ISM what it basically comes down to is a question of philosophy and that philosophy is what you mean by probability okay so all this whole thing is just a question of what is probability and everything else is derived from there so on what is probability but for frequentists a probability is something that's related to the frequencies of repeated events so if we say that there's a 50-50 chance that a coin toss will land land heads or tails the reason we know that is because if we toss a coin over and over a thousand times and somewhere around five hundred of those will be heads and five hundred will details for Beijing is a little bit different probability for Beijing's is fundamentally related to our own certainty or uncertainty of events so when I say that I'm 50 percent sure that this coin will land heads as a Bayesian it's not because I've you know sat there and tossed that coin a thousand times or even imagine doing that toss it's because I've somehow assigned some sort of probability some sort of probability to define that uncertainty in my knowledge it's a 50-50 chance and this this fundamental difference kind of kind of boils up through all the methods that have been developed in both these those both these areas and keep that in mind this is what we're talking on different difference in probability and so the immediate consequence of that is that we're analyzing different things so frequentists the since we're talking about the frequency of repeating events frequentist analyze the variation of data and the derived quantities from the data so we measure something and that's that's something we can make those measurements over and over again we can analyze how those measurements might vary given our model bagans are we're analyzing variation of beliefs about parameters so basically the this is the this is the this is the core consequence frequentists talk about models is being fixed and the data vary around them Beijing's talk about data observed data being fixed this is what we've observed and the models can vary around those so it's sort of this opposite approach to what's going on in the world so I want to do some quick examples and I'm going to run through some of these it's going to be a bit of mathematical formalism a bit of code here and there but I'm hoping some of this will make make some of these issues more clear so for example if we're looking at a star we have a telescope and we want to know how bright that star is where we measure the flux and here we're imagining measuring basically 50 measuring the flux 50 different times and you have each one you have some value and you have an error on it and the question is given given these observed flux values what's the best estimate of the true value you know you have these repeated measurements you want to figure out what they are so from the frequentist approach you use something called the maximum likelihood I'm going to throw up this equation this equation just is basically a Gaussian that's centered on on the observed value with a width that's given by the errors so this is your basically your probability if you have a measurement this is single measurement this is the probability of what you think the true flux is so it's somewhere in there and when you have multiple measurements you multiply those probabilities together into something called the likelihood and so this is what the likelihood looks like this gray this gray a normal curve here is a single measurement if we take in a second measurement we multiply those together and the likelihood is the product it's which is kind of a tighter curve between those and if we continue multiplying more and more data points together that curve that likelihood gets tighter and tighter around the central value and eventually you have once you put all the data points in there you have this this likelihood that's sort of by the nature of it zeroes in on the on the true value so here we generated our data with the true flux of a thousand which is right right where that little red curve is centered in the little red likelihood so in frequent ism you can actually do this analytically you can do the math and essentially what you come up with here is a weighted average so this is an average you you sum up all the values times the weights and divided by the sum of the weights you're basically taking the mean of all the values to get your estimate so this is the kind of thing that you do automatically right if you were measuring if you wanted if you wanted to know if you're just sort of doing it from common sense you'd maybe take the average of all the values that you got right and you can actually from the likelihood approach you can say that's the correct thing to do in the frequentist regime so in Python it's just a couple a couple lines of code and for our point our points we get 999 plus or minus 4 so we basically recover our input so is that all clear yeah so we're basically we're doing this maximum likelihood thing we're multiplying all those probabilities together for the Bayesian approach it's a little bit different what what we're interested in fundamentally is a probability and when I write up here P of F true and then that little bar and addy what that says is we're asking for the probability of the true flux given the data that we observed so that's that is what what Bayesian ZAR looking for and in order to compute that they use something called Bayes Bayes theorem which is a identity of probability that you can just you can prove and apply and what Bayes theorem says is you have this posterior this is this posterior is the thing you're interested in it's the probability of my model value given the data that I've observed and Bayes theorem allows you to turn that around and ask about it in terms of some other parameters so you have the likelihood here the prior and the model evidence so these are all these this is all this vocabulary that that you need as a Bayesian so basically the posterior here is our on the left this is our value that we're interested in right this is the brightness of our single star that we want to know the likelihood is the same thing we saw on the frequentist approach essentially we're just multiplying all those probabilities together we get that little red blip in the middle and then the thing that gets really more controversial is this prior right so in order to flip these probabilities around mathematically we need to put in a prior we need to say what is the probability of what's the probability distribution of this value we're interested in before we take any data so you could say you could do that prior based on all the other stars in the sky you know how bright are all the stars in the sky we know we know that our star is among those so we can do an empirical prior or we can do something that's called a non informative prior we could just say well I don't know anything so let's make all the fluxes equally probable and in practice that's what a lot of a lot of Asians do and in practice that's what frequentist complain about the most then the last term is this denominator the the model evidence and it's it's interesting in some situations but for for our purposes we can basically treat it as a normalization and ignore it and with the Bayesian with a flat prior we're basically multiplying the likelihood by by one so we get the same result as the frequentist here we get 999 plus or minus 4 for this flux so what we see is that in these very in these extremely simple problems frequentist in Beijing results are often distinguishable we can't really we don't get different answers but there there are some cases as we go as we get more complicated that the differences become apparent I've listed a few of these here there are things like handling nuisance parameters these parameters that you don't really care about in the end but that are important for the for the analysis the interpretation of uncertainty that's a really important one incorporation of prior information so for example if you're trying to learn about the expansion of the universe and the Cosmic Microwave Background tells you something and then you want to add in some extra information from supernovae that's that's a way that that Bayesian analysis that's something that Bayesian analysis does well is incorporating that prior and from and comparison of an evaluation of models we're going to focus in on two of these and nuisance parameters and the uncertainty so I'm looking at nuisance parameters we're going to go back to a situation that that Thomas Bayes himself proposed back in the 1760s and the this formulation is a more recent one by Eddie in 2004 so Alice and Bob we're talking about Alice and Bob who have a gambling problem and their friend Carol comes in and decide designs a nice game for them to play so what Carol is going to do is she's going to take a ball and she's going to roll it down the table and that ball is going to settle somewhere and then after that depending on the position of that ball Bob and Alice both have different areas of the table where they can get points so Carol is going to continue rolling balls down the table and if it if it lands in ball in in Bob's area he gets a point if it lands in Alice's area she gets a point and the first person to get six points wins right so it seems pretty simple you can kind of figure out where things are and figure out who has the odds of winning we're against it gets interesting is when you cover the table you make it a black box so now we have this model and all we know are the results we we get Alice winning Bob winning Alice winning Alice winning Bob winning and you don't know anything about the model inside so this is kind of an analogue of what we do as statisticians we we have a model that's kind of this black box that generates data and all we get to look at is the data itself the data themselves so here's the question in a certain game Alice has five points and Bob has three points what are the odds that Bob will go on to win so you can think about this a little bit and you can basically say well you need to know the the division of the table right and this division of the table is an example of a nuisance parameter this is something that's really important for our calculation but in the end we don't really care we're not really estimating where that first ball is on the table we just want to know the results we want to know how much money Alice should put down on this on this answer so a frequentist approach this is something that might might occur to someone who's who's been working on this we need that we need to estimate this this location of the ball basically how probable alice is alice is to get a point so we can do a quick maximum likelihood estimate and we get P equals 5/8 because five out of the eight balls landed on Alice's side and we know that for Bob to win he needs to win the next three rolls so we multiply 1 minus P cubed and we get a probability of 0.5 0.05 3 which is about an odds of 18 to 1 so that seems reasonable right Bayesian approach is a little bit different Bayesian approach involves marginalization so when you have a nuisance parameter a parameter you don't really care about in Bayesian ism you just integrate and you get rid of it so the way this looks right here is you have this probability of Bob winning given the data it's just the integral over the probability of Bob winning and some p-value and you integrate over all possible p-values so this is what Beijing is Bayesian czar doing when they're when they're marginalizing they're just getting rid of parameters they don't care about so you do some algebraic manipulation and I'm not going to go into the details of how to solve this that's in the paper but you basically find that the odds are 10 to 1 against Bob winning after the marginalization so here's what you got you have the frequentist says 18 to 1 odds the Bayesian says 10 to 1 odds and the question is is who's right I'm not going to tell you who's right I'm going to tell you what's different the main difference is that the Bayesian approach allows this nuisance parameter to vary and the frequentist approach keeps the the nuisance parameter fixed now it's probably some people thinking in here that I'm full of it because you know frequentist can allow nuisance parameters to vary in certain ways and they're you know you can explore the sampling distribution or you can transform the problem you can do various things like that but I um what I would say is I think the Bayesian approach offers a much more natural way to allow these nuisance parameters to vary so essentially the difference is that the frequentist is taking a slice of this joint probability and gets a very narrow posterior it's not really a posterior but they get a very narrow result the the Bayesian is taking the whole probability and kind of squishing it horizontally there and in the process of that gets a very very wide result so this is the difference taking a little slice versus taking the whole thing and integrating or squishing it together so a second example this this is one that I think is the most important difference between frequenters and Bayesian ISM and this is the handling of uncertainties so when someone gives you an answer you say the flux is 999 plus or minus 4 what does that plus or minus mean and it turns out that because of the philosophical differences between these approaches it's a very very it's a it's a very different thing it's subtly different and it's different in a way that a lot of people miss so for frequentists they talk about something called the confidence interval they say if this experiment is repeated many times 95% of those cases the the computed interval will contain the true value right so this is a 95% confidence interval a Bayesian says given our observed data there's a 95% probability that the value lies within the credible region so these seem really similar right but notice the the difference the things that are varying and the things that are fixed are the opposite so the frequentist keeps the model parameter fixed and says that the confidence interval itself is varying the confidence interval is derived from the data which is a random quantity derived from the model right the Bayesian keeps the credible region fixed and varies the value of the model parameter so it's the the model parameter is this is our belief about the model parameter that can kind of move throughout space so this is the way that that they're a little bit opposite and this is why for long time Beijing's were called the the Beijing problem was the inverse probability problem because you're basically taking the frequentist problem and you're turning it on its head right and this this ends up having some interesting consequences for certain problems and I want to illustrate it with this truncated exponential example so this is something that Jane's he was a physicist back in the in the to mid 20th century he wrote about in 1976 basically we're considering a model where you have a certain amount of time given by theta and after that certain amount of time you have a probability of something happening so for the example that Jane's use this was a chemical inhibitor that that kept a device from failing and then say after 10 minutes that chemical inhibitor inhibitor runs out and the devices start failing with this exponential decrease and I put this in a blog post recently and I thought it was pretty cool I got I got an email from a guy in a in the Institute of Health metrics that you dub and he said this is the exact model that we use for mosquito nets in Africa so it turns out that this model actually is applicable this people are using this sort of thing so the question is you observe some failure times of this device or you know you observe when the mosquitoes get through the nets and you want to estimate from that how good these mosquito nets are how how long they'll last before they fail on average so we have we have a couple ways of doing this we want another 95 percent bounds on this parameter theta how long how long the nets last before they fail now for the common sense approach we can look in this and we can say well it's impossible for these to fail while the inhibitor is still there you know it's it's impossible but the failure starts after a certain time so each point that we observed has to be greater than that inhibitor time right so the smallest observed point here is 10 so we can immediately say that the theta has to be less than 10 right this is just just common sense we're not applying any statistics oh really I'm running over I'm going to finish this real quick okay so a frequentist approach this is the unbiased approach you create an expectation value you create an unbiased estimator I won't go into the details but you I can assure you this is the correct on that unbiased estimator you complete compute the sampling distribution and what you find is that the 95% confidence interval is between ten point two and twelve point two so we've said by common sense it should be less than ten but frequent ism tells us well it's between ten point ten and twelve point two right so let's see what the Bayesian does the Bayesian again we use Bayes theorem we compute the likelihood with a flat prior we get a posterior that looks like that we draw the limits and bayes says 9.0 to ten point o-- so immediately right now you're thinking frequent ism is wrong but actually frequent ISM frequent ISM isn't wrong we what we're seeing here is that the frequent ISM is answering a different question than what we expect and to visualize that so basically the Bayesian ISM is making a probabilistic statement about the model parameter that theta that we're interested in given this fixed region we've computed and on the other side frequent ISM is given a probabilistic statement about the recipe for computing this bound given a fixed model parameter and you know I plan to do way too much in here so I'm going to end with just this visual description of what this is because I think this'll this will kind of hit at home so the Bayesian credible region looks like this we create an interval and we say given all these parameters that fit within our belief structure ninety-five percent of those will fit within this interval the frequentist version says well we have one parameter and we're going to construct a whole bunch of credible regions and 95% of those credible regions will contain this parameter but the problem is right now we happen to choose this one right so the we we can't say that we can't say that this value has a 95% chance of sitting in this particular credible region but the frequentist result is still correct in the sense of the long-term limiting frequency of this recipe for constructing things so you got to remember this in general when someone gives you a frequentist connor of confidence interval it's not 95% likely to contain the true value and this this is a Bayesian interpretation of a frequentist construct and it happens all the time you have to be really really careful about this so I always imagine this sort of conversation happening you know statistician says 95% of such confidence intervals and repeated experiments will contain the true value the scientist says aha yeah so there's 95% chance of the values in the interval right and the statistician says no you see parameters by definition can't various are referring to chance in that context is meaningless the 95% refers to the interval itself and the scientist says aha so there's a 95% chance of the parameters in the interval and the statistician says no it's like this long-term limiting frequency of the procedure for constructing this interval ensures that 95% of the resulting ensemble of intervals contains the value the scientist says also there's a 95% chance that it's in the interval and if status is no it's it's just just write down what I said in the paper and scientist says okay value is 95% likely to be in the interval so this this is what we have to be really careful about as scientists that we don't misconstrue that those frequentist constructs I have some other slides but I'm going to just skip down I obviously planned way too much these sorts of things happen you know but basically the the conclusions so frequent isn't amazing isn't fun dementedly differ in their definition of probability and in simple problems the results are basically the same you don't have to worry too much if you're computing the mean of a distribution but what I think I've argued for here and people might disagree with me is that Beijing is improve ides a more natural such handling of nuisance parameters so these things that you have to fit into your analysis but don't really affect the results except for the intermediate calculations and I think Bayesian ISM provides a more natural interpretation of errors too you know we don't with our scientific results we don't want to phrase it in terms of this 90 percent 95 percent of an ensemble of potentially calculated confidence intervals we want to we want to show an interval and say the parameter is 95 percent probable to be in there right so I would say that the Bayesian ISM is more natural for communicating our scientific results for the to the public there are some philosophical issues with priors in there and to give a fair treatment I probably would have gone over those a little more but what I can say is that both paradigms are useful in the right situation and I think it's it's on us as scientists to learn the learn the situations where they are useful and to interpret those results carefully so thanks very much does a great talk we have time for one lightning quick question so in your first example you you've said that frequentist evasive interpretations give different columns isn't that something that a frequentist would say is empirically testable by running the game over and over again and and finding out companies yeah the question was in the first example of frequent ISM and Bayesian give different odds and can't we empirically test that in a frequentist manner the answer is yes and it confirms the Beijing the results correct it should be but but that Bayesian result is correct only if you have the pariah the prior correct now so this is the detail that I sort of glossed over that the choosing of the prior and the Bayesian analysis can have some real subtleties and if you don't choose it correctly you might end up biasing your results and I have in the Proceedings paper I have some references for that where you can read a little bit more but that's the one thing that you have to be careful about is the prior invasion ism you
Info
Channel: Enthought
Views: 72,208
Rating: 4.962862 out of 5
Keywords: Scipy 2014, Scipy, Python Programming Language
Id: KhAUfqhLakw
Channel Id: undefined
Length: 26min 31sec (1591 seconds)
Published: Wed Jul 09 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.