Bayesian Statistics without Frequentist Language

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right our bartender yes I love the book and I'm super happy that we have mr. Micawber with us today I thought encountered the red bridge oh my god of megwin I found a grappling hook in one more holy fruit cake and I thought it was great and now the earth maker is sick good thing that color basically the typical Khan album which is taken 2013 it is amazing so I'm super busy with some people Thank You Rasmus so I'm very happy to be here this is my favorite kind of conference because it's cozy so in thinking about the kind of talk to give I thought most the talks would be a bit technical so I decided to give a fairly non-technical talk aimed at of a conflict of my interest and Bayesian statistics and so let me give you some background I'm not really a fast agent at least not originally I'm an anthropologist and I comes to Statistics with a very definite topical interest I study human evolution and in particularly evolution human behavior and one of the methods we use in anthropology to study that is f nog Rafi and it's the epigraphic method to the extent that it has one at all is that you go places and you live with people so that you can get an inside view of their society and how it works you want that inside view because it helps you develop a better outside scientific understanding of variation among societies so this is me here in luck HD work I spent a couple years in Tanzania getting an inside view of culture there when I came to study statistics I applied the same ethnographic method to learning Bayesian statistics take the inside view and I got that perspective from some of the famous writers in the air in the field in my my commitment to Bayesian statistics grew out of what I the strengths I found in that inside perspective it makes particular things easy and it fits with the in particular with the kinds of difficulties of the data we collect in anthropology and so what I want to do today is use this as a launching off point to talk about what I think is there are systemic problems in teaching Bayesian statistics because most people encounter another paradigm first and then they're taught Bayesian statistics from the outside view rather than the inside view and so I think we've got a lot of work to do as a Bayesian community and better developing an inside view that we can agree upon because I don't think there is one actually so I'd like to start this conversation by proposing some elements of an inside view and illustrating them with in the sense that they provide pragmatic solutions to common data modeling problems at least the kind that I study so first a little bit about what the outside view is the outside view is fine I'm not going to say bad things about it but it's just it's the outside view at least not today give me some beers later and I'll say that thanks buddy so this is the Honorable Sir Ronald Fisher who certainly not the only person associated the outside view but he defined likelihood in the way that most people use the term likelihood and I don't maybe you've all read this by the time it's been up here on the slide for a little while and already you don't need to read it all just to say that likelihood is defined in a very odd way in statistics it means a very special thing it's a function it's not a probability as a function of parameters it's not actually conditioned on the data there's a semicolon thing because you can't marginalize over it weird stuff like that and and that's all fine within that paradigm but then when people encounter Bayesian statistics they are taught that we use this likelihood and then we add priors to it and then now you're Bayesian right and that is incorrect of course and I think there's mental friction that's created from that so very quickly the outside view has a bunch of elements and and the outside view is actually a lot of different views but some of the common elements that people come across before they learn Bayesian statistics include things like the data have distributions the parameters don't there's a very important distinction between parameters and statistics at least beneficiary in view the likelihood is not a probability distributions I remember being screened at for calling it a probability once in a mess that has expectancy and there's this imaginary population that is a device for creating uncertainty in statistics now this is the frequency of sampling Theory view and then after you've learned all this and passed some exams on it you learn the Bayes is all this stuff plus we add some priors and this lets us do Bayesian updating and these priors were they're very subjective sort of problem I'm not going to spend any time arguing against this outside view although you get from this art I have on the slide you might get some idea of how I feel about it but so let's say that the the outside view teaching Bayesian statistics is like the British going to Egypt right they disrupt the society quite severely if the colonial view on the statistical paradigm and therefore it's a failure to take the inside view and so it gives up some of the strengths of the perspective and the full strength of the Veda in perspective is unleashed by taking some inside review of what goes on not deriving it as sampling Theory plus priors but rather taking it on more fundamental terms of course I'm not the first person to say this then it's Lindsley put this and probably every one of his papers a complaint of this kind so here's here's probably the most extinct quotation from him he says what most statisticians have is a parody of the Bayesian argument a simplistic view that just adds a woolly prior to the sampling Theory paraphernalia they look at the parody see how certain it is and just dismiss the coherent approach as well Lindley has some very colorful papers by the way if you've ever looked through them they're full of things like this so the conceptual friction in my experience teaching statistics that arises from the outside view plus priors rather than an inside view on Bayesian inference include things like students coming to believe that the data must look like the likelihood function at least at the residuals need to look like the likelihood function in the outside view maybe that's true whether on the inside view it's definitely not true this concept of degrees of freedom is something people are taught in an introductory stats course and then they encounter Bayesian models where you have like a thousand parameters and ten data points and they say you can't fit that and I'd say watch me now not that you're going to get much updating from that data set but you can definitely fit it right and a whole bunch of other concepts like identify ability or really non baking concepts and we use those words to describe Bayesian models we cause problems with the understanding among our students and ourselves sampling as a source of uncertainty is true and non Bayesian approach is definitely not true in the Bayesian approach you can have uncertainty but is it stochastic at all purely epistemic defining random effects being it B is a sampling design I'll have a little bit more to say about this later in the talk and often although it's not a necessary feature of the outside views I'm calling it a neglected data and certainty when there's measurement error people wave their hands a bit and say yeah I worry about that and then they set a model bit ignores it right and and in the insider view I'm going to try and convince you there are obvious solutions to common problems like uncertainty and measurement and all of you probably have your own conceptual confusions that you first encountered when you started learning Bayesian statistics so now I have to admit my book perpetuates this problem so I just started trying to do the second edition of this now hacking away at this a bit and I had to fully engage with my guilt over the problem that I feel bad about many of the choices in the book as all authors do and foremost among those regrets is that it uses the outsider vocabulary I use terms like likelihood and parameter and estimate in ways that are really only have coherent definitions outside of base and never people tell me they learn things from my book so I guess it's not awful but I think we can do better I think I can do better on a second pass and I want to start thinking about that in this talk so one of the problems is that that this generates friction because it using my colonialist metaphor this is like explaining Indian politics using British political parties well there are these things called castes and there's the canoe system but also other stuff that matters and none of that exists in Hogwarts or whatever it so inevitably there thinks that just cannot be explained in terms of the other the other framework when this perpetuates lasting confusion people thinking for example that Tilda's means sample right who's I talking about this with recently yeah with Rasmus we may ask perhaps this is historical necessity to use terms like likelihoods because people still encounter not of a statistics first but I'm at least willing to try with all of you to do better so let me try to outline another path in the remainder of my time today the claim I want to entertain I'm not I'm not sure I'm convinced myself of this yeah is that Bayes is easier and more powerful when we understand it from the insider perspective now the key problem first of all the disclaim is there are lots of insider views on base right bayesian turkey amongst themselves as well is it classic paper 1971 by IJ good called 46,656 varieties of bathing's anybody else know this paper it's acute it's a two-page paper do you think he has eleven criteria and he goes through the permutations of all the combinations of them and it's absolutely it's really nice you can learn a lot about the epistemic possibilities from this yeah so this but yeah I had artificially made some passage discrete my heading would have been on the infinite variety of Asians so I'm going to pick a particular insider view that is useful to meet and solve some particular problems but I don't think it's unique and being an insider view on Bayesian statistics so here's here's the insider perspective that I use most of the time the key thing about the Bayesian approach that engages me as a scientist is that it's a joint generative model of all the variables what do I mean by variables I mean data and parameters is there the same kind of thing in Bayesian statistics this perspective has these two key unifying ideas that where things that are distinct and must be treated differently in the outsider view are indistinct and treated the same much of the time in the insider view and these things are variables well we usually call data and parameters in the non-bayesian view in the beta view data parameters are fundamentally the same thing they're just variables and sometimes we get to observe them and sometimes we don't but we they have distributions that calculations are done the same on and so on distributions likewise there's no fundamental distinction between likelihoods and priors as there in the outside view I want to again say that in there's nothing necessarily wrong with Fisher's definition of likelihood but that's just that's the outside view this is the inside view of it so I want to give you some examples to try and back this up and give you an intuition about why I think breaking down these distinctions can be useful and then hopefully we can have some conversations about exactly what terms we might want to use to refine this so here's a kind of typical mind from a physical model and that's not viscous disco model and I've used so you know Nordic runes in honor of workshop instead of instead of Roman or Greek characters to obscure whatever to mention you would normally use to decide whether something was data or parameter you probably don't have stereotypes about whether if we talk to ring there's not a parameter or data right so something tilde normal of something something for a and and so now I might ask you is the symbol on the far left of data is it something that was measured it's just a likelihood in that case where is it instead a parameter which would apply it's a prior and you can't tell right there's absolutely nothing about this statement which reveals which of those two cases it is and that's because in a Bayesian model it's the same time to death aesthetics data they're fundamentally the same issue and that in for a common data generating model a joint model data parameters from one stage to the next B might be observed we'll just call that me might be observed or it might not be observed when it's observed we treated that data we call that as a life likelihood when it's not observed we treat it as a parameter and we call it a prior but it's the same statement about the underlying silence that makes some sense so I want to show you three kinds of models today which are simple toy examples but they're real workings physical models where this collapsing of definitions between data parameters likelihoods and priors is revealing of some of the unity of the bayesian approach and why it behaves the way it does so I think the cases I'm using are not necessarily the most common kinds of statistical modeling problems people come across saying experimental Sciences and the experimental sciences you're lucky to have clean data if you can set up your factorial experiment and fill all your cells recruit more students right make it work grow more yeast open operations you need to do and I'm an anthropologist and in anthropology we go to war with the data we have not the data we wish we had and so we deal with lots of inconvenient sorts of models and I'm going to show you those you might think of these as corner cases in your fields and that's fine in these corner cases it the distinction between data and parameters is often very hard to make so this will include things like generalized linear mixed models missing data models measurement error models that there are many many kinds of strange machines like occupancy models joint species distribution models have features like this that have these features as well ok so let me introduce the toy example and then I'll go through three kinds of varieties of this toy example in which the collapsing of definitions can teach something useful for if you're just learning Bayesian statistics or maybe if you've even practiced it for a long time so let's imagine a simple kind of observational experiment there's a room in which there's a bird and a cat and the bird likes to sing and when the cat is present it scares the bird abit in attendance to sing less when the cat is absent or sleeping site as the bird tends to sing more there are four variables in the study that we're interested in because we're a student estimating the rate the effect of the cat within psychology terms what's the effect of the cat on the bird singing and so there are four variables they're the count of notes in some interval there's the presence and absence of the cat right and then there are these two unobserved variables that are rates which are calculated from these things the rate of singing when the cat is present and the rate of singing when the cat is absent I hope there are people here who like cats that's why I chose cats because people tend to like cats right put cats on slides so just to summarize we have two of these variables are observed and two of them are unobserved is it simplest model so you would call it typically the ones on the Left data the ones on the right parameters as we move through examples I would like to make you question those distinctions a bit but for the tank events let's start with the initial joint model of these four variables again the thing about the day is insider view to me is that the model is a joint probability distribution for all the variables all of them and so what does it mean it means this thing I should have just put a P there we were talking at lunch about how in statistics it's frustrating every every function is called P so you get the notes cat-rate conditional on cat rates just alone okay and I don't know about you but I have I have a problem visualizing a four dimensional probability distribution so I struggled to try and put one on the slide and didn't come up with anything it looked great so I'm just going to get straight to saying how would we define this in the conventional statistical framework here's a simple version that we can start working with just to think about the notes at time T are distributed as plus on variable with a rate lambda T lambda T is just switched switches on and off there to raise alpha and beta 1-4 when the cat is absent that's a alpha and one for when the cat is present beta and then so we've got our two observed variables notes and cat in here you see how their data and they're reflecting the rate and now we've got our two unobserved things and we need priors for them we have to say what the distributions of these things are unconditional on the data before we see it and I'm going to assert these for the moment I'll justify them in a couple slides that they are exponential with a mean of 10 okay with me so far I'm assuming I apologize if it's not true that everybody is reasonably familiar with this way of writing stats models if not you'll become familiar with it you learn to love it it's like this is for a Stockholm Syndrome right seems really okay loon syndrome so how is prior formed you might ask on the Internet yeah haha let someone understood that Joker okay so what who there are many ways the babies go about forming priors I tend to come from a school where we don't talk about the lease ever the fact is almost it's almost like a taboo and anthropology to talk about anybody's beliefs but we talk about other things so you can ask what pre data information do you have about the unobserved variables in this case and so let me walk you through what I think of as the worst case scenario for determining priors which I this is leading up to a pivot so bear with me for a second now so what do we know about these parameters before we got any data with which to inform them well we know that they're nonzero positive real values why because their rates and rates are by definition non zero positive real values you with me that's got to be true yeah and we assert that all we're interested in is the average we want the expected rate when the cat is absent and when the cat is present so we're going to track one thing about them if those are the two things that we know prior to being able to get information about the rate then there's this fun argument called maximum entropy which gives you the most conservative distribution that embodies that information is no other information and the solution to solving this maximum entropy problem is that you use an exponential distribution for the priors you still have to pick that mean so you need some information and additionally but it leads inexorably to the exponential you can use something else if you feel motivated to do so but this is a maximally conservative approach that that spreads probability as evenly as possible while being consistent with the things you said the two things things positive and it hasn't mean those good two things then it's exponential so the fun thing about this argument whether you like it for priors or not is that this argument when applied to likelihoods gives you GLM's this is the under is the quickest and most conservative reps is specifying all the families are likelihood that Fisher would accused and did use in his lifetime the same argument so what do I need it's like if you know the metadata on the outcome variable before you've seen the values and you apply the maximum entropy argument you end up with the likelihood exam please the exponential family likelihood that you get this doesn't mean you have to do it this way but this is what I'm showing that in the Bayesian perspective released in my Bayesian perspective the way we derive likelihoods or choose them to be justified by exactly the same argument is taking priors from maximum entropy and what that gives you is very conservative flat distributions which blanket as much of the possibilities prior to data as possible so in this case then like the priors the likelihood these are pre data distributions likelihood don't tell you how the data has to look they just give pre data expectations about the blanket of possibilities that it will appear in the data are free not to look like the likelihood because this is a prior distribution the residuals don't have to look like that because you're going to update all right nobody thinks that the posterior has to look like the prior but lots of people think the residuals have to look like the likelihood right but it's not true then you will still estimate the means so even if you use some other distribution right now p-values do depends upon the residuals having a shape but but the posterior being being calibrated does not so so in this case we think about what do we know about the notes before we actually know the values we know that they're 0 or positive integers it's account variable and we know that all we're going to keep track of is the expected value again maximum entropy leads to a unique solution and that's the Poisson distribution which is the maximum entropy distribution in this case so again if you have other information you could put it in and end up with some other kind of likelihood but this kind of argument gives you all the conventional likelihood of non Bayesian analysis as well and these are maximally conservative district so the point of that is that there the unity is the interpretation and derivation of likelihoods and priors even in the simplest kind of what we call regression models generalized linear mixed model we're done anything indexer of generalized linear model and that unity of interpretation and construction is incredibly useful for heading off misunderstandings like thinking that the residuals have to look like the likelihood here's how you implement this model just to prep you this is not a complicated model but I'm going to have for all involved in this I'm going to show you a slide like this I don't intend to walk through the code but I've got a gist of all of the code examples in the talk already up on github if you want to pull it up go through it and run it later I'll fly through these slides if it's just pointing out some key features about how you could implement the models the key reasons do this is that whatever conceptual unity and harmony I may lead you to believe from my other slides I want to convince you from the implementation slide that there are real challenges computational challenges always and getting this stuff to work and I don't want you to walk away thinking like oh they solves everything solve some things the insider view doesn't make your code work right it might help you understand and build code but there are some real challenges to making this happen so on the left is fiscal model there's your sand code I explain the full stack code partly to show you that Stan commits the face in data framers right we could real Abel that observed variables unobserved variables that would make me happy yeah and and then we have a model where we define the distributions for the unobserved variables we compute this lambda thing from the other variables and then we define the distribution for notes in terms 50 the tool that comes with my book Baptist and that's what the model would look like that's how you specify it and maps expand basically guesses what all the all the other data that stand with you to build withstand model you with me yeah okay so first example a building on that let's think about generalized linear mixed model of birds in this case and these are toy examples but they're all chosen to teach one little bit about the unification of data and parameters and likelihood and priors so now we're going to imagine that every bird of the unique snowflake there are a bunch of different birds in different rooms and the some birds are more fearful than others and they react differently the cats than other birds do and we've got some repeat structure in the data so we're going to take advantage of this this is a conventional hierarchical model of that except I've made it as simple as possible by just using exponential distributions for all the random effects so just very quickly same model up topics before except now we've got bird I at time T and the lambda or bird I at time T is associated with cat for bird I at time T and then there's a unique alpha beta breach bird ah yeah now we have to define distributions for the new arms our variables alpha I and beta I and the means for these groups these vectors of alphas and betas are alpha bar and beta bar yeah so new unobserved variables we've got unobserved variables alpha the betas for each bird now those are analogous to the previous ones and now unobserved means of the population suffers typical hierarchical ball good times yeah okay so and same justifications on down so very quickly before I plumb the lesson about music ation how does this one a lot of draw your attention says great paper by buying ng Gilman this is a paper that I think isn't read as often as it might be because it has a really boring title being like now so the variance life more important than ever I don't know if you're like me when you see the you see the words analysis and variance anywhere near one another run in the other direction I had a really traumatic math math class in graduate school that was nothing but sums of squares and with some some squares and I blacked out and I woke up a semester later and I swore never to analysis the variants again so but this is a really good paper and in particular in the second half of it there's this great list of all the definitions of random effects that you might come across with the literature and it's just maddening so just very quickly I'm not going to read these verbatim but so what's this picture between fixed random effects fixed effects are constant across individual brands effects vary for example law effect are fixed if they are interesting in themselves or random if they're interesting under like population psych site when a sample exhaust the population the corresponding variable is fixed when when the sample is small or negligible part of the population corresponding variable it's random so these are all incompatible right with one another it goes on if an effect is assumed we realized all your random variable it is called the random effect I don't even understand that what is a fixed effect are estimated using we've square so that is an all the rhythmic definition and random effects are estimated with shrinkage there are other possibilities right in there anyway so what's my point I with the students who is frustrated encountering random effects and wondering what they are because from paper to paper we even within the same person that could be defined as incompatible ways so what I want to say is that for me what we usually talk about is random effects is just as they exhibit shrinkage there's shrinkage what does that mean there's some mean of the group of parameters that share some family resemblance but the same kind of cluster of thing in this case there a bird and there's reputation of the parameters across birds there are family of them and we model the mean of that and that results in shrinkage towards the mean of the differences if there's a bird that has a really extreme observed singing rate and there's not a lot of data then estimate will be shrunk towards the population mean and that will give you a better estimate this is a famous argument that I think it is familiar to most of you non-baby physicians use the same shrinkage estimators that's not the same estimator but use the same circus phenomenon all the time this is not a Bayesian versus on Bayesian thing but shrinkage happens everywhere you've got a distribution that the function of parameters every time there's nothing about random effects in a Bayesian model or in a non Bayesian model that uniquely create shrinkage ordinary likelihoods create shrinkage it's just in that case you call it regression to the mean right so there's an whole famous argument actually from an anthropologist named Francis Galton about regression to the mean which produces the same phenomenon the shrinkage so two quick examples just remind you here's the empirical Bayes version shrinkage estimators Dam Stein estimators sisters from efference I think great paper on on use estimators from baseball players American baseball players where the best estimates of their batting averages are shrunk towards a common mean because of variation so you can also think about this in a time series you're trying to predict the players performance in the next season and you want to shrink extreme values towards the mean the same phenomenon of course happens in Galton's famous trying to predict children's heights from their parents heights there is called regression to the mean there's no random effect or hierarchical structure to the model but you get shrinkage anyway it's just the what it called that at the time it's the same statistical phenomenon and it arises from exactly the same mechanism inside a Bayesian model they leave it inside a non Bayesian model but it's because there are distributions and those distributions are functions of parameters and those parameters create gravity that attracts the family of things whether they're residuals in this case or random effects in the hierarchical model case towards the mean that's the same fundamental phenomenon but I have found this helps a lot in explaining to students what random effects are about is just regression to the mean and they already understand at least the ones that I used to teach California did they're like oh yeah they understood regression to the mean this is just regression to the mean but now among parameters rather than among data points so there's some conceptual delivery from the unity I hope here's how you implement this model I think this will be familiar to a lot of you see how it goes again we've got the naughty words data and parameters and we add in varying effects vectors open beta and the bar turns the model doesn't change very much right okay plowing forward now let's get into a couple more examples where they have the flavor of the kinds of data problems that I work with in my own research and these are cases where we don't have whole measurement controls over the things we normally call data sometimes that's just because the nature of phenomena there's irreducible uncertainty and sometimes it's because well studies could have been better but worked and we just want to get we want to give what is there so before I get into that let's think it let's revisit the previous model let's just add one line a couple of lines to it to say that we're going to jointly model at the same time you model the birds behavior or the birds are singing what are the cats doing well the cats are entering and leaving the room right this is a great life that's not nothing for the catch then they'll do the toy example and the cat is entering and leaving the room and so we want to estimate the cat behavior is it how often is the cat around so we could write down a distribution for that too so now we've got the cat at time T is distributed as a Bernoulli variable with some probability Kepler then we give a prior to kappa beta distribution with some regularization you know the four four makes it so that the ends have probability zero right kind of shaped like that I've got it on the slide so so far this is just a bigger joint model now we're jointly modeling both animals but if we observe all the all the variables now it's just like simultaneously running two regressions right nothing too special about that you can do it the real value of doing something like this Winn you don't always observe the cat so let's start with the simplest example saying sometimes there are missing values for the cat and I have to blame it on the cash in this case that's a the cat steps on the keyboard occasionally and gets an na right in your data set or it's your research assistant Blaine who you want in the sort of data I work on this happens for a whole variety of reasons sometimes it's because one of the people collecting the data forgot to record a variable and so you get a whole day where a variable is missing it happens and then you can't send them back to the field because the field is Singapore and it's fairly expensive to just send them back and that's how it goes so now the nice thing is this model we've defined that the Joint Distribution of all the variables automatically lets us handle the missingness but here's the thing cat now is data or parameter depending upon whether the value is missing at that spot it is both things and the distribution the Bernoulli distribution on cat sub T is both the likelihood and the prior in the same model and it solves the problem for us now it doesn't mean it actually tells us exactly whether the cat was there or not that depends upon the data but it let's give the specific Aleutian to the missing data problem so here's how you define this in band to deal with the missingness you've got two I'm not going to walk through this code in detail just to say it's on the it's on the gist and you can go through it in stand i'm marginalize over the discreet missing variables these are like the indicator the interaction indicator variables in the morning you know you got to ask if the cat value is minus one that's just the internal code for missing that we do a mixture over the two possibilities otherwise we observe it and then we just do the two progressive distributions right and this there's also it let's say an axis and if you use the exponential branch that's up on have it we'll take this model definition and build that it recognizes the binary missing variable and build the mixture model from but you've got to use the experimental branch and I make no promises that there are not bugs going to fit in the experiment so it's called experimental but I use it every day so that's all I can tell you you can get the tat the posterior probabilities the cat as presents are not in the cases where the cat data is missing by using this generated quantity trick in span as well but stepping past the computational challenge you can get depending upon the data and you can get information about sometimes part reliable information about whether the cap is present or not so now we've got kaki q1 and for the cases where in my toilet example the cat was missing or the data on the cat was missing we don't know if the cat is actually missing and in one of those cases the cat is probably present but we're not sure the amount of singing by the birds filled liens in the direction of the cavity present but it's not a slam dunk and the other patient has almost certainly the same so with me yeah there will be a summary at the end so if we're getting there so final example now let's think about it an example that has elements of all the things we've done so far so same joint model but now let's consider the issue that the cat has not stepped on the keyboard so there are no actual na s in the data set missing values but what what is true of every cat observation is that you can't necessarily trust it because cats are good at hiding sometimes the cat is in the room but the person who had the data sheet couldn't find the cat Frank the cab is waiting to jump out and bird or something like that so let's imagine again Tony example let's imagine the bird always knows the cat is there birds are smarter than people and but the person's doing the data logging doesn't always know if the cat is really there is not so you can trust a one when cat the cat variable says London no the cap is there in capitals lyrics there's no phantom cat presence observed but when a cat is zero you can't believe it now the zeros aren't data right does your mother date of a kind but it's not it's not the variable you're interested in there's a latent variable that you actually want to observe but can't and that's the true state of the cat so this is you know make Schrodinger's cat jokes now it's the cat in the room or not and so we're going to do this pistol version of shooting this cat and so this is the physical version of that argument now we have a cat sub TV the true state of the cat it's in the box we can't see the true state of the cat what we get to see is cat observed at time T and has a radula contribution that where the probability the cat is observed is a product of the true state of the camera the catch there or not zero one indicator times of detection probability so the cat - cos it's always zero and you never see the cap when the guys present you only see the cat Delta at the time yeah it would mean those of you have cats or this resonate with you right now you don't always know and and then the rest of the models was famous before except we have a prior or I should say a distribution on an unobserved variables for Delta X box in ecology we call the occupancy models they do a lot of heavy lifting in in field ecology and they become really important in endangered species studies as well so the implementation of this before I get to the key thing about it is even more complicated to show you the model block here there's lots of commentary if you want to read this later to understand how it works but like the previous one this is a mixture there's multiple multiple likelihoods for each of the possible missingness states in it and you have to add over the inside stance but it works great and and you you can get estimates for both the frequency which cats are present and the detection probability of cats out of things like this this is useful data so at my Institute we use I mean these are toy examples that I'd given you today but advice did we use models exactly like these almost exactly like these in real research not on cats and birds but on chimpanzees and there's this big project called the pan Africa project based at my Institute which has about almost a thousand camera traps across Equatorial Africa taking photographs of anything that walks in front of them actually videos if anything that walks in front of them and so there are hundreds of thousands videos most of them are not apes but then there are thousands of videos of doing things and we're interested in the distributions of behaviors among the Apes but we simultaneously need to estimate the population densities of the apes and both the behaviors and the population counts are subject to the same uncertainties that you observe in shooting this cat right it goes on so we use these models in modeling this camera trap data as well and you need to do it because ignoring the detection probability is a disaster you get the wrong answer now you under count things that's the problem with it right okay so let me try to try to summarize here so the general argument is there is virtue and taking the insider view on Bayes in unifying contests that are split in the outsider view of Bayes and these are the distinction between data parameters and distinction between likelihoods and priors of course there are times when it is useful to distinguish these things absolutely but there's also a lot of conceptual value in seeing them is fundamentally the same thing inside of a basal model so the first example I gave you today the point I wanted to get across what both likelihoods and priors are distributional assumptions on variables observe or unobserved respectfully and these these distributions can be derived from the same informational perspective there's an information state what's the metadata on the variable the pouring speed its value and of course inside the computer when you run calculations you treat them the same way you can't and in the outsider view a likelihood is not a probability distribution but of course it's calculated exactly as if it was Y and we write it in the mathematical model as if it was one that's because it maze it is a distribution but over day not over the parameters right the second both likelihood and priors induce the same inferential force that is they call shrinkage and when it's a likelihood in the outside of you call it regression to the mean and when it's random effects you call it shrinkage but it's the same basic phenomenon that the distributional assumption induces skepticism and inference for extreme values and it takes more evidence to overcome that skepticism and that's what causes the shrinkage and it's good it gives you better estimates regression to the mean is a good thing statistically it improves your predictions and just as shrinkage on random effects improves your predictions the third example what I wanted to get across is that distributions do double duty in models they can be simultaneously inside the same model both the likelihood and prior and the same variable can be both the parameter and an observed variable inside the same model depending upon the details you start with a joint model of generative model of the system and then things happen and and the Bayesian framework takes care of the conceptual difficulties of trying to sort out data from parameter and such that's dead it doesn't take care of the computational challenges which are substantial at times absolutely substantial so again then the fourth I think I already said that even inside the same analysis the same symbol could be both data or parameter the same distribution to invoke the likelihood or prior now I want to say before I move past this slide that there of course cases where it's very important to distinguish data from parameters or rather observed for motives or variables absolutely and if you write your own Markov chain you know of course you have to make proposals for one of these things and not for the others right so it's a very important difference in all the bookkeeping that goes with that but all I'm arguing is that within teaching this and understanding model construction and interpretation the unifying perspective is very important so this is the kind of slide that I put up on the ends of talks to serve as a summary so they've got way more text and I would normally put in a slide so your apologies but I've learned over time to people like summary slides that have a lot of stuff on it so let me try to summarize very quickly benefits the entire view and I'll read this quickly but you'll have access to these slides later if you want to you know study them with a glass of wine sometime we're gonna take the insider view so the insider view is not necessary philosophy in general is not necessary but it is useful what I find about it that's useful is it helps me to think scientifically not statistically it makes me think about the joint model of the system and how the data is produced and I can engage with that and I get a model that will work for all kinds of combinations of missingness and uncertainty I can build off of that and make a model that is it statistically at hoc you can get data sets having already had a model and you can see what you can infer given the data you have at hand many solutions to common science problems arise directly from this approach and I call the measurement methods which are true of all the projects I work on their measurement error problems because it's often field data collection and it's very important for me and all of us of course to propagate uncertainty in the analysis not shed the the noise around an estimate as we move through a project and the Bayesian approach makes that a lot easier to do but yes the computational challenges are very real sometimes it's it's difficult or impossible to fit the model we like to fit and we have to make compromises but it's good to get your philosophy organized first and make wise choices about those things unified approach to construction of both likelihoods and priors said before what I like about all of this is that I personally find a demystifying hedge inflation statistics is over height price and we spa statisticians but most outside of statistics and the deflation review says look this is a garbage in garbage out project you define the joint model of all the variables all the difference can do you is tell you what the data says about that joint model and that's all and then there are no guarantees where guarantee will not be offered and I like that I really like it the humble perspective that is nevertheless extremely powerful at the same time ok so this is the final slide which I mean to be a conversation starter say I want to make a modest proposal with all the implications literary implications of that title there are a bunch of conventional terms that we use in teaching Bayesian statistics and appear even in my own book and and I suspect that it's a mistake to continue using them at least without qualification and so it'd be nice to as a community those of us who are interested in Bayesian statistics applied or theoretician to think about alternatives or families of alternatives to better teach this material because the population of people who want to use Bayesian methods is growing very fast this is the time to get in front of this problem and think about developing new teaching materials that can make this better so very quickly and then I'll sell in my talk data to observe variable now a parameter is the numbers are variable a likelihood assistant distribution a leisure about a priors just distribution I'm not happy with that necessarily because sometimes you do want to make a distinction but not happy with the term distribution even because people have mythical ideas about what that means I think people think distributions are sampled from that's what makes me nervous state of information would be an alternative but no one knows what information is either the including me so posterior can we call this a conditional distribution and we're going to get rid of prior then there's nothing to be posterior to right so this doesn't roll off the tongue right so we need some solution here and conditional state of information would be even worse then we have terms like estimate and random which I I'd like to vote for banishing the building of off the Ottawa these are terms that seem to do nothing useful for us great except cause problems we don't have estimates amazing efforts we have posterior distributions and then there are things we can do with those posterior distributions which lead to behavior changes given some decision model but we don't have estimators in the traditional sense and the word random just ya know cause it's a lot of problems right anyway thank you for your indulgence and I hope that was useful [Applause] oh yeah Hancock no question
Info
Channel: Richard McElreath
Views: 23,437
Rating: 4.919517 out of 5
Keywords: Bayesian statistics
Id: yakg94HyWdE
Channel Id: undefined
Length: 50min 43sec (3043 seconds)
Published: Sat Apr 22 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.