Introduction to Bayesian data analysis - part 1: What is Bayes?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello I'm Erasmus BOTS and welcome to this part one of a three part introduction to patient data analysis so this is an introduction that I am being giving before for example at the 2015 use or conference and it's targeted at you who isn't necessarily that well-versed in probability theory and statistics but that do know your way around the programming language such as R or Python and even though it is in three parts it is going to be quite brief and I'm going to warn you that it is also going to be quite hand-wavy in parts but I do hope it will give you some intuition about what patient data analysis is why it is useful and how you can perform patient data analysis yourself so this part one is about what what patient data analysis is but before we go into that I'm going to fade myself out and we're going to start by looking at some famous people so this is Nate silver he's one of the more famous statisticians around not least because they did a very good job predicting the outcome of the to Obama election and he wasn't completely off in the Trump election he's currently the editor-in-chief of the well-known data-driven news site 538 and here is Sebastian Thrun you want to draw up a 2005 challenge which was about building a self-driving car that could drive through over 200 kilometres of rough terrain and after that he worked on Google's self-driving car finally here is Alan Turing a giant in computer science who helped crack the german enigma cipher during the Second World War which helped secure now at victory unlikely shorten the war significantly so what do these three people have in common well they all worked on complex problems where there was a large inherent uncertainty that needed to be quantified and required efficient integration of many sources of information and they all used station data analysis that's the causation data analysis is a great tool and are imply some are great tools for doing Bayesian data analysis but if you google Bayesian there's a good chance you won't find articles about how this tool could be used instead you might get the philosophy you'll find articles discussing whether statistics should be subjective or objective whatever that means or whether substation should adhere to frequent ism or Bayesian ism as if there were different religions within statistics and you will find heated arguments about whether one should use or should not use subjective probabilities rather than p-values and in this tutorial I won't talk about any of this I will just talk about patient data analysis as one good tool among many that you should have in your data science tool belt so this tutorial is about the what the why and how of patient data analysis part one which you are watching right now try to answer what is patient data analysis for two touches on why you should want to use patient data analysis and part three have some hints on how you how to actually perform a patient data analysis in practice so let's start part one proper what is patient data analysis well this can be characterized in a number of ways from more helpful than others one that isn't too helpful but that is correct is that patient statistics is when you use probability to represent uncertainty in all parts of a statistical model so if you use probability to represent all uncertainty then you are by definition using a Bayesian model you could also see patient data analysis as a flexible extension of maximum likelihood maximum likely being perhaps the most common way of fitting models in classical statistics you can also argue that patient data analysis is potentially the most information efficient method to fix statistical model but it's also the most complication of the intensive method the characterization that we're going to run with in this tutorial is the following patient data analysis is a method for figuring out unknowns often called parameters that requires three things one data - something called a generative model and three priors what information the model has before seeing the data so what is a generative model here well it's a very simple concept it's any kind of computer program or mathematical expression or set of rules that you can feed fixed parameter values and that will generate simulated data a typical example of a generative model is probability distributions like the normal distribution which you can use to simulate data but also any kind of function that you can whip up in R or Python that simulates beta counts as a generative model so a generative model is great if you know what parameter values you want but you're interested in how much the data could vary given those parameters because then you can simply plug in those parameter values run your generative model a large number of times look at how much the data jumps around that is it's a classic gold Monte Carlo simulation but often we are in the complete opposite situation we know what the data is it's not uncertain and we want to know what are the reasonable parameter values that could have given rise to this data that is we want to work our way backwards from the data that we know and learn about the parameter values that we don't know and it is this step that patient inference helps you with so now I'm going to explain how Bayesian inference works with a motivating example that while it's up to you if you think it's motivating but it's about fish and who doesn't like fish so the motivating example is called Bayesian a be testing for swedish fish incorporated this is an example of a B testing when you compare the performance of two methods or treatments one of the Segal uses of inferential statistics but to keep things simple we're actually going to start with just estimating the performance of one method so it's just a a testing but it will come around to the B later now Swedish Fish Incorporated is a company that makes money by selling fish subscriptions you know you sign up for a year and every month you get a frozen salmon in the mail they are huge in Sweden but now they want to break into the locker ative danish market but how should swedish fish incorporated enter the Danish market well the CEO has already come up with a plan let's call it method a he put together this colorful brochure which advertises the one-year salmon subscription plan and marketing has actually already tried this out on 16 randomly chosen things and out of the 16 Danes that got a brochure six signed up for one year of salvan so what we want to know now is how good is method a what should we expect the percentage of of signup to be if we start sending brochures on a large scale well we could of course calculated percent percentage of signups in our sample that's just 6 divided by and 16 equals 38 percent and maybe that is a good guess but surely this gift is quite uncertain especially since we have such a small sample so not only do we want to know what's a good guess for the percentage of signups but we also want to know how uncertain is this percentage and that's what we're going to use patient data analysis for so remember what that that patient data analysis requires three things data and we have data so check on that then we need a generative model which we don't have so let's come up with that let's come up with a generative model of people signing up for fish there are of course many ways of doing this I'm just going to go with something simple here so first let's assume that there is one underlying rate with which people sign up so now let's just have a number let's say 55% then we ask a number of people where the chance of each person signing up is then 55% so ask is in quotes here because we're actually not going to ask anybody this generation model is something that we could implement in our or Python and asking here just means we use some random number function where there's a 55% chance of getting a yes and a 45 chance of getting and no so how could this look let's ask in quotes 16 people because that's because that's how many that was off in our data set and let's see how many of those 16 people sign up so the first person didn't sign up the second person didn't sign up this third person signed up and so on for all our 16 people and finally we're going to count how many signed up and in this instance 7 out of 16 people signed up so this time 7 out of 16 is our simulated data great so now we have a generative model we have a model work where we can plug in fixed parameter values and generate simulated data the problem is of course that this is the opposite of what we really want we don't want to simulate data we know what the data is we know that when marketing asks sixteen randomly selected Dame's six of them signed up and what we want to do is the opposite we want to go backwards from the data that we know and figure out what could be reasonable parameters that resulted in this data that is what is likely the rate of signups that resulted in six out of sixteen signing up the good news is that we are almost there we can almost do this we just need one more thing we have data we have a generation model but we also need priors we need to specify what information the model has before seeing the data and when you do patient data analysis that is the same as saying that we need to use probability to represent uncertainty in all parts of the model right now we're not doing that here is the model we have so far we have the generative model and we have one parameter one unknown the overall rate of sign up if I look at this model I see uncertainty in two places there is uncertainty in the generative model we don't know what the simulated data will be each time we run it but that uncertainty is already implicitly represented by probability as we're using a random number function to simulate data and if you know your probability distributions you might have already recognized that the generative model is actually the same as a binomial probability distribution but there is also uncertainty in what the parameter value is the overall rate of sign up and that uncertainty is yet not represented by any probability or probability distribution so let's do that there are many different ways you could go about this but I'm just going to use an EC off-the-shelf solution I'm going to represent the uncertainty regarding the overall rate of sign up by a uniform probability distribution from 0 to 1 that is by using this probability distribution we're stating that prior to seeing any data the model assumes that any rate of sign up between 0 & 1 is equally likely the probability distributions that are used in this way to represent what the model knows prior to seeing the data are called often prior distributions or just priors all right so now we have specified the prior and all uncertainty in the model is represented by probability if we now look at our checklist we see that we have data a generative model and priors and we should be ready to roll now we just need to fit the model somehow again there are many different ways of doing this but here is one for this on actually simple first we're going to start with our prior and we're going to draw a random parameter value from it this time it happened to be 0.1 t1 that is a rate of sign up of 21% then we're going to take this parameter grow and plug it in to our generative model and we're going to use it to simulate some data this time when we ran our generative model for out of 16 sign up for a year or so on and now we're going to do what we just did growing from the prior and simulating data many many times say a hundred thousand times so we've drawn from the prior and simulating data again and again a hundred thousand times but but here we're just looking at the first four but really a hundred thousand times now it's time to bring in what we actually know now it's time to bring in the data because if it is something we know it is that when marketing did this for real 6 out of 16 people signed up so now we're going to filter away all those parameter draws that didn't result in data consistent with what we actually observed that's because we're interested in reality and what a reasonable rate of sign up could be in reality and in reality 6 out of 16 people signed up so we're going to remove the first parameter draw because it didn't result in 6 people signing up we're going to keep the second as the parameter go actually resulted in 6 people signing up we're removing the third and we're keeping the fourth cause it resulted in 6 people signing up and so on for all the hundred thousand parameter draws note that sometimes we keep a certain parameter value and sometimes we filter the very same value away it all depends on whether the generator model simulated matching data that specific time so for example here we toss the first 21% parameter draw but we're keeping the fourth so what did all this work gave us well this is what we started with this is the distribution of the hundred thousand draws from the prior before we did the filtering step after Pryor was a uniform distribution between serum one you shouldn't be too surprised to see that the hundred thousand random rolls those form a pretty uniform distribution but if you look carefully you see that the bars actually are slightly different then after having done the filtering step where we removed all the parameter drawers that didn't result in matching data this is a distribution we ended up with and this blue post filtering distribution is actually the answer to our original question about what a likely value of the signup rate is because a parameter value that is more likely to generate the data we collected is going to be proportionally more common in this blue distribution that is a parameter value that is twice as likely as some other parameter value to generate the data we actually saw is roughly going to be twice as common on this in this blue distribution so right away we can see just by looking at the distribution that parameter values below 0.1 and above 0.8 almost never resulted in the data we observed so it should be very unlikely that this signup rate for silent description is below 10% or about 80% we see that the subscription rate is likely somewhere between 20 and 60 percent with it most likely being between 30 and 40 percent but importantly we see that the distribution is pretty wide which means that even after using our not so impressive data set of 16 data points the signup rate is still very uncertain and patient data analysis was all about representing uncertainty with probability but we still haven't calculated any probabilities but since we have a distribution of samples it's easy to do favors want to calculate the probability that the sign of rate is between 30 and 40 percent say then we first count up how many parameter drawers that are between open 3 and 0.4 here it was 1900 and then we divide by the total number of drawers that survived the filtering step which was 5,700 and if we white 1900 by 5,700 there is a 33% probability that the sign of rate is between 30 and 40 percent we can of course do the same calculation for all the bars of the distribution and what we end up with is a probability distribution over likely sign-up rates now it's important to note that this probability distribution doesn't stand on its own it's always given the assumptions and the data we used in patient jargon these two distributions are called the prior and the posterior distribution that's because the prior distribution is what the model knows about the parameters before prior to using the information in the data and the posterior distribution is what the model knows about the parameters after posterior to having used the data in the filtering step now the posterior distribution is really the end product of a Bayesian analysis it contains both the information from the model and from the data and we can use it to answer all kinds of questions for example what sign up rate is the most probable this is just the parameter value with the highest probability in the posterior since the plot of the prefer is pinned into ten percent spins it's a little difficult to read from the plot but the sign of rate with the highest probability is actually 38 percent so if we just had to report a single number 38 percent could be given as a best guess now as we use the uniform prior this is actually also the parameter value that is the most likely to generate the data we actually observed and in classical statistics this type of estimate is well known under a specific name do you want to guess it's the parameter value with the maximum likelihood to generate the data we observed yeah it's the so-called maximum likelihood estimate and maximum likelihood estimation is one of the most common ways of fitting models in classical statistics and this is the reason for why biasion data analysis can be seen as an extension of maximum likelihood estimation because as long as you use flash priors you'll always get the maximum likelihood estimate for free when you fit a Bayesian model but there are other ways you can summarize a posterior distribution except for the maximum likelihood estimate you can take the mean of the posterior distribution the posterior mean as another best guess of the rate of sign up in this case it's almost the same as the maximum likelihood estimate but that's not always the case or you might want to summarize the uncertainty of the sign operate as an interval and then you can sign the shortest interval that covers say 90% of the probability this is often called a credible interval and here we can see that the 90% credible interval go between 0.3 and 0.5 to 4 so we can state that the signup rate is between thirty and fifty four percent with 90 percent probability all right that was a simple example of patient data analysis and at this point in the tutorial we usually do a small exercise where we replicate the analysis I just described to you so if you are in front of a computer and have either R or Python installed I recommend that you pause this video and try out this exercise by following either of the links here it's a really useful exercise for helping you understand patient data analysis so pause the video now and do it I'll be waiting for you when you come back boom boom boom boom boom go to to go to the compose button boom boom boom boom boom alright okay welcome back I hope the exercise went well and if it didn't you should take a look at the solution at the bottom of the exercise page now we're going to go through what we just did so nothing new here but this time we're going to use a tiny bit of math notation so the green distribution here is the prior that's what we started with and the blue distribution is the posterior that's what we ended up with after having used the data so how did we go from the prior to the posterior take a sign-up rate of 35% as an example first a signup rate of 35% had to be drawn from the prior and that it did with some probability here the P with parentheses stands for the probability of drawing a parameter value of 35 percent from the prior then in order to not throw this prompt to draw away it had to simulate data which match the data we actually observed and that it also did with some probability here the p with the vertical bar stands for the probability of generating six signups given the bar should be read as given a parameter value of 35 percent and by multiplying these two probabilities together we get the probability of first generating a sign-up rate of 35 percent and then simulating data that matched the data we observed and this will be proportional to the probability of 35 percent being the test parameter value given the data this is in the same way as before where 1900 parameter draws in the posterior between 30 and 40 percent was proportional to the probability of the signup rate being in that range but it wasn't the probability was 1900 parameters so it wasn't the probability until we divided by the total number of draws and in the same way we have two here divided by the total sum of the probabilities of generating the data for all the parameter values and this gives us the probability for a sign-up rate of 35 percent given the data and of course we could use the same procedure for all the other sign-up rates and to retrieve a full probability distribution again this was nothing new this is just what we did before but now using a little bit of probability notation so what have we done we have specified prior information a generator model and we have calculated the probability of different parameter values and the data in this example we used a binary rate binomial model with one parameter but the cool thing here is that the general method works for any generative model with any number of parameters that is you can have any number of unknowns also known as parameters that you plug into any generative model that you can implement and the data can be multivariate or it consists of it can consist of completely different data sets and the patient machinery that we used in the simple case works in the same way here now the equation down at the bottom here is just a generalized version of the one we used in the salmon subscription problem where here D is the data and the theta the dough with the - - and are the parameters so this equation isn't anything new it's just what we did before and that equation is what's usually called Bayes theorem so there it is now I need to mention that the specific computational method we used in the sound unsubscription problem only work in rare cases it's called approximate patient computation and what specific to this method is that you code up a generative model then simulate from it and only keep parameter draws that match the data and it's a good method because it's conceptually simple but it's a bad method because it can be incredibly slow and scales horribly with larger data if you Snavely but there are many many faster methods that are faster because they can take computational shortcuts but the important part is that if one of those faster methods worked the end result will be the same as if you would have used approximate patient computation you will just get this answer much much faster see if you hear about the cool method like Hamiltonian Monte Carlo you don't need to be too nervous if you don't know how it works because it's just another method of fitting Bayesian models and when it works the outputs output will be the same as if you would have fitted the model using approximate patient computation but Hamiltonian Monte Carlo will probably get used a result today rather than in a hundred years say so now we talked about what patient data analysis is let's talk a little bit about what it is not so first it's not a categorical models it not it's not like you have regression models decision trees neural networks and by ocean models patient data analysis is more a way of thinking about and constructing models and many parts of statistics and machine learning can be done in a bi Asian framework so you can have patient regression models patient decision trees patient neural networks and so on patient data analysis is also not more subjective than all the types of statistics as far as I can see or rather all statistics is equally subjective also physical methods require that you make some assumptions and the result always has to be interpreted in the light of those assumptions it's the same in Bayesian statistics and classical statistics also even though Bayesian methods have become more popular quite recently starting in the 90s when suddenly everybody got the PC it's important to remember that that it is not anything new it's actually called Bayesian because of this guy Tom of the days who live in the 1700s it's named after him because he failed to publish an essay on solving a specific probability problem but which was then published after his death but patient statistics should really be named after this guy pls Monica's who was first to describe the general theory of Bayesian data analysis but he didn't call it Bayesian rather he called it inverse probability which sort of makes sense as we use probability to reversely go backwards from what we know the data to figure out what we don't know that parameters the first known person to have used the word patient was actually Ronald Fisher the guy that popularized that p-value and he didn't meant it as a compliment because he famously low that station statistic so maybe Bayesian data analysis is not the best of names and the better name would actually just be probabilistic modeling because that's really just what it is all right that concludes part 1 of this three part introduction to patient data analysis now that we know what patient data analysis is we're going to in part to take a look at why you would want to use patient data analysis again I'm Russell sports and thanks for staying with me to the end
Info
Channel: rasmusab
Views: 230,722
Rating: undefined out of 5
Keywords: Bayesian, Statistics, Tutorial, Probability
Id: 3OJEae7Qb_o
Channel Id: undefined
Length: 29min 29sec (1769 seconds)
Published: Sun Feb 12 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.