And the Bayesians and the frequentists shall lie down together...

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you very much thank you Andrew thank you all all of you for having me this talk is about two warring tribes in data science and statistics the patient's and the frequentists who here is familiar with these two views on how you should analyze data right and who is belongs to one of these tribes who is a Bayesian okay who's a frequentist all right so I'm here to tell you it doesn't have to be like this these are these are families of techniques that have mathematical validity both of them they don't have to be warring tribes you don't have to belong to one or the other and that's sort of the message of this talk so I'm going to talk about how the Bayesian precondition to lie down together so let's talk about what everyone agrees on with your a Bayesian or a frequentist everyone should agree on I'm sort of the math you know when we talk about the probability of something who here is familiar with these axioms of probability we talk about probably you know there's there's some stuff that can happen and then there's some subset of that which is a particular event so the stuff that can happen as someone is going to be elected president and the subset is you know it's gonna be Hillary Clinton that's gonna be Donald Trump it's gonna be Gary Johnson so given that you have these events the subsets and the sample space we can define a probability of some particular thing happening on this probability it's a function of the subset and it obeys some common-sense rules you know it's got to be a real number between 0 and 1 the probability of everything is 1 you know someone's gonna be President and you know if if the subsets are disjoint you know if they can't both happen then the probability of the union of them happening is just the probability of the sum of them happening so the probability that Hillary or Trump becomes president is just the same thing as a problem if Hillary plus the probability of Trump they're not both gonna be President so these are you know these are their axioms of probability these were first written down in 1933 by comma Gaurav everyone agrees on this the patient's the frequentist everyone's agree this is like that you know the part of the Bible we all agree on there's some theorems you know you can prove things this is math so it's like you can make some fear and so you know the probability of not Hillary cuihuan minus the probability of Hillary the probability of nothing is zero etc so my theorems will non-controversial it's math and we can have some definitions like math we can say you know for any two subsets and be any two events the probability of a and B we'll call this the joint probability the probability that a and B both happen we can that's the joint problem we can also define the conditional probability if we know that B happens then what's the probability that a will happen what's the probability of a given B so we write it like this the probability of a given B you know the probability that if Hillary is elected then something else will happen and that's just defined to be the probability of a and B divided by the probability of B we divide by the thing that we know happens we know B happen so therefore we're interest in the profit they both happen given the probability that B happen and we say that two events are independent if the joint probability is the same is just multiplying the two separate probabilities but telling you about one doesn't tell you any information about the other and if they're independent then the probity of a given B it's just the same as the probability of a the fact that I told you that B happened Hillary you know the probability that you know the lottery will be 13 tomorrow given that Hillary is elected that's just the same as the probability that the lottery will be 13 tomorrow it doesn't depend on the event all right still fine now let's talk about a nice theorem this is the theorem by a religious person is a reverend bathes in england and he died and then in his writings they discovered this beautiful theorem now we've this is the definition of conditional probability the profit of a given B is this a and B divided by B we can flip the symbols around get the same thing we can manipulate this the symbols a little bit we get the profit of PNB you know we just multiply both sides by P B equals probably of a given B P B which is also probably P given a PA you can flip around a little bit and you get this nice thing here the probability of a given B in terms of the probability of B given a that's very interesting if we wanted to know the probability you know given that Hillary is elected that the country explodes we can express it in terms of the reverse conditional probability the probability that Hillary be elected given that the country all right so let's talk about why this is interesting I'm a big sailor right it wasn't mentioned my biography but I loved the sail and I was in the Caribbean recently and there's an archipelago there it's a it's a country with two islands there's safty counties it used to be a French colony and just next to this in Bayesian it's the same country but two islands I was down there this is me and I visited the local authorities and it was actually a tragic time for them because they told me that their King had been poisoned the king of both islands had been poisoned that's that's a that's a problem down there they don't like the poisonings so a letter went out from the central government that the governor of each island frequentist in Beijing Islands letter went out it said dear governor attaches a blood test for proximity to the poison it has a zero percent rate of false negative and a one percent rate of false positive bail the responsible parties but remember the nationwide law it must be 95 percent certain to send a citizen to jail every 95 percent sir they they believe in civil rights down there in the French Caribbean so this letter went out I happen to be on staff frequent East when they got the letter and the question is how do you interpret this in the language of probability you know they get this letter it's in French but they got to turn it into mathematics so how do we do it okay so it's pretty easy what is the conditional probability of a negative result given that someone's guilty zero that's what it means to have a zero percent rate of false negative that makes sense so if someone's guilty if the test said negative that would be a false negative that's a zero percent chance okay so the probability of a positive test given that you're guilty is one that's good and now what's the probability we have a 1% rate of false positive so that means given that the person is innocent the probability of a positive result a false positive is 1% and the property of a negative result is 99% so there's a 1 percent rate of false positive that's just how we translate these statements into math that make sense all right now how about the second sentence oh you must be 95% certain to send a citizen to jail how do we translate that into math oh it's obvious we just say the probability that we send someone to jail given that they're actually in it's got to be less than 5% we have to be 95% certain of their guilt to send them to jail that's the obvious translation though can can you just take everyone with a positive result and send them to jail why no why no well what's the process so we know that the probability of jail given innocence got to be less than 5% and so what if we just jail everyone that has a positive result then the probability of positive less than innocent is what matters and what is the probability of positive given innocent yeah the probity of positive giving under cent is 1% and that's less than 5% oh yeah you can jail everyone with a positive result great so then I traveled to each Bayesian it's very very simple sailboat and they have the same problem that they got to interpret this letter in French turn it into math so the first sentence they interpret the exact same way as their people on their friends in the frequentist island do the same way zero percent 100 percent 99 percent 1 percent and they got to interpret this we must be 95% certain now how do they turn that into math well it's obvious the probability that a person is innocent given that you're sending them to jail has got to be less than 5 percent that's the obvious interpretation right if we're gonna be sending someone to jail there has to be less than a five percent chance they're actually innocent that's obvious right who thinks it's obvious okay well on the last leg we had this one yet given that someone's innocent the probably that we're mistakenly setting them to jail there should be less than 5 percent who thinks that one is obvious more people so you're there frequentists you didn't think it but you're the frequentists yeah so it's not so clear how to interpret this so on the Bayesian island they say okay well it's obvious given that someone's going to jail the probably that they're actually innocent has got to be less than 5% that's our interpretation ok so the same question can you take every one with a positive test and send them to jail well here it's harder to answer the question because we can substitute jail for positive here but then we got to figure out what is the profit of innocent given positive we don't have that information we have the positive given innocent but what we're trying to find as a property of innocent given positive we don't have that how do we flip it around yeah Bayes theorem exactly and that's a theorem everyone agrees with that there okay so let's do the math great so here's the thing that we know the probability of positive you have an innocent here's the thing that we want to find out the properties innocent given positive right okay so can you just take a little positive to jail okay so we can take this positive and we're gonna turn it into jail great now we're going to take this one and we're gonna turn into what what's the property of a positive result if you're innocent great point oh one okay now I'm going to take this one we're gonna expand it and then we're gonna take this and we're gonna expand that you know no problem and now we're gonna take this one and we're gonna turn it into what well the property that had given this is the property that a given Islander is innocent a million you know there's gonna be a conspiracy we don't know how many people are involved what's the probability that what's the property that a given person is innocent well we don't know I don't know do you do you really need that information you do is that strange how come the frequentist didn't need this just to calculate this we actually need information that's not given we need to make a further assumption the question is what is the probability knowing nothing more about someone you haven't test them at all before you test people prior to any tests what's the probability that some was innocent we need some sort of assumption it has to come from somewhere external to this problem so that's that's something different about this bayesian islands they have to make these assumptions about the prior probability before they do any test that someone's innocent let's say they make an assumption the same way they make a very aggressive assumption because then it's a real law and order Island down there so you know we don't know it let's make a very aggressive assumption the property that someone's innocent is only ninety percent the assumptions that 10 percent of their people are guilty this conspiracy that King was not a well-liked figure so so so if they make this very aggressive can they jail all the people with a positive test they're really making a very aggressive something that 10% of their people are guilty I'm sorry this is any bit matter at all well we to calculate the property of innocent given that you're going to jail you need to fill in a value here sir was that they don't because the on the bottom the all right so it turns out even if you make this very aggressive assumption like an evil dictator you still can't do it if you plug and chug the profit of innocent given jail is still greater than 5% do you actually have to make it even more evil assumption the MS to be able to send anybody to jail so on the Bayesian island nobody much against a frequent East 1% of the population at least goes to jail because that's the false positive rate so their whole population 1 percent at least goes to jail those are just among the innocent people on you base again we assume that 10% of the people are guilty that's a crazy aggressive assumption and still nobody goes to jail so they end up it despite this evil assumption the end up being actually much more like protective of the innocent people but the disagreement here what is it about is it about philosophy like what is the meaning of probability no no and is it about the theorems or the math now what's the disagreement the facts well uh I don't think they disagree on the facts think now they have the same false positive rate and the same false negative rate on I mean they agree on all these facts yeah it's how do you interpret this we must be 95% certain on the Bayesian island they interpret it to mean the probability given that you were sending you to jail the property that you're innocent has to be less than 5% on the frequentist aisle and they interpret it to mean given that you're innocent the property that we mistakenly send you to jail has got to be less than 5% that's the disagreement between these two islands is how they translate this French phrase into math no it's not about it's not about philosophy or the meaning of probability or how interpretations or any of that stuff it's about what do we mean by certainty though and they care about different things on the frequentist island again they're caring about the overall rate of false positives the rate of jailings among innocent people if your city in your house and you know you're innocent you're gonna have a 1% chance of going to jail by mistake that's what they care about in frequentist Island and the Bayesian they care about you know within their jail population what percent of the inmates are actually innocent that is a different question one side you're looking at the whole population what percent of these people are we going to mistakenly send to jail and the Bayesian island among the jail inmates what percentage of those people are actually innocent so they're caring about different things and only the Bayesian had to make this assumption about sort of the overall rate of innocence that was a difference so in general it's this notion of certainty and what do we mean by certainty that delineates the difference between the two sort of schools of thought about data analysis data science in general there's this sort of paradigm where there's some underlying truth you know there's some fact you're innocent you're guilty or you know Oreos caused cancer they don't cause cancer or what's the data science problem maybe your customers have you know this advertisement converts more readily than some other advertisement you have some hypothesis and that's the truth you're never gonna be able to exactly know the truth but you can perform some experiment to probe the truth and the output of the experiment is some observation you know some reading you know it's like the number of people that clicked on this ad which is the number of people who clicked on this ad we put that into some sort of inference procedure and the output of that is stomm expression of our uncertain knowledge like an interval on the parameters you know the efficacy of this ad is between 4% and 9% of people are gonna buy the product if they see that ad that's like an an expression of uncertainty do you ever calculate something like that you know some sort of confidence interval or yeah so this is the general paradigm so I'm gonna tell you about something that happened to me as a kid and then I'm gonna give you some case studies that happen to me when I was a newspaper reporter and all sort of in this general paradigm I grew up in Chicago actually with Andy you remember jewel the supermarket yeah so they used to have we were not that wealthy we didn't used to get like the name-brand Oreo cookies we just got like the President's Choice store-brand cookies and my mom would be the grocery shopping for the family and they used to have four different assortments of these generic chocolate chip cookies and each acertain would have a hundred cookies in it but the distribution of chocolate chips per cookie would vary you would never know if you were getting the a type the B type the C type or the D type cookie jar each cookie jar had a hundred cookies in it but there's different numbers of chips on the cookie so given that you had an a-type jar for example if you reach in at random and pull out one cookie uniformly at random you know there's a 70% probability that your cookies gonna have two chips on it there's a 1 percent probability your cookies gonna have no chips on it whereas if you had the d-type jar there's a 70% probability your cookies can only have one chip on so the way these cookie jars work every column here represents a probability distribution of 100 cookies and so each column adds up to 100% so I usually do this experiment where I would try to identify you know which cookie jar did my mom bring home so the experiment looked like this the underlying parameters was the the name of the cookie jar a b c or d i wouldn't know that but there would that would be some fact what kind of cookie jar is it and the experiment was to sample one cookie and the the the observation was to count the number of chips that i got on that cookie you know 0 1 2 3 or 4 and then I would express some sort of uncertainty interval like which jars could it be like oh it could be a B or a see jar that was my my game when I was a kid it was broad ones and narrow minds is what Hemingway said about our hometown so this is how we entertain ourselves so I grew up actually you know in the 80s so it was really all about this frequentist methodology so we had this notion of it the way we express our certainty is with something called a confidence interval who's heard that phrase a confidence interval yeah so this is the definition a 70 percent confidence interval procedure includes the correct jar with at least 70 percent probability in the worst case no matter what no matter which jar it is worst case it will have a 70% chance of including the correct jar I'm going to take you through what I mean by that so again every column here is a probability distribution and now we're going to generate a confidence interval procedure that works in the scenario where we pick a cookie at random so what that means is that for each type of jar separately we have to make sure that the resulting cookie ends up producing an interval that includes the right jar all right so the the so let's say we're in the 8th type jar for example what do we have to make sure well it's very common that I'm going to get the with two chips so we have to make sure that we include that in the interval so what this means is that if I call it cookie with two chips the resulting interval has to include an a-type jar as the interval procedure is a mapping from the number of chips that's the observation do an interval on the underlying parameter ABC or D but we have to make sure that within every column the jars including the interval some to 70% that clear you're nodding okay so we have to make sure that for an a-type drawer for example so this is actually enough this is enough to guarantee that this is a valid confidence interval for a type jars because if I reach in and I see two chips I say okay the resulting interval has to include a and that's gonna work 70 percent of the time because 70 percent of time I'm gonna reach in it's gonna have two chips I'll say it's an A and I'll be right I'll be right at least 70 percent of the time so let's move on to the next column I'm just gonna highlight the biggest number and then we don't quite have 70 percent yet so I'm gonna highlight the second biggest number and then still not so let's do the third biggest number and still not quite so I'll do the fourth biggest number okay we got it but this meets the criterion for a b-type Dermer we have to do this worst case every possible column so in this one if it is a B type jar that my mom brings home I'm gonna reach in and if I get one chip - chips three strips or four chips the resulting interval is going to include B so therefore at least seven percent of the time in this case I'll be correct that makes sense so if I pull out two chips for example I'm gonna say oh it's a or B all right so let's do it for this C I'm gonna highlight the biggest one and the second biggest one great I'm done with C and we'll do D I just have to get that 70 great but this is a valid 70 percent confidence interval for the cookie jar scenario no matter what I pick out let's say if it's two two chips I'm gonna say oh it's a or B if it's three chips I'm gonna say it's definitely be if it's one chip I'm gonna say it's B C or D that's the conference interval procedure and this meets the criterion in the worst case for any type of jar I will be correct at least 70% of the time great so this is how I learned it now my sister came along like eight years later generally my sister yeah she was she was younger I mean she was in you know great school and we were in high school she was raised in the new ways sealer in the Bayesian way so that's different so the Bayesian don't use these confidence intervals they use something called a credible interval the definition is different just like on those islands a 70% credible interval has at least 70% conditional probability of including the correct jar given the observation and giving those prior assumptions that's about after we do the experiment we say what is the probability that it was this jar given my observation and we want to make sure to include at least 70% of that probability mass all right so let's construct those so we gotta make that prior assumption so let's make one that you know the jars are equally probable you know you go in and drool it's just random whether they have the A's the B's the C's of the DS in stock I don't know if that's true when I was a kid the danger is going to the grocery stores your teacher might be there so I was always too shy to do that so I don't know the procedure for selecting these cookie jars but maybe it's uniformly random we have to make that assumption if we're gonna do patience so we'll make that assumption so now we're gonna calculate that so remember we're starting out with this probability given that we know the jar a b c or d the probability of the number of chips we're gonna get and that's not what we need we want the opposite we want to know given that i got a certain number of chips like three chips what is the problem that it was a particular jar that's what we're really interested in so we have to flip this around so we're gonna do this by applying Bayes rule here's how we're gonna do it first we're gonna take these columns and we're gonna multiply everything by 1/4 remember that was the uniform probability so we're gonna see the probability the joint probability that it's a certain jar and we get a certain number of chips all i did was multiply with all these numbers by 1/4 so now instead of each column being a probability distribution the whole table is a probability distribution and the sum of the whole table is 100% so this is the joint probability distribution so we could say the probability of this event my mom brings home a d-type jar and I pull out a cookie and it has three chips on it this event has probability 1 over 4 and in total they add to 100% all right so that's applying that prior now we're gonna go through every row and we're gonna say okay what if I actually know I've done the experiment I've reached in and I get 0 chips what is that yeah chips on the cookie what does that tell me that tells me which row I'm in I'm not in this row I'm not in this row I'm not in this row I'm not in this row I've got to be in that top row so yes I only had 13% of being in that row prior to looking at the number of chips but now that I've looked at the number of chips I know I'm in that row but shouldn't say 13% anymore what should it say a hundred percent exactly so I'm just gonna inflate these numbers until it adds up to 100% I'm just gonna keep multiplying it until it gets to a hundred percent so now this row is a problem the distribution we started out with the columns then we had the whole thing now we have the row I'm gonna do that on every row boom 100 boom 100 boom 100 boom 100 all right so now what we have is the probability given that I know how many chips I had the probability of which jar it is and now every row is a probability distribution and every row adds it took to 100% so we just applied base theorem okay so now we just got to make the intervals does this how my sister was taught to do it we're gonna take these conditional probabilities which are also known as posterior probabilities this is after we've done the experiment this is our probability given that it was this number of chips after we saw the number of chips the probability there was a particular job and we're just gonna circle at least 70% of the probability mass in each row so in the first one we'll do the biggest number and then the next biggest number great we're done with that one and then the biggest number are the next biggest great biggest oh it's good enough next one there we go and next one great we're done so remember earlier we were caring about those column sums being up at least 7% now we're talking about these row sums being at least 70% these are my sister's credible intervals and what this means is if you get for example two cookies you say you know what with 74 percent confidence or credibility it's got to be type-a so what time one day my sister and I compared notes and these at the top are my intervals the conference intervals and these at the bottom or her intervals the credible intervals and I looked at her intervals all right you know as my younger sister and I said what are you doing this is crazy look what happens if our parents bring home a Type B jar you are gonna be correct only 20% of the time you're gonna be wrong 80% of the time because they bring home a Type B jar the only time the resulting interval is gonna clean B is if it's got you you happen to pull out a cookie with three chips that only happens 20% of the time the rest of the time you're gonna be wrong you were wrong 80% of the time how can you say you're wrong 80% of time but you claiming to have 70 percent confidence that's insane it doesn't make sense and she says well what does she say yeah she says that's okay because I've made the assumption that type B jars only happen 25% of the time so the fact that I'm wrong there in averages out because I'm right so often on the other types of jars so it all averages out yes every time they bring home a type B jar I'm going to be wrong 80 percent of the time and still have 70 percent confidence which seems perverse but it's okay because I'm assuming that those will only happen 25% of the time I'm like well I I mean that's you're staking a lot on that assumption but she says we'll wait a minute deduction cuz we do dishes wait a minute look at your intervals your intervals are even more insane this is crazy like what happens if you reach in and you pull out a cookie it doesn't have any chips on it what do you say and I say oh I say it came from a hippopotamus came from the empty set because I just don't have a Mac being there I say with 70% confidence it came from a hippopotamus and she says that's insane because you know it didn't come from a hippopotamus if it had to come from some jar how can you claim with 70 percent confidence that it came from no jar at all I say well it's okay how do I say it's okay because that's gonna happen so rarely because you know I make up for it with my good coverage on the one of the two of the three the four chip cookies so it's okay because I say it's okay because other outcomes will happen often enough then it's fine and she says it's okay because other jars she's assuming will happen often enough that it's fine so who's right here which set of intervals makes more sense okay everyone at the beginning said they were a Bayesian is anyone flirting with frequent ism oh no that if the patient's correct about about the prior assumption yeah now let me tell you one one more problem with the Bayesian assumption up number one they had to make that assumption which is you know there was taking a lot on that but number two all right let's say I get a jar and I reach in it's got zero chips I give some ridiculous answer so what if I just repeat the experiment I reach and I get a different cookie and I you know just do the same thing again just give a different interval I can keep sampling the same jar and I'm gonna be correct 70% of the time my sister can't say the same thing because my sisters errors are correlated with the identity of the jar if it's a type B jar she can keep sampling it as many times she wants she's still gonna be wrong 80 percent of the time if she has a hundred robots each go to her jar and sample one cookie and look at it and come up with a belief state eighty percent of those robots are gonna be wrong and they're all gonna have 70 percent confidence whereas I can have a hundred robots go in and yes some of them are gonna be crazy but the majority will be at least be correct so the Bayesian errors are actually correlated with the truth if the truth takes on an inconvenient value the Bayesian zall evasions or at least the vast majority basis can be wrong and wrong with high confidence whereas the frequentist errors are only correlated with the observation so the frequencies can repeat the experiment yeah they're gonna be wrong but they can repeat the experiment and fix it whereas the Bayesian repeating the experiment doesn't help the error is actually correlated with inconvenient values of the truth they're really the Bayesian really have this notion of an inconvenient truth where's the frequentist they get have an inconvenient observation but it yes well it depends if you can bring all the information to one place like a dictator and the dictator makes the inference then yes let me show you so these critiques it's not like I invented these critiques these critiques of the frequentist and bayesian schools of thought have long history in the literature so here's a famous critique of the bayesian school of thought excuse me a famous critique of the frequentist school of thought this is a famous paper by my colleague why most published research findings are false so the argument here is very simple it says to get a paper published in the biomedical literature you have to have by the frequent to standards less than a 5 percent you know 95 percent certainty just like on that island I need to have 95 percent certainty and the problem with this is let's say again you know 1% of the hypotheses that you investigate are actually true but you have a 5% rate of falsely finding these hypotheses to be true that means among the things you publish you know of all the experiments you do 6% of them are going to be publishable and of those five are going to be wrong and one is gonna be right so you end up with 5/6 that the published literature is actually the false positives so that's the argument there's just like on that island you end up sending 1% of people to jail even if you know all those people might be innocent that's the critique of the frequentist way but here's the corresponding critique of the Bayesian way this is in a fancy name journal so you know it's true the impossibility of Bayesian group decision making with separate aggregation of beliefs this is just what we were talking about with a robot problem if every robot wants to come to its own conclusion based on its observation you can have 80% of them come to the wrong conclusion and yet have 70% confidence in their wrong conclusion the only way to make consistent decisions is for them all to agree to pool their information with one guy and that person does the inference of just informs every else with the truth is you need sort of a dictator but you can't you can't separately aggregate the beliefs and come up with the same solution so you know among good to be incorporated against all the world's data and you know comes to all the conclusions maybe this can work but if you want it there's no way to do it in a decentralized way that's sort of an uncomfortable result to data analysis it's hard you know yes prior and posterior and if they kind of divide out and then I know I think you need the observation I think so well then I'm sure you can share a ratio of something but I think your belief state is not enough oh yes this is hard though um that that's sort of the theoretical dispute in my view the difference between the Bayesian frequent of schools of thought I think is a explaining as sort of cookie jar examples and there's sort of no free lunch so now I'm gonna shift gears to the second half of the talk and ask do these Bayesian versus frequentist issues actually account for statistical disagreements that we see in the real world I think it's often you read an article in The New York Times or something there's some study that seems improbable like there's one about ESP a few years ago or a researcher found that you can you know manipulate things with your mind and the critique in the New York Times was oh well that's frequentist reasoning but if we do at the Bayesian way we wouldn't have this problem if you're anyone ever read anyone say like this like oh we just need to have a more enlightened approach anyone ever seen that yeah so I think my view is that this is actually very very rare although this is a real real problem it's not a reproach it's a real issue of disagreement this Bayesian versus frequentist as far as I can tell the times when it causes a real dispute in the real world are very rare so I'm gonna take you soon some cases where it might have been the case and we'll look at whether it really is a Bayesian versus frequentist issue or something else though I'll give you two case studies these are real-world case studies that both happened when I was a newspaper reporter so you know diabetes is one of the major diseases in the world it's dramatically increasing here in the developed world something like a third of US population has diabetes or prediabetes it's just crazy and the number one drug for diabetes was this drug Avandia Rosie blue to zone it was approved in 1999 it's sold by a major drug company there's three billion dollars of sales of this pill in 2006 and there's been increasing pressure on the pharmaceutical industry to release everything they know about these drugs they used to just release sort of the studies that they wanted you to know about and you can see why there's problems with that so there's been a lot of pressure on them to just release everything they know and GSK to their credit tried to do the right thing in 2004 they said okay fine if you want all these like tiny little lame studies we've been doing you can have him so they publish them and this is what the results look like they say in this internal study we gave a Vandy head of 391 people and two of them got heart attacks and we gave a control group to 271 people 207 people and one got a heart attack or in this study 110 and five and hundred 14 561 and 0.76 and two this is what the raw data look like and it's pretty messy because you know the control group for example sometimes is no drug at all or sometimes it's metformin which is a different diabetes drug and the durations of these studies are different and the definition of a heart attack is different so it's pretty messy but they published you know 42 of these small studies I don't think they thought anything would come of it because you know their data scientists had concluded what we didn't call them that then their biostatisticians had concluded that you know there was nothing more to be inferred from these these data but in 2007 a cardiologist at the Cleveland Clinic a prominent cardiologist Stephen Essen published a paper where they agglomerate adalah studies together they did something called a meta-analysis where they tried to sort of average them all together and they found bad news Effects of Avandia on the risk of heart attack and death from cardiovascular causes and they found we conducted searches the published literature you know they found 42 trials and they said in the Avandia group compared to the control group the odds ratio for heart attack was one point four three so basically a 43% increase in heart attacks 95% confidence interval was between 1.0 three and one point nine eight so what would it what would it 1.0 mean no effect yeah so that the lower end of their confidence interval is 1.0 three so they were very flirting with flirting with no effect here but they were they able to exclude the possibility of no effect that with this 95 percent confidence though but it seems pretty close you know so what do you think would be the reaction to this you know from the real world that is true well let me let me show you what the data look like so these are the actual data they took each study individually and they came up with a confidence interval on it we're looking at the relative risk so this is really just if it's one that means no difference if it's below one that means Avandia reduces heart attacks and if it's above 1 that means Avandia increases heart attacks and what they found so you need to study this a little tick mark which is sort of the best yes the point estimate and then there's this confidence interval and you can see they're all over the place but the medic confidence interval at the bottom here just barely excludes no effect it's pretty wide but it does exclude no effect so these are the actual results and the result was dramatic so this is the front page of The Wall Street Journal above the fold very top may 22nd 2007 by my colleague Anna Wilde Matthews medical detective equal so this guy had already brought down a different blockbuster drug sequel for Vioxx critic attack on diabetes pill black so shares plunges dr. Nissen sees risk to heart from abandoned he gets his picture and the coveted picture I never got a picture when I was working at The Wall Street Journal but my friends are Scotty one for my birthday treasurer to present it takes like five hours to make one of these pictures okay and you can see the sales going up and up and up so guess what happened to Avandia after this study is it almost no effect it was 1.03 was the lower end the conference interval front page will alter journal guess what happened to this three billion dollar drug these are the sales and the FDA has now strongly restricted of ante yes well I don't know about most likely but the center of this conference in rule is a 43% increase I don't know what if I told it was a 43% increases our best guess but it really could be anywhere between a 90% decrease and a 44% increase then what would you do well the frequentist method doesn't want us talk about the probability of the heart attack risk given the observation that's that flip that you want to do but we don't have that right we just have the interval well it varies based on the population and buried among those different studies so there were healthy people not helping people so they they modeled it's a very good question they modeled a multiplicative effect I mean if I were a doctor deciding whether to prescribe for a patient deciding whether to take well this is a good question you know there's a few answers you could imagine that doctors might say you know what a Vanya's benefits are so good that what we're gonna do is we're not going to give this to people with a high pre-existing risk of heart attack but we'll keep giving it to people with a low pre-existing risk of heart attack they might do that and this is the kind of conversation that the regulators and you know the medical professions have to have it's hard to know though I mean what is your risk of a heart attack in the next five years do you know yeah so this is what this is the kind of conversation that they that they have it and they decide is there some population where it's maybe safe to keep giving them Avandia or can we not say that and it also depends on what the other drugs are available that might not have this increased risk but have comparable benefits but that's the kind of conversation they have but they do want they they model this multiplicative risk and if you go into the study with a 1% chance of heart attack this will make it 1 x n and then in a different study if you go in with a 5% risk it'll be 5 times then that's what it means odds ratio or relative risk so I you know I said what this studies pretty close to the line I wonder if I get the same result with a Bayesian method but I tried that with my colleague Josh Mandel we looked at what would happen if you did a Bayesian estimate now this is the flip that that you want to do what is the actual probability in this case a probability density function the probability of Avandia is relative risk is multiplicative effect given the observation so now because it's Bayesian we actually can look at those probabilities and this is what we get this is this is what we got so what does this mean for Avandia yeah it lowers I mean the bulk of the probability mass is actually below 1 though there's the frequentist 1 says here's the frequenters one says above 1 and the Bayesian one says below on so did everyone get it wrong we should all be taking Avandia but we also look at one other thing we said what if we assume it's not a multiplicative risk what if it's an additive or subtractive risk cuz you know it doesn't always make sense as a multiple let's say there's a pill that had a tiny bomb in it that just exploded you wouldn't say like what is the risk multiplier that this pill cuz you know I went in with a 0.1 percent chance of exploding but after taking the pill it multiplied in my wrist by forty three percent extra but if you go yes right so it's not the multiplicative risk really makes sense maybe it just adds to your risk you know so if you do the Bayesian analysis that way you get this so now the higher is worse because it's a higher risk and it doesn't look it's pretty close but it does look like to the extent it has an added risk it does add and not subtract from your risk so this was an additive subtractive model this is with a multiply divide model yes well very little I mean you know you we do this with a Monte Carlo so we say let's say someone goes into the study with this risk we have some prior distribution on the risk of heart attacks and if it's multiplicative we say okay so they come with this risk drawn from a random distribution and then we apply you know the drugs effect so we you know the drug could be add or subtract that's one kind of model or it could be multiply or divide and we see where they end up and then we find the outcome that's most consistent with the observation that's it's called rejection sampling or yeah this is Bayesian inference so it's really just about how you do that money Carla do you add or subtract or do you multiply or divide yes this one yeah well you know this this is the raw data a tiny bit well I don't I don't think so this is one for 207 and if you just multiplied by it by 2 you get 2 4 4 14 yeah well I'm not sure let's be honest this is the point if there's a deep subtlety here in how you model the effect of these things and you know the difference be multiplicative an additive makes difference it's not necessarily about base versus frequentist all of our assumptions go into the data analysis there's many ways to sort of modify the conclusion yeah I tried that too it's confusing you can sort of keep twiddling the data and get it to come out many different ways yes uniform uniform over some range all right so that's one case study let's look at a different case study that I experienced pretty directly as a reporter I used to write about cardiology and there's a company in Boston Boston Scientific a major medical device company and they were at the time the leading maker of coronary stents though these are you know if you if your arteries that feed blood to the heart muscle get clogged up with these plaques you start to feel chest pain when you're climbing the stairs and so they can go in through your wrist or through your groin and they can thread all the way up your vasculature inside the artery itself and they can inflate a balloon to sort of prop the artery open again and get blood flowing and then they can leave a little wire mesh behind a scaffold to hold the artery open it's pretty amazing and so we as a country we spend something like 20 billion dollars a year on this procedure and these devices and Boston Scientific was at the time the leading maker of these they're called stents corners dents and they were trying to get a new stent on the market they had the market leader and they're trying to get a newer one when the governor would require them to show with 95% confidence that it met certain statistical conditions namely that it's what they call non and - the old stent though they had to demonstrate with 95% confidence that the rate of a certain bad thing was not increased by more than three percentage points what we're talking about is the risk after you have this procedure that it somehow doesn't doesn't stick and you have to go back to the hospital within nine months the target vessel you know revascularisation we're trying to make sure that that isn't bad so they had two shows within three percentage points of the old stent so if we knew the old stent was seven percent people haven't go back to the hospital then if the new stent is ten and a half percent that's bad if the new stents nine and a half percent that's okay and they can put it on the market you know the company thinks of course that they're exactly the same but there's no way to show statistically that they're exactly the same and just say with 95% confidence it's no worse than three percentage points worse than the old thing that makes sense that's called non-inferiority and of course we don't know the rate in the old group and you know we don't know the radio the new group either so they're both sort of uncertain estimates so instead of cookie jars the truth is the difference in the rate of this target vessel revascularisation within nine months and the study takes you know about a thousand patients in each group with a new stand the old stuff the old set was called Express The News called liberté these are multi-billion dollar devices and they decide if the person has you know that event or not and then they run this conference interval and then they decide if the conference interval you know the 95 percent conference and will exclude to the possibility of being three percentage points worst or more that make sense okay so they put out a press release in 2006 from the company native May 16th Natick Massachusetts and Paris from to locate now if I I would put Paris for I don't think government in Natick Massachusetts but I would have put Paris first but it's not their choice Boston Scientific Corporation today knows nine month data from the trial the trial met its primary endpoint that means that they won the conference interval the 95% confidence interval excluded the possibility of inferiority and that you know the stock goes up everything's happy because that's the condition the government had required them to show so I later a year later they actually published the paper the scientific paper in the journal the American College of Cardiology and here's what they wrote they said the primary endpoint Matt with the one-sided 95% conference bound it was at most two point nine eight percentage points worse than the old stand and that's less than the pre-specified margin of three percentage points worse so they had to make sure was under three and they got it because they got two point nine eight except this is not the American healthcare system works and they have a p-value of 0.08 seventy so that's kind of like the inverse of the conference interval if you have 95 percent confidence that means you have P less than five percent and they had P of 0.04 nine percent less and they talk about their statistical methods they said you know chi-square Fisher exact test was used to compare proportion they had everything there so I saw a debate about which stent was best when I was in Chicago for their big cardiology conference it was very interesting I wonder if I can replicate this with a Bayesian analysis let's see what'll happen I used to be a big you know I'm sort of like recovering beta so I I tried to do the analysis here's the problem this is the actual results from the study you know they had in the control group this many people and 67 had the bad thing happened and the treatment goop that had fewer people and one more person had the bad thing happened so these are the results and so let's say you just take a uniform prior on the rate of this event could be 0% could be 100% in the control group same thing on the treatment group and use plugging into Bayes theorem and you say what is the probability now of inferiority and if you do that you get five point one percent well with the Bayesian method there is a slightly greater than five percent you know conditional probability than the thing you know is bad so as I calm but does that mean that there's a disagreement between the frequentist and a Bayesian approach you know I'm always on the hunt for these disagreements well know doesn't mean that because how did they calculate this P less than 0.05 this 95% confidence interval so they said that they use this fisher exact test in the paper but it turns out the way these papers are written is like dr. Turk oh actually I've met dr. Turk oh he did not have anything to do with writing the paper these medical studies they find a famous cardiologist kind of all at the end I mean the company writes the paper the company has a lot of expert writers and statisticians and 45 statisticians they they designed the study they ran the study they wrote the paper they did the analysis and then dr. Turco kind of comes at the end or at least his name does but I met him he's a nice guy so it turns out they did not use Fischer exact test they made a mistake they actually used a different method a much precise method called a walled interval which is not an exact method it's one of these approximate methods because you know throughout the 20th century we didn't have computers that could do millions of computations so we use these approximation so this is sort of how statistics grew up so they approximate these distributions and the chance of a patient having a certain event with these Gaussian distributions and they actually do a pretty lousy gouge in approximating this wall interval is much criticized in these sort of statistical Theory literature had not yet reached the biostatistics practitioners so what they do is this integral of this normal distribution and in fact they get this point oh four eight seven so I was eventually able to replicate the exact number that they had but it turns out so here's actually the result of the trial so you can see every possible result that could have happened in the control group they could have had 65 66 67 68 bad events in the treatment group they could have had 66 67 68 all the black outcomes are ones where they would have failed the trial by their own method and the event the outcome they actually got was this one right here a p-value of you know four point nine percent so if they had one more event in the new step definitely flunked if they had one fewer event in the control group definitely flunked if they had one more event in both groups they would have flocked but they somehow we're right on the line here so I don't have any evidence that they cheated but these are the results yes I would say the company proposes the three percent goal and the FDA approves or doesn't approve it okay at the beginning before the trial is wrong yeah they are supposed to send the FDA their complete source code the FDA is very vague about this they're supposed to include their complete source code at the time and I was not able to get that and the company wouldn't share with me but they promised me that the method that did was exactly what they cannot do but they were not able to prove it they weren't willing to prove that to me but they said that so it was very close to the line and so you know we can look at this coverage here we can also look at the false positive or it was just sort of one minus the coverage and so it turns out if you look at this you know what is a valid 70 percent compras interval it's one with the false positive rate is always less than or equal to 30% let me make sense so the coverage has got to be greater than 70 that means the post pause has got to be less than 30% so for a 95% confidence interval the coverage has got to be greater than 95% which means the false positive rate has to be less than 5% okay so here is the actual false positive rate of the Wold interval in in their regime so here's five percent here are all the possible outcomes of the trial and if spoke this line is supposed to be below 5% this is just the same as looking at this line here the false positive rate instead of the line is always above 5% so this means that their test is actually too easy it's not a 95% confidence interval it's more like a ninety four point eight percent confidence interval so they set a test with themselves that was too easy and then they passed it and so if you look so I called the journal the Journal of the American College of Cardiology I said I think there may be a problem with the study and I'm a reporter and they said we'll have our statistician try and replicate the result and because this that but they said we're having trouble because they say they use Fischer exact test but that doesn't seem to work because they didn't so we'll try some other test that's the most sensible one and this is the one that the journals statistician tried this is also an approximate method so you can see it's not always below this line but it's much closer this is a more common method and but with this method they they fail so the journal said oh we had to call the company and find out what method they actually used so it turns out with the Wold interval which is one the company used they passed with the one that the journal assumed that they use they failed and actually with every other method you can conceivably imagine it has never been in the literature they fail these are all the different methods for this particular scenario so there's many you know statistics there's many different ways to do things but with all of them except one they fail except the one that they did use I was able to demonstrate is just it doesn't work because that line is always above 5% so I told the company about this and you know I wrote them up a little report in late tack which Andy you tuck Matty's latex and I now use this powers for evil eye to the company and they were very polite they invited me out I had a whole conversation they said you know we think the problem is you know we initially replicated your results in SAS but then we hired an outside expert to try and confirm and that's what expert didn't confirm and we think the difference is you know you were using we without said extra use double-precision floating-point and internally we really using some good precision point wings so we think the problem is your floating-point I said is the only time in my old career is a reporter did we discuss the precision of floating-point only that time Carlo I did the actual calculation so I think maybe you just didn't run the Monte Carlo long enough they said well why don't you I think we think it's the floating-point so I went back to my office and I got Mathematica and I did the calculation with exact rational numbers like no floating-point I like the rational numbers and these are numbers is we're talking what numbers like you know a thousand choose 600 so this is a very big number and all the digits like I filled up the swap space but I was able to get it and it turns out double president floating-point was good about eleven decimal places it's a very good floating-point so I emailed them back my Mathematica notebook and then they asked if they could appeal my decision to the editors normally the Wall Street Journal would not agree to do because they would only hear appeals if it's the allegation is the reporter took a bribe or was somehow partial they would do an appeal but it's just a disagreement on judgment they don't like to do it but we thought it would be prudent well they came into our office and they brought the chief medical officer and the chief biostatistician and the PR person who had been the chief PR person for Ted Kennedy so he was experienced dealing with various crisis issues and then you know they made their case and they said you know there's it anyway we printed the article this this figure actually appeared in the Wall Street Journal the actual newspaper they printed minyan and nerman and score test I was very proud that this incredibly obscure stuff was actually printed in the Wall Street Journal source wsj research so this is probably the nerdiest article I was able to write while I was there here's the actual study on the front page of the business section Boston sign of extent study flawed hardest and manufactured by Boston Sunday McClure Ben expect approvals back my flawed research so I was proud of the study and the result is the study that stent was approved by the government but the competitors to Boston Scientific all used my article in their literature and now people don't use the Wold interval anymore this is not a case of Bayesian versus frequent discs but this is how statistics sort of is in the real world in my experience I'm gonna skip this as a sort of a technical section here which I'll skip but the point is the Bayesian of frequency schools like they have much in common like the difference was this question between the islands that we talked about it between the cookie jars there's a lot that unites them and they have this sort of one difference in method about whether you make an assumption or not and you know which kind of probability do you care about is that the chips given the jar the jar given the chips that to me is the central issue this battle has been going on a long time and most of the time you see a statistical disagreement it's not about Bayesian versus frequentist it's about something else like the amount of statistics there is is like it's never about Bayesian versus tricky just in the real world is much the fan of them is this issue this happens all the time here's the New York Times in 2008 front-page aches a sneeze a Google search data on web may warn of outbreaks of flu there is a new common symptom of the flu in addition the usual aches coughs and sore throats it's everyone's searching for flu symptoms in Google so Google had announced and they had a story in nature a scientific study in nature the journal says they can predict the flu based on searches and they had training data they had 90 percent median correlation and then they had a held out verification set where they said 97 percent median correlation and they held out testing set now who has better in the testing set than they had in the training set that's amazing and then they launched it in 2008 and they immediately in real life had 29 percent reading correlation and they said well that was the swine flu as bran usual they fixed it we're gonna live then it broke again and if they hadn't broken and now they've given up on it um so this was like it didn't work that to me is very interesting and why did we think it would work there's not about Bayesian vs. frequentist but is about the fragility of statistics here's another study this was in 2014 this is a serious study and to explain why this study is statistically BS is quite difficult it's very subtle I mean if you look at the data it really depends how you look on it just kind of a certain risk difference rusa ratio issue it's not easy to explain these things are all around us here's one that was just published in 2016 Microsoft finds cancer clues and search queries Microsoft scientist have demonstrated by analyzing large samples of search engine queries they can identify Internet users who are suffering from pancreatic cancer DeNooyer but it's not that here at times is fun I have dived deep into this study I've correspond to the authors and it these claims are greatly exaggerated you know these companies are under a lot of pressure to show exciting results Google Flu Trends Microsoft they're both under pressure to show exciting results and this causes us to get a lot of results in the literature that are just greatly exaggerated including this one and these are even companies for which they don't live or die based on the statistics you know what Microsoft is gonna get fired based on whether they can or can't predict pancreatic cancer if the drug companies the company really doesn't live or die based on whether they'd be less than 0.05 so it's hard to imagine the things that they won't be willing to do in that sphere it's a sensitive these problems of statistics and a probability have been with us and they always involved the most important problems that we have sort of as a as humanity and now it's about drugs and pancreatic cancer those are important subjects but 100 years ago or 300 years ago looking about drugs it was about God those are the important subjects that learned people talking about all the time so this is the dissertation of John Maynard Keynes the famous economist but this was his doctoral dissertation published in 1921 he's talking about the ancient problems of probability and religion this is one of the most ancient province probability is concerned with a gradual decrease in the probability of a past event is the length of the tradition increases by which it's established perhaps the most famous solution that is compounded by Craig in his theologian Christianity productivity of Mathematica so there's like the Christian principles of mathematics published in 1699 though Craig concluded that faith in the gospel so far is it dependent on oral tradition expired about the Year 880 what are they talking about here there's if I tell you about the Gospels and you tell someone about the Gospels you tell someone about the Gospels even if you have 99.9% credibility that what the person told you is true it's still a geometric series it's gonna go to zero so that you know the theologians were very concerned about this and this is what Reverend Bayes probably thought about when he came up with base there so far as the gospel depend on written tradition he would expire in the year 31 50 some other guy by adopting a different love decrease concluded that faith would expire in 1789 that's a pretty good guess you know the French Revolution that was the American Constitution you know that's not a bad guess but why is this so important that the logic of probability and posterior probability why is it important well if we look at the footnote here's why in the budget of paradoxes DeMorgan is the famous British mathematician quotes the Cambridge Orientalists to the effect that Muslim writers in reply to the argument that the Quran has not the evidence derived from Christian miracles contend that as evidence of Christian miracles is daily weaker 0.99 times 0.99 to 0.99 at time must at last arrived when it will fail of affording assurance there were miracles at all whence the necessity of another prophet and other miracles so probably used to believe you know eventually Jesus is gonna be point nine nine now and the Muslims know in that was the concern of the you know the 17th century theologians now it's about pancreatic cancer but we still have this issue where probability is essential to the questions most most important to us serve as a species way these basins and frequences two schools of thought they're not tribes you have to belong to and I think that just sort of clouds are thinking about that nessa necessary ambiguity of data reasoning under uncertainty means compromise and there's no free lunch thank you you can bring this up with your rabbit yes yes for meeting its requirement and if we are aware as a community that certain methods are not terribly reliable or whatever the case may be why don't we have a kind of stuff set up that's like in what cases would you for instance consider cartoons or / yeah oh sure the question is you know shouldn't there be sitting if there's so many tests shouldn't there be standards about when we pick one of the other how do you even decide when to pick one of the other and sure the government just allows some but not others I think there's a few answer to that what is that as you recall when we built these conference intervals we had discretion we had choice we didn't have to circle the biggest number you know we could have circled the smallest numbers we had options about what to Circle that wasn't true when we made the Bayesian credible intervals those were just determined well actually it's a big cost earier probability that the basin's have is determined by the prior your model the experiment those together make the piste earier but then deciding which interval to include there's still discretion because you want the most central interval do you want the one that's shortest there's actually still some wiggle room and exactly how you define the Bayesian credible interval there's not just one right answer and the government's position in the so there's a few things number one the government's position has historically been as long as you set out the test before you do the experiment and as long as the test make sure that if the drug doesn't work there's less than a five percent chance it'll be falsely approved then you can pick any test you want cuz that's the thing they care about is making sure that you know P is less than 0.05 if the drug doesn't work the null hypothesis is true there has to be less than a five percent chance that you mistakenly you know approve the drug so then you could do any test that meets that criterion and you know formally speaking that criterion is that this line has to be exactly below the dotted line the other thing I'll say is that you know there are fancier tests that now use computers that are not you know from the whole history the 19th century so here's a very fancy test in stat exact that actually meets this criterion this is what happens if you used the supercomputers it never goes above the line but on the other hand it goes way below the line the statistics is not just about calculation it's about persuasion and you know this it's a language so the bio statisticians who work for the company are trying to persuade the statisticians who work for the government and the doctors who write the medical guidelines they're trying to persuade them to prescribe a certain thing so if they pick some exotic test that no one's ever heard of yes it might be mathematically better in some sense but it might be less persuasive because people are less familiar with it well that's probably why they choose this wold interval is because you know it's the one everyone had been using in the past and why they don't know some persnikitty report well first of all they don't know they're gonna get very unlucky and be right on the line and some persnikitty reporters gonna rerun the calculations they just wide divergence way cently you know it doesn't work mathematically yes well the question is maybe you want to wait different outcomes differently like you might wait the possibility of an increase in heart attacks one way and you might wait the benefit from a decrease in diabetes a different way yes I agree with you but historically the FDA has not done you know cost-benefit analysis that's for doctors to do the FDA does not regulate the practice of medicine they regulate the marketing of pharmaceuticals and they they make sure that the purpose that they're marketed for the indication they have to be safe and they have to be effective that's the legal mandate of the government but the government is not really supposed to do the cost-benefit calculation that's for trained medical professionals to do well they sort of avoid doing this utilitarian calculation but of course they do it they do it kind of under the table so for example I think when people wanted to get approval for automated defibrillators I you know these automated defibrillators do not meet the standards the reliability standards of conventional defibrillators and so you can imagine a government agency saying like well we've defined what it means for defibrillator to be effective and your plastic defibrillator that cost you know 300 bucks is not effective you notice or it's not as reliable as we've required of a real defibrillator so campy license is a defibrillator and I think the manufacturers were able to persuade the FDA that look there's a benefit to having these things be cheap because we can put them everywhere and you know what really matters is proximity to the defibrillator because in your heart you know there's sudden cardiac arrest you want one that's close even if it's only 99.9 percent reliable if it's close that's better than having to wait for the ambulance to get to you in Times Square and have one that's 99.999% reliably no there is a trade-off here and so the FDA should lower its standards for reliability for these automated defibrillators and the FDA did that so that is making the kind of utilitarian decision you're talking about but it's within this sort of regulatory framework of the law do you have any medical companies that use yes yes these kinds of issues no they might be wrestling with how do you slice the data how do you find correlations thanks Keith all right thank you very much
Info
Channel: Keith Winstein
Views: 1,769
Rating: 5 out of 5
Keywords: statistics, bayesian, frequentist, p-value, confidence interval, hypothesis test
Id: ID7J-LFSp3c
Channel Id: undefined
Length: 67min 11sec (4031 seconds)
Published: Sat Oct 22 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.