Everything wrong with statistics (and how to fix it)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
well thank you all very much for showing up this morning I think we're in for a very exciting talk and it's my pleasure this morning and privilege to introduce a colleague of mine dr. Kristin Linux she's presently the director of the statistical consulting service here at Livermore you've been here for five years next month she came to us after completing her graduate work at Texas A&M and I think the time that you've been here you've touched almost every part of the laboratory and notably about two and a half years ago she reestablished a statistical consulting service here which is something that the laboratory hadn't had in about 30 years and I think over the course of the last two-and-a-half years there have been about 50 different projects with researchers from around the laboratory that well that you've been engaged in she's managed a team of say 8 to 10 statisticians and data scientists that have been involved in these engagements and so the talk that you'll hear today pulls on a lot of things that she's observed in the and the course of that process and so let me go ahead and turn it over Kristin thank you so much Paul for that introduction and thank you all for coming out here today I have a confession to make I'm not actually going to cover everything wrong with statistics this morning it's a 90 minute talk we could only get the room for an hour so what I'm going to talk about today is as Paul said based on my experience doing Kista chol consulting it's what I like to call statistics in the wild it's basically what happens to statistics when there are no statisticians in the room if you've ever practiced statistics you know there's typically not a statistician in the room and you might well ask me how do I know what's going on when I'm not there well I've been doing statistical consulting off and on for about 10 years now and something that happens a lot to statistical consultants is some subject matter expert a scientist or an engineer or what-have-you will go out and they will collect some data sometimes be experiments sometimes not they will then do statistics to it they will then bring it to me and they will ask me what I think of what they did this is a sub optimal way of utilizing a statistical consultant because quite often I think they ought to have talked to me sooner however since coming to the laboratory in particular there have been a regular stream of people who show me the statistics that they did and I say great job that's what I would have done in some cases that's better than what I would have done so the question I've been thinking about for the last few years and the point that I'm going to try to make in this talk is that there are certain habits of mind if you will that allow people who perhaps don't have 2 to 5 years of postgraduate studies and statistics to nonetheless be extremely successful in applying statistical analysis to their own work so this isn't a talk that's trying to scare people off of statistics I'm trying to make every single person in this room and beyond the best statistician they can be and I believe that that's actually going to be quite good I will get to how that's going to happen in slide 7 but in the meantime in the abstract I did promise you a crisis and we do in fact have one so I believe this is sort of a two-part problem part of the issue is a PR situation we're suffering from right now where there's this hot new field called data science it encompasses statistics predictive analytics machine learning a lot of these sort of buzzword professions right now and data scientists don't tend to speak highly of statisticians were viewed as the troglodytes of the field we were obsessed with limit theorems for some reason we only use boring methods like linear discriminant analysis and linear regression lots of linear stuff which is old and we don't use the cool new methods like random forests or kernel density estimation which were both invented by statisticians by the way so the situation isn't that statistics is being left behind and that we don't have exciting new methods that are being used for exciting new problems it's that when people hear statistics they only think of the boring old stuff so that's something of a personal problem that comes up among statisticians when we're invited to dinner parties there's actually a much more serious problem going on in science at large which was very well summarized in this 2005 paper by John ayan itis he's an MD PhD and he was trying to address the problem of non reproducible results in particular in biomedical science which is his field but it's a lot broader than that where we are seeing a large number of papers his argument is a majority of papers even ones that are published in highly reputable journals by highly reputable groups that when someone comes back and tries to replicate their results they can't do it because the results were spurious there were false positives and this paper was very well received in the technical community at the time it got a lot more press in 2010 when an article in the Atlantic came out called lies damned lies in medical science as you might be able to guess from that title a lot of the blame for this problem is laid upon statistics so at a very high level there's a method called null hypothesis testing and it's Harbinger the p-value you might have heard of before and the coordinate the way that p-values work is that if you have a p-value greater than 0.05 your paper can't be published if you have a p-value less than 0.05 congratulations you get tenure and so we're in a situation where true negative results these high p-values are never published and because there is actually a known false positive rate when you're using these methods a lot of the positive results that are being published turn out not to be real and this is a very serious problem and it's prompted responses both serious and somewhat less serious so there's actually a journal called basic and applied social psychology that has banned statistics technically it's only banned frequentist statistics but it went a lot farther than saying okay you can't use P values in null hypothesis testing they said you're not supposed to use power calculations you're not supposed to use confidence intervals you're not supposed to use any of these tools which if not by the way been implicated in the larger problem but it's still ok to use Bayesian stuff because we haven't read anything bad about it in the Atlantic so far the mailing list that I subscribed to where I first heard about this event summed it up very nicely by saying that they've admitted that they've been using a crutch and now they've put on a blindfold there have also been more constructive but still what I view as suboptimal responses to this problem the journal Science in nature have both implemented a statistical review process where when a paper comes in if it has a substantial statistical analytical component they've partnered with the American Statistical Association to ensure that there will be a professional statistician on the review team for that paper so this is a good way of making sure that really egregious statistics doesn't make its way into very fine journals but it's not the real solution to this problem as I mentioned mentioned in my preamble when you come to a statistician after you've done the experiment III think that Ron Fisher put it best when he said it's a little bit like performing an autopsy we can perhaps tell you what the experiment died off so this is going to lead to hopefully a reduction in in poorly practiced statistics but it's also going to lead to an increase in scientists who really don't like statisticians very much so you can imagine this is a serious problem and statisticians and scientists in general need to look at how we can address this and I think it's very important to start with how did we get to this point well the beginning of the story is actually very happy the statisticians one over the course of the 20th century people became convinced that the right way to learn about the natural world the right way to make engineering decisions or business decisions or all kinds of decisions is to use data and the right way to use data is to use statistics so all of a sudden we've created this enormous demand for statisticians and other people who are trained in these methods to basically analyze all the data that's coming in and if you look at what's been going on in the last decade or two you know that the amount of data available for people is just increasing at an exponential rate so everyone's looking for statisticians and no one can find them anywhere so there are a variety of reasons for why this happened one of them is there's a comparatively high bar to becoming a statistician up until recently undergraduate degrees in statistics were very uncommon it was thought that you should first get your undergraduate degree in something else preferably mathematics and only then could you be introduced to the higher mysteries for your master's degree and later your PhD so we're already setting a high bar because you have to be a mathematician to start before you can do this and we didn't have that many mathematicians when we started secondly we have an issue with incentives particularly in the academic statistics community where the way that you acquire glory and and a highly regarded statistical name is by developing amazing new statistics and publishing in the Journal of the American Statistical Association and annals of applied probability this is one of the few fields where a nature paper isn't going to impress anybody because that just means your collaborators are good you don't get a whole lot of points for doing really good statistics on really important problems if those statistics were invented in the 1950s and unfortunately they invented a lot of good things in the 1950s so if you happen to have a serious statistical question but you don't happen to have the latest greatest newest data that's gonna get somebody a Jassa paper you go looking for a statistician and that's going to be a very frustrating exercise it's not quite that bad but it's again not not a productive enterprise in many cases so the non statisticians are intensely frustrated they can't find anybody help to help them the statisticians are intensely frustrated because they just want to go back to their whiteboards but people keep pounding on their doors asking for them to do boring stuff that Fischer invented the solution that people came up with is okay we're going to Train non statisticians to do just enough statistics to solve their problems to get them to leave us alone and unfortunately that statistics training which is remarkably standard actually across the country is not working so I'm going to refer to the standard first and frequently only statistics course that people take is stat 101 please raise your hand if you took stat 101 it would have involved test of a single proportion maybe two proportions p-values confidence intervals that kind of thing okay all right that's majority of people in the room how many of you made it all the way at 102 which would have involved experimental design you might have heard the the word randomized block I see one person in the back my group leader is now raising his hand but still many fewer people how many people made it farther than that okay not very many at all so the point is most people receive this standardized introductory training and then nothing else and I think that the problems that we're having in statistics both the PR problems where people fundamentally misunderstand what we do and also the larger scientific cultural problems where statistics seems to be getting a lot of journals and a lot of researchers in trouble are caused by the way we teach that one class so I used to teach that 101 I am part of the problem so I'm now trying to become part of the solution stat 101 can be taught in a variety of ways some of which are quite good if you actually read the textbooks or you have a really good professor they will touch on how statistics really works but if you're an undergraduate and you want to pass a class you may or may not pay attention to that the way you pass tests in statistics 101 certainly the way my students for passing tests was very procedural you could basically make yourself a decision tree or a checklist you'd be reading through the test problem and you would check your data type and let's say okay I have two numeric variables I only know one thing you can do with tune variables this is a linear regression problem then the second thing you do you read through you try to figure out the inference method you're asking for there a few more steps hopefully somewhere in that dot dot dot there's some assumptions checking and then the final step is you play statistical madlibs and you say something like because the p-value is point zero six we fail to reject the null hypothesis at the point zero five level and we do not conclude that bla bla bla bla bla bla bla it's really boring grading these tests by the way as well as taking them I'm sure so the problem with this approach to statistics which we reinforce by grading you based on your ability to do this is that real statistics doesn't work that way at all thankfully my job is not to retreat to my office and look up an answer in the handbooks of statistics volumes 1 through 57 and work my way through this agonizingly long checklist my job is so much more interesting than that because what statistics really is is mathematical modeling under uncertainty so on the one hand it's a lot simpler than having to deal with the handbooks of statistics volumes 1 through 57 on the other hand you can't ever turn your brain off when you're doing statistics you have to be thinking about your problem all the time but as long as you know how to approach the problem and how to think about the problem you can avoid these pitfalls that we're seeing throughout the the scientific literature in the statistical literature today I'd like to raise another point about stat 101 we have all these high profile articles about problems with statistical applications and science and the banning of p-values and all this stuff absolutely nothing stated in any of them is news to statisticians the problems with p-values have been known pretty much since they were invented they've regularly appeared in the statistical literature since then somehow we are failing to communicate that to the vast majority of scientists who are trying to use them in their day to day research so how do we fix this well I have made a plan it has three steps because plans with more than three steps are never fully executed so this talk is intended to address steps one in two where I want to show you why you shouldn't be doing cookbook statistics using real examples from the lab in the wider world why does this sort of automated a checklist cooking approach not work where does it break down second I want to show you a better way to approach these problems which I'm calling statistical thinking this is what statisticians really do and finally in situations where you you don't feel that you can quite handle it on your own or you want a wider view of the literature and statistics or you're just not quite sure how something works we do have statistical help available and I will definitely give you contact information for that at the end as well as some ideas of the kinds of things we can help you to do so back to number two what is statistical thinking I think that you can understand it best by going through the golden rules of Statistics the first of which is know thy problem I don't think I mentioned before this is an applied statistics talk so I'm not going to help you improve your limit theorems I assume that you have some problem in the real world that you're trying to solve you're trying to understand some kind of natural phenomenon you're trying to make it a decision you're trying to assess risk but there's always some goal in the real physical world that you are trying to achieve you have to keep that goal in mind at all times preferably the front of your mind at worst at the back of your mind but never take your eye off the ball so to speak second of all know thy tools this goes a lot farther than the stats 101 checklist approach where you figure out which of the methods you're going to use and then you check a bunch of assumptions I have really bad news for you about statistical assumptions they are very nearly always violated so it's not enough to know how your tools work you have to know how they break you have to figure out when you can still use them when the assumptions that underlay them are violated and fortunately we're the kinds of people here who use models in other contexts so we know that assumptions are always violated and you can figure out how you can use them bottles anyway finally know thy data so this is out there for all of my friends in data science who have argued that the problem with statistical methods is that they're all parametric parametric models of course are relying on parametric assumptions but if you can just get enough data you can let the data speak for itself and the data won't lie to you unless you have the wrong data which is actually a problem that comes up a lot more than people think it will and I'll walk through a number of high-profile examples where that's been the case but back to number one no that I problem so I want you to think back long ago to your undergraduate days when you were taking stat 101 how did you determine what the right kind of analysis was to do when you were taking an exam or doing the homework easy you look at the data as I said before you have two numeric variables you only have time to learn one thing to do with them you're gonna be doing linear regression that's not quite how it works in practice in the real world the right data to collect and the right methods to use depend on the goal that you're trying to accomplish you have to watch out for data myopia just because you have some data doesn't necessarily mean it will help you and I'd like to sort of highlight this point with an example that I call the million dollar binomial distribution this was one of my very first consulting projects at the laboratory and it related to the National Ignition facility so for those of you who are visiting the laboratory and maybe aren't familiar with NIF it is the world's largest and most energetic laser array I could talk about it for an extended period of time but for the purposes of this example you only need to know two things about Neff the first is that the kind of science that we do there cannot be performed anyplace else on earth it allows us to answer questions that can't be answered other what any other way the second thing you need to know is that it was very very expensive we want to take good care of it because it costs a lot of money to build and it costs a lot of money to run but that has to be balanced with the fact that again we need answers to questions that we can't get any other way so any problem that NIF has that impacts its ability to do science is by definition and expensive problem and a couple years ago NIF developed a very expensive statistics problem so it was during a a regular maintenance period NIF was shut down and they were installing a particular component which was a fail-safe device one of the engineers involved in the installation determined that the failsafe itself could fail in a very unsafe way it was described as a catastrophic damage scenario which doesn't actually mean you're going to have like flames leaping out of the laser Bay's or the building falling down or anything but it could create extremely severe damage to one third of the optics at NIF which means ICIC went saying yes that is a very very serious problem you've probably figured out I'm understating the cost impact in my title so it's very serious we want to avoid having that happen but again we don't have the option of not running myth just to keep it safe there's always some aspect of risk when you're doing big science when you're doing something no one's ever done before and NIF actually has protocols that they can follow that allows them to weigh the risks to the system against the value of the science and they were able to perform a calculation against their standard protocol and determine alright what's the maximum risk that we can tolerate for the failure of this component and they determined that they could continue to run the system if they could prove that the failure rate was less than one in 10 billion as I said it's a very expensive system we want to keep it very safe so they went to their engineering reliability textbook they did exactly what they were trained to do they looked up the right formula they put it in a spreadsheet they said okay one in ten billion and we need to certify this many components how many tests do we have to run to achieve that certification now I can tell you if this were anything other than electronic component this testing would not be possible you want to certify something at a rate of one in 10 billion you're going to be doing more than 10 billion tests however the testing rate for this component could be measured in Hertz so it was technically possible it was just gonna require NIF to be shut down for three months this was not popular of course they had Plan B which is we can just operate an if outside of the agreed-upon risk bounds for three months this was also not popular so there they're facing two options that they don't like very much and then someone has the brilliant idea none of us do statistics for a living maybe we can find someone who does we'll come in and look it at this and hopefully tell us we've done something wrong so this led to a chain of phone calls which culminated in my boss's boss's boss pulling me out of a classified meeting and informing me that NIF had a statistical emergency and I had to be there in an hour up until this moment I hadn't believed there was such a thing as a statistical emergency it always seemed like something that could wait I learned otherwise so I I'm at this meeting I'm sitting next to the person who performed this sample size calculation where they come up with a three-month time frame and they say okay well this is how we came up with a 1 in 10 billion bound and that looked pretty good and this is the formula we were using to calculate the sample size and unfortunately that also looked correct to me so the popularity of statistics at NIF is really tanking at this point very fortunately the engineer sitting across the table then speaks up and says I think we should be taking credit for the other component my ears perk up and I said what other component turns out this failsafe device is in series with another component at NIF and in order for disaster to occur they both have to fail simultaneously and independently and it turns out that bounding two things at less than 10 to the minus 5 is a lot easier than bounding one thing at less than 10 to the minus 10 five orders of magnitude it matters and so we were able to repeat the calculation for both components simultaneously and it turned out they'd already done that testing in the course of regular maintenance I think one of them took a few hours the other one was slower it took a few days but again absolutely no need to shutdown if absolutely no need to operate outside the risk parameters the statistician is a hero and you better believe that's what I wrote on my performance appraisal for that year although it wasn't actually true the hero in the room was the engineer across the table because I would never have known that this was a two-component problem and what that person did that no one else in the room was doing certainly not me was recognizing we don't actually have to worry about this one failsafe component we have to worry about all of NIF so when you were able to take a step back when you're able to escape that date of myopia we were actually able to solve the problem in a much better much less expensive and equally credible way and so that's what I mean when I say know thy problem don't lose focus just because you have certain data and you know certain methods if they're not addressing your problem then you need to look a little bit deeper and a little bit farther afield to figure out what you're going to do all right commandment number two know thy tools way back in the day in stat 101 you were told that we pick our methods according to the data we've got and according to the correctness of the underlying assumptions and you already know what I'm going to tell you about that the the second thing you learn in stat 101 is that statistical procedures as long as you follow all the rules are going to yield unambiguous results it's a lot like making you do the same thing every time you get a desirable result I wish this were true unfortunately statistical models work the exact same way every other model in science or engineering works its validity depends on the context and the results can be open to interpretation so I underline the word models there because I'm going to make the argument that statistical methods are all models sometimes this is fairly intuitive so the equation that I have up there is the linear regression model you have your response it's a linear function of your yourse planetory variable X and then you have some error model Epsilon the plot up there is showing independent identically distributed normal errors the classic stat 101 error model and people look at that and that looks like a model to them what about that so that is x-bar it is the shorthand notation for the arithmetic mean the arithmetic mean is what I like to call a reflexive statistic if you've got some data why not take the average people do this all the time often in Excel but what do they mean by taking the average what is that number representing to them well usually when people are taking the average they say this is a typical member of my population this is a representative example I would expect some of the data to look like this actually a good amount of the data to look like this and that's true as long as your data looks kind of normal or at least you know modal and symmetric so when you're using the mean as a representative member of the population you are actually using a statistical model you're making assumptions about the real world that will not hold if for example your data really looks like that so when people calculate the mean of that distribution it still works for some things like if you had a bar and that's the weight distribution you're still going to find the balance point but if you say I want a typical member of this population you're actually gonna get something that's in that trough there which is not usually what people want so if I for example wanted to talk about a typical example from this population I'd probably use two numbers I look at the peaks of the two clusters that you have there so keep in mind any time you do statistics even very simple statistics in order for it to be useful to you in order for it to solve your problem you are probably making some kind of assumption about how the world works so I've told you statisticians view our methods as models even the very simple ones now I'm going to tell you how statisticians view their own models so George box very very famous statistician perhaps most famous outside of statistics for this quote which is essentially all models are wrong but some models are useful so we never actually believed the assumptions that we put into these models that we put behind these models they don't have to hold exactly they just have to hold well enough for us to be able to address whatever our concern is and that's a lot less comforting than the stat 101 approach where all you have to do is look at two plots and maybe run a test and if everything comes out fine you're home free but that's not how it works in real life so I'm gonna give you an example of how seriously statisticians really take their own models so this is an explosive safety example the device that you see in the photograph is called a drop hammer it's located at the high explosives application facility on site it's used to test the impact sensitivity of explosives so what you do you take a very tiny sample of explosive you crank up the drop hammer to a particular height place the sample underneath it drop the hammer and then you check for a very tiny explosion we know they're very tiny because we've been using the same drop hammer for years so they they actually check if they look for a flash or they look for some kind of acoustic signature but it's it's not a particular dramatic effect so what's interesting about the drop hammer is it lends itself to one of my favorite areas of statistics which is adaptive experimental design or sequential experimental design which is to say you can't test more than one sample at a time you have to do it one at a time so why not use everything you've learned so far in your test series to pick the next point you're going to look at and one of the earliest examples of a drop hammer experimental design specifically was this paper by Dickson and mood which is published in the Journal of the American Statistical Association in 1948 so this paper has impeccable statistical pedigree the way the method works is that you assume that your data has a normal distribution so what's our data here well you say every single sample of explosive that you can take has some critical height where if you hit it above that height it's going to pop if you're below that height you're not going to get any kind of reaction actually they don't assume that the the heights themselves are normally distributed they assume it's a transformed height specifically the log transformation in this case so you never actually observe the height in question but we're assuming that it's normal you also assume that you have some reasonable guess of what the mean of that distribution is going to be it doesn't have to be exactly right some reasonable starting guess of what the standard deviation of that distribution is going to be again doesn't have to be exactly right and then the way you proceed with testing is you start with your mean and I'm going to kind of follow that that example plot I have up there so you start with the the drop hammer at your mean height you drop it in this case we saw a reaction so we're gonna move the hammer one standard deviation lower so we're gonna put a little bit less energy into the system and see what happens we drop it again in this case it pops again so we're again gonna move it one step lower we drop it again nothing happened okay raise it a step and up and down and up and down and up and down for usually 50 to 60 tests this method is often called the brew stand up and down tests for reasons that I think are fairly obvious based on this description so I mentioned that there's a normality assumption underlying this and there's something critical to think about with that normality assumption which is you can never check it you don't have a whole lot of data to start with maybe 50 60 replicates and none of that data directly samples the distribution all you have are a bunch of zeros and ones and for that matter you never actually move away that far from the center of the distribution you're typically staying within two standard deviations either way you can't check normality for this data so what do Dixon and mooed have to say about that well what they say is that that's okay your transform data doesn't have to be normal it has to be reasonably normal in the neighborhood of the mean they intend for this this method to give you information about the mean of this distribution and it'll work as long as your data is say symmetric roughly you know modal and the tails aren't too horrible you don't have to prove the normality assumption because you don't really need it so again this model is wrong but it's very useful for this application as as evinced by the fact that this was published in the premier statistical journal in the United States and a statistician is standing in front of you today and saying I have no problem with this so now I'm going to show you statistics gone wrong and it looks a whole lot like Statistics gone right doesn't it we have the exact same data we have the exact same method but something has gone awry I'll give you a hint in Dickson and moods paper they have the following quote the up and down method is particularly effective for estimating the mean as I discussed before it's fairly robust it is not a good method for estimating small or large percentage for example the height at which 99% of specimens explode unless normality of the distribution is a shirt I added the emphasis there but in the very next sentence they basically say you're never going to assure normality so just seriously don't do it statisticians believe in the normality assumption the same way we believe in asymptotics you don't have a normal distribution any more than you have an infinite sample size so you cannot rely on the assumption when it matters so to give you an idea of what people do with this Dickson and mood said don't go out to 1% don't go out to 10 to the minus 1 that's in a normal distribution which again we haven't got between two and three standard deviations away from the mean they stop trusting their method people go out routinely five and six standard deviations from the mean this is extrapolation in a very bad way so if you look at the data that you're collecting that data is all within about two standard deviations if we go out about six your data is in the figure your inference is in the text box your data has nothing to say about what's going on down there except it's below the median of the distribution and you absolutely should not trust it but people do this all the time so I not only work with the explosives testings folks here I also do peer review for other laboratories and I had a very interesting experience reviewing a paper for another laboratory which is going to remain nameless but it's located in New Mexico you still don't know who they are and I was reading through a paper and it was a testing situation wasn't the Bruceton up and down test but a similar situation where they had a very limited sample size and then they were going out five standard deviations from the mean and assuming normality and quoting those numbers like they were real so I knew these people and I knew they were smart people so I gave them a phone call and I said you cannot possibly believe this they said of course we don't but we have to put it in the paper because everybody else does and people are expecting to see it I did not knew this as a credible defense so the the compromise solution that we came up with at the end of the day was the paper went out the five Sigma numbers were still in it along with a disclaimer saying by the way we don't actually believe these numbers because obviously the data isn't normally distributed but we're putting this here because it's standard in the literature not perfect but it was better but the the point is when you're dealing with a method you have to make sure that it's robust in the regime where you're using it if you really want to talk about tale distributions there are different testing methods that you can use that are better for that when you're talking about extreme tales unless you're NIF you're never going to be able to get enough data but you have to be very honest about the assumptions that you're making why you believe them and whether or not you really need to believe them when you're quoting those kinds of results alright so you can't have a talk about everything wrong with statistics without talking about null hypothesis testing and p-values please note the disclaimer I speak only for myself today I do not speak for Lawrence Livermore National Laboratory certainly not the applied statistics group my group leader especially not my adviser who is in fact a Bayesian I don't think there's anything wrong with p-values I use them in my statistical practice but however you are using them you need to recognize that a p-value of point zero 501 is the same as a p-value of point zero four nine nine one of these is not a cause for despair the other a cause for joy so be careful of that I don't have a problem with statistical hypothesis testing I use it in my statistical practice it is not the correct tool for making all possible decisions complaining that null hypothesis testing doesn't work for making all decisions and inference for all of science is a little bit like saying that your screwdriver is broken because it doesn't work on nails or put out fires that's not what it's supposed to do it works in a very limited context and neither of these are really automatic methods you always have to be thinking about the implications of using them but overall they're not broken they're just misused that said do not go back to your office and say that the director of statistical consulting said it's okay for you to do to use p-values and all hypothesis testing that's not what I'm saying I'm saying that they are valuable in a particular context be willing to invest the time to understand them if you want to use them all right last last topic of the hour know thy data so this statement is what you hear a lot from people in in data science predictive analytics machine learning outside of the statistical community basically the kind of Big Data community of course parametric models are vulnerable to violations of the assumptions but if you can get enough data and in some context you can then you can use robust models that aren't sensitive to these assumptions and all of a sudden the kind of cookbook approaches these these very standard sort of software approaches Google's robot statistician that kind of thing they're all going to work in a way that they didn't in the in the small data statistical context I don't buy it so there are a lot of reasons why I don't believe this there are actually a lot of caveats and sensitivities to the kinds of models that are used for big data that really do affect its performance in real world problems but the number one issue with Big Data approaches is the same as a huge issue with small data approaches you have to start with the right data and this is harder than people believe so there's a very famous probability problem called the monty hall problem lots of people encounter it in in probability or statistics courses and the basic premise behind it is that conditional probability is harder than you think so I'm trying to promote a a new problem to go in probability and statistics courses which I'm calling Jackie's improbable sister which is to show you that sampling is harder than you think so this is not a new example it's been written up in the American statistician Scientific American Marilyn Vos savant wrote an article on it I first encountered it in a John Allen Paula's book called in numeracy where it was entitled mr. Smith's children so clearly the first thing - popularizing it is giving it a better name and to set the stage this story takes place in the of Assam topia in the town of Assam topia every family has exactly two children these children have equal probability of being male or female and the the genders of the two children in the same family are independent of each other so if you know one of them's a brother that doesn't necessarily give you information about the other one and of course because this is the statisticians version of Lake Wobegon all of these children are exactly average so Jackie is a girl in Assam topia and Jackie has a sibling what is the probability that Jackie has a sister so you ask a statistician this and I think obviously probability is one-half you asked John Ellen Paulo's this in his book and numeracy and he says one-third without explanation I found this so upsetting when I was reading the book I actually to put it down and go to the internet and look for errata because I couldn't figure out how such a smart man could be so wrong and when I thought upon the the problem a little bit more I realized Jacky either has a sister or a brother the fact that we don't know which it is clearly doesn't make it a probability problem but that's not the point I want to make today no laughs no frequentist in the audience so so what I actually found when I went to the Internet is that this problem is famous for being ill-posed which is to say I can generate two sampling schemes by which to find my Jackie's and for one of these the girls who are sampled will have a 50/50 chance of having a sister for the other one they have a one-third two-thirds chance of having a sister and those two methods are equally consistent with the description of the problem so how did this happen it all comes down to how we found Jackie so in statistician world by the way these are all the children of Assam topia you pick a two child family at random you pick a child from that family at random you can actually skip step one and just pick a kid at random and if you picked a girl well two of those girls have brothers two of those girls have sisters this is how we get to a probability of 0.5 so what happened in palos world well basically he thought that the key aspect of that description is that Jackie's family has at least one girl in it so if I restrict all the families of Assam topia to those that have at least one girl I'm down to three I pick a family at random all of a sudden I have a two-thirds chance of getting a family where the girl has a brother and a one-third chance of getting a family where the girl has a sister so this is a contrived example obviously but it illustrates the point where you can draw incorrect conclusions if your sample didn't come from where you thought it came from so for example if I'm wandering around bride widely thinking I'm in one half land and I'm given the statistics for one third land I will think something nefarious might be going on in Assam topia when in truth it was just that the sampling was performed in a different way than I thought this is a very real and very expensive problem so one of the places where it appears most often is public opinion polling you probably give an entire hour long talk on this topic alone but I'm just going to give you this one very famous example where the presidential election in 1948 the Chicago Tribune was having a dispute with his printers Union and it had to go to press before the polls had closed on the East Coast if you were paying attention in 2000 you know it's not a really great idea to call a presidential election before the polls close but they felt pretty good about it because all of the polls were in agreement that Harry Truman was going to get crushed Dewey the Republican was was going to win by a very significant margin so they print this headline for their first edition they published it and that led to the iconic photograph where somebody hands a copy of this newspaper to president-elect Harry Truman and he says ain't what I heard Harry Truman was was very pithy so what happened here well it turns out that the polling organizations were using the latest and greatest in opinion polling technology it was something called quota sampling which had allowed them to successfully predict several prior presidential elections the way quotas works is that you slice and dice your population up according to two different groups so for example one group could be women between the ages of 25 and 35 who wear glasses and like math and so you assigned each of your pollsters a certain group and you say okay go out and ask X number of people in this group who they're going to vote for for president they go out and they find these people they ask them then you you perform some mathematical magic to recombine all of the the different quota categories into one unified estimate of what the popular vote is going to look like and this as I said appeared to work great up until 1948 but there was a magic word that I didn't say in my description of quota sampling that magic word is random so when people were picking their members of these classes they got to pick whoever they wanted for whatever reason they kept picking Republicans in fact when they went back and they looked at the polling results in previous elections they found that quota sampling had been over sampling Republicans for years it just hadn't mattered up until this point and so we abandoned quota sampling thankfully we've we've solved public opinion polling that's not a problem anymore unless you're in the United Kingdom and following the parliamentary election when as you know the Conservatives were we're going to lose badly enough they might not be able to form a coalition government then we wake up the next morning it turns out that the Conservatives had crushed everyone and they were going to be able to form a government all by themselves so clearly we still have a lot to learn about asking people who they're going to vote for so here's an experimental example this photograph is from the administration of the salt polio vaccine so for those of you who were born after the 1950s you might not be terribly familiar with what polio was or why it was so scary so polio was the the impetus behind the formation of the March of Dimes it was a terrifying childhood disease that was becoming increasingly prevalent in the United States in in the early 20th century this is the opposite of every other public health threat that we face where we typically see less and less of diseases polio we were seeing more and more it primarily affected children it caused incurable paralysis in cases where it was affecting the torso people could no longer breathe on their own and they were stuck in machines called iron lungs that could breathe for them and no one knew how it spread people knew that polio was contagious but they didn't know what the vector for spread was in fact if he wanted another example of extreme statistics at one point the prevalence of polio was linked to consumption of ice cream and so people stopped eating ice cream because they thought it might cause polio it doesn't by the way but the point is when the the Salk vaccine came out it was it was heralded as the first weapon that we have against this terrifying disease and the salt polio vaccine trial is legendary in both biomedical testing and in statistics as being this fantastic gold standard randomized double-blind control trial and it almost didn't happen so back in in the 1950s people said there's no way parents are going to agree to enroll their children in a randomized controlled trial they're not gonna let you inject their kid unless they know they're gonna get the vaccine we can't we can't use this method and there was great hue and cry and Sturm and Drang and what ultimately happened was they split the trial into two parts half of it was was non randomized controlled so they vaccinated second graders and then they watched first and third graders as their control group the other half thankfully they did do a a randomized controlled trial so I could give you another talk on what was wrong with the case control study half but for now let's focus on the randomized control sample so I mentioned randomness is magic there were actually three groups in this gold standard study there were the people who enrolled in the trial receive the vaccine people who enrolled in the trial and received the placebo injection and there were people who did not enroll in the trial because their parents would not consent is about 200,000 in in each of the enrolled groups in 300,000 in the unenroll group so remember I said we don't know how polio was transmitted or we did not know at that time which of these groups do you think had the highest incidence of paralytic polio was the placebo group yeah and it turns out that this is not because that saline injections are actually what causes polio it was because of an uncontrolled third factor that was tied to both enrollment in the trial and susceptibility to paralytic polio which no one suspected at the time and that is that polio is a disease of affluence so the vast majority of polio cases are actually completely asymptomatic it's estimated at 90% of people who contract polio never show any symptoms at all of the remaining 10% the most of the time it's transient effects are flu-like effects it's only in half a percent of people that you actually see these these this crippling damage and this horrible persistent disease and the older you are when you get it the more likely you are to have the severe side effects so what was happening was higher income groups we're living in more hygienic environments they were not being exposed to polio as infinite's when they still had maternal immunity to protect them so by the time they were eventually exposed to polio they were more likely to catch the severe disease this is why we were seeing more and more polio as time went on in this country and it's also why the the more highly educated more affluent parents who are more likely to agree to the randomized controlled trial trial had children who were more susceptible and so as I said this was a near miss we actually did do the randomization here because if we had been directly comparing the group who was not vaccinated because their parents refused to the group whose parents consented and were vaccinated we would have underestimated the efficacy of the vaccine randomization is pretty much the only thing we have to protect us from unknown unknowns it's the only thing that it's going to defend us against the aspects of our experiment we haven't even thought to worry about yet and sometimes we can't use it at all here's a contemporary example Google came up with a a tool or an applet whatever you want to call it and I think 2008 or 2009 called Google Flu Trends and this was heralded as this this really neat method where what they were doing was they were taking people's Google search behavior and they were correlating it to the reported number of influenza cases that the CDC had in various regions of the country and they found that people who are sick with flu are disproportionately likely to search for flu symptoms on the internet and so it allowed them to correlate the search results they had access to to the CDC data and predict where the flu season was going to go next and what its peak would be and this was this was viewed as a really amazing achievement at the time 2010 rolls around Google Flu Trends brakes it was actually under predicting by significant amount the flu epidemic in this country so what had happened well the flu changed so 2010 was the h1n1 epidemic it was the first sort of large-scale flu season that had appeared since Google started started to field this tool so Google School had never seen a highly epidemic flu strain so clearly couldn't predict its behavior they said okay well now we've seen this we won't make that mistake again Google Flu Trends is fixed until 2013 so what happened in 2013 well Google Flu Trends starts over predicting by a factor of two the amount of the number of flu cases that are being seen throughout the country what happened there swine flu scare in Asia which never actually translated to additional flu cases in the United States so you have a lot of media attention you have a lot of people googling flu-like symptoms who it turns out don't actually have the flu so Google Flu Trends is once again fixed now I tell you with perfect confidence it's going to break again with the people at Google would not disagree with me because they are very clever and because they know they have a hard problem they are chasing two moving targets they are trying to correlate the behavior of the flu which is always changing to the behavior of humans on the Internet which is also always changing so they are not worried about if their method bait breaks they're worried about when it breaks and how they're going to fix it in time this is a fairly low consequence example but whenever you're dealing with basically changing behavior which happens with a lot of human data you have to be anticipating when your model starts working and as I said Google can't fix this I'm sure that their ex division is working very hard on their procedure for randomly sampling all times past and future but until they get there we will not be protected by the magic of randomization in cases like this and so the point that I want to make with all of these examples is consider the source of your data whether it's sampled or whether it's from a controlled experiment in the same way you consider all the other factors that tie into your experiments I mean we know that if you're not very careful with experimental setup you can get erroneous results we have published plant genomes with human DNA in them obviously lab contamination if you set your clocks wrong you think neutrinos can move faster than light so this is just telling you to worry about the I'm oversimplifying that one a lot this is just telling you think about where your data comes from think about how you're assigning your treatments in the same way that you think about all the other concerns you have when you conduct some kind of experimental study or observational study so in summary please don't treat statistics like a black box don't treat it like a cookbook don't treat it like magic if you receive a pitch from a software vendor saying that you don't need to understand statistics to be able to use these methods don't believe them statistics is it once much simpler than this and and at the same time much harder to execute because statistics isn't memorizing these enormous check lists it's using your brain in fairly reasonable ways and actually in ways that are very similar to the way you already have to think about problems in your own area of expertise so make sure when you are doing statistics think about your problem make sure you always keep your end goal in mind because it can affect the data that you need and the methods that you can use make sure you understand your tools it's not enough to know how they work you have to know how they break because they will break assumptions are always violated you just need to know when that's consequential and finally know your data make sure that the way you're collecting data or assembling data is going to reflect what you need about reality in order to solve your problem recognize that that is not as simple to do as it often appears at first glance so I said at the beginning of this talk I'm not trying to scare people off of doing statistics I think most people can be successful at statistics when they approach it in this way one of the reasons I think that is because you're already two-thirds of the way there you already understand your problem better than anybody else you already understand your data better than anybody else if you were to call me and ask me for help you would first have to explain those things to me and that could take a very long time the piece of this puzzle that might be missing just because you did not spend two to five years of secondary education learning about statistical methods and then a subsequent five years reading the literature you may not be aware of all the methods that are out there that could potentially help you you may not be aware of all the pitfalls of the methods that you do know and that's why we are here to help so lnl has a consistent to school consulting service it will provide assistance for lab projects free of charge for up to half a day and with that time we are willing to do whatever you want us to do to help you with statistics and that can involve going through papers and translating statistics into English it can involve helping you write a proposal it can involve helping you scope a study it can involve giving you ideas for methods that you could try it's whatever would be most useful to you so on a on a personal note the reason why I run this consulting Center and the reason why I give this talk is because I am deeply concerned about the problems that I was speaking about at the beginning I think it's tragic that we are having this this crisis and reproducibility in science and statistics is getting the blame not entirely undeservedly and one of the things I love about the laboratories because I think if any place can fix everything wrong with statistics it's here and it's because we have a ton of smart people it's because we're highly collaborative we do naturally work together on these projects and my goal is for every piece of statistical analysis that comes out of this laboratory to look like a trained statistician did it I want to enable each and every one of you to be successful statisticians in your own air of expertise and I am more than happy to help you in any way that I can so on that note I want to thank you all so much for being here today and I look forward to working with you in the future
Info
Channel: Lawrence Livermore National Laboratory
Views: 24,689
Rating: 4.9075146 out of 5
Keywords: Statistics (Field Of Study), data analysis, data analytics, research, science, big data, Computer Science (Field Of Study), computation, computing
Id: be2wuOaglFY
Channel Id: undefined
Length: 55min 52sec (3352 seconds)
Published: Thu Aug 20 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.