Statistical Inception: The Bootstrap (#SoME3)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to very normal Therapeutics we're so excited to have you as the newest statistician in our company there's a lot of exciting work going on at our company and you'll get to do it all because they're the only statistician here but don't worry I'll still be here to help you out your first task is to help analyze some data that's coming from a clinical trial we just finished should be easy for you right this is what the data looks like the sample size isn't too big but in clinical trials you take what you can get I need you to figure out if the data provides evidence of a positive treatment effect and if you're all good with that just send me the report when you're done okay but before I go let me just make sure that you'll be on the right track for this data and your task what would you do okay so you want to use the average value of the distribution to estimate the treatment effect that's a good start but I think the median would be a better choice here since there are some outliers in the data so how will you know for sure if there's actually a treatment effect and that it's not actually an artifact or the data no we can't just go out and repeat the clinical trial clinical trials are expensive [Music] foreign confidence interval okay good idea because there's a slight problem with that do you know what the sampling distribution is for the sample median how are you going to calculate the bounds for the interval Central limit theorem can work here but it's a little shaky our sample size isn't that big so I'm not sure if asymptotics can kick in you're not in your PhD program anymore you can't just use asymptotics to solve all your problems okay okay I think you're gonna need a little bit more help don't worry it's your first day I don't mind helping you out there's a method I use all the time and I'll teach it to you today to solve this problem solving these are our computers today you're learning the bootstrap the bootstrap was invented by statistician Bradley Ephron professor emeritus of Statistics at Stanford University Professor Ephron introduced the bootstrap in a 1979 paper and developed the idea further in later papers and textbooks it's a deceptively simple technique for estimating the sample distribution here's how it works we want the sampling distribution but we don't have the time money or energy to collect multiple data sets to get multiple estimates all we have is just a single data set here's the key idea each person in our sample is assumed to come from some population of Interest even though this is the data we observed it's not absurd to think that we might have observed the data set like this or even this that is conceivably we might be able to mimic new data sets from our existing data set since we assume they all come from the same population anyway the bootstrap takes this idea to its logical extreme it's sort of a statistical Inception and the movie Inception they move through different levels of the dream world to try to insert ideas in other people and the bootstrap the first level is the population world and after we collect data we go into the sample world where we actually observe data the bootstrap takes it a level further and tries to extract data from the sample itself instead of collecting more data sets from the population what if we collect data sets from the sample itself from each of these bootstrap data sets we estimate a bootstrap estimator of the treatment effect or whatever statistic you want to get the distribution for these bootstrap data sets will sample with replacement from the original data meaning that it's possible that we'll have many duplicate observations in the data for this reason the bootstrap is what we call a resampling technique then we use the distribution of these bootstrap statistics as an estimate for the sampling distribution itself and that's it and you might be surprised to learn that it actually works the bootstrap gets its name from the phrase to pull oneself up by one's bootstraps referring to the idea of starting something without outside help I first learned about the bootstrap in my first semester as a master's student in biostatistics but unlike other Concepts at the time the bootstrap didn't get an explanation or even a proof I was just told that it works and of course that was very unsatisfying to me so I tried to look it up but I could only ever find bootstrap tutorials which I didn't need or Advanced textbooks which I didn't understand and I bet you're just as curious as I was to know how and why it works but unlike me you couldn't get the answer immediately now in the present day as a PhD student I have the understanding and the vocab to understand the Machinery of the bootstrap and explain it to you little did I know the simple Simplicity of the bootstrap belies the mathematical complexity needed to prove that it works let's have a look it's well known that Randomness and the data will translate into Randomness and the estimator statisticians will model the structure of this Randomness with mathematical objects more precisely statisticians describe both data and estimators as random variables one way that Randomness can be described in a random variable is in a function called the probability density function or the PDF the PDF describes the likelihood that a random variable will take a particular value similarly we can describe a random variable using the cumulative distribution function or CDF the CDF describes how much probability is stored behind a given value of a random variable the CDF and PDF both convey the distribution of the randomness in a random variable but in different forms for this video we'll focus on the CDs and this will become clearer later our goal is to estimate the sampling distribution of the sample median more specifically we'd like to get its CDF the CDF is special because we can actually derive many important statistics from it rather than view the sample mean median or even the variance as a function of the data we can also view them as a function of the CDF a function that takes in a function as input and returns a number or vector is called a functional so we can view the sample median as a functional of the CDF in practice we won't ever know the true CDF so by extension we won't ever be able to know values like the population mean but there's this idea in statistics known as the plug-in principle if I plug in an estimate for the CDF into the functional instead I should get a good guess for the population value therefore it's important that we find a good estimator for the CDF the bootstrap hinges on the idea that we can somehow approximate the CDF of our statist we'll denote the sampling distribution of the CDF like this we need data to create an estimator so a CDF will actually change shape depending on how much data we collect we denote this dependence with the subscript pen here this dependence will come up again later so just keep it in mind our candidate estimator to estimate this function is the bootstrap distribution we denote the bootstrap distribution like this in statistics it's common to denote an estimator with this little hat in general for an estimator to be good it must be close to the thing that it wants to estimate theorems like the law of large numbers say that point estimates like the sample mean will be approximately close to the population mean with enough data but both of these things are functions what does it mean mathematically for two functions to be close that is how do we quantify a distance between these two functions there are lots of ways to do this but the one way we'll focus on is the uniform Norm we denote the U uniform Norm like this this word right here means supremum it's similar to a maximum but not quite the same but the technical details aren't important here this supremum is over all the possible values that the two functions can take you can think of the uniform Norm as the widest absolute difference between these two functions if this widest difference were to go to zero then it follows that the two functions are close or essentially the same in other words to prove that the bootstrap works we need to demonstrate that the uniform Norm between the sampling distribution and the bootstrap distribution either is zero or will go to zero upon some condition this is called uniform convergence and uniform convergence is no joke to prove dealing with the behavior of suprema is much more difficult than the sample mean or the variance but before we can even start we have to address two obstacles in our way one is the fact that the sampling distribution depends on sample size how can we show that two functions uniformly converge if the target is always changing with sample size second it's actually infeasible to get the full bootstrap distribution when you sample with replacement the total number of possible bootstrap data sets is given by this equation but many of these will be duplicates the number of unique bootstrap data sets is given by this equation even with medium-sized data sets this number of bootstrap data sets can grow too large to handle with even the most powerful computers in practice we only use a subset of these bootstrap data sets and find what's called a Monte Carlo estimate of the bootstrap distribution we denote the Monte Carlo estimate as this it's called Monte Carlo because we're picking multiple bootstrap samples at random multiplayer methods are very important to statistics but we'll have to wait for another video to discuss them this Monte Carlo estimate is an estimate for the bootstrap estimator there's two degrees of separation from what we have in practice to to what we actually want while I don't have the time or mathematical chops to prove the actual results here I know enough to give a rough sketch of proof I'll lay out the path here it turns out that you actually don't need many bootstrap data sets for a Monte Carlo distribution to uniformly converge to the bootstrap distribution we'll use this link to describe uniform convergence between these two distributions we'll have uniform convergence with the number of bootstrap data sets is large enough this solves our second problem but we still have the thorny issue of the sampling distributions dependence on the sample size if we could somehow remove this dependence On sample size for both of these distributions that would be great one way that we can move away from this dependence is to turn to asymptotics make sure your sample size is infinite or so big that your sampling distributions turn into limiting distributions well without this limiting distribution as L for example in the case of the sample mean we know that the central limit theorem will tell us that L will be a normal distribution with some population variance using a similar argument the same thing will happen to the bootstrap distribution it too will have a limiting distribution which we'll call L hat instead of linking the sampling distribution directly to the bootstrap distribution will go through their limiting distributions the final link we need is to demonstrate that the limiting distributions will uniformly converge these limiting distributions often come from parametric families that is if the parameters of these two distributions converge then the PDFs will also uniformly converge this is usually much easier to show and can also be done through asymptotes as it turns out the bootstrap variants will converge the population variance with the large enough sample size which means that L hat will also uniformly converge to L and that's it we've established the necessary links that allow us to say that the bootstrap distribution uniformly converges to the sampling distribution to sum it all up first we needed to take a large enough number of boost trap data sets for the Monte Carlo estimate to converge to the bootstrap estimate we needed the sample size of the original sample to be so large that three things could happen one for the sampling distribution to go to its limiting distribution two for the bootstrap distribution to go to its limiting distribution and three for the bootstrap variants to converge to the population variance such that the limiting distributions also uniformly converge there's a lot of technical detail glazed over in this video to focus on the bigger picture for example I've ignored the specific theorems used to prove that uniform convergence exists between the distributions the ethereums themselves are important but I wanted to convey that the path between the bootstrap estimator and the target CDF is more complicated than one might expect in this video we learned about the bootstrap a technique for approximating the distribution of an estimator the bootstrap is simple to implement but a beast to understand since its invention many different types of boot shaft have been created to accommodate different types of data and it's still a actively researched topic today the bootstrap also exemplifies one of the important turning points in the field of Statistics in the dark times before we all had a powerful computer in our pocket statistics was done with pen paper and brain power to understand the sampling distribution statisticians often needed to derive it using math and as I learned during the first year of my PhD math is hard but as Andrew Gelman notes in his paper on the most important statistical ideas in the last 50 years computers have come to have a more prominent role in statistics rather than being mathematically derived the bootstrap is computationally defined the only technology you need to perform the bootstrap is a for Loop modern statistics is done with the keyboard now I'd like to thank Professor Larry wasserman's notes on the bootstrap for such a clear explanation on the topic this video actually was not possible without these notes thanks for watching like the video and subscribe if you'd like to see more I'll see you in the next one now get back to work and get me that report [Music]

Info

Channel: Very Normal

Views: 28,240

Rating: undefined out of 5

Keywords: statistics, biostatistics

Id: BiNcdYbyiWw

Channel Id: undefined

Length: 13min 49sec (829 seconds)

Published: Fri Aug 18 2023