The most important skill in statistics

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello this is Christian and welcome to very normal a channel for making you better at statistics because I know that on average you need it especially you you know who you are in this video I'm going to talk about what I think is the most important skill that every statistician should know this skill is important for lots of different reasons and it's useful whether you're a beginner in your first stats class or a principal statistician earning crazy money at a Pharma company you already saw a thumbnail today we're learning about Monte Carlo simulation the whole game of Statistics before I dive into Monte Carlo stuff let me explain what statistics is all about I'll make a diagram to explain this on one hand we have the real world the real world produces data according to mysterious complex mechanisms and we mere humans want to understand these mechanisms more however we usually don't know what these mechanisms are or how they work the only thing we see is the small bit of data we observe the data is random but we assume that this Randomness has predicted ictable structure to it therefore you can think of the world as being ruled by the laws and objects of probability on the other hand there's us who observe the data we take the data we observe and approximate the unknown probability distribution but the statistical model but estimating generic functions is really hard so we often further approximate the true distribution with the parametric family this simplifies the problem of estimating a single function to estimating a single or few numbers called parameters and these parameters represent aspects of the data generation process we want to know more about this side can be thought of as being governed by statistics based on the data we collect we create educated guesses about the parameters we use these estim to learn more about the data generating mechanism based on what we learned we can perform more experiments and gather more data and continually refine our understanding of the world the cycle of gathering data and making inferences on it is the whole game of Statistics add it sets a stage for why mon Carlo simulations are so important defining Mont Carlo simulations Mont Carlo simulations are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results what this definition is correct but it's so vague that it won't mean anything to someone who's never heard about it this general definition is meant to Encompass all the diverse ways Mont Carlo simulations can be used we'll be focusing on it from a statistical length the key phrase in the definition is repeated random sampling sampling refers to the fact that we only have enough resources to observe a small subset of a large population we want to study we call the small subset a sample random refers to the idea that we want to pick people or items in a way that doesn't prefer or exclude any specific subgroup within this population we'll get back to the repeated aspect in a bit simulation refers to the idea of using a random number generator to create data these random number generators aren't random in a chaotic way way their Randomness is usually generated from a known probability distribution like the normal or binomial distribution in R these random number generators come from the r set of functions of the DP QR family of functions random number generators allow us to generate data quickly conveniently and cheaply this is what the mon Carlo and mon Carlo simulations refers to the essence is in the repetition of the random number generation the repetition allows me to better characterize how the inherit Randomness in the data generation affects the actual numerical result I want to observe this idea can be counterintuitive for people who haven't encountered it before most things we see in the real world arise from incredibly complex processes for example cancer is the result of the interaction of multiple genes lifestyle and experiences it's so complex that we may feel like it's an insurable challenge to figure out but by modeling these processes with simpler probability distributions we can learn a little bit more about cancer simulations are almost always approximations and that's something we have to keep in mind but it's better than throwing up our hands and not trying at all now I'll explain how monticola simulations may be used by statisticians of varying levels of expertise no matter how skilled a statistician is a simulation study has the following workflow one figure out a way to generate data from a random number generator in a way that approximates a more complex process two generate a data set from this process and calculate a value of interest and three repeat steps 1 and two many many times to get the probability distribution of this numerical result some of you might notice that this workflow almost perfectly matches the diagram I showed you earlier but instead of relying on the real world to generate data for us we take control of that level one beginner statistical knowledge is incredibly valuable it can be difficult for beginners to approach there are lots of reasons for this but the one I'd like to focus on in particular is the widespread use of Infinities and assumptions in important theorem one of the most important theorems a beginner student should know is the law of large numbers the law of large numbers is a theorem about the relationship between the sample mean and the population mean it states that when we gather data from some population with some average then as we use more and more data to calculate a sample average then the value of the sample mean will approach the true population mean law of large numbers is what's called an asymptotic theorem for many starting students including myself it can be difficult to grasp the use of Infinities which are essential for asymptotic theorems what does it even mean to have infinite amounts of data wouldn't the sum of infinitely many numbers eventually explode to Infinity 2 these are all questions I had when I first learned about the law of large numbers had I known how to use simulation studies then it would have been an effective tool for allowing me to actually interact with this theorem computers may not be able to generate infinite amounts of data but they can easily generate thousands or tens of thousands of random data points with just one line of code here's a quick dirty simulation study to do this I want to see how the sample mean behaves as the sample size increases let's see what happens and that's the law of large numbers in action we could just replace Infinity with a stupidly large number instead but wait there's more many theories and statistics come with various assumptions for the law of large numbers to work the mean for that population must be finite in other words it can't be infinite or undefined but it's hard to appreciate why assumptions are necessary for a theorem if you learn this in class it's easy just to treat the assumptions as something to memorize to reach that next level of understanding we can actually use simulation studies to challenge these assumptions and allow us to see what happens to the theorems when the assumptions are violated here's a second monol simulation everything is mostly the same except that the data is being generated in a different way instead of a normal I'm generating it from a standard Koshi distribution a standard Koshi distribution has a bell shape centered at zero but its shape is not quite the same as the normal what's weird about the Koshi distribution is that it actually doesn't have a population mean so it's a population that's not applicable to the law of large numbers let's see what happens when we run this version of the simulation no matter how large the sample mean was it never really settled at zero which is the true sentence of the Koshi we can see that it started to approach zero for a little bit and then suddenly it would break away and these breakaways are common enough such that the sample mean can really never settle down so what happens when we violate the finite mean Assumption of the law of large numbers not having a finite mean means that extreme values or outliers are common enough such that the sample mean cannot converge to a single value this assumption on the population mean allows us to circumvent the problem of outliers statistics is usually associated at with the analysis part of the cycle but in simulation studies we can control the data generation process as well so being able to control them gives us another Knob to play around with when we're learning level two intermediate to me someone with an intermediate level of Statistics is someone who has gone through all of the introductory topics and have started to move into more advanced topics they may have a working knowledge but not enough to produce new kinds of statistical models as you get into a specific area in statistics it's common that you'll learn s ways to approach the same problem this was actually the topic of a summer internship I had my project is what we'd call a methodological comparison you start to weigh the advantages and disadvantages of different models but you may not know ahead of time which models will perform best in what situation most statistical models come with a set of assumptions when these assumptions are met or are assumed to be met then they should have good type one error control and power but we'll never actually know if they're reasonable or not when we're choosing a model to used for a clinical trial we'll want it to perform well even in context where its assumptions aren't true to figure this out we need to test different models in different data contexts I compared these models along with different sample sizes different effect sizes and many other trial parameters since we're in control of the data generation process and simulation studies we actually have control over what hypothesis is true we can generate data where the null hypothesis is true or where some specific alternative hypothesis is true then when we conduct our hypothesis test based on our different models we can record whether or not the null hypothesis was rejected or if we failed to reject taking the sample mean overall these simulations allows us to estimate type 1 error and power naturally there's nowhere near enough resources to collect real data so it all had to be done through simulation the amount of simulation work was so high I had to learn how to work with a high performance computer to generate all the different data permutations and run all of the models on these data sets and this is one of the weaknesses of mon caros simulations since a large number of simulations are needed to invoke the law of large numbers our need for computation explodes when there's multiple models and data sets to consider but given that access to large computation power has been getting cheaper and more accessible it's not so bad level three Advanced at the most advanced level people are no longer students but practitioners and leaders think professors or senior statisticians these people have thought long and deeply about statistical problems and they know about specific areas so well that they're at the bleeding edge of research at this level you're creating new statistical models or improving on current ones if you were to look at a research manuscript and any journal and statistics you're almost guaranteed to run into a simulation study simulation studies are kind of like the training wheels for new models the data you generate from simulation studies can be tuned to your exact specifications so your new model should perform well on it but what does it mean to perform well in statistics ICS it could mean a lot of things it could mean that the model parameters are estimated well or it could mean that the new estimation process has some desirable property like type one error control or robustness to violated assumptions as an example I'll pick an article from the Open Access articles in the journal the American statistical Association at the time of me recording model robust and efficient co-variate adjustment for cluster randomized experiments nice shout out to these guys for making this article open source doing that's definitely not free if we scroll down a bit going past all the theorems and remarks you'll see an entire section dedicated to simulation experiments in this section they first describe the models they're using the metrics they'll use to compare the models against each other and how they generate their data the purpose of doing this is to allow other people to try to recreate what they did for reproducibility the author summarized their findings in a table this particular paper is interested in estimating a parameter so the metrics are concerned with bias in the estimation process and coverage in the estimated confidence interval these three models act as the standard for the paper these are the models that the authors think would reasonably use in this particular data problem these standard models are being compared against two new models that the authors have created we can see that the proposed models have smaller estimation bias than these two standard models and they have smaller average standard deviation than the unadjusted model both these results tell us that the new models perform better than the current standards in ideal conditions and should be considered for real world analyses as you can see even experts need simulation studies to push the boundaries of Statistics in this video I talked about Mont Carlo simulation and how it can be used for people across different levels of statistical expertise I think that beginning students are not taught simulation studies soon enough but it can be immensely helpful for them to learn statistics but I like the stress that simulations are a tool and should be treated as such you can use these simulations to stress test your understanding of theory but it's not actually a replacement for sitting down and learning it the theory is what makes the simulations work in the first place so don't neglect it if you like the video and like statistical content then subscribe for more if you'd like to get my videos directly in your inbox then subscribe to my newsletter too thanks for watching I'll see you in the next [Music] one [Music] n

Info

Channel: Very Normal

Views: 310,417

Rating: undefined out of 5

Keywords: biostatistics, statistics

Id: r7cn3WS5x9c

Channel Id: undefined

Length: 13min 34sec (814 seconds)

Published: Wed Jan 31 2024