The most important ideas in modern statistics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
most people only ever interact with statistics for a limited part of their lives but statistics is a field of research like other areas statistics has evolved influential ideas have come and changed the trajectory of Statistics as a student of biostatistics it's my responsibility to be familiar with these revolutionary ideas in the field in this video we'll talk about eight Innovations and statistics that have shaped how we know it today and I'll do my best to explain what these Innovations are and why they were so impactful and a way that makes sense to a general audience if you're new to the channel welcome my name is Christian my goal is to make statistics accessible to more people so that they can apply to their daily lives in 2021 Andrew Gman and Aki vitari published an article in the Journal of the American statistical Association or jassa jasa is one of the most prestigious journals in the field of Statistics so publishing here is a big deal but instead of a research manuscript Gman and vitari publish an essay this essay is titled what are the the most important statistical ideas in the past 50 years and this article is what motivates this video but two statisticians do not make up an entire field of Statistics so what gives these two authors the authority to answer such a question the essay was meant to be thought-provoking not authoritative though I would argue that both Andrew Gelman and Aki vitari are in fact authorities in the field they are widely known among practitioners of beian Statistics since they basically wrote the Bible on it beian data analysis as of the writing of this video Andrew Gelman is a professor at Columbia University in both statistics and political science Aki vitari is a professor of computational probabilistic modeling at alter University in Finland Andrew Gman also maintains a fantastic blog on statistics political science and her intersection which I highly recommend the article considers statistical innovations that happened from around 1970 2021 so this is the time period for which I call Modern statistics without further ado let's have a look at the list in an Ideal World all data comes from experimental data where a researcher can control who receives an intervention and who doesn't when we can do this in a carefully controlled manner such as in an RCT we can claim cause and effect between an intervention and some outcome of interest but we live in the real world and the real world gives us observational data sometimes where we can't control who receives a treatment and who doesn't we can still perform statistic analyses on observational data we cannot make the same causal claims about them only correlational claims this was until counterfactual causal inference came onto the scene this framework allows us to take observational data and make adjustments in a way that gets us closer to causal statements how this works is the topic of an entire other video so I'll give you the basic breakdown let's consider a world where I have an upcoming test I can choose to study a bit more or I can choose not to in this reality I choose to study and I get some score on the test later I'll denote this as y sub one if a supernatural statistician wanted to know if this decision caused the change in my score then they would have to examine another reality they would have to find the reality where I didn't choose to study and measured the test score of that version of me who didn't study I'll call that outcome y subz the only difference between these two versions of me is that I chose to study in one but not in another this unobserved version of myself is called the counteract factual because this version of me is counter to what actually or factually happened then the causal effect of me studying on my test score is the difference between y sub one and Y Sub 0 the fundamental problem in causal inference is that we can only ever observe one reality and therefore one outcome in essence it's a missing data problem the counter effectual framework is important because it helped give statisticians a way to formalize causal effects in mathematical models this is significant because several fields of study are prone to having more observational data such as economics and psychology if you've been with my channel for a while you may be familiar with this one already that video delves into more technical detail but I'll briefly explain what it is in this video the bootstrap is a general algorithm for estimating the sample distribution of a statistic ordinarily this would require Gathering multiple data sets which no one has time for or it would require a mathematical derivation which I don't have time for rather than do either of these the bootstrap takes the interesting approach of reusing data from a single data set the bootstrap generates several bootstrap data sets by sampling withd replacement from the original for each of these bootstrap data sets a statistic of interest is calculated and their distribution can be derived from this entire collection this is incredibly valuable not only because it's super simple and therefore easy for more people to use it's applicable to many kinds of statistics we can use boot trrap to create confidence intervals for Point parameters like a regression coefficient or we could create confidence bands for coefficient functions like we might see in functional data analysis the bootstrap is significant not only because of its usefulness but because it highlights the significance of computation in statistics a quote from one of my heroes is very relevant here you see killbots have a preset kill limit knowing their weakness i s wave after wave of my own men at them until they reach their limit and shut down instead of human life statisticians can do a lot just by using Wave After Wave of our own computer's processing power the rise of computational power has made it easier to perform simulations and simulated data allows us to assess experiments and new statistical models for example simulations can be used to assess power and type on error of experimental designs for clinical trials without actually needing to run them this means a lot of money and effort is safe for pharmaceutical companies another example of simulation based inference come from Bean statistics beans encode knowledge in the form of Prior or probability distributions on parameters using these priors we can actually simulate data from a prior distribution and check if the resulting data we collected actually makes sense here this is called a prior predictive check the same can be done for the posterior distribution of a parameter which makes it a posterior predictive check and these are incredibly useful for validating our models to understand this idea we need some context on statistical parameters one way to view parameters is that they are representations of ideas that are important to us within statistical models in a two sample T Test the mean parameter represents the difference of two groups such as a placebo and treatment group in linear regression we're interested in the coefficient associated with treatment which represents the associated change to the outcome that the treatment has statistical models are approximations of the real world but we can we can actually change our models to match the real world a little better one way we can do this is by increasing the number of parameters there are in the model consider the simple linear regression it tells you that the distribution of an outcome shifts according to this coefficient but what if we expect this change to vary over time in the current model there's no parameter for time so this model simply can't capture this complexity we can move up a level by incorporating more parameters into the model and adding a coefficient for both time and the interaction between time and treatment what if we suspect that each individual in the study will react differently to the treatment the current model tells us that this single parameter will explain the change for this population on average to give everyone their own subject specific effect we can make the model even more complex and turn it into a mixed effect model more parameters more flexibility overparameterized models take this idea to the extreme make the model extremely flexible by adding tons and tons of paramet neural networks are a prime example of this each Edge in the neural network is associated with a parameter or weight along with some extra bias parameters we can easily overp parameterize by making these networks very large and by doing so the universal approximation theorem tells us that these networks can approximate a wide variety of functions and this extra flexibility is important because it lets us Model A Wider range of phenomena that simpler models just can't handle one problem with extremely flexible models is that they may start to approximate the data itself rather than representing a more General phenomena we can learn from statisticians employ a regularization techniques which help to balance out this complexity by enforcing that these models maintain some degree of Simplicity multi-level models also known as hierarchical or mixed effect models are models that assume additional structure over the parameters for example multi-level models are commonly used to aggregate several n one trials together each individual is associated with their own treatment effect which will denote data J to indicate that each individual has their own effect these individuals form the second level of the model the first level can be thought of as describing the distribution or structure of these individual effects and in N1 context the first level might be a normal distribution centered at some population treatment effect Theta with some variance Sigma squ in different context the units of the second level of the model could be different things in a study taking place over many locations these may be different hospitals or cities something to indicate a cluster of related units in a Baska trial each second level unit is a specific disease and we suspect that their treatment effects will be similar because they share a common mutation in meta analyses the second level units could be estimated effects from Individual research studies entry Gman says that he used the multi-level model as a way to combine different sources of information into a single analysis this kind of structure is incredibly common in statistics so that's why multi-level models take a spot on the list multi-level models can be both frequentist and beijan so why is beian specifically mentioned in the article but my guess is that the bean framework allows us to incorporate prior knowledge into the models this is especially helpful when deciding on prior for the first level parameters especially on the variants if you choose a wide uninformative prior it encourages the resulting model to treat second level units as being independent of each other on the other hand choosing a narrow and formed prior allows us to pull data together which can help us estimate treatment effects for second level units with small sample sizes being able to choose different priors give statistician much part flexibility in the modeling process a recurring theme among the top eight ideas is the importance of computers and computational power to the development of Statistics advances in technology have allowed more complex models to be invented for harder problems to account for this several important statistical algorithms have been invented to help solve them an algorithm is just a set of steps that can be followed so a statistical algorithm is an algorithm designed to help some statistical problem but there are so many types of statistical problems out there it's hard to get an appreciation for how useful these algorithms are so I'll explain two to give you a taste the expectation maximization algorithm or em algorithm is famously known from the 1977 paper in the Journal of the royal statistical Society another heavy-hitting journal on statistics the EM algorithm solves an estimation problem which is where we need to use data to compute educated guesses about the values of parameters in a model maximum likelihood estimation is another example of an estimation approach what makes the EM algorithm distinct is that it tries to estimate the parameters in a model that we can't solve directly one instance where this can happen is in the case of mixture models with so-called latent classes in this type of model we have data that may come from one of several groups but we don't have the group labels to tell us who belongs where without delving into the details the eem algorithm gives us a way to still estimate the parameters in this model despite not knowing these classes the second example is the Metropolis algorithm and its more modern Descendants the Metropolis algorithm is interesting because its roots actually stem from physics as opposed to statistics the propis algorithm is significant because it lets us generate samples from very complex probability distributions random number generation According to some distribution may seem weird but it's important for statisticians to be able to do so the posterior distribution that comes from Bas rule concern ugly if we turn away from conveniences like conjugate families the posterior can be so ugly that we can't even derive an equation for it but despite this we can still generate samples from a complicated posterior thanks to the Metropolis algorithm even if we don't have a formula for the posterior distribution we can still use the generated data to recover important quantities about the distribution such as the mean the quantiles and credible intervals these two algorithms are just two examples mentioned in the article there are many I couldn't cover and still more that have been developed since this article was written when statisticians designed experiments it used to be a said and forget type of thing fure out the sample size and just run the experiment to completion but Midway through the experiment we might need to stop it under a frequentist framework this would hurt our power and our P value interpretation but in modern times we have a way to account for this adaptive decision analysis is the idea that maybe we don't have to wait for the entire experiment to finish instead we can adapt our experiment Based on data we collect in the interim before it finishes in the context of clinical trials we may decide to stop a trial early if preliminary evidence suggest that the treatment sucks conversely if a treatment shows early promise we can even stop based on efficacy these changes still have to be decided ahead of time to make sure that we make good decisions overall and that the trial is well designed statisticians have to make a lot of assumptions if these assumptions are right or at least plausible then we can feel comfortable trusting the results of statistical analyses stuff like confidence intervals or estimated values but of course assumptions will always be right and it's often hard to even know if they actually are or not and that's where robust inference comes in robust statistics still provides trustworthy statistical analyses even in the face of violated assumptions if we have a robust model then we don't have to be SOI on possibly shaky assumptions the sample median is often cited as a robust estimator for a typical value in a distribution compared to the mean we often hear that the mean is unduly influenced by outliers in a data set and this is true but what assumption do outliers violate many times we assume a distribution to be normal normal distributions have the property that most of their probability is concentrated near the mean you often hear this phrase as the 68 9599 rule outliers challenge this concentration if there's a possibility that there can be many outliers it poses a danger that the data may come from a so-called heavy tail distribution where extreme events are more likely and this would violate the normal distribution assumption in causal inference there's a technique called prop it score matching bity score matching is used to try to match people in a treatment group to people in a control group who are very similar to them by doing this you can produce estimates that better resemble a causal effect propensity score matching requires two models one model to estimate the effect of the treatment on the outcome and another to produce a score that is used to match people together both of these models have to be correctly specified for the results to be useful correct specification essentially means that we choose the Cor correct model for its purpose but this is almost never the case to account for this there are robust versions of propensity score matching that allow for one of these models to be wrong the less assumptions we have to make the better we have to make sure our models can actually account for this yes you read that right we're done with the theory we're done with the computation we're going back to plots and visuals plots give us a way to examine our data and assess our statistical models it's just easier to learn from your data if you can look at it rather than just have it in a CSV but it's undeniable that the skill is important part in any statisticians or data scientists toolkit there's even an entire Paradigm of art programming dedicated to formalizing exploratory data analysis there are people who code in boring base R and then there's people who code using the tidyverse framework popularized by the god Hadley Wickham the tidyverse set of packages makes it extremely easy to get your data into R clean it and visualize it I highly recommend learning it and I hope to have a more in-depth video on it in the future what does it mean for an idea to be important at first I thought that a statistical idea would be important if the paper that introduced it was cited many times but this was not the case the author specifically mentioned avoiding citation counts rather they view important ideas as those that influence the development of ideas that have influenced statistical practice I highly recommend reading the original article it's free to read online and atro gelman's blog you can just Google most important ideas and statistics and look for his name this video only covers part of the article it's full of citations so readers can pick it up and read more about a particular bullet point that they were interested in other articles have even performed actual statistical analyses to answer this question if you think the author's missed a cool idea tell me about it in the comments I hope that I've showed you that statistics didn't stop with the two sample T tests and the new regression new technologies create new types of data so statistics needs to innovate to keep up if you think I've earned it please like the video and subscribe to the channel for more I've also started a newsletter to accompany the YouTube channel so that people can get my videos delivered straight to their inbox I'll see you all on the next [Music] one
Info
Channel: Very Normal
Views: 106,284
Rating: undefined out of 5
Keywords: statistics, biostatistics
Id: nCyGhqQWj2g
Channel Id: undefined
Length: 18min 25sec (1105 seconds)
Published: Sun Oct 29 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.