The biggest prize in statistics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
some people make a lasting mark on the world and we thank them by honoring their achievements every year we honor six groups of people with the Nobel Prize there's Nobel prizes for physics chemistry medicine literature economics and even peace but I guess no one told Alfred Nobel that F like mathematics or computer science also exist but that's all right people in these fields started their own Awards and just asked that everyone think of it like a no prize the fields medal is considered to be the Nobel Prize in mathematics and the Turing award is the Nobel Prize of computer science that's cool and all but you clicked on this video because you wanted to learn about the Nobel Prize of Statistics I could be a little biased here but I think that statistics is just as important as math or computer science all areas of science produce their own types of data so at some point everyone is going to need statistics in some form or another so is there a Nobel Prize of statistics or not and if there is what does it take to win it and today you're going to learn the answers to both of these questions if you're new here welcome to the channel my name is Christian and I'm a PhD candidate in biostatistics you're watching very normal a channel for making you better at statistics maybe not as good as the people in this video but still better than your friends let's see what this award is all about the Nobel Prize of Statistics is called the international prize in statistics they couldn't think of a famous person to name the prize after so they just gave it a complicated name which if you think about it is pretty on par for the course for statisticians don't feel bad if you've never heard of it before the prize only just started in 2016 and it's only awarded every other year so at the time I'm editing this video only four people have received it people who win the Nobel Prize win about a million a 24 Karat green gold medal and a diploma winners of the field medal also get $155,000 whereas winners of the T award also get a few hundred grand so what do the winners of the international prize and statistics get a cool $80,000 and what appears to be a nice piece of paper wow to be considered for the prize your peers have to submit a nomination packet in your name the prize can be awarded to teams or organizations but so far has gone to individual people these people are recognized for a single work or a body of work that has significantly shaped the field of statistics and finally the winner needs to be alive at the time of selection all nominations are considered by a selection committee formed by the five biggest statistical associations in the world and a winner is selected near the end of the year so that's the award but what exactly does it take to win it to answer that let's look at the past winners and delve into the work that earned them a Nobel Prize equivalent the inaugural prize was given to Sir David Cox in 2016 for his groundbreaking paper on the proportional hazard model sir David Cox was a professor of Statistics at several universities including the Imperial College of London and Oxford University and I'm not calling him sir to be polite he was actually indited in 1985 unfortunately he passed away in January of 2022 but he leaves behind a vast Legacy that statistic students still learn from today in 1972 sir David Cox published this paper in the Journal of the royal statistical Society series B titled regression models and life tables the paper deals with the problem of failure times also known as survival or time to event data time to event data looks at the time it takes for some pre-specified event to happen and it's distinct from a regular continuous outcome for example this time could be how long it takes for your new laptop to break or the time it takes for you to find a parking space at the mall it also includes the ultimate outcome for us all our own mortality Cox's paper was significant because it gave us a tool to examine the effect of a variable on the time to to an event whether or not it will shorten or extend this time it doesn't take a statistician to recognize that this type of model is a really important tool to have Cox's model is a regression meaning it models the relationship between some independent variable X and some time to event variable T using a well-defined formula it has two components to it the first is a function of time commonly denoted as Lambda T the second component is an exponential function that contains both X and the parameter this type type of model is also known as semi parametric because it contains both a non-parametric and a parametric component the part we're actually interested in is this parameter and I'll explain why in a bit this Lambda T is called the hazard function and it's one of the ways that we can describe the time to an event there's no time to get into the gritty details of survival analysis so you'll just get the Bare Essentials here this expression is the formal definition of the hazard function I know it looks a little crazy but I'll walk you through each of the bits the numerator here is is expressed as a conditional probability we could interpret it as follows given that a part of the population hasn't experienced the event yet this is the probability that they'll experience it in the following window of time which is this delta T this delta T is also here in the denominator and this limit here indicates that this window of time should approach zero if it's been a while since your last Calculus class you would interpret this whole expression as a rate the hazard function describes the so-called instantaneous rate at which the event is happening at time T I know that's a lot so it's worth looking at a specific hazard function so that we could work out the logic of what it's supposed to represent let's say that the event we're studying is death in some population this means that t represents the time until someone passes away a plausible Hazard function for this event could look like this let's see what happens as a person gets older in this population when we're born we're really fragile there could be a lot of things that are fatal to newborns so we would expect more mortality at the start of this population's life but after some point we're good with healthy robust bodies the rate of death is really low in the population at this time so the hazard function is barely above zero here but of course that doesn't last forever at an advanced enough age there are more and more things that can harm us as our health and bodies get more fragile we also expect the rate of death to increase as older people get older the shape is so common that it has a name the bathtub curve you can also have Hazard functions that are purely increasing or decreasing which can happen with different types of events looking back at the Cox model this Lambda T represents the hazard function for the reference group who have an x value of zero for example this could be a placebo group in a clinical trial the only way that this Hazard function can be changed is through this term here if someone is on active treatment then X takes a value of one and we'll get the product of the original Hazard function multiplied by this constant here depending on the value of the parameter it can change the shape of the hazard function in different ways for Simplicity we'll say that the Baseline Hazard function is increasing if the treatment is beneficial the parameter will be less than one and this will stretch the hazard function causing it to rise more slowly it would take a longer time for someone on a treatment to reach the same Hazard as someone in the placebo group conversely if the treatment is actually harmful the hazard function will contract leading the hazard to rise faster for the treatment group The structure of Cox's model assumes that the hazard functions for the two group can be written in this proportional form and that's where the model gets its name the proportional hazards model I described how this model might work for a single binary variable but the same logic could extend to continuous variables or multiple predictors Cox's paper is so important that it appears on Nature's list of the top 100 most sided papers of all time and it's allowed us to identify factors that shorten or extend our life here on Earth the second prize was awarded to Bradley Efron in 2018 for his work on the bootstrap he's currently professor emeritus in statistics and biostatistics at Stanford University one of my first breakthrough videos was on the bootstrap so his method will always have a special place on this channel the bootstrap was first published in 1977 in a manuscript titled bootstrap methods another look at the jack knife and this paper gave rise to a simple but deceptively powerful idea when we're interested in studying a population we collect data on them for example you might want to know how long they typically live or What proportion of them have a chronic disease from the data we create an educated guess or an estimate of the population characteristic we want to study often we only collect one data set but statisticians know that a slightly different data set will create a slightly different estimate it's not enough just to get that single estimate it's important that we can characterize how this estimate might change with different data sets in other words we also need to understand the uncertainty or variance in our estimate to understand this uncertainty we need to know the probability distribution of this estimate which usually means we need to collect different data sets but collecting data can be very timeconsuming or expensive so most of the time we only have one data set and that's it you can't study uncertainty with just one data set so we have to turn to statistical theory for example we often rely on the limit theorem to give us the probability distribution of our estimate one problem with theorems is that they often come with assumptions and requirements so if these can't be met then we can't safely use the distribution guaranteed by the theorem this is where the bootstrap comes in instead of trying to gather different independent data sets Efron's idea was to create new data sets from the original data itself this is where the bootstrap gets its name we create so-called bootstrap data sets by sampling from the original data with replacement this means that it's possible to have duplicate observations in the same bootstrap data but that's not really a problem from the point of view of the bootstrap if one observation came from the population it's not impossible to observe a second observation that's close to or even the same value as the one we just got and this can extend to multiple observations as well from each bootstrap data set we can calculate a bootstrap estimate this results in a collection of estimates that are all slightly different since they use different data we can then use this collection of estimates as a standin for the true probability distribution of the estimator and by extension use it to characterize its uncertainty you can calculate other useful values like a bootstrap P value or a bootstrap confidence interval and use these to conduct valid hypothesis tests and the bootstrap can even work for more complicated estimators like functions you can see an example of this in Netflix's technical blog where they use an advanced version of the bootstrap to characterize the uncertainty in the quantile function the bootstrap is useful not only because of what it can do but with how accessible it is to nonstatisticians statisticians come up with fancy models all the time they're only really useful if the statistician is there to teach other people how to use it or the method is simple enough for a general audience to use with the proliferation of powerful computers and laptops the bootstrap is now in everyone's hands before I get into the next winner let me ask you a question without looking it up can you name at least one famous woman in statistics I'll give you a few seconds if you didn't know that's too bad but after this video you should have an answer the third prize was awarded to Nan La in 2021 for her work on the random effects model she's currently a professor amerita in biostatistics at Harvard University n's seminal paper on the random effects model was published in Biometrics in December 1982 it was framed to tackle the problem of longitudinal data data that's observed repeatedly over time but as I'll explain later the problem of longitudinal data is a problem that's much more common than you might expect let's say that we're studying the effect of pollution on a child's lung function it's thought that this effect will change over time so we'll also need to gather data from these children as they grow up the end result is multiple observations per child in your data this is a problem since we're looking at the relationship between pollution and lung function the problem calls for a regression but a linear regression assumes that the errors in your data are independent and identically distributed the well-known IID assumption but if two observations come from the same person do you think that it's plausible that their errors are independent of each other no data from the same person is very likely to be correlated so longitudinal data inherently violates the IID assumption therefore linear regression is not a good solution here cue the random effects model it has a form similar to a standard linear but has some additions that make it much more interesting the random effects model is what's known as a two-stage model or a hierarchical model to make sense of it we have to understand what these two stages represent in a standard linear regression all the children are modeled to have the same treatment effect represented by this parameter here think of this as a population level effect the change in lung function that this population of children will experience thanks to pollution the random effects model goes a step further not only do the children experence experience an overall change as a group they're also allowed to vary slightly from this fixed change in the model you can represent this using a sum the fixed effect plus some subject specific deviation which in total equals the child specific effect it's not just for the treatment effect either the children can also be modeled to differ in their Baseline lung function as well these deviations are also known as random effects and they give the model its name the model states that there's an average effect that all the children will experience but each child will have a specific effect that deviates slightly from this average these deviations themselves are assumed to have a normal distribution with its own variance while we don't directly see what each child's specific deviation is we do observe repeated outcomes for each of them the treatment effects form the first level of the hierarchy while the distribution of each child's data forms the second level without getting into the technical details by specifying that there's a distribution on the child specific effect X the model can actually help account for correlation that comes from repeated measurements like I mentioned earlier Nan L's method was originally framed for longitudinal data with this type of data observations are correlated because they come from the same person think of an individual person here as a cluster depending on our context what defines a cluster can also change clusters could be people locations or even something abstract like an individual research study this is the idea behind the meta anal analysis widely considered to be the highest form of scientific evidence even beyond that of the hallowed randomized control trial the fourth prize was given to Kumi rrishna ra also known as SI RA in 2023 unfortunately s raal passed away on August 2023 but thankfully he was still able to accept the reward before his passing before his retirement in 1991 carra was a professor at the University of Buffalo in pens State sir David Cox Bradley Efron and Nan L each contributed a model or a method that researchers of all kinds can use for some people this can feel more concrete since a model is something you can use with just a few lines of code in contrast the works at CR are honored for are more on the theoretical side most people are probably not aware that they're benefiting from the implications of cr's work but I'm going to try to change that today for brevity I'm only going to talk about one specific work in this video the CH lower bound when I was explaining the motivation behind Bradley Efron's work I talked about how we take data and construct educated guesses about specific aspects of a population the bootstrap answers the question of uncertainty behind this estimate but there's an even deeper question behind all of this how do you even make an educated guess in the first place remember that the only thing we actually see is data just a bunch of numbers with just the data alone for students learning statistics this point is often glossed over they're told a formula to use for stuff like the sample mean and the sample variance but they don't stop to question why we use them in the first place as you might guess CR R's first work lies at the very heart of this question we know that data is inherently random so the estimates we derive from them are also random because of this we need to characterize the variance in our guess as it turns out one way to judge how good an estimate is is its variance we would much prefer to have our guesses have small variance so that they stay clustered together even if the estimates vary around the true unknown quality that we want to know about smaller variance helps us hone in on what it could be conversely if an estimate has high variance we might get wildly different or even conflicting guesses about our quantity of Interest the key idea here is that lower variance is better in 1945 SRA published a short article titled information and the accuracy attainable in estimation of statistical parameters this paper establishes that there's an absolute lower bound on the variance that any guest can have this is the famous CR raal lower bound discovered independently by S ra and a Swedish statistician Harald CR this value here is called the Fisher information which I definitely don't have time to cover here just know that no matter what your method is for creating estimators the variance can't be better I.E lower than the CH ra lower bound some of you might ask okay there's a lower bound on the variance for any estimator so what the implication of CH and Rouse's work is that if you do have a way to create an educated guess and its variance equals the lower bound then in a sense you have a good guess and not just good but an optimal guess do we have ways to make these so-called optimal guesses and the answer is yes it's maximum likelihood estimation any model that uses maximum likelihood estimation achieves the CH ra lower bound anytime you run a linear or logistic regression you're using maximum likelihood Cox's proportional hazards model maximum likelihood the random effects model maximum likelihood you might not know it because R hides this process in the background you get to benefit from the properties of Maximum likelihood while avoiding the process of having to calculate the guess in the first place textbooks and classes often stop at the minimum variance result of Maximum likelihood and don't explain why it's important if our estimator has the smallest possible variance it also means that we get the smallest possible confidence intervals even before considering other factors like experimental design or sample size in October this year the nominations for the 2025 prize will start since I run a statistics Channel I thought it would be fun to make my own prediction this channel is so far mostly focused on statistical inference we can't forget that the problem problem of prediction is also a statistical problem even though it's been given many names statistical learning machine learning data science artificial intelligence all of these deal with prediction given that generative AI like chat GPT have entered the public mind since sarra was awarded the prize in 2021 I don't think it's impossible to imagine that the next award will be given to someone who has supported machine learning my prediction for the 2025 prize is Vladimir vapnik one of the masterminds behind vapnik Trev Theory or VC Theory machine learning and artificial intelligence today wouldn't be the same without VC Theory so that's my guess but I can also see the committee choosing someone from causal inference or mcmc methods if I'm wrong I'm wrong but if I'm right I totally called it I'd love to know what your predictions are for the prize so let me know in the comments who you think it might be I like to think of Statistics as a quiet hero of science when it's done right we expand our knowledge and create new techn Technologies and medicines that help us out but at the same time we often focus on the achievement rather than the method that made it possible and this is where the prize comes in to honor those who have helped make these achievements possible even if the general public doesn't appreciate what statistics has done a little recognition never hurt anyone thanks for watching the video if you'd like to get more statistics content then consider subscribing to the channel you can also get notified right in your inbox if you subscribe to the channel substack that's it for this one I'll see you next time peace [Music]
Info
Channel: Very Normal
Views: 26,757
Rating: undefined out of 5
Keywords: biostatistics, statistics
Id: Gh-oD6i1Zds
Channel Id: undefined
Length: 21min 14sec (1274 seconds)
Published: Mon Jun 24 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.