The Z Factor - Numberphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
who would win in a basketball game between Michael Jordan's 9697 Chicago Bulls and Steph Curry and the 2015 2016 record-breaking Golden State Warriors we can give you the answer using matths and something called The Zed Factor The Zed Factor uses a statistical distribution which is called the normal distribution so a normal distribution Maps a set of data which has this kind of shape so we have this bell-shaped curve sometimes called a gaussian and if we say this is X and this is y then the center is at what we call me which is going to be the mean so the average value of our distribution peak's going to be there most data points will be near to the mean and then as we move away things get less and less likely really common example would be like height of people in the UK because I'm in the UK now there's an average height and then you've got a few people who are really small and a few people who are really tall but most of them are grouped around the mean and the power of the normal dish distribution is that it amazingly works for sets of data that you wouldn't even begin to think it would possibly work for and there's all kinds of mathematical theory behind this there's something called the central limit theorem not going to talk about that today but homework it's really cool allows you to use a normal distribution in all kinds of really interesting situations so we've got the peak the central bit in the middle is at the mean and then we have a measure of the spread called the standard deviation and we call this Sigma so if Sigma the standard deviation is Big the data is nice and spread out so if I were to draw one with a small standard deviation but the same mean then it might look a bit more like this okay so this one would have a smaller Sigma smaller standard deviation it's more clustered around that mean whereas the one in black the standard deviation is bigger it's more spread out given a data set you can work out the standard deviation there's a formula right uh measures that the distance of each data point from the mean when we plot this graph for a normal distribution we are plotting a mathematical function it is important to remember that this is centered on the mean so this is not x equals not this is centered on the average value of the data so what's actually being plotted here is the following we're plotting a function of X which is rather complicated 1 over the 2 pi uh we also have a sigma on the bottom there and then we do the exponential e to power of -1 / 2 Sigma ^ 2 * x - mu^ 2 so 1 < tk2 Pi Sigma e to- x - that is that graph where the Center is at mu and the spread is controlled by Sigma this represents a probability it's a statistical distribution so the area under the curve has to have a total of one because the total probability is one so what this means is if I took like a value here let's call this X1 and if I wanted to know what is the suppose this is height and you know average is like 5'8 suppose this is like 6t I wanted to know what's the probability in the UK population of being smaller than six foot I need to add up all of this area up to that point so the area under your distribution gives you the probability you add up all of that area that's why for the red one I drew the peak going taller because again the total area under it even though it's narrower still has to be one has to total one so if we want to know the probability up to this point X1 we know the total area is one so if we wanted to know the probability that X was less than or equal to X1 so we can be anywhere all the way down to us Infinity up to this X1 value we have to integrate because that's how we find the area under a curve we know the function FX so we would be integrating from minus infinity all of these values are allowed all the way up to a maximum of X1 and we want the area under this curve so we integrate F ofx between those two points key thing key relationship is we want to work out probability up to a certain point this is called the CDF or the the cumulative distribution function that's going to be important we just simply integrate up to this point so what it also means is if we have the CDF if we differentiate it the reverse of integrating we get back this one which is called the PDF probability density function so you integrate or differentiate to move between the two of [Music] them that's that so what does all of this have to do with basketball the idea is we're going to take the data for that particular season the number of wins or win percentage of a team in that particular season and we're going to try and create a statistical distribution and this is an assumption this is a mathematical model but if you look at the distribution of wins in a regular NBA season or any other sport there's an average number of wins that a lot of teams are clustered around there's the Champions who win loads down here and there's the really bad teams who win very few so it actually has really really really good for most sports and most data as long as you have a lot of games fits really nicely into this normal distribution model so this is what we're going to use to allow us to construct a distribution for each season and then compare them so the trick is we construct a normal distribution for the 96 997 NBA season and find you know Chicago balls are somewhere down here we construct a separate different normal distribution for the Golden State Warriors in 2015 2016 they're also somewhere over here but how do we compare them when they're both on different graphs across the different eras who knows maybe the Bulls were playing weaker teams maybe all the other teams were rubbish in 96 997 so who really cared they were obviously going to win all their games right again you can have an opinion on this but we want to get it down mathematically so we've got addition two two separate distributions for each team in each season but what we can do is reduce each of them to what we call the standardized normal distribution and that then allows us to compare what position are they on the standard one in order to be able to do that we need to figure out how do we turn a general normal distribution into the standardized one and what is the standardized one the normal distribution we have here with this complicated PDF my random variable X is normally distributed mu and variance Sigma squ so the variance is just the standard deviation squar now the standard normal often referred to as Zed and this is where I think Zed Factor comes from is a normal distribution with mean zero and standard deviation one so the question now becomes how do we turn this graph the more the most General possible normal distribution into a n one distribution and the way we do it is you say Z is equal to x minus mu so we subtract the mean away hopefully you can see would turn this Center Point to zero shift it xus mu shifts the graph to the left by mu and then we divide by Sigma so we rescale by one over Sigma so this is the standardized normal so you take your data point from whatever season it might be 96 97 bulls or the 15 16 Golden State you subtract the mean off from that season and divide by the standard deviation of that season and then you've got a value of the normal not one which allows you to actually then compare the Zed factor and that is exactly what the Zed factor is it's your data point minus the average from the whole set divided by the standard deviation for the whole set and this is comparing everything on the same Distribution on the same graph and it removes this idea of there being different erors different strength opponents that is all factored in because we've reduced everything to the standardized normal so given we have our formula for the PDF when we have mu and sigma squ we substitute into this formula where Sigma is 1 and mu is zero so what I'm going to get is going to be 1 over the < TK 2 pi Sigma disappears because it's one we've then got the exponential of -1/ 2 and then mu is zero and I've called this Zed instead of X so it just be Z squ so now if we were to plot this it has the same shape as this but comparing what's happened here we've shifted it by mu and we've rescaled a little bit by Sigma so what does this all mean well we can take our our distribution our data point from any time period in history for any sport team in any spot and we can then reduce it to the standard normal and then get the Zed factor and we can then compare those values as which one is bigger in theory You could argue says which team is better okay so for a standard normal and by this I mean the normal n one the one we're reducing everything to if we are one standard deviation away on either side so this is plus or minus one either side then you can work out this probability because we know the probability of being between these two values is just the area underneath this curve we know what the curve looks like so we just integrate this between these two values so what you will find is that being between minus one Sigma and zero this is 3413 sort of percent probability this is the same 34.1 3% probability then if we go up to next ones along 2 Sigma or minus 2 Sigma be plus two or minus two then this adds an additional 13.59% and then we go up to 3 Sigma and minus 3 Sigma and this will then give us an additional 2.14 and then being beyond that point you're actually just 0.13% we're going to be interested in the good performers because you can do the same for the bad ones down here right if you're if you're in this little tiny thing you're more than three sigma below the average you are historic Ally the worst team ever and we're talking about here number of wins yes in the season and you could also do like number of goals number of home runs number of three pointers you just need to make sure that you define what it is and you need to have a large data set that is important so you want at least 20 data points really um in terms of teams in the league to be able to have 20 points along the curve ideally 30 but most competitions have about 20 at least when it comes to football my favorite um I guess you've got 30 B teams um and you also need to have a pretty good idea of that team's performance so for example I did not do this for the NFL American football because they only play 16 games it's a very small sample size to see if a team's actually good over 16 games but in a 80 plus game NBA season you've got a lot of data points there which gives you a feel for just how good a team actually is so once you start getting above two Sigma beyond the average you're really seeing something quite rare you know that could be be I wouldn't quite say once in a generational but that's like you know you're a good team you're getting above two as your Z Factor value you are really much you probably won the league put it that way if you're getting a value up there and then if you go three and Beyond then you're greater than 99.87% data so a z factor of over three is like a 1 in 800 team so if you've got 20 teams in the league that's like one in 40 years so that you would argue is very much a generational collection of you know performance from that team or maybe individual performance so we're going to kind of be looking at things from two onwards right because that's when you're getting into like the real good teams occasionally you will see values above three I've never seen a value above four despite doing this for all kinds of different sports and different data sets but once you there are a few of them that go above three um which tend to be one example that comes to mind was Pelle Brazilian football player he had a season when he scored like 70 something goals in in like 20 League games which is clearly ridiculous and that was like 3.8 on on the Zed Factor so again regardless of his competition that was just a once in a century level of performance so you do sometimes see those but this is the setup we're interested in what is this Zed value and the bigger it is the sort of more of an outlier that particular team is so we started asking the question Michael Jordan's Bulls Steph Curry Golden State so took the best SE of both of them the Bulls in 9697 had a zed factor of 2.06 very very good for sure the Golden State Warriors had a zed factor in 2015 2016 which is when they set the league record for the most wins in a regular season that came out at 2.54 you know we're going to interpret this as who would win in a match I think the the maths would tell you Golden State of 2015 2016 were more dominant compared to Chicago blls of 9697 but this doesn't take into account many many things obviously the way the teams match up uh the way the teams perform under certain conditions but it also doesn't take into account how much they won their games by the Chicago Bulls could have won every game by 40 points and the Warriors squeaked a few by one or two points oh absolutely so there's there's a very good question and a very interesting point you raised because the 201617 Golden State Warriors are by many basketball analysts deemed to be better because the 15 16 team actually lost in the championship game right whereas obviously the Bulls won so even though they set the regular season record they lost when it mattered right the one game you want to win they ended up losing or best of seven games well yeah yeah they lost it 4 three yeah um to to Cleveland it was but the year after um the 2016 17 they won it and also in that season they I think set the record record for the most points they won on average by it was over 10 points with their average Victory so the 167 could be argued how we're a stronger team but their Zed Factor actually lower because they won fewer games across the season it's a really important point about all of Statistics is what data did I use to draw this conclusion so you know me saying this means Golden State would beat Michael Jordan's blls I don't know that I'm just saying based on how many wins each of those teams had at their peak season you know their their know highest winning season the Golden State Warriors from a mathematical perspective that was more of an outlier than the the number of wins the Bulls managed so you can't really use it to conclude who would win right I was kind of joking a little bit when I said that but we do have a measure that argues it was more impressive it was a rarer event for the golden State Warriors to win that many games in 2015 2016 and that does take into account the level of competition than it was for the Bulls and their winning record in in 9697 I guess the the Holy Grail of sports statistics and I know this is often sort is to come up with one number or figure that encapsulates the strength and how good a team is but then you would take that across the generations and use this method to compare Generations absolutely yeah the the question might be then what is the suitable number or suitable data set to compare teams across eras I've just used win number of wins win percentage because it seemed the most straightforward That's What sport's about yeah ultimately that is what sport is about but then yeah so because you know because people would argue it's the number of titles you win it's the number of trophies or versus you know who cares if you won the league and then lost the title game no one cares is some people's opinion but I just based this on wins because that allowed me to compare win percentage across different sports so we've been focusing on basketball but I have Z factors for other sports and other teams across different eras Liverpool 2020 finally won the league after a long wait Premier League football team Brady's a Liverpool fan I I'm not a Liverpool fan um their Zed Factor was 2.62 which is really high so even better higher than the Golden State one from 1516 so that was a really impressive season from Liverpool based on level of competition at that time um then other famous Premier League football teams you've got the Arsenal invincibles they went the whole season without losing a game the only team whoever have done this in the Premier League and that was a 2.53 so actually a little bit less Man City got 100 points they were known as the centurions in 2017 2018 that was 2.5 so again around that kind of 2.5 level but a little bit below Liverpool and then my my favorite team man united so their highest Zed Factor actually occurred in 1992 93 three so going back to almost one of the first few or maybe even the first year of the Premier League when they actually had a z factor of 2.59 still not quite as good as Liverpool's but the interesting thing there is that man united team lost six games which is a lot higher than you see in all of these other ones I've mentioned but back in that time period the early 90s the league was way more condensed right so you had you kind of had I think United won it with like 80 something and you had so many teams with like 70 60 50 whereas take Liverpool in 2020 they were on like 99 and then it was like 81 for Man City and then it was like 50 40 20 20 it was just the standard deviation here plays a part and that's because it's comparing the particular performance to the strength of opposition at that time period other ones in football that I found interesting so I looked at La Liga in Spain um so Real Madrid's best season according to zed factors was 2011 2012 they scored 2.85 which is actually the highest I found for any football team they did score over I think they got 10 and something points 102 points that year we also had Barcelona the year Messi scored all the goals which is the season after that they scored 2.66 so again really high and then outside of football I also looked at um baseball I thought that would be an interesting one because you have more data points to play more games um so I'll admit I'm not a huge follower of baseball but some research online told me that there's an argument around um the team with the most wins so there's the 1906 Chicago Cubs had 116 wins and then you had the 2001 Seattle Mariners had 116 wins which as far as I'm aware is the record number of wins in a season the Cubs had a higher win percentage it was 76.3% they won more of their games because you played fewer games in 1906 compared to 2001 however Zed factors 1906 Cubs 2.05 2001 Seattle Mariners 2.6 eight so really big difference there in the Zed factor which based on win percentage alone even though the Cubs had a higher win percentage the level of competition the Mariners faced clearly was tougher which meant that according to this particular statistic of the Zed Factor the 2001 Mariners actually were more of an outlier than the 1906 cargo Cubs I love this stuff I love sports statistics you know the you know the question that's coming into my head is what is the lowest Zed factor a team has won the league with Oh I thought you were going to say like the worst team ever no the lowest Z Factor but you still won the league I can figure that out you can all figure this out so we've of course talked about started with basketball talked about football my favorite sport little bit on baseball they're just the spots and as we've discussed the win percentage number of games won that's just the particular data sets that I decided to apply this Theory to the joy of this and this is a let's call it a homework exercise for the interested viewer to throw out to you all is try this for yourselves pick your favorite sport your favorite team get the data you just need to figure out what it is you want to measure whether it's number of goals scored by a certain player number of home runs three-pointers number of wins of a team whatever you want pick What statistic you want to compare across different eras different time periods get all the data work out the mean relatively straight forward the average value work out the standard deviation again there's a formula and then figure out your Z factor it's just your data point minus the mean divided by standard deviation and you can figure out the Zed factor for your favorite team in your favorite spot and see how it compares to all of the ones that we've looked at here you could even compare players from different sports Michael Jordan's threep pointers versus Don bradman's runs absolutely y who was who was more exceptional that's what the Z fact is telling you it's like how how much of a surprise was this particular statistic this particular performance of this particular person check out this great puzzle from episode sponsor Jane Street the numbers represent the heights of Manhattan skyscrapers and you need to place them in a very special configuration it's all to do with what skyscrapers block other ones now Jane streets are trading firm with officers all around the world including New York they've made this puzzle well for fun really they love making puzzles but it's also to draw some attention to its upcoming Academy of math and programming Summer Event in New York this is a chance for recent high school graduates who sometimes face barriers in their education journey to come to New York amongst the skyscrapers with all their expenses paid and to learn about Game Theory data analysis programming all that good stuff now for more details on just what a great opportunity this is well have a look in the video description and by the way you don't don't need to like complete the skyscraper puzzle to apply for this program it's just for fun and you don't need to be interested in the program to just go and do the puzzle as well it's for everyone to have a go There's the link and there's one you can click on in all the usual [Music] places [Music]
Info
Channel: Numberphile
Views: 116,845
Rating: undefined out of 5
Keywords: numberphile
Id: -PGrIXlFq4E
Channel Id: undefined
Length: 22min 35sec (1355 seconds)
Published: Thu Feb 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.