Explaining Probability Distributions

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to a very normal Therapeutics employee training video in this video we'll talk about probability distributions we'll talk about what they are why they're important and we'll use a little bit of code along the way these are the main Concepts and ideas we'll use throughout the video at the end we'll weave each of them into a simple mental map the world is a random Place sometimes you win sometimes you lose sometimes your Uber driver talks to you and other times they'll talk to you even when you paid a little extra for them not to say anything will Define randomness as the presence of uncertainty or even better the absence of predictability Randomness means that no matter how hard you try you can't perfectly predict what will happen next this unpredictability kind of sucks our brains are hardwired to search for patterns and it doesn't take much to convince us that something is happening when it actually isn't think of Lucky streaks in the casino or the supposed hot hand in basketball our pattern finding brain sees these Trends and beli is that some force is causing these things to happen in reality both of these things are just a string of random events that we interpret to be non-random this brings up a problem if we can see patterns in the world that actually aren't there how do we distinguish these Illusions from actual trends that exist statisticians use mathematical models to impose structure on this Randomness to understand what I mean by structure we need to understand the probability distribution but before we get to that we need to learn about the random variable one of the basic elements of a statistical model is the random variable if you were to read about random variables in a textbook you would see that a random variable is a function this function takes elements in a sample space of an event and outputs a number that's it there are more formal definitions of a random variable but those definitions aren't that helpful to us statisticians instead we'll ground our understanding in a few examples as a first example let's say that we're playing a game of Katan a board game that's been the source of a lot of bitterness in my life you don't need to understand the rules of Katan only that the main source of Randomness in the game is the roll of two six-sided die after you roll you take the sum of the two die the outcome of the sum decides whether or not you get resources which lets you do stuff in the game this Matrix shows the possible sums from the two dice we can see that their sum can range from 2 to 12 and due to Randomness we can't know which one we'll see ahead of time some of you may already see which sums are more likely and which ones aren't we'll get to that as a second example let's say the thing I want to observe is the hours of watch time that this YouTube channel gets in a day hours is a measure of time so the range of possible values is theoretically infinite but only infinite in the positive direction as a final example let's say that the thing I want to observe is whether or not you finish this YouTube video since I know you will the only value this random variable can take is one which is the number I use to represent that idea theoretically some of you won't finish the video so this random variable might also be zero but that won't happen right the central takeaway is that the random variabl is a mathematical representation of something in the world that can take a range of numerical values but we can't predict what value we'll see in advance these values can be discret as we saw in the first and third example or continuous as we saw in the second example you'll need to be familiar with the bit of notation when a random variable is a capital letter we are conveying that this object can take on different values when we see a lowercase version of that same letter we've actually observed a value from the random variable it's no longer random since we've actually observed it we now know that a random variable can take on different values but just knowing the range of values isn't that useful we might also be interested in knowing How likely we are to see these different values relative to each other for example in Katan there's a map of hexagons and most of these hexagons have numbers on them if you have property on a hexagon you can collect resources but only if the sum of the dice matches the number on that hexagon therefore it's in your best interest to build properties on hexagons with numbers that are more likely to come up otherwise you're just sitting there eating Doritos while everyone else is gathering sheet to be more precise what we want is a function this function will take in a number and output another number this output number will describe to us How likely we are to see the input number if the output is higher that means that we're more likely to observe that input number if the output is zero then we can interpret this as it being impossible to observe that particular input luckily all random variables have such a function which we refer to is the probability distribution it can also be called the probability density function or PDF or probability Mass function pmf depending on if the values we observe are continuous or discrete from here on out I'll use the phrase probability distribution and I'll denote this function with the lowercase f the probability distribution is important because it describes the structure in the randomness in a random variable to understand what I mean by this let's look at the probability distribution of the sum of the double dice roll the distribution takes on this triangle shape when many people think of Randomness they think of this this Randomness is chaotic totally unpredictable by looking at the entire function you can see which values are likely and which aren't if you want to win in Katon choose hexagons with an eight or a nine if you want to have a bad time choose 2 or 12 some of you may have noticed this but I've avoided using the word probability to describe the output of a probability distribution that word does apply in the case of the p MF you can get a probability from both Point values and ranges of values if you want to get a probability from a PDF you can only talk about that in terms of ranges of values if you take a point value from a PDF it's not a probability it's technically a probability density it can still tell us that one value is more or less likely than another but it's not a probability even though it's impossible to predict values in a random variable ahead of time the probability distribution tells us that over many observations the frequency that they appear will be predictable and that's what I mean by structure it's worth mentioning there's an alternative way to describe the structure of Randomness in a random variable it's a close cousin to the probability distribution and it's called the cumulative distribution function or CDF note that the D in PDF and the D in CDF aren't the same word like the PDF and pmf the CDF takes in a number and outputs a particular kind of probability the CDF gets its name from the fact that it outputs a cumulative probability the probability that a random variable will be less or equal to a given value we usually denote the CDF with an uppercase app many people prefer to work with the probability distribution but the CDF enjoys the benefit of always outputting a probability the CDF conveys the exact same information as the PDF but it just takes some effort to see I'll demonstrate here with the double dice roll example since this relationship is easier to see what discrete random variables the CDF of a double dice roll looks like this cdfs of discrete random variables always take on the staircase appearance for reference also have the pmf to the side before the number two the CDF always outputs a value of zero well once we reach two it jumps to 136 when we get to three we jump to 336 at this point we need to add the probabilities of the sum being two and the sum being three since we're outputting a cumulative probability therefore the jumps at the CDF indicate how much probability is allocated to that number let's talk about the connection between probability and statistics they're very often paired together but the distinction between them is not always made clear when people first start learning about probability distributions they're grounded in real world examples like the dice I mentioned these are good for picking up the concept of random variables but I feel like it doesn't help explain why we statisticians would bother to learn about them let me ask what are the random variables we deal with in statistics the first major one is the data set data is is a really general term but I'll Define it as just observed information in numerical form a data set is just a collection of observed random variables really a data set is observed so we usually denote it like this the number of observations or sample size is usually denoted with an n and if we're just talking about the general idea of a data set it would look like this but we don't collect data just to do nothing with it the sample size n could be really large so we often want to summarize the information contained in the data set into a single value this single value is called a statistic for some reason statistics are often denoted with the capital T if you know why let me know in the comments a statistic is simple it takes a data set of n observations and outputs a single number there are many common statistics that we care about and learn about in statistics classes the sample mean the sample variance they're all functions of data that produce a single number and they have meaning that's relevant to the data there are many other interesting statistics other than these so it's help F to have a general definition to cover all of them in probability you learn that a function of random variables is also a random variable since a statistic is a function of data then a statistic is also a random variable this can be potentially confusing because many times we only ever collect one data set and by extension only ever see one sample statistic in the end so how could it be random you just need to think about what happens when you collect another data set it's very unlikely that that second data set will produce the exact same value for the statistic if statistics are all random variables then by extension they also have probability distributions and this is a foundational piece of knowledge and statistics that's often missed by beginning students if you internalize this it makes other Concepts like hypothesis testing much easier to understand before we wrap up this video we'll do a little code demonstration let's say that I'm a tryhard at Katan when I play Katan I want to destroy my opponents to be able to do this I'm going to use R to simulate the game and test out different strategies I would like to know the number of times I would expect to see different sums over the course of an entire game of Katan for Simplicity I'll stick to the number eight but you can repeat this for other numbers I read in an interesting blog post that the average game with Katan is about 70 dice rolls so my data set is going to be 70 double dice rolls which I'll simulate in R to simulate a double dice roll I'll randomly generate two numbers from 1 to six to get these to be integers I'll round up I'm assuming that an average game of Katon is 70 dice rolls so I need to repeat this process 70 times after simulating all those dice rolls I count the number of eights among these rolls which turns out to be 10 in this case that was just one data set and one statistic to demonstrate that this statistic has a distribution I need to replicate this experiment many times so I'll repeat it 10,000 times this code here does just that now that it's done running let's look at the histogram for the sum statistic the histogram is actually an estimate estimate of the probability distribution and there we have it you can see that this count statistic takes on a range of different values it almost looks like a normal distribution even though the distribution of the original dice roll is triangular the peak of this probability distribution looks to be around 13 or 14 so let's check in an average Katana game of 70 dice rolls we'll see about 148 on average knowing that I can try to shift my strategies to work around this number somehow you can find this code on GitHub and I've also put a link in the description to wrap up this video Let's develop a map that relates all the concepts we learned it won't be a complete picture but it can help jumpstart your own mental model for these Concepts Randomness in the world is the presence of uncertainty and prevents us from predicting future events with perfect accuracy we represent this idea with a random variable to understand what values are more or less likely we use a probability distribution this is a function that takes in a possible value that a random variable can take and outputs a number that expresses How likely we are to see it in other words it describes the structure in the randomness in the random variable depending on whether or not the random variable is discret or continuous it can have another name the probability Mass function or the probability density function there's another function that describes the structure of Randomness the cumulative distribution function or CDF this function outputs a probability that a random variable will be less than or equal to a given value in statistics the most common random variable we deal with are data and statistics an observed data set is just a collection of observed random variables while a statistic is a function of data that produces a number sample bean and variance are classic examples of Statistics that we deal with this has been a very normal Therapeutics training video on probability distributions now get back to work where you're [Music] fired
Info
Channel: Very Normal
Views: 14,953
Rating: undefined out of 5
Keywords: statistics, biostatistics
Id: k5sbE1_MDwU
Channel Id: undefined
Length: 12min 54sec (774 seconds)
Published: Sat Oct 14 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.