What are p-values?? Seriously.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hi team welcome to this video on p-values now before we begin i want to show you that you already possess the required intuition for p-values and i'm going to show you that using this australian two dollar coin which yes is smaller than the one dollar coin i can't explain it move past it if i flip this coin 100 times how many heads do you think i'm going to flip could i flip 50 heads what about 52 heads could i flip 56 heads out of 100 what if i told you that i flipped 90 heads out of 100 coin flips what does your gut tell you as to what's going on with this coin for that to occur now i reckon most people watching this video will tell me justin there's something wrong with that coin there's no way you're going to get 90 heads out of 100 realistically out of 100 coin flips but what you've really done without knowing it is that you've asked yourself this question you've asked yourself if the coin was fair how likely is it that i would get 90 heads out of a hundred or in other words how extreme is our sample given the coin is fair now that my friends is the correct line of thinking for p-values let's explore that concept a little bit more my name is justin zeltzer and this is z statistics all right so as you can see this video forms part of a series on health statistics uh called health stats iq you can see all of the other videos up on zed statistics.com but for now let's dive straight into p values so i'm going to finish off that explanation around the coin example we saw in the intro after which we'll look at the history of p-values how they actually came about we'll then look at the difference between one-tailed and two-tailed p-values and finally i've found a really cool paper to look at that shows you how p-values tend to get used and how realistically you're going to be analyzing them so let's go back to our little example about the coin now before we assess what a p value can be we need to have something to test and this in statistics is called a null hypothesis it's like our default position that we take hoping to see if there's enough evidence to reject it so our default position here is that the coin is a fair one and in the example i gave we had 100 coin flips and quite clearly then if the coin is fair so under the null hypothesis we'd be expecting 50 heads right of course we know that due to the randomness of sampling we're not necessarily going to get 50 heads from a fair coin we might get more than 50 we might get less than 50 but the distribution is going to be centered on 50. this being our number of heads now the reason why this is a bell curve is a little bit beyond the scope of this video if you'd like to know more you can check out my video on the binomial distribution if you're keen to figure out why this is a why this looks like a very normal distribution or a bell curve shape but for the moment just appreciate that it's more likely to get numbers close to 50 than it is to get numbers further away from 50 so it becomes less likely to get say 60 heads 70 heads 80 heads becomes very very much less likely out of 100 coin flips now let's take one of those uh scenarios again where we had 56 heads so let's just say i did have a sample of 100 coin flips and i got 56 out of 100 heads and that exists over here in the distribution to the right of our midpoint which is 50 so somewhere in this kind of upper tail now what the p-value is this is the official definition it's under the null hypothesis the p-value is the probability of getting a sample as or more extreme than our own so here's 56 the p-value would be represented by all the possible samples above 56 heads in other words more extreme than 56 if we're considering the coin to be fair so under the null hypothesis where the coin is fair a more extreme sample is one where it's further away from where we might expect that sample to be which is at 50. now here's the interesting thing it's not just this region which is as or more extreme than our sample it's actually the mirrored region on the other side as well from 44 heads downwards don't forget that 44 heads actually means 56 tails so it's just as it's just as extreme a sample to say you have 56 tails as it is to say you had 56 heads right so all of these samples down here where we go 44 heads 43 heads 42 heads etc all the way down to zero that's going to be as or more extreme than our current sample which is 56. so these two regions will become what's called our p-value and again while this is beyond the scope of the video you can calculate that region which is 0.193 thinking graphically that's actually 19.3 percent of the area under this curve so in other words we can say if p is 0.193 we can say that if the coin was fair the probability of getting 56 heads or a sample more extreme than that is 19.3 percent so i'm just turning that into a percentage so that gives us some idea of how extreme our sample was 56 heads it turns out that's not really that extreme it's quite possible for us to get a sample of 56 heads if the null hypothesis was still true i.e the coin is fair so this really shouldn't be alarming for us to think that the coin is rigged if we get 56 heads as our sample but let's see what happens if we got 60 heads so there's 60 it's a little bit further away than 50. so our p-value is now actually a lot smaller because the area in the more extreme sections of this curve is a lot smaller and it turns out here that the p-value is now 0.032 so again if the coin was fair the probability of getting 60 heads or a sample more extreme than that is actually 3.2 percent so quite small now and as we'll see usually what happens when the p value gets below about 0.05 which indeed this is we tend to start casting aspersions on the null hypothesis we start thinking you know what that null hypothesis might not be so true anymore it seems quite unlikely that we got 60 heads from a fair coin now the example i gave in the intro was in fact not 60 heads it was 90 heads i can't even put this on the graph because it's so far to the right of where i had everything else before but 90 would be way up on the right hand side somewhere and you can appreciate that the area represented by the more extreme parts of the curve is actually going to be pretty much zero so i've written here that p equals 0.000 realistically it won't be exactly zero but to three decimal places it certainly will be and we can say that if the coin was fair the probability of getting 90 heads from flipping it a hundred times is practically zero so if i told you that i got 90 heads out of a hundred flips you'd rightfully say there's something wrong with that coin my friend now you're probably thinking look this is a health stats video why are we talking about flipping a coin funnily enough the history of p values in health goes all the way back to 1710 and it's actually very similar to that coin flip example john arbuthnot um some scottish guy in the 1700s in his paper an argument for divine providence what he did was he reviewed 82 years of birth data but what he found was that every year that data was collected more males were born than females and while he didn't actually use the phrase p-value he came up with this conclusion if male and female babies were equally likely the chance that male babies would be more numerous 82 years in a row is this very very very small number 2.01 times 10 to the minus 25 which is in fact 0.00000 25 of those zeros then 201 but i hope you can see how this is very much like the coin flip example he basically had a null hypothesis that both sexes would be equally likely at birth so that in any given year it's like a coin flip as to which sex is more numerous so if we had 82 of those years we would expect 41 of those years to have more male babies born than female and perhaps 41 of those years to have more female babies born than male babies that would be the likely outcome if indeed both sexes are equally likely so it's very much a similar situation there's our expected value 41 we ended up with 82 years out of 82 with more male babies it's like flipping heads 82 times in a row what's the chance that your coin is fair if you've just flipped 82 heads in a row well as we found out it's 2.01 times 10 to the minus 25. i can't even draw the point 82 but it's the small region that's more extreme than 82 which in fact is is nil in this case pretty much so this paper was not only instrumental for telling us that in humans male babies are born with a slightly higher probability than female babies it's certainly slight i think it's around 51 percent but it also introduced to us this concept or the logic that would soon become the p-value simeon poisson in 1837 in his paper richard probably etc his paper was an investigation of criminal trials in particular the juries of criminal trials and he did something interesting he made two particular comparisons in that paper one of them had the probability of occurring by chance of 0.0897 and the other one had the the other comparison he made he said had the probability of occurring by chance of 0.00468 so what he did was he said that well when it was .0897 he was like you know what that seems reasonable perhaps this occurred by chance but when it was down to 0.00468 he said you know what that's now too small and i don't think this occurred by chance i think there's something else going on something structural going on with these two jury groups and so that was really the first introduction to this sort of vague idea of a critical value beyond which we're going to be considering p values significant and it was finally in 1925 when ronald fisher in his book statistical methods for research workers he finally defined what the p-value is which is what we've been dealing with in this video and he also explained that the p-value of 0.05 is going to be that useful cut point which happened to be halfway in between poisson's two groups that he compared and thus we have the p-value all right so let's now look at what the difference is between a one and two-tailed p-value all right so let's look at a two-tailed p-value first because it's kind of the one we were looking at already here are three hypotheses that we might be able to test which will provide us with a two-tailed p-value so we've already seen one where the coin is fair is our null hypothesis but also in health sciences you'll get a lot of studies where they're looking at differences between the sexes between men and women if our null hypothesis if our default position is that there are no differences between the sexes then this might provide us a nut again with the two-tailed p-value when we take a sample of men and women and assess the differences in the score of whatever we're testing and finally another example i can think of is where we might be looking at anxiety medication and the side effects of that medication in particular we might have a null hypothesis that says that the medication has no effect on someone's weight now the thing that will make this a two-tailed p-value for each of these scenarios is that we can reject these null hypotheses either way in other words for the coin we can reject that statement if we get many more heads than we do tails but we can also reject that statement if we get many more tails than we do heads and that's where this second statement which we're going to call the alternate hypothesis comes in so because the alternate hypothesis is just simply that the coin is biased in either direction this will end up being a two-tailed p-value same with the difference in sexes if our alternate hypothesis is merely that a difference exists that can happen where men score considerably higher than women or women score considerably higher than men it's two different ways same with the anxiety medication the medication might increase or decrease the person's weight and either of those situations would be relevant so let's have a look and see what it might look like this distribution just like what we saw before is the samples distribution if the null hypothesis is true so for our coin example the expected percentage of heads is 50 and if we got a sample up here we would construct a p value out of both of these regions because all of that shaded region represent samples that are more extreme than the one we got in other words for the purpose of these hypotheses these values on the bottom side here on the left hand side are just as extreme as the ones on the positive side and that's true of looking at the sample difference between sexes as well the expected difference between sexes would be zero and just say our sample result happened to be up here showing that maybe women scored higher than men on whatever we're testing again it would be a two-tailed p-value so you would add up both of those regions to calculate the p-value as all of those sample results in the shaded regions represent samples that are as or more extreme than the one we got and similarly with the weight change for anxiety medication it's the same deal if our sample was up here showing you that we have an increase in weight you'd have two regions again to add up to get to your p-value so what does a one-tailed p-value look like well it's not so much about what it looks like but what kind of hypotheses would lend itself to a one-tailed p-value and here's an example i thought of that might make a bit of sense if we're looking at particular medication whose objective is to reduce swelling after an injury say we might start with the null hypothesis that it has no effect on swelling but the alternate hypothesis here is that the medication reduces swelling specifically we're only really interested in one direction to reject this null hypothesis so if the change in swelling distribution looks a bit like this and under the null hypothesis we're going to have 0 in the middle because that's the change in swelling we'd expect if the medication has no effect on swelling so in our sample that we receive we expect there to be no change in the person's swelling after they take the medication but if our sample result happens to be down here which is on the negative side showing that the swelling has reduced the p-value again is the region represented by that which is as or more extreme than the sample we got so that's this yellow region up here but it's only that yellow region so think about it if the medication was actually increasing the swelling and you were up here somewhere in your sample so the poor person that took this medication trying to reduce swelling actually found the swelling increase that's not going to tell us that the medication's effective so because we're only interested in rejecting this null hypothesis one way the p value reflects that factor and so the the values which are as or more extreme than our sample are all on that left-hand side and here's another example i thought of which is looking at coronavirus mortality being equal between sexes if we're trying to show that coronavirus mortality is greater for men specifically showing that it's greater for men then we could look at the mortality difference men minus women and if our sample result again is on the upper side we'd find that that p-value is just the region above our sample because if our sample turns out to show that the women in the sample have a higher mortality than men that's not going to help us with this test we want to show that specifically men have a greater mortality than women now that distinction might be quite subtle because in the last example i gave you a difference between sexes that was a two-tailed p-value and yeah the difference between those two situations is pretty technical but i hope you can see that the nature of the one or two-tailed p-value is determined by what we're actually trying to test and whether it's one directional or whether we don't care whether one group is higher than the other we just want to show a difference now realistically if you're just interpreting p-values from your research you don't have to worry about whether it's a one or two-tailed p-value as the stats boffins behind the scenes have done that all for you but if you're running the research yourself you might have to get on top of what kind of thing you're testing before you can start quoting p-values so that said let's actually have a look at a piece of research where they use a lot of p-values and and try to figure out how they use it this comes from the medical journal of australia this year 2020 and this is called optimizing epilepsy management with a smartphone application an rct and here are the authors yang c et al what they were trying to do in this study was see whether people with epilepsy could manage their episodes better by using a smartphone application as opposed to not using the smartphone application okay so here we go this is the medical journal of australia here's the the paper we're looking at and i'm going to go straight down to the tables that we have or the boxes they call them from the study so what p values tend to get used for is comparison between groups so in this particular table we've got the group that were assigned the app and the control group and you can see that there were 990 people enrolled in each of the two arms of this study so this is common to any study where you're trying to assess outcome differences between groups you want to first make sure that the baseline characteristics of the people in each groups are quite similar so you can see we have 54 men in the group that have the app and 54 men that have the control group as in not using the app presumably all these people have epilepsy so the p value for that is in fact 1.0 so that tells us that that's exactly what we would expect if there's no difference in sex between the two groups you can see that for the age it's slightly higher for the app group only very very slightly and the p-value says 0.94 so that tells us that if there was no difference in age between groups the chance of getting this sample difference which is only 0.1 the chance of getting that sample difference or one that's more extreme than that is 0.94 so it's very likely that we could get a difference like this if in fact the ages were the same between the groups so what's going to happen is as the differences between these factors get more and more the p values will get lower and lower so here's an example looking at the unemployed rate again very similar 23 to 22 so our p value is quite high which is good we want to know that they're quite similar but when we look at the residency in terms of the urban areas you can see that the app group is slightly higher the app group has 53 percent whereas the control group has 46 so there is a slight difference there in terms of the tendency for people to be living in urban areas if they're in the app group versus people that are slightly less likely to live in urban areas for the control group and you can see that the p-value reflects that it's 0.18 which is not concerning but you can see it's much less than all the other p-values we've seen before that tells us that there's an 18 probability that the difference we're seeing here occurred purely by random chance and as i said 18 is not concerning but it's interesting to note that it's much lower than the others so what a paper tends to want to do is make sure that all these p values are reasonably high certainly above 0.05 you don't want any of these being below 0.05 and if we scroll down you can see that yes in fact all of these are above 0.05 which is good because that tells us that there really isn't a significant difference between the two groups in terms of their baseline characteristics but then we go to the next box which shows the outcomes between the two groups now i'm no expert on epilepsy but these are all your possible outcome variables so you can see that the app group scored 144 as their total score versus 125 for the control group so that difference seems to be quite large and in fact the p-value tells us that that is a significant difference in other words the probability that the difference we're seeing here occurred purely due to random chance is less than 0.001 it seems much more likely that a significant difference actually does exist between these two groups and you can see lots of the other outcome measures here have p-values less than 0.01 except for these two here those two seizure management itt and seizure management pp their p-values are a little bit higher so that would tell us that we we don't have a significant difference between those two groups on those outcomes but for all the others there's a significant difference so it's interesting you've actually got two different scenarios here in the original one we looked at we were hoping to get high p values to show that there's no difference between the groups on all of these baseline characteristics and then we wanted to see low p values when we were looking at the outcomes between the groups and indeed that's what we found so that is it team thanks for sticking with me if you got something out of this video tell your friends about it so i can grow the audience a little bit and it'll allow me to make more of these videos and you can subscribe and like the video do all those lovely things if you can and if you want to see any of the other videos they're going to be up on zed statistics.com i'll catch you later you
Info
Channel: zedstatistics
Views: 32,354
Rating: 4.975503 out of 5
Keywords: p values, p-values, what are p-values?, what are p values?, health statistics, justin zeltzer, zedstatistics, zstatistics
Id: 4XfTpkGe1Kc
Channel Id: undefined
Length: 25min 59sec (1559 seconds)
Published: Wed Dec 23 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.