Null Hypothesis Significance Testing

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] this video contains a pink hippo puppet examples about hearing loss dating and teddy bear therapy i am physically assorted by the hippo again so before we begin today i thought we should all have a bit of spaniel love you because today's lecture is on null hypothesis significance testing not the funnest topic in the world but if you've got a spaniel it's all fine in today's lecture we're going to have a look at null hypothesis significance testing or you know for most people the p-value so as we've seen in the previous couple of lectures there are five key concepts to understanding uh the statistical methods that are typically applied in the social sciences the standard error which we looked at in previous lecture parameters which we also looked at in a previous lecture confidence intervals or intel estimates of the parameter which we looked at in a previous lecture estimation which we looked at in a previous lecture and that leaves us with one final topic which is null hypothesis significance testing so that's what we're going to cover today so where does this fit into our model well we've talked about this spine of statistics that we always start with some kind of scientific question that we're interested in answering we will sample some data visualize the data before we do anything then we fit a statistical model and those models are invariably variations on the general linear model and we saw that when we fit a model there's two things we can think about one is estimating the effect so estimation we can do that with a single value a parameter estimate or an interval value so we can construct an interval around the parameter estimate and a second thing that we can do is to apply some kind of significance test to that parameter and finally we need to test the assumptions of our model and possibly fit a variant of the model that is robust if any of those assumptions are violated so today we are focusing on this part of the diagram so we fitted a model we have estimated our parameters and constructed confidence intervals but we now want to apply some kind of significance test to them so that's what we're talking about today and just i wanna i don't really talk about this too much but draw your attention to the fact that the standard error which we talked about in a previous lecture uh is quite intrinsic to this process and that feeds into both confidence intervals and hypothesis tests okay so by the end of this lecture hopefully we'll have some kind of understanding of the process of significance testing parameters so what it means and what process we need to go through uh i'm going to spend a bit of time trying to explain what a p-value represents and also what it does not represent because it's a it's another thing in statistics so there are a lot of misconceptions about so we'll try to dispel some of those we'll have a look at some of the limitations of significance testing and we're also going to talk briefly about effect sizes so looking at the parameters themselves and how looking and interpreting the parameters provides a useful context for significance tests so what is null hypothesis significance testing well this flow diagram basically illustrates the process that uh happens in if you're applying frequentist models now there are other types of models that you can apply bayesian models for example and they have a different process but for most of what you read in psychology journals and social science journals this is the process that people are supposed to go through so you start off with some kind of hypothesis that you want to test and having generated that hypothesis you kind of split it into two alternative realities um a null hypothesis which is effectively saying there is no effect and an alternative hypothesis which is the you know the thing you're hypothesizing so for example um we might um hypothesize that online learning um leads to reductions in happiness that's a bad circle prediction not a good prediction for me but let's say you predict that your alternative hypothesis is that that effect exists so engaging in online learning reduces your happiness the null hypothesis will be there's no effect of online learning on your happiness so it doesn't do anything so what you do having generated this hypothesis is you specify a significance level um and this is going to be um basically the threshold that you apply at the end of this process now what you're supposed to do is to have a look at the research literature and know the kind of area that you're working in and generate a bespoke alpha level that is appropriate to the research that you're doing you are not supposed to blindly apply a single value to all circumstances so fischer who was responsible for the the p-value part of this process but not the hypothesis testing part he said i'm paraphrasing slightly because i didn't you know not actually memorize his writing or anything but he said something like you know only a four would use the same threshold for alpha all the time so what do psychologists actually do they typically apply 0.05 and don't think about it so rather than thinking about the bespoke alpha they should apply typically this go uh what does everyone else do um oh they use a 0.05 level of significance okay i'll do that too that's what you should not do but that's what lots of people do anyway so having decided on your bespoke alpha um um what you should then do is use that alpha to calculate the sample size necessary to detect the effect of interest and you would also pick an appropriate sampling distribution for the uh parameters that you're interested in estimating so typically that distribution is a normal sampling distribution i mean at least uh most of the models that we're dealing or most of the variations of the general linear model that we deal with that sampling distribution will be a normal distribution but that's just in the process to uh you know point out that you should actually be thinking about these things so having done that you've calculated the sample that you need to detect the effect and let's say that's 98 participants you then randomly sample 98 participants no more no fewer and you estimate the parameters so you fit the model you go through the estimation procedure you get some estimates of your betas and based on those betas you can compute something known as a test statistic which we'll talk about in a minute and that test statistic will be uh putting those betas to some kind of test and we'll talk about it in a minute having done that so you've got this test statistic you there i say you i mean obviously you get computers to do it because you know that's what computers are there for uh you compute a long run probability p that the observed statistic would be at least as big as the one that you have observed if the null hypothesis were true we're going to come back to this in quite a lot of detail but essentially uh you calculate a p-value um and then you compare your p-value to the alpha level that you set at the beginning and if your p is less than or equal to the alpha that you set at the beginning then typically people use that as evidence that they can reject the null hypothesis so in other words if the test statistic that they obtained was very very unlikely so the p was very very small very unlikely if the null hypothesis were true then people say okay well this was unlikely if the null was true so we'll reject the null however if p is greater than the alpha that you set at the beginning so if the p is kind of reasonably likely if the null hypothesis were true then you use that as evidence to accept the null hypothesis so you say okay the test statistic i've observed is you know reasonably common when there's no effect so i'm going to assume that there's no effect here so that's the process in a nutshell a very long convoluted nutshell that probably contains a really rancid rotten disgusting toasting nut in the middle so let's just recap the example that we've been using throughout all the lectures which is to do with ears ringing uh predicting here the length of time it is ring four as a function of the volume of concerts that you have attended so this model should be familiar with you we've seen a number of times i just want to illustrate how the beta the parameter attached to a predictor variable can tell us something about a hypothesis so if we're hypothesizing that there is an effect of the volume of the concert on how long it is ringed for then essentially we are hypothesizing that the beta attached to volume which bear in mind represents the slope of that red line it also uh you can think of it as the rate of change of the outcome as volume increases if that beta is not equal to zero then there is some kind of relationship between volume of constant is ringing so and the degree to which it differs from zero tells us the degree to which the volume of the concert effects is ringing so the beta very directly kind of quantifies the effect of interest now under the null hypothesis that there is no effect the beta would be zero the line would be flat so as the volume of concert increases nothing happens to earrings at all that is what that would be represented by a flat line so the beta attached to volume can be used to test a hypothesis because if that beta is zero there is no if if you know the effect is basically zero and if it's different from zero then the effect then the effect is non-zero so we can use the beta we can test whether the beta is different from zero and that test will tell us you know whether the effect is different from zero because the beta very directly quantifies the effect of interest um we looked at a variation on this example which is looking at just whether you attend concerts or not and how long your ears ring for and this would be the equivalent of comparing two groups so groups of people who attended a constant groups of people who didn't and again other things about well sorry we've seen before that the beta one in this scenario represents the difference between those two means so the difference between the means and the two groups so beta one again so the beta four attendance very directly quantifies the effect of interest here so whether there's a difference in earning in the two groups if that beta is different from zero not equal to zero then the difference between the means is not equal to zero there's a slope to the line so is ringing changes as a function of group under the null again that line the gradient of that line the beta would be exactly zero we'd see a flat line which basically means the means in the two groups are the same so all i'm trying to illustrate here is that we can use the beta attached to a predictor because it quantifies the effect of interest we can apply a significance test to that beta so parameters represent effects so this could be the relationship between a predictor and an outcome and the relationship might also be a difference between means of groups and both of those can be quantified with a beta or a b so parameters reflect hypotheses which means that we can express the null and alternative hypotheses in terms of these parameters so the null hypothesis which we normally express as h0 what we're effectively saying here is the b in the model the parameter is equal to zero or there may be situations where we want to compare two parameters so you compare the parameters of you know two variables or something so the null could also be that parameter one is equal to parameter two or two of the betas in the model of the same the alternative hypothesis that there is some kind of effect that's normally denoted with h1 we can express that in terms of the beta not being equal to zero so being bigger than or smaller than zero or two betas in the model not being equal to each other if for some reason we want to compare and the betas attached to different predictors now we know from previous lectures that all parameters all b's have an associated sampling distribution which means that every single parameter that we estimate has a standard error attached to it and the standard error is going to tell us something about how variable those estimates parameter estimates will be across different samples so if hopefully you remember small standard errors mean that different samples will give us pretty similar estimates of a parameter whereas a large standard error means that different samples could give us quite different estimates of um a parameter so the standard error is really the kind of the noise represents the noise in the sampling process and the beta represents so the estimate in your sample represents the signal so we can construct a kind of a signal to noise ratio if we look at the size of the parameter so the value of the parameter that's our signal relative to the noise there is in sampling so how much variability there is in parameters across samples so if we divide the estimate by its standard error we're getting a signal to noise ratio of uh you know the parameter relative to the kind of uncertainty in its value when we do that we get a test statistic so typically a t statistic in fact so if we divide the parameter by a standard error error that gives us a test statistic a signal to noise ratio and once we've calculated that test statistic we can know based on the properties of that test statistic work out the likely values of that test statistic if the null hypothesis were true so if a beta was zero what values would we be likely to get of a particular test statistic and we typically express this as a as or we can express this as a as a distribution of probability so the the chart on the slide sorry the plot on the slide here shows one such distribution so we're kind of saying here is we're looking at values of the test statistic on the x-axis and the y-axis is the probability of them occurring so you can see a very small values of the test statistic the probability of them occurring under the null is very very low and very high values of the test statistic again probabilities are very low whereas if we look at the the null so the uh the test statistic being zero well that's when we get the highest probability so what we can see is we can it's that we can look at values of the test statistic and work out how likely they are to have happened if the null were true and what we do then is we apply our having worked out that probability value we apply out whatever threshold we decided on typically 0.05 and use that to make a kind of a decision now if we're using a 0.05 threshold if we if our alternative hypothesis is just saying that our beta is not zero that's what's known as a two-tailed hypothesis so we're not particularly uh predicting that it you know it will be bigger than zero or smaller than zero we're just saying it won't be zero then that .05 probability gets split over the two tails of the distribution because we're we're prepared to accept that the effect um could be in either direction but there's other scenarios um where you might make a very specific prediction and you're saying that there's a direction associated with it so you're saying your beta is not only not going to be zero it's going to be positive for example and in that scenario you can put kind of all of your 0.05 probability in the upper tail and that will give you a sort of a different threshold um to apply to as to determine whether that result is significant or that test statistic is significant or not um so that's your kind of you know standard process for calculating p now what is a p-value what does that p actually represent having got that probability what does it mean in order to explain this we're going to think about a rather outlandish scenario where i'd like you to imagine that imagine a university that applies a very very weird kind of system where all students rate all other students on various personality dimensions so let's say every day you get a little you know notification from the university um saying can you please rate and one of your fellow students so and you have to rate them along various dimensions such as how funny they are so they are human rating uh how kind they are um you know how conscientious they are so things like that so just rating dimensions of their personality and then at the end there's a little tick box would you date this person because this is like a really weird university who likes to get involved in people's dating um so let's imagine that imagine you're at this university and like i say every day you get to rate one of your fellow students along personality dimensions and you also get to indicate whether you would date this person okay now imagine two students first student is called alice she is brilliant very conscientious works very very hard gets good grades very studious and because she's so conscientious she kind of um diligently completes these ratings every day when she gets them even though she kind of thinks what's the point of this but she diligently does the ratings and i want you to imagine another student called zack and he's a very very nice guy not so diligent though he's like you know he's in a band he's a bit more laid-back about his studies but he really likes alice because she's like so clever he's just like oh my god she's so clever i love her um but he's a bit he's a bit shy he's not quite sure whether you know she would be interested in him because he's he's a bit low on his self-esteem kind of well you know kind of really into my music but i'm not so academic and you know she's like brilliant so maybe she wouldn't be interested in like a normal academic sort like me but you know he just he thinks she's amazing so but he's not confident enough to ask her out for a date so you know he like chats with her says hi whatever but he just he gets a bit nervous he can't actually bring himself to like ask her out but he knows that there's all these ratings going all this racing going on so he's like thinking well you know maybe she's rated me you know it's possible anyway one day he's in the library and um you know he's sitting doing some studying getting a bit confused about you know what the p-value represents and yeah alice is just over a few tables away very diligently doing her statistics coursework and um he just kind of you know he's kind of got a bit a bit bored of kind of looking falonly at her from a distance so he decides to um to like go get a coffee or something so he's just walking out of the library he has to walk past her table and as he walks past he notices on her laptop the the the app the university's app is up and he sees his photo and she's about to rape him for humor and he notices just as he goes because you know he doesn't want he doesn't like stand and stare at computers that would be weird so uh but as he goes past he just notices that she'd clicked on a five on this rating scale so five out of ten for humor and you know then she she submits it moves on to the next rating and and he goes past he's like yeah i don't wanna i don't wanna hang around here but he's got some useful information that's his test statistic his test statistic is he knows that alice has given him a rating of five on humor five out of ten in isolation though that test statistic is not particularly useful you know 5 out 10 it's not you know it's not putting him in stand-up comic territory or anything is it so he needs some context for that and the context is how alice rates other people and in particular how alice rates other people that she doesn't want to date so what he does he has a friend in computer science and he gets the friend in computer science to like hack into the database at which point you know this is time to go quite unrealistic but never mind stick with me um and he finds all of alice's ratings apart from his darn it because his one hasn't been committed to the database yet so he's got all her ratings and what he discovers is that for the question would you like to date these people she's answered no to all of them so he has a bunch of information about how alice rates people on her module when she doesn't want to date them that's the null hypothesis is alice does not want to date zach so he's suddenly got a load of information about how alice rates people that she doesn't want to date in particular if he looks at the humor ratings he knows how alice rates people that she doesn't want to date on humor so this is like the distribution under the null the distribution of human racing's under the null so he has this information how she rates people that she doesn't how she rates the humor of people that she doesn't want to rate up to date um and he's also got a test statistic he knows that she rated him as a five so let's imagine one scenario where this is the distribution of alice's ratings of people's kind of funniness when she doesn't want to date them so this is her distribution of humor ratings under the null now in this scenario um you can see that on average her ratings around six so when she doesn't want to date people on average she will give a rating of about six for humour so kind of you know saw midway along the scale if you look at high ratings say a rating of like you know eight it's very rare that she gives a rating above eight when she doesn't want to date someone and it's very rare because she's a nice she's a nice person she doesn't really rate people's humor very often below about a four either because you know she's too nice to give people zeros and ones now zach knows he got a rating a five so what this information tells him is the probability that she would give a rating of five or more if she didn't want to date someone and that probability is the area under this curve and what you can see is it's a big area right so she quite often gives a rating of five to people that she doesn't want to date remember zack's test statistic is five that's the rating he knows that she gave him so should he accept the null in this situation well he knows that when she doesn't want to date people she quite often gives a rating of five or more so it's quite likely given he's got a rating of five but yeah that's consistent with the null hypothesis it's consistent with her not wanting to date him because she quite often gives people a five or more if she doesn't want to date them so really this provides a context for his test statistic now let's imagine a different scenario so let's imagine her ratings look like this so in this scenario alice hardly ever gives a rating more than six when she doesn't want to date someone basically never does it so how does zack's rating of five shape up if this were her distribution of humor ratings well in this scenario it's looking a bit better for zach to be honest because she very very rarely gives a rating of five or more when she doesn't want to date someone so his rating of five is fairly inconsistent with how she rates people that she doesn't want to date it's very very uncommon that she gives ratings as high as five to people that she doesn't want to date so this would kind of give him some hope i guess that um you know maybe the null is not true maybe he can reject the null and instead assume that she might want to date him because his rating is quite high for someone that she doesn't want to date so maybe she does want to date him so what is the p value well it's very as i said it's very misunderstood so it's the probability of getting a test statistic at least as big as the one that we have observed given that the null hypothesis is true and we can look at some things it's not as well um i'm going to enlist enlist my friend professor hippo again here and uh imagine some uh you know scenarios uh how are you professor yeah not too bad not too bad we've been up to i've been doing a little bit of anger management huh that's good so we're gonna do a little um little scenario where i'm gonna give you some definitions of a p-value and um see see where you're making them okay but then we just hit me this time that kind of hurt last time no no you're all good i've been doing meditation and everything i'm good i'm calm super calm i'm so calm nice um okay so the p-value is the probability of a chance result i thought you said you were calm yeah that threw me over the edge sorry okay p-value is the probability that the null hypothesis is true seriously the p-value is the probability that the alternative hypothesis is true um so it's not any of them then no it's the probability of getting a test statistic at least as big as the one we have given the null hypothesis is true don't forget it oh thanks for that professor so bear in mind p-value is not the probability of a chance result it's not the probability that the null is true it's not the probability that the alternative hypothesis is true now some constructs related to the p-value first is a type one error this is the probability of rejecting the null when it is in fact true so this is kind of like believing in effects that don't exist and uh you know it's not good it's not good to believe in things don't exist well i mean you know there are some unicorns it's probably okay to believe in them but generally try not to believe in things that are not true well you know unless it makes you happy and it's kind of fine isn't it i mean what am i getting about um so this is a bit like zach believing alice wants to date him when in fact she doesn't which is awkward type 1 errors awkward type 2 errors this is accepting the null hypothesis when in fact it's false so this is not believing in effects that do exist so this is a bit like zach believing alice doesn't want to date him when in fact she does sad story type 2 errors sad story type 1 error is awkward and statistical power it's the probability of a test avoiding a type 2 error so in other words it's the probability that test detects an effect that is in fact true or not by putting it the probability of rejecting the null when the alternative is in fact true okay let's move on to an example that we're going to use throughout the rest of the lecture so this is an example based on uh teddy bear therapy and to illustrate this this is my eldest son um helping me out here photos are a couple years old so he's my eldest son you know he's my wife's held his son as well and frankly she had a much bigger contribution to his production than i did um so imagine a scenario where we were going to look at teddy bear therapy so whether hugging a teddy bear could improve your self-esteem now we need some kind of control group for this and the control group we might choose is hugging you know some something else like a statistics textbook for example now like his father my son he does appear to be happier when cradling a statistics textbook than when cradling a teddy bear i mean hopefully there's enough time for uh you know environmental influences to drum that out of him but um yeah he's getting on the statistical ladder early good work or bad work depending on how you look at it so we've got two conditions a group of people who have teddy bear therapy group of people in some kind of textbook hugging control group i'm not sure you'd get that past ethics actually giving people one of my books to read as a control but anyway um we're going to use this example to illustrate some of the problems with null hypothesis significance testing the first problem is that hypothesis testing doesn't tell us anything about the importance of the effect despite the fact you use the word significant because the p-value depends upon the sample size now we're going to illustrate this with imagine this teddy bear study was done three different times so you can end up with scenarios here's the first study so in the estimate column we've got a minus five that's the difference in the mean self-esteem between the two conditions so that's our effect so having a teddy bear to cuddle um well actually depends which round you do it uh i'll phrase it the other way around because it's easy to think about having a book to cuddle rather than a teddy bear reduces self-esteem by five and if you look at the p for that is zero to a few decimal places so basically very very significant um effect in study two we have the say exactly the same effect so stealth self-esteem changes by five on our self-esteem scale but now the p-value is not significant it's only um it's above the threshold of 0.05 so we've concluded it's non-significant result so here we've got two different studies that have the same effect size the value of the parameter is identical but we get different p values that's weird let's have a look at this other study we've got uh study three we have a zero effect roughly so the uh the size of the parameter is 0.052 so it's basically zero there's basically no difference in self-esteem between the book condition and the teddy bear condition but if we look at the p-value it's 0.046 it's below our 0.05 threshold it is significant so that's weird we've got basically no difference between the groups but that's coming out as significant so why is that the reason is to do with sample sizes so study one had 200 participants study two had only 20 so the reason those studies have the same effects but different p's is because they're based on different sample sizes so what that means is or what that illustrates is that the p-value has a relationship to sample size so in big samples p-values can be small even when the effect is kind of trivial and in small samples big effects can be non-significant now if you take this to it's kind of logical extreme in study three how did we manage to basically have an effect of zero that was significant it's because our sample size was massive the sample size in this study is 200 000 so um the p comes out significant it has so much power to detect even a trivial trivial trivial effect the more of this story is whenever you look at a p-value you should always interpret it within the context of the sample size if the sample size is huge then a p-value can be significant even when the effect is trivially small another problem with null hypothesis significance testing is it provides little evidence about the null or alternative hypothesis now firstly it assumes that the null is true because remember we're calculating a p-value under the assumption that the null is true so we're using the distribution of the test statistic when the null is true to calculate p so we're assuming the null is true to get p if we get a p value greater than our threshold so say greater than 0.05 all this means is that the effect was not big enough to be found given our sample size it does not mean that the effect is zero does not mean that the null is true and if we get a p less than .05 or less than whatever threshold we've set it just means that the observed step test statistic is unlikely given that the null is true but that doesn't mean that the alternative hypothesis is true so going back to our example of alice and zach and zach wanting to know whether alice was likely to want to date him his humor rating of five he could take that he could know that that was an unlikely rating if she didn't want to date him but that doesn't mean she actually wants to date him so you know there's a kind of a leap of faith there that you're making it's also based on a kind of flawed logic so to explain the logical flaw we're having a bit of iron [Music] the kind of logic of the null hypothesis or of significance testing goes something like this if the null hypothesis is true then it is highly unlikely to get the test statistic that i have observed i have observed this test statistic it has occurred therefore the null hypothesis is highly unlikely that's the kind of logic that it relies on now this seems kind of a face value to be okay i'm going to switch some of the words and show how it's not actually very logical so i'm going to replace the phrase null hypothesis with the phrase person plays guitar and i'm going to replace the phrase test statistic with plays in the heavy metal band iron maiden so uh rephrasing hypothesis significance testing is a bit like saying if this person plays guitar is true then it is highly unlikely that he or she plays in the iron maiden in fact this is a completely you know true statement there's about 50 million people on the planet that play guitar i'm not counting the bass guitar because that's not a proper guitar it's about 50 million people play guitar and um only uh three of them play guitar in iron maiden so out of the 50 million people that play guitar it is very unlikely that the one you happen to have picked plays guitar and iron maiden so that's true so the next thing this person plays in iron maiden so we happen to have observed someone and it happened the person we've observed happens to be one of the guitarists in iron maiden therefore person plays guitar is highly unlikely well is this true actually it's not true because um there are six members of iron maiden three of them play a proper guitar so fifty percent of them uh play guitar so if uh the person plays in iron maiden actually it's pretty likely that they play a guitar or because you know half of them do so hopefully you can see that the the kind of the logical statements do not necessarily follow by replacing null hypothesis with person plays guitar and test statistic with plays an iron maiden you can see that if you follow the statement through you can get to a nonsensical result so it's not actually a logically coherent process so tells us nothing about the importance of uh the effect because p depends on sample size it provides little evidence about the null or alternative hypothesis i'm really selling it to you here the other thing is it encourages all or nothing thinking so what do i mean by this well one of my hobbies when i'm not lecturing statistics is to set up webcams in the offices of my colleagues around the psychology department and yeah i record hours and hours and hours of footage and then i kind of work through it and look at what they get up to in their offices and in particular i'm interested in when they're analyzing data so i hone in on that moment when they've fitted their model and all the results come up on their screen and they're kind of scanning their results and seeing kind of what happens and i've got a little montage here of uh some of those recordings for when uh members of faculty have fitted their model and the key effect they were interested in um has yielded a p-value that's greater than 0.05 or greater than the threshold they were using so they've scanned through and seen that their effect was non-significant this is what happens now um feel free to um work out which animal first which member of faculty i can possibly comment um basically everyone's really distressed they've spent all this time doing this research and then their p-value's not significant and they all start screaming and making uh horrific noises here's a clip of someone that i found where when they fitted their model and scanned through their results they found a significance value a p-value less than 0.05 which was their threshold so they found a significant result [Music] they are happy with what's happened so a big problem is that focusing on p values uh tends to encourage all or nothing thinking either the effect is significant yay or it's non-significant sad face um and it's like there's no there's kind of nothing in between and uh of course unfortunately it's much easier to publish significant results in journals so there's a whole incentive structure set up in science whereby you are reinforced for significant results and uh kind of penalized for non-significant results when actually non-significant results you know can be interesting or that you know there's a there's a wider context around the significance value so going back to our teddy bear therapy example where we had a group of people cuddling teddy bears and a group of people cuddling statistics textbooks and we were measured their self-esteem so we did that study imagine nine other research groups around the world did basically the same study and these are the results of those nine studies and our original study so what we've got here along the x-axis is the difference between the group means and for each study there's a dot representing the difference between the group means for that particular study and each dot has a 95 confidence interval around it representing um you know the confidence interval for that difference now notice there's a vertical dashed line and that's at zero so that's representing the null so that's representing the fact that there's no effect of teddy therapy there's no difference between the means in the teletherapy group in the book group anything below that line that's saying that self-esteem was higher in the book condition than in the teddy condition so in other words cuddling a book was more effective than cuddling a teddy and anything on the right hand side is the effect that we were expecting the cuddling of teddy uh yields higher self-esteem scores than cuddling a book now let's have a look at the p values for each study so each study has a p value for this difference p value for this effect so we can see in there um you know some of them are let's say we're using a threshold of 0.05 some of them are significant the first study is significant second one's not third one's not and so on and so forth just take a moment to think about whether you believe teddy therapy has an effect based on what's in front of you now what you might have concluded is that teddy therapy probably doesn't have an effect the results have seemed quite inconsistent so the first study 10 is significant study seven is significant study six is significant and study four is significant so four of the ten studies get a significant effect six the other six do not so you might think well it probably doesn't work six out of ten studies show a non-significant effect however by focusing in on the p-value you're ignoring a lot of the consistency in the data so for example all of the dots all of the all of the point estimates in every study all fall to the right hand side of zero so they all show or every study shows an effect where self-esteem is higher in the teddy condition than the book condition so there is actually a lot of consistency and all the effects are within a fairly narrow threshold so the effects are all between a difference of about one in the means and a difference of about four between the means so there's actually a lot of consistency here but by focusing in on the p values we're kind of our attention is drawn towards differences so we always start saying well you know four studies say it works and six studies say it doesn't so it probably doesn't work but if you look at the actual estimates the actual parameters themselves you're seeing a lot of consistency in what those studies are showing so it tells us nothing about importance because p depends on sample size provides little evidence about the null hypothesis encourages all or nothing thinking and the other thing is it's based on long-run probabilities so its job its job is to control error rate in the long term so p is the frequency of the observed test statistic relative to all test statistics from an infinite number of identical experiments with exactly the same a priori sample size so it's it's telling us a long run probability it's it's saying that in the long run if we interpret significant results as significant in the long run we will only be wrong five percent of the time the p-value is not a probability attached to our particular study it's a kind of an error rate attached to all the studies you might conceivably do so it's a bit like so you know if you're going to have a long academic career doing lots and lots of research and what you really care about is controlling the type 1 error rate over the course of your career that's what the p-value will do that's kind of its job but if you're interested in is there a significant effect in this particular study that i have done it doesn't tell you that the type 1 error rate in a given study is either zero or one you've either made a type one error or you haven't but you don't know which it is but what you do know is in the long term over lots and lots and lots of studies you will only make a type one error say five percent of the time if you use a 0.05 threshold so you've got to be clear about what the p value does it provides long kind of long term error control it doesn't give you a probability for your particular study so rather than obsessing over p values you can also interpret the effects themselves and this provides a useful context there's a few ways you can do this we talked in previous lectures about raw effect sizes so actually interpreting the betas the bees i'm a big fan of this you'll also see people using standardized effect sizes so these are effect sizes that are expressed in standard deviation units or or some other kind of standardized unit so a few common ones are cohen's d pearson's correlation and we came across the standardized b in a previous lecture so that's like the model parameter but expressed in standard deviation units so to give you an example of cohen's d because we haven't talked about that one yet uh all you would do here is take the difference between the two means that's what this uh the top equation is kind of saying so you take the difference between the two means and you divide it by some kind of estimate of the standard deviation so you the the top half of this this bit here that is just the raw effect size it's the raw difference between means and you standardize it by dividing by some estimate of the standard deviation of scores and there's a few ways you can do that it's quite common in clinical studies to just use the control group standard deviation because that would be expected you know that would be a sort of a good i guess baseline of variability in scores but you also sometimes see the paused standard deviation this is basically like a weighted average of the standard deviation across the two groups that you're comparing so let's have a look at how effect sizes can contextualize a p-value so going back to our three teddy bear therapy studies if you remember the first two they had the same parameter so the same difference between means so in both studies it was five but in one study the p-value was basically zero and in the other study it was 0.1 so if you look at the p-values in these two studies you would conclude the opposite things in one study teddy therapy appears to have a significant effect and in the other study it doesn't if we focus on the parameter itself or the the raw effect then we would reach the same conclusion from these two studies so the effect on self-esteem of teddy therapy relative to a book is a change of five on the self-esteem scale so by focusing on the effect size we can textualize the p-value so we're drawn away from this kind of all or nothing thinking when the effect is the same the effect is the same in the two studies similarly if we look at that study where we had a zero effect so basically the difference between means was zero but the p-value was significant well we can get very excited about our significant p-value but when we look at the effect size the raw difference between means and we see that it's zero that contextualizes the p-value for us so we know okay this is significant but actually the effect is trivial so let's not get too excited about our p so effect size is a really useful way to contextualize a p-value going back to um an example that we used in previous lectures so this was some research done uh on whether playing hard to get um predicted um romantic interest in someone and this is a slide from a previous lecture and it's just really to highlight the fact that we have looked at raw effect sizes before in previous lectures so they are a familiar thing there's nothing new here so we saw for example we got a parameter estimate here of 0.294 that's a raw effect size which told us that as the perception that the other person was hard to get increased by one on a scale that was one to five uh it was associated with with .294 more expressions of interest and if you remember from previous lectures we talked about that being a very small effect because you'd actually need perceptions of of playing hard to get to increase by 3.4 on a five-point scale to even get one additional expression of interest so we've yeah raw effect sizes are familiar now what happens if we go back to this diagram so where we looked at our 10 studies of teddy bear therapy and we looked at the p-values and we concluded there's inconsistency in the results six studies say there's a non-significant effect four studies say that there's a significant effect and you know i said this kind of focuses your attention on inconsistency what happens if we look at the effect size so i've got the effect sizes here expressed as cohen's d um so this is the difference between means expressed in terms of standard deviations so for example in study 10 the difference between means was 0.7 of a standard deviation the main thing to focus on here is how again effect sizes help us to focus on consistency in this case so all of the effect sizes all the cohen's d's are positive values so they all show that teddy bear therapy um yielded higher self-esteem scores than the book therapy group although you know the book control so there's consistency all the studies d is positive and also if we look at the magnitude of d it's also fairly consistent most of the cohen's d's fall between 0.45 and kind of 0.7 so there's actually a reasonably narrow range of effect sizes here reasonably narrow but again we can see consistency here so the effect size has become a useful contextual factor here so although we've got inconsistency in the p values the effect sizes are showing us that the studies are actually showing quite similar things and bearing in mind the p values in the different studies will have been affected by sample size and things like that so just to summarize model parameters typically so the b's typically represent hypotheses and we can test whether those uh parameters are different from zero as a way of testing the hypothesis whether testing whether the effect is different from zero and we do this by computing a p-value and a p-value is the probability of observing a test statistic at least as large as the one we have given that the null hypothesis is true and this process is somewhat problematic so in a way it addresses the wrong question because i think most people probably actually want to know you know the probability of the effect being real in their particular sample that's not what the p tells you p's are dependent on sample size and they do encourage all or nothing thinking so although it's fine to use p-values and they can be useful and they should always be interpreted in the context of the sample size and in the context of the effects the raw effect size or a standardized effect size okay that's it thank you very much
Info
Channel: Andy Field
Views: 7,011
Rating: 4.9506173 out of 5
Keywords: Statistics, NHST, p-value, effect size
Id: IXYDMMBisr8
Channel Id: undefined
Length: 54min 8sec (3248 seconds)
Published: Thu Oct 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.