Statistics made easy ! ! ! Learn about the t-test, the chi square test, the p value and more

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
learning statistics does not need to be difficult now instead of bombarding you with a complicated formula and statistical theory I'm gonna walk you through a way of thinking and that's gonna enable you to address the most common statistical questions when we look at sample data for the most part we see two things we see differences between groups so men are taller than women and we see relationships between variables like taller people way more than shorter people committed and the big question is are those differences and are those associations or relationships real and I'm going to talk you through what it is that we mean by the term real over the next few minutes we're going to take a look at a very simple data set and we're gonna see how by looking at various combinations of variables and variable traps we can identify very specific differences between groups and very specific relationships between variables and I'm gonna walk you through when and how to use statistical tests and how to interpret your results now let's imagine that we have a research question and it's about the height and the weight of people living in Ireland of course we can't measure the height in the weight of the entire population so instead we take a random sample of the population and we measure the weight and the height of that sample and we clicked some additional information like gender and age group from each of the people in our sample and we arranged these data in a spreadsheet or data set with the various attributes in columns and these are called variables and these variables will be the object of our inquiry [Music] now most data sets that you work with will contain two types of variables categorical and numeric variables categorical variables like gender content categories as the name suggests think of them as groups or buckets that the data can be arranged into in this case males and females numeric variables like height on numbers as the name suggests and can be arranged on a number line now to better understand our data and to make sense of it we summarized it and we visualize it in the case of categorical data we can count up the number of observations in any given category and we can represent them in a table and on a bar chart and to summarize numeric data we firstly interested in the spread the distribution of the data so we might describe the range of the data the interquartile range we could also include the standard deviation to get a sense of the middle of the data we use the median which divides the doctor into two equal halves and we use the mean which is the average the mean is probably the most commonly used summary value to represent this kind of data we can visualize that data using a box plot which is a visual representation of the range the interquartile range and the median and of course we can create a histogram and this gives us the shape of the data so I hope you can see that this process of summarizing and visualizing the data takes it from being just numbers and words on a spreadsheet and turns it into something that is meaningful to us something that we can get our heads around something that we can think about now in this very simple data set we've got two categorical and two numeric variables and things start to get interesting when we start looking at combinations of variables so for example we can take a look at a categorical and a numeric variable like gender and height and so we can group the data by gender which is the categorical variable and create a summary of the numeric variable in this case height that is separated out into those two groups and looking at the summary we can see that in our sample data men are on average taller than women what I want you to see here is that we've looked at a combination of the categorical and a numeric variable but as you can imagine there are other possible combinations of variables that we could have looked at we could have looked at height and weight which are both numeric we could have looked at gender and age group both categorical and in each case we might see either differences between groups or relationships between variables and in each of these cases there are specific statistical tests that we can apply to see if what we are seeing in the sample data has implications for what we think about the wider population can we infer anything is what we are seeing statistically significant so let's take a quick look at the five most important combinations of data that we have and we'll look at firstly what might we observe in our sample data given that sort of combination of data types and secondly what statistical test we might apply to determine whether or not we can infer anything about the wider population so we might look at a single categorical variable like gender and we could do a one sample proportion taste for two categorical variables we would do a chi-square test for a single numeric with the t-test if we have a categorical and a numeric variable we do a t-test or analysis of variance or ANOVA if there are more than two categories in a categorical variable and for two numeric variables we do a correlation test now I'm going to come back to each of these scenarios in each of these tests so don't panic at this point what I want you to see is how the data can be divided up and in just a few minutes we're going to take each of these scenarios and work through exactly what questions you can ask and how it is that you can apply statistical tests and importantly how to interpret your results now before we carry on I just want to say a big thank you to biomed central or BMC for sponsoring this video BMC are a publishing company that published open access journals and that means that the full-text of all of the papers published are available for free to anyone in the world I'm the editor-in-chief of one of the journals that they publish called globalization and health and genuinely impressed with them as a company I believe that they have integrity and I honestly believe that they are making the world a better place they have a portfolio of over 300 journals that they publish so check them out at biomed central com I'll put a link in the description below at this point I want to say this it's not good science to take a data set and just randomly stab around blindly hoping to find something that's statistically significant before you interrogate the data you start off by defining your question your hypothesis you define your null hypothesis you identify the alpha value that you're going to use and then you analyze the data so let's look at what we can do with just one categorical variable like gender we might ask the question is there a difference in the number of men and women in the population now we could state that as a hypothesis which is that there is a difference between the number of men and women in the population and we could check to see whether or not we think that that is the case and when we look at our sample data well we do in fact see that there's a difference in the proportion of men and women so should we get excited well no not yet remember this is just sample data we could have by chance selected a sample that just happened to show a difference so let's consider the possibility that in actual fact there is no difference in the number of men and women in the population and we call that our null hypothesis and if that were true how likely would it be what the chances what is the probability that we would see the difference that we have observed or greater difference for that matter and if we can show that that probability is low then we can have a degree of confidence that the null hypothesis is wrong and we can reject it but before we calculate this probability which we're going to call our p-value we must be clear about how small is small enough below what value of P would we reject the null and we must decide on that cutoff before we calculate the p-value and we call that cutoff the alpha value and for the rest of the examples in this video we're going to use an alpha value of point zero five or five percent so we've really got two scenarios we've got the null hypothesis which is that there's no difference and the alternative hypothesis which is that there is a difference and the next step is to apply a statistical test and in this case we're doing one sample proportion test and we generate a p-value if the P is less than the alpha then we can reject the null hypothesis and state that the difference that we observe is statistically significant if we add another categorical variable in this case age group we may have a research question like does the proportion of males and females differ across these groups so our hypothesis is that the number of men and women that we observe is dependent on the age calorie that we look at in other words the proportions change or depend on or are dependent on the age category now we can collect our sample data we look at it and we can see that yes in fact the proportions do change across the age groups in other words in our sample data the proportions are dependent on age category now is that JooJoo chance well let's test the idea that the proportions are all the same well that they are independent of age category that's our null hypothesis now here we can conduct a chi-square test and that gives us a p-value and if the p-value is less than the Alpha we can reject the null hypothesis and state that our observation is statistically significant if we want to look at just one numeric variable on its own like height then we don't have any groups to look for differences between and we don't have another numeric variable to look for some sort of associational relationship with so what questions can we ask well we might have some theoretical value that we want to compare our data to for example in the case of average height we might have some historic data we might wonder if the current population is significantly different from that historic daughter so our question might be is the average height different from a previously established height let's imagine that the previously established height was one point four meters we want to know if the average height in our current population is different to that our hypothesis is that there is a difference again we collect some sample data we find that the average height is indeed different from the historic height is that statistically significant well if there were no difference what would the chances be that we observed the difference that we do or a greater difference we conduct a t-test comparing the averages and if the p-value is less than the alpha then we can reject the null hypothesis and state that the observed difference is statistically significant now let's consider a categoric and a numeric variable and remains the question is there a difference between the average height of men and women in this case our hypothesis is that there is a difference in our sample we do observe a difference let's assume that there's no difference we conduct a t-test which gives us a p-value if the P is less than the Alpha will reject the null and we state that the observation is statistically significant if we had a categorical variable with more than two categories like age group that's got three categories then instead of doing a t-test we would do an analysis of variance or ANOVA now let's look at the last combination of variable types in the Stata said two numeric variables height and weight here we might start with the question is there a relationship between height and weight our hypothesis is that there is a relationship we collect sample data we look at it and one lakh we do see some sort of relationship is drill or let's assume that it's not it's assumed that there's no correlation between the two variables and if it weren't real then what are the chances that we'd see the relationship that we do and here we conduct a correlation tastes now a correlation test is going to give us two things firstly it's going to give you a correlation coefficient which tells us something about the nature of the association between the two variables and I'm going to talk about that in just a minute but of course it also gives us a p-value and again if the p-value is less than the Alpha we can reject the null hypothesis and state that the correlation that we see is statistically significant and the correlation that we see can be represented by a number that we call the correlation coefficient so let's talk about that for a second correlation coefficient is a number between negative 1 and 1 and it looks at the relationship between two numeric variables if as the X variable gets larger the Y variable gets smaller we say that they are negatively correlated if they are perfectly negatively correlated then the correlation coefficient will be negative 1 if there's no relationship between the two variables then the correlation coefficient will be 0 and if there's a perfectly positive correlation as X goes up Y goes up then the correlation coefficient will be 1 and of course you can have any value in between and by the way it doesn't matter which of your variables is on the x and the y axis the correlation coefficient will be the same of course we've only just been able to scratch the surface in terms of what there is to learn about statistical analysis if you want to learn more then go to learn more 365.com and I've got some courses there that you can love if you'd like to learn about our programming which is a programming language that gets used for statistical analysis and it's free it's very powerful it's easy to use it's absolutely fantastic I have a YouTube channel that focuses specifically on that so that's our programming 101 I'll put a link in the description below go and check it out otherwise please subscribe to this channel hit the bell notification if you want notification of future videos leave your comments below and share this video with anyone that you think might find it useful until next time take care
Info
Channel: Global Health with Greg Martin
Views: 986,765
Rating: 4.9554343 out of 5
Keywords: statistics, stats, statistical analysis, data, data science, analysis, research, quantitative, p value, hypothesis testing, statistical tests, t-test, chi squared, ANOVA, correlation, data types, categorical, numeric, method, methodology, public health, global health, epidemiology
Id: I10q6fjPxJ0
Channel Id: undefined
Length: 12min 50sec (770 seconds)
Published: Mon Jun 10 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.