Statistics 101: Introduction to the Chi-square Test

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to my video series on basic statistics now to notes before we get going number one I will most often just use the word stats there are fewer SS and T's crammed together in stats and therefore I am less likely to trip over my own tongue which happens often number two these videos are geared towards individuals who are relatively new or just need to review the basic concepts and stats so if you have advanced study and quantitative methods these videos are probably a bit below what you would need also if you do have advanced study in quantitative methods just keep in mind that I am simplifying some of the concepts for those who are new to the topic so all that being said let's go ahead and dive right in in this video we will be doing an introduction to the chi-square test this is one of the most often misunderstood tests in hypothesis testing so we're going to do a couple things we're going to set up a more complex problem that we will actually solve in the next video but we'll talk about it in this video after we talk about that data we'll look at some graphs that help us understand that data better and then we will actually do a simple chi-square test step by step so you can see exactly how the numbers are calculated which are actually fairly simple and then how we interpret that so let's go ahead and talk about our problem now again this problem will be solved the next video in this video I will actually be doing a simple example that we will then apply to this more complex problem so we'll see this more again in part two so you work in the Office of Institutional research at a small but growing regional four-year university over the past five years the number of undergraduate students at each level so freshmen sophomore junior and senior and then we have unclassified two which are sometimes high school students or others has changed so we have had variation and our student headcount over this five year period now here are our questions now even though some headcount random variation is inevitable is that variation beyond what we would expect due to chance alone there is a lot packed into that question I want to explain a couple of things just out in the world when we count the occurrences of things if we count an occurrence maybe today and then we count the same thing tomorrow and then we count the same thing the day after that and the day after that there's going to be a natural random variation in the number that we count so maybe I go out to a busy stoplight and I count the number of cars that go through it during a 15 minute interval say during rush hour well if I do the same thing tomorrow I'm gonna get a different number if I do the same thing today after that I'm probably going to get a different number but those numbers are probably going to be close together even though they vary a little bit just you know randomly so what we're trying to ask here is is that variation beyond what we would expect just due to the normal random chance variation now what types of graphs can we use to better visualize our data and how can a chi-square test help us rule out that variation due to chance alone so where is the threshold by which we can say wait a minute that change is just beyond what we'd expect due to chance alone so here is our student headcount now this is actual data by the way I did not make this up so we have the years 2007 through 2011 then we have the class levels freshmen sophomore junior senior and then we have the unclassified and this tracks the student headcount or enrollment you can think of it during the fall semester of each one of these years so go ahead and take a look at that and we'll talk about what we see but one thing I notice is that for junior and senior the head count goes up quite a bit over that amount of time and the same thing for and classified if I look at freshman and sophomore it kind of goes up and down there is no rule you know pattern as far as straight up or straight down now let's go ahead and do some graphing options so we can visualize this data better now one of my credos is graphs are your friend graphs are awesome use more graphs take advantage of our ability to understand data visually a column of numbers is one thing but making it visual is a whole nother thing and they can really help with your classmates or your instructor or your co-workers or your dean or whoever else you might be presenting this to take advantage of our ability to understand data visually so in this problem we're going to consider simple line graph a stacked bar chart a stacked percentage bar chart a stacked area chart a stacked percentage area chart and then a spider or radar diagram and we're going to talk about what each one does for us as far as interpreting or understanding our data better so here is our simple line graph and we have our five class levels over on the right-hand side denoted by each line now as you can see if you start at the bottom the special category seems to go up over time the junior which is the green line goes up the senior line which is the purple goes up over time but then we have sophomore which kind of goes up and down and then the same thing for the freshman it starts up goes down and comes up and then kind of goes down again so when we talk about variation in our data across the years this is what we're talking about now because of the natural way enrollments work and other things in you know in society in nature work there's going to be some natural random variation we cannot expect once we do some serious quota filling to have the exact same enrollment every year so we're going to have natural variation now what we're trying to figure out is is that variation within what we would expect by just sort of random chance alone now here is a stacked bar chart which is another way of looking at our data so within each year we have the number of students in each class level and then they're stacked on top of each other so what does this do for us you know that's new well it helps us see proportional enrollment so you can see that the freshman which is there in the blue bar not only can we see its pattern over time you know down a little bit up and then down a little bit but we can also see its size relative to the other class levels so it seems to be about twice as much just sophomore which is about twice as much as the junior depending on the year you're looking at and then senior and our special category so it helps us see relative in this case enrollments you know as one class level as compared to another here is are stacked percentage chart of course in this case each class level is described by the percentage of the total enrollment it occupies for any given year so if we look in 2007 we can see that freshmen were approximately 38% of our total undergraduate enrollment or headcount in 2008 that went down to about maybe 33 percent by the time we got to 2011 when were almost all the way down to 30 percent of course the entire enrollment takes up 100% so it's all relative here again and you can see that the special color there at the very top gets bigger so it takes up a larger percent of our enrollment same thing for seniors it appears and juniors and then sophomores seem to narrow and their percentage as we go across time so this helps us look at relative percent for each year now here is a stacked area chart this is very similar to the stacked bar chart except we take that data and we go all the way across the graph with it so this tells us a few things if we look at the very top of the graph we can see that it increases so what we can say is that our overall headcount or overall student enrollment increased over this time period from you know in the mid-1400s up to above 1600 now as far as the individual bands of course those represent each class level so if we look the senior the purple seems to widen over time the junior the green seems to widen over time the sophomore level seems to narrow a bit and the freshman seems to narrow a bit so again with this visual information we can sort of make you know some ideas in her head about how this has changed over time it seems that our freshman and sophomore enrollments went down a bit but our junior and senior and special category enrollments went up over this time and actually with this university there is a Iowa hypothesis at least why that happened but maybe we'll talk about that next video now here is our stacked percentage area chart and this is very similar to our stacked percentage bar chart where each band represents a percentage of the total so again you can see that overall the freshman seems to narrow as a percentage same thing with the sophomore percentage and the junior you got look at two things here does it get whiter and sort of its direction so it does seem to get whiter and then same thing with the senior there in the purple and the special category there on top so okay can we say about this well it seems that our freshmen and sophomores combined are taking up a smaller percentage of our overall undergraduate enrollment and then our junior senior and special categories are taking up a larger percent over this time so again we have variation in our enrollment but our question is is it within just random chance variation or is there something else going on now the last one we're going to look at is the spider diagram and again this is an often underused diagram that I think can be very helpful now of course each grade level is represented by a different color and in the center we can call that a hub kind of like the hub of the wheel if you'd like and then radiating out are the years so each spoke coming out of that Center is a year to 2000 708 of 9 2010 and 2011 then of course we plot the number of students in each grade level along that spoke now what does this tell us new well if you notice as we swing around the spider diagram as we get towards 2010 and 2011 we have like a bulge in the special the juniors and the seniors so the lighter blue the green and the purple if you remember other graphs that was apparent in our areas in stacked bars because the juniors the seniors in the special category or becoming a greater part of our overall enrollment then if you look at the red it doesn't change a whole lot over time sometimes it comes in the bit and goes back out and comes in a bit and goes back out but not by a whole lot and then the blue starts way out almost all the way to 600 students comes back into 500 and 2008 goes back out to the middle stays in the middle and then comes back in again at 2011 so it helps us sort of see bulges in our spade our spider or radar diagram to see where our changes have been okay so that's actually get to the heart of this video and that is what is a chi-square test now first and foremost make sure you pronounce it correctly it is Chi square as in kite not Chi as in cheetah not a chi square or chai as in chai tea it's not chai square it is Chi square so I've been in about ten different stats classes between undergraduate and my graduate work every class every one of them someone has said I don't understand the chi square or I have a question about the chai square its Chi Chi square as in kite so don't be that person in your class okay what does it do it helps us understand the relationship between two categorical variables and that's very important they have to be categorical variables so what do I mean by that well grade level that's one example in this case so we have freshman sophomore junior senior and then special or unclassified sex male or female if we think of it as a binary category age group so we could have you know you've probably seen them are you in the age group 18 to 25 or 26 to 35 or 36 to 45 whatever so those are categories if they're put in groups years course we have years in this example etc so the important thing here is it has to be categorical variables now Chi squares involve the frequency of events or the count so we're only dealing with counting things we are counting members of these categories we're not dealing in percents we're not dealing in anything like that we're dealing in frequencies counting now it helps us compare what we actually observed with what we expected okay observed versus expected often times using population data and I want to go into all that right now but you know that's every member of a certain category we denote that's a population or theoretical data and actually when we do our example we're going to be using a theoretical data event I guess you could say now Chi squares assist us in determining the role of random chance variation between these categorical variables so the relationship is going to change but the question is is that change within a certain limit we set that would account for just random variation and finally we use the chi-square distribution now that just went whatever your head do not worry for this video it's not important just know that we use it and I put it in here just to be technically correct and within that we use what's called a critical value which I'll explain here a little bit to accept or reject our hypothesis okay so if I'll just talk about hypothesis and chi-square distribution the critical values or have your mind going in crazy directions right now don't worry an example we're going to do is going to be so crystal clear step-by-step that you'll have it down pat now this look at our head count changes over time again so we can see we have a couple of categories that go seem to go up over time almost in like a very flat straight line and then we have a couple sophomore and freshman that kind of bounce all over the place so we have variation we just want to know are these categories grade level and year the variation that occurs is due to random chance alone or is there something else going on in this data now in this video we're going to use a very simple experiment it's very common when talking about the chi-square and that is the dice experiment and I'm going to set it up maybe a little bit differently than other people have so here is our example let's say I have two died in my hand okay and just in case you maybe you're not familiar you know die or the six-sided squares that are often used in games especially like gambling games and they have we have one through six on each side so that's I have two died in my hand one is fair and the other is one five six loaded that means it favors the numbers one five and six due to alterations in its weight so some people that cheat at casinos swap out the actual casino dice or dice with weighted dice to get the numbers they want okay so I gave you two of them and one is fair and one is loaded now I asked you to determine if it's the fair die or the loaded die I just gave you and I want you to be 95% confident in your conclusion now to do that what you're going to do is I'm going to ask you to do is over the next six days I want you to roll that die 100 times each day for a total of 600 rolls okay and then record how many times each number occurs over those 600 rolls so you're going 200 rolls each day for 600 for six days 600 total rolls keeping count a frequency of how many times each number comes up now let's assume okay that the die I gave you is fair let's assume that what would we expect to happen what would we theoretically expect to happen over these six hundred rolls now if the die is fair if we wrote it six hundred times and we have six numbers on the die we would theoretically expect each number to come up 100 times so six numbers six hundred rolls each one has a same probability of coming up so theoretically we would expect 100 of each number to occur so how we're going to state this hypothesis and again this is one of the more complicated slide to this hang with me first we have what's called the null hypothesis and that's represented by H Sub Zero so if you've been in stats class you've probably seen something like this now our null hypothesis is that the die is fair then we have all our alternative hypothesis which is denoted by H sub 1 and our alternative hypothesis is that the die is not fair okay so we have the null that says the die is fair we have the alternative that says the die is not fair pretty straightforward now what is the everyday sort of English way of saying this now is the variation in our observed data simply due to chance or is the variation beyond what random chance should allow or how far can our data vary before we have to reject the null hypothesis and conclude that the die is not fair which is our our alternative so we're going to have some variation but we need to know if that variation occurs within limits we said now I ask you to be 95% confident so that creates what's called a p-value of 0.05 so again if that p-value concept kind of goes over your head don't worry about it too much now another way of thinking about the p-value is what level of power ins are we willing to put on this variation if our tolerance is pretty loose we might have a p-value of 0.1 or sort of 10% if we want the power in store the variation in our data to be very narrow we want to be very strict we might choose a p-value of 0.001 or 1% so I've sort of picked the medium which is 0.05 which is often the most commonly common use commonly used in a lot of social science research okay so degrees of freedom oh goodness this is one of those concepts that gets flown around flung around in stats classes and never gets explained at least in my experience very well and guess what I'm not going to explain it in this video either now for this kind of con for this test our degrees of freedom DF are simply the number of categories we have which is six we have six numbers minus one okay so in this example just kind of take it as it is that our degrees of freedom are six minus one which equals five now in the next video when we have more complex categories the degrees of freedom will be figured a bit differently but for this example it's just six minus one or five now we have a concept called the chi-square critical value well what is that well the chi-square critical value is sort of the threshold it is the point where we just have to conclude that our variation is too great to be explained by chance alone and therefore would have to read checked our null hypothesis over there so easiest way to find this actually is in Excel Excel has a built in function sort of the Chi inverse or Chii env and then you just give it two inputs you give it your p-value which in our case is 0.05 then you give it your degrees of freedom which in our case is five then it spits back a value a critical value of eleven point zero seven so when we do this kind of keep that number in the back of your mind our threshold for our chi-square critical value is going to be eleven point zero seven so what that means is that if our dies chi-square is greater then eleven point zero seven then we have to reject our null hypothesis and claim the die is not fair the variation is just beyond what we would expect by normal random chance or normal random variation so if we get a chi-square that's greater than eleven point oh seven we got to throw the null hypothesis in the garbage and just accept the alternative hypothesis which states the die is not fair let's go ahead and do this step by step okay so here is our expected frequency which we talked about now on the right hand side is what our data actually produced these are our actual observations so six days later you come to me and say here are my observations so the number one came up 111 times number two came up 93 a1 etc and of course that adds up to 600 total rolls so those are our observed frequencies when we actually did the experiment so here's the first step in figuring out our chi-square the math is very very easy okay so I know you're smart and you can do it so let's go ahead and just do it step by step the first step is we take our observe or observation - what we expected so as you can see on the right hand side it's simple subtraction we take our observed which in the first case is 111 - our expected which was 100 because we expect it to be a fair die so 111 minus 100 is 11 and then we just do that all the way down that column that's it that's step one observed minus expected in the next step we take that observed minus the expected and square it that's it so for number one remember we had 111 minus 100 which was 11 and then a misstep we just square it which is 121 then we do that for each of our numbers so step one we subtract step two we square it that's it now in step three we take what we got in step two which was the squaring and we divide that by what we expected which is e okay so in all of our cases we expected 100 this is not this is actually very simple division it's just moving the decimal place so for the number one we had 121 minus 100 that's one point two one for number two it's just well one and number three we had three point six one and for number four we had 0.04 and on down five and six so that's step three just remember step one is subtraction step two we square that and then step three we divide that by our expected which in this case was 100 so it's very easy now in step four we just add all those up so in our right hand column again we had one point two one one three point six 1.04 etc we just add all those up that's what the summation sign at the top of the slide means and guess what folks we just did our chi-square we're done it leave at least with the math part so our chi-square our value for this experiment was twelve point two six of course that is a meaning thing yet until we actually interpret it but the chi-square value for this is twelve point two six now remember our critical chi-square value was eleven point zero seven I guess what twelve point two six is greater than eleven point zero seven so therefore our die critical value is greater than eleven point O seven so we have to reject our null hypothesis which said the Dyess fair and claim that the die is in fact not fair so we have to accept our alternative hypothesis because our chi-square was greater than our critical chi-square based on our excel formula so how do we interpret that result now if actually use the APA format which I encourage you to do depending on your discipline of course what we would say is that the observed frequency of each number on the die differed significantly from what would be expected on a theoretically fair die of course we have our chi-square five is our degrees of freedom and is the number of times we rolled the die and that equaled twelve point two six and then our p-value was point zero five so if you were writing this in a journal that's exactly how it will look in APA format now our problem chi-square was twelve point two six our critical chi-square was eleven point O seven which is therefore our variation was too great to be explained by chance alone therefore we must reject our null hypothesis which is the Dyess fair and accept h1 which was our alternative hypothesis and say the die is not fair so we are 95% confident that you have the loaded die in your hand now I want to talk about just the effect of choosing a p-value remember I said the p-value is sort of how strict we're willing to be on accepting random variation if we change the p-value we're going to sort of change the threshold of what we're willing to accept is random chance now this is the same slide that has changed a few numbers so null hypothesis is still the dice fair alternative is the die is not fair same interpretation okay talking about variation here here's what we changed instead of being 95% confident I want you to be 99% confident so we're gonna have a p-value not 0.05 but now it's gonna be point O 1 so we're going to be much much more strict on interpreting our variation now what that means is that we're going to need a lot more variation in our observations in order to reject our null hypothesis we're going to need much more variation in our observations to reject our null hypothesis because we've selected a much more strict p-value now degrees of freedom of the same now for the chi-square we changed our p-value so we're going to have a new critical value so when we put this into Excel we change that to 0.01 degrees our freedom are five now our critical value is 15 point O 9 remember last time it was a little bit over 11 now it's over 15 so you can see that it requires much more variation to go over that threshold because we've picked such a strict p-value so therefore if our die Kai's quare is greater than 15 point O 9 then we reject our null and claim the die is not fair so again an APA format we would say the observed frequency of each number on the die did not differ significantly from what we would expect on a theoretically fair die and everything there is the same five degrees of freedom 600 rolls our problem chi-square was twelve point two six that didn't change but our p-value did so we have a P of 0.001 so our problem chi-square was twelve point two six our critical chi-square with a p-value of 0.01 was now fifteen point zero nine we did not cross that threshold therefore our variation was not too great to be explained by chance alone therefore in this case we have to accept our null hypothesis and conclude the die is fair and we were 99% confident that you have the fair die now wait a minute you have the same die in your hand but in the first case when P was point O 5 we concluded you had the loaded die now with AP of point zero 1 we conclude you have the Farrah died what now this is my point on changing the p-value or changing the strictness with which we are willing to explain variation so with the P of 0.05 we did not need as much variation to overcome the critical value as you notice the critical value went up significantly when we change the P 2.01 so it's just much more strict we have to have more variation to cross that sort of 99% confidence because we selected such a low p-value all right this quick review was a chi-square test remember it is Chi square nut not chi-square or chai square it's Kies and kite it helps us understand the relationship between two categorical variables that's important chi-square is involved the frequency of events the count so in this case we were counting the number of times each number comes up it helps us compare what we actually observed with what we expected Chi squares assist us in rejecting or ruling out to some extent random chance variations between categorical variables and we use the chi-squared distribution which we didn't talk about it's not important to what we're doing to accept or reject our hypothesis regarding random chance now basically what that means there is that we were able to put into Excel our degrees of freedom and our p-value and it generated a critical chi-square value that we have to surpass in order to reject our null that's all that little thing means down there okay so let's review I just reminder in our next video we will actually be looking at the data we started with so look in this case instead of having a die with six numbers on it we have five class level categorizations so freshman through unclassified then we have five years so we have five grade levels and five years for our categories and then we're going to try to determine what we are going to determine whether or not the variation present in this data can be explained just by you know random chance just a random chance that comes with enrollment all right so that is our in-depth introduction to the chi-square hopefully you learned a lot and I look forward to seeing you again in our next video when we look at that enrollment data any more complex example again thank you very much for watching and look forward to seeing you again next time Oh
Info
Channel: Brandon Foltz
Views: 344,048
Rating: 4.9107966 out of 5
Keywords: brandon foltz chi square, statistics 101 chi square, chi square statistics, chi square tutorial, what is chi square test, chi square test tutorial, chi test, chi-square test, chi square test, chi-squared test, chi-square, chi square, chi square test explained, chi square distribution, chi-squared distribution, brandon c. foltz, brandon c foltz, statistics 101, fair die, anova, hypothesis testing, logistic regression, chi, square, brandon foltz, statistics, Linear regression
Id: SvKv375sacA
Channel Id: undefined
Length: 37min 39sec (2259 seconds)
Published: Tue Jul 31 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.