Statistics 101: Is My Data Normal?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello thank you for watching and welcome to the next video in my series on basic statistics now a few things before we get started number one if you're watching this video because you are struggling in a class right now I want you to stay positive and keep your head up if you're watching this it means you've accomplished quite a bit up to this point you're very smart and talented and you may have just hit a temporary rough patch though I know that the right amount of hard work practice and patience you can get through it I have faith in you many other people around you have faith in you so so should you number two please feel free to follow me here on YouTube on Twitter or connect with me on LinkedIn that way when I upload a new video you know about it and on the topic of the video if you liked it please give it a thumbs up put it on a playlist share it with colleagues or classmates because that does encourage me to keep making them on the flipside if you think there is something I can do better please leave a constructive comment below the video and I will try to take those ideas into account when I make new ones and finally just keep in mind that these videos are meant for individuals who are relatively new to stats so I'm just going over very basic concepts and I will be doing so in a slow deliberate manner not only do I want you to understand what's going on but also why so all that being said let's go ahead and get started so this video is the next in our series on probability distributions now in previous videos we talked about things like the uniform distribution the Poisson distribution the binomial distribution and probably most importantly the normal distribution now one question we really had not asked yet is does the data we have fit the normal distribution because the truth is not all data does fit the normal distribution there are all kinds of other distributions that exist but in many cases advanced techniques and statistics assume that your data is normally distributed so that's the question we're looking at here and we're going to talk about a few visual tools that don't involve any calculation so you can begin to see whether or not your data is normally distributed or not let's go ahead and dive right in now we'll begin this video with a little sermon and I could have done this with all my other stats videos but it seems applicable to this one as we get into more advanced techniques and here it is always look at your data graphically first before starting all of the fun cool whiz-bang analysis so that students I work with are really gung-ho about putting your data into Excel or SPSS or SAS or whatever else that might be in running their statistical tests what they never did was look at the data graphically first so their data may have violated some assumptions that were required for that test so get to know the data any statistician or researcher or operations person in business or market researcher or whatever else will tell you that if you can get to know the data inside and out how its structured sort of its personality I guess you could say it really helps you as you do your analysis write-up your reports and then of course usually present it so either colleagues or your classmates whatever else it might be so get to know the data and the first way to get to know the data is to look at it graphically and what it does it helps us find patterns if they exist it highlights potential problems in our data which we'll talk about here in a minute we can garner some of the initial relationships in our data and this is all before we start doing some you know crazy sophisticated analysis so number one look at your data graphically first because as the saying goes garbage in garbage out so if you use data that violates an assumption to do some statistical test then you're going to get garbage on the other end so always do this step first let's talk about some graphical data exploration techniques now the way we're doing this is that by using a few simple visual tools we can learn a tremendous amount of information about our data our data may have excess skew which is a fancy way of saying it's lopsided excess kurtosis which is another way of saying it has very fat tails the distribution on the ends has a lot of probability in it it could be bimodal it has two humps in it sort of like a camel's back or follow a distribution other than the normal distribution as I said there are many other types of distributions that data can follow now in this presentation we will briefly discuss the following tools to determine if our data is quote normal and I'm going to point out here that I'm just giving you a brief overview of each one of these tools I'm not going to go in-depth into how you would calculate them in certain software packages or anything like that I'm just going to talk about what they are and how they are used and then probably in other videos we'll talk about how to calculate them or how to make them more beneficial but for this video it's sort of a brief overview of each one so we're going to talk about histograms which you've probably heard of stem-and-leaf plots box plots which are also called box and whisker plots PEEPi plots and QQ plots so we're going to talk about these five visual tools that you can use sort of at the preliminary stage to get a better feel as to whether or not your data fits the normal distribution and here's why and I said this before many statistical techniques assume that data fits and normal distribution so if your data has high skew or high kurtosis or something else and you put it into some sort of statistical technique that's going to be your problem because you know those techniques assume a normal distribution so I'm sure you've seen a normal curve looks like this and the question is how can we tell if our data fits this shape does our data have what's called goodness of fit relative to this normal distribution and one problem we can have is called excess kurtosis and no that's not a disease even though it kind of sounds like it so we have a distribution and what happens is that we have more probability than expected in the tails of the distribution due to extreme values away from the mean there in the middle so the probability values are pushed away from the mean and out towards the ends now it's kind of crude but a way to think about this is like a hamburger or something with too much say ketchup on it and when you bite down it squishes out the sides that's kind of how to think about kurtosis we have XS out on the ends of our distribution and if there's too much probability out there on the ends that could violate the test as to whether or not this is normally distributed another problem we can have is called excess skewness so it might look like this there's more probability than expected on one side of the distribution versus the other it's lopsided so here you can see we have a lot of probability a lot of values over on the right hand side and then this long tail over on the left hand side and we call that Yunis now another reality we have is that oftentimes data fits another type of distribution altogether there's a log normal distribution the Weibull the exponential the uniform among others so not all data is going to fit the normal distribution it depends on what it is so some data in biology might fit the exponential distribution better data in sort of statistical quality control might fit the log normal better again it just depends on the data so you have to get a feel for it and don't assume that every data set fits the normal distribution okay so let's talk about the histogram and this is some data I have from previous videos that I use in a lot of examples and this histogram is a frequency count of the daily returns for General Electric stock during 2012 so there are like 250 or 252 days or so in this data set and you can see that they're in the middle that's a return day to day of zero zero percent and then of course we have a little bit above zero and some below zero so to find the percentage you would just move the decimal over two spots so to the right we have two percent on the very end four percent and then to the left would have two percent loss and then a four percent loss and then in the middle in actual bars is the frequency with which we had returns in that range now the frequency of values over certain intervals is called bins so that's sort of the bar width so the width of each bar is called a bin and then we simply count the number of occurrences that are inside that bin now there's a whole sort of branch of Statistics that deals with how wide should our bins be and again like I said before I'm not going in to the nuts and bolts of you know the histogram and bin size and stuff like that but if you run it through a statistical software package they'll probably do it for you based on those guidelines so does this histogram look roughly like the normal curve well I think so so in SPSS which I use it will actually do a histogram with normal curve overlaid on to it as you can see that the histogram fits pretty well within that normal curve now there is sort of a warning here and it kind of alludes to the previous slide histograms can be misleading the look of a histogram is largely dependent on the bin size so sort of the width of the bars or the space between those tick marks so again you have to be careful when looking at a histogram that the bin size or the width of the bars is appropriate but again most software packages will do that automatically for you the lone exception being Excel unfortunately so again just keep that in mind when you're looking at histograms the next technique we're gonna look at is the stem-and-leaf plot so here is a stem-and-leaf plot from SPSS it's sort of a sideways histogram and again I'm not going to go into all the specifics of a stem-and-leaf but basically it breaks your data down into two component parts the stem which you can think of sort of a larger unit and then the leaves which break those larger units down into its component parts so just for an example here if you go down to the second two from the bottom as a frequency of five and it says the stem is two and in the leaves or one three three three four well I can notice the bottom it says our stem width is 0.01 so we would take our stem width times our stem so that would be point O two and then the leaves are the next digit so we would have point O two 1.0 to 3.0 to 3.0 two three and then 0.024 see how that works so again this data is kind of weird when using stem-and-leaf plots but another data set might have the tens digit as the stem and the ones unit as the leaf so again it just depends so just think of it as a sideways histogram so if we were to take this steaming lift stem and leaf plot and turn it on its side it would look kind of like a normal curve kind of see that it's kind of a normal curve on its side so again we can use this graphical technique to see whether or not our data seems to fit the normal distribution which in this case it does the next technique is called the box plot and I'm sure you've seen one it kind of looks like this so box plots are relatively simple graphical tools for looking at the distribution of data so there in the middle the black line is the medium remember the median is the middle value if you put all your values in a sending order from lowest to highest the median is the one in the middle or if it's an even number it's sort of the average of the two in the middle so it's the middle value one side of the box is called the first quartile and then the other side of the box is the third quartile and we call that altogether the interquartile range so that's sort of the middle 50% of our data so what should you look for in this box plot well number one is the box plot sort of symmetrical overall well if you look at this one the top half from the median looks pretty much like a mirror image of the bottom half so it does look pretty symmetrical our quartile 1 and quartile 3 approximately the same distance from the median so the median up to quartile 3 and the median down the quartile 1 are about the same size the median to quartile 3 is a little bit higher but not much are the whiskers the lines extending out from the quartile one and two I'm sorry quartile one and three approximately the same length so again the whiskers here going from quartile three all the way up and from quartile one all the way down do appear to be about the same length so we can say that again this dataset appears to be normally distributed it's symmetrical it symmetrical in order interquartile range and things of that nature now in spss the little dots there in the end those are just dots that are beyond one point five of the interquartile ranges and again that may be over your head but those are not outliers those just mean that those are sort of the data values at the very extreme and if you notice they're about the same number so up above you probably have what five there and down below you have four so that's not out of the ordinary again it's symmetrical even sort of at the extremes which is a good sign when we're trying to determine whether or not it fits a normal distribution now the pp plot this is one that people come into stats classes having never seen before so it's kind of new and that's kind of that's good so this is the normal PP plot of our stock return data or daily stock returns now what this is is that in a PP plot we compare the cumulative probability of our empirical data which in this case the stock returned with an ideal test distribution in this case we are testing our data against the normal distribution in terms of the cumulative probabilities in this example we're comparing our stock return data with the ideal normal distribution now I will do videos on the PP and QQ plots so again I'm just introducing you to these ideas now here are the questions to ask when you look at the PP plot do the data points fall in a straight line if our data matches the testers tribution which in this case is the normal distribution they should if you look at all those little individual dots they fall almost in a perfect straight line you can actually see the line right below the data points and that's what we're looking for in a normal PP plot the QQ plot is very similar the QQ plot we're comparing the quantiles of our empirical data with the ideal in this case the moral so in this case we're comparing our stock return data with the theoretically ideal normal distribution and the same question we would ask do the points fall in a straight line now notice that in the extremes sort of on the lower left and the upper right they fall a little bit off the line and in a QQ plot that's not that uncommon and there are some reasons for that but those few points on the ends that fall off the line are not enough to say nope this does not fit them almost tribution you know in many ways we're concerned with sort of the middle part of this line and unless the extremes on the ends are really far off then we really would not let that concern us ok let's look at another quick example and then we'll be done so is this data normal so here is our histogram so here is our histogram without the normal curve and here's our histogram with the normal curve and again this is a different data set well this should makes you you know this should give you pause look on the left hand side almost all of our data all the probability all the counts are on the left hand side ok it's very lopsided over there on the left hand side so it's going to have looks like it has some skew to it and we have a longer right tail that goes off towards the end so that should worry us whether or not you know this fits the normal distribution on the right hand side you can see that when we try to put the normal distribution over top of it you know a good you know third of the normal distribution curve is off the chart and that should be a problem let's look at our stem and leaf so here's our stem and leaf plot now what do we notice about this if we look at the top of our stem and leaf plot which is actually the lower lower part of our data look at all the data that's there so you can see that we have 2 6 10 that's 18 data points in that lower part and if that's you look at the box plot that's over that's on the bottom so you got to think of this sort of upside down and the stem and leaf the beginning of our data is at the top and the box plot the beginning of our data is on the bottom so you can see that we have a lot of our data sort of in that part in that part and what this does is it pulls the median towards this lower end because think about it if the vast majority of our data is at the beginning that means our median is going to be somewhere in there so it's going to pull that median down towards the beginning on the flip side the tail is really long so we have extreme values that stretch the distribution over to the right if we were looking at this as actually the normal curve and we have an extreme outlier up there at the top so you again you can see that what this does it gives us representation that's very similar to our histogram so think about our histogram all the probability was on the left hand side well look at the top of the stem and leaf you can see that all of our values are there on the top which is the same as saying it's on the left hand side of the histogram now that the PP and QQ plots well here's the PP plot and there's a QQ plot so what does your mind tell you are these linear does fall along the line well no it does not that's not even close enough to say that those fall on that line so what we can say this data does not fit normal distribution it actually it fits a different distribution which I know it to be which is the log normal so again you can use those very few simple techniques to visually look at whether or not the data fits a normal distribution before you go on to do anything else because if we used this second data set in certain tests that assumed a normal distribution we're going to have problems okay this quick review and then we are done so remember by using a few simple visual tools we can learn a tremendous amount of information about our data the data may have excess skew which means lopsided it may have excess kurtosis which means it sort of squished in the middle and has fat tails or more probability in the tails it could even be bimodal which I did not graph in this but it could have two humps throw like a camel's back or as in our psychic example it can follow a completely different distribution other than the normal so in this presentation we looked at the histogram the stem-and-leaf plot box plots pp plots and QQ plots now some of those I will make separate videos on but again I just wanted to give you a brief overview of some techniques you can use to tell whether or not your data does or does not fit the normal distribution ok so that wraps up this video on these visual techniques now just keep in mind a few things I can't stress this enough always look at your data visually first so that can be anything from these type of visual analysis to maybe a scatter plot matrix where you can look for correlations between certain variables if you have multiple variables you're looking at or all kinds of things so before you do anything more advanced take a look at your data okay there's a few reminders and then we'll wrap this video up if you're watching this because you were struggling in a class once you stay positive and keep your head up you're smart and you're talented and there's just me this may be a temporary rough patch if you liked the video please give it a thumbs up put it on a playlist or share it with colleagues or classmates that does encourage me to keep making them connect with me here on YouTube on Twitter and or run LinkedIn that way when I upload a new video you know about it and just keep in mind that the most important thing is that you're on here committing yourself to learning and trust me if you have the right process of learning in place the results will take care of themselves so thank you very much for watching I wish you all the best of luck in your studies and your work and look forward to seeing you again next time you you you
Info
Channel: Brandon Foltz
Views: 250,556
Rating: 4.9550071 out of 5
Keywords: is my data normal, is my data normally distributed, normality test, qq plots, normal probability plot, qq plot, q-q plot, q q plot, p-p plot, pp plot, p p plot, quantile quantile plot, probability plot, qq plot excel, normal quantile plot, p-p plot spss, statistics, box plot, anova, logistic regression, linear regression, anova statistics, brandon foltz, p-value, normal distribution, statistics 101, box and whisker, data visualization, Statistics for data science
Id: 9IcaQwQkE9I
Channel Id: undefined
Length: 25min 30sec (1530 seconds)
Published: Sun Jan 27 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.