Statistics 101: Understanding Correlation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

(upbeat music) - [Brandon] Hello, welcome to the next video in my series on basic statistics. Now a few things before we get started, number one, you might notice by my voice that I'm just getting over a cold, so I apologize if I am hard to understand at times, I will try to dictate and speak very clearly for you. Number two, if you are watching this video because you are struggling in a class right now, I want you to stay positive and keep your head up. If you're watching this, it means you've accomplished quite a bit in your educational career up to this point. You're very smart and you may have just hit a temporary rough patch. Now I know with the right amount of hard work, practice, and patience, you can get through it. I have faith in you, so so should you. Number three, please feel free to follow me here on YouTube and or on Twitter. That way when I upload a new video, you know about it. And on the topic of the video, if you like, please give it a thumbs up. Share it with classmates or colleagues, put it on a playlist 'cause that does encourage me to keep making them. And finally, just keep in mind that these videos are meant for individuals who are relatively new to stats, so I'm just going over basic concepts. I will be doing so in a very slow deliberate manner. Not only do I want you to understand what's going on but also why. So all that being said, let's go ahead and get started. So this video is the next in our series on bivariate relationships. So in previous videos, we talked about covariance and we looked at scatterplots and really tried to understand what covariance was. So this video is sort of the next topic in that series, correlation. So correlation as a standard topic in introductory stats courses, you often hear about it on the news or when reading of course, scientific academic journals. So we're really gonna break it down, understand what it is, and how it relates to other measures of bivariate relationships. So here was the example we used in the covariance video. Now, this scattergram here in the middle is a chart that looks at the monthly returns for the S&P 500 and the Dow Jones Industrial Average during the year 2012. So it's the monthly return for each index. So over on the right, you can see the returns in decimal form, so you just have to move the decimal over two places to get the percentage. So again, this is just a scatterplot and I always recommend looking at a scatterplot of your data when you're looking at bivariate relationships. Now, first question we wanna ask ourselves is how would we describe the shape or the pattern of these data points? Are they in a line? Or are they in a curve shape? Is there no real pattern at all? Now in this case, they seem to follow a linear pattern. So you can see if we put a line over the points, they seem to follow that line pretty closely. Now what this means is that when one stock index rises, the other one also rises. And this is very important to understanding covariance and its cousin correlation. So of course, they follow a linear pattern. Now, in a real world practical sense, this had better be the case, well why? Well, the S&P 500 and the Dow Jones Industrial Average claim to measure basically the exact same thing and that is the performance of the overall stock market. So if they are not behaving in similar ways, then we have a problem. Because they are claiming to measure the same thing. So this relationship should not surprise us at all. Now we said that two variables show this type of pattern, they have a positive linear relationship. When one variable moves in a certain direction, the other tends to move in that same direction. We call this the covariance. It's how they co-vary, or vary together. Now, think about linear relationships in general, now the covariance and this is just a quick review, is one of a family of statistical measures used to analyze the linear relationship between two variables. How do two variables behave as a pair? Now we have the covariance which we've talked about in previous videos. We have the correlation which we're gonna discuss in this video and linear regression which of course, is related to the covariance and the correlation. So all of these measures are ways to look at the relationship between two variables. So let's talk about the difference between the covariance and correlation. Now the covariance provides the direction, so a positive, a negative, or near zero. It provides the direction of the linear relationship between two variables. While the correlation provides direction and strength. So remember, the covariance does not tell us anything about the strength of that relationship, just the direction. The correlation overcomes that limitation by describing the direction and the strength. Now the covariance has no upper or lower boundary and its size is dependent on the scale of the variables. So for the covariance, you're gonna get a number that is related to the scale or the measure of the variables themselves. However, the correlation is always between -1 and positive 1 and its scale is independent of the scale of the variables themselves which is very handy because it allows us to compare variables that are measured in different ways. So we can look at the correlation between the temperature outside and our energy usage in kilowatts which are obviously in different measures. We can look at various relationships across the variable pairs regardless of the units they're in. That's why correlation is very handy. Now another way of saying that is that covariance is not standardized while correlation is standardized. In the same way that a z-score is a standardized measure of the variation in one variable. So you can think of z-score, the z-score of one is z-score of negative 2.3, the standardization in the z-score allows us to compare variables that are measured using different scales. The correlation behaves exactly the same way. It's a standardized measure of the relationship between two variables. Now there other couple of things to keep in mind when you're interpreting correlation measures. Number one, before going crazy computing correlations, look at a scatterplot of your data. What pattern if any does it exhibit? So a lot of students I work with are always gung-ho about calculating or putting the data into a statistical software package or Microsoft Excel and calculating the correlation without actually looking at a scatterplot of the data. Always look at a scatterplot first. Now correlation is only applicable to linear relationships. There are many other types of relationships that can exist between two variables. And we'll look at those here in a minute. For example, if we were gonna look at a scatterplot of energy usage versus the temperature outside and I live here in the Midwest, so our seasons vary quite a bit. What you'll notice is that when the temperature's very low, the energy usage goes up. Now as the temperature rises to a moderate amount, the energy usage goes down, but then when the temperature gets really hot, energy usage goes up again. So the scatterplot will probably look like a U shape and of course, that's a curve shape, it's not linear. So you always wanna use the scatterplot to look at the shape of your data and realize the correlation is only applicable to linear relationships. Number three and I'm sure this is hammered into your mind in your classes and that is correlation is not causation. 'Cause there's an idea called spurious correlation. And that's when two completely unrelated factors that may have a mathematical correlation but have no sensible correlation in real life. So maybe we measure the number of times our dog barks and we keep that data in relation to whatever phase the moon is in. So maybe our dog barks more when there's a full moon and it barks a little bit less when there's a three-quarter moon, it barks a little bit less when there's a half moon, so this relationship, this correlation exists but it doesn't mean anything in real life. So the moon is not causing our dog to bark more or less and our dog barking is certainly not causing the phase of the moon. So correlation is not causation. And finally, a correlation strength does not necessarily mean the correlation is statistically significant, now we're not gonna talk about statistical significance with respect to correlation in this video, but just keep in mind it just depends on the sample size that we're working with. So whether or not a correlation is statistically significant will depend on its sample size but we'll save that for another video. I just wanted to point it out. So here are some general correlation patterns and again, I'm sure these are probably in your stats book, you've seen these before. So in the left hand side we have a correlation where all the points seem to follow a positive upward trend along a straight line. So this correlation on the left hand side is probably near positive 1. In the middle we have the points that follow a negative relationship, so starts at the upper left and goes down to the lower right in a straight line pattern. So this correlation is probably near -1. And then on the right hand side there seems to be no linear relationship at all. So the correlation of these points is probably near 0. Now in real life our data is probably gonna be somewhere between these extremes. So we're not gonna have data that looks like the positive 1 and -1 general patterns and data that, well we probably will have data that does not have a correlation that will look like the one on the right but when data does have a correlation it's gonna be somewhere between the plus one and minus one extremes and the zero example. And again, that just comes with looking at the scatterplots of your data. Now data, like I said before, does sometimes follow non-linear relationships. So in the left hand side we call this a quadratic relationship. And this might be a good example when I was discussing energy usage and the temperature outside. So if the x-axis down there at the bottom is the temperature outside, when the temperature is really low, we might use a lot of energy to warm our houses. As the temperature becomes more moderate there in the middle, we might actually turn the air conditioning or the heat off completely, open the windows and therefore we're using very little energy. But when the temperature rises, and becomes uncomfortable, we're likely to turn the air conditioning on and then our energy usage goes up again. So in that case, it might follow this U shape which of course, is non-linear. Then we have an exponential pattern which comes into play in some areas of biology and demographic data and things like that, but again, it's not not linear. It follows this curve that gets steeper as it goes up. And then we have a polynomial relationship so this kinda has a S, kinda squiggly curve shape and we sometimes see that in data. So again, these are not linear. So again, you have to look at a scatterplot of your data to make sure it is applicable to doing correlations. So again, here is our example of the monthly returns of the S&P 500 versus the Dow Jones. So when the S&P goes up, the Dow Jones tends to go up and again, that should make sense. Now I ran these numbers through SPSS which is one of several statistical software packages. You could do this easily in Microsoft Excel, you could do it in S-A-S, SAS, you could do it in Minitab, whatever software you have at your disposal. You could also do it on a TI calculator. Now what this tells us is that the correlation is .974. So you can see that falls in the diagonals there from lower left to upper right. Now you can see that the correlation between the S&P 500 and itself of course is 1 and the same holds true for the Dow Jones and itself, the correlation is 1. 'Cause anything correlated with itself is one. Now between the two variables the correlation is .974. And I did not point out that we use the letter r, lowercase r to denote the sample correlation coefficient. So again, in this case it's .974. But we'll talk about the correlation formulating coefficient here in a second. Now it should make sense that the correlation is this strong 'cause remember the highest it can possibly be is positive one. Well .974 is pretty close to one. And that should make sense given the way our data looks and what we're measuring in the real world. Let's talk about the correlation formula a bit. And I'm not one to really harp on memorizing formulas, but I just wanna point out sort of where this comes from and how it relates to other things we've talked about. Now r is called the Pearson correlation coefficient. So in your book or in your class if you hear the term Pearson correlation, that's what we're doing here. There are other types of correlations but the vast majority of the time, especially in an introductory stats class, you're going to be using the Pearson correlation coefficient and it's named that after the gentleman who actually invented this correlation coefficient. Well, what is it? It's the covariance between the two variables divided by the product of their standard deviations. So on the top of this fraction we have the covariance of x and y. In the denominator we have the product. So the standard deviation of x times the standard deviation of y. Now it's also written like this. So covariance of x and y divided by the standard deviation of x times the standard deviation of y. So we just take the covariance between the two variables and divide it by the product of their standard deviations. Now on a side note here, because we have four terms in this expression, let's say you know the correlation r and you know the standard deviations which are there in the denominator. Because you know those three things, you can actually find the covariance very easily. So and that would hold true for any of the other three. So if you have any three of these, you can actually find the fourth. And that's just again, very simple algebra, but it can come in handy. I just wanted to point that out to you. Let's go ahead and look at an example problem. So this is the same data I used in the covariance video, but we're gonna use it to find some correlations. So Rising Hills Manufacturing wishes to study the relationship between the number of workers, x, and the number of tables produced, y, in its plant. To do so it obtained 10 samples, each sample was one hour in length, from the production floor. So x is the number of workers and y is the number of tables produced. Now I went ahead and found out for you the standard deviations, so the standard deviation for the number of workers in each sample was 6.48. And the standard deviation for the number of tables produced was 16.69. And again, I just rounded those to two digits. So remember when we're calculating the correlation coefficient, we have to have the two standard deviations so there they are. You might wanna go ahead and write them down because we'll need them when we do the calculation. So here is a scatterplot of our data. Now, how would you describe this relationship? What correlation are you expecting? Is it gonna be positive? Or negative? Is it gonna be strong? Very close to one or negative one? Or is it gonna be weak, around zero, or non-existent, exactly zero? So this is just the way I computed the covariance if you're interested in the covariance please check out the previous videos on covariance where we actually learn how to calculate it by hand. So the covariance of x and y, also written as s sub x y was 962.4 divided by n minus 1. And again, you don't really need to know this for this video, just check out the previous one on covariance. So we go ahead and calculate that out and the covariance between x and y comes out to 106.93. Now of course, the covariance along with the standard deviations are what we need to calculate the correlation. So we have our correlation coefficient r equal to the covariance of x and y divided by the product of the standard deviations. It's also written like this. So you may see it one or both ways in your class work. So we go and substitute the numbers in so we have 106.93 in the numerator divided by 6.48 times 16.69 in the denominator. So that is 106.93 divided by 108.15 and that gives us a correlation coefficient of .989. So it is a very strong positive correlation. Now do note that this is a bit higher than the SPSS output gave us due to rounding because when we work it in SPSS, it doesn't do rounding until the very end so we're gonna be off a digit or two, no big deal. So how would you describe this relationship? Well, with a correlation coefficient of .989 or .973 I think it was in SPSS, it's a very strong positive linear relationship. Now again, in this case, we're dealing with the number of workers and the number of tables. So can we say that the number of workers causes the number of tables that are produced? Now in this case, I think it's okay to say we have a preliminary causation here and that's because the variables mean something. But I'm just saying that correlation does not necessarily mean causation but in this case because of the relationship between the two variables in the real world, we could probably go ahead and say that. Now, how could we more objectively state whether or not a relationship exists between two variables? There's actually a rule of thumb that we can apply. So all the students I work with we come up with the correlation coefficient but sometimes in the classes that they are in, this sort of rule of thumb is not taught or it's taught and they forget it or don't write it down which is completely possible, but there is sort of a general rule of thumb way to figure out whether or not a relationship exists. So here is that rule of thumb. If the absolute value of our correlation coefficient is greater than 2 divided by the square root of our sample size, then a relationship exists. So again, for our example, we had 10 samples, so if the absolute value of our correlation coefficient is greater than 2 divided by the square root of 10 because we had 10 samples and that comes out to .632, if our correlation coefficient is greater than that, the absolute value of it is greater than that, then we say a relationship does exist between these two variables. So again, that's just a general rule of thumb to use if you have to make the conclusion whether or not a relationship does or does not exist. Now just keep in mind that even if our relationship was -.989, we would say that the relationship does exist in that case too because we're interested in the absolute value of our correlation coefficient which removes the sign or makes the sign positive you can think of it that way and as long as it's above this .632 then we'll go ahead and say relationship does exist. Okay, so just a quick review and then we'll wrap this video up. Just remember that covariance provides the direction while correlation provides the direction and the strength. The covariance has no upper or lower bound and its size is dependent on the scale of the variables. So when we figured out the covariance of 106.93, well what does that mean? The reality is it really doesn't mean a whole lot. In covariance, it really depends on the variables we're using. Now correlation is always between -1 and positive 1 and its scale is independent of the scale of the variables. That's another way of saying covariance is not standardized while correlation is standardized. Again, think of z-scores, so z-scores allow us to explain variation in individual variables regardless of what unit they're measured in, the correlation works the exact same way. Okay, so that wraps up our first video at least in understanding correlation. So again, my goal in this video was just to show you the tie ins between covariance and correlation 'cause obviously they are highly related in how we calculate them, how we interpret scatterplots and things of that nature. I just wanted to point out that correlation does have some caveats, so correlation does not necessarily mean causation which is very important. Correlation is a standardized measure of the relationship between two variables and correlation is only really applicable to variables that exhibit a linear relationship. So always look at your scatterplots first. Okay, just a few reminders, if you're watching this video because you are struggling in a class, I want you to stay positive and keep your head up. I have faith in you, many other people around you have faith in you, so so should you. Please feel free to follow me here on YouTube and or on Twitter, that way when I upload a new video you know about it. If you like the video, please give it a thumbs up, share it with classmates or colleagues, or put it on a playlist. That does encourage me to keep making them and the same holds true for other screen casters here on YouTube, so if you like a video on YouTube, please give it a thumbs up because that's encouraging to us. On the flip side if you think there is something I can do better, please leave a constructive comment below the video and I'll try to incorporate those ideas into future ones. Just keep in mind the most important thing is that you're here on YouTube, or on the web somewhere trying to learn. And that's the important thing. If you're on here trying to learn, trying to improve yourself as a student or a businessperson, or in some other field, that's what really matters. If you have the process of learning, the results will take care of themselves. So, thank you very much for watching, I wish you the best of luck in your studies or in your work and look forward to seeing you again next time. (upbeat music)

Info

Channel: Brandon Foltz

Views: 623,757

Rating: 4.9600749 out of 5

Keywords: correlation statistics, statistics correlation, correlation, correlation analysis, correlation and regression, correlation and regression in statistics, correlation coefficient, statistics 101, brandon foltz, brandon c. foltz, brandon c foltz, Pearson correlation, covariance, regression analysis, linear regression, understanding, bivariate, regression, 101, anova, logistic regression, anova statistics, statistics, data science, Covariance and correlation

Id: 4EXNedimDMs

Channel Id: undefined

Length: 27min 5sec (1625 seconds)

Published: Fri Jan 25 2013