(upbeat music) - [Brandon] Hello,
welcome to the next video in my series on basic statistics. Now a few things before we get started, number one, you might notice by my voice that I'm just getting over a cold, so I apologize if I am hard
to understand at times, I will try to dictate and
speak very clearly for you. Number two, if you are watching this video because you are struggling
in a class right now, I want you to stay positive
and keep your head up. If you're watching this, it means you've accomplished quite a bit in your educational
career up to this point. You're very smart and
you may have just hit a temporary rough patch. Now I know with the right
amount of hard work, practice, and patience,
you can get through it. I have faith in you, so so should you. Number three, please feel free to follow me here on
YouTube and or on Twitter. That way when I upload a new video, you know about it. And on the topic of the video, if you like, please give it a thumbs up. Share it with classmates or colleagues, put it on a playlist 'cause
that does encourage me to keep making them. And finally, just keep in mind that these videos are
meant for individuals who are relatively new to stats, so I'm just going over basic concepts. I will be doing so in a
very slow deliberate manner. Not only do I want you to understand what's going on but also why. So all that being said, let's go ahead and get started. So this video is the next in our series on bivariate relationships. So in previous videos, we talked about covariance and we
looked at scatterplots and really tried to understand
what covariance was. So this video is sort of the next topic in that series, correlation. So correlation as a standard topic in introductory stats courses, you often hear about it on the news or when reading of course,
scientific academic journals. So we're really gonna break it down, understand what it is, and how it relates to other measures of
bivariate relationships. So here was the example we used in the covariance video. Now, this scattergram here in the middle is a chart that looks
at the monthly returns for the S&P 500 and the Dow
Jones Industrial Average during the year 2012. So it's the monthly return for each index. So over on the right, you can see the returns in decimal form, so you just have to move
the decimal over two places to get the percentage. So again, this is just a scatterplot and I always recommend
looking at a scatterplot of your data when you're looking at bivariate relationships. Now, first question we wanna ask ourselves is how would we describe the shape or the pattern of these data points? Are they in a line? Or are they in a curve shape? Is there no real pattern at all? Now in this case, they seem
to follow a linear pattern. So you can see if we put
a line over the points, they seem to follow that
line pretty closely. Now what this means is that when one stock index rises,
the other one also rises. And this is very
important to understanding covariance and its cousin correlation. So of course, they
follow a linear pattern. Now, in a real world practical sense, this had better be the case, well why? Well, the S&P 500 and the
Dow Jones Industrial Average claim to measure basically
the exact same thing and that is the performance of the overall stock market. So if they are not
behaving in similar ways, then we have a problem. Because they are claiming
to measure the same thing. So this relationship should
not surprise us at all. Now we said that two
variables show this type of pattern, they have a
positive linear relationship. When one variable moves
in a certain direction, the other tends to move
in that same direction. We call this the covariance. It's how they co-vary, or vary together. Now, think about linear
relationships in general, now the covariance and this
is just a quick review, is one of a family of statistical measures used to analyze the linear relationship between two variables. How do two variables behave as a pair? Now we have the covariance which we've talked about
in previous videos. We have the correlation which we're gonna discuss in this video and linear regression which of course, is related to the covariance
and the correlation. So all of these measures are ways to look at the relationship
between two variables. So let's talk about the difference between the covariance and correlation. Now the covariance provides the direction, so a positive, a negative, or near zero. It provides the direction
of the linear relationship between two variables. While the correlation provides
direction and strength. So remember, the covariance does not tell us anything about the strength of that relationship, just the direction. The correlation overcomes that limitation by describing the
direction and the strength. Now the covariance has no
upper or lower boundary and its size is dependent on
the scale of the variables. So for the covariance, you're gonna get a number that is related to the
scale or the measure of the variables themselves. However, the correlation is always between -1 and positive 1
and its scale is independent of the scale of the variables themselves which is very handy because it allows us to compare variables that are
measured in different ways. So we can look at the correlation between the temperature outside and our energy usage in kilowatts which are obviously in different measures. We can look at various relationships across the variable pairs regardless of the units they're in. That's why correlation is very handy. Now another way of saying that is that covariance is not standardized while correlation is standardized. In the same way that a
z-score is a standardized measure of the variation in one variable. So you can think of z-score, the z-score of one is
z-score of negative 2.3, the standardization in
the z-score allows us to compare variables that are measured using different scales. The correlation behaves
exactly the same way. It's a standardized
measure of the relationship between two variables. Now there other couple
of things to keep in mind when you're interpreting
correlation measures. Number one, before going
crazy computing correlations, look at a scatterplot of your data. What pattern if any does it exhibit? So a lot of students I work with are always gung-ho about calculating or putting the data into a
statistical software package or Microsoft Excel and
calculating the correlation without actually looking at a scatterplot of the data. Always look at a scatterplot first. Now correlation is only applicable to linear relationships. There are many other
types of relationships that can exist between two variables. And we'll look at those here in a minute. For example, if we were gonna look at a scatterplot of energy usage versus the temperature outside and I live here in the Midwest, so our seasons vary quite a bit. What you'll notice is that
when the temperature's very low, the energy usage goes up. Now as the temperature
rises to a moderate amount, the energy usage goes down, but then when the
temperature gets really hot, energy usage goes up again. So the scatterplot will probably look like a U shape and of course, that's a curve shape, it's not linear. So you always wanna use the scatterplot to look at the shape of your data and realize the correlation is only applicable to linear relationships. Number three and I'm sure this is hammered into your mind in your classes and that is correlation is not causation. 'Cause there's an idea
called spurious correlation. And that's when two
completely unrelated factors that may have a mathematical correlation but have no sensible
correlation in real life. So maybe we measure the number of times our dog barks and we keep that data in relation to whatever phase the moon is in. So maybe our dog barks more
when there's a full moon and it barks a little bit less when there's a three-quarter moon, it barks a little bit less
when there's a half moon, so this relationship,
this correlation exists but it doesn't mean anything in real life. So the moon is not causing
our dog to bark more or less and our dog barking is
certainly not causing the phase of the moon. So correlation is not causation. And finally, a correlation strength does not necessarily mean the correlation is statistically significant, now we're not gonna talk
about statistical significance with respect to correlation in this video, but just keep in mind it just depends on the sample
size that we're working with. So whether or not a correlation is statistically significant will depend on its sample size but we'll
save that for another video. I just wanted to point it out. So here are some general
correlation patterns and again, I'm sure these are probably in your stats book,
you've seen these before. So in the left hand side we have a correlation where all the points seem to follow a positive upward trend along a straight line. So this correlation on the left hand side is probably near positive 1. In the middle we have
the points that follow a negative relationship, so starts at the upper left and goes down to the lower right in a
straight line pattern. So this correlation is probably near -1. And then on the right hand side there seems to be no
linear relationship at all. So the correlation of these
points is probably near 0. Now in real life our data is probably gonna be somewhere between these extremes. So we're not gonna have data that looks like the positive
1 and -1 general patterns and data that, well we probably will have data that does not have a correlation that will look like the one on the right but when data does have a correlation it's gonna be somewhere between
the plus one and minus one extremes and the zero example. And again, that just comes with looking at the scatterplots of your data. Now data, like I said before, does sometimes follow
non-linear relationships. So in the left hand side we call this a quadratic relationship. And this might be a good example when I was discussing energy usage and the temperature outside. So if the x-axis down there at the bottom is the temperature outside, when the temperature is really low, we might use a lot of
energy to warm our houses. As the temperature becomes more moderate there in the middle, we might actually turn
the air conditioning or the heat off completely,
open the windows and therefore we're
using very little energy. But when the temperature rises, and becomes uncomfortable, we're likely to turn
the air conditioning on and then our energy usage goes up again. So in that case, it
might follow this U shape which of course, is non-linear. Then we have an exponential pattern which comes into play
in some areas of biology and demographic data and things like that, but again, it's not not linear. It follows this curve that
gets steeper as it goes up. And then we have a polynomial relationship so this kinda has a S,
kinda squiggly curve shape and we sometimes see that in data. So again, these are not linear. So again, you have to look
at a scatterplot of your data to make sure it is applicable
to doing correlations. So again, here is our example of the monthly returns of the S&P 500 versus the Dow Jones. So when the S&P goes up, the Dow Jones tends to go up and again,
that should make sense. Now I ran these numbers through SPSS which is one of several
statistical software packages. You could do this easily
in Microsoft Excel, you could do it in S-A-S, SAS,
you could do it in Minitab, whatever software you
have at your disposal. You could also do it on a TI calculator. Now what this tells us is
that the correlation is .974. So you can see that falls in the diagonals there from lower left to upper right. Now you can see that the correlation between the S&P 500 and itself of course is 1 and the same holds true for the Dow Jones and
itself, the correlation is 1. 'Cause anything correlated
with itself is one. Now between the two variables
the correlation is .974. And I did not point out that we use the letter r, lowercase r to denote the sample correlation coefficient. So again, in this case it's .974. But we'll talk about the
correlation formulating coefficient here in a second. Now it should make sense
that the correlation is this strong 'cause remember the highest it can possibly
be is positive one. Well .974 is pretty close to one. And that should make sense given the way our data looks and what we're
measuring in the real world. Let's talk about the
correlation formula a bit. And I'm not one to really harp on memorizing formulas,
but I just wanna point out sort of where this comes from and how it relates to other
things we've talked about. Now r is called the Pearson
correlation coefficient. So in your book or in your class if you hear the term Pearson correlation, that's what we're doing here. There are other types of correlations but the vast majority of the time, especially in an introductory stats class, you're going to be using the
Pearson correlation coefficient and it's named that after the gentleman who actually invented this
correlation coefficient. Well, what is it? It's the covariance
between the two variables divided by the product of
their standard deviations. So on the top of this fraction we have the covariance of x and y. In the denominator we have the product. So the standard deviation of x times the standard deviation of y. Now it's also written like this. So covariance of x and y divided by the standard deviation of x times the standard deviation of y. So we just take the covariance
between the two variables and divide it by the product
of their standard deviations. Now on a side note here, because we have four
terms in this expression, let's say you know the correlation r and you know the standard deviations which are there in the denominator. Because you know those three things, you can actually find the
covariance very easily. So and that would hold true
for any of the other three. So if you have any three of these, you can actually find the fourth. And that's just again,
very simple algebra, but it can come in handy. I just wanted to point that out to you. Let's go ahead and look
at an example problem. So this is the same data I
used in the covariance video, but we're gonna use it to
find some correlations. So Rising Hills Manufacturing wishes to study the relationship
between the number of workers, x, and the number of tables
produced, y, in its plant. To do so it obtained 10 samples, each sample was one hour in length, from the production floor. So x is the number of workers and y is the number of tables produced. Now I went ahead and found out for you the standard deviations, so the standard deviation
for the number of workers in each sample was 6.48. And the standard deviation
for the number of tables produced was 16.69. And again, I just rounded
those to two digits. So remember when we're calculating the correlation coefficient, we have to have the
two standard deviations so there they are. You might wanna go ahead
and write them down because we'll need them
when we do the calculation. So here is a scatterplot of our data. Now, how would you
describe this relationship? What correlation are you expecting? Is it gonna be positive? Or negative? Is it gonna be strong? Very close to one or negative one? Or is it gonna be weak, around zero, or non-existent, exactly zero? So this is just the way
I computed the covariance if you're interested in the covariance please check out the
previous videos on covariance where we actually learn how
to calculate it by hand. So the covariance of x and y, also written as s sub x y was 962.4 divided by n minus 1. And again, you don't really need to know this for this video, just check out the
previous one on covariance. So we go ahead and calculate that out and the covariance between x and y comes out to 106.93. Now of course, the covariance along with the standard deviations are what we need to
calculate the correlation. So we have our correlation coefficient r equal to the covariance of x and y divided by the product of
the standard deviations. It's also written like this. So you may see it one or
both ways in your class work. So we go and substitute the numbers in so we have 106.93 in the numerator divided by 6.48 times
16.69 in the denominator. So that is 106.93 divided by 108.15 and that gives us a correlation
coefficient of .989. So it is a very strong
positive correlation. Now do note that this is a bit higher than the SPSS output
gave us due to rounding because when we work it in SPSS, it doesn't do rounding until the very end so we're gonna be off a
digit or two, no big deal. So how would you describe
this relationship? Well, with a correlation coefficient of .989 or .973 I think it was in SPSS, it's a very strong positive
linear relationship. Now again, in this case, we're dealing with the number of workers and the number of tables. So can we say that the number of workers causes the number of tables that are produced? Now in this case, I think it's okay to say we have a
preliminary causation here and that's because the
variables mean something. But I'm just saying that correlation does not necessarily mean causation but in this case because
of the relationship between the two variables
in the real world, we could probably go ahead and say that. Now, how could we more objectively state whether or not a relationship
exists between two variables? There's actually a rule of thumb that we can apply. So all the students I work with we come up with the
correlation coefficient but sometimes in the
classes that they are in, this sort of rule of thumb is not taught or it's taught and they forget it or don't write it down which
is completely possible, but there is sort of a
general rule of thumb way to figure out whether or
not a relationship exists. So here is that rule of thumb. If the absolute value of
our correlation coefficient is greater than 2 divided
by the square root of our sample size, then
a relationship exists. So again, for our example, we had 10 samples, so
if the absolute value of our correlation coefficient is greater than 2 divided by the square root of 10 because we had 10 samples and that comes out to .632, if our correlation coefficient is greater than that,
the absolute value of it is greater than that, then we
say a relationship does exist between these two variables. So again, that's just
a general rule of thumb to use if you have to make the conclusion whether or not a relationship
does or does not exist. Now just keep in mind that
even if our relationship was -.989, we would say that
the relationship does exist in that case too because we're interested in the absolute value of
our correlation coefficient which removes the sign or
makes the sign positive you can think of it that way and as long as it's above this .632 then we'll go ahead and say
relationship does exist. Okay, so just a quick review and then we'll wrap this video up. Just remember that covariance
provides the direction while correlation provides the
direction and the strength. The covariance has no upper or lower bound and its size is dependent on
the scale of the variables. So when we figured out the covariance of 106.93, well what does that mean? The reality is it really
doesn't mean a whole lot. In covariance, it really depends on the variables we're using. Now correlation is always
between -1 and positive 1 and its scale is independent of the scale of the variables. That's another way of saying covariance is not standardized while
correlation is standardized. Again, think of z-scores, so z-scores allow us to explain variation in individual variables regardless of what unit they're measured in, the correlation works the exact same way. Okay, so that wraps up our first video at least in understanding correlation. So again, my goal in this video was just to show you the tie ins between covariance and correlation 'cause obviously they are highly related in how we calculate them, how we interpret scatterplots and things of that nature. I just wanted to point out that correlation does have some caveats, so correlation does not
necessarily mean causation which is very important. Correlation is a standardized measure of the relationship between two variables and correlation is only really applicable to variables that exhibit
a linear relationship. So always look at your scatterplots first. Okay, just a few reminders, if you're watching this video because you are struggling in a class, I want you to stay positive
and keep your head up. I have faith in you, many
other people around you have faith in you, so so should you. Please feel free to follow me here on YouTube and or on Twitter, that way when I upload a new video you know about it. If you like the video,
please give it a thumbs up, share it with classmates or colleagues, or put it on a playlist. That does encourage me to keep making them and the same holds true
for other screen casters here on YouTube, so if you
like a video on YouTube, please give it a thumbs up because that's encouraging to us. On the flip side if you
think there is something I can do better, please
leave a constructive comment below the video and
I'll try to incorporate those ideas into future ones. Just keep in mind the most important thing is that you're here on YouTube, or on the web somewhere trying to learn. And that's the important thing. If you're on here trying to learn, trying to improve yourself as a student or a businessperson,
or in some other field, that's what really matters. If you have the process of learning, the results will take care of themselves. So, thank you very much for watching, I wish you the best of luck in your studies or in your work and look forward to seeing
you again next time. (upbeat music)