- So what is partial
correlation and how can we use simple techniques in R
to learn more about it. Stick around and let's find out. (upbeat music) Hello and namaste. My name is Brandon and
welcome to the channel. So on this channel, you will find lessons and tutorials for statistics, data science, and related fields that can help you get
higher grades in your class, prepare you for that next job, or just sort of quench
the natural curiosity you might have around these topics. Now, when all is said and
done, if you like the video, please give it a thumbs up,
share it with classmates, colleagues or friends or others you think might benefit from watching. And of course, if you haven't already please hit that subscribe button
and the bell notification. In the description below, you
will find two important links. One is to sort of a table of
contents for all of my videos. So if you wanna study something
else, just follow that link and you'll see that
table of contents there. The second link goes to where
you can download the file to follow along with this video. So again, look for those
in the description below. It's not that we were
introduced and up to speed, let's go ahead hop into R and learn about partial correlation. Now I want to reiterate that you do not need to
have advanced knowledge of R to watch this video. In fact, I'm gonna assume
you know very little. Now, many of my users have
quite a bit of R experience. Whereas another subset of my users do not. They're typically just business students or university students that
are taking a stats class for their major or just for
their general requirements. So I'm gonna assume a
low level of R knowledge. And actually we are only going to stay in the R Studio interface
here for a few minutes. We will hop out to an HTML file created from this R Studio environment, and then follow along with the
actual lesson in the video. When it comes to this R
environment that we have here, I'm running an R project. So an R project kinda has all
of your stuff in one place. On the file that you see
over here on the left is actually an R Notebook. It's very similar to the
Jupyter notebook you might see when using Python. And the language inside this R Notebook is called R Markdown. So you can see the hash signs. So the one hash is actually an H1 heading, two hashes is H2 and so on and so forth. So if you want to know more
about how this document is put together, just do a
Google search or YouTube search for R Markdown or R Notebooks. And of course you can download
this project file as a zip in the link in the description below. Finally and very quickly, what makes this R Notebook
environment so flexible is that you can run chunks of code. So here in this load
library chunk right here, on the right you'll see a green arrow, it looks like a play button. So if I hit that play
button, what it will do is execute this chunk of code. And then I could go
down to the next chunk, which you would see down here. So if I click this button, it will load in the dataset
we have up here on the right, and then it also opens the
help file for the dataset we are using, that's the
second line of this code chunk. So you can see that R
Notebooks are pretty cool. They allow you to run your code in chunks. You can using R Markdown, put
in instructions and images, which we'll see and so on and so forth. But I don't wanna go into all that here. I just wanted to kinda explain how this is all put together. So what I'm going to do is
go up to this knitr button. What that will do, is allow
me to create an HTML file based off this R Markdown in the Notebook. So hit it in the HTML, it will run that. And the first thing it does is it opens our MTCARS dataset help that we'll always have that available. Now, if I go down here,
another window opened up, this is the browser inside of R, but I wanna make this even bigger. So I'm gonna open it in my actual browser. And now we have everything
blown up nice and big. Let's go ahead and actually
get to the meat of the video, which is partial correlation. So first things first, I have all my YouTube
resources up here at the top. So my main channel, all my playlists, this playlist that we're in right now, which is model building playlists 20, and then the previous
video in this playlist, which is a visual guide
to partial correlation. And I do recommend you watch
that before doing this one. So when we go into the numbers and do all the regression
models and stuff, you'll actually know why we're doing them. To understand the gist of this video, what should you know, to start with? So basic statistics like
correlation and linear regression, you should know what the R-square
is and a regression model. You should know what residuals
are in a regression model. 'Cause they will play a very
important role in this video. And it will help, like I said, if you've watched the previous video that I've listed up here above. So the first thing we have to do is load in libraries for R. So those are ad-ons to R,
that extend its functionality and make some things easier to use. So we may not need all of these libraries, but they're good to have handy. If we were writing super optimized code, we would be meticulous
about package management. But for this small example, this is fine. Now also want to note here
that I am not an R expert, and this is not a tutorial on R. There are many ways to
do the same thing in R, and probably many of you watching this are much better at R than I am. So I just wanna get it out of the way that I'm using R here as a tool, not as like teaching people how to use R, and I've kept everything very simple. So what I do here is strictly
for teaching purposes. So just sort of keep that in mind. So we load our libraries in. Datasets are the built in
datasets that are in R. Knitr is what we use to
create this HTML document. Ggplot2 creates plots. Psych is a fantastic package
for statistical analysis. It is massive. It has all kinds of tools in it. It's well supported. So I definitely recommend
you learn more about psych, if you have not learned about it already. And then lmSupport is a package that helps with linear models. So LM, linear model support, and it gives us some
additional functionality we'll use towards the end. So those are the libraries
that we will pull in. Next we will load in the MTCARS dataset. So MTCARS is a very famous
dataset that's built into R, you'll see it used in
many tutorials and books and things you might read. So we will stick with that. And most of us are familiar
with how cars work. So it's also practically useful. So we'll load in MTCARS into a data frame. So an R data frame, if you're not familiar is sort of similar to
an Excel spreadsheet. Of course there are differences, but if you're looking for something that's analogous to
your everyday experience an R data frame is similar
to that Excel spreadsheet. So it's easy to understand
the structure of named columns and rows of observations
and so on and so forth. So what we'll do here is we
will take the MTCARS dataset, and then we will assign that using that sort of arrow
operator to another object called df_cars. So we're telling R, "Hey,
take the MTCARS, dataset, "and put it in this object name df_cars." And then the second line will open up like I showed you before the help file for the MTCARS
dataset that shows you what the data actually is. So if I go over here, let me
make this a little bit bigger. So you can see where the data comes from, it's from 1974, your Motor Trend Magazine. It looks at 10 aspects of
automobile design for 32 cars. This lists all of the variables
that are in the dataset and then their position
within the columns. So MPG is the first variable, number of cylinder second
variable, so on and so forth. That can be very handy
when you're trying to just work with certain variables in any notes and things like that. So that's what this little line here does, is it opens that helpful for you. So the next thing we do is we're gonna do some initial correlations just
to get a feel for our data. So I get a sense for it. So this is simple
exploratory data analysis. Now we want to play special
attention to the first column in this correlation matrix,
which is miles per gallon. That's gonna be our overall
dependent or target variable. So we are particularly
interested in that variable. So to generate this plot,
we use the core plot. So cor.plot(MTCARS) that will give us our full correlation plot. And this is what that looks like. So you can take a look
at that for a second, pause the video, if you want. And again, we're interested primarily in this first column with MPG. Now, a few things we should keep in mind, so we can see that the variables with the strongest negative correlation with miles per gallon are
cylinders, so number of cylinders, we can see that here at negative 0.85. Engine displacement
which is negative 0.85. Horsepower, car weight
and number of carburetors. Number of cylinders are just like, you've heard the terms like
V8, V6, flat-4 stuff like that. Those are just the number
of cylinders in the car. Displacement is sort of the
volume of those cylinders. So if the displacement is larger, the volume inside those
cylinders is larger and that allows more fuel to be burned. Engine horsepower, I think
we all know what that is. Car weight, obviously just
the weight of the car. And then back in the 70s,
engines used carburetors, which basically pull in
and mix the air and fuel. Now, nowadays we primarily
use fuel injection, but carburetors were sort
of the fuel injection of 30, 40, and longer years ago. Now this should all make sense because all of these
contribute to using more fuel and therefore should logically tend to have a negative
correlation with miles per gallon. So number of cylinders that
means more fuel can be burned. Larger displacement, that
means more fuel can be burned therefore miles per
gallon would be negative. Horsepower, more powerful
cars tend to burn more fuel. Therefore miles per gallon is negative. Heavier cars tend to burn more fuel, again and so on and so forth. Now, the variables with the
strongest positive correlation with miles per gallon are DRAT, which is the rear drive gear ratio. Quarter second, which is the
quarter mile time and seconds. So have you ever seen like a
drag strip or a drag car race? You have those here in the US, I'm not sure where overseas
you might have those for watching us overseas, but the car does is at a standstill. The light turns green and the car goes, it just goes as fast as I
can for a quarter of a mile. And then we time how long that takes. VS is the shape of the engine. So some engines are in a V-shape. So you've probably heard of
a V8 engine or a V6 engine. Some engines are inline,
some engines are flat. So my car has what's called
a boxer engine and a Subaru. So the engine is flat. So VS is just shape the engine. AM is automatic or manual transmission. So a manual transmission
you're probably familiar with, it's the gears usually in a stick shift in the console middle of the car. And then gear is the
number of forward gears. A lot of automatic transmissions are four speed transmissions. My car is a five speed transmission. Other cars maybe have six
speed transmissions and gears. So that's what we mean by gear. And we go up, those are
all positively correlated with miles per gallon. So if a car has a higher,
what we call a taller gear in the rear end, that usually means that for the wheels to spin
the engine turns less. And therefore, if the
engine is turning less, you're using less fuel. Quarter second time, that makes sense cars that are slower and
their quarter second time tend to be less powerful
and therefore use less fuel. And of course, actually, if
we take a look over here, we can see that quarter second time has a strong negative
correlation with horsepower, that tells us that higher horsepower cars have lower quarter second
times and vice versa. Automatic and manual, that's the gearbox. So if we go down here, we
can see that one is manual. So manual cars with a stick shift tend to get better gas mileage, actually. And then number of gears,
again, like I said, if you have more gears
that allows the engine to run more efficiently, so
those are our two sets here. So these first sets tend to
decrease miles per gallon. And this other set down here tend to increase miles per gallon. Next thing we'll do is go
for data frame preparation. So what we're gonna do
for this simple example, we're just gonna focus on
one independent variable, miles per gallon, and just to have those
independent variables. So DRAT, which is the rear end gear ratio, and then horsepower. So we're just looking
at those two variables out of all the ones above. We are interested in
the partial correlation between miles per gallon and DRAT, while controlling for
the effect of horsepower and the partial correlation
between mass per gallon and horsepower while controlling
for the effect of DRAT. So we are going to conduct
two partial correlations while controlling for the third variable. Therefore let's create a data frame that just has what we need. So let's scroll down a little bit more. So this code creates a
list of three variables we wanna keep and then places them into a second smaller
object data frame here. So it's a subset of the larger MTCARS. So we'll call this variable
here, or this object keeps, and we will assign to that
a list of the variables we want to keep so miles per
gallon, DRAT and horsepower, but keep in mind the quotes here. And they have to be
spelled exactly the same, including capitalization. So MPG, DRAT, and horsepower. Then we create a second data frame. So df_cars2, and we
assigned to that MTCARS, which is our larger dataset, but just the three
variables we listed above. So going to MTCARS get just the variables that we listed out in keeps and put that into a second
data frame called df_cars2. Now to make everything clear, we will place each variable
into its own object. So we have explicit names for variable. This will also allow us to
manipulate each variable easily. Technically you do not have to do this, but I'm doing this again
just for teaching purposes. So we're taking the first variable here, MTCARS then the dollar sign
and then miles per gallon. So we're telling R, go
into the MTCARS dataset and grab the column or the
variable of miles per gallon. Then assign that to a
object here called Y_cars, because that's gonna be our overall dependent variable target variable. Then do the same thing for DRAT. We're gonna assign that
to a variable X1_cars, and then horsepower X2_cars. So we have our Y, X1 and X2. We next need to create
some basic linear models and then place the models into objects. The first listed variable
after the lm command, which you see here, it means linear model, is the target of that model. And the variables that
follow this tilde sign are the independent variables we are putting into that model. So a consequence of this step is that we will also obtain the
important R-square values we used in the previous video. So let's look at these one by one. We are creating an object called cars_YX1, and we're going to assign
to that this linear model. So LM, then we have Y_cars that in this case is our
target variable, tilde and then X1_cars, that's
our independent variable. So here we are doing a
simple regression model of using X1 cars, which is also
DRAT, remember, by the way, to predict the miles
per gallon for the cars, then below that we do the same
thing, but for horsepower. So now it's X2. So we have Y_cars and then X2_cars. Then the next two might
seem a little bit odd, but we need these to get residuals. So in these next two, we're actually taking
the independent variables and regressing them with each other. So in this first example, X1 is the target variable,
remember that's DRAT. And then horsepower is the
independent variable here, below that we've flipped them. So X2 is the target
variable that's horsepower, and then X1 which is DRAT
is the independent variable. And then below that we
have the full model. So Y, X1, X2. So Y_cars regressed with
the independent variables of X1 cars and X2 cars. So we have two single
variable regressions, two regressions, where we're regressing the independent variables
overall with each other. And then we have the
full model down below. So this will give us all of our R-squares and will give us residuals
that we will need here in a minute. So very quickly here
are our model summaries, we actually did this
in the previous video, but just want to show you the output. So you kind of see where to find it in R. So the summary.lm function gives
us all of this information. So if you ever create a linear model and you wanna get the residuals, you wanna get the coefficients, the significance values, and then all of the standard
errors and stuff like that, just use this summary.lm function. So we can see on here, this
multiple R-squared of 0.464. That's the same thing we
found in the previous video. So when, when we regressed
DRAT with miles per gallon, we got an R-squared of 0.464. So we keeps scrolling down. So here's the same idea,
but with horsepower. Our R-square here is 0.6024. Again, that's the same value
we found in the previous video. Now, when we regressed
the independent variables on each other, we're actually
gonna get the same R-squared. However, the residuals
will not be the same. They're related, but they're not the same. Here is R-square of 0.2014. Same thing down here, 0.2014. Now our full model with miles per gallon, as the target or dependent variable. And there are two independent
variables together simultaneously we have the
multiple R-square of 0.7412. And again, that's the
same R-square we found in the previous video. So we just did a series
of simple regressions, linear regressions using the lm function and the variables we created from above. Next, we need to capture the residuals because under the hood, partial correlation is actually about the correlation among residuals. That's why we're grabbing them. So all we are doing here is taking the five regression models
we did in the step above, and then grabbing the residuals using this function here, resid. So if we do resid, then we put in our first regression model, we can assign those residuals to a different object over here, and we just call it cars_YX1 resid. So we're just grabbing the
residuals off that model and putting them in their own object. And we do that for all five models. So miles per gallon and horsepower here, the two independent variables
progress with each other, and then the full model. So now we have five sets of residuals. So now the partial correlation part. So think of it this way. I want you to put this
question in your mind in this fashion. What is the variable
we want to partial out? So if we're looking for the
partial correlation in this case between miles per gallon and DRAT, what's the variable we
want to partial out. Well, the answer is horsepower. So what is the implication of that? Well, horsepower will be
the independent variable and miles per gallon, and
DRAT will be the targets for two regression models. And again, you'll see how that looks here obviously in a second. To get the partial correlation
between miles per gallon and DRAT, we need to take
out the effect of horsepower. Now regression is how we take
out the effect of horsepower. So once we do that for both
miles per gallon and DRAT, we have two sets of
residuals, I.e what's left. So think about that. You get the partial correlation between miles per gallon and DRAT, we take out the effect of horsepower. Regression is how we take
out the effect of horsepower. Remember, we're gonna grab up
all the variants that we can using these regression models, and then we're gonna have
residuals, I.e what's left. Now the correlation between what's left is the partial correlation between miles per gallon and DRAT. And as a diagram, it looks like this. So if two simple regressions
here, we on the left, we had the two variables, we want the partial correlation for. So in this case, it's
miles per gallon and DRAT, then what we're gonna do
is create two regressions with the variables we want to partial out, which is on the right. So in the second box here, we have the variables to partial out of the two variables on the left. Then from those regressions. And again, we can have
more than one variable, here we have just have horsepower, but if we were doing partial correlations with the entire MTCARS dataset, and we wanted the partial correlation between miles per gallon and DRAT, we would put all of the other variables to the right of that tilde sign. And then we would still get the
same residuals and so forth. So we have two regressions and then we get two sets of
residuals out of those models. Then we do the correlation
between those residuals that gives us the partial correlation between the two variables
that we were interested in in the first place, in this
case miles per gallon and DRAT. See how that works. So this means two regressions where horsepower is the
independent variable and DRAT and then miles per gallon are the target variables respectively. So we already did this actually. So this is actually repetitive code. I just put it down here again. So it's with this diagram. So we create a two
linear regression models. Then we grab the residuals off those two linear regression models. And then we go down here and what we do is we create an object. So here parcor_MPG_DRAT. We're gonna assign to that, the correlation between those residuals. So what we're doing here
is two regression models. Grab the residuals. Now we're doing a correlation
between those two residuals and then we print it out. So in this case, the partial correlation between miles per gallon
and DRAT is 0.5907. So you see how that works. Partial correlation is a correlation between two sets of residuals. Now that was 0.5907301. And the bi-variate correlation that we did at the very beginning with all of our variables was 0.6811719. So we can see that the partial correlation is quite a bit lower than the bi-variate correlation we had above. And we'll talk about what that means in the grand scheme of
things here in a few minutes. Now, next we have the partial correlation between miles per gallon and horsepower, this the exact same process. Again, think in your
mind, question to ask, what is the variable
we want to partial out? In this case the answer is DRAT because that's the one that's
not in the two variables that we want to get the
partial correlation for. So the implication is, is that DRAT is the independent variable and then miles per gallon and horsepower are the targets for our
two regression models. So to get the partial correlation
between miles per gallon and horsepower, we've gotta
take out the effect of DRAT. And again, regression is how we take out the effect of DRAT, we go
in and we use the models of DRAT the miles per gallon,
and DRAT the horsepower to grab up all the variants that we can in each of those two. So once we do that for both
miles per gallon and horsepower, we have two sets of
residuals, I.e what's left. And again, the correlation
between what's left, the residuals, is the partial correlation between mass per gallon and horsepower. So the diagram looks like
this similar to above. So we're interested in
the partial correlation between miles per gallon and horsepower. We'll create two regression models, where DRAT is the independent variable. We get two sets of residuals. We correlate the residuals that gives us the partial correlation of miles per gallon and horsepower. So this was the same process. I'm not gonna go through this in detail, but we have our two regression models, our two sets of residuals. We correlate the residuals
and print it out. So the partial correlation
between mass per gallon and horsepower is negative 0.7191. And the bi-variate correlation
we had at the very beginning was negative 0.7762, approximately. So you can see here that the correlation didn't change as much from
the bi-variate correlation in the beginning to the
partial correlation here. So it would appear that
the relationship between miles per gallon and horsepower, at least in this scenario is stronger because the effect of DRAT was minimal as compared to the first situation where, when we used horsepower as
the independent variable that lowered the bi-variate correlation of DRAT to miles per
gallon, a little bit more. So I think it went down to
like 0.59, or whatever it was yeah, here 0.59. So it appears here that the relationship between mass per gallon
and horsepower is stronger, or at least it's not affected by DRAT as much as the reverse. So two more steps here that can make this a bit easier in the long run. We went through all that
during this video to show you conceptually, what's going on, actually. So you know, what's actually
going under the hood because my videos are
largely about conceptual, not just copy some code,
paste it and there you go. So I want you to understand
what time actually going on. So we can use the psych package for sort of one stop shopping. The psych package has
a partial.R function, and it has something called lowerMat, which actually makes our
correlation matrix look prettier. So we can use this command here. So we have inside, we have partial.R of our df_cars2 data frame. Remember, that's the data frame
we created at the beginning that has just our three variables. So that's why we needed that here. So partial R of that data frame and then put it inside this lower mat, which makes it easier to read. So if we look here, this is what we get. So first lower mat
makes it easier to read. If you notice here, we only
have sort of this wedge shape because everything above that is repeated. That's what lowerMat does, gives us the lowerMat of all this output. And then partial R of course, gives us our partial correlations. Now, if you notice these
numbers look familiar. So the partial correlation between miles per gallon
and DRAT was 0.59. That's exactly what we got above. And then the partial correlation between miles per gallon and horsepower, which is what we just did right here was negative 0.719, rounds
it to negative 0.72. So that is a one stop shopping command in the side of the psych package that you can use to find
partial correlations. And finally, there's a
package called lmsupport. So linear model support,
I misspelled my apologies. Now, note that this
takes in the full model. So both of our independent variables and then produce this partial Eta squared or the effect size values
for which the variants associated with the effect or variable. Now we can go into effect size and do a whole another video on that, which I probably will at some point. But I'm just kind of explaining
what output it gives you, and how you have to interpret it, to get the partial correlations
that we're interested in. So therefore you will have
to take the square root of the partial Eta squared to obtain the partial correlation. If you look down here, we're gonna use this
command model effect sizes. And then we put in a full
model and down below. We get this partial Eta
squared column right here. So this is our effect
size, partial Eta squared. This column here is effect size, and it's the percentage of the variant and the dependent variable explained by the independent
variables in the sample. So how do I interpret that
is that X1 cars which is DRAT explains about 35% of the
variance in miles per gallon. And then X2 which is horsepower explains about 52% of the
variance in miles per gallon. Now notice that these are squared, so we take the square
root of these values. Then we get our partial correlations. So the square root of 0.3490 is 0.5901, which is the partial correlation between miles per gallon and DRAT. And then the square
root of 0.5171 is 0.7191 or in this case, it's negative remember, because the correlation
between miles per gallon and horsepower is negative, which is the partial correlation between mass per gallon and horsepower. So it's just a square root
of the partial eta squared. So the square root of
the partial Eta squared is the partial correlation. Now this last column here is actually very interesting as well, and that's gonna lead into
our next series of videos. So I don't wanna spoil that fun. However, if you look at those
values, so 0.1387, 0.2772. If you remember from our previous video where we solved the Venn diagram puzzle, if you look that's the
value for a so 0.1388 just 'cause a rounding. So that's this area here in the purple and then c 0.2772, that's
this area here in the yellow. And again, they show up here in this model effect
sizes column here at the end, and we'll get into what that
means a little bit later. And finally, just to reinforce that we're doing this the right way, what we're gonna do here
is create a correlation of all the residuals together. So I'm gonna create a data frame. So data.frame, and then
in that data frame, I'm gonna put all the
residual objects that we made. So YX1 residual, YX2 residual
and so on and so forth. I'm gonna round this to three digits to make it easier to read. And then we get this
little matrix down here and you will see that
these numbers reappear. So here is our negative 0.719, which is the partial correlation between, look what we're doing here, look at the variables we
have YX1, we have X2X1. Remember the X1s we're partialing out, so this negative 0.719 is
the partial correlation between Y and X2 which is miles
per gallon and horsepower. And then for the other
one is right here, 0.591. Again, look at our
variables we have stored, we're looking at YX2 and X1X2. So that's the partial correlation between Y and X1 while partialing out X2. So 0.591 is the partial correlation between miles per gallon, which is Y and DRAT, which is X1. So we can say that again down here. All right, so that wraps
up this tour de force of partial correlation
using R, just a tool with some visuals to kind of see how everything flows together. Some basic understanding of how R works and then what's actually
going on under the hood in partial correlation. So I hope you found this video helpful. I hope you find the files you can download and follow along helpful. And again, my information is here. If you wanna look at other videos of mine and this playlist and
the table of contents for all my videos, but I will stop rambling now. And thank you very much for watching. I wish you all the best of
luck in your future studies and in your work. And I look forward to
seeing you again next time. Take care, bye bye.