Statistics 101: Model Building, Partial Correlation Concepts in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- So what is partial correlation and how can we use simple techniques in R to learn more about it. Stick around and let's find out. (upbeat music) Hello and namaste. My name is Brandon and welcome to the channel. So on this channel, you will find lessons and tutorials for statistics, data science, and related fields that can help you get higher grades in your class, prepare you for that next job, or just sort of quench the natural curiosity you might have around these topics. Now, when all is said and done, if you like the video, please give it a thumbs up, share it with classmates, colleagues or friends or others you think might benefit from watching. And of course, if you haven't already please hit that subscribe button and the bell notification. In the description below, you will find two important links. One is to sort of a table of contents for all of my videos. So if you wanna study something else, just follow that link and you'll see that table of contents there. The second link goes to where you can download the file to follow along with this video. So again, look for those in the description below. It's not that we were introduced and up to speed, let's go ahead hop into R and learn about partial correlation. Now I want to reiterate that you do not need to have advanced knowledge of R to watch this video. In fact, I'm gonna assume you know very little. Now, many of my users have quite a bit of R experience. Whereas another subset of my users do not. They're typically just business students or university students that are taking a stats class for their major or just for their general requirements. So I'm gonna assume a low level of R knowledge. And actually we are only going to stay in the R Studio interface here for a few minutes. We will hop out to an HTML file created from this R Studio environment, and then follow along with the actual lesson in the video. When it comes to this R environment that we have here, I'm running an R project. So an R project kinda has all of your stuff in one place. On the file that you see over here on the left is actually an R Notebook. It's very similar to the Jupyter notebook you might see when using Python. And the language inside this R Notebook is called R Markdown. So you can see the hash signs. So the one hash is actually an H1 heading, two hashes is H2 and so on and so forth. So if you want to know more about how this document is put together, just do a Google search or YouTube search for R Markdown or R Notebooks. And of course you can download this project file as a zip in the link in the description below. Finally and very quickly, what makes this R Notebook environment so flexible is that you can run chunks of code. So here in this load library chunk right here, on the right you'll see a green arrow, it looks like a play button. So if I hit that play button, what it will do is execute this chunk of code. And then I could go down to the next chunk, which you would see down here. So if I click this button, it will load in the dataset we have up here on the right, and then it also opens the help file for the dataset we are using, that's the second line of this code chunk. So you can see that R Notebooks are pretty cool. They allow you to run your code in chunks. You can using R Markdown, put in instructions and images, which we'll see and so on and so forth. But I don't wanna go into all that here. I just wanted to kinda explain how this is all put together. So what I'm going to do is go up to this knitr button. What that will do, is allow me to create an HTML file based off this R Markdown in the Notebook. So hit it in the HTML, it will run that. And the first thing it does is it opens our MTCARS dataset help that we'll always have that available. Now, if I go down here, another window opened up, this is the browser inside of R, but I wanna make this even bigger. So I'm gonna open it in my actual browser. And now we have everything blown up nice and big. Let's go ahead and actually get to the meat of the video, which is partial correlation. So first things first, I have all my YouTube resources up here at the top. So my main channel, all my playlists, this playlist that we're in right now, which is model building playlists 20, and then the previous video in this playlist, which is a visual guide to partial correlation. And I do recommend you watch that before doing this one. So when we go into the numbers and do all the regression models and stuff, you'll actually know why we're doing them. To understand the gist of this video, what should you know, to start with? So basic statistics like correlation and linear regression, you should know what the R-square is and a regression model. You should know what residuals are in a regression model. 'Cause they will play a very important role in this video. And it will help, like I said, if you've watched the previous video that I've listed up here above. So the first thing we have to do is load in libraries for R. So those are ad-ons to R, that extend its functionality and make some things easier to use. So we may not need all of these libraries, but they're good to have handy. If we were writing super optimized code, we would be meticulous about package management. But for this small example, this is fine. Now also want to note here that I am not an R expert, and this is not a tutorial on R. There are many ways to do the same thing in R, and probably many of you watching this are much better at R than I am. So I just wanna get it out of the way that I'm using R here as a tool, not as like teaching people how to use R, and I've kept everything very simple. So what I do here is strictly for teaching purposes. So just sort of keep that in mind. So we load our libraries in. Datasets are the built in datasets that are in R. Knitr is what we use to create this HTML document. Ggplot2 creates plots. Psych is a fantastic package for statistical analysis. It is massive. It has all kinds of tools in it. It's well supported. So I definitely recommend you learn more about psych, if you have not learned about it already. And then lmSupport is a package that helps with linear models. So LM, linear model support, and it gives us some additional functionality we'll use towards the end. So those are the libraries that we will pull in. Next we will load in the MTCARS dataset. So MTCARS is a very famous dataset that's built into R, you'll see it used in many tutorials and books and things you might read. So we will stick with that. And most of us are familiar with how cars work. So it's also practically useful. So we'll load in MTCARS into a data frame. So an R data frame, if you're not familiar is sort of similar to an Excel spreadsheet. Of course there are differences, but if you're looking for something that's analogous to your everyday experience an R data frame is similar to that Excel spreadsheet. So it's easy to understand the structure of named columns and rows of observations and so on and so forth. So what we'll do here is we will take the MTCARS dataset, and then we will assign that using that sort of arrow operator to another object called df_cars. So we're telling R, "Hey, take the MTCARS, dataset, "and put it in this object name df_cars." And then the second line will open up like I showed you before the help file for the MTCARS dataset that shows you what the data actually is. So if I go over here, let me make this a little bit bigger. So you can see where the data comes from, it's from 1974, your Motor Trend Magazine. It looks at 10 aspects of automobile design for 32 cars. This lists all of the variables that are in the dataset and then their position within the columns. So MPG is the first variable, number of cylinder second variable, so on and so forth. That can be very handy when you're trying to just work with certain variables in any notes and things like that. So that's what this little line here does, is it opens that helpful for you. So the next thing we do is we're gonna do some initial correlations just to get a feel for our data. So I get a sense for it. So this is simple exploratory data analysis. Now we want to play special attention to the first column in this correlation matrix, which is miles per gallon. That's gonna be our overall dependent or target variable. So we are particularly interested in that variable. So to generate this plot, we use the core plot. So cor.plot(MTCARS) that will give us our full correlation plot. And this is what that looks like. So you can take a look at that for a second, pause the video, if you want. And again, we're interested primarily in this first column with MPG. Now, a few things we should keep in mind, so we can see that the variables with the strongest negative correlation with miles per gallon are cylinders, so number of cylinders, we can see that here at negative 0.85. Engine displacement which is negative 0.85. Horsepower, car weight and number of carburetors. Number of cylinders are just like, you've heard the terms like V8, V6, flat-4 stuff like that. Those are just the number of cylinders in the car. Displacement is sort of the volume of those cylinders. So if the displacement is larger, the volume inside those cylinders is larger and that allows more fuel to be burned. Engine horsepower, I think we all know what that is. Car weight, obviously just the weight of the car. And then back in the 70s, engines used carburetors, which basically pull in and mix the air and fuel. Now, nowadays we primarily use fuel injection, but carburetors were sort of the fuel injection of 30, 40, and longer years ago. Now this should all make sense because all of these contribute to using more fuel and therefore should logically tend to have a negative correlation with miles per gallon. So number of cylinders that means more fuel can be burned. Larger displacement, that means more fuel can be burned therefore miles per gallon would be negative. Horsepower, more powerful cars tend to burn more fuel. Therefore miles per gallon is negative. Heavier cars tend to burn more fuel, again and so on and so forth. Now, the variables with the strongest positive correlation with miles per gallon are DRAT, which is the rear drive gear ratio. Quarter second, which is the quarter mile time and seconds. So have you ever seen like a drag strip or a drag car race? You have those here in the US, I'm not sure where overseas you might have those for watching us overseas, but the car does is at a standstill. The light turns green and the car goes, it just goes as fast as I can for a quarter of a mile. And then we time how long that takes. VS is the shape of the engine. So some engines are in a V-shape. So you've probably heard of a V8 engine or a V6 engine. Some engines are inline, some engines are flat. So my car has what's called a boxer engine and a Subaru. So the engine is flat. So VS is just shape the engine. AM is automatic or manual transmission. So a manual transmission you're probably familiar with, it's the gears usually in a stick shift in the console middle of the car. And then gear is the number of forward gears. A lot of automatic transmissions are four speed transmissions. My car is a five speed transmission. Other cars maybe have six speed transmissions and gears. So that's what we mean by gear. And we go up, those are all positively correlated with miles per gallon. So if a car has a higher, what we call a taller gear in the rear end, that usually means that for the wheels to spin the engine turns less. And therefore, if the engine is turning less, you're using less fuel. Quarter second time, that makes sense cars that are slower and their quarter second time tend to be less powerful and therefore use less fuel. And of course, actually, if we take a look over here, we can see that quarter second time has a strong negative correlation with horsepower, that tells us that higher horsepower cars have lower quarter second times and vice versa. Automatic and manual, that's the gearbox. So if we go down here, we can see that one is manual. So manual cars with a stick shift tend to get better gas mileage, actually. And then number of gears, again, like I said, if you have more gears that allows the engine to run more efficiently, so those are our two sets here. So these first sets tend to decrease miles per gallon. And this other set down here tend to increase miles per gallon. Next thing we'll do is go for data frame preparation. So what we're gonna do for this simple example, we're just gonna focus on one independent variable, miles per gallon, and just to have those independent variables. So DRAT, which is the rear end gear ratio, and then horsepower. So we're just looking at those two variables out of all the ones above. We are interested in the partial correlation between miles per gallon and DRAT, while controlling for the effect of horsepower and the partial correlation between mass per gallon and horsepower while controlling for the effect of DRAT. So we are going to conduct two partial correlations while controlling for the third variable. Therefore let's create a data frame that just has what we need. So let's scroll down a little bit more. So this code creates a list of three variables we wanna keep and then places them into a second smaller object data frame here. So it's a subset of the larger MTCARS. So we'll call this variable here, or this object keeps, and we will assign to that a list of the variables we want to keep so miles per gallon, DRAT and horsepower, but keep in mind the quotes here. And they have to be spelled exactly the same, including capitalization. So MPG, DRAT, and horsepower. Then we create a second data frame. So df_cars2, and we assigned to that MTCARS, which is our larger dataset, but just the three variables we listed above. So going to MTCARS get just the variables that we listed out in keeps and put that into a second data frame called df_cars2. Now to make everything clear, we will place each variable into its own object. So we have explicit names for variable. This will also allow us to manipulate each variable easily. Technically you do not have to do this, but I'm doing this again just for teaching purposes. So we're taking the first variable here, MTCARS then the dollar sign and then miles per gallon. So we're telling R, go into the MTCARS dataset and grab the column or the variable of miles per gallon. Then assign that to a object here called Y_cars, because that's gonna be our overall dependent variable target variable. Then do the same thing for DRAT. We're gonna assign that to a variable X1_cars, and then horsepower X2_cars. So we have our Y, X1 and X2. We next need to create some basic linear models and then place the models into objects. The first listed variable after the lm command, which you see here, it means linear model, is the target of that model. And the variables that follow this tilde sign are the independent variables we are putting into that model. So a consequence of this step is that we will also obtain the important R-square values we used in the previous video. So let's look at these one by one. We are creating an object called cars_YX1, and we're going to assign to that this linear model. So LM, then we have Y_cars that in this case is our target variable, tilde and then X1_cars, that's our independent variable. So here we are doing a simple regression model of using X1 cars, which is also DRAT, remember, by the way, to predict the miles per gallon for the cars, then below that we do the same thing, but for horsepower. So now it's X2. So we have Y_cars and then X2_cars. Then the next two might seem a little bit odd, but we need these to get residuals. So in these next two, we're actually taking the independent variables and regressing them with each other. So in this first example, X1 is the target variable, remember that's DRAT. And then horsepower is the independent variable here, below that we've flipped them. So X2 is the target variable that's horsepower, and then X1 which is DRAT is the independent variable. And then below that we have the full model. So Y, X1, X2. So Y_cars regressed with the independent variables of X1 cars and X2 cars. So we have two single variable regressions, two regressions, where we're regressing the independent variables overall with each other. And then we have the full model down below. So this will give us all of our R-squares and will give us residuals that we will need here in a minute. So very quickly here are our model summaries, we actually did this in the previous video, but just want to show you the output. So you kind of see where to find it in R. So the summary.lm function gives us all of this information. So if you ever create a linear model and you wanna get the residuals, you wanna get the coefficients, the significance values, and then all of the standard errors and stuff like that, just use this summary.lm function. So we can see on here, this multiple R-squared of 0.464. That's the same thing we found in the previous video. So when, when we regressed DRAT with miles per gallon, we got an R-squared of 0.464. So we keeps scrolling down. So here's the same idea, but with horsepower. Our R-square here is 0.6024. Again, that's the same value we found in the previous video. Now, when we regressed the independent variables on each other, we're actually gonna get the same R-squared. However, the residuals will not be the same. They're related, but they're not the same. Here is R-square of 0.2014. Same thing down here, 0.2014. Now our full model with miles per gallon, as the target or dependent variable. And there are two independent variables together simultaneously we have the multiple R-square of 0.7412. And again, that's the same R-square we found in the previous video. So we just did a series of simple regressions, linear regressions using the lm function and the variables we created from above. Next, we need to capture the residuals because under the hood, partial correlation is actually about the correlation among residuals. That's why we're grabbing them. So all we are doing here is taking the five regression models we did in the step above, and then grabbing the residuals using this function here, resid. So if we do resid, then we put in our first regression model, we can assign those residuals to a different object over here, and we just call it cars_YX1 resid. So we're just grabbing the residuals off that model and putting them in their own object. And we do that for all five models. So miles per gallon and horsepower here, the two independent variables progress with each other, and then the full model. So now we have five sets of residuals. So now the partial correlation part. So think of it this way. I want you to put this question in your mind in this fashion. What is the variable we want to partial out? So if we're looking for the partial correlation in this case between miles per gallon and DRAT, what's the variable we want to partial out. Well, the answer is horsepower. So what is the implication of that? Well, horsepower will be the independent variable and miles per gallon, and DRAT will be the targets for two regression models. And again, you'll see how that looks here obviously in a second. To get the partial correlation between miles per gallon and DRAT, we need to take out the effect of horsepower. Now regression is how we take out the effect of horsepower. So once we do that for both miles per gallon and DRAT, we have two sets of residuals, I.e what's left. So think about that. You get the partial correlation between miles per gallon and DRAT, we take out the effect of horsepower. Regression is how we take out the effect of horsepower. Remember, we're gonna grab up all the variants that we can using these regression models, and then we're gonna have residuals, I.e what's left. Now the correlation between what's left is the partial correlation between miles per gallon and DRAT. And as a diagram, it looks like this. So if two simple regressions here, we on the left, we had the two variables, we want the partial correlation for. So in this case, it's miles per gallon and DRAT, then what we're gonna do is create two regressions with the variables we want to partial out, which is on the right. So in the second box here, we have the variables to partial out of the two variables on the left. Then from those regressions. And again, we can have more than one variable, here we have just have horsepower, but if we were doing partial correlations with the entire MTCARS dataset, and we wanted the partial correlation between miles per gallon and DRAT, we would put all of the other variables to the right of that tilde sign. And then we would still get the same residuals and so forth. So we have two regressions and then we get two sets of residuals out of those models. Then we do the correlation between those residuals that gives us the partial correlation between the two variables that we were interested in in the first place, in this case miles per gallon and DRAT. See how that works. So this means two regressions where horsepower is the independent variable and DRAT and then miles per gallon are the target variables respectively. So we already did this actually. So this is actually repetitive code. I just put it down here again. So it's with this diagram. So we create a two linear regression models. Then we grab the residuals off those two linear regression models. And then we go down here and what we do is we create an object. So here parcor_MPG_DRAT. We're gonna assign to that, the correlation between those residuals. So what we're doing here is two regression models. Grab the residuals. Now we're doing a correlation between those two residuals and then we print it out. So in this case, the partial correlation between miles per gallon and DRAT is 0.5907. So you see how that works. Partial correlation is a correlation between two sets of residuals. Now that was 0.5907301. And the bi-variate correlation that we did at the very beginning with all of our variables was 0.6811719. So we can see that the partial correlation is quite a bit lower than the bi-variate correlation we had above. And we'll talk about what that means in the grand scheme of things here in a few minutes. Now, next we have the partial correlation between miles per gallon and horsepower, this the exact same process. Again, think in your mind, question to ask, what is the variable we want to partial out? In this case the answer is DRAT because that's the one that's not in the two variables that we want to get the partial correlation for. So the implication is, is that DRAT is the independent variable and then miles per gallon and horsepower are the targets for our two regression models. So to get the partial correlation between miles per gallon and horsepower, we've gotta take out the effect of DRAT. And again, regression is how we take out the effect of DRAT, we go in and we use the models of DRAT the miles per gallon, and DRAT the horsepower to grab up all the variants that we can in each of those two. So once we do that for both miles per gallon and horsepower, we have two sets of residuals, I.e what's left. And again, the correlation between what's left, the residuals, is the partial correlation between mass per gallon and horsepower. So the diagram looks like this similar to above. So we're interested in the partial correlation between miles per gallon and horsepower. We'll create two regression models, where DRAT is the independent variable. We get two sets of residuals. We correlate the residuals that gives us the partial correlation of miles per gallon and horsepower. So this was the same process. I'm not gonna go through this in detail, but we have our two regression models, our two sets of residuals. We correlate the residuals and print it out. So the partial correlation between mass per gallon and horsepower is negative 0.7191. And the bi-variate correlation we had at the very beginning was negative 0.7762, approximately. So you can see here that the correlation didn't change as much from the bi-variate correlation in the beginning to the partial correlation here. So it would appear that the relationship between miles per gallon and horsepower, at least in this scenario is stronger because the effect of DRAT was minimal as compared to the first situation where, when we used horsepower as the independent variable that lowered the bi-variate correlation of DRAT to miles per gallon, a little bit more. So I think it went down to like 0.59, or whatever it was yeah, here 0.59. So it appears here that the relationship between mass per gallon and horsepower is stronger, or at least it's not affected by DRAT as much as the reverse. So two more steps here that can make this a bit easier in the long run. We went through all that during this video to show you conceptually, what's going on, actually. So you know, what's actually going under the hood because my videos are largely about conceptual, not just copy some code, paste it and there you go. So I want you to understand what time actually going on. So we can use the psych package for sort of one stop shopping. The psych package has a partial.R function, and it has something called lowerMat, which actually makes our correlation matrix look prettier. So we can use this command here. So we have inside, we have partial.R of our df_cars2 data frame. Remember, that's the data frame we created at the beginning that has just our three variables. So that's why we needed that here. So partial R of that data frame and then put it inside this lower mat, which makes it easier to read. So if we look here, this is what we get. So first lower mat makes it easier to read. If you notice here, we only have sort of this wedge shape because everything above that is repeated. That's what lowerMat does, gives us the lowerMat of all this output. And then partial R of course, gives us our partial correlations. Now, if you notice these numbers look familiar. So the partial correlation between miles per gallon and DRAT was 0.59. That's exactly what we got above. And then the partial correlation between miles per gallon and horsepower, which is what we just did right here was negative 0.719, rounds it to negative 0.72. So that is a one stop shopping command in the side of the psych package that you can use to find partial correlations. And finally, there's a package called lmsupport. So linear model support, I misspelled my apologies. Now, note that this takes in the full model. So both of our independent variables and then produce this partial Eta squared or the effect size values for which the variants associated with the effect or variable. Now we can go into effect size and do a whole another video on that, which I probably will at some point. But I'm just kind of explaining what output it gives you, and how you have to interpret it, to get the partial correlations that we're interested in. So therefore you will have to take the square root of the partial Eta squared to obtain the partial correlation. If you look down here, we're gonna use this command model effect sizes. And then we put in a full model and down below. We get this partial Eta squared column right here. So this is our effect size, partial Eta squared. This column here is effect size, and it's the percentage of the variant and the dependent variable explained by the independent variables in the sample. So how do I interpret that is that X1 cars which is DRAT explains about 35% of the variance in miles per gallon. And then X2 which is horsepower explains about 52% of the variance in miles per gallon. Now notice that these are squared, so we take the square root of these values. Then we get our partial correlations. So the square root of 0.3490 is 0.5901, which is the partial correlation between miles per gallon and DRAT. And then the square root of 0.5171 is 0.7191 or in this case, it's negative remember, because the correlation between miles per gallon and horsepower is negative, which is the partial correlation between mass per gallon and horsepower. So it's just a square root of the partial eta squared. So the square root of the partial Eta squared is the partial correlation. Now this last column here is actually very interesting as well, and that's gonna lead into our next series of videos. So I don't wanna spoil that fun. However, if you look at those values, so 0.1387, 0.2772. If you remember from our previous video where we solved the Venn diagram puzzle, if you look that's the value for a so 0.1388 just 'cause a rounding. So that's this area here in the purple and then c 0.2772, that's this area here in the yellow. And again, they show up here in this model effect sizes column here at the end, and we'll get into what that means a little bit later. And finally, just to reinforce that we're doing this the right way, what we're gonna do here is create a correlation of all the residuals together. So I'm gonna create a data frame. So data.frame, and then in that data frame, I'm gonna put all the residual objects that we made. So YX1 residual, YX2 residual and so on and so forth. I'm gonna round this to three digits to make it easier to read. And then we get this little matrix down here and you will see that these numbers reappear. So here is our negative 0.719, which is the partial correlation between, look what we're doing here, look at the variables we have YX1, we have X2X1. Remember the X1s we're partialing out, so this negative 0.719 is the partial correlation between Y and X2 which is miles per gallon and horsepower. And then for the other one is right here, 0.591. Again, look at our variables we have stored, we're looking at YX2 and X1X2. So that's the partial correlation between Y and X1 while partialing out X2. So 0.591 is the partial correlation between miles per gallon, which is Y and DRAT, which is X1. So we can say that again down here. All right, so that wraps up this tour de force of partial correlation using R, just a tool with some visuals to kind of see how everything flows together. Some basic understanding of how R works and then what's actually going on under the hood in partial correlation. So I hope you found this video helpful. I hope you find the files you can download and follow along helpful. And again, my information is here. If you wanna look at other videos of mine and this playlist and the table of contents for all my videos, but I will stop rambling now. And thank you very much for watching. I wish you all the best of luck in your future studies and in your work. And I look forward to seeing you again next time. Take care, bye bye.
Info
Channel: Brandon Foltz
Views: 2,686
Rating: 4.9555554 out of 5
Keywords: brandon foltz, statistics 101, machine learning, linear regression, linear regression model, data analysis, data science, partial correlation, correlation, correlation and regression, semi partial correlation, partial correlation coefficient, partial correlation analysis, partial correlation example problems, statistics for data science, r squared regression analysis, partial correlation in r, correlation in r, rstudio, effect size, correlation coefficient
Id: bA_gUflzSyY
Channel Id: undefined
Length: 32min 57sec (1977 seconds)
Published: Tue Sep 08 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.