Multiple Linear Regression using R ( All about it )

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey folks welcome back to my channel in this video I'm gonna show you guys how can you build your linear regression model using R so this video is gonna be a comprehensive video that's gonna include all of your linear regression results for example what is your R square what is your p value what is your residual error your static all of that plus I'm also gonna show you guys how can you predict your values from your model and then analyze if the prediction is good or not right so let's go ahead and get it started right now the first thing the like like my previous other modeling videos I'm going to use the Boston housing data set so I'm just going to go ahead and build input variable so let's read the data inside let's take a look into the reader all right so so this data here it has 14 different variables and 506 observations so it just talks about the housing market in the Boston area in the United States this video I'm gonna show get the MA DV is going to be my interested output variable and then I might use L start to predict EMA DV so L start is going to be my input variable or X and M a DV is gonna be Y right so so mi DV stands for median value of your houses and lstart stands for lower status of the population right so I'm gonna use these two guys to find the MeV in fact I'm of can also use age on any of that if you want to do multiple regression but we can use L stat and ma TV as our main variables and you can always add and drop other variables as needed so so first things first let's take a look into summary as to how the data looks like these two is going to be where my interest is plus I can also choose age for my prediction so so L stab the minimum is going to be one point seven three the maximum is thirty seven point nine seven and we have some median values around eleven like that and then again ma DV the minimum is five the maximum is fifty and then in the middle we have twenty-one like that we also choose age no it's a we'll do a multiple regression so min is two point nine zero Max is one hundred and then between we have around seventy seven like that apart from that I also wanted to show you guys the plot of this so let's just say data in I'm gonna use ma DV and then data in and I'm going to use L stat so x and y variables to plot so this shows us the full plot of it so this is the full plot of what we have right now else to add an ami DV what I'm gonna do is I'm gonna go ahead and build the model right now I'm going to show you guys where my linear regression line stands for as you might know the linear regression is basically your y is equal to M X plus C right so Y is your here ma devii and then X is going to be your L stat and M is going to be a slope slope of you know how the line is gonna be drawn and then C is your constant right so I'm gonna show you guys how this specific line corresponds to this data and then how that is gonna get plotted over this x and y variables here right alright so first thing first for model building first thing is your data partitioning so let's do a data partition so in my previous video and logistic regression I have used stratified random sampling you can take a look into that I have used a carrot library for using the created a partition function in this one I'm just gonna use a simple random sampling since it's a numerical data ma DV is a numerical right so I'm just gonna use a simple random sampling so it's super easy actually to partition I'll show you guys right now so first thing is you need to create partition service variable is called p1 it's just gonna these are not a functional variables just for us to in or do the / helpers to partition this is gonna create p1 and then I'm gonna say run the front F is the function and then I'm going to say what is the total number of rows within this and then create me some random numbers for those many number of rows so interrupt in row right you can also say 506 right you can just put 506 in place of this it's all the same thing but I can also do just for simplicity instead of this I can put 506 right and it's gonna create some 506 random numbers something like this right so if I open if I run p1 you see it just some random 506 numbers it just creates some random numbers just like that right but this is my p1 I'm gonna use another service variable called p2 and which I'm gonna order my p1 why am i doing that I'll show you in a few seconds so if you take a look into my p2 it just stores the index right these are just the indexes of different values here so zero point seven four five four right has to be the 240 at the number as per when it orders it that is a to 40th number so zero point seven four five for this last number right and then this one 0.5 767 is basically a 350 second number right now likewise your first one is going to be 299 right so this one here wait where is that okay so this one here is going to be your two hundred ninety-ninth record so that is what I'm doing here but I'm doing this because I'm gonna use this I'm gonna use my shuffled values I'm going to use these random indexes to shuffle up the data inside my data in write this to create some randomness in selection right let's create the training data set training data set is equal to the data is gonna come from the data in right there's no thing in that and then it indexes are gonna be chosen from p2 right within P 2 I'm gonna choose you know why we have 506 records I'm gonna choose let's say first 350 records right so let's say 1 all the way until 350 so it starts from 1 and goes all the way until three and 50 within this so it's basically gonna take this number all the way until 352 is dad which means 350 is these two right I think or even including 70 so it's gonna take the first three run 50 values from p2 right and it's gonna just assign it it's gonna take it from the data inside only these numbers so two hundred ninety-ninth record from data in will be taken to 71st will be taken xx record will be taken 87th is taken 414 is taken so it's completely random selection from your data in right so that's all you're doing here and then I'm gonna do it get all the columns so command alt gives me all the columns alright so my training data it is created as you can see it has 316 observations random you see here so exactly what you have here so 299th record has been chosen 270 first record has been chosen xx ax has been chosen and all the way right so random selection that's advantage of using the run of command so likewise I'm also going to choose training data set similarly from the data in I'm gonna choose indexes from my p2 variable in this one I'm gonna choose the records from 351 all the way until 506 right so that's all I'm gonna have so now let's take a look into my state I said so I have hundred fifty-six records perfect again random selection just like the one before right clean alright so next one what I'm gonna do is basically now that we have partitioned the data set which is a primary step in model building next one is let's go and build the model right so I'm gonna call this as you know Boston Boston housing regression right that's the name of my model the function is LM LM is a linear model super simple to keep in mind LM linear model since you are building a linear regression alright so I'm going to use a media that is going to be my output variable and use a tilde sign to differentiate between my output and input l stat is my input variable and then I can also as we discussed I am also going to add age with it so if you just use your sat it's going to be single imputation since I'm adding age to it now it becomes multiple imputation right and the data is going to come from this time since model building I'm going to use the training data set first let's train the model and then when we test the model the data pattern will all be memorized by the engine and then when we test it it's gonna give us amazing prediction results right so that's the expectation all right so now that the model has been built now I want to show you guys before I'm gonna go and jump into the somebody and explain what is our square what is our MSE and all of that I wanted to show you guys how does it look when you plot your line in your in your plot here so let's use an a/b line and then let's just plot this guy here so you see it looks like we have a negative slope I'm going to show you guys little bit add a little bit more clarity to that it's not green okay so if you see here so this is basically a linear regression line right so this intercept this looks like the y-intercept is around 30 to 33 like that and then it's negative slope so the slope is going to be negative for L stat so that's what we have here all right so now let's take a look into the summary [Music] summary of my model and then let's see okay alright so first things first what you have here is basically the first thing is your function call right and then you have your residuals residuals are basically what are these right so residuals are basically the difference between you are a predicted value right which is the fitted value and that and the actual values the difference between the predicted value and the actual value the difference between these two that is a residuals right the expectation is the residuals should be symmetrical what it means is basically the min and Max shouldn't have a lot of radians first quarter's third quartiles should have a no less difference and then median should be close to zero so that's the expectation for your residuals so we have this here next let's jump into the coefficients so as we saw here it has a negative slope so that's why you have your minus one point zero one sound as a slope for your velstadt and then you have your intercept right your y-intercept is plus thirty three point nine seven nine hundred so that is your you know y-intercept and then the slope for your H is going to be zero point zero three five nine one so remember keep in mind this is a multiple regression or multi-dimensional regression so that's when I try to plot my a b-line it was also showing the since I was only using two dimensions and my model has more than one you know variable so it was showing this throwing this warning I can always use add another variable in a separate plot so it's all the same thing alright so let's come back to the coefficients here so we have your intercept you have your slope of your lstart your slope of your age now you can form your equation so how does that equation look like your y is equal to MX plus C is gonna look something like this so M me DV is equal to thirty three point nine seven plus zero point zero three five in x age minus one point zero seven times L star right so that is basically or linear regression line that's what you have predicted is these are the estimates and then what you have on the next is basically your standard error what does that mean so this is the errors while it's trying to predict these estimates right so this is what you have here always obviously it's an error so obviously the less the best for you right so it says the error which is not something that you prefer so always what if it comes to less it's amazing for you and then you have your T value so what is your T value T values basically division between your estimates on your standard error right so always a T value should be the highest because your numerator when the numerator is highest which means estimate size and standard errors lesser which is you know it's a good prediction and then you have your p value so p value here the or the probability of probability value like you see this is a probability of your the modulus of t being the least right so what this means is basically you have three stars here which is you have a mapping for these three stars what does the three stars mean for example you have your when three stars means which is a really good good one which shows that there is less the error in predicting this specific estimate right so is so what that means is basically one thousand is the error rate for predicting your L stats slope value right likewise the this is just one star which is basically here 0.01 so one-tenth is you know times of error or one tenth is the probability of error when predicting your slope of your age right and then again one thousandth is the probability of an error in predicting the estimate of your intercept that is what the p-value means always obviously we wanted to have this as low as possible so that your estimates are more closer to the accuracy right so so that's why it's always amazing when it is gonna fall between 0 and 0.001 but usually if it is like you know when you make confidence when you make a confidence level which is 1 minus you're these these values here so it's been a whenever it is like 0.95 and above it's amazing lesser than 0.95 you have to you know do some transformations within your data like creating your logarithm of your X variables like that like for example and this one would be like log log X or log of age or log of L stat which other needs to be treated right and now let's jump onto your residual standard error what does this mean here so this means that so there is different values for residuals finite error like your root mean squared error or regression error or typical error or people call it in different ways it's all the same thing this means is basically the error that you have the difference between your Y and Y cap which is basically or ma DV and the predicted EMA DV the difference between these two you just need to get a square of these these differences and then get the mean of that squares and then take a square root of that so that gives you your residual standard error right and then you have your degrees of freedom so degrees of freedom is basically there is a formula called n minus K minus 1 so that's how you compute your degrees of freedom basically n is basically the number of data that you have in this one my training has 350 observations so 350 minus what is K K is a number of coefficients that you have so in this one I have two coefficients lstart an age so k is 2 so 350 minus 2 which is 348 minus 1 so that is 347 so those are the number of degrees of freedom next one what you have is your r-squared value so multiple R square what this means is basically now with the value that we have 0.57 so the this says that Elst at an age these two in fact you can also add adjusted r-square 0.56 so what it says is basically since it's a multiple imputation it's better to look into your adjusted r-square if this was going to be a single imputation using only one variable I would have just looked into my multiple R square but since it's a multiple imputation I'll just draw a square I'll tell you the difference between these two quite a quick second so in this one I'm gonna take a look into my adjusted r-square what it shows is basically else to at an age they combined Li described we have described my MA DV at 0.56% accuracy or there is 56 percent accuracy or 56 percent of the variation in my mãe de is being explained by l stat an age right so that's what this value means so what is now let's come back to the difference between multiple R square and ejector R square sorry why I was using my adjusted r-square for multiple imputation in place of my multiplier square right so why the reason for that is basically so our district our square this value is basically scaled for multiple parameters like these ones so when now you have single parameter it doesn't need to be scaled because there is only one parameter you are using as your input right so in those circumstances multiple R squared is fair enough no problem but then whenever there is more variables you need to see their combined effect on the Y variable or your ma DV right so that's why you need to scale these two variables to find the combined effect and that's why you have your register R square which is always less than or equal to your multiplier square right so always it will be either equal to in some cases or less than you or multiple R square that's your hodge etre adjusted r-square finally we have your statistic our F test it's better to be lower them the less the lesser it is the better it is for you so F statistic explains how much of accuracy or how much of importance thus the R square value has the more they have static values the R square value tend to have more error because a square value will always range between 0 and 1 so closer to the 1 the better it is for you obviously because there is when it is going to be closer to 1 let's say for example zero point nine zero there is 90% of variations has been explained by your input variables right so the F statistic value is always lesser the amazing amazing it is right so your for example what how we can do is basically you can how we can test that is you can add and drop or different other variables from your data set you have 14 different variables right so you can just add and drop different variables to your model for example you can add Krim there's a bunch of different variables you can just put it in here and then try to find which tried to find basically first your p value the lowest amazing it is and then you have your r-squared value and then also you can compare with your F statistic right so the lower the scores are you will take that specific variable and then you will go ahead and build your model so that is your f statistic and then you have your p value again here so that's what we have so this is basically the meat of the full linear equation right so this is a really important piece in model building everything else is just basic programming not a big deal but understanding this is an important thing again also what you have is since we are doing the multiple regression you have your rigged residual plots which I have explained in my other Aleks video you have your residuals versus filtered so linear regression there is three different principles of your linear equation which needs to be you know set in stone so what are these so homoscedasticity right the output variables value it can increase but then the residuals should not increase it should not be proportional to the you know the value of your output variables right so that's homoscedasticity or funneling effect so that's number one number two is basically the residuals should be you know normally distributed and and then you have your cooks distance there should not be lot of effect of your outliers on your data set and these kinds of ones so please take a look into my electrics linear regression using analytics video I would have explained it explained the residual plots in detail so which is often necessary whenever you go ahead and build your multiple regression models right alright so now that we have built the model successfully now let's go ahead and take a look into the predicted values right so I'm going to create a variable again called predicted values I'm gonna say predict right that's a statistic based function that we have when I use make use of that call my model whatever I have built using the training data set and then I'm gonna pass my new data that's gonna be my testing data set right so as soon as I pass that my predictions are ready for me to take a look and then here you go so these are the predictions I have so now what I wanted to show you is to show the predictions in parallel with actual predicted by actual values so what I'm gonna do is test D is create a new column predictor M e DV is equal to that's gonna be my predicted values right so now I'm going to take a look into my test D s so you have your actual values so these are your actual values and these are the predicted values so you see this is there is some places it makes it clear it's really close for example the first one and there are some places it's not as close so this is basically the what you see because your r-squared value in this one it is just like 50% 55% something like that so that's why only 50% of the data has been the variations of data has been explained and the remaining 50% is not well explained see these ones are all looking good but there are some like these you see some like these ones these ones 7.4 versus 2.8 and 18 point 2 versus 26 so these ones are all not well explained so this is what this is your predicted value here anyways so so what we discussed today is basically we took a look into your data set we partition the data using random sampling run of command right straightforward we partition the data into training and testing data sets and then we created the model using my training data set and then once a model has been created we took a detail look into the summary of the data right what is your R square what is a p value F test and all of that and then we also took a look into the predicted values right so that's it from today guys so let me know if you have any questions thank you and have a good day

Info

Channel: DataExplained

Views: 4,438

Rating: 5 out of 5

Keywords: Linear regression using R, regression analysis using r, random sampling, random sampling in r, predictive analysis, linear regression model, linear regression machine learning, linear regression explained, linear regression equation, y=mx+c, r programming

Id: _ymR-FFG44c

Channel Id: undefined

Length: 25min 28sec (1528 seconds)

Published: Sun Jun 21 2020