R Stats: Multiple Regression - Variable Selection

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi welcome to our stats I'm Jakob Cebulski in this series of videos I'm going to introduce you to some predictive statistical models such as naive Bayes and regression with plenty of data visualization welcome back there's a series of lessons about linear regression and this time I'm going to talk about multiple linear regression now we're going to use the same data as was the case in a simple linear regression it's a diet about automobiles from UCI University we're going to try to predict price based on the car specification I'm going to rely on three packages H misc for eliminating missing values psych for plotting correlation charts and car for a variety of additional functions to do with regression okay let's get all three packages go to the working directory read the item quickly the summary which shows that amongst many variables describing each of the cars their missing values and the variables many and we have variables which could be used for prediction such as normalized losses a target variety of factor or categorical variables we describe the cars and lots of numerical variables such as the cars within high and wait number of cylinders which is a factor engine size and others and one of them will be price which is what we'll be trying to predict now the first thing is in regression is to eliminate missing values regression cannot deal with that we'll do a very simplistic approach to eliminating missing values substituting them with the mean or the median depending on whether the variables continues numeric variable or whether it is a categorical variable and quick check and we have no more an ace list against any of the variables I'm going to pick a few potential candidate variables which we could hypothesize a good price predictors so I picked the following variables and we have horsepower the miles per gallon in the city the peak rotations per minute the curb weight number of doors and price which is our target and we can see that some of those variables are skewed like horsepower and price and normally we should be dealing with that if you want to have a good regression model we all variables you have a normal distribution the variable should be independent it should be not correlated horsepower and kerb weight I clearly have a high correlation and so city mass per gallon against weight are negatively correlated we should have been dealing with that we also should be looking at extreme values the values which are which have a huge error from from the model and the variables have once then plotted against the model that should have even distribution of residuals okay we're not dealing with any of them at this stage the purpose of this lesson is to show how to construct the model by selecting variables suitable for that model and in the future lessons will show more sophisticated approach to that as is the case in all modeling whether our or any other software and it is important to have a training of the model and validation and possibly testing of the model so I'm going to split here the data into a training set and validation set and about 80% of the data will be used for training and 20% for validation some people argue that you create a model and the regression model is going to be a just sort of internally based on the properties of that model to indicate its quality and we will rely on those quality values however at the end I still want to test the model the diet on on the data provided ideally this should be a cross-validation type of testing of the model we're going to create a model which is a simple linear model using multiple variables and the formula here described such a model it's a price which is an intercept with what my y-axis which is the price axis and here is just three variables horsepower kerb weight and city MPG and we'll be trying to estimate the coefficients of those variables in this equation so that the overall errors am mistaken in price estimation could be minimized so let's create the first linear model the first fit price the target so it's on the left of the tilde operator and a number of candidate variables from a training sample it's been created now let's look at the summary and the summary of that fit or the linear model and the model itself remembers how it was created it provides some simple statistics about the the median the minimum the maximum of the first and the third quartile of the residuals that means the errors we're trying to minimize the errors in this model and here we have a collection of estimates of that formula so the estimate for the intercept the B 0 the coefficient for the horse our coefficient for City mpg peak rpm curb weight and number of doors and those are the values now each of those values estimated with some degree of confidence we have a probability of whether or not we can trust this value so three stars it means that the probability of rejecting this value it's quite low and so we should keep it however some of those variables have coefficients which with a high probability that is wrong which probably means we should reject it also we have the r-squared measure which actually tells us how well the model explains the variation in the data which is not random so this model explains around 73 percent of errors or residuals from the linear model r-squared is just a very similar measure however its usual bit lower because it takes into consideration the number of variables that we use to build a model the more variables we're going to get the higher will be asked for it and us adjusted r-squared always lowers it down so we're going to eliminate the worst variable the variable we cannot trust or we trust at least so that's a backwards elimination of variables one at a time so we eliminated number of dollars and we look at the fit again and the r-squared is almost identically - is identical adjusted r-squared slightly improved but it's almost insignificant but we eliminated the variable which does not contribute much to the final formula so by eliminating the variable the model did not get and their words the next potential candidate of elimination is peak rpm and that's what we're going to do eliminate desirable and again look at the summary of that model the feet and the r-squared it actually got worse but adjusted r-squared it goes slightly better but it's almost insignificant the model performs theoretically as well as before but we have less variables it's a simpler model and again the next variable which is a candidate for limitations is city city mpg now the model will have only two variables and the multiple r-squared got a bit lower not much and adjust a squared in significantly it's lower the model now is such that we could trust every single variable used to build the model which is a good thing now let's do a few plots of the model we arrived at the first one it shows us that if we consider the range of prices of those vehicles and 0 here it signifies the linear model which is sun aligned in terms of in case of a two-dimensional model or it's a hyperplane when we have multiple variables and it shows that some of the data points are just about above this plane and some below this plane and if you try to fit the line into the distance all the point this is around the model you can see that it's a nonlinear fit so it means that it has a small degree of non-linearity which is most likely because we didn't normalize the data in the first place it also shows that we have about three data points which are very far away from the model so the model does not capture them if we consider them against the mean of the sample we would say that these are those observations outliers but since they are in relation to the model and we call them that their extreme values extreme cases and normally we will be deleting them as well and the model would improve after that let's look at the next chart this chart shows us how all of the examples of residuals they compare against the theoretical distances from the model and shows that yep we have a bit of a problem on both ends of the spectrum all the data ideally that all all of the observations should nicely and neatly go around the line which signifies the model itself the third which again shows us the distribution of residuals around the linear model against in relation to relationship to the price of all the vehicles and basically shows that if you have the high the price we have the less observations and the scatter of of observations around the model is greater and they're very densely following the line in a medium or lower price range and finally the last chart it shows are called so-called Cooke distances and it identifies all of the extreme values and shows that some of those extreme values have a huge impact on the model so the extreme and the model it has a huge the construction of the model is hugely influenced by those extreme valleys which means that if we eliminate observation 167 most likely the model will be more linear okay so that's one of the things we will be doing as well in the next lesson watching those charts and watching extreme values watching the way the air is distribute around around the model okay and now we got a model fit and as good as gets let's see whether it can predict anything and towards degree it could predict so normally we do we measure the model first of all against data it was trained on and then against the data it has never seen before during the model construction so I'm going to do the prediction of prices using the model and the data which we had used before while constructing the model and I actually added an extra variable to the Train training sample and I'm going to do the same thing for the validation sample that means the system has never seen this data before and I'm creating a new variable predicted price for the validation sample we've seen this before and I'm interested in those values adjusted r-square and let's see whether the predicted price and the real price are in any way correlated to that degree ideally if we look at the training sample and the correlation between prediction and price should be more or less in that level of magnitude so I'm going to complete that and I'm going to also to calculate the root-mean-square difference between the predicted price and the real price for the training sample and mean absolute error for the same two vectors and this is the result the result which basically says that the correlation between the predicted price and the actual price for the training set is 72 percent which is very very close to the theoretical r-squared for the model okay so that's good news the second thing is it's more as a guide looking at those two values which is the amount of error but in terms of the price units which is dollars on average each of the estimate was about four thousand dollars away from what it should be and more optimistic estimate of that error is around two and a half thousand dollars which means that on average on the set of observations that we have previously seen we're going to make between two and a half and four thousand dollars difference on s in your well estimating now let's do the same thing for the data the system has never seen before so correlation root-mean-square and mean absolute error for the validation set and here are the results now it looks like about 69 percent correlation between the prediction and the price that's not bad it's not that far away from the training set validation set the data the system has never seen and the training set which is almost like a recall okay the price the error that means on a validation set on average the error is in price estimate was between $3,200 and 5000 considering that the vehicles are up to $5,000 it's about 10% error in estimation which is not bad and the system performed well you can improve those results by normalizing the data eliminating extreme values and basically doing everything that we asked to do as a requirement for the construction of the linear model so thank you and come back to the next lesson where we're going to make a much more complex model and doing the right things while creating the model thank you you
Info
Channel: ironfrown
Views: 83,210
Rating: 4.9286871 out of 5
Keywords: Data Analytics, Data Visualization, R Statistical Software, Multiple Regression, Stepwise Regression, Backward Elimination, Model Performance
Id: HP3RhjLhRjY
Channel Id: undefined
Length: 18min 46sec (1126 seconds)
Published: Mon Apr 18 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.