Video 6: Variable Selection

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this video that discusses multiple linear regression this video is about a variable selection and we will start this video by discussing the problem at hand when we were performing simple linear regression where we only had one X we were trying to explain the variance in the dependent variable by using a single independent variable in this case x1 and it was very easy to assess if this Barrow was worthwhile having the model or not because essentially it was our only variable however in the context of multiple linear regression we have many independent variables and now it is hard to understand which of these variables are really worth having in our models and for instance after analyzing the performance of each variable we could decide that some of those are simply not worth having a remodel because they are not helping us explain the variance in the dependent variable there are two potential strategies we can use to approach the issue of selecting variables the first one which is fairly intuitive is to check if each individual independent variable has a statistically significant relationship with the dependent variable with the Y so basically we are checking if each individual X has is positive or negative correlation but a strong relationship with the Y in the model our second strategy will actually focus on the overall performance on the model and test how much does each individual variable contribute to the models performance and explain the variance of and the ability of the model to forecast unobserved values of the dependent variable in this video we will discuss these two strategies let's now describe the data we will use for the example the data is usually available in a CSV file called Auto dot CSV and it has data on vehicles miles per gallon in fact the miles per gallon or mpg is going to be to continuous dependent variable in all our models we then have a series of variables that could explain a car's performance in particular the number of cylinders which is a discrete variable then we have displacement which is a feature of the engine that quantifies how much liquid is displaced by the peasants of the engine and there's a continuous variable measured in cubic inches we then have horsepower which is a continuous measure traditionally used to measure how strong or how potent an engine is we then have weight of the entire vehicle measured in pounds it's a continuous variable acceleration is measured in the number of seconds a car needs to reach a certain speed we will then have the models year the year in which the car was manufactured and we finally have origin which is a discrete integer number which is the idea of where the car comes from let's discuss our first strategy more in-depth as we had mentioned it is very popular and somewhat intuitive the key point of this strategy is to discard any independent variable that does not have a statistically significant relationship with the dependent variable so the next question is how do we determine if there's a statistically significant relationship and for this we must first ourselves decide what confidence level do we need in order to establish that there's a relationship for example if we determine that we want a 5% level then we can use some factors such as the p-value and say if the p-value is higher than 5% we will discard a particular variable the end result of this approach is that we will have a claim model and note the quote unclean where all the variables have statistical significant coefficients at least up to the 5% level so now let's throw all the variables we have at our disposal into the model having mpg the miles per gallon as the dependent variable the shoe regression output shows that some variables have statistically significant relationships however some do not in particular cylinders horsepower and acceleration do not have small P values P values of each of these are 12% 21% and 41% so if we had to choose which ones to remove we would remove all three however we don't want to remove all three for reasons that we will discuss in class rather we want to start removing a single one and in this case the one with the highest p value so we will model again this time we're moving acceleration I am now showing you the model with all the previous variables but having removed aceleration note that quite interestingly horsepower which previously had a very high p-value now has a load p value or p value of 2.8% so we realized that this variable might actually be worthwhile having in our model meanwhile cylinders is still insignificant as it has a p value of greater than 11% we will now rerun our model having removed the cylinders variable once we remove cylinders we have a new model where all the variables have P values less than 5% so we can say that each individual variable on its own has a statistically significant relationship with the dependent variable and if we took the first strategy this is the best model we can reach however there are several problems with this which we discuss as follows the underlying problem with the strategies that we're being too strict and somewhat arbitrary in choosing which variables stay in our models let's remember that it was us who chose to use a 5% level to decide if a variable stayed or not however if you think about this objectively there really isn't much difference between a variable that has a p-value of 5.1% versus one that could have a p-value of 4.9% such small difference could be due to random factors that we do not observe and hence using this such strict arteria could not be beneficial for us moreover by dropping out the variables we might be losing information that could be useful in explaining the variance of the dependent variable and that we could use eventually to forecast unobserved values of the dependent variable let's remember that regressions had two objectives one was to establish there's a relationship between two variables and in particular if these relationships are statistically significant which coincides with our approach in the first strategy where we only leave in variables that have such statistically significant relationships however our second objectives was to forecast new observations and to use what we know about the variance in the exes and its relationship to the variance of the Y's to forecast new observations so could we develop a new strategy that actually focused on a model stability to accurately forecast unobserved values of the Y's initial approach for this would be to focus on the r-squared let's remember that the r-squared is a measure of models fit and it captures the proportion of the variation of the dependent variable that is explained by the variation of the observed dependent variables which are the XS and their coefficients the betas in other words we could assume that the higher the r-squared the better our model will be in forecasting any unobserved values but is this correct I'm going to leave you a couple of seconds for you to think if this is correct there is a big problem in relying solely on the r-squared to establish if our model is good in explaining the variance of the DS noumic variable and the problem is regardless how useful or useless a variable is the r-squared can only increase as we continue adding more variables to our model in other words if we only use the r-squared we could find ourselves simply throwing any single variable we find into our model since in the worst case our r-squared will only increase minimally so I want to fix this metric in some way in thinking about how could we improve the r-square as our measure let's recall that the R square grows whenever we add a variable that is relevant in explaining the variance in Y but on the other hand the r-squared will only grow minimally that is only a little bit if we add a variable that is not that relevant really in explaining the variance in Y so what we would like to have is an adjusted version of the r-squared that only grows when these new variables these new X's are really meaningful in explaining the variance in Y however we would even like it to penalize us in the case we add useless variables that is exactly the intuition behind the adjusted R squared which we generally denote that's an R squared with a bar on top the adjusted R squared is a metric that will grow with model fit so it grows as R squared grows however at the same time it decreases as the number of variables increases so we can think of it as a function of both the R squared and the number of variables in our models when working with the adjusted R squared adding a new variable could even decrease its value to the point where the adjusted R squared becomes negative this is because of the penalties of adding useless variables the adjusted r-square will grow when we add a new variable only if that new variable does contribute to explaining the dependent variable more than what could happen just by purely random factors so if we find that some particular variable has nothing to do with the dependent variable but adding it increases the R squared just by mere coincidence by chance by random factors the adjusted R squared will discover this and penalize us for this so let's take a look again at our models our first model used all the variables and we can note that it has an R squared of 0.8 e to 14 and an adjusted R squared which is lower than the R squared of point 81-82 the adjusted R squared will always be lower than the standard R squared and let me take note on the right-hand side of your screen of the adjusted R squared just so we can compare it to the adjusted r-square z' of other models when using the former strategy we then removed acceleration because it was the variable with the least significant relationship with the dependent variable mpg we note that there is a very minor improvement in the adjusted r-squared in this model the new r-squared rounded up is 0.8 E 184 so both by the first strategy and by the second one this model is a better one in this case it has a higher adjusted r-squared the next step we had taken was to remove cylinders which was the remaining variable statistically significant relationship with the dependent variable and let's see what happens note that in this case the adjusted r-squared is point eighty one seventy seven this valley is actually lower than the previous result meaning that having cylinders into our model where it's not that bad after all and cylinders even though on its own does not appear to have a statistically significant relationship with the cars miles per gallon it does contribute to our models performance in being able to explain the variance of the dependent variable we then go back to the second model and choose that since this one has the highest adjusted r-squared this is the best model we have found so far so this video subjective was to explain to you the intuition behind gate just at r-squared and why you should use the adjusted r-squared to choose what variables go into your models however there is still a lot of room of improvement in what we have done thus far in particular this example was developed in a very clumsy manner not paying attention to any important aspects for instance when we ran this model we did not even check if the different variables we used were normally distributed it may be the case that there's a strong more positive skew and some of them and we need to transform the variables before we can appropriately use them in our linear regression model similarly we did not check if the relationships are linear it could be the case that a car's mpg does not necessarily increase proportionally to our particular independent variables and for many of the variables it could be the case that having a nonlinear relationship would be more appropriate and explaining the variance of the dependent variable finally we did not check that all the variables could be interpreted as numbers in particular if you recall when we described the variable origin we said that this was a numerical variable that had an ID of the origin of the vehicle and in particular this variable took values of 1 2 3 2 respectively indicate if this car came from the u.s. from Europe or Japan in our models we consider this as a regular can use variable which in turn gives us completely erroneous results so your job is now to fix this I encourage you to use a dissociation tool to understand how the relationships look like an example of such tool could be tableau then once you understand the relationships you should develop the appropriate models and once you have the models you should use the adjusted r-squared and choose the model with the highest adjusted r-squared as your best model thank you very much
Info
Channel: dataminingincae
Views: 61,218
Rating: 4.9497647 out of 5
Keywords: multiple linear regression, Statistics (Field Of Study), variable selection
Id: tPykSMHpgHw
Channel Id: undefined
Length: 13min 47sec (827 seconds)
Published: Sun Sep 14 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.