SAS - Multiple Linear Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello YouTube this video will discuss multiple linear regression but first what is the difference between multiple linear regression and simple linear regression well in simple linear regression we are trying to predict why and we have one predictor and we're going to see if X is useful in predicting the value of y in multiple linear regression we have multiple predictors so now we're going to try to use x1 x2 and x3 here to ultimately predict the value of y then when we finish our analysis we should be able to say well x1 was useful but x2 and x3 or not or maybe none of them are useful or maybe all of them are useful in predicting why but a useful first step is to create a scatter plot matrix and we'll do that now to do this in SAS is just proc SG scatter on the next line it's going to be matrix and then whatever variables you want to be included in the matrix well I want all of them in there so I'm going to just going to put Y x1 x2 x3 and then we'll go ahead and run so we've created our scatter plot matrix let's take a look and what we're looking for are linear relationships between y and then either x1 x2 or x3 so let's start with y and x1 so we're looking at these two scatter plots right here so yes it does look like there is a linear relationship between Y next one so that's good let's look at X 2 and Y so we would be looking at this scatter plot and then this scatter plot it does not look like there's a linear relationship then between y and x 2 so if we were just going to go off of this scatter plot matrix we might say that X 1 will be useful in predicting Y but X 2 will not then basing it off of X 3 it does not look like there's a linear relationship between y and x 3 and let's quickly talk about how to analyze individual scatter plots in here so let's just analyze this guy so what exactly is on the y axis and what exactly is on the x axis well this one's easy to see whatever variable is in the column that's what's on your x axis so X 1 is in the column here so that means that that's what makes up our x axis and we can see this easily because we also have labels 1 through 7 well if we go back to our data and look at x1 next one is this column well there you have it one through seven so for this scatter plot here and as x1 on the x axis Y on the y axis now when we go to this scatter plot then it's reversed so since Y is in the column that means Y is on your x axis and then x1 is going to be on your y axis so we said judging upon the results from our scatter plot matrix x1 was the only one that showed a linear relationship with Y so now let's use some actual statistical tests here and the first one that we're going to do is an overall F test and this will determine if at least one of those variables is useful in predicting Y now the code for this is just going to be proc rag just like in simple linear regression and we can specify our data here if we want and then next line it'll be model and then whatever variables we're modeling here it'll be y equals x 1 X 2 X 3 and then let's go ahead and run and when you run this you're going to get a long list of results but if you scroll up you should find the ANOVA table that we use to analyze simple in your regression and it should look very familiar before we start to analyze let's go back and just talk about the model used to represent this scenario so given that we have 3 X values here we're going to use this model so our model will be y hat which is just our estimated regression line equals beta 0 remember that these represents subscripts so anytime I have a 0 right after a B it's really in beta sub 0 so beta sub 0 which is our intercept plus beta sub 1 times x1 so our parameter for x1 times whatever the x1 value is then plus our parameter for x2 and then so on so now this is our model what does that ANOVA table tell us well the first thing that we can actually pull from this printout is we can come down and we can look at our parameter estimates if you just wanted an estimated regression line and you definitely wanted to use all of those X values then here it is well it would just be Y hat is equal to negative zero point zero six plus three point two nine x one plus zero point three one five x two minus zero point five five x three so that would be your estimator regression line if you know you want to use all of the X values and while we're here we should talk about the r-squared values now there were in multiple linear regression we want to focus on the adjusted r-square here it is zero point nine five five six which is very good and that implies that 95% of Y can be predicted from x1 x2 and x3 so that's a very good adjusted r-square so why do we want to focus on adjusted r-square rather than just our square well for each of these predictors it comes at the cost of a degree of freedom so x1 x2 and x3 not counting intercept here so we can look at this and we can see this in our model so degrees of freedom there are three for x1 x2 and x3 so let's say we were to add an X 4 or a fourth predictor then our R square would probably go up but that doesn't take into the account that we're adding another degree of freedom that would adjust our model if we add at X 4 to degrees of freedom for and then our adjusted r-square would take that fourth degree of freedom freedom into account then so if X 4 really isn't a very good predictor this could actually end up going down all right now we're ready to analyze the overall F test so let's take a look at our null and alternate hypotheses for the overall F test the purpose is to determine if at least one of these variables is useful in predicting Y so if at least one X is useful in predicting the value of Y our null hypothesis is that they are all utterly useless so beta 1 is equal to beta 2 is equal beta 3 is equal to 0 meaning all of those X values none of them are useful in predicting Y notice that I didn't include the intercept beta 0 here we're just testing the parameters on the x values now our alternate hypothesis is that at least one of them is useful so let's say we reject the null then we know at least one of them is useful in predicting Y but we don't necessarily know which one it is and determining if we should reject the null is a very simple process I'm going to assume that alpha is equal to 0.05 for all these tests results assuming alpha 0.05 and we have a p-value of 0.005 which means our p-value is less than alpha so we can go ahead and reject the null hypothesis so for this guy then the conclusion is that we're going to reject the null and conclude that the alternate hypothesis is correct at least one of these beta 1 beta 2 beta 3 is not zero and is useful in predicting why so now I know that at least one of those betas is useful so let's go back to our results here now and talk about the individual t-test in simple linear regression we talked about the f-test we talked about the t-test and I said that they're the same for simple in your regression and in simple linear regression only they are the same but in multiple linear regression now the individual t-test tests whether this predictor is useful if everything else is included in the model so let's just take x1 for example the null hypothesis for this could be that beta 1 is equal to 0 meaning x1 is useless the alternative hypothesis then is that it is not useless or that it is useful in predicting why so if we look at our p-value we see that p-value is very small which is less than our alpha value of 0.05 so we would say that given everything else included in this model x1 is still useful or beta 1 should be included in the model and we could do the exact same test for x2 and x3 all we have to do then is just change our hypotheses so for x2 our null hypothesis then would be that beta 2 is equal to 0 and we could analyze our p-value and say well our p-value is larger than alpha which means that we would fail to reject the null and then we will conclude that given everything else remaining in the model beta 2 is equal to 0 or beta 2 is not useful we could do the same thing for X 3 then X 3 also has a p-value that is larger than alpha so we can we would conclude that X 3 or beta 3 is not useful in predicting why so we've covered the overall F test which can tell us if they're all equal to zero or they're all useless we've covered the individual t-test which can tell us if an individual predictor is not useful but now what happens if we just want to test two of them at the same time so let's say we have a hunch that beta 2 and beta 3 are not useful in the model how would we test that both beta 2 is equal to 0 and beta 3 is equal to zero at the same time rather than doing the individual t-test which assumes that everything else is included well for this we can do a partial F test which is very similar to the overall F test except now we're going to omit beta 1 so our null hypothesis here will be beta 2 is equal to beta 3 is equal to 0 and then our alternate hypothesis is going to be at least one of them is not zero so now let's actually perform this test this is going to be very similar to the overall F test so I'm just going to copy and paste that code and we're going to keep the same model except now we're going to say test 1 and test and now we can just put whatever we want to equal 0 so you want X 2 is equal to 0 and comma X 3 is equal to 0 and then we'll put a semicolon so we're saying this is test 1 and we want to test if these two predictors are going to be equal to 0 so now let's go ahead and run and now this is going to give us a nice little print out here now nearly as big as the overall F test and the analysis of this is really very easy so we have a p-value here and this is 0.2 7 5 8 so assuming that our alpha is still 0.05 then we're going to fail to reject the null hypothesis now remember we set our null hypothesis then is that beta 2 and beta 3 or equal to zero so we fail to reject this we can assume that they're both going to equal zero so then the ultimate model that we could actually end up using for this data is just omitting beta 2 and beta 3 since we assume that both of these are 0 and now we can just use this model as our estimated regression line now I just want to cover a few more things the first being confidence intervals for all of these predictors now by default alpha is set to 0.05 if you want to change that we can go ahead and change it up here let's just set it at point 1 and let's get a 90% confidence interval for all of these predictors now again we're going to leave all the predictors in here even though we just showed that beta 2 and made a 3 are useless but we'll keep them in there for now so let's say CLB so /c lb this will give us the confidence intervals for our three predictors and we'll go ahead and run and scrolling up now the only thing we've added here we've just added the 90% confidence limits as we desired and the last thing I want to cover is how to predict the value of y using these three predictors in SAS so let's say that we know X 1 will be 4 X 2 will be 5 and X 3 will be 6 and this isn't anywhere in our data set so that's good so now let's say using 4 5 & 6 we want to predict the value of y and furthermore let's say that we want a 90% confidence interval for that prediction of Y how do we get that in SAS well in order to do that the first thing we need to do is we need to manipulate our data a bit so now what I'm about to do here is we're going to add the values 4 5 & 6 4 X 1 X 2 and X 3 and in the place for Y we're just going to put a decimal point so we're going to add those values to this data set but in order to do that we're actually first going to create a whole second data set now could I just add an extra observation here and put 0.45 6 yes I certainly could and that would be the much shorter way of doing it but in case you're reading from a text file or from Excel and you don't have the luxury of doing that we're going to do it this way so first we'll create a whole second data set data demo2 and now we'll specify our data here so Y is equal to a decimal point then X 1 is equal 4x 2 weeks ago 5 and X 3 is equal to 6 now let's just print the so we can verify there should be one observation with those values okay it looks good observation one y is equal to decimal point four five six so now that we have two data sets demo and demo two how do we go about combining them well what I'm going to do is really just add demo two into demo so then we'll still be using the original data set here so in order to do this I'm going to go back to data demo and then I'm going to say sets demo demo two so it's not going to erase everything that's in demo is just going to take whatever's in demo two and add it to the original demo data set and then let's just print this out and we'll go ahead and run so now everything should be the same except we should have the added observation four five six with decimal point four why everything looks good so now that we have our data set up and ready to roll let's get a confidence interval for Y now specifically let's get a 90% confidence interval for the prediction of Y so given the fact that X 1 is equal to 4 X 2 is equal 5 and X 3 is equal to 6 that's why I want it 90% confidence interval for what Y will end up being so we use CLB before to get the confidence intervals for the beta's now we're going to end up using COI and we'll just use the same model in here and going to our results here's the decimal point we put in for Y which means the corresponding X values are 4 5 and 6 then here is our 90% confidence limit for the prediction of Y so we can say with 90% confidence when X when those X values are 4 5 & 6 y will be between six point eight four and fifteen point nine one now we have our 90% confidence interval for a prediction what if we want a 90% confidence interval for a mean so we can get that as well and a very similar procedure we're just going to go back to where we put C li and now add to it and just put CLM we could replace COI with CLM but we'll just add to it for now and now let's go ahead and wrong and we'll scroll up and now we should still have the 90 percent confidence limit for a prediction which we do and now we have the 90 percent confidence limit for the mean as well now the mean should be easier to predict so the interval should be smaller and it is we see that the 90 percent confidence limit for the mean is between seven point nine and fourteen point eight okay but that's all I want to cover in multiple linear regression I hope this helped one procedure that I did not go over is GLM procedure for general linear model and you might find that very useful as well in your study of multiple linear regression but thanks for watching
Info
Channel: Krohn - Education
Views: 25,889
Rating: 4.9473686 out of 5
Keywords: statistics, programming, sas
Id: _ojaa1dTHWI
Channel Id: undefined
Length: 16min 8sec (968 seconds)
Published: Wed Nov 09 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.