Tutorial: Multiple Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
a tutorial on multiple regression we'll start by checking the linear relationships between the predictor variables we have five variables in our dataset we have our ID variable three quizzes and the final grade in a stats course we're interested in examining whether the three quizzes predict the final grades in the course and in order to do this we need to run a multiple regression so we have three predictor variables and one outcome variable so three IVs one DV so before we can actually run that analysis we do need to check our assumptions we can assume at this point that the data has already been cleaned and screened for outliers and impossible values and response sets so we'll just focus on the the assumptions related to the analysis specifically so the first assumption that we're going to check is for linearity and what we're checking is to make sure that there's a linear relationship between the quizzes and the final grade in the course so to do this we're going to create a scatter matrix which is essentially a whole bunch of scatter plots at once to look at all these different relationships visually so we'll go down here to scatter plot and then instead of clicking on this one here we're gonna click on the matrix down here and drag it up so what this will give us is a series of scatter plots representing all the relationships we choose so we're actually gonna put all four variables in here and actually I don't think we can move them all at once it's what do them one at a time we had quiz 1 quiz 2 quiz 3 and quiz 4 and once we're sorry and final grade rather once everything's in here we can move on so we click OK and now we have our scatter matrix so this is a lot like a correlation table where essentially this top is the screen that's going to be the mirror image of this except that it actually flips the X and YX or the x and y bars so essentially it's the invert of what's up here so we'll just focus on the top and essentially we're looking for any unusual patterns that suggest a nonlinear relationship so a cloud is fine a line like this is fine it just means there's a strong linear relationship there however if we were to see kind of like a V or an inverted V or a U or something that kind of took an ass shape that would be a problem but as long as we can be pretty confident that it's not nonlinear we're good to proceed so in this case we're interested in stats grade and each of this one so we can just look at quiz 1 with stats grade quiz 2 is stats create in quiz 3 and we have linear relationships in all three so we're good to go with our linearity next we'll examine multi collinearity between predictor variables the next thing we need to check is our assumption of multicollinearity so although we're looking for a linear relationship between each of the IV's and their DV we do want to make sure that none of these variables are too closely related to each other otherwise it would be impossible to distinguish with the contribution of each of these variables was independently on the dependent variable so we do need the variables to be a little bit different from each other so to check for this again we're gonna go to analyze and this time we're actually gonna go to the regression tab and select linear so this is actually a little bit it's not a long process but we need to do run this analysis for each of the independent variables that we're looking at so our in our case we have three of them quiz one two and three so we're gonna do this three times so first we do is assign one of them randomly as the dependent variable and the other two will become the independent variables and then when our sorry and then what we need to do is go up here into statistics we can unselect estimates unselect model fit and we're looking at Co linearity Diagnostics so what this is gonna do is look essentially at the relationship between quiz 2 and quiz 1 and quiz 3 and quiz 1 and make sure that there aren't any issues so we click OK and the only table we need to worry about is the table rate here so what we're looking for in the tolerance statistic is that the values are higher than 0.2 so we have 0.778 for both so we have no issues here both are above 0.2 and then the vif we want the value to be no higher than 3 ideally or 5 is kind of the borderline and 10 is the absolute max so if you're between 5 and 10 there's a bit of an issue over 10 definitely an issue 3 & 5 isn't great but under 3 is perfect so in this case we are good so now we need to repeat this two more times for the other two variables we do analyze regression linear this time we'll take out quiz 1 put it down here and then take quiz 2 and put it up here rerun the same thing again over point 2 under 10 and even under 3 which is perfect and then we'll just do it one last time and I'll take quiz 2 B and again we're good to go so over point 2 under 3 so we can be confident that we do not have any issues of multicollinearity between the scores on the three quizzes next we'll look at homoscedasticity next we need to confirm that our predictor variables have the same impact on our predicted variable for all levels of the variables so essentially in this example what it means is that we need to ensure that quiz 1 2 & 3 have the same influence on final grades regardless of whether someone did well on the quizzes or poorly on the quizzes so to test for this we have to confirm that we have no issues with homoscedasticity and we do this through the regression tab spring go analyze regression linear and here we just need to set this up like we're gonna actually run the model so we'll add all of our quizzes as our independent predictor variables and we'll add our stats final grade as the dependent variable so to check for this we only need to look at one thing and that is the plot of the relationship between these two variables so in order to see the influence of everything at once we're gonna actually use the standardized version so the Zed scores of the predictive variables and the residuals so the error essentially so do this we're gonna take Z pred here and we're gonna put on our x-axis and Zed Raziel is gonna go on the y-axis and we'll click continue and that's everything we need to set up for this test so from here we'll get our output and we can ignore everything except for this graph here so what we need to do first is add a line of best fit so essentially the regression line for this graph so if we double click and then click on this we'll have the linear line here we can close it and now we can look at it so what we're looking for here is for the line to be pretty straight and for the scores to be clustered on both sides of the line and from the high and low end so what we see here is a pretty good distribution on both sides this way there's a little bit more stuff at this end and that end however we do see a nice kind of cloud of data so we don't have anything to worry about if we saw some patterns where for example there was something like this going on a lot of scores here a lot of scores there that was a just a problem if we had a big u or a lot of scores at this end and a space a lot of scores of that and those would all be problems but this really does just look like a cloud and the line kind of goes right through it so for the purposes of our analyses we are good to go Filene will check for independent observations finally before we can proceed with our multiple regression analysis we just need to confirm that we have independent observations on our predictor variables so that means no underlying relationship between quites 1 and quiz 2 or quiz 1 and quiz 3 or quiz 2 and quiz 3 so to do this it's actually very straightforward we'll go to analyze regression linear this time we don't need anything selected in the plots we just need to click on statistics and then click on the Durbin Watson analysis so this will add an extra little output to our results and we'll be able to interpret it quite quickly so we click continue okay and now the Durbin Watson results end up in this table here under model summary so we'll just look right here and and essentially anything our values will always be under for anything close to two is considered to be good where one and three are kind of the parameters so anything above one or under three is considered to be okay in our case our value of one point six nine seven or one point seven is close to two and within that range of one to three so we can assume that we have independence of observations now we'll conduct a forced entry regression now that we've tested all of our assumptions we can move forward with our multiple regression analysis so the first thing we're going to check is whether these three quizzes so quiz 1 quiz 2 and quiz 3 predict final grades in the statistic course since we don't have any underlying reason or past research to suggest that one of these quizzes already predicts final grades we are going to treat all of them equally and we're going to do a forced entry multiple regression so that means that all of these will be entered at the same time and we'll be able to compare them at that point to see whether they have a significant relationship with final grade in the stats course so we'll go to analyze regression linear again keeping the model set up the same way where we have quiz 1 2 & 3 and we are predicting our final grade so now we can add a few things into statistics so we're interested in the model fit the R square change we can unselect Durbin Watson and we can leave everything else the same we don't need anything under plots the save stuff this is where you can do a lot of diff work for identifying outliers and multivariate outliers so we won't need that for this course however if you are looking to do slightly more advanced things there's a lot of different functions here options we will not use anything here and bootstrapping we will also fix or a skip brother so we can click OK and proceed with our results so the first table we have is just a summary of what was run so we ran one model and we can see here that there were three variables entered and you can see here underneath the table we have the dependent variable which is stats grade so this little B here just says that all the variables we requested were entered variable to do it if there had been a problem it would have notified us here the first table gives us a summary of the correlation coefficient between the three variables the variance explained by or the effect size so the the R square here is a variance explained by these three variables we have an adjusted r-square the standard error of estimate and then this section here is essentially comparing how much better the model got compared to no model so in this case when we were just running a model like this the R squared change is always going to be equivalent to the actual R square value because we're starting from zero and the F square or sorry the F change is gonna be the same as this F here because again we're starting from zero and it'll have the same degrees of freedom as our model and therefore this will match this so when you're just running a forced entry regression these change sadistic don't actually mean anything when we look at hierarchical regression that's when this will come into play so so this table here what you're looking at with the adjusted r-square is that the value is similar to the regular R square this takes into consideration some error and if whether the model was generalized and if you were to see a large discrepancy here it might mean there are some problems with the error but a couple points is nothing major so we can suggest that this is a good indicator of our effect size so next we have our ANOVA table so we see here that we have an F value of 386 point four seven and four three and twenty-two are two hundred and twenty two degrees of freedom this value is extremely significant so P is much smaller than point zero zero one in fact so much smaller sorry it's so most significant we can't even see where the next value is so this next table gives us a summary of each of the predictor variables contribution in our regression model so what we have here is quiz 1 quiz 2 quiz 3 we have there unstandardized coefficients then we have the standardized coefficients and then the t-test examining how well each of these predictors predicts the outcome variable so if we look here we can see that in the unstandardized coefficients the values vary significantly however since our sorry but if we want to actually compare these with each other we need to look at the standardized ones so whichever one is the largest is going to is going to sorry is going to have the biggest influence however the others could have a significant influence if these values here are significant but in our case the standardized beta coefficients for quiz 1 and 2 are actually quite small so we have a standardized coefficient for quiz 1 of negative 0.02 and for quiz 2 it's practically zero which means that these don't actually really have any kind of influence on the dependent variable and if we look at their significant values we see that that has been confirmed so these are not smaller than point zero 5 so therefore quiz 1 and quiz 2 don't actually have any real influence on the dependent variable quiz 3 however has a very significant relationship and we have a value of 0.9 3 here and this as a significant positive relationship given the positive value there if you were to want to create the and sorry the equation predicting final grades this would be the constants you'd have y equals fifty point three four and then since these are not significant predictors you wouldn't necessarily have to include them in the equation but you would have into the sorry the slopes for each of the predictor variables and that's actually everything we need to look at for this particular analysis so in our case we do have a significant model where our quizzes do predict the final grades however looking at each of the coefficients separately we see that quiz three is actually the only one having a significant influence and the others are not and finally we will conduct a hierarchical regression okay this time we're interested in examining again the relationship between quiz 1 quiz 2 and quiz 3 and the final grades in the stats course however I've in this particular example we already know from previous research that quiz 2 is a significant predictor of the final grades we know that from maybe the previous semesters course since quiz 2 was administered to the students but let's say that in this course quiz 1 and 3 were added for this semester so we're not actually sure yet whether or not they will also have an influence on stats grades above and beyond what we already know will come from quiz 2 so if we want to do this we actually need to run it as a hierarchical linear regression because we need to enter the quiz 2 first and then we can add Quinn wasn't quizzes 1 & 3 so to do this again we're going back to the regression and linear tab and this time we'll actually just click reset to restart everything so our dependent variable stays the same however this time we do need to set up the analysis so that SPSS tests quiz 2 and then tests all three so to do this we enter quiz 2 as an independent variable and then you'll notice right here it says block 1 of 1 so right now we're asking SPSS to run one model however if we a second block here by clicking next we can add quiz 1 and quiz 3 and now SPSS will run what's in the first block at the same time as what's in this second block here so we have block 1 which is quiz 2 and then block 3 sorry 2 is gonna be quiz 1 quiz 3 and quiz 2 from block 1 so it's gonna run two separate analysis and this is how we conduct our hierarchical regression so go back to statistics and we'll keep model fit estimates will key and will keep or sorry and we'll reselect our square change right here and we can click continue and then like in the forest entry regression we don't need to actually select any of the other options so we're good to go from here so we click OK and SPSS generates our results so what's changed now is that SPSS has now ran two models so this is block 1 and this is block 2 so we can see in block to that or block 1 sorry that we have just quiz 2 as the independent variable and then quiz sorry block 2 indicates that we've added quiz 3 and 1 but it's automatically gonna run whatever comes before it up here as well so everything has been entered and again stats grade remained our dependent variable so this is where the change statistics get interesting you will notice that we have two values for absolutely everything and essentially these change statistics here for the first model so block 1 represent the difference between no model and the model we ran so this value here is gonna match that value there this F here is going to match this F there etc however model 2 is now comparing the model that we ran with the new model and so this value is gonna be the difference between these two here and this value is the difference between these ones here however with a different error term so it's not going to calculate exactly like we see it it has to take into account that the error term Changez and then we're looking at different degrees of freedom for both values and an additional test of whether or not this is a significant difference so when you're looking at these two models you might be wondering okay well which which model do we use and actually everything comes down to this value here under significant F change so if we have a significant value under our lowest model here which is - it means that model 2 is significantly better than model 1 just like when we looked at only one of these having a significant value here meant that our model was significantly better than no model so what we wanted to confirm in the first round is that our first our first variable predicted our outcome variable which it did by this value here however we added the additional variables and now we want to confirm that this model predicts the outcome above and beyond what the first model did and since we have a significant difference here we can conclude that that's true and these change statistics show us by how much the model has changed so we notice here that our our values or our correlation coefficient goes from 0.43 to 0.92 our effect size so our R square goes from 0.18 2.84 and the justice r square again is similar so we can see that our second model really accounts for a lot more variance here and the dependent variable than our first model did so since it is this is the model we're gonna go with for the rest of the results we can actually ignore everything under model number one and we'll just be looking at model number two so let's scroll down here and what we see is that we have essentially the same sort of thing here we have to know about tables except that the degrees of freedom of change and that's because this value is dependent on the number of predictors in the model so our first model had only one this one had only three since this is a significant R sorry since this one's significantly better we don't need to look at this one so we have an F of 300 and 6.47 and 4:3 and 222 degrees of freedom this is a good model since P is smaller than much smaller than point zero five then we can now look at the actual coefficients and what we see here is that quiz two is a significant predictor of final grades when we just look at quiz two okay so we have this is the unstandardized value this is the standardized value the t-test and the significance right here of showing that okay yes quiz two significantly predicts final grades however once we add quiz one and three we see that this relationship changes and essentially we're back to the relationship we had before where quiz three and quiz 1 do not have a significant relationship and do not predict quit or the final grades once we take into consideration quiz 3 which is actually accounting for most of the variance and final grades so this is the model that we would have to interpret so we can can't we cannot conclude that quiz 2 has a significant influence on final grades we have to go with this one and conclude that it's actually only quiz 3 that has the significant and positive influence on final grades in the course and finally excluded variables here is just a summary of what was not included in the first model which makes sense we didn't have quiz 1 and 3 we were only looking at quiz 2
Info
Channel: Meredith Rocchi
Views: 35,762
Rating: 4.8198199 out of 5
Keywords:
Id: 4ERLK7F9nWc
Channel Id: undefined
Length: 23min 49sec (1429 seconds)
Published: Mon Sep 08 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.