Multiple and hierarchical linear regression in SPSS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody in this video i'm going to demonstrate how to do a multiple linear regression and hierarchical regression in spss the data that i'm going to use for this demonstration is partially adopted from a study called core which is spearheaded by dr melvin chan of the national institute of education of nanning technological university in singapore the data that i'm i'm using consists of these three variables i also include gender and stream but i'm not going to use it in the in this analysis it's just for demonstration here the dependent variables the dependent variable in this analysis is comprehension and i want to see if i use grammar and vocabulary as two independent variables in the analysis uh whether i can predict the comprehension scores of students who participated in the study and if so with what level of accuracy so in order to run a regression analysis we need to remember the assumptions of multiple linear regression it's multiple since we have two or more independent variables in the analysis if we only have one independent variable then the analysis would be referred to as just linear regression so that's already the second assumption of multiple linear regression which which mentions that there should be two or more independent variables these independent variables could could be either continuous like the two vocabulary and grammar score variables that i have or categorical like gender or stream of students which i just mentioned i showed you before there should also be one dependent variable therefore regression analysis can also run with one dependent variable other assumptions include linearity relationships between the dv and iv lack of multicollinearity no lack of any influential outliers and the normality of the residuals so i'm going to walk you step by step through the analysis and show you how these assumptions can be met and how to make sense of the data under the analyze data in spss we need to go to regression analysis and linear in this box we need to move our dependent variable which is comprehension to this slot and move grammar to independent slot as well as gram as well as vocabulary for the first analysis i'm going to do i'm not going to use next because next is used only in hierarchical or sequential regression now let's go through these steps to figure out what we can do with them and what options are there for us under the statistics tab we already see that estimates have been selected for us so we just leave it as is we choose confidence intervals we check off r squared changes descriptives part and partial correlations and multicollinearity or collinearity statistics which is uh one way to diagnose whether your data is affected by multicollinearity now in some sources you may read that the the durbin watson test of residuals is also necessary for linear multiple linear regression as i have written here but i have crossed out the point is that the durbin watson test is not suitable for multiple linear regressions which are not based on time series and since my data is not using time series so i'm not going to choose durbin watson continue under plots we need to test the assumption of the distribution of our residuals in order to do so we need to plot standardized residuals against uh standardized predicted values standardized residuals are these z resid stands for residuals you can move this to either y or x axis doesn't really matter let's just move it to y and z predict stands for standardized predicted values so i'm going to move it to x and this will be just pretty enough i would like to also get this histogram and normal pop probability plots so this will allow me to investigate the normality assumptions which were mentioned before under save i would like to check uh standardized residuals because this is one of the assumptions of regression and also in order to check whether we have huge outliers i would need to check off cook's distance cook's distance is a statistic that is estimated by using leverage measures as well as residual values leverage measures are an attribute of outliers okay so i'm going to click on continue under options we don't want to change anything really here and the rest can also remain unchanged now we have quite several methods of regression analysis enter is the first one on top it's a procedure for variable selection in which all variables in a block are entered in a single step stepwise is another way of doing the regression analysis at each step during stepwise analysis the independent variable not in the equation that has the smallest probability is entered if that probability is small is sufficiently small that's smaller than 0.05 now variables already in the regression equation are removed if their probability of f that's the p-value becomes sufficiently large that's if the p-value is larger than 0.5 so they are automatically removed so this method terminates the analysis when no more variables are eligible for inclusion or removal in the analysis the next one is remove which is a procedure for variable selection in which all variables in a block are removed in a single step and the backward elimination is a bit more complex in the sense that uh in this procedure all variables are entered into the equation and then sequentially are removed one at a time the variable with the smallest partial correlation with the dependent variable is considered first for removal but then if uh it meets the criterion for elimination then it will be removed after the first variable is removed the variable remaining uh in the equation with the smallest partial correlation are then considered next and this will continue till only significant variables are left and finally forward selection is a stepwise sort of procedure in which the first variable considered for entry into the equation is the one with the largest positive or negative correlation with the dependent variable so this variable is first entered into the equation only if it satisfies the criterion for entry if the first variable is is entered the independent variable not in the equation that has the largest partial correlation is then considered next the procedure then stops when there is no variable that meets the entry criteria so this is just really a brief explanation of how these different types of uh analysis work but for the current analysis i'm going to choose enter because that's the most straightforward one and i'm going to click ok to get the results and here are the results so let's go through the assumptions again i'm going to start from the linear relationship between dv and ivs because we have already met the first two assumptions and see if we've got the linear relationships between the dv and iv we need to look at the correlation box here in this matrix you see that the correlation between the comprehension score and the grammar and vocabulary score is pretty fine it's medium it's neither too high nor too low 0.49 0.555 is pretty good so the first assumption seems to be fine we meet that and for the lack of collinear multicollinearity what we can do is to continue looking at the correlation and also go to the tolerance and vif factor so it has to be vif the vif factor that's variance inflation factor in order to find out whether we have multiculinarity or not so in this case for for the first way of looking at multiple linearity we'll see that the correlation between grammar score and vocabulary here grammar and vocabulary vocabulary is smaller than 0.7 this provides us some sort of strong evidence that the data is not affected by multicollinearity now another way of looking at this is to just quickly scroll down and go to the coefficients box in the coefficients box you can see the collinearity statistics we've got two types of them tolerance and vif as mentioned before the lowest score of vif would be one so that's what we expect to get but so if we get a one under vif then there is no multiple linearity whatsoever if the statistic falls between one and five well it's an indication that we have moderate multilinearity and if it's larger than five we have high degrees of multicollinearity so this is pretty tolerated yeah and so this also shows uh that there is no multicollinearity in our data now tolerance statistics uh is uh 0.724 which indicates that uh there is no multiculinarity yet so this is the third piece of evidence and how do we know that well the tolerance statistics should be larger than 0.4 so that so that would indicate that there is no multicollinearity in the data as you see 0.72 is well above 0.4 therefore these three pieces of evidence vif tolerance and the correlation analysis provide strong evidence that there is no multicollinearity and therefore this the assumption of multicollinearity is lack of multiple linearity is met now influential outliers and residuals for influential outliers we need to look at the cook's distance as i mentioned before cook's distance has already been created for us in the data sheet and this is under coo hyphen one or underline one uh what we need to do is to right click on this column and sort everything in a in an ascending fashion uh in a maybe descending would be better yeah descending so the rule of thumb is that if any cook's distance statistic is larger than one then that variable is a huge outlier and should be eliminated from the analysis as you see these statistics are all well smaller than one therefore we don't have huge outliers luckily now for residuals we can either look at this this column i think this is what i usually do just right click and sort everything in a descending fashion first and if anything falls outside of two or three i mean we have as you see here we have two criteria if anything falls between a minus minus outside of minus three and plus three it can be eliminated a more stringent and perhaps more acceptable criterion is minus two plus two and this is what we see so if anything is larger than minus two plus two all right these these data they should be removed or eliminated so you can clear this i'm not going to do this in this analysis just want to show you how that can be done and then you should rerun the analysis once more okay so um i'm going to right click on this and this time around i'm going to sort it in send this ascending fashion so that i'll get big residuals on the negative side and you see quite a few of these residuals are falling outside of minus two plus two and these are smaller than minus two and they can be eliminated too so i would suggest that you can eliminate these and rerun the analysis once more actually my data set is large really and you know deleting this amount of data will not affect it that much but if you have a smaller data set this might not be very a very wise procedure to do it so you might want to stick with a minus three plus three which is a more liberal range of residuals another way of investigating residuals is to look at the pp plots and the qq plots the pp plot here is actually a probability probably plot of regression standardized residuals the diagonal in this plot represents normal distribution and these dots uh the dots are pretty you know densely scattered so you cannot see uh visible dots here but you know the thick area here along the line is made of dots so the rule of thumb is that if they deviate significantly from the linear line then you have some uh you know deviation from normality this can also be identified in this scatter plot i mean earlier called it queue blood is actually a scatter plot and the point is that the these dots should fall between minus and plus three on both horizontal and vertical lines so they seem to be falling between minus and plus three here and here and here and here so it seems fine to me but finally it would be a good idea to also look at the residual statistics table there because that's helpful for residual statistics we should look at the standards residuals and their range as you see i've already seen this in the variable that was created earlier here i've already seen that the range minimum maximum falls within minus three plus three but it's uh larger than minus two plus two and therefore if you wanna go by that criterion you can adjust your variables by destroying those that do not follow minus two plus two criterion okay now that everything seems to be relatively fine we need to look at the results of our regression analysis the first thing we need to look at is what we call the goodness of fit and it's measured through r squared value and the r square value refers to the amount of variance that is explained by our independent variables which are grammar score and vocabulary score so 36.2 percent of the variance is explained by our two dependent var independent variables we also have adjusted r square values in large samples which basically represent the population better the rs adjusted r square values and r square values are more or less the same as you see the case here so it doesn't really matter which one you report so this side of the table will be more meaningful in a hierarchical or sequential regression analysis so we can just stick with this side 36 36.2 of the variance is explained by the two independent variables now we need to know if both independent variables have a significant impact on the dependent variable this is the table that we need to look at we need to look at the standardized coefficients beta so beta basically quantifies the influence or the magnitude of of influence of independent variables on the dependent variable standardized coefficient beta of 0.411 indicates if grammar score increases by one standard deviation the dependent variable or comprehension score would increase by 0.411 standard deviations of course um under the circumstances where the vocabulary score is held constant in the same way if we if we hold grammar the effective grammar score constant that's if we partial out is is effect if the vocabulary score goes up by one standard deviation the comprehension score will increase by 0.274 standard deviations and these two increases are both statistically significant as you see in the p-value here so the two variables have been able to predict uh about 36.2 uh percent per percent of the uh variance and that's to to explain 32 36.2 of the variance in comprehension score and they have these impacts this quantified impacts on comprehension score now one last thing is the anova test here it basically simply tells us if there is a significant if the slope of the regression line is significantly different from zero we don't want the regression line to have a slope of zero and as a result we would like to see a significant p-value here which means that the slope of regret the regression line is significantly different from zero so there is some slope there and as a result our regression model is able to predict the dependent variable using the independent variables in the analysis now i would like to quickly go back to the linear regression analysis and this time round uh move these variables one at a time to the independent box so the first variable that i want to enter into the analysis is vocabulary then i'm going to click next so i will create a hierarchical regression this time right now i want to enter grammar so these two will be just pretty fine i'm not going to change anything else really in this analysis and i'm going to click ok so we will we're going to have two models in the analysis as you will see and we can compare those two models to see which one of them fits better and choose that i've already talked about the correlation and the correlation of the statistics they remain the same so the variables that entered into the analysis are both vocabulary and grammar and this is the model summary like i said we have two models uh this model one has a a an r square value of 0.24 the second one which is just pretty much like uh the previous model which we estimated uh has a an r square value of 0.362 so the second model seems to be a better one just based on the r square value this is more meaningful now because the r squared changes if we move from model one to two and this change has an an f statistic associated with it which is pretty large this uh 330.413 and with an r squared change of 0.122 which is statistically significant therefore model one has a statistically significantly better r square value than this than the first mod model two has sorry model two has a statistically significantly better fit than the sec than the first one according to this p value here the anova which as i mentioned is an indication of whether the slope is significantly different from zero is also statistically significant in both which is good news for both models ultimately we need to choose one of these two models so let's look at the vif and tolerance statistics these are not really too bad although if you have only one predictive variable the vif intolerance would be better but these are not too bad they're still acceptable what matters here is that we look at the standardized coefficients as you see together these two predict more of the variance if we add them up they predict more of the variance than if we just include the variable and both of them are statistically significant because the regression coeffic the standardized regression coefficient is 0.274 for vocabulary and for grammar is 0.411 whereas for the vocabulary score in the first model in which we only have one predictor the standards coefficient is 0.490 overall i would prefer the second model because it's also theoretically more justifiable to include both vocabulary and grammar score not to mention that it's r square value that's the fit goodness of fit was much better than the r squared value of the first model so the rest of the analysis the output is more or less the same as what i talked about earlier so this brings me to the end of the discussion of regression analysis both multiple linear regression and hierarchical linear regression now one last thing before i close this presentation is so far we only have had this part of the equation in which grammar can predict comprehension as well as vocabulary now the question is what if we throw in the for example another another variable right writing score and we want to examine a more complicated relationship in which grammar and vocabulary predict comprehension and comprehension itself can predict writing and writing is also predicted by grammar and vocabulary meaning that grammar and vocabulary can have a direct influence on comprehension and a direct influence on writing and also an indirect influence on writing through comprehension this is not a question that can be addressed by grammar this will be done by path analysis or structural equation modeling which is a topic of a video that i'm going to create very soon thank you very much for your attention stay tuned in and i will soon make sure to create that video and upload it have a good day
Info
Channel: Statistics & Theory
Views: 1,591
Rating: 5 out of 5
Keywords: SPSS, hierarchical linear regression, multiple linear regres, influential outliers, Cook’s Distance, variance inflation factor, tolerance, residuals, Normal P-P Plot
Id: 3sQnO02f8Z0
Channel Id: undefined
Length: 26min 9sec (1569 seconds)
Published: Sun Dec 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.