How to Use SPSS: Standard Multiple Regression

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I would demonstrate how to perform a standard multiple regression and in standard multiple regression what we're typically doing is taking two or more predictor or independent variables and then using those to try and predict a quantitative or numeric outcome a single quantitative or numeric outcome now in standard a multiple regression all of our independent or predictor variables are entered into the equation or into the model simultaneously so each independent variable is evaluated in terms of its own predictive power over and above that offered by all the other independent or predictor variables this is probably the most commonly used form of multiple regression and you would typically use this approach if you had a set of variables for example personality scales or risk factors and wanted to know how much variance in an outcome variable or dependent variable like level of eating disorder or level of muscle strength for example they were able to explain as a group or a block and so this approach would allow us to figure out how much unique variants in the dependent variable each of the independent variables explained now there are several assumptions with multiple regression and one of the first is sample size and typically the rule of thumb here is the larger the sample size the batter because as we're trying to make predictions about outcomes we need to have enough variance or enough variability in our predictors as well as in our outcome to be able to make a good prediction so one of the rules of thumbs that we use is we typically want about 15 participants per predictor variable so if we only have two predictor variables the absolute minimum number of participants we would have would be 30 and so as our number of predictor variables increases the number of minimal subjects we should have now if our dependent variable is normally distributed which is again one of the assumptions we'll talk about in a little bit then that rule of thumb of 15 per predictor variable is acceptable if our dependent variable is slightly skewed or if we have concerns about its distribution then we probably want to increase that number okay the next assumption is of multicollinearity and this refers to the relationship among the independent variables and so when our independent or predictor variables are highly correlated to each other that's the case of multicollinearity and we use an our value typically of 0.9 or higher to say that two predictor variables are have multicollinearity okay and so we don't like to have multicollinearity in multiple regression because that means we've got some redundant predictor variables and that tends to make the model more cumbersome and less accurate when we're using more than one variable to measure the same thing so we always will check for multicollinearity the next is outliers we want to make sure we're screening for outliers multiple regression is very sensitive to outliers whether they're very high or very low scores and so checking for extreme scores should be part of our initial data screening process as well as making sure we have accurate and complete data so we should do this for both the dependent and independent variables that we're going to use in our regression analysis and outliers can either be deleted from the data set or perhaps we can generate scores through multiple imputation or some of the other things we can do when we have missing data okay the next assumption we have is that our outcome variable needs to be normally distributed the relationship between the independent variable and the dependent variable needs to be linear they need to be linearly related and we also need to have similar variance on the dependent variable among our different predictor groups so we need to make sure that that that dependent variable has a normal distribution is not skewed and if that's the case that usually that usually takes care of that issue but we also want to make sure that we have a linear relationship between the predictor variables in the outcome okay so the example we're going to do here for the standard of multiple regression is I have several data points and what we're going to look at is how or if some predictor variables can predict the level of someone's perceived stress so the variable we're going to use is our outcome is total perceived stress or TP stress so this is a numeric measure of someone's stress level so a higher score indicates a higher stress level and so far predictor variables we're going to use measures of control in other words how how strongly someone feels they're in control of their emotions and their feelings we're going to use the measure called team mass which is a measure of someone's mastery that they feel they have control over perceived control over events and circumstance so how well they feel they can control things so that'll be one of our predictor variables and then the other predictor variable will be the perceived control of internal states which is this TP COI SS okay higher scores indicate the person feels they have greater control over their internal states like their internal motions their thoughts and their physical reactions so to summarize we're going to use a a measure of someone's field perceived level of mastery how well they're able to control themselves and then also their their perceived level of their internal states we're going to see if those two variables can predict someone's level of perceived stress okay so our first step then is we get ready for this procedure is as we go through the steps and as all demonstrate how we actually tell SPSS what we want to do one of the first things we'll do is check the assumptions that we have and then we'll go ahead and actually look at the regression model that's produced so we want to go to the analyze menu then select regression and linear so we're going to take our our numeric dependent variable or outcome in this case total perceived stress and move that into the dependent variables box and then our two independent variables total mastery and that internal states measure so internal states measure and then total mastery measurement ooh the independent variables box we're saying that those are going to be our two predictor variables now just below that you can see where it says method make sure the enter method is selected and this is the standard multiple regression rule we put all the variables if you had more than two this is where we have all the variables entered into the model at once and we'll talk a little bit about the difference between that and hierarchical or stepwise regression in a different video we next want to click on the statistics button okay we want to make sure we have the following selected estimates confidence intervals model fit descriptives part and partial correlations and collinearity Diagnostics and that those are going to help us determine some of those whether that we meet some of those assumptions okay in the residuals section we want to select case wise Diagnostics okay and make sure that outliers outside is checked and then three should be inside the standard deviation so this is going to help us check for outliers that are beyond three standard deviations from the mean of our variables okay once we've done that we click continue and then we click on the options button and in the missing value section check exclude cases pairwise and what this is going to do is if we have any subject that is missing one of the variables whether it's one of the predictor variables or the outcome where they're going to be excluded from the analysis can click and continue we're going to click on the plots button ok and we want to move the variable that's labeled as star zr e s ID into the Y box and then the variable that's labeled star Z P re D the one just above it into the X Box okay now in the section that's labeled standardized residual plots we tick the normal probability plot click on continue okay then we click on the Save button and in the section labeled distances we're going to select the first choice a metal over this distance and also cooks distance then we click on continue and then we click ok and so here is our output so the first step here is to check our assumptions so the first assumption we want to check is a multicollinearity assumption so in order to check the cultico linearity assumption we want to find the table that's labeled correlations so if we go over to our our menu here in our output and click on correlations it will take us to that table this is the table we see here so what we want to see is our independent variables show at least some relationship with our dependent variable in other words on our value greater than 3.0 so as we look at total perceived stress as our outcome and we follow that over to our two predictor variables we can see that both predictor variables have negative correlations with the outcome but they are their greater than 0.3 up so in that case both of the scales correlate quite strongly with our outcome total perceived stress okay we can also check that the correlation between each of our independent variables is not too high so as we look at the correlation between pcoi SS score and total mastery score we can see that that's a point 5 to 1 so we can see that our two predictor variables are correlated with each other which is okay but we don't want them to be higher than typically point 7 so if there's a too strong a correlation that might mean there's some multicollinearity or some redundancy in those two predictors so we have a case where we have a correlation greater than 0.7 oh we probably don't want to use both of those predictor variables we probably want to omit one of the variables or maybe form a composite variable if we have two highly correlated predictors so in this example our correlation is 0.5 2 or 0.5 to 1 to be exact which is less than 0.7 so we can we can use both those variables as predictors okay so SPSS also performs some collinearity Diagnostics for us and so this is again is one of the assumptions we want to look up or look at so this can pick up any and any problems with multicollinearity that maybe wasn't picked up or wasn't shown to us in just the correlations that we just looked at so what we want to find then is the table labelled coefficients okay and so we click on that so two values are given to us in this coefficients table related to collinearity Diagnostics and the first is tolerance and the other is vif or variance inflation factor so tolerance let's talk about that first is an indicator of how much of the variability of that of the specified predictor variables is not explained by other predictor variables in the model okay so this value is very small in other words less than point 100 it indicates that there might be multiple correlations that are high suggesting multicollinearity so our value here in our two independent variables is 0.7 to 90 so that's well above point 100 so we can say at least in this measure we do not have multicollinearity okay so this variance inflation factor we can look at that next and this is basically just the inverse of the tolerance value and when we have vif values above 10 that would be a concern indicating multicollinearity and as we can see here our vif values are 1.37 which is well below the value of 10 so both of these statistics give us an idea that we do not have multicollinearity so we've met kind of the assumption of multicollinearity in other words not having multicollinearity now if you do have multicollinearity if these tolerance or vif statistics were below or above the thresholds that we just talked about then you're probably going to want to consider again removing one of these predictor variables or perhaps finding a different predictor variable okay the next assumption we want to look at is for looking at normality linearity as well as outliers and so one of the ways we can look at these assumptions can be checked by expecting the normal probability plot that we asked for now in order to find those we want to go down to our chart section okay and the first thing we'll see is this PP plot okay in a normal PP plot what we want to see is our points all our dots here will lie reasonably close to this kind of line of best fit that bisects the chart here so as we can see we want this to be a reasonably straight line it might deviate a little bit but as we can see here we have very little deviation from this from this perfect line so it appears that we have a good fit on the PP plot so there's no major deviations from normality now just below that we can also look at the scatter plot that's produced of our our variables and what we're looking at is we want to see a roughly rectangular distribution so if we look at all these dots what we want to be able to do is pretty much draw a rectangle around all of these dots and if we see a roughly rectangular distribution then with most of the scores kind of clustered in the center um then we've met that assumption of linearity now what we don't want to see is kind of a clear or systematic pattern to the to the dots where they're higher on one side versus the other so deviations from this kind of centralized rectangle shape suggests some violation of those assumptions okay now outliers can also be detected from the scatter plot and so we define typically we define outliers as cases that have a standardized residual is displayed in the scatter plot of more than three point three or less than negative three point three so with large samples it's not unusual to find a few outlining residuals but if you find only a few it's probably not something you need to worry about or take action about now as we look at this we can see we do have some standardized materials that are approaching minus four and are approaching plus four but these are us a small number of residuals that are that are in that that neighborhood so we probably don't need to worry about that now we can also check for outliers by inspecting the maja a little bit Mahalanobis distances that are produced by the multiple regression program these don't appear in the output but are presented in our data file is an extra variable at the very right hand position of the data file now to identify which cases are outliers and we'll look at the data file in just a second you'll need to determine the critical value using the number of independent variables as a degrees of freedom so if we have scores that exceed this critical value then we would consider them to be outliers so first of all in order to determine this we need to determine how many independent variables will be included in our analysis and we know we have to okay and then we need to use this number of independent variables and we associate that with a critical value of chi-square so for example having two independent variables is associated with a critical value of thirteen point eight to having three independent variables is associated with a critical value of sixteen point two seven and so on so because we have two independent variables the critical value for determining outliers would be thirteen point eight two so to find out if any of the cases have this Mahalanobis distance value exceeding thirteen point eight two we can go to the residuals statistics table so click on the residual statistics table and then go to the row that is labeled ma HAL distance and so the maximum value in our data file is thirteen point eight nine seven which just slightly exceeds that critical value of thirteen point eight two find out which case has this value we can go to our data editor so we go back to our data okay and we go to the data file and then we go to select cases I'm sorry sort cases and so we're going to sort by the new variable at the bottom of our choices here the mah underscore one variable so we're going to move that to the sort by box okay we want to sort by descending order so it'll list the highest values first and then go down in in order okay so we click OK and then if we go to our back to our data window we can see now that the highest score is now the highest a distance score is now listed so we have one score one individual who has a distance score above our critical value and so with the with the size of the data we have it's not unusual to see a few outliers and we only have one case so in this situation I really wouldn't worry about having one case that's only very slightly outside the critical value if you find cases that are are much higher than the critical value you may need to consider removing these cases and if you have a large number of cases that exceed that critical value if you have more than about two percent I would say that exceed that critical value on the the maja distance and I would I would probably consider removing those okay now the next thing we can look at let's go back to our output file concerning any unusual cases we might have as we examine this is in a table called case wise Diagnostics which is just above the residuals statistics so in a case why is Diagnostics okay this presents information about cases that have a standardized residual value above 3.0 or below minus 3.0 so in a normally distributed sample we would expect only about one percent of cases to be outside of this range so in this sample we have in our example here we have one case case number 55 that has a residual value greater than 3.3 so as we can see as we look at the table we can see that their total perceived stress score of 14 they have a total perceived stress score of 14 but our model predicted they would have a score of 28 okay so clearly our model did not predict this person's score very well they're actually a lot less stressed than predicted okay so that that might be a problem so to check whether this this individual case this case number 55 is having any undue or oversized influence on the results for our model as a whole we can check the value for cooks distance given towards the bottom of the residual statistics table so here don't we look down it here's cooks distance here now if cases have values larger than 1 then that is a that is a potential problem okay so in our case our situation our example here the maximum value of cooks distance is 0.09 1 so that's obviously much less than 1 so that indicates that this individual case or individual subject is not having an undue influence on our ability to predict the outcome now if you obtain a maximum value above 1 you should probably go back to your data file and sort cases by the new variable that spss creates it's called cooks distance or Co o underscore one and then look at each of the cases that are above the values of one just so we do with a mile Oh bonus distance but we sorted them by that variable from descending and down so we might need to consider removing any cases that have cooks distance greater than one okay so we've examined all of our our assumptions here and we seem to have met all of our assumptions so we can move forward now with evaluating the model that we created to see how effective this model is and whether it's statistically significant and how accurate it will be in its predictions so our next step then is to actually evaluate the model so we want to go to the model summary box so we click on that and I'll take us to the model summary ok we want to check the value given under the heading R square ok now this tells us this value right here this tells us how much of the variance in the dependent variable perceived stress which is our outcome is explained by the model so we've got our two variables mastery and that pcoi SS variable so we want to know how much those two variables explain stress level so in this case the value is 0.46 8 as you can see here now Express as a percentage we just basically just multiply that number by 100 this means that our model using our true predictor variables explains about forty six point eight percent of the variance in perceived stress so whether someone's going to have a high stress score or a low stress score about forty seven percent of that variance is explained by these two predictors which means that more than 50% of the variance in stress is explained by other things so but this is actually a pretty respectable result if we can get close to 50% of the variance explained by two predictor that's pretty good okay now you also can see an adjusted r-square value right next to the original R square value that we looked at now when we have a small sample size the r-squared value in the sample tends to be a little overestimated it's a little optimistic over estimation of what was probably really happening in the population so the adjusted R squared statistic corrects this value to provide a better estimate of what's happening actually happening in the population so we have a small sample size if maybe you're right at that 15 subjects per predictor variable limit you may wish to consider reporting the adjusted r-square versus the normal R square now the next step that is to assess the statistical significance of our result in other words is the model a statistically significant predictor of the outcome does it does it make accurate predictions in the way that we can say this is this is a true prediction of what would happen in the population we want to look in the table labelled ANOVA so this tests the null hypothesis that multiple R in the population equals 0 so in other words the model cannot predict accurately the outcome now the model in this example has a p-value less than 0.05 so we would say that there is statistical significance for this model in other words the model does a good job of predicting the outcome better than just chance okay now our next step our third step is to evaluate each of the independent variables so the next thing we want to know is which of the variables in the model contributed most to the prediction of the outcome and so we find this information in an output box labelled coefficients which is right below the ANOVA box okay now we want to look in the column labeled beta under standardized coefficients right here and we want to compare the different variables as far as their beta level and what we want to look at make sure we're looking at is the standardized coefficients not the unstandardized coefficients so it's important that we look at this column standardized coefficients beta now standardized means that these values for each of the different variables has been converted to the same scale so that we can easily compare them now if you're interesting constructing a regression equation in order to create the equation torie if you know the value for P Co is s and we know the value for total mastery can we predict our total stress we would use the unstandardized coefficients and we'd use these values in this B column to actually create a multiple regression equation now in this case we are interested in comparing the contribution of each independent variable and so we're going to use these standardized coefficient beta values so we look at the beta column and find which beta value is the largest and we're going to ignore our sign as a matter if it's positive or negative so in this case the largest beta coefficient is 0.4 to 4 and that comes from the total mastery variable that predictor variable so that means that this variable makes the strongest contribution to explaining the outcome when the variance is explained by all the other variables in the model and that's controlled for so that individual variable does the best job of explaining the outcome so the beta total for our other predictor variable P Co is s score is slightly lower so it made less of a contribution but it still made a fairly large contribution okay now for each of these variables we can also check the statistical significance of their contribution and so we find the column that's labeled s IG okay and this tells you again whether the variable made a statistically significant unique contribution to the prediction model and this is going to be very dependent on which variables are included in the equation and how much overlap there might be among these independent variables how much cold linearity there might be so if the significance value the SI g value for our two predictor variables is less than 0.05 or 0.01 depending on how stringent you want to be then we can say that the variable is making a significant unique contribution to the to the prediction of the outcome if it's greater than 0.05 then you can conclude that the variable is not making a significant unique contribution to the prediction of the outcome and again this might be due to some overlap of the of the predictor variables some multicollinearity in this case both total mastery and total pcoi SS score may unique statistically significant contributions to the prediction of the outcome now the other potential useful piece of information in the coefficients table is the part correlation coefficients so we look over here where it says correlations and we see part this is sometimes also referred to as semi partial correlation coefficients so that that might be a little confusing you might see that in the literature now if you square these values these part correlation values you get an indication of the contribution of that of each individual variable to the total r-squared so in other words it's going to tell you how much of the total variance in the outcome is uniquely explained by that variable and then how much our square would drop if we removed that variable so our R squared if we remember was point point 4 6 8 so if we removed the total mastery variable from the prediction then our R squared would drop by 0.36 to would only end up being point 0.106 so it would be a very small R squared so now we can also again take this value and square it so if we square this value multiplied by itself we get point 1 3 indicate that mastery uniquely explains about 13% of the variants in the total perceived stress scores now you may have noticed that our r-squared value is not the total of the two part correlations added together so the r-squared value for the model does not equal all the squared part correlation values now this is because the part correlation values represent only unique contributions of each variable with any overlap our shared variance removed or partial doubt so the total r-squared value however includes the unique variance explained by each variable and also the variance that is shared so in this case our two independent variables are reasonably strongly correlated therefore there is a fair amount of shared variance that is statistically removed when they are both included in the model now one last thing we can look at when we think about developing a regression equation to actually be able to predict someone's total stress level another piece of information that's important is the standard error of the estimate and this is basically gives us an idea of how much our prediction might be off and we can talk think about this in terms of plus or minus so if we made a prediction of someone's stress level using total PCO ISS score and total mastery score the prediction of their total stress might be off by about four point two four seven points so it gives you an idea of how much variability there might be in our prediction now it doesn't seem like a whole lot of variability score of four but certainly the larger that number is the more variability there might be in the equation so the more statistically significant equation is and the higher the r-squared value is typically the smaller the standard error of the estimate okay so to summarize the results of our analysis allow us to answer a cup questions about our our situation here our model so our model which includes this mastery score in other words how well people feel like they can control external events and also control their internal states their emotions and their reactions explains about 47% of any variance in a perceived stress level so these two variables mastery makes the largest unique contribution and PCO ISS also made a statistically significant contribution so a statistically significant model that allowed us to predict perceived stress using two predictor variables both of these variables had statistically significant contributions to this model with the mastery variable having the larger contribution of the two but still both having statistical significant contributions so this is a great example of how we could run multiple regression now we could have obviously more than two predictor variables if we wanted to the process is pretty much the same it just gets a little more complex as we have to examine more layers of variables as we look at assumptions and also kind of determining which variables are making the most contribution so hopefully you learn something from this presentation and hopefully you have good luck with this technique when you do your own research
Info
Channel: TheRMUoHP Biostatistics Resource Channel
Views: 314,487
Rating: 4.8853502 out of 5
Keywords: statistics
Id: f8n3Kt9cvSI
Channel Id: undefined
Length: 36min 54sec (2214 seconds)
Published: Thu May 02 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.