How to Use SPSS: Discriminant Analysis

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I will demonstrate the performance of a discriminant analysis discriminant analysis has two basic purposes and that's to describe major differences among groups following a manova analysis and also to classify subjects into groups based upon a combination of measures so the idea here is to take a group of predictor variables which are typically going to be quantitative in nature either interval ratio based and then use those to try and predict group membership so if a subject will or will not have a certain characteristic or will not or will belong to a certain group and then using these predictor variables we can predict whether or not they would belong to one of these two groups now examples of this might be trying to find variables they could predict whether or not someone would pass an exam of trying to predict whether or not someone would develop a disease or an illness or an injury whether or not someone might have a certain characteristic or trait in psychology or education so we can use this to try and figure out what variables would help differentiate group membership which is somewhat the opposite of may anova and manova we're using groups to try and predict differences among multiple variables now we're using these quantitative variables to try and predict who will will not belong to a certain group so based upon its name what we're trying to do is gather a group of predictor variables that can help discriminate or can help differentiate why subject a might belong to group one and why subject B might belong to group two so the logic then of discriminant function is to try and identify uncorrelated linear combinations of these predictor variables and so these uncorrelated linear combinations are referred to as discriminant functions so if we can identify the uncorrelated variables we can identify which variables discriminate between group membership this is very similar in respects to logistic regression and how we go about using variables to predict a outcome that is categorical now how we interpret these results is pretty straightforward since they tend to parallel what we've seen with some of the other types of regression or prediction kinds of analyses that we've done the main result obtained from a discriminant analysis is the summary of what we refer to as the discriminant functions which is similar to factors or components when we talk about multiple regression or factor analysis now in the case when there's more than one predictor variable or more than one group of uncorrelated linear combinations of predictions these combinations basically consist of regression equations so raw scores on each original variable are multiplied by assigned weights and then sum together in order to obtain a discriminant score which is similar to or now as to a factor score or component score like in factor analysis so the analysis returns several indices many of which we have talked about before in some of the other multiple regression equations or analyses we have I in values and percentage of variance explained are provided for each discriminant function and these values are interpreted very similarly is we would see in factor analysis or principal component analysis we also see a value reported called the canonical correlation and this value is equivalent to the correlation between the discriminant scores and the levels of the dependent variable or the outcome now we can have typically traditionally we have only two levels of the outcome the either pass or fail they either do or don't have the outcome of interest but we can evaluate more than two levels of outcome we can evaluate three four or more levels of outcome typically this is used when we have a dichotomous outcome so a high value for this canonical correlation indicates a function that discriminates well between subjects in other words it will likely perform well in terms of classifying subjects into dependent variable groups so it's important to realize that when the canonical correlation is less than perfect in other words not equal to 1.0 there's give me some degree of error reflected in the ability to predict group assignment or group membership so that's a that's an important consideration as we go forward and evaluate some of the outcomes that we're going to do in our example we'll talk a little bit about that another portion of the results that's vital to our interpretation involves the actual coefficients for each discriminant function so similar to regression coefficients and multiple regression these coefficients serve as the weights assigned by the formula to the various original variables in the analysis how we determine the accuracy of the actual prediction from the analysis is the number of correct correct classifications that is made using the predictor variables and this is known as the hit rate and so this hit rate or these predictions of group membership is compared to the actual group membership of subjects in our original sample data so a table as we'll see when we look at the output in a few minutes as a table showing the actual group membership and then the predicted group membership is is given to us and so this table includes the percentage of correct classifications based upon the equation or the predictors and then we also get the the actual percentage of or the actual percentage of people being in one group membership versus the other so even though there's no rule of thumb regarding an acceptable rate of correct classifications from the analysis we certainly want to achieve a high percentage when we get up into the 90s or above that's considered to be obviously excellent prediction now one consideration we have when we look at the the ability to make these these correct classifications is it's important that we want our hit rate to be high our our our analysis to be able to produce a high rate of correct classifications another consideration is possibly the cost of miscalculation so to a researcher it may look initially really good to us we have a hit rate of ninety to ninety-five percent but what about the small percentage of cases that are classified incorrectly so there exist some pretty large ramifications if we miscalculate an individual for group a for example into Group B resulting in a greater cost and miss calculating someone in the other direction so for example assume we wanted to classify subjects is either high or low risk in terms of developing heart disease based upon family history and maybe some health behaviors so obviously labeling a person low risk when in fact they're high risk has a much greater cost than identifying that person is high risk when they are actually low risk so it's important to realize that there are issues that can't be necessarily explained through our analysis but must be assessed and interpreted based upon your experience and your knowledge level as researcher now before we get into our actual example we do need to talk a little bit about some of the assumptions that we have when we do discriminate analysis and one of the first assumptions and and this is not unusual when we talk about other types of inferential analysis and that's that the predictor variables we assume have been randomly assigned or randomly selected or randomly gathered and they must be independent of one another so it can't be the same person being measured multiple times they have to be independent measurements the distribution of predictor variables has to be normally distributed we're also assuming the variances among the predictors and within the groups group membership is also equal and they have similar variants we're also assuming that the relationships among all pairs of predictors within each group are linear so in order to run the actual analysis in SPSS we need to first talk about the initial step in this process and that's as we screen our data to make sure that our data is normally distributed making sure it's complete and accurate and we also want to screen it for any outliers and so we can screen for outliers using techniques like the outlier labeling technique or we can also screen for outliers similar to how we did with multiple regression where we use the Mahalanobis distances for example is a way to identify outliers and then deal with those outliers so that's an important first step in this process is making sure our data meets those basic assumptions of normal distribution and having equal variances so in order to run the analysis then and assuming we have we have done those first steps and make sure this data is clean accurate and meets the assumptions we go to the analyze menu classify and then discriminate so we want to first of all move our outcome variable into the grouping variables box so again this is where we're differentiating whether or not someone passed or failed the exam and then we have our predictor variables which include a grade point average during the courses prior to taking the licensing exam or certification exam and then we have results or scores from some standardized exams that we won determine if those can help us predict this outcome of passing or failing so we want to move those predictor variables over into the independence box and so we have two options here we can include all the predictor variables into the analysis at once or we can actually have the predictor variables entered into the analysis step by step so we have either all together which is the option we're choosing here so under independence you can see where it says enter independence together the option we're choosing or you could chose a stepwise method where the variables are entered in one at a time now the next thing we want to do is we need to define the range of our grouping variable in order to tell SPSS which numbers were using to define our groups so for a two group situation like we have here we're using 0 and 1 as our numbers to define our groups but if you have more than two groups you might have a larger range of possible values that could be associated with the outcome so once we've defined the range of our outcome here we click continue and we click statistics and we want to see the means of each of the groups we also want to see univariate ANOVA's and then boxes M boxes M is going to help us determine if we've met the assumption of normal variance among our groups then we click we one more under function coefficients we want to make sure we choose Fisher's as well and what Fisher's is going to help us do it's going to give us an idea of which function coefficients maximize discrimination between the groups let's that's a good thing to report as far as which variables help us maximize that prediction so click continue then click classify now if our group proportions in the population of the proportion of people that are passing or failing this exam is fairly equal in the population then we can check all groups equal under prior probabilities if they're unequal we can pick compute from group sizes and this is the most common option and we're going to go ahead and do that because we know that in the population there's not a 50/50 50/50 percent pass rate from this exam so we want to make sure we account for that and choose compute from group sizes now under display we want to choose summary table and then leave one out classification which will reclassify each case based upon the functions of all the other cases excluding that case and then under use covariant matrix we want to select the default which is within groups and then depending on the number of levels you have in your outcome you may select one or more of the options under plots now the combined groups plot will create a histogram for two groups or scatter plot for three or more groups so this option is is useful when the DV has or the outcome has three or more levels so the separate group groups plot creates a separate plot for each group and the territorial map charts centroids and boundaries and is used only when the outcome has three or more levels so because we only have two levels to our outcome we're going to choose the combined groups plot and we can click continue and then we click save so SPSS will save those functions for us now this last dialog box gives us some options as far as saving some outcomes or output for some future analysis and so we've got three options predicted group membership discriminate scores and probabilities of group membership it's very common to save discriminate scores and also to save the scores of the predicted group membership so we're going to check those two and then click continue and now we're ready to run the analysis so we can click OK so the interpretation of the output has is going to basically have four parts the first would be analysis looking at whether or not we've met the assumptions of the analysis for the data and we're going to assume we've we've already done that except for the four boxes test we're going to then look at significance tests and strength of relationship statistics for each discriminant function we're going to look at the discriminant function coefficients and then lastly look at the group classification and how accurately we're going to be able to classify subject into specific groups now the first table we can look at is labeled group statistics and this basically gives us the descriptives for each of the groups on each of the predictor variables so as we look at group 0 or the group of people who did not pass the exam we can see the mean scores for each of the predictor variables and then we can see the same thing for group number 1 the subjects that pass the exam we can see the mean scores for each of their outcome for each of their predictor variables excuse me and then we can see the totals for the two groups combined the next table we can look at is labeled tests of equality of group means and also give us some indication if in the people that were in the past group versus the people in the fail group were they significantly different do they have mean scores that were significantly different in each of these predictor variables so to the people that passed the exam have a greater or lesser GPA than the people that failed in a statistically significant level and so we can look at that for each of our five predictor variables so the column we want to look at is the SI G column which gives us an idea of the statistical significance of the group means for each of these predictors and we can see here that for all the predictor variables except for m80 that the past group is significantly different than the failed group and so that gives us some indication that certainly these predictor variables seem to be discriminating so people that that passed the exam had a statistically significant higher GPA than people that fail the exam and we can see that again for all four of the five predictor variables and m80 variable is trending is very close to two statistical significance okay the next thing we can look at is boxes tests so we can look at boxes M in boxes M will give us an idea of whether or not we have equal variance among our groups so we're going to look at the SI g value significance value within boxes M test results and if this is greater than point zero zero one then we make the conclusion that the group variance is equal in other words we've met the assumption of equal group variance if this value is less than point zero zero one then it means we have unequal group variance now we can continue with the analysis but we need to be very careful of how we interpret and realize this now becomes a potential limitation of our analysis now the next table we can look at is this summary of canonical discriminant functions and again this gives us an idea of how strong the relationship is between the predictor variables and the outcome that we're trying to predict and so one value of use here is here's the actual canonical correlation for our predictor variables and so we can take this value is 0.67 three value and we can square it and that becomes an effect size similar to an r-squared value when we do multiple regression so when we square this gives us this gives us an idea of the magnitude of the actual effect of the predictors on the outcome so that's a useful thing to report now the next box we can look at is labeled Wilks lambda and this gives us an idea of the statistical significance of the prediction model so do the predictor variables predict the outcome at a statistically significant level and so we look at the SI g value again if that's less than 0.05 then we can say that this group of predictor variables will make predictions that are statistically significant in their accuracy so that's a good thing again where we can say that we have a very strong model here the next table we can evaluate is labeled standardized canonical discriminant function coefficients and this is going to give us an idea of which of the individual predictor variables seems to have the highest loading or the highest predictor capability of predicting group membership so as we can see here GPA has the highest value followed by ma t and AR score and so those three values seem to have seemed to have fairly high loading especially GPA in its ability to predict the outcome so these are the predictors that seem to have the most effect or the best ability to predict group membership now we can look at the next table below this labelled structure matrix and what we try and look for to see if there's consistency in the values here in the values here so in the upper table GPA has a very high correlation coefficient discriminant function coefficient and then also has a fairly high correlation coefficient we can see that AR also has a fairly high value and we can see that GRE V and GRE Q also have fairly high values so at least for these first two variables GPA and AR there's some consistency between the discriminant function coefficient as well as the correlation between these individual variables in the pooled variables so there's a very strong consistency here so GPA and a our score seem to be the best predictors and our ability to predict group membership and others have last but so fairly strong predictive ability so the last table we want to examine is labelled classification results and this will give us an idea of how accurately our predictor model was able to predict the actual results so as we can look at this we want to look at the cross validated area of the table here and this will give us an idea again of how accurately our model was able to predict the actual group membership so as we look at the first percentage here we can see the people that did not here are the people that did not pass the exam cross-reference with the people that we predicted did not pass the exam and we're able to correctly predict about 78 percent of those people using our predictor variables we're not able to pass the exam we look at the next column over and here we're trying to prepare are the people that pass the exam cross reference to the people that we predicted would pass the exam and we're able to predict about 87 percent of the people that actually pass the exam we're able to predict those those correctly so this indicates a fairly high prediction rate obviously it appears well a little bit able better to predict the people that pass verses the people that fail but all in all using these five predictor variables and especially GPA and AR score it appears that those are the strong predictors and this model is a fairly accurate model statistically significant model in being able to predict group membership so if we know these five predictor variables that will then help us be able to predict the outcome of pass or fail the exam so again the summarize discriminant analysis is used when we want to predict membership in a group using multiple predictor variables that tend to be quantitative and we will then look for these unrelated correlations or these unrelated groups of correlations that unrelated groups of variables excuse me they can help discriminate or help us predict group membership so hopefully you're able to learn something from this video and good luck using this technique in your own research
Info
Channel: TheRMUoHP Biostatistics Resource Channel
Views: 155,532
Rating: 4.8240919 out of 5
Keywords: statistic
Id: 7zYcMZ-61c4
Channel Id: undefined
Length: 24min 4sec (1444 seconds)
Published: Mon May 06 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.