How to Use SPSS-Replacing Missing Data Using Multiple Imputation (Regression Method)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video I will give a brief overview and a demonstration of how to perform a technique to deal with missing data called multiple imputation this technique is used when we have data that may be missing due to non-response or due to dropout of subjects and if we want to maintain sample size or if we have a situation in which data that's missing is missing in a systematic way which may introduce some bias it's in our best interest in order to develop a good hypothesis decision that we replace the data in some way now in other videos I talked about techniques of dealing with missing data called trimming and winds arising in which we just basically remove subjects that have data that are missing and that's certainly appropriate when we have small numbers of missing data usually less than 5% but if we have large numbers of missing data or trimming or winds rising is not appropriate because it may significantly reduce the sample size we can try and replace the missing data with with one of these techniques and the more common and probably one of the more accepted ones that these days is called multiple imputation this is also used a lot if we have longitudinal data in which we have multiple data points in which we might have subject mortality or dropout and we want to try and do an intention-to-treat analysis and incorporate the missing data in the analysis so one of the techniques detective we're going to discuss in this video is called multiple imputation now what multiple imputation does it basically runs simulations on the missing data relative to the relative to the data that's available in an attempt to replace the missing data with data that is most likely or would be most likely to be similar to the current data the data that's available so it basically looks at patterns in the available data and then makes a probability judgment as to what the missing values would probably be or would most likely be and then it replaces those missing values with these imputed values in order to then create a full data set so the data set we're going to work with here in the demonstration is it's a hypothetical data set and it's got several variables that are both numerical and categorical and it has missing values in it and so we're going to go through the step-by-step process of how we first analyze the missing data to try and determine if there are any patterns or if it's if it's random types of missing data and the first step is to go through is to analyze determine if there are patterns in the missing data and then go through the process of actually computing the values based upon a given method and we'll talk about the different options that we have as we go through this step-by-step so our first step then in the process is to evaluate the missing values in other words determine if there is some kind of a pattern to the missing data points so our first step is to go to the analyze menu click on multiple imputation and then click on analyze patterns now the defaults that we see here we're going to pretty much leave alone and what this will first do as I mentioned is is look for patterns in the data to try and determine if the missing data is systematic or if it's random now depending on if it's if it's systematic there is one multiple imputation technique that will be used versus if the the data is random and SPSS will kind of help us make that decision as far as which technique we will use so our first step then is to select all of the variables except the ID variable that's just an identification variable so we don't want it to analyze that variable so we want to select all the variables that we have and depending on on how many you have we want to select them all so we move them over into the analyze across variables box now once we get the output SPSS is going to display or we can designate how many variables SPSS will display simultaneously we only have 17 variables so the default is to to look at 25 variables simultaneously so we're going to leave that alone now the next thing we need to tell SPSS to do is to display a variable that's missing data as long as it meets a certain criteria so the default indicates that the only variables with 10% or more missing values will be will be analyzed and displayed in the output now it's generally good practice to try and review all the patterns of missing data so some people will say if it's missing less than 5% I don't need to worry about it that's kind of one rule of thumb that we have because if it's missing less than 5% we're just going to trim it and we're not going to worry about it but it is as I mentioned good practice to try and evaluate all the missing data even if it's less than just 5% so that we can see if there is some sort of a pattern so we're going to change this minimum percentage missing to 0.01 percent so even if a variable is missing just a very small handful of data points that it will still analyze it evaluate and then we can make a decision as far as how we're going to deal with it alright so once we've made that change we want to go ahead and click the ok button and so now we'll see here is the output from this analysis and there are several things we're going to see we're first going to see a series of pie charts three pie charts now the pie charts display the number and percentage of missing variables that's the left-hand pie chart so it's going to show how many variables have missing data so in this case the green represents incomplete data so all 17 variables have some missing data the center pie chart is the cases okay so that that's how many how many subjects or how many cases are missing some value at least one value and it appears that about 860 or a little bit more than half of the subjects are missing at least one value and then the right-hand pie chart that's labeled values that indicates that 5% of all values are missing so if we take our 17 variables multiply it by our 1,500 cases or subjects there's twenty five thousand five hundred potential values and so according to this chart we've got about twelve hundred and forty six missing values throughout the entire sample so that's about just a little bit short of 5% so it looks as though all the variables have some missing data individual subjects more than half of them are missing at least one data point and of all the possible cells or all the possible values that we that we have available to us a little less than 5% are incomplete so for some people they might say well less than 5% I don't really need to do anything about this I'm just going to trim them and then and then move on with a number of 1500 that might seem a fairly sample size so trimming a certain percentage might not seem a lot but when we're thinking about the values we're saying that there's about twelve hundred values missing so that could indicate to us there's some sort of pattern here because it's a large number of pieces of data missing and so now the sample even though it's a small number of subjects that are missing data it might seem like a small percentage of pieces of data are missing that still represents more than half of our subjects are missing data so that that's that's a problem we're probably going to have some level bias here so the next thing we want to do then is move on to the variable summary chart and examine that and try and get a feel for whether or not there's some sort of a pattern here and so the variable summary chart is going to list all of the variables that have at least point one zero one percent of missing values and so the number of missing values the percentage missing the number of valid values the mean based on the valid values as well as a standard deviation based on the valid values are all listed in this table for each of our 17 variables and so you'll notice that the variable is ordered by the amount of values they're missing based upon percentage missing so extraversion that variable is listed first because it has the highest percentage of missing values about five and a half percent and neuroticism is listed at the bottom of the table because it has the lowest percentage of missing values so this gives you an idea of which variables seem to be missing the most values and also gives us some descriptive information on the means and variance of the valid values now the next thing we can examine is a chart called the missing value pattern now this is is displayed to allow us a chance to examine whether or not there is some pattern to the missing data so each row represents a certain pattern in other words a group of cases with the same pattern of missing values in other words the patterns or groups of cases are displayed based upon where the missing values are located to relative to each variable so the variables along the bottom or the x-axis are ordered by the amount of missing values that each one contains so considering considering the table neuroticism has the lowest percentage of missing values and is therefore listed first or on the left-hand side of the x-axis our extraversion which has the largest percentage of missing values is listed last or to the right so for example the first pattern the one listed here at the top is always one which contains no missing values and then the second pattern reflects only cases with missing values on the neuroticism variable so what we're looking for here is something called monotonicity in other words kind of a rigid pattern of decreasing or increasing values across the sequence so what we look for is is a is a pattern in the organization of these red lines indicate there's missing missing values now if we had a concentration of red lines in the upper left and then another concentration of red lines in the lower right that would indicate there's some sort of pattern in other words monotonicity when we have islands of non missing values like we see here so we have some concentration of red to the upper left and also some concentration of red to the lower right but we have these little patches of non-missing values so if we had all the red touching to the upper left and also touching to the lower right that would indicate a pattern but as you can see here we have these islands of non missing values and so that would indicate and if you look at these red lines here you can see that they look like they look they're randomly arranged there's really no pattern to how these red lines are organized in the center of this graph and so that would indicate then that the missing values are probably missing in a random pattern there's no systematic pattern to how these values are missing so that's a good thing that means that that minimizes the chance we might have bias in our missing values in other words there's not one question that most of the subjects didn't answer or there's not a series of questions that subjects were not answering or not responding to so that's a good thing indicating that there's Rann randomness to the missing values if there's non randomness to the missing values then that's going to move this in one direction as far as the options we choose for doing the multiple imputation if it is random then that moves us and another option for how we would do the multiple imputation in the next graph we can look at as we move down the outcome here is labeled as variable and this is the pattern frequencies graph and this graph shows that the first pattern the one in which no missing values are present across all variables is the most prevalent so that's what's indicated here in this this large bar so what this saying is is the most common pattern that we see is that there's no values missing across all the variables the other patterns are much less prevalent in other words there are certain patterns of missing values across all the variables are much less prevalent but they're roughly equal in other words the patterns of certain kinds of missing batteries across the variables is is pretty pretty consistent okay so we've analyzed our our missing values and we've determined a couple of things here first of all obviously there are missing values and there there's a fairly large number of missing values more than half of our subjects are missing at least one value there's missing values in all 17 of our variables we've also determined that doesn't appear to be a pattern to how the values are missing in other words it's appears to be kind of a random arrangement of missing values across all the variables and that by far the way the most common pattern we're seeing is no missing values but there is a number of patterns that indicate missing values across multiple variables so that tells us that we probably should move on to the multiple imputation and go ahead and replace those missing values so that we can flush out or fill out our our data set and so our second step then is to go ahead and impute the missing data values so what multiple imputation actually does it actually goes through a process of trying to find the data that it's missing and try and simulates it so that it matches up or seems to best fit with the data that's available so it actually goes through several what we call iterations and comes up with several results of new data to replace the missing data and tries to find an iteration that creates the best fit so it might do this three times it might do it five times it might do a hundred times until it can find missing values that seem to fit best with the values that are already there that are present and so in order for to do that in order for it to do it's multiple simulations or multiple iterations we need to give it some give SPSS some instructions as far as how we want it to generate these random iterations and so what we need to do and this is referred to as setting the random seed another is giving it some some guidelines and some parameters for how we want it to randomly generate numbers to create these iterations for these missing values so what we need to do is go to the transform menu and choose random number generators and so this is going to then kind of tell SPSS how we want it to develop these iterations these models for the missing data okay so the first thing we want to do is select Kurt check the box that says set active generator okay and then we're actually going to choose the Marcin twister which is a random number number generator program okay the next thing when you do is set starting point and then we want to choose fixed value and then we're going to go ahead and keep the value at its default okay once that's done we click OK so now we can actually conduct the multiple imputation so we start this by going back to the analyze menu choose multiple imputation and then choose compute missing data values okay and so now we're going to see a decision box similar to two other ones we've we've seen in SPSS okay so first thing we need to do is select all the variables so we know that all 17 variables have missing values so we want to do multiple imputation for all 17 variables now if we only had a handful of variables that have missing values we would just move those over but in this case we're going to move all 17 over again except that ID variable because that it's not data it's just an identification so we want to highlight all of these okay and we want to move those over into into the variables in model box okay all right now we want to create a new data set from this imputation so we want to make sure we click create a new data set and we want to name it and so this data set will contain these newly generated or imputed values in place of the missing values now you'll notice that the imputation box here lists five which is the default and so that means it's going to run five amputations or five simulations performed in sequence and during each imputation the missing values are imputed in other words generated in order to create a model that fits it and then at the end of the imputations in this case five the values are averaged together to take into account the variance of the missing values and so this is why the procedure is called multiple imputation because what we end up with is one set of imputed values but these values are actually aggregates of multiple imputed values and so imagine that the fourth case is missing the extraversion score the fourth subject is missing extraversion score so that score will be imputed or simulated five times and stored in data sets and then those five values will be averaged and the resulting single value will be used as the new value the mid that will replace the missing value and that will then be used in the primary analysis for the study so each subject will get that process for each variable that they might be missing data and so it's going to go through five imputations to try and generate a missing value and so it's important to realize that multiple imputation is a strategy or a process there are many methods of going about the process but it's certainly not the best as far as this technique is showing that the best one there's many other methods and then we'll we'll talk about that when we actually get to choosing our method but we want to do is go ahead and name our data set and so we're going to call it new imputed data and then once we have done that we then click on the method tab and this is where we're going to choose which method now SPSS has two methods available the Markov chain Monte Carlo method and the monotone method now we're going to use the automatic function because what this automatic function will do it will scan the data for monotonicity it'll-it'll again kind of pretty much do what we did with that one chart where we looked at the red bars and the white bars and if it discovers that there is monotonicity in other words there is a pattern to the missing data then it will use the monotone method otherwise if there's a random random variation or randomness to the missing values it's going to default to the Markov Monte gene Markov chain Monte Carlo method now you can choose the custom method if if you feel you want to use the Monte Carlo method you can actually change the number of iterations that it may do to try and increase the likelihood of getting the correct model in other words what we call attaining convergence meaning the estimates are no longer fluctuating more than just a small amount from iteration to iteration in most cases if we've got random missing values we can do that in five iterations or so you can also then choose a model in which you wish to use in the most common is a linear regression model for doing these Monte Carlo iterations or if you feel like you have a pattern you have monotonicity then you can choose the monotone method we're going to let SPSS do the heavy lifting here we're going to choose the automatic method and let SPSS decide which technique should be used the Monte Carlo method or the monotone method okay the next thing to do is click on the constraints tab okay the first thing we want to do is click the under the constraints tab as go ahead and click the scan data button okay and it's going to give us the kind of the descriptives for each of our variables as far as the percent missing the observed minimum and observed maximum score in our data now we can use this information in in the define constraints table now for some variables there are a finite number of valid values in other words if we're measuring something like body fat percentage we know there's only certain valid values that should be there so when the process is done we want to make sure SPSS only imputes or comes up with values that are valid we don't want to end up with having a body fat percentage of 105 so we can we can indicate to it what the minimum and maximum values should be now categorical variables are automatically dummy coded so coming up with the minimum or maximum are not really necessary so what we need to do is is go through the the table and make sure that the observed minimum and observed maximum values are valid measures are within a range we're considered to be acceptable and normal and so we want to make sure that that is the case if we happen to find values minimum or maximum values that are outside of an acceptable range we can then go down to the next box here and define what we want the minimum a maximum to be we can also have program round values as well so if we've got fractional values we can tell it we want to round up or round out in a certain way now in the case of some variables like for example I'm in the define constraints box here and how many drinks someone has per week or how many absences someone has from school or work we know that the minimum of that could actually be zero so we want to make sure that good SPSS recognizes that that is the kind of the minimum possible score there is the possibility of having zero now another thing we can do is define the role of our particular variables in other words is it a predictor or is it an outcome and so we can also then define in the role whether or not we want to impute and uses either a predictor or is a an outcome now because and that's typically only going to be used when we're using the linear regression model and so we don't really need to worry about this for doing the automatic option so we can just leave this as it is and it doesn't really change very much the process now our next step is to choose the output tab and this is where we're going to get a chance to specify what is displayed in the output window so we want to make sure we choose descriptive statistics for variables with imputed values and we also want to create an iteration history and so what this is going to do then is it's going to give us descriptive information for each iteration that the process has gone through so we talked about you know we have five iteration so we're going to see up what the imputed values look as far as mean and variance for each of our iterations and then we're also going to see the pooled means and standard deviations after each of those iterations so it's instructive to be able to look at these different iterations to see how much they vary not only from the original data but also from the pooled data and it just kind of gives us a an idea of how much each iteration varies from the other iterations and gives us an idea of how accurate the iterations may be if we're seeing very little differences from iteration to iteration to iteration then that indicates that the pooled variance is we're going to or the pooled value is we're going to come up with at the end of the process are probably going to be pretty accurate and there are there's going to be very little air in what the missing values are relative to what they possibly may have actually been theater the iterator or the imputed missing values what they would actually be relative to if they weren't missing so we also want to correct or collect create a new data set so we want to make sure that's chosen and then we want to name it and so in this case we're going to name it iteration history so again this will give us an output that will show the means and variances for each of our different iterations I once you've completed that we can go ahead and click the ok button and if you look down here at the lower right you can see just above the toolbar here it's showing that it's running the multiple imputations and just next to it you'll see which iteration it is on is it is running each of these multiple imputation so it's actually going through several iterations now as we talked about before depending on the process it may go through multiple iterations and this may take a while for it to go through the process and give us our outputs so you know we've specified that it should go through about 100 iterations for each of the five imputation so it's going to go through several set multiple simulations for each each imputation so when the process finishes let's go back and look at our data so we're going to go back to the data file now we go back to our data file you notice we have two new data files one called new imputed data and then one called iteration history we're going to have if you look at the output you can see we have quite a lot of new output so first let's take a look at the imputed data set so let's click on new imputed data and so what you'll notice there's going to be several differences in the data editor in this file compared to what we saw in the original data and three things stand out the missing values that we see here okay and we're going to see the new variable on the far left side called imputation and so this is identifies the data as far as whether it was the original data and which imputation it is so again we're going to have 0 represents old data and then also a 1 would then represent the first imputation what the data looked like and then what looked like in the second imputation and so on and then the third thing we can look at is a little cube here that is yellow and white and it has a drop down menu and this allows us to kind of move back and forth between the original data in each of our five in five imputations and so if I click on the one here you'll see it it drops down automatically now to the data from the first imputation and so what we'll see then in the yellow shaded cells that's where the missing data was replaced and this is the first imputation of that missing data so we have aged replaced here for this subject we have class standing as well as a rating here in this cell and so now we have the data replaced and so we can move through each of these imputations and see how the data has been slightly different in each imputation so this imputation variable here basically labels each of the sets of data so again the first set labeled 0 is the original data the first 1500 cases with missing values still present and then each set of imputed data will then be represented and so as I mentioned SPSS marks the cells which contain imputed values by highlighting them so that we can identify really quickly which set of data we're looking at now another important thing to notice is that this imputed data file which is just a really obvious is the fact that the data editor is aware that this file is an imputed data file so when you click on analyze and then run some analysis which we'll do in a second you'll notice that many of the analysis options are compatible with imputed data but some are not so each analysis or function which shows an icon it kind of looks like a cube with a swirl in it or next to it will automatically be run using the aggregated imputed data so in other words it will know that we have you know five sets of imputed data SPSS will then aggregate this imputed data and come up with kind of a pooled or aggregated set of data that includes the missing values and then it will run the analysis on that pool pooled data or pooled estimates so let's do a quick example and I'll show you what I mean so let's go ahead and run a t-test on some of this data so we go to analyze and then go to compare means and now you'll see here's our little grid with the swirl next to it which means that this imputed data can be analyzed using one of these techniques and so let's go ahead and run on independent samples t-test and we're going to compare males and females so we're going to do sax as our grouping variable our independent variable and so we're using a one in a two is our grouping code and the outcome we're going to test them on is on number grade so let's find that at the bottom yep number great so that's going to be our outcome okay and now what we do is we click OK and in the output we're going to see something a little bit unique okay is what we're going to see is group statistics for the original data so here's the group means for our original data and then here's the group means for each of our different imputations one through five and then we lastly have our pooled data so this is where SPSS took these five five different imputations of data pooled the means and the variances to come up with the pooled mean and pooled standard error and so this is the actual values that SPSS will then use in the independent samples t-test and what you'll also see then as you page down is it will do the same thing and I'll do a t-test for each of the different five imputations but the one we need to use and the one we need to focus on and report is for the pool amputations and so as we look at the t-score of the t-value that for this pooled imputed data we can see we have a t-score of if we're assuming equal variances of minus 1.9 for 8 and so that's a p-value that's greater than 0.05 so depending on how we are doing our hypothesis testing we would then say that these two groups are not significantly different now if we look at the original data with that missing all those missing values it actually would be considered statistically significant so replacing the missing data with this imputed data actually made a difference in this case is whether or not we're going to accept or reject the null hypothesis so that's a good example of how imputed data can can make a difference in how we might interpret the outcome of a study so that that's again how we would actually then use the imputed data to do some kind of an analysis ok now let's go ahead and review the output created by the imputation process so the first thing we'll see and it's going to be labeled multiple imputation is the imputation specification so this is basically just kind of a summary of what we told SPSS to do in other words how we wanted to analyze the data and how we wanted it to impede the data the second table the imputation constraints again simply lists what we specified on the constraints tab prior to running the process so this will kind of give us a record of how we set up the imputation process so we can keep track of of this and so if we wanted to make changes this would allow us to have a record of what we did or didn't do if we did additional imputations with slightly changing the parameters now the next two tables the imputation results and imputation models simply display what occurred during the imputation process so it basically gives us a record of what spss did in each of our imputation sequences and then if we go down to the imputation models again we see the same thing so as we look at neuroticism it had 58 missing values and imputed came up with 290 imputed values and then again we can see what each of the models look like for each of the variables so again it did a linear regression looking at each of these different variables as possible predictors on what the missing value of sex should be in other words what the imputed value for sex should be so it uses all these other variables and looks at the relationship among these and comes up with a predictor what should be the missing value for sex if one individual had values for each of these other variables so it basically comes up with a prediction of what the missing value should be does that five-times app use five different values of what it could possibly be and then gives us a pooled value that we can then use in further analysis okay and the land next thing we can look at is the descriptive information I'm going to page down here so now here is the descriptive statistics for each of our variables so in this case we see Saxon age so if we look at the imputed values as far as the number of missing values for you in each of the imputations and then it shows the proportion of male and female in the complete data after the imputations and you can see the proportions change actually very little from imputation to imputation and then we then would pool this to do any analysis and if we look at age which is a numeric variable we can again see the same thing we have a mean we have a standard deviation and here's what was the mean was for the original data and here's the mean after each of our imputations and again what we typically will see is now once we run analysis we will SPSS will give us a pooled mean or pooled frequency that we would use in our analysis what they're doing a chi-square whether in a t-test or an ANOVA and that's those are the values we would then report so that's pretty that's pretty much it that's the process so many analyses in SPSS are able to consider imputed data and then offer pooled output in other words showing the results of the analysis for each imputation set and also giving us a pooled result like we saw in our t-test example so to summarize what we can use multiple imputation to do is replace missing values from data so if we wanted to maintain a certain sample size or if we felt like trimming subjects with missing values from our data set we might create some bias or might now create a non representative sample we can use multiple imputation to replace those missing values and gives us a reasonable assurance that the values that are replaced through the imputation process are appropriate and match the other data that's not missing and also it does a fairly good job of predicting what the actual value may have been for that person and then replaces that makes that prediction replaces that missing value with this imputed value so it's a fairly simple process it has obviously multiple steps that we have to go through but it is a process that can be done and it is a process that it is becoming well accepted but it does take a little bit more documentation you do have to report quite a bit more when you do this process and what I'll do is I'll also include some some references published references that go along with this multiple imputation process if you'd like to do some more in-depth reading and I think you'll find it to be a great way to try and salvage missing data as well as to try and determine if there are any patterns to missing data which can be important as you describe your sample or the population that you may be studying so hopefully you learn something in this video and please let me know if you have any questions or comments about this presentation
Info
Channel: Biostatistics Resource Channel
Views: 337,099
Rating: undefined out of 5
Keywords: statistics
Id: ytQedMywOjQ
Channel Id: undefined
Length: 45min 1sec (2701 seconds)
Published: Thu Mar 28 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.