IBM SPSS - Missing data: How to identify and handle missing data in SPSS?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone! In this video, we will discuss how to handle missing data using SPSS. In many cases, our survey may come back with missing values. There are many reasons for missing values such as: questions are not applicable to the respondents, respondents skip the questions, respondents do not want to reveal sensitive information, etc. In any cases, we need to examine and take care of missing values before performing data analysis. Missing values can cause loss of information or skewness of the data. Some analysis techniques, such as multiple regressions, require that missing values must be handled before we can run the analysis. In other words, they require complete data and missing values will cause errors in the analysis process. First, we need to examine the missing values carefully to make necessary decisions. In some cases, only a very small percentage of the entire dataset is missing values. If after examining the missing value cases we find out that they are random and will not affect the analysis, then it could be safe to delete them. There is another example that deleting a variable may help solve the problem. For instance, if most missing values take place with a specific variable, then we can remove that variable. On the other hand, in some cases, in which deleting missing values will reduce the sample size significantly, imputation must be done to keep the cases. Nonetheless, it is important to always examine cases of missing values carefully before making a decision on what to do with them. In this video, we will examine the missing values in HBAT dataset. This dataset has 14 variables (we will not consider the id variable); 9 of them are Scale; 5 of them are Nominal. By looking at the data, we see several missing values here. We will use SPSS to examine the missing values and handle them:. We will make decisions on deleting or imputing the data. SPSS has a very useful tool for missing value analysis. We select Analyze from the menu, and then select Missing Value Analysis. First, we move ID to Case Labels box; this is to allow SPSS to identify the case ID. First 9 variables are Scale variables, so we move them to Quantitative Variables box. Then, we move other 5 variables to Categorical Variables Box. We select Patterns button. In the Patterns window, by defaults all boxes are unchecked. We check Tabulated Cases box, and Cases with Missing Values box. These will show us the patterns of missing values. Click Continue. Then, we select Descriptive, and check Univariate Statistics box. Click Continue. And then click OK. Here we have Missing Value Analysis output. First, Univariate Statistics shows us the summary of missing value cases. For example, variable v1 has 21 missing values, account for 30% of the data; variable 2 has 13 missing values, account for 18.6% of the data; variable v3 has 17 missing values, account for 24.3% of the data, etc. So we can conclude that v1 has the most missing values, followed by v3 and then v2. Let’s look at Missing Patterns table. This table shows detailed cases with missing values for each variable. We can see clearly in this table that v1 has most missing values followed by v3. The table also shows specific case (by case ID) that has missing values. We can see that there are six cases (245, 233, 261, 210, 263, and 214) have 7 missing values, which account for 50% of total missing values. So we can consider deleting these cases. Next, let’s examine Tabulated Patterns table which is about variables. We will look at variables and what happen if we delete any of them. How to interpret the patterns? The first row shows 26 cases with no missing values in any variable. The last column shows the number of completed cases, if any of these variables is deleted. In other words, if we do not delete any variables, we have 26 completed cases. Let’s examine row 2. If we delete v3, we will have 27 completed cases (one more than the first situation). Let’s examine the pattern in the next row. If we delete both v1 and v3, we will have 37 completed cases, which are 11 cases more than the first pattern. But we have to delete two variables. In the next row, if we just delete v1 then we will have 32 completed cases. Thus, deleting this variable will get us 6 more cases compared to the first pattern in row 1. That means it is not a good choice to delete both v1 and v3, because this will affect the factor structure and the model, but only give us 5 more cases compared with deleting only v1. The better choice is to delete v1 only. That is how to interpret this table. Thus, by examining these two tables, we conclude that we can delete the six cases with a high number of missing values, and variable v1, which has the most missing values. For this example, we will make these changes. Here is the data after deleting six cases and variable v1. As you can see, v1 is not in the data anymore. Similarly, these six cases are deleted from the data, and we have 64 cases left. Let’s examine the missing value pattern again to see if it improves. We run the Missing Value Analysis again with this new dataset. All options will be the same in this process. The new output – Univariate Statistics - shows that variable v3 has the most missing values, account for 21.9%, followed by v2 with 10 missing values, or 15.6%. That shows great improvement from the original data. Then we examine Missing Patterns table. Most cases have only 1 or 2 missing values. The highest percentage is 15.4 percent. This looks better. We also need to examine Tabulated Patterns. We have 32 completed cases with no missing values. If we delete v3, we will have 37 completed cases, which are only 5 cases more. So we decide NOT to delete any more variable. Thus, we have improved the missing value situation by deleting cases and one variable. The extent of missing values has been decreased. But we still have missing values and we need to handle them. Let’s discuss the imputation process. I would suggest reading more detailed guidelines on this process from the textbook (Hair et al.). Basically, we only should perform missing value imputation if there is randomness in the missing value process. In order to determine whether the missing value process is random, we need to use the test so called Missing Completely At Random (MCAR) test. This test will tell us whether it is safe to impute the missing values without affecting the results of analysis. To run this test, we go back to Missing Value Analysis window in SPSS. Select Descriptive, and check t-test box, click Continue. Then check the EM box, which will give us the MCAR test. And click OK. This is the output. Some information is the same as before. Let’s scroll down. The Separated Variance t-test table shows mean values of valid cases (or present) and missing value cases. The t-value shows the comparison between variance of valid cases and missing value cases across variables. It allows us to evaluate the randomness of missing data through group comparison between missing and valid data. We will need to examine the p-value of t-test. If the t-test is not significant, that means there is no difference between variance of valid cases and missing value cases. The results show some concern with v2, since the p-value is less than 0.05 in three comparisons with v4, v5, and v6, indicating group difference between these variables. All t-tests for v3 are non-significant indicating no group difference in this variable. That is also true for the remaining variables. So the only variable of concern is v2. But before deciding whether to delete v2, we need to examine the MCAR test. We will scroll down further to EM Estimate Statistics, and we can see the MCAR test is shown here. The result shows that MCAR test is not significant (p value = 0.583). Basically, non-significant MCAR test indicates that we have the randomness of missing value process, and we can perform data imputation. On the contrary, if the MCAR test is significant, imputation is not recommended because the missing values are not random and may have effect on the results. In this case, other methods such as modeling may have to be used. In this example, the MCAR test is non-significant, indicating the randomness of missing values. In this case, we decide NOT to delete v2, and will proceed to the imputation process. Let’s perform some data imputation with this dataset. We select Transform from the menu, then select Replace Missing Values. First, we must select the imputation method. It could be Series Mean, Mean of nearby point, or Linear regression. Note that we can only perform imputation with quantitative variables. The common imputation method is replacing by mean. So we select Series Mean, and then highlight v2 to v9, and move them to New Variables box. And click OK. If we go back to our data, we see that new variables are added, and they are labeled with SMEAN, indicating we use the mean method. We can also try another method such as Linear trend, or regression. And here are new variables for that method. Please read the textbook to learn more about differences among these methods. I want to show you another tool on handling data imputation provided by SPSS. It is Multiple Imputation. This tool allows us to choose specific imputation methods or let SPSS choose for us, and then perform and compare multiple imputation methods. To do that we select Analyze from the menu, then select Multiple Imputation, and choose Impute Missing Data Values. We create a new dataset called Imputed, select 5 imputations. Under Method tab, we check Automatic method. Any constraint can be set under Constraints tab. Under Output tab, we check Imputation Model and Descriptive Statistics. And click OK. The output shows that SPSS selected Linear regression as the imputation method, and created 5 imputation models. The Descriptive Statistics shows information of these 5 models for each variable. Let’s go back to the data. Five sets of imputations were created, and SPSS added a new variable Imputation_. This is a categorical variable, identifying five sets of imputations; each of them has 64 cases. The purpose is that when we run our analysis, we can run it with five sets of data, and compare the results across these sets. This method is time consuming and is not very common in academic research. Typically, for datasets with many missing values, this tool can be used to make sure we achieve similar results across imputation methods. You are not expected to use this method in this course. But I wanted to introduce to you this tool. That is how to examine and handle missing values using SPSS. Thank you and bye now.
Info
Channel: Dothang Truong
Views: 11,462
Rating: undefined out of 5
Keywords:
Id: 7RtA9TKVPno
Channel Id: undefined
Length: 16min 9sec (969 seconds)
Published: Sun Jun 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.