Hi everyone! In this video, we will discuss how to handle
missing data using SPSS. In many cases, our survey may come back with
missing values. There are many reasons for missing values
such as: questions are not applicable to the respondents, respondents skip the questions,
respondents do not want to reveal sensitive information, etc. In any cases, we need to examine and take
care of missing values before performing data analysis. Missing values can cause loss of information
or skewness of the data. Some analysis techniques, such as multiple
regressions, require that missing values must be handled before we can run the analysis. In other words, they require complete data
and missing values will cause errors in the analysis process. First, we need to examine the missing values
carefully to make necessary decisions. In some cases, only a very small percentage
of the entire dataset is missing values. If after examining the missing value cases
we find out that they are random and will not affect the analysis, then it could be
safe to delete them. There is another example that deleting a variable
may help solve the problem. For instance, if most missing values take
place with a specific variable, then we can remove that variable. On the other hand, in some cases, in which
deleting missing values will reduce the sample size significantly, imputation must be done
to keep the cases. Nonetheless, it is important to always examine
cases of missing values carefully before making a decision on what to do with them. In this video, we will examine the missing
values in HBAT dataset. This dataset has 14 variables (we will not
consider the id variable); 9 of them are Scale; 5 of them are Nominal. By looking at the data, we see several missing
values here. We will use SPSS to examine the missing values
and handle them:. We will make decisions on deleting or imputing
the data. SPSS has a very useful tool for missing value
analysis. We select Analyze from the menu, and then
select Missing Value Analysis. First, we move ID to Case Labels box; this
is to allow SPSS to identify the case ID. First 9 variables are Scale variables, so
we move them to Quantitative Variables box. Then, we move other 5 variables to Categorical
Variables Box. We select Patterns button. In the Patterns window, by defaults all boxes
are unchecked. We check Tabulated Cases box, and Cases with
Missing Values box. These will show us the patterns of missing
values. Click Continue. Then, we select Descriptive, and check Univariate
Statistics box. Click Continue. And then click OK. Here we have Missing Value Analysis output. First, Univariate Statistics shows us the
summary of missing value cases. For example, variable v1 has 21 missing values,
account for 30% of the data; variable 2 has 13 missing values, account for 18.6% of the
data; variable v3 has 17 missing values, account for 24.3% of the data, etc. So we can conclude that v1 has the most missing
values, followed by v3 and then v2. Let’s look at Missing Patterns table. This table shows detailed cases with missing
values for each variable. We can see clearly in this table that v1 has
most missing values followed by v3. The table also shows specific case (by case
ID) that has missing values. We can see that there are six cases (245,
233, 261, 210, 263, and 214) have 7 missing values, which account for 50% of total missing
values. So we can consider deleting these cases. Next, let’s examine Tabulated Patterns table
which is about variables. We will look at variables and what happen
if we delete any of them. How to interpret the patterns? The first row shows 26 cases with no missing
values in any variable. The last column shows the number of completed
cases, if any of these variables is deleted. In other words, if we do not delete any variables,
we have 26 completed cases. Let’s examine row 2. If we delete v3, we will have 27 completed
cases (one more than the first situation). Let’s examine the pattern in the next row. If we delete both v1 and v3, we will have
37 completed cases, which are 11 cases more than the first pattern. But we have to delete two variables. In the next row, if we just delete v1 then
we will have 32 completed cases. Thus, deleting this variable will get us 6
more cases compared to the first pattern in row 1. That means it is not a good choice to delete
both v1 and v3, because this will affect the factor structure and the model, but only give
us 5 more cases compared with deleting only v1. The better choice is to delete v1 only. That is how to interpret this table. Thus, by examining these two tables, we conclude
that we can delete the six cases with a high number of missing values, and variable v1,
which has the most missing values. For this example, we will make these changes. Here is the data after deleting six cases
and variable v1. As you can see, v1 is not in the data anymore. Similarly, these six cases are deleted from
the data, and we have 64 cases left. Let’s examine the missing value pattern
again to see if it improves. We run the Missing Value Analysis again with
this new dataset. All options will be the same in this process. The new output – Univariate Statistics - shows
that variable v3 has the most missing values, account for 21.9%, followed by v2 with 10
missing values, or 15.6%. That shows great improvement from the original
data. Then we examine Missing Patterns table. Most cases have only 1 or 2 missing values. The highest percentage is 15.4 percent. This looks better. We also need to examine Tabulated Patterns. We have 32 completed cases with no missing
values. If we delete v3, we will have 37 completed
cases, which are only 5 cases more. So we decide NOT to delete any more variable. Thus, we have improved the missing value situation
by deleting cases and one variable. The extent of missing values has been decreased. But we still have missing values and we need
to handle them. Let’s discuss the imputation process. I would suggest reading more detailed guidelines
on this process from the textbook (Hair et al.). Basically, we only should perform missing
value imputation if there is randomness in the missing value process. In order to determine whether the missing
value process is random, we need to use the test so called Missing Completely At Random
(MCAR) test. This test will tell us whether it is safe
to impute the missing values without affecting the results of analysis. To run this test, we go back to Missing Value
Analysis window in SPSS. Select Descriptive, and check t-test box,
click Continue. Then check the EM box, which will give us
the MCAR test. And click OK. This is the output. Some information is the same as before. Let’s scroll down. The Separated Variance t-test table shows
mean values of valid cases (or present) and missing value cases. The t-value shows the comparison between variance
of valid cases and missing value cases across variables. It allows us to evaluate the randomness of
missing data through group comparison between missing and valid data. We will need to examine the p-value of t-test. If the t-test is not significant, that means
there is no difference between variance of valid cases and missing value cases. The results show some concern with v2, since
the p-value is less than 0.05 in three comparisons with v4, v5, and v6, indicating group difference
between these variables. All t-tests for v3 are non-significant indicating
no group difference in this variable. That is also true for the remaining variables. So the only variable of concern is v2. But before deciding whether to delete v2,
we need to examine the MCAR test. We will scroll down further to EM Estimate
Statistics, and we can see the MCAR test is shown here. The result shows that MCAR test is not significant
(p value = 0.583). Basically, non-significant MCAR test indicates
that we have the randomness of missing value process, and we can perform data imputation. On the contrary, if the MCAR test is significant,
imputation is not recommended because the missing values are not random and may have
effect on the results. In this case, other methods such as modeling
may have to be used. In this example, the MCAR test is non-significant,
indicating the randomness of missing values. In this case, we decide NOT to delete v2,
and will proceed to the imputation process. Let’s perform some data imputation with
this dataset. We select Transform from the menu, then select
Replace Missing Values. First, we must select the imputation method. It could be Series Mean, Mean of nearby point,
or Linear regression. Note that we can only perform imputation with
quantitative variables. The common imputation method is replacing
by mean. So we select Series Mean, and then highlight
v2 to v9, and move them to New Variables box. And click OK. If we go back to our data, we see that new
variables are added, and they are labeled with SMEAN, indicating we use the mean method. We can also try another method such as Linear
trend, or regression. And here are new variables for that method. Please read the textbook to learn more about
differences among these methods. I want to show you another tool on handling
data imputation provided by SPSS. It is Multiple Imputation. This tool allows us to choose specific imputation
methods or let SPSS choose for us, and then perform and compare multiple imputation methods. To do that we select Analyze from the menu,
then select Multiple Imputation, and choose Impute Missing Data Values. We create a new dataset called Imputed, select
5 imputations. Under Method tab, we check Automatic method. Any constraint can be set under Constraints
tab. Under Output tab, we check Imputation Model
and Descriptive Statistics. And click OK. The output shows that SPSS selected Linear
regression as the imputation method, and created 5 imputation models. The Descriptive Statistics shows information
of these 5 models for each variable. Let’s go back to the data. Five sets of imputations were created, and
SPSS added a new variable Imputation_. This is a categorical variable, identifying
five sets of imputations; each of them has 64 cases. The purpose is that when we run our analysis,
we can run it with five sets of data, and compare the results across these sets. This method is time consuming and is not very
common in academic research. Typically, for datasets with many missing
values, this tool can be used to make sure we achieve similar results across imputation
methods. You are not expected to use this method in
this course. But I wanted to introduce to you this tool. That is how to examine and handle missing
values using SPSS. Thank you and bye now.