Data Screening, Cleaning and How to Replace Missing Values in SPSS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this session we are going to talk about data screening and imputation introduction before you can assess if one construct is influencing another you need to make sure you are actually capturing the construct of Interest via observed variables or indicators now in this session I will go over how to initially screen your data for data problems before you even start your analysis data screening now the first step before analyzing your model is to examine your data to make sure that there are no errors outliers or respondent misconduct you also need to assess if you have any missing data and how to sort out problems relative to your missing data once your data has been keyed in into the data or software programs like Excel SS or SPSS the first first thing you need to do is you set up an ID column what I normally do is I usually do this on the First Column of the data and it is simply an increasing number from one up until the last row of the data now this is done to make it easier to find a specific case especially if you have sorted on the different columns after forming an ID column it is good idea to initially examine if you have any respondent misconduct another thing which is which is important if you are collecting data in paper form or you filled in the questionnaire through paper what you need to do is as you collect your hard questionnaire number each of the questionnaire now once you number each of those questionnaires while keying in the data and then later finding out if there are any errors with the data you can go back to the original questionnaire in the hard form so once you have numbered in the hard questionnaire or the paper format of the questionnaire you can go directly to that particular questionnaire and see if you have entered the data incorrectly or what's the correct value to be entered now the quickest and easiest way to see if the respondent abandoned your question here what you need to do is just sort out the last few Columns of the data in ascending order and you could see if the respondent dropped out of the survey and stopped answering the question or rather questions these incomplete rols are subject to deletion if the respondent failed to answer the last few question you just need to determine if this amount of missing data is sufficiently acceptable to retain the respondent's other answers for example he only missed or didn't answer the last one or two questions then there is no point deleting the whole response but what if he missed out on 40 50% of the question of the of the questionnaire so what you need to do is you can actually delete that that particular response after making a determination if respondents who fail to complete the survey should be deleted the next thing you need to assess is respondent misconduct let's say you have asked the survey on a liet scale from 1 to 7 you want to assess if the respondent simply marked the same answer for every question the likelihood that the respondent feels the exact same way for every question is a small and is subject to deletion because of the respondents misconduct we do not normally have the same answer for each question so there is bound to be a variation in responses now if the respondents answers all the question with the similar response there is an a chance or highly likely that this is a misconduct and he did not or she did not actually read the questions sometimes you will also hear y saying where the respondent is not reading the questions and just marking the answers or the responses an additional step you can take to assess if the respondent misconduct is taking place is to add attention check measures to your survey now these questions are added simply to make sure the respondent is paying attention to the questions and they may ask the respondent to specifically select a number on 1 to7 scale for example you may add reverse questions to your questionnaire so this will help help you assess the respondent misconduct and keep a check on the respondent as well to see if you have a problem with your data set examining the standard deviation of answers for each specific respondent is a very good way to assess if the respondent misconduct is present now while SPSS is a great tool to analyzing uh the standard deviation in SPSS this is a quite um a job now in order to do this what we will do is we will use Excel so what we do is Excel is a bit quicker so let's use Excel to identify if there is a certain amount of standard deviation now go to the last column that is blank and simply input the standard deviation function in Excel and highlight only the liquid scale items or statements and do not include your ID column this will allow you to see the standard Devi of each row for each respondent now anything with the standard deviation that is less than 0.25 should be deleted because there is little or no variance among the responses across the survey now how do we do this let's first do this let's say here are my responses so we go to the last column here and we simply add the the function is equal to standard deviation dop now we select all our like a scale items or questions now this is done and just press enter now in order to extend it to all the responses just select the first function and here just double click this plus sign and see this is extended to all your responses now are there any responses less than. 25 well I see here let's sort it out so we select this column sort and filter smallest to largest expand the selection yes and see we've got only one that is less than 0.25 this one is greater than 0.25 so this one may be deleted because there is hardly any variance and you can see this yourself as well now moving on now saying that it does not mean that if standard deviation is under 0.25 you need to automatically delete the record as the researcher you need to determine what is an acceptable level of agreement or actually disagreement within the question and this can be a matter of how large or how small the survey is as well so if it's a large survey or a small survey the value of standard d devation might change but this is a rule of thumb now there is no Golden Rule that you can apply to every situation but if you have a standard deviation of respondent that is less than 0.25 then you need to go to strongly consider if the respondent answers are valid and then obviously you move forward now screening for impermissible values in the data there are times when the respondent simply key in a value wrong or list an invalid response to an inquiry to test if an answer is outside an acceptable range you need to go to SPSS file and check for minimum and maximum value now to do so what you need to do is go to analyze descriptive statistics and descriptives now how do we do this how do we find if there are imper permissible values so we go to our SPSS file and let me import the data so you open SPSS open another file click open now where is your data so my data is in Excel format there it is just open it just click okay and here is your data now how do I find out if there are any values that shouldn't be here so what we I need to do is I need to go to analyze descriptive statistics and then descriptives now here are the variables let's see I'm interested in let's say these four so I just put it in here go to options select minimum and maximum make sure they are selected and just click continue press okay now look here 125 125 125 125 yes that this was the minimum value and this was bound to be the maximum value so there are no issues with impermissible value moving on how do I assess if I have missing data so we have already addressed how to find a respondent's abandonment but finding missing data that takes place in random manner can be more challenging to initially see if we have got any missing data let's use SPSS so in SPSS what we do is we go to analyze descriptive statistics and frequencies let's go to our data and see if we've got any missing values analyze descriptive statistics frequencies let's use these five variables or indicators go to statistics let's say you are interested in minimum maximum value there are other values as well you can get range variance standard deviation skewes koses mean and other stuff let's use mean as well and just press continue press okay and let's see if we've got any missing values let's say say are there any missing values here no there are no missing values in these variables just in case if you have missing values what would you do in order to sort out this problem what we will do is we will address our missing data problem so how do I address missing data now there are two prominent ways to handle missing data one is a list or pairwise deletion and the other one is in a now I do not encourage deletion because you throw away a lot of data by doing this if a respondent misses one question the whole survey is dropped from the analysis so if you used listwise or pairwise deletion what you are doing is you throwing away very important data so previous research has shown that you can remedy up to 20 to 30% of the missing data with an amputation technique and still have a good parameter estimate so imputation is often a better option if you do not have an excessive amount of missing data imputation is where your software program will replace each missing value with a numeric guess now the most popular imputation method is replacing a missing value with a series mean of the indicator this is usually done for its ease of use but it has a drawback of reducing the variance of variables involved not to mention this actually also fails to account for individual differences of specific respondent a second way to impute data is to use a linear interpolation option now this method examines the last valid value before the missing data and then examines the next value after the missing data and imputes a value that is between those two values now the linear interpolation imputes based on the idea that your data is in line or is linear so how do you do series imputation to use a series imputation and linear interpolation imputation and this can be easily accomplished in SPSS to replace missing values in SPSS you need to go to transform and then replace missing values a popup window will appear where you will need to select which indicators have missing values and need to be inputed when you select the indicators to inut the default inut a is series mean labeled as s mean SPSS will impute the series mean for these indicators and create a new variable with an underscore and one as the new variable name now for example 81 will be renamed as 811 now how do you do this in SPSS let's go to SP SPSS and let's say I've got some missing values here let's delete a few values here so we go to transform replace missing values and let's add 81 you can add multiple variables as well now there is series mean as a method nothing is required just press okay and your values are replaced by series mean so where is your new variable here is your new variable and this is the series me see this one this one here this one here and this one here now what's the other method in order to use the linear interpolation method let's have a look here it is very similar method to impute using linear interpolation method after selecting the transform and replace missing values option you need to select each indicator for imputation and then as stated earlier the default method is actually series mean but there are options so what you do is you select linear interpolation option and make sure that you hit that change button to change the method now how do you do this in SPSS let's go to SPSS and here since we took mean series mean let's use linear interpolation method let's delete a few values from 82 what we need to do is we need to go to transform replace missing values and I'm interested in 82 this one can be removed you can add other variables as well if you want but in this case let's use 82 now this is the new variable 821 I'm not interested in using series mean rather I'm interested in using linear interpolation so method is linear interpolation so I click linear interpolation but nothing happens so you have to select or Click Change now it is changed what you need to do is just click okay and your creating function is linked all well let's have a look now look here so your missing values are replaced now which one 82 the second value was replaced look at this second value five but here it was series mean here it's discrete value so this is how you can use both these methods to replace your missing values now in order to know more about missing values this is a very good book to know more about missing values I hope the video would have helped you understand the concept of data streaming and imputation thank you very much
Info
Channel: Research With Fawad
Views: 16,409
Rating: undefined out of 5
Keywords: SPSS Data Analysis, SPSS Tutorial, Handling Missing Data, Series Mean Method, Linear Interpolation Method, Handing Missing Data using Standard Deviation, SPSS
Id: dZdIiEsgHWE
Channel Id: undefined
Length: 16min 38sec (998 seconds)
Published: Sat Oct 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.