7 data quality issues and how to clean them in SPSS

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

you probably have heard the term garbage in garbage out well that's the truth about working with data if you use garbage data you get garbage results and no one wants to make decisions based on trashed data this is why data cleaning is an extremely important step in data analytics data quality issues are very common in data that has been collected through surveys or imported from other formats for example databases or microsoft excel worksheets in this video i'll point you to the data quality issues you need to look out for and how you can fix them using spss first what is data cleaning well data cleaning is a process of preparing data for analysis by removing or modifying data that is incorrect incomplete irrelevant duplicated or improperly formatted with that definition you should have an idea already about what is involved in data cleaning that's what i'm going to be showing you in this video starting with data quality issue number one irrelevant data data can be irrelevant if it is not of interest to the analysis you are trying to do so this could be either irrelevant variables or irrelevant cases i have here this data from a fictitious survey that i conducted using a data collection platform called cobot2box already you can see that we have variables that were automatically generated by the platform for example the start and end time of the interview here if you scroll to the end you also notice that we have more of these for example the id the uuid submission time validation status and so on dealing with irrelevant variables is simple just delete them i'll scroll to the beginning right click on the variable name and select clear that's it let's do the same thing with the end time variable as well so right click clear we can also delete multiple variables at the same time let's select from id to tags there are several ways you can do that but the easiest and most consistent way is to click the first variable which is in this case that's id then press the ctrl key on the keyboard and hold it now while the control key is too pressed go ahead and click all the other columns then right click and select clear i left the index variable deliberately because we may need it to uniquely identify each case in the data set while we are talking about irrelevant data let's look at how we may remove irrelevant cases let's suppose that we are only interested in analyzing data for those who completed teacher education there are multiple ways to do this but the easiest is to simply sort the data by the said variable and then delete the rest of the cases that are not of interest to your analysis i must say that it's wise to back up the data before you can delete anything from it so in this case we'll right click what is your highest level of education column choose sort ascending now quickly click and drag from case number one to case number ten now right click select clear now in this case the variable what is your highest education level is useless since we only have one value so let's go ahead and delete it data quality issue number two is incorrect data formats it is very common for data that was imported from other software to being properly formatted in spss the most common one is having numeric values being designated as string or text values this causes problems when you're trying to use such variables to create charts or analysis that require numeric variables to be run switching between data formats in spss is pretty straightforward first switch the variable view here i'll widen the label column a bit so that it's easy to identify the variables the variable how older you is obviously numeric but here it's being treated as string so we click on the variable type here then click this little button here select numeric make sure the width and the decimals are correct here the width is just the maximum number of digits to store so for age it's going to be two and we don't have any decimals that's fine click ok let's do the same thing with the weight variable but this time weight is going to have one decimal place then we'll do the same thing for height that was perfect let's not jump to data quality issue number three duplicates the best treatment for duplicates is just remove them all together but first we have to check whether we have duplicate data or not we can do this by checking values that are supposed to be unique across the records in the data or unique identifiers such as emails phone numbers or ids generated or used when data was being collected we're in luck because we actually have the index variable which has unique values for each of the cases in spss there's a special functionality for identifying duplicates on the menu at the top click data identify duplicate cases we need to drag and drop the variable we're using to uniquely identify the cases into the box at the top if you have a combination of variables that must be unique you can drag them all into the box here i'm going to be using the index variable here's showing us that it will create a new variable called primal last by default it would treat the last case from a list of duplicate cases as the primary case this means that if you have three cases with exactly the same index value the first two cases will be regarded as duplicates and the last one being the one to keep you can change that by choosing first case in each group is primary and here it says a value of 0 in the primary last variable will mean that's a duplicate and we can delete it that's fine let's click ok the easiest way to now remove the duplicates is to sort the data by the new variable primary last our sort ascending then we can delete all the cases with zero one primary last which if you remember means these are duplicates we don't need the primarist variable so i'm also going to delete that let's call the data called the issue number four structural issues structural issues in data include typos spelling errors inconsistent capitalization and extra spaces these issues are very hard to detect by just looking at the data itself the best way to find out if your data has any structure issues is to summarize it for example i suspect that the variable are you male or female has some issues so step number one will be summarizing it using frequencies this will show us a list of all the unique values in the variable the easiest way to do frequencies on string variables is just right clicking on the variable name here then select descriptive statistics as you can see here we have two entries for the value female it's clear the reason we have that issue is because for some three cases we typed female with a lowercase f instead of an uppercase f there are multiple ways you can fix those issues for example you can sort the variable ascending that'll bring all the values that have the same issues together and then you can manually fix them there but perhaps the fastest way to fix structure issues is to use the find and replace functionality first make sure the variable is selected here now click edit on the menu click replace so here it's showing us that we'll be making edits in there are you male or female variable that's great in the find box we'll type female with a lowercase letter f in the replace with box we'll type female but this time with an uppercase letter f here we'll check the option to match the case in certain cases you may need to turn on some options here so for example you may need to ensure that the string you are trying to find matches all the contents of the cell to avoid replacing values that may start with the weight female but then ends in another way this is fine let's click replace all now all the three instances we have found have been replaced click ok and close this window in software like microsoft excel you can analyze categorical variables that are stored as string variables without a problem but in spss there are certain statistical procedures that do not work with string variables in this case you need to convert the string variables into categorical variables with numeric codes that's data called the issue number five for example on the variable are you male or female who assign a numeric value say one to male and two to female the functionality that does this in spss is called automatic record automatic record converts string variables into numeric variables by assigning numeric values to each of the unique values in the string variable okay let's click transform automatic record we'll drop both are you male or female and the marital status variables on the right hand side when we run this procedure new variables will be created so we need to specify variable names for those new variables one by one select marital status let's set the new name here m status then click add new name now let's click the second variable let's call it sex click add new name now below here we are prompted to choose whether recording will start with the lowest value in our case it's going to record alphabetically that's fine we'll leave it like that let's click ok minimize the output now as you can see we have two new variables at the end let's click the value labels button at the top so that instead of seeing ones and tools we should be seeing the actual values instead perfect so now we can clear the two string variables data called the issue number six dealing with outliers outliers are values that are extremely lower or higher than the majority of the data outliers may be invalid data for example finding 250 as a value of age however they can also be valid data for example having a household earning hundred thousand dollars per month or the majority of the households in the data set only get up to ten thousand dollars the problem with having outliers in the data set is that they tend to skew the results towards the direction this will paint a false picture that the majority of the data is either on the much higher side or much lower side depending on the size of the outlier so the first thing is of course to detect whether we have any outliers in the data set you can do this in several ways the easiest of them being to just sort the column where you think you may have outliers for example let's sort the height this time descending so that the largest values are at the top there are some obvious outliers that point to invalid data here there are many ways you can fix these outliers the first way would be to find out if at all these were simply mistyped information if you take a look at the 172 and 170 here you probably realize that these are in fact 1.72 and 1.7 so let's just correct that by typing the correct values then you have these other values that are really just too extreme you can either delete the whole case or just delete the values that are clearly outliers here but also you can simply filter out the values when you are conducting the specific analysis that involves this variable i will cover more about filtering in a later video on the channel so make sure you have subscribed to the channel in some cases it's not very easy to see that you have outliers in this case the best way to detect them is to create a box plot but before you can create a box plot you must assign correct measurement types of the variables i have talked in length about measurement types in the video exploring relationships between variables the link is right at the top corner but also it should be in the description below let's switch the variable view variables how old are you bodyweight and height are all skill variables so i'll assign them as such now let's create a box plot for the variable body weight go to graphs chat builder now on the chat gallery at the bottom select the box plot category grab the face box plot type the simple box plot and drop it on the canvas then grab the body weight variable onto the y-axis click ok this chart is showing us that cases number 84 and 89 outliers let's go to the data set scrolling down to cast number 84 you'll notice that indeed cases number 84 and 89 have crossed the 100 mark these may well be valid values but of course there are two isolated cases that are extremely higher than the majority of the cases in the data set so how do you treat them we can either filter them out when running some analysis or delete the cases completely from the data set finally data quality issue number seven missing values missing values can cause biased results from your analysis unfortunately there's no one best method for fixing missing data you have the choice of deleting all observations that have missing data especially on variables that are the most important to your analysis unfortunately this means that you may lose a lot of data in the process the other option is to fill in the missing values using either statistically calculated values such as the average or patterns based on other observations in your data set the downside to this approach is that you risk losing the integrity of the data since you are using data that you are just assuming are not exact observations the other option is to leave the missing values in the data set and just find a way to report your output independent of them for example filtering them when you're doing analysis your choice of the way to deal with missing values largely depends on your individual use case let's say that you choose to impute missing values using the average i'll use bodyweight as an example first we need to calculate the average of body weight and that's easy just right click the variable and select descriptive statistics all right great so the average is 72.27 then you can simply sort the data in bodyweight ascending to show the missing values at the top we see that the first case at the top is missing just type our mean value 72.27 on that spot that's sorted so that's it on how to clean data using spss if you found the video very valuable to you then you will love the other great videos i'll be publishing very soon and you don't want to miss out so go ahead and click the subscribe button

Info

Channel: Data for Development

Views: 31,905

Rating: undefined out of 5

Keywords: Data quality, Data analysis, Data cleaning, SPSS, Datafordev, Data for development, removing duplicates, levels of measurement, ordinal, nominal, scale

Id: -xYXsiMgjA4

Channel Id: undefined

Length: 16min 7sec (967 seconds)

Published: Tue Mar 23 2021