Imputation of missing data - Multiple imputation using SPSS

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we will learn how to deal with missing data using multiple imputation through spss we will divide it into three main steps step number one we will be checking the pattern of missing data in step number two it's to impute the data according to the pattern or lack of pattern and the third and final step is to pull the imputed data into a complete data set but before you begin you need to decide whether that imputation is necessary or even appropriate if you have too little the missing data you may not need to use a multiple computation method or if you have to match missing data then it might not be appropriate to use an imputation method once you have your data set complete on excel you will realize that perhaps some countries have got quite a bit of missing data and also some of the variables are quite incomplete if you have a particular country with too much missing data and that country is not really crucial for as a potential destination for your firm then it might be just simply to delete that country or if there is a variable with too much missing data again it might be better to delete that particular variable if the variable is crucial then you may need to look for an alternative for some sort of proxy that gives you relatively the same information so providing that the missing data that you have is not too much on any particular country or variable then there are ways to try to recover that missing data by following some imputation techniques and methods and that's precisely what we're going to be doing now so as you can see in your excel there is variables here wine consumption for the population with missing values there there is also another alcoholic consumption gdp per capita another one wine imports and a few other variables for which you have missing data as well here is another one another one so as you can see there's quite a bit of missing data throughout the entire data set and we will see how we can recover that missing data by following that imputation technique fortunately we don't seem to have a particular country with too much missing data or a particular variable with too many missing values as well therefore we will not delete any of the countries or variables and we will try to recover them all so what you have to do is to import these data sets that you've created in your excel file import it into spss okay and we will see how to do that you will have to open up your spss and then go to file and import data you click on excel and then you go and select where the the document where you have that data and then you will need to import the data to the document so this is what happens as you can see it gives me the option to choose what the particular page from the excel that i want to import from in this case i will import missing data okay where i have that data set with the missing data and in this case i will select these because it shows the read variable names from the first row of data just to show you i've been careful enough to keep the first row with the name of the vari the variable so there you go that's why when in spss i'm doing the import i'm able to select this option okay then i will click ok there he goes the data has been imported into spss and we've got it nicely organized the same way we had it in our excel document let's go to the variable view that's important okay when you go to the variable view you will notice that there are then various columns okay in which you have the name of the the country the variable and then you've got some characteristics in terms of the type the width something which i haven't done here but i strongly recommend you to do is to be take a bit more time and to actually create and label the the names of the various variables remember you will be working in groups and therefore some of you may be doing some variables others are the variables and after a while you will not remember anymore what these names mean therefore for instance obviously country is an easy one every you would almost not need to label but i will just still label above age of 65 there he goes and then you go and label all those various variables be careful in doing that something which is important as well is here where you have the measure so as you can see country is a nominal variable and the other variables are scale variables so um i've selected them all there so going back to our data set it's nicely organized here and it is now that we will have to talk about the multiple imputation for these missing variables that we've got here and throughout the entire data set so let's start with step number one checking the pattern of missing data so for that purpose then we will go to check the if there is a pattern of missingness we will go to analyze menu here and you move down and look at where it says multiple imputation that's precisely what we are attempting to do then what at this stage you are looking for is to analyze precisely the patterns okay so once you're there you will obviously not select country code or country name but rather the all the other variables and you will move the variables into the analyze across variables window in that case there are several options already pre-selected which you will keep them okay in this case you just need to be careful whether you do have more than 25 variables we in this case i i do not have more than 25 so i'll keep that 25 and here this option it keeps it at the 10 level but i will reduce it to 0.01 to ensure that we will display all types of the missing data for all variables basically not only those above a certain threshold so once you've done that then you click on ok so spss will compute and will produce some an output documents okay so this as you can see this document here is separate from the data okay it's connected and related to the data document that you had created before but this is an output document here and therefore that's the one we are going to be looking at now the result of the analysis you've just done for the pattern so this diagram gives you a brief summary and as you can see here you do have an indication of what's missing in in the variables apparently then there are 46 percent which is 12 variables with some missing data and in terms of cases in this case it's we are talking about countries there are actually 16 countries that have got some missing data more importantly it's this chart here pie chart here showing the overall missing data and the values which is multiplying the number of variables by cases and as you can see there is only one percent to the 1.2 percent roughly 24 missing values out of all missing values therefore it's okay we can go ahead with the multiple imputation technique then the summary table is not that important it will just give you an overview for what variables have got some missing cases and the percentage as well so that's not important this is indeed the most important output that we are looking for right now it's to see whether there is any missing data as you can see there is no clear pattern there are some random missing values throughout rather than all clustering around here and around the bottom right hand side corner so step number two then is to impute data according to the pattern or no pattern in this case we do not have a pattern we are now ready to move on to the next phase step number two which is the imputation according to the pattern o in this case no pattern as we've seen in the previous phase so you go back to your data center and you go to transform in this case okay once you go to transform you're going to be generating random numbers so this is the choice you will pick random number generation here what it does it is a multiple iterations until generates multiple iterations until it finds a suitable data for your set through various regressions so let's then select random number generate there it goes and it opens up these window okay so we've got this in this case set active generator okay and we in that case we will pick up the mercen twister okay which is for the random number generator remember when we did the pattern analysis of if there was a pattern we found out that there was not a pattern and we will pick that option there in terms of the active generator initiation okay we will pick up a fixed value and we'll keep it as it is and basically we will go for the okay so almost nothing happens to the output document here that we saw before okay so as you can see we basically have almost what we had before it seems as if not nothing has happened however you have indeed set up the scene for the actual imputation exercise so in this case once you've done that okay you go back to your data set document and you click on the analyze and this is where we actually do the mult the imputation itself okay so in this case we will go down okay to multiple imputation okay and we are no longer analyzing the patterns because we've we've already done that we are actually going to be imputing missing data value so here you will actually select all variables in the variables menu and move them all to the variables in model you will keep here the imputations to five you can choose between three and ten but we'll keep it five i think five imputations is pretty robust so we'll keep it to five here then there is the choice to create a new data set okay and we will do that so here we will type a name okay and you choose what you want to type i will in this case say country multi-input data okay that's the new name i'll give to these new data set once you've done that okay you go to methods okay and in the method the there is this option which is automatic okay and it gives you the option to customize so if you had selected here the custom and you have uh realized that before there was no pattern you should have picked up this one which is the marker chain monte carlo method and if there was the pardon you would pick the monotone but because spss is quite clever we will let it choose for you okay though we know that this one was probably the appropriate for our missing data since there was no no particular pattern okay then we have to move on to the next window which is constraints okay so here we will do a scan okay a scan of the data and as you can see the scan of the data is telling showing you what percent is missing for each of those variables and you can see if you scroll down some of the variables alcoholic consumption growth rate is the one we had actually seen before with more missing values and also what's important it gives you the observed minimum and observed maximum can go to each one of those variables and type the maximum values and the minimum values this is important those constraints are really important because remember the software will be through regressions will be generating numbers for you and if you are providing some reference points to what are the highest and the lowest figures for a particular variable that's helpful in case of wine consumption it's got four percent missing so we've got there then zero point one for one consumption and we've got zero point one there four 4.11 as a maximum value and we do that for all the other figures values with missing numbers so there he goes we've already reduced all those maximum and minimum values and next we need to go to output in the output window you need to select descriptive statistics and create iteration history here we will call it country iteration history and then you click basically okay the third and final step is to pull the imputed data into a complete data set we now go to the newly created dataset which we call country imputation data and we will then proceed as you can see in this column showing imputation you have the number of the imputation so 0 means it's the original data no imputation so the missing values are still missing you can see it here here so they are still missing if you go to imputation 1 so you will see highlighted in yellow now the missing values previously missing values have been now added inputed and this is all in imputation 1. if you go to imputation 2 again it did the same thing computation four and five we did set for five multiple imputations so there it goes the complete data set with the newly inspired data if you now go to the original okay data you have to delete basically the original data so you have to select so go to the beginning of the file select all original data and delete that original data okay so let's select all 75 countries that's the number of countries that we have so all with zero will be deleted right click and clear there it goes once we've done that next step is to sort the data so let's sort the data okay sort cases and here you we choose first to sort by country code and then imputation number and we sort by imputation number ascending order and okay so what you can see now is that it's sorted out so you can see that for instance the first five cases by five imputations all for country one okay and which is antigua and barbuda so this is sorted out and they did it for all the other countries according to the various imputations so what we need to do next then is to uh go to utilities okay in utilities we will go to the oms control panel which stands for output management system and there you will select first tables then you will select here frequencies so there it goes frequencies and then here in the table subtypes you will select then statistics okay now we go to these output destination windows and in that window we will create a new data set and this new data set is going to be our final we will call even country final oh country imputed pooled data set and that's because we are pulling all the data set into one single documents all the various imputations so once we've done created this new type we have then have two options there goes and in options we will select all dimensions in single row once you've selected that you will click in continue okay here in the output destinations still you click in exclude from viewer once you've done that then you are ready to add that to the request so you've made the request here okay so now that you have made the request for the oms procedure now you will be splitting the file okay so let me just click then okay and okay perfect so we'll go back to our data set okay country impute data and here we will then go to data okay in data you will select split file so be just careful to select the split file and not split into files but rather a split file there it goes here you have selected split file so you select by country id okay which is in this case country code and organized by groups so ensure that you move country code into groups based on country id once you've done that you create okay and then continue it is processing and you you will not see that because it's it's doing the calculations in a virtual file so again now that's done let's go back to our contribute data okay and we will go to analyze now and in analyze we'll do descriptive statistics and in descriptive statistics we will select frequencies okay now let's select all the variables excluding imputation country coding country numbers so let's select all the other variables then here it goes to the variables window once we've selected the variables go to statistics and in statistics select mean and mode and then continue is processing the output spss okay and it will show you the various frequency tables and here we'll show again frequencies percentages of the various variables okay so at the moment this is not critical let's go then to utilities okay and in utilities we will need to lock the and close the oms procedure that we had initiated before so what we do is we go back again to the oms control panel okay and there it goes if you remember you have put that request before and you click this is important clicking and all to lock that procedure okay so you click and all okay and there it goes okay now it's processing and create you will be creating then the new final pull data set and this is the the most important data set for the pool data okay the imputed pull data and here it is if you look at this is the last data set that it has created which is called indeed as we called it before so country imputed pool data set wonderful we are making good progress and getting there now in the final pulled imputed data set we have to delete and clean it and delete some of the variables so let's do that so these variables here command subtype label and then all the ones that show valid okay there you go selecting more valid there's quite a few political risk okay more and then the ones also showing missing okay missing for all those same variables so there's quite a few we'll need to select them all here we are be careful and be patient okay let's move and there he goes so up until he missed some missing so we will go and delete those variables and in this case because we only have scale data then we will these variables are all scale variables we will then select the modes keep only the mean okay so let's move a bit to the left there you go so mean population so mean and get to mode there it goes and we will select all the modes and delete them if you if we had not just scaled variables but rather ordinal and also nominal then we could have kept the modes rather than just the means so in this case because we don't have we will clear and keep only the means okay so there it goes now it is our clean data set well done just one more detail before you you complete the whole data center which is you need to go to the variable view of this newly created imputed data set and as you can as you can see we have lost the properties that we had assigned to the original file so what we'll do now is copy the properties to these document and then ensure that we have the the same properties okay let's do that so for that purpose we go to our original document which we had country ranking imputation that was the original document that we have and here it goes we will select then those variables okay and copy the properties so now we don't have country number which uh name but just country code and we will therefore select first copy select all these variables okay with their respective so properties in the variable view copy we will go to our or a newly created impurities pooled imputed data set and we will then select and paste let's select them all and there goes paste perfect as you can see it has changed so the measures are already there and whatever characteristics we had before it's copied there let's just do for the country code as well back to the original here it is country code control c and back to the final pull data and do the same here paste there you go so and that is bingo so job done we now no longer have missing values and we have through a multiple imputation method completed the data set so you do not need to delete any countries or any variables for your analysis and you may proceed with your regressions or country ranking country clustering exercises a lot of work so let's ensure that you save this final data set okay save job done you
Info
Channel: Roger Go
Views: 28,742
Rating: undefined out of 5
Keywords:
Id: ujf0ifOmvwg
Channel Id: undefined
Length: 28min 46sec (1726 seconds)
Published: Fri Sep 25 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.