Multiple Imputation: A Righteous Approach to Handling Missing Data

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone welcome to the stats for the masses webinar today we're covering multiple imputation righteous approach to handling missing data my name is Elaine Eisenbeis I'm the owner and principal statistician here at Omega statistics we've had a large number of registrants for this presentation I see many reott ending with us so thank you for taking the time to join a slide most of us encounter missing data values and datasets so let's just get into and start talking about it I very rarely see a data set that comes through that it's complete and beautiful today we're going to present a rather simple example of multiple imputation we're going to use SPSS software today is just an overview but I hope that the information you receive today will give you some ideas and guidance on how to handle missing data in ways other than just deleting them you know the leading records you don't want to do that if you don't have to but as I say in all of our webinars you use what you learn today you try to apply it in a holistic way to what you're doing it's my hope that if nothing else you're going to come away from today's presentation feeling a bit more in control of your capabilities as a researcher too often we're afraid of making adjustments to data because we're under the impression that if we make changes we're going to invalidate something more often too not though I'm using the data as is without any regard to the structure of it the misenus of it or some of the requirements of the analysis you're performing it's it's a lot worse so you want to make sure that you make changes when they're necessary I anticipate about 40 minutes of the presentation that will leave us about 20 minutes for some Q&A at the end and I currently muted the line so only my voice can be heard my assistant is here though Heather Crutcher she's monitoring the chat board so feel free to type any questions you have on the chat board and Heather will choose some to ask me at the end of the presentation if you have any technical issues you can type it to Heather on the chat board hopefully we won't have any today I were notorious for always having something but I think today is going to go well I often find though if you do have some technical issues that just logging out and back into the webinar clears up just about everything so when in doubt reboot and let's go ahead and see what we have going on here today about Omega statistics before we begin I'll I'll take a few minutes to tell you about us and the stats for the Massa's webinars newsletters upcoming workshops statistics can be difficult even for seasoned researchers often textbooks classes training programs they make things a little more confusing my goal with the stats for the massive series and for Omega statistics as a company is to inform and educate about statistical concepts in a way that we can all understand sometimes that's not easy but we always try to do our best here so that everyone can see the elegance and usefulness of statistics I think is just beautiful here's some things about us we are private practice consultancy we're based here in Marietta California we do medical clinical research we do all phases of clinical trials we also help independent researchers with experimental design epidemiology journal article assistance we help with exploratory design and predictive modeling that's probably one of my most favorite things to do dissertation assistants we help with methods and results chapters also very rewarding to help researchers meet that dissertation PhD goal survey design we do projects of any size we have helped individual researchers we help large companies we do government work basically we have our hands and just about everything is a private practice consultancy and and I really enjoy that and so does the staff we learn something every day that's so like you were always learning our motto is to provide elegant solutions effectively applied so let's see what we have coming up I figured I was trying to think of some things today what do we want to do for the rest of the year and I thought regressions everybody is interested in regressions I think that's probably one of the most beautiful statistical models there is you can you can do predictions you can you can see how much predictors are affecting outcomes is just beautiful you get so much information and out of a regression so we're going to have regression Palooza we're going to be focusing on a few of the many many types of regression models for the rest of this year so we have September 17th I'm going to start with correlational analysis it's always nice to know about correlations before you jump into other regressions it really helps you to understand the multiple regression not so useful for logistic regression it's kind of another beast all together but we will be covering that in November and then on the December 17th I'm going to do my best tackle hierarchical linear models a lot of social sciences educational research they've infected these are nested type of models students in schools things like that so they're becoming more and more use of the software as software gets better all these things can be done more easily so we'll cover that in December I'm so you can go to our website which is right here there's a webinar page and go to for more information and to register you can also contact Heather crusher and here's her email and here's your phone number and she'll sign you up I right now I need to post them on the page so give me until probably later today tomorrow morning before you go and look but hopefully we'll get that up soon so you can go and register ok so also when you go to the website sign up for the mailing list you can also ask Heather if you shoot her an email she'll put you on the mailing list that way you can keeping in touch with us know everything that we're doing we're doing more things all the time ok well let's get busy so let's see what we have here what is multiple imputation good question I always like to start out with a textbook definition and and kind of make it a little easier with our stats for the masses definition so if you look at the textbook definition I'm actually Ruben he's a Donald Ruben probably back in the 1970s he's the one that kind of started talking about different types of misenus missing completely at random missing at random missing not at random they all found the same in a way but there are some differences which we'll go over very briefly but that kind of led into also the multiple imputation procedures and and he also was very good at initializing that and you can see I mean if 70s 80s I guess is longer longer ago than I seem to think I have a hard time believing the 1980s or like thirty years ago but I guess they are however it's rather new relatively speaking but multiple imputation replaces each missing value with a set of plausible values that represent the uncertainty and it goes on to tell you no matter which complete data analysis is used the process of combining results from different data sets is essentially the same the way I like to look at it is it's a process used to fill in the blanks of a data set so that you have a full and complete set of data that you can use for your analysis nice and easy so basically we're just going to look at the blanks and fill them in in the best way we can here we are with the types of missing data according to Donald Rubin about three classifications they're defined in the order of severity you have something that's missing completely at random we call it M car the missingness is random nothing's affecting the reason it's missing it just is maybe somebody forgot to answer a question and maybe it's just missing it could be any number of reasons the thing is about missing data you never really know if you might know sometimes but for the most part you don't really know the function behind why it's missing so you kind of have to look at the relationships patterns of it and such and see if you can determine from that what might be the best definition of the three for your misenus when it's missing that random completely at random there's not a relationship between how the data is missing the pattern and the data that we have the observed data so and the data that we don't have right so you have some data you don't have some data but what's that pattern and what's that relationship there's there's nothing to it really you don't see any patterns so in practice a data set with less than 5% missing data for each variable can be considered M car also you can look at overall the data set has 5% or less missing for all the data points a lot of times you can consider that also M car when you have m car data a lot of researchers will just take out the incomplete records and that's pretty legitimate most times if you have that less than 5% but a lot of times you don't a lot of times you are missing more than 5% so then you want to know is it missing at random or missing not at random so missing at random it just says that the missing this may may be due to a relationship between how the data is missing and the observed data but we can't see it for sure okay often the data is missing it's due to a variable we haven't even included in our data so the pattern of missing this has a relation to the observed data but not to the missing data again I know you're probably asking right now how in the world do I know that you don't send us the thing a lot of statistics is guessing a lot of Statistics is making estimation in fact that's all it is if we knew the truth I always say this if we knew the truth you wouldn't need statistics so we have missing data so we're going to look at some patterns you'll see in a little bit we'll look at some patterns and maybe try to make a good guess at what we're going to do as far as handling it but you never know for sure so if you think there's an exact test or some kind of procedure you can use to see if it's m-car or Mar you won't know there's just not probably the worst one the worst severity is missing not at random and the missing this may be due to a relationship between how the data is missing and the observed data again we can't see it for sure we'll never know for sure because we don't have the missing values to check against the observed value so we can't look at any kind of relationship between them because half of them are missing so in theory data that's missing not at random it's a pattern of missingness that has a relationship to both the observed and the missing data and again as I've been saying we never know for sure about the true nature of missing data but we can make assumptions on how to treat datasets with missing data and we can make adjustments to the data accordingly so there's some there's many ways of dealing with missing data some some are even better to multiple imputation in some cases the some not-so-great ways are just to delete records there's a list deletion and that's just a nice fancy name for saying just take out all the records that have any missing data so if you have a person in your data set or a case in your data set and it's missing any data on any of the variables it goes out you don't use it you just delete it and it's fine to use for the most part if you have less than 5% missing data across all the variables and that's because that's your M car and I said it's easy-peasy yeah you just take it out however when more data is missing if you take out entire records you're going to reduce your sample size which also reduces the power so you won't be able to see significance even if it's truly there there also could be systemic bias in de data and that's when you have them the missing at random or missing not at random remember we never really know but if you have 5% or less missing you can usually safely assume that it's at random or at least treated completely at random and treat it that way SPSS has a nice little option where you can do pairwise deletion and what that is is it's kind of like list isolation except you're only going to delete your record from the analysis if it's missing data for that analysis however if you run another analysis and it has all the data you keep it so it only deletes it whenever it needs to be deleted so it's like a list it's lists wise deletion sometimes so basically it allows you if you say you have an ANOVA you might have all the information for that you have the group membership and you have the outcome but you do a regression and there might be some variables where you don't have information for a particular record sps with the paralyzed deletion will let you include the person for your ANOVA but it will take that person out for the multiple regression so that's a nice little option we use that quite a bit whenever there's less than 5% missing again you don't want to use it otherwise so some other not-so-great ways on single imputation and what that means is you replace your missing values on your variable with one specific value some types include means substitution I'm sure maybe in basic statistics classes you've covered that where you will just replace all the missing values on a variable with the mean score so everybody gets that anything that's missing gets the same value replace a median substitution is the same thing as mean substitution except medians you're more likely to use when you don't have that normal distribution because means of course as you know means and standard deviations are related to the normal bell-shaped distribution maybe you have something that's skewed maybe you have something that's a Poisson distribution something like that you might want to use your median long long ago back when they had punch cards for computers and census data I don't know if they still do it with census data but they used to you had hot dec imputation and cold dec imputation hot deck just means you use the observed variables that are in the data set and then you randomly select them to replace missing values on the variable so you kind of look at what other people have on that variable and just randomly choose from them to click replace the missing ones cold deck is kind of similar except what you're doing is you're using values from an entirely different data set so you're not using the exact data set you have you're using a different one why use multiple imputation multiple imputation involves creating multiple predictions for each missing value so instead of going in and just replacing everything with one value you would go it would it would allow you to take that data and it accounts for the uncertainty and it kind of makes the best guess sometimes it will do it the regression model sometimes it'll do it the other means it usually simulates it over and over you get a few different imputations you get a couple values it pulls the data all together and then you get that main value that will be in your pooled data I mean you'll see as we go through it this isn't the case with single imputation because they only have that one value that's going across all the missing data so when you use a single imputation you get smaller standard errors because think about if you're replacing say 20 percent of your data your missing data with one number with a mean value that's going to bring everything in towards the mean you're going to have everything stacking up on the mean you're not going to have a lot of variability right because everything's basically heading towards like a constant value so that's going to make your standard errors much smaller which is also going to make your confidence intervals much smaller so you're going to get type 1 errors and a type 1 error that's when you see significance that's not really there and that's a very bad thing you don't want to say you're seeing significant differences when you're truly aren't because you're your standard errors are too small and so multiple imputation gives you better estimates of your variance in your standard errors and such so that you can be more confident that you're you're really looking at significance if it is there are there other ways of dealing with missing data absolutely depending on the circumstances they may even be better to use I mean this wise deletion might be just fine to use I mean it's easy you know but you can check out some other things in your spare time maximum likelihood estimation the expectation maximization algorithms in many cases these methods make better use of the actual data and the distributions are coming from and they usually work better than a multiple imputation when you have smaller data sets if you have larger data sets it appears that the multiple imputation often does better however in practice the maximum likelihood estimation and that expectation maximization algorithms they're computationally intensive they use numerical methods instead of simulation often some software's have great programs for them so of course it's one of those things where as computers get better and better software gets better and better they're they're becoming pretty easy to use also Safin em clusters some big ones that you could use with those however multiple imputation is easy its effective and many studies have shown similar results between all of them so I like this one because if it's nice it's easy it's a it still allows your research to have rigor and that's important there's some steps first we're going to check the pattern of missing this as best you can because remember we never know for sure we'll take a look at it we'll make a good estimate you know we're going to impute the data set according to the pattern of misenus then we're going to take all those imputed values we're going to pull them together basically take an average of all those and then we're going to derive our parameter estimates or standard errors of the estimates we're going to get that from the pooled data so we'll go ahead and give it a try we're going to go into SPSS we do need to have access to the missing values add-on so SPSS likes to charge for every little bit that you add so you will need that if you're going to use this process in SPSS but let's go ahead and open up a data set okay so here we have oh I think we have about 29 countries and this is some data with obesity rates for 2009 through 2011 and that's in percentages it's the percentages through 2009 through 2011 of people over 65 and then here's the health cost per capita per person for each year in each of the countries as you can see we have a lot of holes we have a lot of holes over here on obesity not so many over here but it does appear that we have holes in the data across the board but we will look for sure to see so let me go ahead and get my little notes out on this so I can walk through it with you first we're going to check the pattern of missing this okay so we're going to go up to analyze we're going to go down into multiple imputation going to click on analyze patterns I want to look at all my variables I don't want to look at country name we don't care about that that's an ID right so we'll just look at all of these so so I'm going to click them all and I'm going to move them over and we want this we want to keep this we want to keep the patterns we want to look at the summary this is fine just leave it as it is the minimum percentage missing for variable to be displayed they have 10% that's large I want to see everything that's missing I want to set it to a very low number so I'm going to say 0.01 percent that that way it will include everything and then I'm going to click OK and let's see what we get all right here's some nice little pie charts kind of gives you an overview this first pie chart here it tells you how many variables have missing data and here's incomplete how many are incomplete all of them are incomplete so 100% are missing data on at least one at least one datum right they're missing at least one value the second one is how many cases are missing at least one value so this is record by record so we have 27 27 of the countries having complete data to look like they're ok they're complete so almost all of them are missing and then the values so on the values one is how many data points all together are missing so we have 181 another 80 probably 261 data points all together so of that 80 of them are missing about 31 percent well we can't do the m-car because it's over 5% so multiple imputation is probably the way to go here's your variable summary it tells you here how many are missing out of 20 and I think we had 29 records I think it was yes 29 in here you can tell because here's the missing and here's how many there are okay so for the obesity rate in 2009 20 values were missing and nine countries had it have it and here's your mean and the standard deviation your means and standard deviations are according to the values that we do have right the data that is fair you can't tell anything about the missing data all right so on your raw data these are all the values you have let's look down a little bit further here here's where you can start looking at some patterns okay the red are where they're missing the white are non missing what you want to do with this is kind of see if there's some kind of pattern to the data sometimes you might have a lot of red up here and a lot of red down here that would be a pattern if in this graph you have the red and white just scattered all over the place you don't have a pattern I can see though from here that we do have a pattern down here it somewhat it's obesity the obesity variables are missing a lot of data and that's what we saw so this is showing it right here so I would say maybe this is a systematic kind of kind of error it could be missing at random may see not at random definitely we've ruled out completely at random over here here's a pattern and let's see what's your pattern oh I should tell you here they're listed from the variables with the least number of missing data to the variables with the highest number of missing data so BC rate of 2009 had the most missing the health cost per capita in 2011 had the least number of data missing and then when you look up here this would be pattern one this very top row and what this is what pattern one always is is when there's no missing data okay I don't know if that happened in our data set it might have I think for two records right so not much there but when you go down here here's the pattern of eight mean we're missing on the red colored variable so 18 we're missing these ones okay you go down here and you look a little further you can see right here this one's the biggest one has the value of 13 pattern 13 was the most prevalent and it shows about 25% of the cases had this pattern so let's see we go back up things around here go back up let's look at 13 13 would be right here so here's 12 here's 14 or 13 pattern 13 had missing data on the 2010-2011 and 2009 obesity rates so most of the cases had that kind of missingness okay if you look down here remember number one always means there's none none missing it's way over here so probably we're looking at missing at random missing not at random okay so that will give us some idea of where we want to go next so we know we don't want to delete anything we want to impute so let's go ahead and do that that's step two first before we do the imputation it's always nice to set a seed multiple imputation uses methods of simulation like Monte monk Monte Carlo methods I say Markov chain Marty called Markov chain Monte Carlo methods we'll just call it MC MC right so it does a lot of that simulation so it's nice to have a starting point where you start your simulation from so you should set a seed that way if you come back later need to rerun it again for whatever reason you can start with that same seed and you will get the exact same results if you don't set your seed every time you run it you can get a different set of results so we'll go into transform okay and then we're going to go to random number generators which is down here we want to set an active generator I'm going to set the Marcin twister it's a pseudo-random number generator it's one of the better ones it's from Japan and I'm pretty sure Japan or I'm pretty sure that they authors are from Japan but it's a very good one it's used quite a bit and that's set a set of starting point they have you could do it random but then why are you setting it if you're allowing it to be random we came in here to fix it so let's make a fixed value you can use the default and that's fine just make sure every time you come in and do this that it's the same if you're using the same data and want the same results but sometimes you might run them quite a bit and so you want to you know maybe keep track of them somehow today's the 13th of August so maybe I'll just say I'll put today's date August 13th 2014 that'll be my my um my feet so we'll go ahead and do that I'm going to click OK and nothing much happens you just get a little thing in your output okay now that was set okay now let's do our imputation it says this is so easy it's almost scary just go into analyze go into multiple imputation impute missing data values okay so first we'll take our variables we want to look at all of them just to be safe oops try that again there we go so with that in five imputations is all you need a lot of times people they get they get really strange about this because that's that just doesn't seem like enough you know should we do thousands yeah but five is good three three to ten is usually the standard and a lot of people will use five I want to go ahead and make a data set so I'm going to make a data set for it and I'm going to call it imputed data okay now I want to go to the method now we have different ways we can look at this you can go with a custom method and it's a fully conditional which is your Monte Carlo methods and this is suitable when you have the kind of the random arbitrary right you'll use that there's also monotone and that's what you'll use whenever you have a pattern which is probably what we would use in this case we would check this however SPSS also allows you to do an automatic and it lets SPSS do all the work for you let's let it you know let it figure it out and it's pretty good so let's go ahead and do that however if you did use custom message you can even specify down here what you want do you want to use a linear regression to run the imputations or do you want to use the mean matching a lot of times your regression works just fine and their work really well with this data because it's all scale continuous kind of data so let's go ahead and click on automatic let's make it easy on ourselves today you can always look this other information up another time let's go to constraints first thing you want to do is click on scan data and what it's going to do is yes it's going to go back and look at the data file it's going to tell you here how much is missing it's going to tell you the minimum and it's going to tell you the maximum for each of the variables okay so what you want to do is take a look and make sure that these makes sense we know obesity rates and the aged over 65 rates those are percentages so those should be between 0 and 100 so here we have they all look good all between 0 and 100 your health cost per per person and they can be anywhere from 0 to I mean theoretically infinity so you just want to make sure there's no negative numbers there it looks good all right if you want to you can divide you can define some constraints on what it imputes because sometimes if you let it go by itself you'll get negative values when you shouldn't you'll be over a certain boundary that's just impossible unlike say I don't know I'm thinking well percentages for instance you want them all to be between 0 and 100 you don't want something that's 105 you don't want something that's negative 12 right BMI would be another thing you don't you know very rarely I would you see a person with a BMI over 70 or 80 so you want to be careful with that so what I'm going to do and I'm going to change my role and I'm just going to call up I'm going to say then we're just going to do in cute only I'm going to go to do this down the board and then here I'm going to say to minimum 0 max from 100 rounding I don't want to round it I want to keep my decimal places so we'll go ahead and do that so here I'm just going to go and put this in listen and just go down software's great you just got to plug in your numbers can be dangerous too because unless you know what you should be getting you could get yourself in trouble but trust me I know what I'm doing so let's go ahead and put this in here okay now here we have health cost per capital or per capita so we will imputed the minimum I want to be zero I'll leave the maximum open we'll just let that go we'll leave it be whatever it needs to be but I definitely don't want it to be negative self as a zero in okay so I did that here's maximum case draws maximum parameter draws you don't need to worry about that just leave it the way it is sometimes you might need to make adjustments you know to make it run more or less to make the algorithm run but right now it should be fine the way it is so let's go ahead and leave it like that and let's go to output I want the imputation model I want to see the descriptive statistics and I also want to have an iteration history just to keep it in a file so if I ever need to go back and check it I can so I'm going to give it a name let's give it I'll just call it iteration okay all right so let's look at it we click ok this isn't so bad I mean it's a saint intercept only because we didn't choose these as prediction in the imputation however we did get our imputations so sometimes these warnings are more trouble than anything else more bark more bark than bite alright so here we have the imputation method fully conditional they imputed on these variables looks like all of them are imputed so they all had something missing okay and here's the model use linear regression to do the imputation for the different variables there were there were some that we're missing some of them were missing two values on this one but it imputed 10 because the reason it looks like oh my gosh we only had two missing how come it put in 10 the reason is because we chose 5 imputation right so 5 we chose 5 iterations so five different putative models so 2 times 5 is 10 so you'll notice everything here multiplied by 5 is going to come out with them with the values then we have some descriptive statistics and we just want to make sure that you know they kind of make sense to you and here's your original data there were 9 records for the obesity rate in 2009 that had data here is our mean here's our standard deviation minimum maximum and here's our imputed there are five different iterations of the imputation okay and so here here are all the means and standard deviations for each of the imputations and the minimum and maximum and they're all looking ok they're all between 0 and 100 bets are pretty low and here also is after imputation here are the data sets completed so we have now 5 full data sets and they're 29 records in each one here's the mean for each of the data sets here's the standard deviation for each of the data sets the minimum maximum it's always kind of nice to look and see well you know how are they here the means a little bit off and of 13 but when we imputed it brought it down a lot and you'll find that happens quite a bit in peas of values always seem to be smaller because it's just it's running from the information that's there so it looks okay though I mean they're within normal we're within the ranges they could be in let's look at maybe the aged over 65 here again you have the 14 here very small here however complete sets pretty close let's look down a little further so you have them for everything let's look at some per capita let's look at 2011 so here is the original data here's some imputed data with just the imputed values okay you're probably two things imputed so I mean you get a mean from two values right but again smaller however when you put them all together and you look at the complete data sets that have been imputed they're up there again pretty close okay so you just want to kind of look at that make sure okay so let's look at the imputed we have three files now we have our working data that we started with we also have the imputed data and we have the iteration file so let's go into our imputed data file and take a look at that alright this has the raw data which is what we're looking at now and it also has the five imputations so here is a variable that's been thrown in and it's called imputation if you go down here's imputation one these little yellow boxes are the values that have been imputed okay so if you go up here to the top Australia for 2009 didn't have an obesity rate so in the first data okay there it is it imputed that value you can also go up here oops get up here the little box up here okay you can click on this where it says original data you can also go down to your first and go down to your second and see here's our second imputation set and it tells you which imputation so if we go up a little bit further here here's Australia for the second imputation got 3.36 and here are all the other records and as you can see everything is supposed to be a percentage I don't see any negative numbers so that's good and you can go all the way down to you know the tip nope there we go let's go back up where they are okay so it kind of shows you the data sets right now we're going to use this imputed data and we're going to run a model okay we're going to run a regression we'll run today let's go ahead and do that step 3 we're going to run analysis so to do that we're going to go analyze we're going to go into regression and see right here the little snail shell looking thing that means you're using the imputed data okay some of these you can't and it will have the little smell the little snail shell thing that's what I call it I don't know what you can call it whatever you want a little wave I don't know but it's this little curlicue so what that means is that you're going to be using your imputed data set and which is what you want to do okay so let's go ahead and click on that I'm just going to look at 2011 today so let's say we want our dependent variable are dependent variable to be the health cost per capita and we want obesity rate for 2011 and you percentage of people over 65 for 2011 so those will be our independent variables and let's just go in and just choose some things I like this I like this why not these get my confidence intervals continue for that I'll go into plots I always like to look at the residual plots so we always put the predicted here and we put the residuals in the Y I like to look at histograms QQ plots well choose those I don't think we need anything else right now we'll just keep it simple of course you have all those other options all right let's click OK all right so down here it's running the regression I don't know if you can see it on your screen but it looks like it ran let's go into our output and click on regression here we go ok here's our original data there's our mean standard deviations for all of them notice there's only seven values we would have no power at all if if we only use the seven so list wise deletion definitely would have been a bad idea here's our imputed data so 29 Records for everything again this is also small sample size but remember it's just a demonstration if you had a regression you probably would want at least 64 this kind of regression for with two predictors but well we will do it with these but here's the difference five different models and then here is the pooled data the pooled data is what you're going to use to report okay what is what the pooled data is is it took these five different data sets and averaged them to come up with a value for the pooled data so you can see here the health care health cost per capita here's about three thousand and here's it's quite a bit more okay BCD rate is lower here than here and here it's about the same age over 65 all the same so this look pretty good correlations your original data here right here you can see the obesity rate was pretty strongly correlated with the health cost per capita which means as BC wrote rate goes up so there's a health care cost this actually looks like it's going negative of course that's not significant at all because it's very small and this one here actually is C significance the fun right here you have a negative correlation between obesity and age so it looks like as a percentage of people over 65 goes up the percentage of people who are obese go down or vice-versa they're just moving in opposite directions but that is significant okay that's in the original data let's look down at the pooled data okay we're looking in here for significance nothing much going on there okay so our and our correlation some are a little less some oh how do you say they're just less they're just not as um they're not as closely correlated right they're not just they're just not as tight and that often happens because you know you have a lot more data it's been imputed so it's supposed to give you a more better estimate of your standard errors look of our model summary this will tell us our R squared that's the percent of variance that those predictors are bringing on to the outcome so it's how much in your original data the obesity and the percentage of people over 65 account for almost 30% of the error in that outcome for healthcare costs per capita and then down here some all the different imputations this one is not a very good looking model at all compared to the others but the other ones are pretty good as you can see you don't see a pull value there and that's a that's a glitch in SPSS for some reason they don't have the pull data in the model summary so you either have to do it by hand or find some other means supposably they're supposed to be working on that fixing that and you can see here two holes so for the eight for the ANOVA model no pull data okay so basically your original data wasn't significant for the model but here your first one was this one and this one these are at the 0.05 level of significance which is the standard but let's look down with the coefficients here we have our pooled data okay so here's our original data and see what we had here here's our coefficients okay and here's our significant nothing nothing significant of course there's only seven records so I wouldn't expect it to be no power however if you look down here to pull data which is what we're going to be reporting and using if you're at the 90 percent level of significance there if you're rejecting at T less than point one then this would be significant and up here of course it's not so if you're looking at at the point O five level at the 95 percent level of significance there's not there's there still no significance but if we were looking at it at the 90 percent level then we could say yes obesity is significant for that outcome for the healthcare cost per capita in fact then you would go back and go okay for for every 1% increase in the obesity rate the health care costs go up about 93 dollars okay and then your average is about 116 hundred per person over all the countries so that's that's that's pretty nice here's our residuals I like to look at the residuals everybody's talking about the normal distribution such here's our raw data when you have regression you want to check your residuals make sure there's a nice scatter right or just make sure that there's you know that they're kind of following an x distribution and here there's a big hole there's a big hole right there not in the raw data but if you impute look at the imputation models much better see the holes are all filled in this one's a little skewed off and guy out there by themselves but they look better the holes are basically patched that one looks nice like that one and then your your probability plots as you can see what you have one two three four there's our seven data points in a raw data not much to look at there but here they are lined up really nice for all your five imputation so that's very nice well here's our scatter plots have scattered there's our zero and doesn't look too bad user zero there actually that's that's not too bad of course it's what the only seven data points you can't really see a pattern anyway but here now you can see a better scatter because now we have 29 and so they all look pretty good okay so remember you want to use the pooled results and I guess that's about it I guess I ran a little bit over but let's go back into our PowerPoint and let's go to the next page and here's some more references I didn't get a chance to do the sass or to stay down I'll ever hear some information here there's some seminars that they had at the UCLA Institute for research and education it's a great online resource I mention it every single time go there look learn annotated output it's awesome it's awesome so here's multiple imputation in Stata and here's some in sass okay so you can look those up this is a nice easy read if you just want to get an overview learn a little bit more read it at your leisure it's a very good place it's a paper by Jeffrey Wayman check out our blog stats chat soon for more information I'm planning on putting a forum on there too so we can talk back and forth but you guys can talk to each other etc etc so here's some information for me if you want to get in touch with us and you'll be getting a PDF handout so you'll be getting this information and the data set so you can play with it all right so questions let me see if I can make Heather live and see if we have any questions okay Heather are you there I'm here yeah everybody hear me yeah okay I have people shaking their head in them saying they can hear me oh yeah yeah yeah I actually had a few questions Jenny asked she was looking at the key imputed values that were produced and they seemed really low to her compared to the dataset she was wondering am I looking at this wrong or no they will look low a lot of times in any kind of imputation procedure you do kind of well when you have a simple imputation just so you know one value of course you're going to have them kind of bunch up at the mean however what the multiple imputation does is it takes the imputed values but it takes it with the end result in mind right so it's going to to adjust those imputed values so you're going to get something that's close to the mean value started with so the imputed data sets might seem small what's your pooled data with the average of everything after the imputed values are in there with the original data it should be closer so let me see if we can go up here and look okay for instance this is probably not the best one let's see but but Jenny is right Jenny's always right let's see just trying to find one as you can see here's the imputed values you're way lower okay but here they come back up a little bit not quite to where I'd like to see it but you got to also know that there are only nine values in here so the algorithms and such and it uses the regression we use the linear regression to figure this it's going to take into account what it does know and what it doesn't know right so it's going to pull we don't know this is probably much closer to what the real mean value would be once you put everything in so the algorithms take into account that missingness to something we can't see so oftentimes you'll get your imputed ones will be much smaller but when you put them all together in your complete data they tend to go back up a little bit more some of these no not true but there were some here's one aged over 65 here's our mean with the 20 but so we had 24 records to begin with so as you can see right here it's almost 20 it's almost the full set so you're going to get a little bit closer here so it's just a function of taking into account that missingness and somehow I don't know the exact way it does and I could find out of course it'd be very technical and very labor intensive and math and tennis yep that's so exactly but if you have a little bit if you have a lot of missing data going in you're probably going to get some little stranger values on your imputed but it should all come back together in your complete set with the pooled data okay I have another question and this one was from Rebecca she said when the values are actually created here the imputed values what kind of what information is it taking from is it taking from me the data that is in the set already in taking means or how is it how is it actually working if there's different there's different algorithms for it the one we use today we used regression so what we did was we took all the data that was in the data set and we threw it into regression models and then that helped us to find what the values would be that we're missing on the variables so there are there are ways you can do it by hand with regression I mean that's the old-school way where you would just run regression after regression after regression and just you know try to settle on the value but the SPSS does it but there are other other ways to do it if you go into R or some other programs where you can actually choose some other types of distributions you can do it that way that's where your expectation maximization algorithms and such come in handy because you can use a lot of different distributions and in figure them that way that's where they have their kind of a benefit over the multiple imputation but in this case we use regression and we took all the data we did have it adjusts for the data we don't have which is why you get those those kind of freaky estimates on the pooled set are not on the cool side I mean on the intuitive set but at all like I said all comes back other on the pulled set and it sounds like cheating I thought it that you know excuse me what you can only you only need to do five of these simulations pull it all together and we're going to actually get something better than we had if we had left it missing yes actually you are and and it sounds like these numbers are just coming out of nowhere however we are taking into account that the data that is there okay yeah because Rebecca same if you have data missing absolutely not at random will then still work yes that's the nice thing about it because like I said you can't tell what kind of missing data you have the only assumption you can make about the missing data is if it's less than 5% you're just assuming it's missing completely at random it could be missing at random it could be a very systematic problem but it's not it's not going to hurt you because you're only going to get rid of a little bit of data if you do decide to just delete them but multiple imputation will work for em car Mar and MNAR if you not at random works across the board okay so we would it's not good just to delete you know cases that are missing it would be best to try to do imputation on this yeah the rule of thumb is if you have less than 5% missing you probably most likely are okay you know nothing's absolute in statistics right Heather right we know that so as soon as I say oh yeah all the time you can always delete now if there's five percent really then and then some somebody will prove me wrong so for the most part five percent or less missing aren't going to impact the data it shouldn't unless you have data unless your variability is huge and data is all over the place but for the most part no so yes you could do deletion just lists Y solutions just pull those records out if you have 5% or less however if you have more than 5% you can almost you can always be guaranteed that there's there's something going on besides just randomness and that's when the multiple imputation would be useful you could use it I mean you could use multiple imputation if you're only missing one or two records you know however I don't know if people would want to go through all that trouble if you're going to get similar similar results okay of course I thought that much trouble as you saw it's not that hard to do yeah yeah because if you were to just keep all the complete cases and delete all the rest you're losing a lot of data you can be and that's that's the thing if you don't have many to begin with say like in this data set we only had 29 records the last thing I want to do is pull anybody out right unless I have to and-and-and-and and to tell you the truth this data said it won't impute if something's missing across the board I had I had Brazil in this data set had no data on any variable it was useless that had to come out right but as long as you have values on at least one one variable for each record you should be able to infuse so few imputations like we only had five imputations in this presentation in the data we did five wide why not 100 or 50 years you only need the literature shown and Ruben demonstrated that anywhere from three to ten is usually optimal you could do more however you're just there there's a formula it's a pretty simple formula I can't write it here I need to get something section right on the board but basically you would you take a rate of missing information you divide it by the number of imputations you raise it to a negative power and add it to one okay there's there there's your math it's hard to explain about showing it however your rate and the number of imputations it doesn't really matter you can go from three to twenty to a hundred and you're going to get the same kind of precision so there's no reason to do 100 you could easily do it with the software though if you wanted to put in a hundred imputations you could but you're going to get very similar results as if you just did five and pulled them the precision that good okay Jenny was asking too cuz she had asked she was looking at how well the imputed values were and if they turn up being so low she was wondering is there else is there something else you would try and say if the rift arranges our ways they're just two out of the range was there something else you the tribes have imputation try to constrain it right right now I constrain those percentages to 0 to 100 however the obesity rates I doubt any obesity rate in many countries over 50 you might want to constrain it a little further right so you know check check your research check your literature check to see you know what help help what's possible you know I kind of left it you know I left every every probable outcome in there but it might be that you need to constrain it a little more but the big thing to do Jenny is don't pay so much attention to the computed values except for maybe in your math in that data set you want to check your imputed data set to make sure that you don't have anything weird in there as far as your actual values however don't pay too much attention to the imputed data sets look at that pool data set right so the pool data or the complete data after the imputation right that's what you want to look at and if those are looking okay or not too far off you should be alright there might be other things you might want to try but I would say just you know the big thing about doing the imputation is about getting the right standard error estimates getting the right variances it takes into account variance and that's that's important in order to do that and get the right variance and take into account that variance that's in your data for what you eventually get sometimes it needs to give you some little kind of like wonky numbers but if you if you look at the complete data after and it looks legitimate it's reasonable keep it the other option is to you could always run it on the raw date well and you do I mean it shows you the raw data right compare your raw data to your pooled data and see how different what different you're getting in significance and your parameter estimates you know that's that's also something you could look at however like I said in this case where are here we go here are constants very very low on the original data I mean a saying there's a negative and the constants supposed to be your mean right so it's supposed to be your mean of expenditure per capita well that's you know that's probably not right so right there your raw data not look good peuta data much better so sometimes your raw data will give you rotten things too right so you want to make sure that you you know you're giving your your model to enough power and you're being reasonable with your variability that's the big thing and we're about out of time do we have other questions nope those are the other ones yeah okay all right well I guess that's it for today if you have any other questions you know let me know send me an email send Heather an email hopefully I'll get that form up soon which would be great and we're going to send handouts in the data set you can play with this and you know probably come up with more questions but it's just meant to be an overview so hopefully it was informative hopefully it you know what's your appetite for more of this I think it's a great it's a great nifty little thing to know and thank you everybody for joining us and thank you Heather thank you and we will talk to everybody soon
Info
Channel: Omega Statistics
Views: 28,371
Rating: 4.8550725 out of 5
Keywords: multiple imputation, missing data, missingness, missing at random, omega statistics, elaine eisenbeisz, dissertation help, clinical research, study design, data analysis, statistics, statistician, stats for the masses
Id: 27NSGTcWaPI
Channel Id: undefined
Length: 59min 33sec (3573 seconds)
Published: Sat Apr 29 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.