Dealing With Missing Data - Multiple Imputation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video we'll be looking at a much more powerful way to deal with missing data called multiple imputation now before we get into the nitty-gritty of what multiple imputation is let's talk about why the methods we've been using so far are actually types of single imputation methods they're called single imputation methods because no matter what we do whether it's mean imputation median imputation hot dec methods in the end for each of these missing values here question mark we end up with a single value we've computed that value in different ways based on what the method was but we just have a single value for each of those missing data points the issue here is that that value might be biased it might just be a coincidence of the data we use to calculate it and might not be representative of what that value truly is if that's a little bit confusing then a good way to think about it is why we take the mean of things instead of just using a single value as the representative for example if you're trying to figure out some people's height you wouldn't just go up to the first person you see take their height and say that's the average for all human beings you would take a pretty big sample of people probably different people and take their average and say that's a better approximation of the average height for human being in the same way instead of just filling in a single answer for this missing value multiple imputation methods attempt to fill in multiple values over here run the analysis with a completed data set in all of those cases and then average the final results to get a more unbiased estimate now that's enough talking let's get into how to actually use this method what we'll go through in this video is just one type of multiple imputation method by no means is what we're going to do here the way to do multiple imputation it's just a way to do it and I'll explain generally what multiple imputation is now here we have a slightly different example from our previous missing data videos here we have a data set of distance from your library so households distance from the closest library and how much it owes in fines to that library so we might expect that the farther a house is from the library the more they would oh because they can't find the time to drive over there and their fines might just keep increasing and increasing furthermore this dataset continues and let's say there's thousands maybe one thousand two thousand rows in this data set just lots of households and furthermore we have a certain degree of missingness so we see here that this house sold 1.7 miles away it was 11 dollars and so on but we get to this first missing value here where this house holds 6.1 miles away we don't have a amount of fine amount so we would like to fill that in with something that makes sense so what we'll do is something a little bit more intelligent than even the hot Beck method we did in the previous video what we'll do is run a regression of distance from your library versus your fine amount for a subsample of these 1,000 or 2,000 points let's say we just take 50 of them so I'll write that here we take 50 data points and we plot them and just for example let's say this is kind of the shape we're getting so generally we see that the more your distances from the library your higher the finest now let's say these are 50 points here something we can obviously do is run a regression line do it or new least squares regression and get a best approximation line for these 50 data points and now keep in mind that these are only 50 some random 50 sample from all these data points here and of course these are only rows where we have both data points the distance and the fine amount or else we couldn't plot the y axis if we had a missing value okay now you probably see where this is going but let's say we want to figure out if you're 6.1 miles away from the library how much do you probably go we can go ahead and just find six point one miles wherever that might be on the x-axis trace that up to our best approximation and we find out here is how much we think you owe based on the regression line and let's just say over there that's gonna be something like $12 now this is still if we stopped here this is a single imputation method because we just have a single value and if we go ahead and run all of our analyses we're only doing it based on this single filled in value for each of our missing values here's where we get into why this is called multiple imputation we don't just do this regression once we do it over and over and over again so I'm gonna do one more example here remember how we took a 50 random sample but stay we take a different sample of 50 points now these are completely different points some of them might be repeated but they're not all going to be repeated right so the data signature is gonna be somewhat different and we're gonna get a different regression line so let's say user are 50 points and this regression line as we can see is a little bit nice orange is a little bit shallower than the one previously and of course that means that we're gonna get a different value than 12 over here so here when we fill in 6.1 as the distance from library we're gonna get something less than 12 because of the shallower line let's say we get something like 10 over here so we go ahead and fill in 10 here and we just continue this on and on and on and as literature says you don't really need too many of those you don't need 50 or 100 different regressions generally 5 to 10 is designated as correct but the more you have the less biased your final results gonna be but the con is that you're running a bunch of regressions which might take a lot of time so for example let's say we had 12 we have 10 and another one let's say we have something higher like 15 and then we might have 13 and so on we just have all these different values now the next step of multiple imputation is treat each of these regressions we run and the results of them as a separate complete set so here the first one we calculated was $12 if you're 6.1 miles away from the library treat that as one complete data set and of course there are other missing values down here so for example if someone was 7.1 miles away from library and they had a missing value we would of course go over here find 7.9 here and trace however much they owe based on the regression line so we fill in all the missing values based on that regression and that's one complete data set okay and then based on that complete data set which is based on this one regression to fill in our missing values we go ahead and do the analysis we want to do whatever that is let's say in this case we're doing something simple as just finding the mean of all these different finds of course this is super simple you can do much more complicated analyses and get more complicated results but keep this simple we're gonna say we're just interested in finding the mean of the fine amount for all these people so let's say after we do that analysis this is mu1 which is the mean of the first filled in data set based on this top regression up here let's say we get the mean is ten point nine average now we do it again so when we had the second regression we go ahead and fill in all the new values for the missing values for getting this regression ever-existing and we get a new mean for all of our find amounts let's say here we get something lower right because generally it's gonna be a lower line so let's say we have nine point one dollars and we do that again and again and again so we have mu 3 mu 4 mu 5 and each of them each of them has their own regression line they have their own filled in values for our missing values and they're gonna have their own mean of this fine amount okay so that is the second to last step of multiple imputation where we calculate the values we want to calculate here it's the mean you can do whatever else you want standard deviation you can do get even more fancy things based on multiple columns and you end up with these five values here yeah if you did five different regression lines to fill in your missing values the last step of multiple imputation is to fold the first part of it is take some kind of aggregate of all of your analyses so here we're just taking a mean of these five means so I'm gonna call that mu bar which is gonna be the mean of MU 1 u 2 mu 3 mu 4 mu 5 okay and that's gonna give us our best a more unbiased estimator for what the total fine amount would be or further what the mean find them out would be I should say and of course the second part of this last step in multiple imputation is to analyze the spread so for example already we see that of course these means are not all the same because they're based on different complete data sets different regressions so we want to do an analysis of how far apart are these means on average if they're clustering pretty tightly together then they have a low standard deviation if they're really far apart then we might also want to know that because that kind of means when we do different regressions we're getting pretty wildly different values for our missing values which is ending up with really different means in each case so the first part of the last multiple imputation method is to calculate some aggregate of all of your analyses here it's the mean of the means and the second part is to analyze how different your analyses are some kind of standard deviation metric and that's the theme of multiple imputation now I just want to point out some places where it's not necessary to do what I did but you can do something else instead so here I'm using a ordinary least-squares regression as my single imputation methods which I later aggregate you do not have to use a regression you can do any kind of different analysis you want to get these complete data sets it's just that once you have all five to ten of these complete data sets you go ahead and define your appropriate values and then go ahead with the analysis step where you a great them and then look at the standard deviation of them okay so before we end this video I just want to list some of the pros and cons of multiple imputation as opposed to single mutation the pros we've been talking about the whole video right it's more unbiased and you can see that clearly because instead of just taking one set of filled in values for your missing values you're taking multiple ones and you're kind of washing away the coincidence that you might have the cons are of course this is a complicated method so it might take a lot of computer time and of course will take more thinking on your part than the simple single imputation methods we've used this one so hopefully that helps to kind of understand the jist of multiple imputation the main point I want to get across here is that by no means is every step I did something you have to do but just a multi-point imputation is use some kind of single imputation method to generate several complete data sets do the analysis and each of those complete data set cases and then aggregate your results and look at the spread of your analyses to see if they're really different or really similar
Info
Channel: ritvikmath
Views: 26,080
Rating: 4.9622641 out of 5
Keywords: data science, missing data, machine learning, multiple imputation
Id: LMsULWGtP2c
Channel Id: undefined
Length: 11min 2sec (662 seconds)
Published: Mon Nov 05 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.