Dealing with MISSING Data! Data Imputation in R (Mean, Median, MICE!)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone welcome back to my channel when i talk about all things tech and finance and in this video i'm going to be going over data imputations often in the data analytics realm you can be coming across many data sets that have sparse data meaning that all of your observations or many of your observations have md or null values and you want to figure out a way to try and replace those null values with actual values that is where data imputation comes into play a really easy method is to use the mean method where essentially for each of your features you essentially just take the average of that particular feature and then you replace all of your null values with that given mean the mean method is actually really easy it's really simple it also won't affect your sample mean size and it won't affect your overall mean within that given feature and the other imputation method that's very popular is the median method this is actually really good when you have a lot of outliers with the data set and so if you were to replace all of your missing values with a given median this could help spruce up your data a little bit more but both of these methods are actually pretty bad mean and medians because for one they don't really take into consideration of the relationships among your many other independent features and also they are also sort of limited as to what the data they can use they're really good for using on numerical type data but not so good on categorical type data or even strings but one thing that you should take into consideration is that when you are working with empty observations or values that have missing data is that you should never really remove the observations just for the sake of having a quote-unquote clean data set it's because this usually leads to not really good results and you have to have a really good reason as to why you're going to be removing these observations because just removing data from your data set is actually hampering the true values within your overall data now let's go over some of the more advanced techniques as to how you can start imputing your data k nearest neighbors is based on distance observations where you can average these values and start grouping observations together this method assumes numerical data if you are using categorical data then you should use a neural network you can build a machine learning model to impute data generally this model works well for both categorical and numerical data the other method is to use a form of regression we'll be finding relationships between missing features and other features within our data set a really popular method is to use mice also known as multiple imputation by chained equations this method is commonly used in psychiatric research and this method can handle categorical or numerical type features the primary feature of mice is that this method executes multiple imputations instead of single imputations thereby maintaining some form of a relationship among the feature a really neat thing about mice is that this technique provides a statistical uncertainty measurement to imputation whereas for median or mean imputation there is no uncertainty measurement which can lead to biased data there are six main steps related to mice the first step is that a simple imputation is applied for each of the missing values in the data sets this simple implementation can be a mean or median you can think of these values as placeholders the second step is when you are starting with the first feature that has imputed values the placeholder values for this feature is set back to the original value the third step is that the original value of the original feature from step two is regressed on all the other features essentially the original feature is the dependent variable and the simple imputed features of the other features are the independent features the assumptions follow whichever regression technique you might use step 4 the missing values of your original feature ie your dependent feature are replaced with the predictions of your regression model note that as you are imputing other features you will use the changed observations of prior features as independent variables for your next imputation step 5 is to repeat steps 2 through 4 for each variable that needs imputations one cycle is iterating through all the features once for imputation do this until you don't have any more features that need missing values filled the last step step six is to keep on repeating step 5 until your predefined number of cycles have been completed with the imputations being updated at each cycle the number of cycles varies among data sets however a general rule of thumb is to go through 10 cycles the main point behind the use of these cycles is to ensure that your predictions are converging to a given point now let's go real quick to the demonstration as to see how cool these methods can be okay so already loaded in my data over here there's two very important packages that we are going to be working with mice and tidy version mice has imputation method and tidy versus our data query data wrangler and if you're more interested in the mice i recommend you checking out that particular url the link is in the description but essentially what we're trying to do here is that we are going to be predicting what these n a values can be given all of our other given features that we are given over here and just to see how or however many and number of nasa we are working with we're going to have to be imputing 1836 observations for work class 1843 for occupation and 583 for country and the general rule of thumb when you're working with imputations it's actually like heavily recommended that you're missing data that represents about five percent of your entire data set and it's like a good threshold to abide by these are the three categorical variables that we are going to be working with so for work class we have we have these specific values that exist and as we can see we have an n a value over here right under private and so we are trying to convert this n a to one of these other eight variables that we are given over here same thing for occupation similarly we have 14 13 14 15 but we have 14 for n a so you have 14 unique variables and this n a needs to be one of these other variables are given over here and also for country we have close to 42 actually 41 since there's one n a here so these are the categorical variables that we will be predicting one thing to note is that all of my given features over here are categorical variables or in this case factors the same approach will work for numerical data given for mice since we are going to be regressing on these terms so related to the means and the median type of imputations it's actually really simple to do when you are trying to impute for your nas all you're really doing is that you're removing the the n a's and then you're going for that specific column value and you're replacing those values with that given median or mean and that's really all you really need to do for the data imputation part but however you know as i said earlier this is not always the best uh use of our data and there is of course a better way of doing that and that is going to be the mice so just to keep in mind over here i essentially just subseted all of my given observations for the you know given feature that has the n a observations so we can keep track as to what observations rna and what the observations will be when once we actually start imputing some values so this is just like a double check to see how well our imputations are or what they are in relation to you know what we have going on over here so then let's actually begin with the mice process the first thing you want to do is to execute nd dot pattern this is a mice related package the way to interpret this is that we have all of our given factors or or okay all of our given features we have 15. the blue squares they represent the number of observations that don't have any n a so as we can see here we have 30 000 or so observations and that do not have any n a so 30 000 observations in our 32 560. so there's roughly 2560 observations that happen n a we have seven observations of our occupation feature that are nas we have 1809 observations that are n a's over here and then so on and so forth so you can determine where all of our empty values all of our null values exist within our given features that we are working with uh do note that there's only three features that have n a values country world a work class and occupation and these red squares are in line with what we have over here when we ran that given query the sum n a or class occupation in country an additional plot that you can look at is coming from this library vim and it's essentially this like really nice plot of where it gives you additional statistics as to like what number of observations out of your given total observations are empty or have a null value and this provides some plots the general rule of thumb is i said earlier was five percent so that's why it's tailored at five percent over here and it just gives you an additional perspective as to where your given null values exist within your data set probably the most important feature here is that the counts are roughly about five percent so i'll just keep those given features into our given data set and not really worry about them as we go so we have about five point six percent and five point six percent close to five percent so i will leave those features in here note that if the features are like 20 missing or so then it's probably a good idea to maybe remove that entire feature so now that we have a in general a good understanding as to what type of data we are working with and what features need to be imputed we can check out what type of methods we can actually utilize in terms of a regression type technique to determine what our imputed values will be and so as we can see there's actually about 31 functions here that we can utilize in this case i'll be using random forest particularly a random force over here particularly because all of my given imputed values they are categorical variable and i'm not really working with numerical type data and so random force is actually really good for this type of use case i have going on over here so it's actually really really simple to do all you really do is just execute the mice function you pass in your data and we have m is equal to five you can think of the m value as the number of cycles as i've said earlier and in the default value the m is equal to five but you can easily increase this to 10 50 100 but the larger data set that you have the longer it will take and in this case it's going to take a little bit of time over here but for the methods all you have to do is just include rf you can also include you know the quadratic regression you can do linear modeling you do the mean you can do binary you can choose any of these particular methods to input into the overall mice imputation method and to see how well it does okay it only took about three four maybe five minutes this is what the summary of our imputed data looks like as we can see here for each of our features that have the term rf this is essentially what the regression type technique was actually applied to and as you can see there's some features that don't have any any characters that exist between the speech marks and that's because each of these features does not have a imputed value that's being applied to this particular function that we have going on over here so within the imputed data over here we can actually check out to see what our inputted values are given over here and we can go even a step further to go to the specific feature that has been imputed so we can check out our work class and this is the given value that these are the given predicted values that are given for this imputation method so there are one two three four and there is a fifth one down here a fifth one yeah fifth so we have five different data sets and each of these observations inside of this data set is the inputted value so we see for our 27th value this is related to work loss we can go to our world class subset we have 27 6169 so we have these three values so each of these values are the inputted values for this particular n a and as we can see these values are a little bit different from for a given data set for two three four and five and yeah so there's gonna be some variety that happens so you can choose whichever one of these uh cycles whichever one of these cycles to choose from and you can use that to complete your entire data set and then go forth and start predicting based on that input of data set you can do the exact same thing for country and also the occupation and then it's pretty much the exact same thing you just have a difference you just have a different data set for that given patient value but the real kicker here is that you can then complete this data set so the way to do this is we have a finished imputed data set value and the function here to use this completes you pass in your inputted data and you can pass in whatever column it may be in this case i'll be using the very first column for each of these given observations that i have going on over here so one over here so if i do that finish inputted data you can view this view that and as we can see here this is the um this is the observations that we are working with we can do a real quick dimension on this inputted data sets 32 560 and if we were to check out since i use the very first column let's look at the very first row for that given value and that is for occupation we have exact managerial and we look at 27th observation as inputted and we have exact managerial we can also check up as to see if it is correctly inputted and we look at the occupation here 27th and there's no occupation here so there it is that is how you combine the the finished imputation data as and then you can use this data set for whatever machine learning algorithm that you want to apply this to a real name a really really neat check to see if we don't have any other missing values is just use the s apply function pass in our given data set and then just check out to see if there's any nas and as we can see here all of the nas within each of our features are equal to zero so it seems like we are good to go but if you're not really satisfied with just using one data set you could technically use all the data sets and then have a somewhat of a standard error that's associated with that so to do this all you do is just pass in your inputted your inputted data use a width function over here i'm going to be using a generalized linear model or in this case a binary model family binomial to predict the family our predicted target variable over here target variable is essentially the let me go to data target variable is to see whether or not a given income is greater than 50 000 or less than or equal to 50 000. that's all it really is i'm trying to be predicting that so make sure your families go to binomial and i'm going to be using the three separate classes that have inputted values and only those three just to see how well these specific variables are attributable to the overall given model and in this case it'll be using all five of these given data sets which is really really neat and i'll go let that run it should only take like you know 15 seconds there it is we can check out the model over here and as we can see each of our given variables here work class never worked work class self-employed it separates the variables within our given feature and so there's some one hot encoding involved over here but all of our given features are provided here and there's some coefficients that are provided and they provide an aic code so there's five of these over here just to make sure that you know we're utilizing each and every one of our given models this is the fourth model we can just pull up our models and just do a summary on that so this is what that looks like and let me actually zoom out over here so that we can have there we go let's zoom out and then let's run that again so that maybe yeah everything's on one screen and we can see here we have a given p value for each and every one of our given observation uh categories to see how well that particular model or that particular feature is you know attributable to our overall model so some individual observation groups are statistically significant as we can see over here we have like zero zeros and so on and so forth uh note that i would actually recommend that you would probably just duplicate your original data set and maybe include some n a values where you know that your data has a given observation that has a specific value like the fnl wg the weights we can make that an n a but in actuality we know that it's 83 000 and 311 and so if we were to randomly select you know a few additional observations that make them null but we actually know the true value of those null values we can actually determine how well our particular mice method has done in terms of it doing its predictions and so on and so forth so i actually recommend that you do that um in addition to all the other steps that you have done up to this point and lastly since you actually have created that duplicated data set you can then compare the accuracy scores and just get a general great sense as to what type of input of data you're working with and to what degree of accuracy your data set is when you're actually plugging away at a given machine learning model so i hope that you enjoyed this video make sure you leave a like and hit that subscribe button with those notifications on and i hope to see you in my next video thank you so much for watching
Info
Channel: Spencer Pao
Views: 2,277
Rating: 4.891892 out of 5
Keywords: Data Imputation, Replacing NA, Rstudio, MICE, imputation, Mean, Median, Distribution
Id: MpnxwNXGV-E
Channel Id: undefined
Length: 19min 1sec (1141 seconds)
Published: Sun Mar 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.