Handling Missing Values using R

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so today we'll go over how to handle missing data in our when you are working with lot of data and a lot of variables it is very common to have a situation where some of the values may be missing now one way we can handle missing data is we just delete the row but the problem there is let's say we have 20 variables and one data point is missing so when you are deleting the entire row not only you are deleting that one data point but you are also throwing away 19 good data points so that's not a very smart idea and especially if you have many missing values and if you start deleting all the rules where any value is missing then very quickly we land in a situation where we have thrown away a lot of good data so a second problem that may come up has something to do with the bias so let's say we have data on vehicle failures and only those data points are missing where vehicle is quite old so if we are throwing away all that data so we are losing data on those kind of failures which are occurring in older vehicles so that obviously creates some amount of bias so another way to handle missing data could be that we find the average of non missing values and replace that missing value with the average if we have like lots of data with maybe hundreds and thousands of rows and if we are just changing one value using average maybe it may not have a big impact but in many other situation that may not be the best or optimal way or smart way to handle missing data for example let's say we have vehicle in a data on five failures and the data point that is missing is the mileage at which the car filled suppose we find average of four for mileage values where we have data and suppose that average comes to 50,000 miles and we replace this value with 50,000 what could happen is if this missing data was for one month in service where a vehicle has failed after one month if you are putting a value of 50,000 miles so that is very unlikely in one month it will be very rare to see somebody driving like 50,000 miles so obviously that's not a very smart way we are going to use a vehicle failure data file so this is a data frame with one six two four observations so that many rows and there are seven variables so there are seven columns so the first column is just the vehicle number 1 2 3 4 and then we have failure month what was the mileage at the time of car failure and then labor-hours L H so how much time it took to correct the problem labor cost is what was the cost and then we have material cost and what was the state of failure so let's look at the summary so for numeric variables we get the values like minimum first quartile median mean third quartile and maximum and for categorical variable like state we get frequencies like there were 290 failures reported in Texas 199 in California and so on apart from these summaries we also get values for n is so not available which is basically missing data so for mileage variable there are thirteen data points that are missing for labor hours six labor cost eight and for state we have 15 missing values so let's calculate what percentage of data is missing for each variable so I'm going to save this in P for proportion or percentage so we'll use function and then we can specify what this function is going to do so we are going to sum by checking if it is not available data so s dot n a X so sum of missing values we divide by length and then we multiply this by 100 to get percentage and then we use apply so our data is data and since we have data in columns I'm using two one is four rows and then P so we know that we don't have any missing data for vehicle or FM columns but in mileage column we have about 0.8% values missing and for LH we have 0.36 9 4 LC we have like about half a percent and the maximum missing values are for state so 0.9 - 3 % we can also make use of MD for missing data dot pattern within data this gives us a table as well as a plot so the first row here is one five eight six which has a value of zero so this means that there are 1586 rows with no missing data and then there are 11 rows where there is exactly one data point missing and that data point is missing in column state we have 13 rows where mileage values are missing similarly six data points missing from LC or labor cost column and then there are two rows where we have exactly two data points missing one from LC and one from state column if you see column wise for vehicle we have zero so no missing data for FM no missing data for MC no missing data and then six data points missing for labor hours 8 for labor cost 13 for mileage and 15 for state maximum number of data points are missing from state column in all we have 42 data points that are missing so this information is actually depicted here in the form of a heat map we can get another missing data pattern by using MD dot pairs so we get four different tables first one is our are basically these number indicates how many data points are observed for example for vehicle all 1624 data points are observed whereas for vehicle and state pair one 6:09 data points are observed second table is for R M so observed and missing followed by M R which gives us a table about missing versus observed and then missing versus missing two data points missing from state and allege combination or pair two data points missing from state and LC combination and so on we can also do a marginal plot using margin plot so we will use data and within that will use all rows and four columns let's specify two columns so I'm going to use mileage and let's use LC labor cost so blue dots are observed values and red dots are for missing values so there are eight missing values for mileage so this box plot is based on labor cost but they do not have values for mileage similarly there are 13 values that are missing for labor cost and this box plot is based on the mileage of those data points where labor cost was missing so let's store imputed data in impute the function that we use is mice and within the data will only use columns two to seven because first column is vehicle and it's simply a vehicle number and it doesn't have any predictive power so I'm going to ignore first column and for that we are going to use square brackets and then comma which means all the rows but only two to seven columns we can also specify how many imputations we want and I'm going to say number of imputations captured by M to be three default is five I'm going to go for only three and we can also use a random seed let's use one two three you can see that it has done five iterations and for each iteration it has done three different imputations so IM p is for imputation so let's do print impute so this gives us more information number of multiple imputations we chose three and imputation methods for the missing values are indicated here so FM failure month doesn't have any missing value so no method has been used mileage is a numeric variable so the default method for dealing with the numeric missing values is predictive mean matching PMM so that's what has been used for mileage labor hours and labor cost material costs has no missing values so nothing has been used and state is a factor or categorical variable so the method used for imputation is a poly rig which is basically multinomial logistic regression so let's look at some imputed values let's choose mileage here so it lists all the rows where mileage value was missing and it gives the estimates for first second and third imputations let's look at some of these values to understand what it has done so let's look at the twentieth row so within data we look at twentieth row and all columns or the 20th vehicle that failed so has missing mileage estimates based on three imputations range from 5 7 7 7 to more than 15,000 and the first imputation suggests a value of 7 7 to 6 miles so let's also look at summary of data dollar sign mileage so average is about twenty thousand five hundred so if we replaced all missing values by average so then this value would have been close to 20 thousand five hundred which is not impossible if we consider eight month in service so maybe there could be a customer who drives lot and accumulates 20,500 within eight months but let's look at 250 third car failure nor that this car has failed after one month in service and if we replaced this missing value by twenty thousand five hundred fifty nine obviously that doesn't seem reasonable so it will be very rare that somebody drives 20,000 miles within a month and the method that we used suggests that we should have about forty five minimum and eight thousand three hundred seventy as maximum possible imputed value so if we go with the first one then this value is forty five which looks reasonable for one month and even if we go with the second one which says eight thousand three hundred seventy so that's much better than twenty thousand five fifty nine so once we have done imputations we can get complete data set using complete function we can also specify which imputation we want to use first second or third so let's say we want the first one this will give us complete data set we're missing values are replaced by first imputation method so for example first missing mileage was for the 19th car failure and the imputed value is 72 355 so if you look at 19th failure you see the missing value is now replaced by 72 355 similarly if you want to look at what values were replaced for state variable so you can see that the first missing data point is for 68 failure and that value has been replaced by Texas so 68 and we have Texas instead of missing value so once we decide which imputation we want to use so let's say if you want to use the first one you can always save this in let's say new data and that can be used for developing prediction or classification models so we can look at the distribution using a strip plot so let's use PCH equals 20 and size c x equals 1.2 so blue dots are actually observed values which are available and in fact for FM everything is blue and for mileage we have first one which is based on the original data and this one here is for the first imputed values and these the red colored dots indicate what was the mileage that was estimated based on the method that we have used in the first imputation similarly this is for the second imputation and this is for the third one and we do not see any unusual pattern here we can also do a scatter plot for like two variables so let's do labor cost versus labor hours so this is for the original data and this is a labor cost versus labor hours based on the first imputation and these were the missing values that have been now estimated and you can see that it goes very well with the data set we don't see anything unusual and this is for the second imputation and this one is for third one okay so we went over example of foreseeing how we can use this mice package to handle missing values even if we are going to use some of the default values with most cases it works very smoothly and it is able to handle missing data whether it is a factor variable or whether it is a numeric variable so it can handle those missing values more smartly than just replacing a value with average or most frequent value or just deleting the value

Info

Channel: Dr. Bharatendra Rai

Views: 34,834

Rating: 4.9612722 out of 5

Keywords: data science, machine learning

Id: An7nPLJ0fsg

Channel Id: undefined

Length: 16min 6sec (966 seconds)

Published: Sun Aug 26 2018