Dealing with Missing Data in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello good afternoon everyone welcome to liquid print so today i want to talk about data imputation in r specifically with the mice package so there will be a few chapters in a video where first i'll go through a google slide on the when what and why your invitation and then we're going to go into our studio where there will be an arm up now for you to go through some simple imputation such as mean invitation lcf and or cb and then k nearest neighbor and then we're going to go into advanced separation technique with the mice package specifically i'm going to showcase two different type of imputation in this case the predictive mean matching pmm all we're going to go for and we're going to go for random forest rf so once we've done that and we compare the performance of these two using a sample data set we're going to go into a practical where we're going to download the data from the cancer genomic atlas and we're going to extract the data do some manipulation on the data and then we're going to impute the data for the gene expression and evaluate the different performance agreement that we have for the imputation so first of all what is imputation so imputation is the process of replacing missing value with substitute data it is usually done as a pre-processing step for downstream analysis uh where they cannot accept missing value so when there's data missing in the data set we do not want to immediately just throw them away because there'll be kind of ways so what we're trying to do is to fill in the data uh based on the trend that we observe so that even though we have filled in the data ourself we have predicted data it should not change the narrative of the data set or the trend of the data set so there are a few cases where data can be missing from data set where we have missing data completely at random m car we're missing data at random and we have non-random missing data so let's just separate out between random and non-random so missing data at random will happen in something like a data rod where some of the data i'm missing from your hard drive it can happen when you have a paper survey and if some rain's got in it can be that some patient forgot to come in for the checkup at a specific day so some data on that day is missing while non-randomizing data is a little bit more complicated where a certain group of data is being deleted for example you're comparing cancer and normal patients and all of a sudden all the cancer patients is missing because it's on a specific thumb drive that you drop into the drain or something like that in this case non-random missing data we can't really impugn because it will be really difficult for us to understand what would be the trend in that but for for example missing and random missing at random and complete random will be something like a patient forgot to come in on a single day where we can try to interpolate the data based on the patient data before and after so it's a lot easier so complete at random will be a little bit more random than at random which is in this case there'll be uh it's sort of like the yeah there's rain falling on the paper where certain data sets are missing while missing a render will be something like um you run two different analysis on different days so there's there's certain gaps within the day where they're missing but in this case these two is a lot easier to deal with but non-random missing data very very difficult to impute so what's the effectiveness of imputation or how do you measure the performance of imputation in this case there's no real answer because there's nothing absolutely universally good about it usually if you do interpolation data it's a lot easier than extrapolation so in this case how we're going to measure effectiveness in this video specifically we're going to artificially remove the data and we'll impute them back so we can actually compare the original original data and the impute data so we can actually check for the accuracy on data and define that as effectiveness of the method so it's a lot more complicated in the real world where you don't understand how it works so mean matching imputation will be the first imputation method because it will be the easiest which is fill the n a with means which is the continuous data and or mode and medial for the character the categorical data uh what it does is just fills in the missing data as the mean so that when you aggregate the data it doesn't change the mean of the overall aggregation so it's a simple term but it's usually not suitable for most situations unless you're only considered the mean for next analysis so for random forest it will be slightly more advanced where you try to build a random forest model based on data set and then we use the random forest model that we built to actually predict the data that we have so for example you can you can think of it as you're you're trying to run a linear model and you have a y cosmetics plus the line between the two data sets and we can easily use that line which is that linear regression model to predict the x from the y so same thing but instead of using a linear model we're using a random forest model everything else is roughly about the same concept okay so the next one is locf and nocb which is the last observation carry forward and next observation carried backwards sometimes actually run as a fallback where you first run lcl and then you run out nocb so you ensure nothing is miss out so this is particularly useful for time series data for example one of the patients forget to come in for for a weekly checkup so we can actually just bring forward their their last checkup data into that day because most likely between checkup they're not going to differ that much and you can use usually run lc app to actually fill in those data if you have weather information for example one of the day your weather station actually goes out power so you can also fill in the weather data based on lcf and ocb and usually it will work but it depends on the situation on how large is the time gap between and do you want to use the mean between the two or moving average many things you can choose from so the next is k nearest neighbor so carry nearest neighbor is actually a clustering algorithm that not usually used as an imputation in the first place but similar to what we do with the random forest just now k nearest neighbor builds a model for the clustering process of the data set and after that cluster we can actually check what okay so we're building a care nearest neighbor model based on the non-missing data right so after we have done the non-missing data we can use the non-missing data to predict what is the missing data so we can do that for every single line so particularly you only have one column of the data missing you can easily use k news neighbor to predict all of them back so i'll have a little bit more detail on how you actually run it in r but basically these are just some information that i found online which actually helped me a lot uh to understand how it works something like this right you can actually just build model based on the different data set and then we can actually just based on what is the dot nearest around them we can just pull the dot to the five nearest neighbor around the missing dot and we can actually predict the data or impute the data using that so in mind in this case there's a lot of different way you can do imputation from predictive mean matching to lasso linear regression to level one number homosynthetic pn or gma mean and all that thing so there's no correct way to do that you have to understand are you imputing a categorical data numerical data binary data and so on and usually i think what people do is to read some papers and actually run some of the testing to see what happens so for example we talked about mean matching just now but there's actually a more advanced version called predictive mean matching which actually involve running a linear regression model predicting a distribution of the missing data and then from that distribution of missing data we can actually just pick a few points and then we use those few points and when we run the means of that viewpoint to be used as a imputation newly imputed data so it's a lot more complicated but in this case they're of course a lot more accurate and a lot more suitable for most of the data set that you have so i also have another random forest invitation on how it works so same thing it just builds a model and uses it to predict the data set which is actually very confusing so the next the last step of the our script just to run you through is we're going to impute the data from tcga and which is the cancer genomic ls program which has a lot of data from a lot of patients so it's great for running tests like this where you need a big raw data so if that's okay well let's move on to r and we can actually start from that so you can actually download the r script in the video description down below in case that you don't know and you can actually go there download and you can actually continue the video from that okay so once you go to github and download the script and open the rmd file the script should look somewhat like that so if you don't not familiar rmd file everything inside the shaded area is actually the r script and everything outside is just the text and they're not actually executable for example here i just included a chapter information for what we're going to go through later in the rmd and you can see some yellow thing that usually the the chapter title or the header of a certain section so we're going to go through what we say just now about mean invitation lcf can and we're going to go for the mice imputation with pmm and random forest and then lastly we're going to finish off the whole video with the downloading of tc share data patient data extraction inspection some data manipulation method and how we're going to do some advanced imputation and we're going to compare the effectiveness of a few different imputation methods so let's move on to the first one mean imputation so min limitation is kind of the oldest technique in the world where we just fill in the data with the mean so for example we're going to create a very simple factor over here as you can see it's 4na 7571 and there's n a here and there's n a here so what we do is we calculate the mean of the whole thing and fill the n a with the mean for example in this case the mean is 4.75 and after we fill them in they're going to be 4575 here and 4.75 right here so this is a simple way to do it you can also do it with the iris data where it is exactly the same method you just change the data frame that you have just make sure in this case we are using this function that has to be a single dimensional data so instead of doing that one by one we can also write a simple function to do that and then we can actually run it on for example the secret one and we can actually just simplify the uh the whole calculation with a single line over here so you can see that now uh what i attempt to do here is to just a few the row 2 4 6 8 10 with n a and i input the data in and again after i have actually impute the data i print out those data and you can see now that the data are of 4.8737893 which is actually the mean of the whole uh the whole list of vectors over here so let's move so that's mean invitation very straightforward very easy but you know not very useful most of the time because mean doesn't represent a lot of things uh for example let's actually look at the second one just for contrast so here we're looking for llcf no cb last observation carry forward and next observation carried backwards so we're going to use a package called nomantica which you can actually install the package right here package install and just press non-medica into the empty slot here press install and that will actually load the data into the into the environment area so let's first look at locf which is last observation carry forward and you can see that the 2 now is being carried over to 2 and a over here 3 is being carried over to the next one and 4 is carried over to the next one while nocb is exactly the same but opposite where 1 is carried backwards over here the 3 is carrying backwards over here and 4 is carrying backwards over here so of course it's kind of an issue so that we usually what we want to run is actually run for back which is lcf and then a new cb so you can see that the everything is being carried forward first and then it's being carried backward again so that we actually get the first one but this n a is being eliminated by just carrying backwards so you don't have any trailing m a uh in front or in the end of the data set so that's kind of the thing so we can actually compare the performance of lcs and lcb with this situation right here where i try to sample 10 data where i try to randomly select 10 data from a data set in this case is the air quality data so we're going to do is that we're going to create a new object called air quality missing and we're going to remove randomly 10 different data sets in this temperature situation and fill with any and we're going to use a fallback to actually impute the data in so the next one we're gonna do is actually just uh just extract out the missing data as original data and imputed data so the original data is the one before we fill it in with any so that's original data and this is the one after we have imputed in so after we compare them to numeric we can just plot them out and actually just plug them again just to see so as you can see that the original data and imputed data in the real perfect scenario should be exactly the same so that should actually line on this line over here but as you can see sometimes they're not the fallback glory dome doesn't perform that well all the time but in this case they're kind of good enough as you can see this is the original data set and this is the imputed data set you can see that even though some of the data are quite different from one another right as you can see here the the data still falls within the trend of the data and i don't think it will be causing a problem for you downstream analysis for example if you're running an rnn for the next step for example of course i would use like something like moving average but uh fallback in this case lcf kind of kind of makes sense for this case but of course run it on your own and see how it performs there's new there's no correct answer to how to do things here so the next one is actually just k and n how it works and then how k n imputation actually works so we're gonna run through k n situation first and then use that we should be able to understand how kn imputation works because it's a single line of invitation again so first of all is we're going to use iris as our what is that called a sample data set we're going to create a data frame called i called df which is actually the same as iris we cut we're going to create a min max normalization function which takes the largest um number and then put it as one smallest number and puts it at zero and we can actually just apply to the column one two three four over here and we're gonna have a new uh object called df not which now only have four different uh columns with the paddle the separate line and weave and we have the largest as one lowest at zero so it's actually max normalized so once we're done that we can actually just run a sample thing to actually select some of the training set and the testing set things like so remember we're not running imputation here we're running a clustering algorithm so what it does is that it should be able to use k n to actually predict the [Music] the testing set based on the training set data right so we're going to use a can of 30 which is a little bit too much but actually just run into three because we know there's three species over here so as you can see that the the training accuracy on k n is actually pretty good means that we can actually predict the species almost uh like 98 accuracy or something like that with the the canon model that we have remember canon here is a classification model so we can actually based on the training set we train a model and we use the model to predict our testing set and this case our accuracy are pretty pretty high just to based on their pattern super length we're going to predict they are what is that called the the species okay so what we'll do later once again we're going to write an accuracy function and we can see that our model has 93 percent accuracy and that's what we want remember when you are running a clustering or modeling algorithm we do not want to run it as 100 accuracy that is gonna be a problem because it's gonna be a lot over feeding situation here so what what happens here are very similar where we are using k n imputation but they're not gonna be as transparent on how we run it just now so what we do is again here we're gonna do the same thing where we try to do it df missing on the zipper line as you can see here some of the zipper line is missing missing missing missing so once we're done missing the the way to run canon imputation is just literally a single line and what they'll do is that they'll try to use the data from the other tree column and we try to build a model on the other three column and that will rebuild the data back on the first column on the c pawn so that's basically the concept and then the rest are kind of just a performance evaluation of how it works actually let's just run it and see what happens and you can see that using canon imputation on the [Music] what's that called the the impu that the original data impeded that are pretty close together we can't actually tell too much of a difference but if you do want to see the percentage difference between each other uh some of them are actually pretty big which means that can and might not be the best uh situation here to actually try to understand not try to understand try to impute the data that might not work as well so i actually actually again also try to do a little bit more uh jingling in the in the in the in the number of neighbors that i have in this case so we can see that most of the neighbors i think they perform pretty [Music] similar i would say the best is maybe about three uh what do you call three different neighbors where because we know there's three species uh and you can see that the the the imputation result are pretty balanced from positive and negative well if you have one then negative is very strong well you have 10 actually 10 and 3 is very similar to each other so yeah it depends on how you want to define the accuracy and you have but but for in this case i might not want to use k n to actually try to impute the data if i'm dealing with simple and header line for for example so that concludes our simple imputation with mean with lcf and ocb and knn so now we're going to go into a library called mice which actually i forget the full name about but it means like um something missing data or whatever i'll put it right here over there so uh we're gonna run to how it it's actually being run first but we're gonna go not go too deep into algorithm but more on the execution and practical about how it is run so in this case uh the actual way to run it is just one single one single line but let's go from the beginning so first of all we're going to create a data with the sample data set which is add quality and we're going to fill the first seven data of column three sna and the first five data is column four as n8 so we can see the francis seven of them so in column three there's some sample data and there's column four there's some missing temperature data there's also some missing data that's originally coming from dataset we're not gonna we're gonna we're not gonna look at them but just be in mind that they're there okay so the way to do it is just just run it and the only thing that you kind of need to know here is the m equals to three which that will give you three sets of predicted data and we can actually use that three set to recombine them into one which is actually using the the complete command over here so in this case we can take a look at the temperature date temporary data which actually has a lot of things because this is the original input data and this is the input data and it's it's kind of long so i'm not gonna go through too deep with it but what you need to know is every single missing data over here is being imputed with three different value so once we're done the three different value right you can actually see that these are all the information that we have with the imputation technique that they use and why do months and days that imputation technique because there's no missing data right okay so these are the predictor matrix of what they are using and so how do you actually in the end combine the data back is to use the complete function and your input data should be right here yeah so as you can see the first five data from sorry the first seven data from column three has been now filled in and the first five data from column four has now been filled in as well together with all the ozone and solar and all the situation over here so how do we actually uh compute the difference i have first of all i don't i don't like to trust a single number so this is actually the raw uh the raw number between column one and two so column one is the original data from the air quality before we actually remove them and the second is the input data data that i imputed using the pmm predicted mean matching method as you can see that they are close enough but that column three might not be that suitable they're quite different uh these are good enough good enough so they're quite close right so actually now also let's let's look at column four which is the temperature uh remember temperature here is fahrenheit which is why it's so high uh but yeah this this column four might be a little issue over here when the predicted number is not actually that's similar with the uh with the original number so it might be due to the concept that the other column missing as well because we only have in this case six different data i think uh not like here uh imputed data yeah i don't know where is it what did i call it just now uh tem data it's just called data so we also have very little number of data to predict so column four yeah in my so column five i i expect you'll be enough it would be a problem because it cannot use ozone and solar radiation to predict our wind and temperature so i predict column five is going to be slightly bigger and column five is going to be slightly bigger which is kind of true but but not really in some cases so that's basically the the predictive mean matching and how they actually create a thing and then they create everything in the background magic happens and you get imputed data and this is the performance so the the great thing about mice it doesn't actually just need to be running on a mean matching actually they have a large range of different things run for example the method now i'm using is called a random forest which we now get gonna do exactly the same thing we're gonna have three set of data being imputed and we're gonna join the three set of data into one with complete function again and be aware that if you have a big data set this is gonna take quite a while so so if you're running on ten 000 genes it's gonna take like an hour or two i don't know depends on the computing speed and we're gonna do the same thing but now instead of actually just just to just to look at raw data like we did just now with pmm we're actually gonna plot the the performance out between two our two of our imputing aquarium so for example the pmm is gonna be the red color sorry the original is gonna be a blue color so remember blue color here is the original data from column two and column four the red color is going to be the predictive mean matching pmm the red color and the green one is supposed to be the rf so you can see that there are in in terms of the column three they are actually not that far away the green one has a big spike over here so the rate overall perform better when the the number is lower and when when the number goes higher i would say the rate ones still perform better in terms of that closer to the trend of the whole thing but the green one is kind of reversing the trend of how the whole thing to run but both of them doesn't perform absolutely well because it might there might not just be enough data from the other column to actually do the prediction in the first place but this is how this is a way of how we can compare if the imputation algorithm actually works or not so now using the same concept we can actually move on to tcga and then actually use the same c our concept and same mechanism to run an imputation on the gene expression okay so once we have figured out that let's actually move on to our t z share data specifically how to download them and actually simulate a situation where there's missing data imputed in measured performance so how do you download tccga data is using the tcga bio link and these are the exact code that you can use to run it so i'm using are i'm actually targeting the gene expression data which is why i'm using the transcriptome profiling gene expression quantitation and rna sequencing so i'm also actually subsetting a barcode over here because i want enough column for me to do the imputation later but i also don't want it to be too big so that you run out of ram to run it or it takes way too long to run it okay so once you're done just run the qray run the download try to prepare and actually you get a summarized experiment data set so i'm not going to do that i'm just going to load it directly from my rds because i've done it earlier and gdc prepaid takes quite a while to run so don't get surprised if it actually appears to be freezing up so once you've done that tch data has a few things if you are not familiar with the summarize experiment object you have row ranges which contain all the information about the gene column data contains all the information about the patient their age the treatment and all the other mutation do they have and the assay is the one that we look for which contains all the gene expression data okay so once we've done that we're gonna extract the gene expression data over here we're gonna use the tpm on strength so that should actually give us the raw data over here so of course there'll be 60 000 over 36 different variables which is a lot so we're gonna do the first thing first is actually we don't need to remove them anymore because just now there's some extra column but the first thing we're going to do is actually run the log 10 of it because uh if they don't run log 10 you know gene expression data is very left skilled and negatively skilled and we can't actually the algorithm has a hard time to try to figure them out so next one we actually try to re try to actually remove is the low count so i'm going to use a cut off of one just or just to limit the number of things that we have but in this case number one doesn't seem to work since i said sixty thousand of them i'll use 10. still doesn't work oops nope this is the one okay so i'm gonna remove from just now how many from sixty thousand now to eighteen thousand of them so i can remove all the data with the low cow rate and then i'm gonna only select the first thousand of them um why i did that's because i don't have i'm making a video and i don't have enough time to actually show you how to run ten thousand of them uh you can try do if you're running your research definitely include all of them for invitation and because i don't know what is the effect of selling a thousand but i can't wait two hours for the run of the invitation okay so what i'm gonna do next is actually select 20 of the sample using the sample functions over here i'm going to select 20 of the the data over here and we're going to fill in them with an n8 so you can look at it from so between the raw data and the missing data raw data have all the data our missing data has 20 percent missing data on the first column so why i did it on the first column to just to make things easy but remember the imputation that we did with the whatever air quality we just now you can actually run it on all the different columns and it will actually impede all of them just make sure that you don't have like one row one two eight has all the missing columns or all the missing rows and then you will have a hard time trying to impute the data in but if you only have a single column missing is usually easier for them to run okay so the first imputation method i'm gonna use is the random forest because obviously it's done very well just now comparing pnn and random forest so i'm gonna do that and again this is gonna take like some time so don't get too surprised if it actually takes a long time to run especially if you're using a lot of different a lot of different gene for example so there's number one logger event is 150 because in mice there will actually be be some output that i actually suppress with the print flag but just don't get too surprised don't get too worried if this happened and the next one i'm gonna run something different is called locks largo select jet plus linear regression which i feel like is gonna run better on my gene expression data but because of the way i subset i actually cannot be too confident that well because i might have actually killed the trend by selecting a thousand of 18 thousand but that's one of the algorithm that you can use because i i remember that the elastic necklace i really use elastin back legalization is actually used a lot for uh gene normalization and genetic expression normalization so this is what i want to try so the the last one is actually just uh similar to what i did above with the uh visualization of the performance so what i do now is is to try to plot them out as you can see that again blue color is the original so the red color is the first imputation method in this case random forest the second one is the green color which is the the normal is it log normal so the log select plus linear gradient model but the problem is they're all overlapping together so it's really hard to tell but i can already tell they work quite well actually am i running all of them yeah so so remember so this is only the 20 of data that's missing none of the data here is actually from the original of them so no actually there's a thousand of them so yeah my fault here this is all the data but you can see that i can't really tell the imputed data from the original so the imputation sort of work but of course we don't want that so what we do is we want to actually ensure that we we want to also cross compare the performance of the aquarium based on the size of the count because obviously if the count size is big uh the imputation has more uh thing to play around so what i actually do here is trying to to to calculate a percentage error of each of the imputed data right here okay so p data i remember i did actually run it to here original imputed this is the length imputed imputed uh yeah this is no yeah this is basically a thousand different a thousand thousand line that we included in the overall so once we actually so this is what we call residual plot where we calculate the percentage differences on the each of the imputed data against the original so if it's larger count if it's a hundred away twilight away not being a big of a difference but if it's 10 and it's 100 away it's gonna be a huge difference so after done that i can just plot two line plot but now remember the x scale sorry that the x exists here is the count number which is the size number of the original count rate and the the two lines rate representing the random forest and the green representing the lasso select plasma regression and we're gonna actually scale the x as a log 10 so remember that the x axis here is a log 10 scale and you can actually see on the plot that the green color the lasso select here actually works much much better than our than our what is that called our random forest which is kind of what i anticipated and you can see that everything with a low calorie over here we have a huge difference between the original because remember if there's no different from original it should be on zero but there will be some variation up and down over so i'm glad that both glory dumb shows kind of a balance profile they don't go up all the way didn't go down all the way so they kind of work okay but the green one just follow much closer compared to the rate one except for one big spike over here but overall i'm happy with the result and if i were to actually impute uh gene expression data i might actually go for the log select linear regression so that's basically what i want to say for the imputation today so i'm glad it actually works out okay and i hope you learned something today about imputation uh definitely do say subscribe i'm gonna have a lot more video like this coming up next time that's all bye [Music]
Info
Channel: LiquidBrain Bioinformatics
Views: 5,074
Rating: undefined out of 5
Keywords: Rstudio, Rstudio Tutorial, Bioinformatic, Machine Learning, R Programming, Statistical Analysis
Id: _BFMS1IefzE
Channel Id: undefined
Length: 33min 33sec (2013 seconds)
Published: Fri Jun 03 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.