All about missing value imputation techniques | missing value imputation in machine learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
let me ask you a simple question suppose you encounter some missing values in the data and you want to train a ml model using this data so what is the first thing that comes to your mind is it like how do i impute this missing value how do i treat this missing value or it's like why this missing value is here if your answer is how do i impute this then probably that approach is more of a non-scientific approach the scientific approach would be why this missing value is here is there something that i can observe about this missing value before even going on to the treatment or imputation phase can i understand little more about this missing value welcome to unfold data science my name is aman and i am a data scientist in this video i am going to tell you all the details of missing value treatment in machine learning and deep learning i am going to tell you what are the different categories of missing values i am going to tell you which kind of imputation which kind of treatment will work on all these categories one by one i am going to tell you how do you understand what is the category of your missing value by looking at other features of your data many things i am going to discuss in this video guys about missing value treatment and missing value observation very very important video do not skip the video watch till the end no need to put much brainer it's not a very hard to understand video simple one let me take you to the whiteboard and try to explain you what all can happen in your data with respect to missing value okay so guys i am showing you one data here okay and with the help of this data we will try to understand all the categories of missing value what are the different techniques to treat missing value and what are the pros and cons of different techniques okay so let us understand the data first what is this data this is employee data basically employee name you have gender you have age you have network you have for the employee let's say there is an organization this is the data for different employees of that organization fine now first we will understand what are the different types of missing value different categories or buckets of missing value that can be there in the data okay so i will first list it here what can be different categories of missing value let me write it here one category is known as missing completely at random call as mcar okay other category is missing at random okay third category is not missing at random okay and fourth category is a structure missing or structured missing fine let us try to understand first category that is missing completely at random okay so what is the meaning of missing completely at random is some data points are randomly missing and that is completely random which means you have no clue why that is that is missing you can't figure out why that is missing for example let me take the eraser and try to delete something from here i will delete something from here i can delete something from here so whatever i am doing now right i'm just deleting some randomly i don't have any pattern i don't have any background of why i'm deleting and this is a less common scenario in the industry if you see okay so every time there will be missing value there will be some logical reason of why that is missing okay most of the time but if it's a completely random missing right then that we call completely random missing value and one more property of completely random is if you try to impute it using some technique for example if i try to put the next is in this box or any any technique mean technique median technique right then we are not sure how good that will work because we do not know what is the reason for missing of that value okay so whatever i showed you just now is kind of completely missing at random missing completely at random okay we have no clue why that is happening that is a less common uh occurrence in real world and we don't know which technique will work on this which will not work on this fine another thing that you need to understand here is called missing at random so if you see here the completely word is gone so we are not saying it's completely random but it is missing at random how it is different from the previous one is let's say uh let me delete the age from this employee okay let me delete the age from this employee now suppose guys how many female employees are here if you can see there are three female employees one two and three okay in these three female employees suppose one female employee age is missing so one logical approach would be why don't we impute this value using the information of other female employees right like 28 and 30 can we use this information and impute the missing value here if you are getting an idea of what will be the logical approach to impute that missing value some idea of that though it is a random kind of missing but it is not completely random you can say that there is a female whose age is missing other female says we have maybe we can do this and that will make sense if you are getting into that mindset that sense right this category of missing value is called missing at random it is not completely random where you are totally clueless you are not totally clueless you are able to think something about this missing value okay so that is your missing at random category okay let us come to the third category third category is known as not missing at random what can be the example of this let me come here and delete some entries for example uh let me delete these two entries i am just deleting it for a purpose i will tell you why let me delete these two entries okay now suppose you come here and you see that net worth is blank for these two people so what is one thing that you can observe in these two people age these two people are kind of higher age group if you come here this person is high as group this person is high age group also this person is high age group but for him net worth is not missing but for these two people net worth is missing so one possible reason can be people whose age group is higher they are not very comfortable uh showing their net worth okay so how many people have higher age group three people two people are not showing hence their value is blank one person is showing so can i logically uh take an approach where i will take reference from people in higher age group and impute these two values if with certain more amount of confidence this is not a random miss this is a kind of logical mess or you are able to think why these things are missing okay so if you are able to think you are able to think of a business reason of why this is missing maybe that falls in the category of not missing at random okay so this is your third category which category not missing at random because that may be intentional missing okay and then comes your structured missing this is one type of missing value where you are 100 sure on why that value is missing okay for example i will take a good example here i will go ahead and delete all the age of female employees okay so delete this delete this okay i will delete this and then all the female employees right uh you don't have age okay so this is a kind of we can almost hundred percent confidence we can say that hey you know i have a reason for these missing values female employees are not you know very willing to put their age on the table or put their age on the database right that can be a kind of logical reason which is more or less 100 sure so you understand the difference guys when you are moving from here to here your confidence of missing value y that is missing right your confidence is increasing why that value is missing in first one you have no clue in second one you have some understanding of how we can treat that and how it will make little sense in third one you have more understanding of okay this might be the probable reason and in fourth one you are almost hundred percent sure what can be the reason of the missing value okay more surety you have more better you can treat the missing value okay so if you know the exact reason of why that value is missing then you can treat it very well right so another example of structured missing can be suppose in the census data right you have one column saying if the person is married or not married and other column is how many kids that person has so every time that column is unmarried right the marriage column is unmarried number of kids will be may be missing so you have a hundred percent visibility of why that column is missing those kind of thing is called structured missing more understanding you have of why your data is missing better imputation you can do now let us go to the imputation part what are different types of imputation guys i am not sure what all imputation you have heard but i will tell you few imputations uh that you should know right so first imputation that i want to tell you here is univariate versus multivariate univariate versus multivariate okay what is the difference guys the difference is suppose i want to impute a value here these three values i want to improve just for example so what all columns i will include in including in imputing these values if i am including only one column that is age whether i do a mean imputation median imputation whatever i do if i include only h column that is a univariate imputation if i include multiple columns that become a multivariate imputation okay this is the difference between univariate and multivariate now what are the different techniques and how these techniques will make sense or how these techniques we can use let us go one by one first imputation technique that is known as i will just look at my notes once because i do not want to miss the name okay so first imputation technique is list twice deletion simple one list wise deletion what is the meaning of list wise deletion is in your data set suppose any one value is empty okay for any of the column so i will take this first record and i will see if any value is missing any column value employee name is missing or generally missing or age is missing or net worth is missing i will remove that complete record okay this is a very simple straightforward thing any box any entry is missing delete that complete record that is called list wise deletion now there are certain advantages and disadvantages as well obviously so what are the disadvantages you can think of guys so first of all you are deleting the entire record means you are losing the information right machine learning is all about model training so you are kind of losing the training power of your algorithm second thing you may be introducing some bias in the data maybe you are deleting something important and other things which is not that great import importance will be here right so it may introduce bias that is a list wise deletion second deletion technique or second imputation technique is all of you know this mean median mode imputation so i will not go into much detail of this right so you take the mean of the column and impute it or the median or the mode whatever you want you can take depending on categorical numerical variable and you put it here right so that is a simple technique that that is you know taught in the data science trainings uh first classes and all third imputation technique that i want to talk about here is called dead imputation okay what is your dick imputation guys dead imputation basically there are two types one is called cold day computation and other is called hot day computation what what day computation will do is it will come here and find the similar object of what you want to impute suppose i want to impute let's say in this one value is not there and in this one i mean value is there so let us put some values here let us say i will put 29 28 and this one value is not there okay so what the computation will do is it will find the similar looking objects and it will try to use one of these values so there are two varieties right one is cold and one is hot in the computation suppose if i talk about the hot imputation right what it will do is it will go ahead and see which are the similar objects from this last object so these two objects are similar it will see the edge and it will pick one edge randomly and put it here hot imputation hot day computation colder computation will not put randomly it will follow some system for example it will follow follow the third age or the second age or the first is some system it will follow it will put that this is called a dick imputation okay next imputation that i want you to learn here is known as model based imputation model based what happens in model based imputation there are different varieties of model based imputation you can use a canon imputer you can use expectation maximization computer you can use a maximum likelihood function right you can use regression right so what i mean to say here is whatever you want to impute that you will predict basically using other features suppose i want to impute this age of this employee right what i will do is i will predict this age using other features as independent features okay how i will do that i will i will train a machine learning model of these categories based on whether i want to predict a numeric data type or categorical data type right so all these are different different modeling techniques through which you can input and you can get your what should be the imputed in the missing value right and the next technique that that i want to discuss here is known as uh your subject area or prior knowledge right so from your prior knowledge also you can do some imputation if you are a business expert in that domain right so you don't need to depend on system given imputation techniques you can do your uh whatever you know it's sensible whatever makes more sense for that you can put put that here for example if uh you know net worth is missing for this employee and you think a person working in this company of this age will be having this much net worth you can just put that value here just prior knowledge or business knowledge you can say so these are different techniques guys now from your understanding uh which technique will make sense for which kind of you know missing bucket so just a thumb rule here more more knowledge you have about this you can be more sure on what will make more sense for example mean median mode is a simple imputation technique okay simple imputation technique but i will not take mean median mode if this is a structured missing value why because i know why it is missing i have a hundred percent confidence on why it is missing so i will have a better you know imputation technique maybe canon will work better maybe hot deck imputation will work better i don't know i will have to check that okay and less knowledge you have about the type of imputation for example missing completely at random then any of these techniques may work equally good for you you don't know which one will work which one will not work okay so what we have to do is we have to go ahead and check with first of all try to understand what type of imputation it is then try to put some logic in what will work what will not work and kind of create a smaller set of imputation technique and try with these techniques okay one more topic that i want to cover here so whatever we discussed till now is called single imputation okay single imputation why because we are imputing one value in place of missing value there is something called as multiple imputation also okay single imputation versus multiple imputation what is multiple imputation will do is it will not impute the value only once it will impute the value many times and then it will create a combined imputation kind of thing in r and python both there are dedicated packets to do that and it's a very interesting concept if you want me to discuss this in more detail guys i can discuss this this topic alone in more detail with some examples and show you in python as well you have to drop me comments saying i want you to discuss this or any of these topics right if you did not um if you if you think that if i explain you will understand better then i can go in more detail of model based or maybe recomputation or any of the imputation so whatever topic i get 20 plus comments i will create a video on that meanwhile please give me a thumbs up guys if you think this video was useful and please press the subscribe button if you have not done yet also press the bell icon so that you receive all the notification guys see you all in the next video wherever you are stay safe and take care you
Info
Channel: Unfold Data Science
Views: 17,923
Rating: undefined out of 5
Keywords: All about missing value imputation techniques, Missing value imputation in machine learning, Missing value imputation in excel, Missing value imputation algorithm, Missing value imputation types, Types of Missing value, Missing value imputation decision tree, missing value imputation in r, missing value imputation, how to impute missing, missing value imputation using linera regression, missing value imputation techniques, missing value imputation - part 2, unfold data science
Id: -uC79UTOye8
Channel Id: undefined
Length: 18min 32sec (1112 seconds)
Published: Thu Mar 10 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.