5 ways to work with imbalanced data | Imbalanced dataset machine learning | Imbalanced data

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

if you are working on data science projects for some time now then you must have encountered a problem known as imbalanced data problem or imbalanced class problem let me ask you this guys what is the typical approach that you take to tackle imbalanced class how do you handle that problem are there some methods which you always use to handle this or do you do some research on that do you think that there should be more number of methods more number of offerings to use from while working with imbalanced data do you think that the approach can be slightly different apart from using the inbuilt python functions so that we can get more meaning out of the data if your answer is yes this video is for you welcome to unfold data science my name is aman and i am a data scientist in this video i am going to show you the five ways in which you can approach the imbalanced class problem i am talking about five ways guys not five methods or five python functions five ways in which you can approach the problem in all these ways i am going to show you different methods through which you can approach the problem i am also going to show you some of the methods which is specific for deep learning modules and how these can be used i am going to show you the python demo of all these and show you the theory behind all these as well so let's go ahead to my screen guys and i'm going to show you all these one by one okay so guys first of all we will try to understand what is the meaning of a imbalanced class scenario with this simple example okay so i have a credit card data here you can have transaction time you can have transaction amount and you can have is fraud each fraud is your target column in this case okay that is what we have to predict so i have transaction time for example 10 am 8 am 7 am i have transaction volume like 20 21 30 13 500 right now look at this column carefully guys your target column how many no values you have for no values okay and how many yes values you have only one as yes value so if i ask you what is the ratio of no and yes in your target column so no becomes your 80 because four out of five and yes becomes your twenty percent because one out of five right so this kind of scenario is called an imbalanced class problem and this is your problem why because your model may get biased towards the majority class what is the majority class here no class the thing to understand here is we are more interested in knowing the pattern of fraud transactions which means these y transactions we are more interested in understanding but since we don't have sufficient sample model may not be able to learn the pattern of this class at all this is a very very common problem in the industry okay here i am talking about 80 20 you will have the ratio like 90 10 sometimes 95 and sometimes i have seen 99 percent and one percent ratio also all these are examples of your imbalanced data if you train your model with this original data your model will be biased and maybe it will not be of much use in real world how to tackle this problem that is what this video is all about so guys there are four ways you can approach this you know imbalanced class problem or four routes you can take here i am going to talk about five routes basically okay so let me write here one by one what i am talking here one is called under sampling okay under sampling under sampling means by doing something i will talk about what is that something by doing something you will bring this 80 down okay probably you will bring this 80 close to 20 which means out of these four samples you will take only one sample in model training okay this is not as straightforward as i am telling i will tell you what are the methods okay this is called under sampling minority class there are positives and negatives with all the approaches i'll talk about that second approach is called over sampling okay over sampling the minority class so i will use this word again and again guys when i'm saying minority class means something which is less when i'm using majority class means something which is more okay so over sampling minority class so this 20 can i make this 20 as 80 by doing something okay i will talk about the methods but this is way too okay now one thing to understand here guys the ideal situation is to have a 50 50 ratio what is the ideal situation 50 percent know 50 yes we are aiming for that okay so this is about your over sampling then a combination of under sampling and over simply the third way okay combo of under sampling plus over sampling okay this is the third way what is the fourth way the fourth way is basically and sample approach okay so this is basically modeling approach that we will take and fifth approach is basically batch approach batch selection for model training this fifth one is applicable for your tensorflow and keras models when you are training deep learning models and symbols will use your ensemble models like random forest adapter boost etc and these things i will show you what are different different techniques okay so let me go to the python and show you guys one notebook here um let me go to this notebook okay how to work with imbalanced data uh i have taken a small data here german credit data this is known as so this is basically again a financial system data okay and here you have dataframe.head i am doing so this credibility column right credibility column is the column which is our target variable so if credibility is 1 one then the customer is good customer zero means customer is not a good customer okay and then you have account balance then you have duration of credit then you have payment status purpose credit amount many things you have here in your data so am not bothered about many things occupation number of uh dependents telephone many things right i just want to see how the target variable is distributed which means what is their percentage distribution so i go ahead and see their count so data frame credibility value counts now you can see here guys this is 700 and 300 which means 70 30 ratio now i give you example of 80 20 ratio here i have 70 30 ratio if i want to plot it i will be seeing a chart like this okay con plot so one class means reliable customer or credible customer are 70 but non-reliable customers are 30 now i want to understand the pattern what makes a customer non-reliable in my in my model training right so what i have to do first i told you in in ideal world the ideal scenario will be 50 50 distribution okay so let's see some of the approaches that i discussed here what is the first thing i discussed under sampling let's go ahead and try to do under sampling first what is the first method i am using in under sampling guys so i am importing a package called imb learn i am importing a from collection module i'm importing counter i'm separating my target variable and independent variables okay i'm just plotting here original distribution of the target column okay so let me run this from here let me run this let me run this again so you will see that i am plotting on 2 you know x and y duration of credit is one column and credit amount is another column these two columns come from the data itself okay don't bother about these two columns just something i have to keep on x axis and y axis so i keep it and then you will see a plot like this this plot tells you this orange dot tells you all these zeros and this blue dot tells you all the ones so obviously rna dots will be less in number right because zeros are less if you see here zeros are less so ordinary dots will be less in number this is your actual distribution means actual uh data okay now for the for the benefit of all of you and for comparing it i have saved it as a png here so this is the same png what i will do whatever whatever technique i will apply right i will compare with this original thing okay so this is a photo i will come to this photo i'll compare and show you what i am doing okay this is the same photo which i have plotted here fine now what is the first technique guys in under simply the very first and common technique that we use in under sampling is called random under sampling random under sampling so let me write here this is called random method what we do in random method is out of this 80 data we will randomly take 20 percent out not twenty percent we will make it similar to this twenty percent so out of these four observations i want only one observation why because i want to reach to twenty percent of the population okay suppose here we have four records and here we have one record i will just go ahead and choose one record at random from these four records that is your random under sampling randomly no no uh fundamental behind that no mathematics behind that randomly i am choosing one record okay so that is called your random under sampling how to do that we have to go here and we have to from imei learn dot under sampling i am importing random under sampler you see here random under sampler i am importing and fit sample on original data so x resample and very simple become my new data after resampling and then i am plotting it to compare with my original plot and see how many observations are left so you see here after under sample data save 0 is 300 and 1 is 300 which means previously one was 700 remember guys see here previously one was 700 now it became 300 so this is a randomly selected 300 observations so this is your new new chart okay now if you come here and compare your new chart with the image i have saved right this is the original data now compare this guy this looks more dense right and this looks less dense which means some of the orange data points have been deleted remember these guys orange data points which means majority class you see here all the data data density here where i'm covering my cursor this is 50 you see below 50 i am hovering my cursor come here below 50 gone okay so from all over the places right you will see the orange data concentration is you know reduced which means randomly 300 records have been selected very simple to understand this is done but this is a very very naive approach this is very very simple approach no you know it doesn't give lot of benefit and we are kind of not not drawing any meaning from the data we are losing some information from the data so not a very good approach always the second approach is called the centroid based approach now what happens in centroid based approaches suppose um in your data there are some majority and minority class so let's say these these are your majority classes okay and these are your minority class fine now let's say i have to find four data points from the majority class so what will happen is four clusters will be created here call it cluster one call it cluster two call it cluster three and call it cluster four in majority class okay how many data points i want to choose from majority four so how many clusters i'm creating four i will take centroid of all these clusters you can think of it as k means cluster taking the centroid so my target was to get four from the majority class this way i will get four let me show you that in python you come here and you say import cluster centroids rest of the methods remain same okay and then you run this if you run these guys then you know the difference you will not see much difference from the previous one because previous one was random but this looks little more meaningful right because you only think from your common sense right cluster centers will be created only at the point where there is more concentration right so little more it's making sense okay so if we compare with the original one right this one then you will see that data points have been reduced but um this is not randomly reduced this is not randomly reduced this is reduced in cluster centroid method okay now these two methods that we discussed now guys these two methods are known as these two methods are known as selection based method okay selection based methods because we are selecting some data point from the majority class there is something called as deletion based approach we will delete something from the majority class okay let's see how now suppose there is a approach in under sampling that is known as enn or edited nearest neighbors okay edited nearest neighbor let me tell you what is that so let's draw some data points here guys from any class from majority minority any class okay and let's say one data point at random with chosen for example this data point okay this x data point now um let's see five neighbors of this data point never one never two never three never four never five so what is the mode of those five neighbors mode means either they fall to zero category or one category okay so mode of neighbors if that mode matches with the category of this chosen sample then keep all the sample otherwise delete all the samples okay so this is a deletion based approach you will delete some samples okay now if i come here in my python right then you will see that some of the data points will get deleted it is not a selection method okay so let me run here and you will see that ideally it should be 300 and 300 300 this and 300 this but here it is becoming 271 this why because in some iteration more of class 1 have been deleted more values of class 1 have been deleted okay so this will also be a kind of under sampled approach but a different approach now i have spoken about just three approaches guys but you need to study about some of the important ones one is canon based other is called as near miss and third is called as stomach links you must study about these okay these are also some of the uh under sampling approaches okay now let's move on to the approach to for dealing with imbalance data that is over sampling in our sampling what we do so in our sampling basically you see this 20 of records right by doing something can i make this similar to these number so this is one record right if i duplicate this one record four times right four times same record then it will become four and no no records were also four this is your random over sampling okay random over sampling means duplicating the same records many times so let me go ahead in python run this random over sampler the approach is same hence i'm not walking you through the code all the time random over sampler random over sampler run this now you will see that both 0 see before original data control 1 700 0 300 after 1 700 and 0 also 700 so records have been duplicated records have been duplicated okay if you compare with the original you know chart you will not find any difference you will not observe any difference guys can you tell me what is the reason for that the reason for that is the dots are plotted on top of each other okay because that's a duplicate record right so you won't be able to observe any difference between this chart and this chart okay because this is just a duplication but this is not again a good approach because you are you are duplicating the data and your your data will have some bias you are unnecessarily giving the same information to the model again okay what is the other approach over sampling second approach is known as smooth smart stands for synthetic minority over sampling technique here you don't duplicate the records rather you create new records how let me tell you let me tell you this is an over sampling technique of minority class remember minority class means that 20 we will try to raise up okay so suppose here i have first data point second data point third data point fourth data point fifth data point all our minority sample i'm talking about okay let's say this is one data point okay how many neighbors it has nearest neighbor one to three neighbors so using these three neighbors right one new synthetic data will be generated synthetic data not the real data for simplicity if i have to generate a data between this and this then a new data point will be generated here to keep it very simple i will give you a simple example let's say in this data point the value is age and salary this data point value is let's say 20 salary is let's say 11 hit this data point age is let's say uh 25 and salary is let's say 8 okay just an example i'm giving don't take it literally so if i have to generate a new data set here then maybe i will take a average of these two just an example i am giving so new data point this data point will be 20 plus 25 by 2 11 plus 8 by 2. now there is a mathematics inside it there is a vector quantity that is being multiplied with this okay and then the synthetic data is generated but this is a synthetic data that is generated not the real data what are the pluses and minuses of this approach guys this approach you are not you are not taking your original data duplicating your original data hence little better than random over sampling but again since you are creating from the main data so little bias gets introduced now compare this with your original you will see how the data is being generated see these guys so here you don't see orange dots now this is your png png means original data no orange dots here orange dots which means new data have been generated new data synthetic data not the original data okay this is about your smooth synthetic data generation you need to learn about many varieties of smooth which i cannot cover in this video because it will be very very long then it will be smooth and see smoke and borderline smooth came in smooth svm smooth and add a sim okay i will show you where to learn all these in a moment just give me a few moments okay and then you can have mix of under sampling and over sampling that is known as one example i am giving here smooth enn so i explained you in and i explained us more this is a combination from one side it will be under sample other side it will be over sampled let's see how so if i run this mode in and write then you will see that the data point count right see here originally 700 300 is you now zero becomes three zero seven and one becomes two zero nine so from both these sides it's working from one side smart is working other side e n is working and if you want to compare this with your uh original data right so you see this and you come here to your png right you will see that it looks little different than png because both the things are working smart and en and both are working okay you have to learn about one more technique in this knowledge smooth to make to make is nothing but uh one approach in under sampling that i have listed also here let me show you see atomic links right this one combination with smooth you have to learn because this is important okay mix of under sampling and over sampling now you have to learn about ensembl method and symbol method is the fourth approach that i have listed here okay fourth approach that i have listed here and symbol so in ensembl what will happen is multiple models will get created and all the modules will get created on a balanced data binary data will be chosen random under sampling from your data okay random under sample will choose random under sampling create a back fit a model create a bag fit another model so this is called easy and simple classifier one example of that i am fitting a model and results does not look that bad okay easy and simple classifier this is model fitting not the data creation remember you have to use the model itself directly here okay and other methods where to learn and how to learn about that i am going to show you in the imb learn page because i am using all the things from there guys let me go ahead here so in under sampling method i told you centroid cluster i explained you edited nearest neighbor i explained you see how many methods are there near mace condensed neighbor many things one-sided selection atomic links you need to understand then over sampling methods smooth all the varieties combination easy sampler i showed you uh here smart nn i showed you a smart to mac you have to understand easy and simple classifier i showed you random forest classifier is their bagging classifier is there boost classifier is there you have to understand this batch generator for keras what it will do important this is for your deep learning models created balanced batches when training a keras model when you want to train a keras model you want to see data in batches the data will automatically be created in a balanced way there you use your balanced batch generator okay balance generator for tensorflow is also there you can use implant.tensorflow dot balance by generator for keras you can use for keras specific thing okay imb learn dot keras dot balanced path generator but remember guys none of these techniques is a sure sort guarantee that your your model will do good right uh understand this guy this is a centroid based thing right so in which approach centroid will work fine we don't know depends on the data distribution but in the war if you have more weapons in your hand right then you can use more weapons i'm sure you learned something new in this video right you can combine under sampling and over sampling you can do and simple and many things so i will upload this notebook in my jupiter in my google drive guys let me know if you have any doubts um i'll see you all in the next video give me a thumbs up if you like this video wherever you are guys till then stay safe and take care you

Info

Channel: Unfold Data Science

Views: 15,991

Rating: undefined out of 5

Keywords: 5 ways to work with imbalanced data, Imbalanced dataset machine learning, Imbalanced data in classification, Undersample and oversample, Undersample majority class, smote meaning, smote in python, smote oversampling, undersampling techniques in machine learning, undersampling technique, oversampling technique, Deep learning batch sampling, imbalanced data classification, imbalanced data, imbalanced dataset, imbalanced data machine learning, unfold data science

Id: JisESsmQDS8

Channel Id: undefined

Length: 23min 9sec (1389 seconds)

Published: Thu Apr 14 2022