Handling Imbalanced Dataset in Machine Learning: Easy Explanation for Data Science Interviews

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys it's Emma welcome back to my channel in this video I want to focus on imbalanced data sets in Balance datasets are common in real world applications so being able to deal with it will help you with tackling interview questions as well as dealing with real data science problems I personally do not have lots of experience on this topic so I have done some research to prepare for this video I have found some helpful online resources as well as two books I want to recommend to you which learning design patterns and designing machine Learning System so these toolbox have great discussions on this topic if you want to learn how to deal with other productive motion learning problems such as feature engineering model deployment hyper parameter tuning then you might find these two books helpful so in this video I'm gonna summarize what I have learned through my research and hopefully give you guys some ideas on dealing with imbalanced data sets I will also share all the resources I have found helpful in the video description now let's get started here's an outline of this lesson we are going to look at what is an imbalanced data set why it causes problems and finally we look at different approaches to deal with imbalanced data sets we'll look at three different methods related to data level methods model level measures and evaluation metrics here are some interview questions to give you a sense of the questions you might encounter in an interview what's the disadvantage of imbalanced data sets how to handle imbalanced data how to deal with imbalanced data sets when the data contains only one percent of the minority class alright now let's get started with understanding imbalanced data set an imbalanced data set is a data set with one or more labels make up the majority of the data set leaving far fewer examples of other labels imbalance data sets apply to both classification and regression tasks in classification it might happen in binary classification multi-class classification and multi-label classification for example we have a binary classification problem and a 95 of labels is in one class with label 0 and the rest five percent is in the other class with label one in regression problem it refers to a situation that examples with outlier values that are either much lower or higher than the median or average of the data for example we want to build a regression model to predict prices for houses and for houses worth over 10 million dollars they are much rarer than other houses on their Market when we face with imbalanced data sets the first thing we want to do is to get as many samples for the minority class as possible however in many situations getting more data for the minority class may be impractical or hard to acquire because the data is inherently imbalanced for example in a fraud detection problem the number of fraudulent cases is much less than the number of legitimate cases another example is detection of real diseases the number of people who have a rare disease such as cancer is much less than the number of people who do not have such a disease so why does an imbalance data set cause problems why do we need to care about it the reason we need to care about class imbalance is that the model cannot learn to predict the minority class well because of class imbalance most of the time the model is only able to learn a simple heuristic for example always predict the dominant class and might get stuck in a sub-optimal solution an accuracy of over 90 percent can be misleading because the model may not have predictive power on the real class using example we mentioned earlier if 95 of labels is in one class class zero and the model always predict example being class 0 the model still have over 90 accuracy but the model does not have any predictive power on examples in class one however the minority class is more important than the majority class in most cases a wrong prediction an example of the minority class is more costly than wrong prediction and example of the majority class for example missing of fraudulent transaction may be a hundred times more costly than misclassifying a legitimate example as fraud okay now let's go over different ways to deal with imbalanced data sets there are data level methods model level measures and imagery level measures we could use to deal with imbalanced data sets let's cover them one by one the first semester we are going to talk about is a DV level methods resembling recently changes the distribution of the training data to reduce the level of class imbalance one simple method is oversampling or app sampling we simply add more examples to the minority class we can do this in different ways we can use random over sampling by randomly making copies of the minority class until a ratio is reached here's a diagram showing random over sampling we have Advanced Data set and we simply make a copies of the minority class so that both classes have a simple amount of examples the downside of this approach is that simply making replicas may make the model overfit to the few examples in the minority class because we are not changing the data we simply duplicate the data for the model to learn from another approach to oversample the minority class is to generate synthetic examples one problem method is called Smooth synthetic minority oversampling technique it creates synthetic examples on the real class by combining original examples it does this using a nearest neighbor's approach in this diagram you can see that we use four nearest neighbors to create a new synthetic example for the minority class and then we add these examples to the training data for the model to learn the advantage of this approach is that it can prevent the overfeeding caused by random over sampling because it does not use original examples instead it uses new examples that are similar to original examples another resembly method is under sampling or down sampling we simply remove examples from the majority class the first method is random on the sampling the idea is similar to random oversampling random under something means that we randomly remove samples of the majority class until a ratio is reached in this diagram you can see that we remove samples from the majority class of the original data set and then the resulting data sets have similar amount of examples in both classes one thing was noting is that random under sampling may make them result into the set too small for model to learn from so it only works when we have enough number of examples at least thousands of examples in the majority class another public owners dumping method is called Toe Mac links we find pairs of examples from opposite class that are closed in proximity and remove the sample of the majority class in each pair inside one you can see that we find pairs of examples from two different classes and in each pair the examples are close to each other and then we remove the example of the majority class which is shown as blue dots in each pair the model is then trimmed based on this under sample data of the majority class the advantage of this method is that it may help make the decision boundary more clear and the model learn the boundary better but the downside is that the model may not learn from the subtleties of the true design boundary because we remove the examples of the majority class now to summarize reasonably measured recently measured is a good starting point to deal with imbalanced data sets but it runs the risk of overfitting the training data if we use over sampling methods and losing important information from removing the data if we use under sampling methods now let's move on to some model level methods to deal with imbalanced data sets the idea of using Molly methods is to make the model more robust to class imbalance without changing the distribution of the training data one common use method is to update the loss function of the model specifically we designed a loss function that the penalizes the wrong classification of the minority class more than the Run classifications of the majority class it will force the model to treat specific classes with more weight than others during the training one computer loss is a class dependence loss we simply mix the weight of each class inversely proportional to the number of samples in that class the weight of class I wi is n over ni L is the total number of examples in the training data and Li is a number of examples in class I here's an example of two classes Class A and B and we have number examples is 1000 for class A and A 10 for class B the weight for class A will be 1.01 versus the weight of Class B will be much higher is 101. now when we calculate the laws caused by example X of class I we use wi as a multiplier of the regular loss function here loss XJ is a loss when X is misclassified as class J because the true class is I other than changing the loss function we can also choose to select appropriate algorithms to deal with class imbalance problems three based models work well on tasks involving small and imbalanced data sets logistic regression is another algorithm we could consider to handle class imbalance it works relatively well in a standalone manner we can adjust the probability threshold to improve the accuracy for predicting the minority class other than updating the loss function and selecting appropriate algorithms we could also consider combining multiple techniques together to deal with imbalanced data sets we talk about recently measures so my example is to combine under sampling methods with Ensemble learning algorithm the idea is to use all samples of the minority class and a subset of the majority class to train models and then Ensemble those models for example we have two classes for binary classification problem Class A has a thousand examples and Class B has a hundred examples we can divide examples in class A into 10 groups with 100 examples each and then we use all examples in class B plus examples in each group of Class A to trim models and then we will have 10 classifiers and then we'll Ensemble those classifiers to be the final model another method is to combine app sampling and update the loss function of the model we can upsample the majority class until a ratio is reached and then we calculate the new weights for both classes and then pass a new weight to the loss function of the model these are two examples to combine multiple techniques to deal with class imbalance problems but we can definitely try other combinations of techniques the general idea is to use multiple techniques together to deal with class imbalance finally let's look at choosing the right evaluation metrics to deal with imbalanced data sets before we talk about which metrics are appropriate for imbalanced data sets there's an important thing to keep in mind we should use unsummer data instead of reasonable data to evaluate the model because using the reasonable data will cause the model to overfit the resampled distribution the test data we are going to use for evaluation should provide an accurate representation of the original data set as I mentioned earlier accuracy is misleading when class are imbalanced the performance of the model on the majority class will dominant the accuracy so a better choice is to consider using accuracy for each class individually basically we should have accuracy to measure the minority class individually other metrics such as Precision reconf score they are all helpful to measure a model's performance with respect to the positive class in a binary classification problem we can also consider using Precision recall curve to identify a threshold that works best for the dataset the Precision recall curve gives more importance to the positive class it puts more emphasis on how many predictions the model got right out of the total number it predicted to be positive that is helpful for dealing with imbalanced data sets another common use metric is the area on the curve that you see of the ioc curve we can tune thresholds to increase recall and decrease false positive rate however the problem with ioc curve is that it treats both classes equally and is less sensitive to model improvements on the minority class so it's less helpful compared to Precision recall curve alright those are the metrics we can consider using to measure the model performance for imbalanced data sets we have talked about data level methods using resampling model level measures and choosing the right evaluation metrics to deal with imbalanced data sets I hope now you have a good idea on why imbalanced data sets cause problems and how to handle them probably I will see you in the next video
Info
Channel: Emma Ding
Views: 18,932
Rating: undefined out of 5
Keywords: Data Science, Data Science Interview, Emma Ding, Data Interview Pro, imbalanced data machine learning, imbalanced dataset machine learning, data science interview questions
Id: GR-OW5asKlk
Channel Id: undefined
Length: 13min 43sec (823 seconds)
Published: Mon Dec 05 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.