Handling Imbalanced Datasets SMOTE Technique

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello everyone my name is Ashok welcome to another video on machine learning and in this video we will discuss about how to handle imbalance data set well what is an imbalance data set let's let me explain what's an imbalance data set then we'll go ahead and see how to handle that let's say I have a data set of cancer patients so I'm I'm used I'm going to use this data set to build a predictive model which takes an input and says whether a base patient is diagnosed it's a cancer record not cancer so let's say you have thousand records of which and 900 have cancer and sorry 900 or non cancer and 100 or cancer for example okay this is an exam this is this example of an imbalance data set because your majority class has about about nine times bigger then you have my nerdy classic this is your majority class there's lot many records in the class and this is your minority class so your majority class is about nine times bigger than you minority class and this is what we call it as imbalance data set so that's the definition so imbalance dataset is where you have a minority class much larger than the minority class how larger well you even if it is twice or thrice larger it can still be taking us in balanced dataset so what's the problem with the imbalance dataset well most of your machine learning algorithms tend to actually bias towards the majority class for example so in this case let's say again the same example you have thousand records of which nine hundred or non-cancer only hundred or cancer ok even if your algorithm always says non cancer for every record you view it's fully bias towards non for example all right so even if it says all of your new records or thousand records you give it to it it says all thousand or non-cancer still it has 90 person accuracy because yes that's a majority glance so most of your kernel based algorithms was of your algorithms actually tend towards the majority class which is a problem because essentially most of the time we are actually more interested in the minority class what do you mean by that this will give them I am interested to predict cancer patients the condition cancer condition all right it's the same story with all of the other cases as well for example if you want to predict a spam or not spam essentially you know the non spam mails nowadays situation slightly different but non-spam we call it as ham it's normally larger and the spam mails are smaller but that actually we are more interested to predict the spam right most of the cases any of your business problems your minority class is usually your focus class which means I want the model to focus towards the minority class but because I'm looking at the accuracy as your metrics your algorithms usually tend to bias towards the majority to us how do you handle that well we have several techniques the first one is over sampling so over sampling over sampling techniques or let's under sampling so first is under sampling under sampling techniques so what we do is out of this thousand which are nine hundred non cancer and one hundred cancer we keep the hundred cancer as hundred and we actually randomly sample 100 from the nine hundred is called under sampling because the sample size is much less than the actual data so under sampling so now what you have is 200 records this 200 is perfectly balanced because you have 100 non-cancer data 100 cancer date okay well it's not very popular because as you can see from thousand records we actually got only 200 records or 200 data which means we are losing about 800 data points in machine learning data is golden and losing data is not recommended so this is this this actually balances your output but it might decrease your accuracy as well as you losing a lot of data which might contain a bad value information the second one is over sampling I'm sorry second one is over sampling in over sampling 100 records sorry thousand records or of each 900 or non-cancer the same example and 100 or cancer so what we do is we keep we keep the let me see if I can move this also I can make it smaller all right so we keep the 900 as 900 which are without touching it will keep the 900s 900 and what we do is we make this hundred also as 900 but that's how do you do that well simple duplication it's not like just duplicating 100 into 9 we take over sampling means let's say you sample 30 out of this 100 every time 100 cancer and you take 30 at a time with replacement the sampling with replacement means 100 always reminds hundred all right we keep on taking a 30 pin of the coffee all right so you take 30 30 30 randomly so random sampling and let's say you do it for 30 times then you'll end up getting 900 so essentially it's duplication but it's not like into 9 duplication some records might be only few times indicated some records might be 20 30 times aggregated because a random duplication okay and the number of Records you land up with is 1800 from thousand and it's perfectly balanced we say we do say that in a duplicates are not good for the models because radio is accuracy but in this case a balancing is even more important because we have an imbalance data set your minority class is not really getting focus so by doing this you actually get focus to the minority class and 1800 records we have or of which 800 of duplicates but it balances of your dataset and this is one of the popular technique of viewing being user over sampling is another popular techniques this is one of the techniques what we have other one is smart smart is smart is smart technique we we generate data so let's say this is your data points simple you have one class which is put in black color another class let's say black color is a non-cancer and the red color is cancer for example as you can see the number of red class or the cancer class is much less smaller than capacity the blackness what we do is we try to take the nearest neighbor and take an average of it so essentially we're gonna create a centroid so just basically averaging it by synthetically taking the averages of the red points and thereby we can keep on doing it till we reach the number of Records equal to your majority class this also increases your minority class which is cancer in this case from 100 to 900 a measure at the class reminds 900 of course but this time the 800 or not duplicates we saw synthetically created it's the fake we created by averaging the existing cancer class cancer records data points and this because it has synthetic minority over sampling technique if you take their first lettuce we it becomes smart smart smart is one of the very popular technique for handling an imbalance intercept and it is so you know properly use in our like the languages when you say you haven't balanced it did you smart it when you say when you smart it you're basically saying that it you balanced it that's that polymer so that's your spot and that's how you handle the imbalance data set I'm going to quickly show how do you do it with the real data set if you understood the concept then the the pipe encoding part I have a data set contour evaluation which is need a set of 1700 records and see that number of records [Music] so you have 1700 records 728 brackets and this data is from you see a repository so you can look at it so this is some you say depository cory williams you can actually Google for it I shall leave the link in the description of this video for $80 all right unless you see a repository you look at the number of records of this output class just quickly explain it is talking about the outcome of a con model by based on the six input parameters operating test the buying cost maintenance cost number of those number of persons like age food space how much space you have carried lucky you some safety it's been rated with low medium and high and the outcome which talks over whether the car model is unacceptable acceptable good are very good if you look at the number of records of each of the classes you see that out of 1728 210 call models or the records or actually an acceptable class 384 or except class 69 are good cars and 65 are very good cars this is of course imbalance dataset you can see that the minority class here in 465 is about 19 times smaller than the unacceptable classes in majority class otherwise your majority class is about 19 times bigger than your minority casts we do have more minority classes over here if I just go ahead and fit any algorithm let's say I'm gonna go in I'm going to pick this K nearest neighbor for example neighbors sub packaged in a nearest neighbors classifier as a model is equal to K nearest neighbor classifier okay before that I need to get the X & Y X is all the records except the last one and Y is the last one let me just quickly check whether we got all of this experiment you have one more thing and that is your X variables or somewhat screens or labels we need to convert them into numbers by using label encode I'm gonna import a label like product quickly s killer processing label encoder and I say yes and I'm gonna simply take X ll see all the records I'm gonna say by copy-paste it that's more easier and then maintenance these are the variables which requires label encoding those and person seems not required because they are numericals already so this is your luggage boot and this is your safety okay and I'm going to replace them with label encoding I'm gonna use and uh supply function don't apply and say NZ big plans and then I'd like mine all of you are you know variables are label encoded great and then I'm gonna go ahead and split the data set because if you want to evaluate then you need to take that 17% as a training and 30% as a testing which I'm going to do by using another package in a scaler model selection called as train tested so x-ray come on X test y training comma Y test is equal to let's go to the next line printer split I'll say X comma Y test sighs I'm going to take a test point three point three and let's put a random state so that the splitting is fixed now then I'm going to fit a model model of fit explain sorry my crying fitting is done [Music] and embarked on its test okay now let's look at the accuracy of the model for accuracy wanna download a metrics addresses for print accuracy clicking why test and why predict okay so I predict so got 92% accuracy well that's not enough data for us to analyze let's look at the crosstab crosstab why test yd all right so if you look at this crosstab it basically tells you okay okay so you can see that this is actually my charity class okay and these are all minority class you can see that these are all right classified diagonally is rightly classified so about 366 records of unaccepted is classified as an accepted and accepted ten of the good cause the classified as good 89 of the AXA to class or CAD number so class present acceptable and 17 good cause are classified as very good work less experience all of the rest is actually missed classification for example this lebanon is actually acceptable cause being classified as unacceptable this is your prediction and this is your actuals okay if you look at the accuracy all right out of how many records we have like one or two records 89 were one not 280 never classified correctly in this case out of 10 1021 only 10 are correctly classified out of 360 371 because if you add this to a 571 only 3 6-2 6-4 correctly classified and out of 25 seven little correctly glance but you look at the accuracy let me just use take a help of heightened Dignan by one or two 87 percent so this is 87 percent ten by twenty one forty seven percent this is 2 6 by 3 71 is about eighty ninety eight percent and then you have 70 by 25 68 percent so you can clearly see that the majority class has got a huge accuracy on a class voice class basis whereas your good and very good point cause have like really poor accuracy 47 unsexy this is that this is because of your balance it is a so let's apply smart and see if you can get better accuracy I'm gonna play smart now or for smart you need to install a package called as imbalance data set are you be learning so this is how you can be done install am I've already installing so if I run this normally imports from I am be done over sampling import small smart is the technique which we are talking about and then I'm going to simply define an object first watch ok then I'm going to transform my X train only very good obvious party not the testing data so I'm going to say new variables which is smaller variables as a small click on sample I pass on x-ray gamma white way that's it so that actually let me just put it as a floor because as [Music] okay so if you look at the from there's none of the small package collections [Music] for smoting before smart and I'll pass on y frame then I do the same thing and say after spotting pass on wide range so before smarting in the training data set you had eight look at here 839 unacceptable data the national class has eight that night data acceptable has 282 good has 48 and very good as forty after snorting all of your classes have the same number of records as your majority class which is an expert how this happened well we have created by extrapolating the individual classes so the acceptable had 282 numbers data so we use this 282 like taking averages interpolated sorry basically accepted interpolate basically take an average of the neighbor neighboring data and then create new data till it becomes 389 the same thing happened with good and very good classes as well so let's do the the same classification the same thing again okay this time I'm going to use this mortared data and I also print the same thing again this time rather than using the Train the regular data which has imbalanced it I'm using a balanced data set which is a small data I run it and I see that the first observation itself is the accuracy went up that's wonderful let's look at here okay let's look at here the accuracy went up straight youthfully actress is slightly nips because we are duplicating data with a big data but in this case it seems accuracy went up and you can see a very noticeable changes in all of these classes for example this class we got 89 posts Rayden incorrect predictions before it actually now has 92 the number of records they - the same this actually have 100% that's wonderful accuracy good car which had 47% the accuracy has jumped from 47 to hundred well the majority class got a bit of heat because that's how it gets the other classes up and running and then you also have a great accuracy in the minority class let me do the math quickly 92 by one or two so 92 it's improved 90% so it was 89 before now it is 90% it was 47 before it is now 100% and it was 98 before this will be pelleted 354 by 371 its ninety five point four so it has reduced by three percent not bad 23 by twenty five the last one 92 percent well you visually see that all of your accuracies were improved except the magenta class which is deep a little but look at this all of your classes have 1995 ninety-two fantastic and that's what the smart dust to your data side so if you have an imbalance data set please go ahead and apply smart and you see that it has drastically improved the accuracies of your minor day class as well as sometimes even overall accuracies most of the cases is as I said before usually your focus class or the class your desire of interested is you mine at the class so in those cases what as what is actually kind of like a man you don't think you have to do to get a proper accuracy for you right with us okay so that's a spot I shall leave this document which I just written over here in particulate we haven't put a link to you for you to play with data set you can download from QC repository well I can also providing from the do that that's it so that's smart hope it is helpful and if you find the channel valuable please subscribe to the channel also click a notification button thank you very much and if you have any comments or questions leave it on the comment I'll surely answer them let's leave the next video thank you
Info
Channel: DataMites
Views: 23,264
Rating: 4.9075737 out of 5
Keywords: Datamites Institute, Data Science Courses
Id: dkXB8HH_4-k
Channel Id: undefined
Length: 24min 32sec (1472 seconds)
Published: Mon Feb 10 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.