Machine Learning Tutorial Python - 21: Ensemble Learning - Bagging

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
once i wanted to buy nist thermostat and i wasn't sure if i should be buying that or not i then called four of my friends who already have that device and then asked for their opinion three of them told me i should buy it one guy told me no i shouldn't buy it and i just took a majority board and i went ahead and installed nest thermostat in my home we use ensembl learning in our real life where to make a decision we take opinion from different people similarly in machine learning sometimes what happens is if you have just one model and if you train that model using all your data set that model might get over fit or it might suffer from a high variance problem if you don't know about bias and variance i have another video which you must check before continuing on this video but to tackle this high variance problem we can use ensembl learning in the case of my nist thermostat why didn't i call just one of my friend because that if i call only one person that person might be biased so i wanted to make a good decision hence i talked to multiple people and took a majority award in ensemble learning we train multiple models on a same data set and then we when we do prediction we do prediction on multiple models and then whatever output we get we combine that result somehow to get our final output bagging and boosting are the two main techniques used in ensemble and symbol learning and in this video we are going to talk about bagging we'll also write some python code let's get started let's say i have this data set of 100 samples and when i train a machine learning model one of the problem i might encounter is overfitting and overfitting happens due to the nature of the data set your machine learning methodology etc and usually overfitting model has a problem of high variance to tackle this problem one of the things you can do is out of 100 samples create a small data set of 70 samples let's say i'm just giving an exam example and the way you create this subset is by using resampling with replacement now what is that exactly let's first understand that let's say i have this 10 samples out of which i want to create smaller data set with four samples in resampling with replacement we randomly pick any data point let's say four and then when we go and pick second data point we again randomly pick any data point from 1 to 10 with equal probability we don't look at what we already have in our subset so second time i will let's say get 8. third time also we randomly pick any data point from 1 to 10 and this time i might get same sample again so this is a resampling with replacement where in my subset i can get same data sample multiple times so here from my original data set i created a subset of 70 sample i might create n number of such smaller data sets from my original data set using resampling with replacement and then on individual data set i can train my machine learning model let's say i'm trying to classify if a person should buy insurance or not and i'm using logistic regression so i will use logistic regression model so here m1 m2 m3 they all are logistic regression but they are trained on a different data set and when they're trained and now i have to perform the prediction or inference i will perform that prediction on all three models in parallel individually and whatever result i get i just take a majority what so here m1 and m3 is saying person should buy insurance m2 is saying they should not majority what is clearly one and that is my final outcome the benefit you get here is these individual models are weak learners weak learners means they are trained on a subset of data set and hence they they will not overfit you know it is likely that they will generalize better they will not over fit and these individual weak learners when you combine the result you get overall a good quality result this was a case of classification same thing applies for regression let's say you're doing housing price prediction here you take an average of whatever is the prediction by individual model now this technique is also called bootstrap aggregation because when you are creating this small subset of data set using resampling with replacement that procedure is called bootstrap and when you combine their results using either an average or majority what that is called aggregation so hence bagging is also called bootstrap aggregation many many times you hear all these terms and jargons and you get worried what it is but really these concepts are very very easy you know you already understood what bootstrap aggregation means now random forest is one of the bagging technique with only one distinction which is here we not only sample your your data rows but you also samples your features so basically you sample your rows as well as your columns let's look at our classical housing price prediction example where town area bedroom etc are features and pricing which is a green column is your target variable here you will sample rows and columns both so here you can see i don't have a bedroom column or a plot column i randomly resample out of c one two three four five six out of seven columns i got only four columns in the second time again i randomly sample this column so i did not get uh for example here bedrooms in this particular data set and in the third one for example here i did not get school rating okay so you are resampling you're randomly picking rows as well as columns then on individual data set you train a decision tree model and then you aggregate the results here i have decision tree regression you can use it for classification problem as well but the point is very simple random tree is basically a bagging technique but here we do one additional thing which is we randomly pick features as well and the difference between these terms bagging and back trees is that in bagging individual model can be svm knn logistic regression pretty much any model whereas in bagged trees so random forest is a bag tree example here the every model that you're training is a tree all right so i hope the theory is clear let's move on to python coding using scalon i will be using this data set for the coding today the data set is about pima indian diabetes where based on certain features you have to predict whether the person has a diabetes or not by clicking on this link i have downloaded this csv file which looks something like this here the features are pregnancies glucose blood pressure these are all contributing factors for diabetes based on these you are deciding whether the person has diabetes or not i have loaded this csv file in pandas data frame as you can see here and as soon as i load data in my data frame i like to do some basic exploration so let's first start and find out if there is any column which has null values so the way you find it out is you will do df dot is null dot sum this will tell you in this column let's say if the number is 5 which means it has 5 null values but we are lucky here there are no null values so no need to worry about it second thing that i do is df.describe this tells me the basic statistics for each of the columns for pregnancies look at min and max you know max i understand 17 is little high but it's not unrealistic when i examine min max values in all these columns they look normal and i don't feel like we need to do any outlier detection or outlier removal etc so i will just go ahead and assume that there are no outliers the second thing i check is whether there is any imbalance in the data set because see there's this outcome right so here if you do value counts what you will find is there are 500 samples which which says no diabetes 268 samples which says yes person has a gravities and if you look at the ratio it's 0.53 so it looks like some imbalance but it is not a major imbalance major imbalance would be like you know 10 to 1 or 100 to 1 ratio this is more like 2 to 1 ratio you know so i would say that this there is slight imbalance but it's it's not something that you should worry about so i would go ahead and move on to the next stage which is creating x and y so my x will be df dot drop because outcome is my target column i need to drop that and the way you do that is by using drop function in pandas and in the axis you will say columns and y would be df dot outcome okay so my x and y is ready now i will do scaling because the values are on a different scale here you can see this particular number you know 0 to 17 versus 0 to 199 i mean it's not a huge difference in the scale but still just to be on a safe side i will use standard scaling you can use min max scalar as well let's create our scalar object scalar and if you have followed my previous videos you know how to use scalar object you can just say fit transform x and what you get in return is your x scale this will be numpy array hence i will just print like first three rows out of it you can see the values are scaled if you use min max scalar you'll get different set of values but both works okay now this should be in your muscle memory when you have x scale and y ready you will do train this split and for that i will import this method in train this split what do you supply x and y right x and y what is our x x scale right we want to use the scale value and then in the output you know the standard output that i get is x train x test y train and y test now here i will use stratify argument because there is slight imbalance so i want to make sure the test and train data set e has equal number of samples like equal proportion basically and i will say y i mean it won't be equal but at least the this ratio should be maintained and random state i will set to 10 or maybe 20 let's go 10. this is just a random number by the way and it allows you the reproducibility every time you run this method you will get same extreme y train so if you do the shape okay 576 samples in your train data set and this data set has 192 and if you look at this oh no actually you know what i should be looking at why train dot value counts and if you look at this around same kind of ratio that you saw here okay now we will use decision tree classifier to train a standalone model you can use svm kanye west neighbor any classification model i am using decision tree to demonstrate that decision tree is relatively imbalanced classifier it can overfit and it will you know it might generate an um high variance model so let's train decision tree first and i will use cross validation score here to just try it out on a different type of data set rather than just x strain and you know x taste so cross validation if you don't know about cross-validation score i have done separate video on k-fold cross-validation you should watch that if you are not aware otherwise you know this thing will bounce off your head so this isn't cross validation score expects a classifier which is my decision tree classifier then x and y and then cv is equal to 5 10 i'll give 5 so what this will do is uh if i have if i have um 768 samples it will divide them into five folds and it will try different set of x taste and you know white is to x train and next days to try the model you should watch my video on k4 cross validation you will get a good understanding of it and this will return this will do like five time training and all those training results would be inside the scores so see the score that you got by running phi iteration of training is this and if you take a this is a numpy array so just take a mean of it you'll find your model is giving you 71 percent accuracy which is okay but now let me use bagging classifier and first thing you can do is ask your friend sklearn bagging classifier your friend is google and google will tell you which api you need to use so here see i will use the most important tool for any programmer which is copy paste and i create a bagging classifier and backing classifier you can read all the arguments but i'm just going to use couple of arguments first of all okay which estimator you are using well i am using decision tree how many estimators like how many sub groups of data set 100 trial and error okay you try 10 20 figure out which one is giving you best performance and this 100 is nothing but this see in my presentation i said 3 model i am doing 100 models and 100 subset of data sets and i will be training them in parallel and how many samples see here we used 70 out of 100 so for sampling use 80 use 80 of my samples there is another thing called oob score so ob score is equal to now what is ob score well oob means out of bag when you are sampling because you are doing random sampling by the law of probability you are not going to cover all 100 samples in this subset let's say in this subset all this subset number 29 did not appear at all okay so number 29 let's say number 29 is here right 1 200 number that 29 number sample did not appear in any of this subset so now all these models which are trained they have not seen that data data point 29 so you can use that 29 to test the accuracy of these models so you are kind of treating that 29 sample as a test data set ideally what you do is you take your data set you split it into train and test so this diagram that i have shown here this block is actually your x train so your x taste is separate already which you will be using to test the performance of your final model before before deploying that into a wild but even within x train because of our sampling strategy you might miss some samples let's say you might you have 20 data samples which which which has not appeared in any of this subset and those 20 samples after these models are trained you can use those 20 samples for prediction take the majority award and figure out what was the accuracy and that accuracy is your oob score so you realize that okay okay let me first do random state here again random state is for predictability and i will call this a bag model and when i have a bag model i will do o oob score oob score oops [Music] begging classifier actually you know what i have to fit dot fit so i am doing x you know x and y train fit and then i get this you realize i did not try even x stays in y test on training data set when i did 80 samples when i train 100 classifier it might have missed some of the samples from my training data set and on them i ran my model prediction and the accuracy i got is stored in this ob score now i can do regular scoring x test by test and you see improvement right 77 percent versus standalone model giving you 71 percent now i agree you will tell me here you use cross validation here i did not use cross validation so let me use cross validation then so i'm going to do some copy paste magic and create the same bagging model and then use cross validation scroll okay in cross validation score what do you supply you supply first your model then x then y and how many folds well five volts okay you get scores back and those scores you can just take a mean you will find that the base model gives you seventy per one percent accuracy bagged model gives you seventy five percent accuracy so for unstable classifier like decision tree bagging helps if you have a classifier you know sometimes you have unstable classifier like decision tree and sometimes your data set is such that there are so many null values you know your columns are such that your resulting model has high variance and whenever you have this high variance it makes sense to use bagging classifier now we talked about random forest so let's let me try random forest as well on this particular data set so i will try random forest here and i will they say okay random forest classifier x y cb equal to phi i get scores and pretty straightforward x mean random forex classifier gives me one little better performance inside like underneath it will use bagging it will not only sample your data rows but it will sample your feature columns as well as we saw in the presentation now comes the most important part of this video which is an exercise learning coding or data science is like learning swimming by watching the swimming video you are not going to learn swimming obviously similarly you need to work on this exercise otherwise it will be hard to grasp the concepts which i just taught you here i'm giving a csv file which i took from kegel by the way and it is about hard this is prediction you have to load the data set apply some outlier removal i have given all the information here so work on this exercise and once you have put your sensor effort then only click on solution link because i have an ai technology built into this video where if you click on this link without trying out on your own your laptop or computer will get a fever and it will not recover for next 10 days okay so you will miss all the fun so better you try first on your own and then click on the solution link i hope this you like this video if you did give it a thumbs up at least and share it with your friends i wish you all the best if you have any question there is a comment section below thank you
Info
Channel: codebasics
Views: 9,008
Rating: undefined out of 5
Keywords: yt:cc=on, ensemble learning, python machine learning for beginners, machine learning tutorial for beginners, machine learning with python, python machine learning, ensemble learning in machine learning, ensemble learning tutorial, ensemble learning bagging, bagging, bagging data science, ensemble learning data science, ensemble learning python, ensemble learning deep learning
Id: RtrBtAKwcxQ
Channel Id: undefined
Length: 23min 37sec (1417 seconds)
Published: Fri Oct 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.