XGBOOST in Python (Hyper parameter tuning)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] hello if you are a machine-learning practitioner a learning machine learning you would have heard about XG boost algorithm extremely gradient boosted it's one of the most popular algorithm for for classification as well as regression and being heavily used in hackathons also in production I've been using examples for all of my client projects wherever you have a structure tabla data XE boost is kind of usually a preferred algorithm for production alright so this video is about XE boost hyper parametric fine-tuning right so you probably would have done fine tuning by yourself with randomly trial and error method but if you understand what are these parameters and how to fine tune in a weight of systematic manner that helps you to find two algorithms not just XE boost any machine and algorithm in much better way let's get started so I'm going to open up Python 3 ok and let me first look at the parameters of XZ boost so these are you an extra boost parameters as you might have realized not every parameter is important in fact some of the parameters does not really affect so only certain parameters has a significant effect on the model so we will focus on those core parameters which affects your model fine tuning ok even before going to fine tuning I would like to tell you a quick idea about what is overfitting and underfitting because we're when you find in the parameters you're looking at under fitting no fitting and trying to find the optimal optimal point and so what is overfitting let's take a quick graph so when you when you train a model model you also have an evaluation module so usually what we do is we keep certain part for the training and certain part for the testing for an evaluation okay the model comes up with training accuracy as well as you were testing address so if you look at when you increase a model parameter if you make the model more complex towards overfitting side or towards more fitting the data find to be the parameters you see that the training accuracy keeps on increasing and and probably if this is 100 percent it gradually tends to reach hundred-person or near to it where as a testing accuracy it actually increases with the training to some extent and then it starts falling okay the reason being your model is learning the data too hard right when it learns too hard it's called as overfitting so basically it predicts everything correctly for the train data which is being used for training and for the test data it actually the accuracy falls down because you never seen the data and it does not generalize the model well so it actually gives a very poor accuracy for the testing data so this region is called as overfitting and this region is called as under fitting okay and this can be taken as optimal model complexity of email okay I'm not really good in writing on this pen optimal model complexity so your goal as a machine learning the goal of the modeling is to actually find this optimal model okay for doing that we use hyperparameters so there are certain hyper parameters when you increase it the model or facts there are certain parameters when you increase it the model generalizes or regularizes what is regularized this is kind of opposite to your overfitting so it goes towards this side so we have to balance this over fitting and regularization parameters to get the optimal model all right so let's look at the parameters and let's start with one by one okay so I'll start with max depth so if you if you know patient tree and any of these Scott models maximum depth is how much depth the model is allowed to grow the tree so for that for example this tree is actually three levels depth of maximum depth is three right when you increase the maximum depth what happens is the model tends to overfit so let me make a quick table I will just give you a rough idea of the range and other things but obviously all of these are you know thumb rules indicated so you have you're free to actually explore the other values as well so I'm gonna just say what happens if I increase it okay maximum depth usually we take from let's say 2 to 20 or 30 really based on your number of predictors or X variables or input variables dimensions so if you have lot of columns or input variables the maximum depth also can be little higher because it can actually grow on a little bit more actually good if you increase the maximum depth then it over fits the model okay so increasing the maximum depth over fits the model in a similar manner you have subsample there's another one is how much percentage of the hyper parameter how much percent of the data you are taking for each of your free building or estimated building for example subsample is usually its Mac the range is between of course you can take some point one two one sub temple is default value is one which means you're taking one hundred-percent nature for your model building if you take if you put point five here then it basically reduces the number of data being taken for your model building so this gives you more regularization effect so as we increase a subsample it over fits the default value is one if you want to actually sub sample is usually used to as a regularization parameter if it decreases subsample it essentially is taking a part of your data for model building so which gives you a regulation effect and it generalizes better alright so another variable which which you would like to take is this this one so this the column by column by level column by node and column by tree column by node is two granule so I would say column by level column sample by level and column sample by tree pardon my writing okay so column sample by level means for every level so default values is again one but it can you can give any value between zero to point one two you can even go from zero as well but zero means absolutely no data so I would say even though the back no point zero one is a value but it's really small so let's say 0.12 one is range so if I say 0.5 column sample by level it means for every level it only considers 50% of the data so so not the data my apologies columns columns are nothing but features so let's say your model has 20 features features or predictors or one of the same or simply X variables or input variables okay here were 20 if you set a column sample by level is 0.5 let's say it's a 0.5 okay so all of these 20 variables it only considers 50% which means only 10 variables for your each each level each each level here each level so column sample by default is 1 if you reduce it it also gives you a regularization effect okay if you reduce it okay I don't want to confuse it so usually if you reduce it it actually gives you more regulation effect of course if we put it as 1 with our friends it's not right it doesn't seem like our fit is right vote for column sample the purpose of these 3 are actually a more regularization so this 3 if you reduce it then it actually gives you more regulation effect and apart from that we have two more parameters minimum child weight this value can be 1 500 you can actually find you Nancy and this one of the parameters which I personally observed that it actually changes the accuracy wide range so it's one of the important type of parameters and if you increase it it actually regular Rises and we also have lambda and alpha all or all of these three or regularization parameters so when you increase them it actually the globus is the model when you say regular is the model it basically generalizes the model okay and we do have two more important parameters which are basically n estimators and estimators n estimators are number of trees so gradient boosting algorithm if you know how it works so it basically builds estimators serially so each estimator or simply a tree tree or estimator one of the same each estimator is actually boosting the next one so it's a gradient boosting algorithm so if you say n estimators is equal to ten so you get ten estimators the default value I guess is 100 so default value they keep on changing so the default value for way this is an estimator okay the default value for an estimator is hundred hundred so you can have an estimate as let's say from 10 mm you want so as we increase estimators the speed actually comes down because it has to really build so many estimators so number of estimators increasing will affect your training speed or modeling speed okay so it's better to keep it as minimum as possible given that some you know after certain point increasing the estimators does not really help your model fitting so of course if you increase estimators your model fits better I would say better then overfitting because after certain level it doesn't it doesn't really increase your model accuracy and the last one is your learning rate this is the one which are taking in your XD boost so you take a percentage the percentage of your residual and add it back to the the previous boosting tree so learning great if you increase it then your algorithm learns faster but again if you increase the learning rate sometimes it does not converge means that you'll end up getting poorer accuracy so increasing the learning rate will make your algorithm faster but it does not give you a proper results so keeping the learning rate small is actually good but if it is too small then it takes ages I mean it takes a lot of time to actually fit and sometimes it's not optimal as well so finding a right learning rate is important so learning rate can be anywhere between point zero one two you can go up to let's say 0.5 or 0.8 sometimes that's range you can even go below point zero one for certain data it might work but generally point zero one is a good one so I would say point zero one to one is a nice way of training but you could also go beyond one sometimes that should be also okay all right so if you increase the learning rate then it actually basically it is faster but you decrease the learning rate it generalizes better so it actually finds the right why I don't mind so if you increase the value grid so so basically learning great you cannot really say 200 fix but yes increase the learning rate doesn't know fit basically converges faster not certain parameters it's therefore actually regularization and you cannot release all of it but you know of course if you want to put it in a simple a platform then I have to say that okay these are the important parameters maximum depth sub sample column sample by level column sample by three minimum child weight lambda alpha so these three are again regularization parameters and n estimators on learning rate okay so I'm going to just take these parameters and find you let us say a default data set breast cancer and see how it works so this cancer is a standard so I can download it from Esqueda datasets Lord breast cancer I can say data is equal to Lord breast cancer let me look at recently I've been I haven't been using the default data side so I have to see ok the data is a one which actually comes as theta is equal to theta X is equal to capital X is greater than theta does it come to it okay it looks okay and my Y was equal to I want to make it a little more nicer so let me see where are the features feature so target names okay okay feature names keys let me see target names description okay and feature names we do have that but somehow it's not coming up when I press the tab so to make it more nicer so I'm going to convert this into a data frame that we import pandas I'm more comfortable with the anything and put the column names as well features used to be a nice yeah so this looks nice alright these are all X we have 30 variables 30 text variables sorry shape 30 variables on 569 break and then my why is your target okay all right so that's a standard data set I just imported I converted you to data frame because it's easier it has a header name that you can see that okay so I'm going to fit X Z boost and see how okay so I have already imported energy boost so I don't have to import it again so I'm going to straight away go ahead and split this data set into training and testing cost I can use cross validator so I'll keep it simple so that you can see the effect of x and y and then I'll put a random state is equal to 0 and s size because the small data set I would like to take 20-person as a test size that's pretty good okay and then I'm going to fit the model the model rock fit is X 22 by 3 the fitting is done okay I did not define the model to classify and right now I am NOT putting any any hyper parameters I'm just gonna first a fall fit and see how it looks then we will go into the I have a parameter tree so I'm going to predict my then 1x test and friend accuracy score and then we have while test and why it let me see so we got 97% already pretty good because that's one of the no benefit of XC boost or even random follows there's the same thing you have an excellent decent accuracy even without any parameter tuning so default parameter gives you that these it gives you a decent accuracy but let us anyway try and see let me also see this crosstab crosstab is confusion matrix so and you can also print the classification let me also print classification report and obviously I'm assuming the basic machine learning knowledge so I'm not really explaining each and everything focused on explaining 99% recall which means it's not good enough so so basically what I can say is zero is non-cancer one is cancer so if you look at it there are two two records or two to two patients who don't have a cancer being predicted as cancer this is okay but if you look at this there's a one patient who has a cancer but particular is non cancer so this is not good it should be it should be also you but you know for our discussion let keep this aside it's too detailed so let's focus on only our accuracy scope okay so 97.3 the requisite score now I'm going to just copy paste the same thing and this time I'm going to start fine tuning things so let me just go back to mine so maximum death as I say let me also print the training accuracy so you get an idea both training and testing angles train and oh I did not predict for the training day so I'll go back here I'll say why train predict finally different name predict extract so now I have both I given both the training data as well stress together to my model and predicting it so now let me bring down both the things it'll be nice if I put a small you know saying it is a train accuracy test occurs all right so your train across is perfect already all fitted so nothing much can be done but let us see how do we reduce it so we can reduce it by look at our so if you want to reduce I mean to go towards regularization increasing maximum debt over fixed so decreasing it actually comes down so let me first directly put maximum depth is equal to let's say - that's hot with that okay all right so it is still showing you okay so if you put Michael at one depth one so you can see that the model accuracy actually reduced because it's basically not allowing the tree to grow but to my knowledge maximum depth two or one it's not really affecting yeah if you reduce it a bit you can see that the training testing accuracy went up if you put three yeah training Equus is still is is one but testing at once you can see that when I decrease it the model is more gentle icing by your accuracy is increasing a bit okay so that explains you a maximum depth suck sample sub sample so I should be cautious because it's a really small data set 600 even less than 600 567 records of which I have taken only 70% or maybe 80% for the trading really small record I cannot show all we have a parameter phenomena in this but I would I will try my best to explain default is 1 so if you put point 5 you see that you know the model is coming out of your overfitting so basically means if you increase the subset to one it affects if you keep on decreasing it it actually reduces your model accuracy both training and sometimes it helps testing because it loves better in this case see both are actually producing so I'll leave it to one I'm not going to disturb that all right so you can also play with a sample column sample by level column sample by 3 both has a very similar effect as your sub-samples I'm not going to touch it so let me look at this n estimate ins that's very interesting because that's one of the very important parameter n estimators if I put 2 obviously I have only 2 trees you expect a reduction of both training and testing so you see the training accuracy is reduced testing accuracy is also reduced so both of your actresses ready so we are moved like in an under fitting region now if you put 1 estimate that it's crazy because only one tree there is no concept of gradient boosting at all applying over here doesn't make much difference you put 10 now you see that they are increasing but let me put hundred so 100 you see that training actress is 100% and the testing accuracy is coming about ninety eight point two four ninety eight point two four percentage so that's your n estimators learning rate again learning rate if you put point zero zero one point zero one so it's very very small steps is taking some time it's not very efficient point one actually the default is 0.14 XE boost so it is giving you what it is actually it's a small trick you can do if you if you if you read user learning rates it's it's like this first you find the minimum learning rate which has a good accuracy if you want to increase the accuracy a bit a little bit you can perform this trick so the trick is pretty straightforward okay you don't have much space here okay so learning rate let's say you decrease the learning rate by alpha so decrease the learning rate by alpha or any number and you can increase the number of estimators by alpha then you see that you see a slight improvement in accuracy I don't know I feel reflecting this data set let me give a try what I'm going to say is you decrease your learning rate by 50% so this actually by 2 I have done and I will increase my number of estimated by 2 as well so it's actually didn't have any effect does it now it's very minor because it's not really a data let me apply this again after I find you to the next level so that actually works if you have a significant data so this this actually works so what are the other parameters we have so we have random state random state I don't really translate as a parameter let me let me look at it last so we have this minimum child weight this parameter regularized this and it's one of the important parameter it actually works very well so let me just put one and see what happens okay and let me also play with this nice won't that increase a bit that's really not working and just let me the regularization is reduce the sample decides okay I don't really suggest to go to learning right to the point - but okay so ninety eight point two four is a maximum you are getting now let me try to know nothing happening so maybe this is not a great data set for laying around yeah so this prick actually work now so so this is the parameters I have I already reached this like kind of maximum accuracy but a small little bit bump you can get it from I'm reducing the learning rate by 50% and increasing the number of estimators by 50% so I'm doubling the number of estimators and ready and basically I'm multiplying into two and there you are actually dividing by two so you see that we got 99.1 - okay so these are the parameters okay so these parameters is good enough for you to do of course you can also play with lambda and alpha that's any way we can do that so lambda alpha values is usually I mean it's actually called as regular relation alpha regulation lambda default values are three on one you can play with them as well right now my algorithm already reached his maximum level it can be possible so I'm just gonna put it but I'm not gonna play with that okay so these are the parameters you would like to actually do and to be more accurate do not really rely on this random state random state if you change it your models you can use it in this way so you do all the hyper parameter tuning and then you start playing with random state if you change the random state if the model is affecting like it's actually spinning between the accuracy wide range then your model is not good so after you find in all the parameters and then you change the random state your model should not get affected a small change in point zero one percent is okay but if it is changing widely then probably your model is not good enough so random state in XZ boost should not affect your model accuracy for that purpose you can use random state and then you can understand how strong is your model alright so hope this video is helpful I shall put this file in in in one of the git repositories I'll share the link below for your you can download and play with it so if you like this video it's helpful please subscribe to the channel and if you have any questions you can just drop it on the comments also you can send an email to me I have mentioned my email in the description thank you for watching this video I'll see you in another one you

Info

Channel: DataMites

Views: 28,568

Rating: 4.7701778 out of 5

Keywords: Datamites Institute, Data Science Courses, xgboost, XGBOOST in Python, XGBoost in Machine learning, XGBoost Algorithm, XGBoost Data Science, Data Science Tutorials, Data Science and Machine Learning Courses, Machine learning tutorials, data science with python, python for data science, python tutorials

Id: AvWfL1Us3Kg

Channel Id: undefined

Length: 31min 10sec (1870 seconds)

Published: Tue Dec 31 2019