How to train XGBoost models in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
how to train an XG boost model in Python this tutorial will get you started with an example step by step I'll show you what is xq Boost and how to implement it in Python including scikit-learn pipeline setup hyper parameter tuning model evaluation and feature importance plot by the end you'll be able to build your own xq boost model for your prediction tasks let's Dive In hi everyone I'm Justin welcome to just into Data where data science materials are shared and made simpler for you before jumping into the example in Python let's answer the question what is xq Boost xgboost stands for extreme gradient boosting it's an optimized implementation of the gradient boosting algorithm it became well known because of its outstanding accuracy and efficiency compared to other algorithms in machine learning competitions who can easily apply xgboost for supervised learning problems to make predictions in case you're not familiar with gradient boosting let's also briefly explain it gradient boosting is a machine learning algorithm that sequentially ensembles weak predictive models into a single stronger predictive model we can apply it to both supervised regression and classification problems if you're more interested in learning about the gradient boosting algorithm we've covered its fundamentals in a written tutorial please check out what is gradient boosting and machine learning fundamentals explained I'll put the link in the description so the most common choice of weak models to Ensemble ingredient boosting are decision trees the tree-based gradient boosting methods are performant and easy to use despite the advantages of tree based gradient boosting methods the model tends to overfit as well as could demand a lot of computing power xgboost also a tree based gradient boosting implementation overcomes some disadvantages with its optimizations let's look at some key improvements of xgboost versus traditional gradient boosting it is more regularized xgboost uses a more regularized model formalization to control overfitting which gives it better performance you can consider xgboost as a more regularized version of gradient tree boosting for example the objective function of xgboost has a regularization term added to the loss function we'll use some regularization parameters in our xgboost python example also more scalable xgboost was designed to be scalable it has implemented practices such as memory optimization cache optimization and distributed computing that can handle large data sets so overall xgboost is a faster framework that can build better models the xgboost framework has an open source python package this package was built with easy integration with the popular machine learning library scikit-learn if you're familiar with scikit-learn you may find it easier to use xgboost alright now we're ready to build an xq boost model in Python I'll use jupyter lab to demonstrate this tutorial in this notebook I've divided the process into different steps Step 1 explore and prep data we'll use some Bank marketing data as an example you can download the data set from this UCI page Link in the description this data set records a telemarketing campaigns of a Portuguese Bank based on the client information and campaign activities will predict if a client will subscribe to a term deposit either yes or no so we're dealing with a supervised classification task this project comes with different data sets we'll use this one Bank additional foe dot CSV so let's click the data folder you can download this Bank additional.zip file don't forget to extract it to your desired location since it's zipped the CSV file is contained in this folder back to the notebook let's load the data set as usual we'll import pandas as PD and use a read CSV function to load the data set as DF in this XT boost Python tutorial I assume you already know the basics of python including pandas if you need help please check out our course python for data analysis step by step with projects this course teaches pandas which is necessary to transform your data set before modeling and much more again Link in the description below alright back to our code in reality after loading the data we'd explore the data set before transforming it but for Simplicity I'll skip the process here and change the data set with the following code so this variable calls to drop stores The Columns to drop the ones that are less related to the Target based on my judgment then we'll transform the data set DF by dropping this list of columns and at the same time rename the rest of the columns so that they're more understandable for example we rename the column job as job type default as default status and so on for the detailed definition of these columns please read the UCI page of the data set that was shown earlier lastly we convert the target the column as result to numerical values so when yes the client accepts a term deposit the value is one otherwise zero now let's look at our clean data set we can print out the first five rows so here is the head of the data set you can have a quick look also we can print its info summary as you can see here we have 14 features and the target result with no missing data note that xgboost can handle missing values internally even if there are some and if we look at the value counts of the result column it shows that most of the customers rejected the offer from the bank while 4640 accepted it all right I won't spend too much time looking at the data set next let's split the data set into training and test sets as in the usual machine learning process we first separate the features from the target as X and Y then we separate the x's and the Y's for training and test sets using the train test split function from sklearn we also set the sampling to be stratified based on the target's value and also a random number seed so that we can get a reproducible result now we have the training set as X train and Y train which is 80 percent of the original set and the test set as X test and Y test which is 20 of the original data set that's how the data prep will do for this tutorial next in Step 2 we'll set up a pipeline of training using the scikit-learn package the scikit-learn pipeline can sequentially apply a list of transforms and a final estimator it conveniently assembles several steps or changes that can be cross-validated together when training building a pipeline is much easier and ensures consistency than setting up the process manually hence it's a good practice to follow if you're not familiar with the pipeline you can find the link to this page below and read more about it so back to our example here let's set up a pipeline called pipe holding the parameter steps as this variable called estimators the estimators includes a list of tuples in sequential order first an encoder of Target encoder encoding like this is a standard pre-processing procedure and classification prediction problems it will transform the categorical features in our data set into numeric ones you can read more about Target encoders which I'll put a link below then an estimator clf of X Cube boost classifier this is the SK learn wrapper implementation of the xq Boost classification again we have the random State set here to get reproducible results if you're having a regression problem please use xgb regressor instead if we run the code and print out pipe actually you can see some warning messages but they're nothing we need to worry about for now we'll just ignore them Below in the output we can see that we've assigned the pipeline steps as encoder followed by clf in the following steps we'll train the data set by calling this pipeline to ensure the data set is always encoded by Target encoder before fitting the xq Boost classifier all right moving on step three one more very important step before training our xcubus model in Python the extra boost model contains many hyper parameters we should tune them to get a better estimate of the model as you might know there are different ways of hyper parameter tuning such as grid search and random search Instead This tutorial will use a different approach we'll use a package called scikit optimize SK opt for hyper parameter tuning it's easy to use and integrates easily with scikit-learn within the package we'll use an object called base search CV the scikit-learn hyperparameter search wrapper in short it utilizes Bayesian optimization where a predictive model is used to Model A search space of hyperparameter values to arrive at a good value combination based on Crouch validation performance so it's an efficient yet effective approach to hyper parameter tuning to use the space search CV method we need to define a search space of hyper parameter values back to our notebook in the python code below within the variable search space we set up the ranges of the selected hyper parameter values that will be searched as a dictionary the keys of the dictionary are parameter names in our case we're using a pipeline so first we specify the name of the xgb classifier estimator clf that we've set up earlier followed by two underscores and the hyper parameter name this is a structure of how to call nested parameters within a pipeline again the estimator name of clf referring to xqb classifier two underscores followed by the parameter name for example this is calling within the estimator called clf the parameter name max depth this is a learning rate and so on then the values of the dictionary are the type and range of the hyperparameter defined by the space module of scikit optimize we have options of integer real or categorical as a result only these hyper parameter values will be considered for tuning this list of hyper parameters is not exhaustive we are tuning the hyper parameter max depth within the xgb classifier which is a maximum tree depth for base learner as integers of between 2 and 8. as well as the learning rate the boosting learning rate as a number between 0.001 and 1 with log transform and subsample and so on until the gamma parameter you can remove or include more hyper parameters by reading their definition within the xgb classifier documentation which again will be linked in the description and you can change their search values as well after that we set up a variable opt as base search CV and feed it with these pipe the pipeline we set up earlier search space the search space of the hyper parameters we just defined CV the number of folds of a cross validation as three number of iterations the number of hyper parameter settings that are sampled as 10 scoring the metric for evaluation as Roc AUC and of course a random state number in reality you may consider setting CV and the number of iterations to higher values to get a better result we've set them lower so the training process is faster note that it's necessary to use a scikit-learn pipeline when using base search CV this ensures our encoding of Target encoder is being applied to the correct data set during cross-validation finally we've got everything set up for training step four train the xgboost model so opt includes both the pipeline and the hyper parameter tuning settings we call its fit method on the training set again there's some warnings that we can ignore for now and after waiting we'll have our xtb models trained it's done we'll just click to minimize this long output moving on to step 5 evaluate the model and make predictions let's look at the chosen pipeline or the best estimator you can see these are the columns the target encoder has encoded and here is the best xtbook classifier with these parameter values now let's evaluate this estimator if we go back you can see that we've set up the scoring within opt as Roc AUC so going back down we can call the best score of opt to see the ROC AUC score for the training set the closer the score is to one the better predictions the model can make that's a fair score and by calling the score method on the test data set we have the same metric for the test set we can see that the scores on the training and test sets are close to make predictions we use the predict or the predict probability methods on the test set these are the same process as other scikit-learn estimators all right we're pretty much done the last step is optional in Step six we'll measure feature importance we can look at the feature importance if you want to interpret the model better the xgboost package offers a plotting function based on the fitted model so first we need to extract the fitted xq boost model from opt as you can see the xtb classifier is printed with this code now we can use basic python indexing techniques to grab it so first within this list we use index 1 to grab this part of clf then within this Tuple we use index one again to grab the model this part and the code is here the xgb classifier model is stored with an xgboost model we can then apply the plot importance function to plot feature importance based on such a model the default calculates importance as a number of times each feature appears in a tree we can see the feature importance plot please investigate more if you're interested and that's it as you can see building xtb models in Python is easy in this tutorial you successfully built an xgboost model and made predictions in Python did you learn something new in this video If so make sure to subscribe to our YouTube channel just click the Subscribe button below this video right now if you're interested in more data science tutorials and courses please head over to our website justintadata.com thank you and see you in the next video
Info
Channel: Lianne and Justin
Views: 25,560
Rating: undefined out of 5
Keywords: xgboost, xgboost python, gradient boosting, python, sklearn, machine learning, data science, xgboost classifier
Id: aLOQD66Sj0g
Channel Id: undefined
Length: 18min 56sec (1136 seconds)
Published: Mon Jan 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.