XGBoost in Python from Start to Finish

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
xg boost is extreme but so is this webinar it's totally extreme yes steadquest [Music] hooray i'm josh starmer and welcome to the stat quest webinar on xgboost in python from start to finish this is the jupiter notebook that we're going to go through today we're going to use xgboost to build a collection of boosted trees one of which is illustrated below so this guy right here and use continuous and categorical data from the ibm base samples website to predict whether or not a customer will stop using a company's service in business lingo this is called customer churn you can download the telco churn data set or use the file provided with the jupyter notebook and if you want to learn more about the telco churn data set you can click on this link in the jupyter notebook it's live or if you just want to learn more about the base samples there's all kinds of data sets here that you could use for other experiments in machine learning so it's a it's a great resource for just testing out these models in general all right xgboost is an exceptionally useful machine learning method when you don't want to sacrifice the ability to correctly classify observations but you still want a model that's fairly easy to understand and interpret in this lesson you will learn about importing data from a file missing data including dealing with missing data xg boost style which is relatively unique and then we're going to format data for xg boost including using one hot encoding and actually because of the way xgboost uses handles missing data we're going to have a mini stack quest in the middle of this webinar to explain sort of the specifics of one hot encoding and how it relates to how missing data is handled after that we're going to build a preliminary xg boost model and then we're going to optimize parameters with cross validation and grid search and then lastly we're going to draw validate and build obviously and and then we're going to interpret eval and evaluate the optimized xg boost model note this tutorial already assumes that you know the basics of python and are familiar with the theory behind xgboost cross validation and confusion matrices if not check out the stat quest by clicking on the links for each topic also note i strongly encourage you to play around with the code playing with the code is the best way to learn from it so the very first thing we do is load in a bunch of python modules python itself just gives us a basic programming language these modules give us extra functionality to import the data clean it up and format it and then build evaluate and draw the xgb boost model note you'll need python 3 and at least these versions of the versions of the following modules pandas numpy scikit-learn and xg boost i've got instructions on how to install all this stuff right here and also instructions on if you want to actually draw that beautiful tree i've got instructions on how to install graphviz as well to make it all work out all right so this is where we're going to load in the data since this is a jupiter notebook we can load uh we can run the the python code by either clicking on the play button uh or we can select run selected cells or we can select this keyboard combination um to run the cells note i'm using a macintosh so that's my keyboard combination on your computer the keyboard combination may be different so i'm going to run this and we get a number over here that tells us that we've run the code so that's great if if by the way if python is still cranking away if you're doing something a little more complicated than loading in some modules you'll see a star over here while python is cranking away but we just get a number so that's good okay now we're ready to import the data we're going to load in a data set from the ibm base samples specifically we're going to use the telco churn data set which can allows us to predict if someone's going to stop using telco services or not using a variety of continuous and categorical data types note when pandas reads in data it returns a data frame which is a lot like a spreadsheet data are organized in rows and columns and each row can contain a mixture of text and numbers the standard variable name for a data frame is the initials df and that's what we're going to use here so what we're doing is we're creating a new data frame that we're calling df and we're setting it to the output of this pandas function read csv so this is a csv file that we're loading in um so let's do that bam now that we've loaded in the data into a data frame called df we're going to look at the first five rows using the head function so this is our data frame df and there's an associated function called head that we're going to run and what it does is it prints out the first five rows so it says five rows by 33 columns and we scroll over we see all these columns not all of them have been printed to the screen i don't know if you can see there's a dot dot dot between gender and contract and that just means that even though there's 33 columns we didn't print all of them to the screen um note the last four columns churn reason cltv churn score um these are um what are these these are sort of uh exit interview uh data uh that were collected from people that left telco and only people that left telco provided answers here so we don't want to use this data in our prediction because generally speaking someone's not going to do the exit interview before they leave the company um and so since these things will give us perfect uh prediction ability perfect predictive abilities we want to remove them from the data set um so we do that using the drop function so we've got data frame this is our data frame and we're going to use the drop function and then we list the columns that we want to drop we set access equals to 1 to specify that we're dropping columns instead of rows and we set in place to true which means we want to modify df this data frame called df directly we're not making a copy uh that that of of the data frame and only in the copy uh is that or uh containing these or without these columns it has these columns dropped and then after what we're after that what we're going to do is we're going to print the first five rows again just to verify that we did everything correctly so let's do that okay so when we scroll over to the right we see that churn reason and cltv those things are gone some of the other columns in this data set only contain a single value and will not be useful for classification for example this column count we just see a bunch of ones we can verify that the only value in this column is one by with our data frame we specify we want to look at the count column and then we call the unique function to print out all the unique values in that column and when we run that we see that the only unique value is one so this column count only contains one and that makes it useless for making predictions likewise if we look at the unique values in country just looking at it we see united states a bunch of times and if we print out the unique values in the country column we see that the only value is united states similarly in the state column we see a bunch of entries for california let's print out all the unique values in state again we just see one value california so that means we can omit count country and state from the analysis because they're not going to help with predictions in contrast city a column called city contains a bunch of different city names so let's look at that so yeah we see los angeles beverly hills huntington park standish all kinds of stuff so we're going to leave city in because city may help us with making predictions however we're also going to remove customer id and we're going to remove this we're going to remove customer id because there's a different value for every single person so that's not going to be very helpful for predictions and also there's a column called lat long which contains both latitude and longitude of of the resident of of that was a customer but we also have separate columns for latitude and longitude so we don't need the column that merges those two things so um so we're gonna uh we're gonna drop those columns as well just like we did before so we've got our data frame dot drop so we're using the drop function then we pass in array of the columns that we want to drop specify that we're dropping columns instead of rows and again just like before we're using in place set to true so that we modify the data frame df itself and as always we're going to print out the first five rows to make sure we did everything correctly hooray looks like we did it just right okay now we're down to just 24 columns note although it's okay to have white space in city names uh so you see we've got we've got los angeles and there's a there's a blank space between loss and angelus and you see that up here um when we printed out the unique values for each city name we've got a blank space between beverly and hills and a blank space between huntington and park those blank spaces are perfectly okay for xg boost xg boost doesn't care partially because we're going to use one hot encoding for this however we can't have any white space or blank space if we want to draw the actual tree like we did all the way up here if we want to draw this tree later on at the very end we need to remove the white spaces um so we're going to do that since we know there's a lot of white spaces in the city column we specify we've got our data frame and we specify that we were interested in the city column and then we can use the replace function and this is a just a search and replace uh function like you'd find anywhere else we're searching for blank spaces and we're going to replace it each blank space with an underscore we set regex to true that's regex is short for regular expression and if you're not familiar with the term regex or regular expressions don't worry just think of it as advanced features for a search and replace function and again just like we've been doing before we're going to make these modifications in place so we're going to modify df directly and then print out the first five rows to make sure we did it right there we go and we see an underscore between los angeles and we can also print out the first 10 um city names uh the first 10 unique city names and verify that we've got underscores between beverly and hills uh between huntington and parked between marina and dell and dell and ray so we've we've eliminated those white spaces also we need to eliminate the white spaces in the column name so we're going to replace those with underscores as well we do this slightly differently so we've got our data frame df and we specified that we're interested in the columns not in a specific column but the column names and what we do is we convert those column names to strings and then we call the replace function and again this is just a search and replace we're searching for white space and replacing with underscore and just like we've done before we're going to print out the first five rows bam so now we see an underscore between zip and code senior and citizen tenure and months etc etc hooray we've removed all of the data that will not help us to create an effective xg boost model and we've reformatted the column names and city names so that we can now draw a tree okay so now we're ready to identify and deal with missing data unfortunately the biggest part of any data analysis project is making sure that the data are correctly formatted and fixing it when it's not the first part of this process is identifying missing data missing data is simply a blank space or a surrogate value like n a that indicates that we failed to collect one or more features we failed to collect data for one or more features for example if we forgot to ask someone's age or forgot to write it down then we would have a blank space in this data set next to that person's age one thing that's relatively unique about xgboos is that it has a default behavior for missing data it knows how to handle missing data it's expecting it so all we have to do is uh is identify the missing values and make sure they are set to zero now um in the webinar that i've given a couple times now uh this has confused a lot of people because what happens if there is a zero already in your data set um i'm going to have a little mini mini stat quest about halfway through that shows how we can have zeros that code for things as well as using xero to code for missing data and it actually works out and even in situations i'm just going to show you one scenario and even if there's some scenario where it does not work out the way i'm demonstrating i'm i will throw this out as well the author of xgboost has said that even when you code something with xero and you use zero to mean missing data xgboost still does a great job it doesn't so for some reason that doesn't really interfere with how well it performs um so the first thing we're going to do is we're going to see what sort of data is in each column this is what i always do when i'm looking for missing data i print out the data types for each column because that can tell us if something is messed up or not we see that a lot of the columns are object and that's okay because above here when we uh when we did head we saw that senior citizen has a bunch of no's and it probably has some yeses in there partner has yes's and no's dependence has yes's and no's so it makes sense that a lot of these columns are object um because they have text responses like yes and no however we should always verify that we're getting what we expect in each column so for example we'll look at the phone service column and we'll use our handy unique function to check to see what we're getting and what we get is yes and no and that's perfect because that means there's not question marks in this column there's not an n a place holder for a missing data so so we can verify that this column only contains yeses and no's and that's great okay now in practice we should do that for every single column verify that it has uh the type of data that we're expecting and it only has the responses that we're expecting and trust me i did this but right now we will focus on one specific column that looks like it could be a problem and that's total charges so if we looked at this output over here we see that total charges looks like a bunch of numbers however if we look at the data type for total charges we see that it's an object and that we usually get that object when we get a mixture of numbers and and characters so one thing we can do is just print out the unique values and total charges and kind of see what we what we see so let's do that bam and when we do that we see that there are too many values to print we've got this dot dot dot right in the middle um but what we see looks like a bunch of numbers however if we try to convert the column to numeric data or numeric values and i'll try that right here when we run this code we get an error so i'm going to comment it out just in case you want to run all the code all at once um and let's look at this error the the nice thing about this error and the reason why i wanted to show you this error is it actually tells you what's wrong um with the data it says unable to parse string quote nothing or blank space end quote and what does that tell us that tells us that there are blank spaces in this in this column um in total charges uh and so so we need to deal with those okay so now we're ready to deal with missing data xg boost style like i've said before one thing that's relatively unique about xgboost is that it determines default behavior for missing data so all we have to do is identify missing values and make sure they are set to zero however before we do that let's see how many rows are missing data and if it's a lot then we might have a problem on our hands that is bigger than what xg boost can do on its own and if it's not that many we'll just set them to zero so uh so we do this by um searching for we're using the location uh function so we've got our data frame and we use we say tell give me the rows where this is true so the value in the total charges column is equal to zero and then we're wrapping all of that in the len or length function and that counts the number of rows that have blank spaces in the total charges column and we see there's only 11 rows that have missing values so since it's only 11 we can look at them so we're going to print them out bam so we see if we go over here we see uh that in the total charges column uh we have uh no values uh we also see that in the tenure months column we've got zero for everybody and that means that the reason why these people have total charges equal to blank is that they have not been charged for anything yet they just subscribed and so they're on a plan they've got uh we've got an expected amount of money we're going to be getting from them but they have not paid us anything yet and since it's just a handful of people we're going to set these up these total uh um to um uh we're just gonna set total charges to zero and the way we do that is a lot like what we did before only this time we're specifying instead of having having uh loc return the entire row which is what it was doing before we're specifying that we just want the total charges and again we're interested in just the the the locations where total charges equals blank space and we're setting that value in the total charges column to zero so let's do that we can verify that we modified total charges correctly by looking at everyone had 10-year months set to zero note i'm not looking at total charges equal to zero because there could be other people that have not paid a dime even though they've been on board for a couple of months they they might have zeros there already um and so but i'm pretty certain that everyone who had tenure months equal to zero had a blank space because they just signed up and they hadn't had they hadn't paid a bill yet so we do that and we see that we've got the 10-year month's equal zero and when we scroll over to the right we've got total charges equal to zero so that worked bam we have verified that our data frame df contains zeros instead of spaces for missing values note total charges still has the object data type and that's no good because xgb boost only allows int float and boolean data types so we're going to fix this by converting that column with two numeric there are multiple ways to convert columns from one type to another this is just one of them so i'm using two numeric and uh it's a pandas function and i'm specifying that i want to convert the total charges column and i'm saving it in the original total charges slot and when i'm done i'm printing out the data types to verify that we've done it correctly so let's do that bam and we go down here and we see total charges is now float 64. so hooray now that we've dealt with the missing data what we're going to do is we're just going to replace all of the other white spaces in all of the columns with underscores there could be other columns that have white spaces and we're just going to do this all at once um so we're going to do it just like we did before we've got our data frame and we're using the replace function however this time we're not specifying a specific column we're just going to do this data frame wide we're replacing blank space with underscore and then we're going to print out the first five rows to verify that we did it correctly so there we go bam and if we scroll over to the right we see that one of the things that we fixed was under under the payment method column uh instead of blank spaces between mailed and check now we have a um an underscore an electronic in check we've got underscores so bam and remember just to clarify the only reason why we're replacing all these blank spaces is so that we can print out a nice pretty looking tree xjboot xgboost itself doesn't really care partially because we're going to one hot encode these things later anyways okay so now that we've dealt with all those issues uh we can now start formatting the data for an xg boost model and the first step is to break it into two parts the uh we want we want to separate the columns that we will use to make classifications from the column that we want to predict so we're going to use the conventional notation of capital x to represent the columns of data that we will use to make classifications and we're going to use lowercase y to represent the thing that we want to predict in this case we want to predict churn value let's look at this this is one for people that left and it's zero for people that did not leave um and so what the what we're doing is we're creating uppercase x here and we've got our data frame and we're just like before we're calling the drop function and we're specifying um that we want to drop churn value however we're not doing in place like we did before and we're saving the results in a new variable and we're going to print out the first five rows of that new variable so there we go and if we scroll over we see that churn value is missing we've got everything else which is great that's exactly what we wanted now we're going to make this lowercase y variable and it's going to be just the column in data frame called churn underscore value and then we're going to print out the first five rows to verify that it looks the way it should and there it is bam now that we've created capital x which has the data that we want to use to make predictions and we've made lowercase y which has the data that we want to predict we're ready to continue formatting x or capital x so that it is suitable for making a model with xg boost so that brings us to one hot encoding okay now that we've split the data frame up into two pieces we need to take a closer look at the variables within capital x so the list below which i got from the um the ibm website for this data set tells us whether or not each column should be afloat or categorical so we see city is a category a longitude is float gender is a category senior citizen is a category we've got a bunch of categories tenure months is a float so we've got lots of columns now just to review we're going to look at the data types in x to remember how python is seeing the data right now so we see that latitude longitude monthly charges and total charges those are all float64 which is great that's exactly what we want however all of the other columns that are object those need to one we need to inspect those to make sure each one only contains a reasonable value and then we also need to modify them with one hot encoding one hot encoding is a trick for taking um taking data that is categorical and splitting it up into a format that xg boosts and a lot of other algorithms can use so the problem is is that xgboost and a lot of other machine learning algorithms they natively support continuous data like monthly charges and total charges but they do not natively support categorical data like phone services which has two different categories so if we want to use categorical data with our model we have to uh convert it with one hot encoding to kind of get around this limitation okay so at this point you may be wondering what's wrong with treating categorical data like continuous data can't we can can't we just can convert the categories to numbers and be done with it and to answer this question uh we're going to look at an example and i've chosen payment method uh to be that example we see it has a bunch of a bunch of options we've got mailed check we've got electronic check we've got bank transfer and we've got credit card if we converted those categories to numbers one two three and four and treated them like continuous data then we would assume that four which means credit card is more similar to three which means a bank transfer than it is to one or two which are other forms of payment that means the extra xg boost tree would be more likely to cluster the people with fours and threes together uh than fours with ones in contrast uh if we treat these payment methods like categorical data then we will treat each one as a separate category and that is no more or less similar to any other category thus the likelihood of clustering people who pay by mailed check with people who pay by electronic check is the same as clustering with mail check and credit card so this approach seems more reasonable to me note there are there are many different ways to do one hot encoding in python there are two main popular ways and i describe the pros and cons of these two approaches in the jupiter notebook in this paragraph however for the purpose of this webinar we're going to use git dummies because i feel like it's the best way to teach what one hot encoding does so what we're going to do is we're going to use this pandif function so pd is short for pandas we're going to use this pandas function called git dummies and we pass in the data frame that we're interested in processing and the columns that we want to one hot encode now i'm not saving the results i just want to see what happens so i'm just going to print out the first five rows to show you how this column payment method is is one hot encoded so we run this code and we and we see that all the columns that we did not modify are on the left side of the data frame but if we scroll to the right we see payment month payment method bank transfer so we've got that column we've got another column for payment method credit card another column for payment method electronic check and another column for payment method mail to check and in each column we've got a one if for in the in the bank transfer column we've got a one if they use bank transfer and a zero for any other option for credit card we've got a one in that column and a zero for any other option for electronic check we've got a one in that column and zero for any other option and lastly for male check we've got a one in that column and zero for any others note if you're familiar with linear regression or logistic regression if you're not familiar with linear regression or logistic regression don't worry about what i'm about to say but if you are familiar with those two things one hot encoding is different from the way you would encode it for the same data the one hot encoding gives us a result that is different from what we would do for linear and logistic regression so just keep in mind that one hot encoding is is is not for linear and logistic regression but it works great for um so now that we know what git dummies does and we know that it works we're going to use it on all of the categorical columns and we're going to save the result and note in a real situation and not in a tutorial like this we would go through each individual column and make sure it only contains reasonable data however since this is just a tutorial and i've already done all that work we're just going to skip to the next step which is to run this pandas function pd we're going to run the pandas function called git dummies we're going to pass in our data frame and all of this of the columns that we want to convert into uh categorical all of the categorical columns and we're going to save that in a new data frame called x underscore encoded and then when we're done we're going to print out the first five rows so let's do that bam um so now we see that we've encoded a bunch of columns there's this dot dot dot still because there's too many columns to print and if you look down here obviously we've got five rows because we use the head function which only prints out five rows but we now have 1178 columns dang a lot of those are because we've got a different column for each city name so we've so you see all these all these city names that just kind of bleed into the dot dot dot we've got a lot of different city names and we've got a column for each one now one last thing before we build an xg boost model we're going to verify that y only contains ones and zeros with a unique function so we've got y dot unique and we run that and it only has one and zero and that's a double bam so we finally finished formatting the data for making an xg boost model however before we get uh before we do that i want to do a mini stack quest to show how one hot encoding works especially when we're coding missing values with zero so imagine the favorite color was a column or feature or variable in our data set two people loved the color blue two people loved the color green and two people had missing data so just like we did in the jupiter notebook we replaced the missing data with zeros now we convert favorite color with one hot encoding just like we just did there are ones in the blue column for the two people that liked blue and zeros in the green column because those people did not like green likewise there are ones in the green column for the two people that liked green and zeros in the blue column because those people did not like blue lastly both the blue and green columns get zeros for the people with missing data let's move this table to the left side now the question is should the people with missing data be clustered with the people that like blue or should they be clustered with the people that like green xgboost answers these questions by comparing these two different ways to split the data on the left side we are splitting people who like blue from everyone else and that means we are clustering the people who like green with the people with missing data on the right side we are splitting the people who like green from everyone else and that means we are clustering the people who like blue with the people who have missing data xgboost then chooses the split that gives the best value for gain okay i get how xgboost deals with missing data but doesn't keeping track of all these zeros take up a lot of memory because xgboost uses sparse matrices it only keeps track of the ones and it doesn't allocate memory for the zeros and that means that this branch is really just checking to see if memory is allocated for blue if memory is allocated for blue then we go to the left and if memory is not allocated for blue then we go to the right likewise this branch is only asking if memory is allocated for green if memory is allocated for green then we go to the left and if memory is not allocated for green then we go to the right this is how xgboost deals with missing data and is memory efficient at the same time bam all right now let's return to the jupiter notebook and and build our preliminary xg boost model okay so i know we just said that we were gonna build the uh um the xg boost model but the first thing we need to do is we need to split the data into training and testing data sets however let's first observe that this data is imbalanced by dividing the number of people who left the company where y equals one by the total number of people in the data set so since the this this column y lowercase y only contains zeros and ones and it only contains ones for people that left the company we can add up all the values in this column to get the number of people that left and if we divide it by the length of that column we'll get the percentage of people that left and we see that only 27 percent of the people in the data set left the company because of this when we split the data into training and testing data sets we will split using stratification in order to maintain the same percentage of people who left the company in both the training set and the testing data set so we're using a function called train test split and we're passing in x encoded lowercase y we're setting the random state to 42 so that hopefully hope against hope you will get the same results that i got and we're setting stratify to y for yes and this will return four variables and we're or four or four data sets we're storing them in x underscore train x underscore test y underscore train and y underscore test so let's run this bam now let's verify that stratify worked as expected now we're just doing the same math we did before only this time we're doing it on the training data set so we see that 27 of the training data set are contains people that left and let's look at the testing data set and we see that 27 percent of the people in the testing data set left so bam stratify worked as expected so now what we're going to do is we're ready we've got our training set we've got our testing data set we're ready to build our xgboost model and the way we do that is we we use xgb which is the xgboost module or library and we use a function called xgb classifier and we specify that the objective is binary logistic and that's for classification uh because xgboost kind of uses a logistic oppression a logistic ex logistic regression approach to evaluating how good it is at classifying the observation um we're setting missing to none um the default value for missing is is none so i don't actually need to set this and uh in the webinars this confused a few people what we're supposed to the purpose of this uh parameter is to or this argument is to tell xgboost what character we're using to represent missing values the default value when you have missing equals none the default value is zero or np or it's a numpy not a number but it uses zeros in that sparse matrix just like we saw in the um in the mini stat quest it uses zeros to represent missing data so it doesn't have to allocate memory uh in those fields for people that have missing data so that's the default behavior and that's what we're specifying if we used question marks to represent messing data we could specify that here and again we're setting the seed to 42 in order to hopefully give you the same results as me we store this in a new variable so this this is just basically creates a shell within which we're going to create a forest of extreme gradient boosted trees uh and we're storing that in this variable and we and we create those trees by then running fit um we pass in the training data we say for be verbose sort of tell me what you're doing while you're doing it and one thing we're doing that's kind of special is we're doing early stopping um so what we're doing is um is we're building trees and at some point uh the prediction will not improve and then what what will what extra boost is going to do is then it's going to build 10 more trees and if those 10 all 10 of those trees none of them can improve on the predictions um then it will stop and we're we're using the auc as a way to evaluate how well the predictions uh are being made and we pass in the testing data set because what it with the trend it's training on the it's training the trees on the testing data set on on the train excuse me it's training on the training data set but it's evaluating how many trees to build using the testing data set and this is what we would normally do anyways by hand using cross validation and a variety of other approaches but xg boost will do it for us okay so we're you'll see there's a star by the code meaning it's running and you can see down here we're just sort of printing out the results of each tree and we're going and going going and here we finally stop after building 55 trees however that meant that the previous 10 trees did not improve the classification and so it says stopping the best iteration was actually the after building 45 trees so um so so we've only created 45 trees for our our our model okay so we've built an xg boost model for classification now let's see how well it performs on the testing data set by running the testing data data set down the model and drawing a confusion matrix we do this with the function called plot confusion matrix we plot we pass in the the model the extreme gradient boost boosted trees the and we passed in the the testing data sets and then the last two uh rows of of stuff just make the the this the class the confusion matrix excuse me it just makes the confusion matrix look pretty so in the confusion matrix we see that the top row represents uh people that did not leave the company and there are a hundred excuse me one 1294 people in this row and of those 1178 were correctly classified so that's awesome that's 95 91 however the second row this is the these are the people that left the company and there are 467 people in this row and we see in this bottom right hand corner that only 239 or 51 percent were correctly classified so xg boost was not awesome part of the problem is that the data is imbalanced and we saw that earlier and we see that in the confusion matrix right now because leaving people because people leaving the company cost the company a lot of money what i'm going to do is i'm going to try to optimize the xg boost model so it does a better job predicting people that have left or are leaving the company for a competitor for some other reason and the good news is xgboost has a parameter called scale pause weight that helps deal with imbalance the data and basically what it does is it um it adds a penalty for uh incorrectly classifying the minority class in this case that's the people that that left the company and we and we want to increase that penalty so that the trees will try harder to correctly classify them and so we're going to try to improve our predictions using cross validations to optimize these parameters okay so xg boost has a lot of hyper parameters these are parameters that we have to manually configure and are not determined by xg boost itself and these include max depth max depth is is how many how deep the tree can go so if it's just a stump then we just go down one level if we allow more branching then we can we can go down two three four five or six more levels um we can optimize the learning rate which is eta and if you've watched the uh the stat quest videos on xgboost you know all about that uh we can also optimize gamma the the parameter that encourages pruning pruning and the regularization parameter for lambda and that's for ridge regression so let's try to find the optimal values for these high hyper parameters in hopes that we can improve the accuracy with the testing data set uh since we have a lot of parameters we're going to do that using a grid search cv or cross validation however because this is just a tutorial i'm i've commented all this out this takes a while to do it runs it takes about 10 minutes to run um however there's a few notes in here that i want i want you to be aware of in the in the manual for xg boost uh for the for xg boost uh it says uh if you have imbalanced data then you should use the auc to evaluate the performance of your um of your uh uh you know of the fit and that we should also try to optimize this scale pause weight uh parameter and what i've done is and i've commented this all this out but i've i've done the optimization in two separate rounds because optimizing everything all at once just took too long so what i did is for max depth i i gave it three different values three four and five so we can have a tree that's got three levels or four levels or five levels i've got three different values for max learning rate that we can try out three different values for gamma uh three different values for the regularization parameter and three different values for scale pause weight and then i optimized um and oh and here's something i need to point out in order to speed up the cross validation uh for each tree we're using a random subset of the actual data we're not using all the data we're just using 90 and that's randomly selected per tree we're also only selecting per tree 50 of the columns in that data set so for every tree we make we select a different uh 50 of the columns to make that tree and that helps with overfitting and that also helps speed things up considerably other than that we're just doing um we're using the the auc to score and we're not doing a lot of cross validation we're not doing tenfold we're just doing three fold but but you get the idea of how we're doing it so when i ran this um it gives so the for when i ran the first round it gave me a max depth of four which is the middle value here uh i gave you know we could have been gone down as low as three um and we could have gone as high as five um because i got the middle value i set that value in the second round um however for learning rate i got a value on the edge of the range so these are values that are are or you know so this is on the sort of the high end for uh for the learning rate but it could go higher and so what i did is because it's possible that it could go higher i i continued to explore in that direction the next time i went through cross validation likewise for gamma uh the first time we got this we got the middle value so i just set gamma to 0.25 but for regular for the for the regularization parameter uh we got 10 and so i continue to explore with larger values for the lambda uh but it did settle on scale pause weight for three and so we're just going to stick with that um and so when i also you know so then i just ran the second round and ultimately i got gamma equals 0.25 the learning rate equals 0.1 max depth equals 4 and regularization parameter is 10. okay so now that we've optimized the parameters we can we can build the final xg boost model and now this uh this call to xgboost classifier is a little more complicated because we're specifying a lot more of the parameters here because we're not just using default settings now we're using the ones that we optimized we're using early stopping just like we did before and that means we don't have to optimize the number of trees in the in that we're going to use in the in the random forest so we'll run this and it's going it's going going going going [Music] and there it is it stopped at this after making 65 trees but remember that means there were 10 trees that did not improve before that and so it says that the best iteration was uh the 55th so it built 55 trees and that's that's where it it's it's going to end now let's plot the confusion matrix this is just like before and now we see that we're doing a much better job classifying people that left the company now we've captured 390 that's 84 before we were only getting 51 percent uh however uh this improvement comes at the expense of correctly classifying the people that did not leave however i mean the company may feel differently and that's fine with them but from my perspective since this is my tutorial um the way i see it is when people leave the company they take their money with them and that's money that the company does not get um and so it'd be nice to be able to catch those people before they leave and then send them a coupon for a milkshake or an ice cream cone and maybe if we do that they'll stick around and they'll continue to pay us every month for their internet and their telecommunications needs now like i said we are we're not doing such a hot job predicting people that aren't going to leave but that's an error i'm willing to to make and so what this means is maybe i'll send a free coupon to someone who isn't going to leave uh the company they'll get a free ice cream cone or a free milkshake or something like that and that's just going to make them feel better about the company they're gonna be like hey i really like this telco company they give me milkshakes every now and then so i i feel like we're area we're making errors of course but we're we're making better errors uh than we were making before okay the last thing we do is draw the tree this code right here where we're creating the xgboost container and then training everything is the same as before except for this thing which is key which tells us tells xgboost that we only want to build one tree we don't want to build 65 trees or 55 trees we just want to build one because all we want to do is draw that first tree now why do we want to draw this first tree one reason we might want to draw it is say like we don't have an idea of of of what are some reasonable values to um set uh for the cross validation for those parameters for for the for the regularization or for the for for all these things or for gamma um printing out the first tree can give us a sense of what a good ballpark is we'll see what the values are uh you know we'll see what the gain is we'll see what the weight is we'll see what the cover is we'll see all these things and and those will give us a starting point for optimizing these parameters um when we go back to optimize the tree but we've already optimized the tree but but it's still kind of cool also to look at the tree uh as well so we're gonna uh down here i've just got stuff that makes the tree pretty i've also got stuff for printing out sort of um sort of numerically how that tree performed um so here's what we're gonna do uh we're gonna we're gonna run the code and here is these are our metrics for how the tree did and these are some of the values that we can use to help us figure out what what values we need to use in when we're optimizing the tree and here's the tree itself and i know it's really hard to see the individual values inside of these boxes so i'll just tell you what they are in the root so in each of these green nodes we have a we have a column name and we have a threshold for splitting observations in the root in the root node we have contract month to month less than one and that means all of the people who where that's true uh where the value in that column was less than one they go to the left and all the people where that statement is false they go to the right and actually xg when we draw this tree out it actually there's a little it's hard to see but there's a little no there and on the on the on the branch that goes to the right and on the branch that goes to the left there's a little yes comma missing so that means uh people that have values less than um one for month a month uh plus the missing data as we saw before are going to go to the left um so that's how to interpret the tree the leaves they don't give us classifications remember this is xg boost and the leaves just give us sort of a small incremental piece of probability that we add together for all of the trees and that gives us the final probability that an observation is one classification or the other they're either going to leave the company or not in conclusion we have loaded data from a file identified and dealt with missing data formatted the data for xg boost using one hot encoding built an xg boost model for classification optimized the xg boost parameters with cross validation and grid search built drew interpreted and evaluated the optimized xg boost model and that gets us to the triple bam hooray we've made it to the end of the jupiter notebook i want to thank all of you guys for supporting statquest and um it just means a lot to me that you'd be willing to show up for a webinar and i hope you're all safe and until next time quest on
Info
Channel: StatQuest with Josh Starmer
Views: 73,869
Rating: 4.9831867 out of 5
Keywords: Josh Starmer, StatQuest, Machine Learning, Statistics, Data Science, XGBoost, Python
Id: GrJP9FLV3FE
Channel Id: undefined
Length: 56min 43sec (3403 seconds)
Published: Sat Aug 01 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.