Tune xgboost with early stopping to predict shelter animal status

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi my name is julia silgi and i'm a data scientist and software engineer at our studio and today in this screencast we're going to walk through this week's data set from sliced the competitive data science streaming show live streaming show on twitch and what we're going to do is we're going to use it as an opportunity to talk about early stopping in xgboost and how to do that with tidy models i've shown how to fit and tune xg boost a couple of times but i don't think i have talked about how to use early stopping early stopping is a good way to avoid over fitting it's a good way to be more effective with your tuning and so what we're going to do is talk through how to tune the best value for early stopping and how it can help you if you are using extru boosts the data set is about shelter animals uh animals in animal shelters and what happens to them and how we can predict whether they'll be adopted or transferred or have no outcome which is sad so it's a multi-class challenge a multi-class classification problem so um it's interesting in that sense as well so let's um let's talk some about xgboost some about early stopping and some about multi-class evaluation metrics all right let's learn about xgboost early stopping and shelter animals so my own two cats are shelter animals so naturally i think that um the outcome for all these shelter animals should be adoption but sadly it is not i mean most of them are but we have no outcome and also uh transfer so no outcome i think we generally understand the mean something not good you know not good happening to the animals and transfers are going somewhere else we've got information about um the animal itself we've got an including some pretty interesting information about like breed and color and name animal type during the episode of slice this week i i did quite a bit of analysis of like the color and the breed and the name but um in this screencast we are really just going to focus on training a model with some basic features so we won't get into too much of that this time um in order to get started let's just do a couple of exploratory plots to give some context for the features that we are going to use which are these more sort of basic um getting started kind of features so let's start with that um let's start with that uh there there's this feature that is age upon outcome um but it's you know that's not super personable for a little bit better if it was this was like a numeric value like age in weeks or something like that so let's talk about how to do that how to get that because we have over here the date of birth and the date time at the outcome so we can we can process that and get that there so if we do so let's make a new age upon outcome and let's see so we're going to use lubridate here so we're going to say um as period date time minus date of birth like that um oh yes okay so one of this is a daytime and this is a date so let's do as date like oh gosh ugh that's not what i meant to do let's see view let's do this and do one okay there we go we're back to this let's there we go okay so now we have this this um column is now a period which is a concept from lubridate and so now let's do one more transformation and let's say time length which is a function from lubridate and we're going to put that age upon outcome in there and the unit that we're going to put there is weeks so what this does is it gives us now a numeric value so this is the number of weeks so this cat who used to be we used to say the cat was two years old now we say it's 120 weeks old this one that was one year old is 52 weeks old two months old nine weeks old so this is now the age in weeks which i think is more useful and now let's do just a visualization so let's put that age upon outcome here and let's do fill by outcome type and then let's make a histogram let's say let's make just a few minutes like lesbians let's make alpha equals 0.5 so this is now a stacked bin they're they're just like all on top of each other so if instead we want them uh to take up the same like kind of as if they were on like in front of each other would look like this and then the last thing these are these are in counts so we see here right that the no outcome is less than the transfer which is less than the adoption that is useful but another way we might want to um visualize this is with as a density here and so we can do it like this and we can kind of see where those differences are so if we want to change the labels we might say age in weeks and we don't need that on the um legend yeah so pretty good okay let's make one more um exploratory part plot before we get done we get started with our modeling so let's take that raw training data data data and let's say let's take that outcome type and then let's instead of it has three things in it and let's for now let's just look at adoption so instead of saying adoption no outcome transfer now it just says true false true trust true false as uh in terms of who is adopted and not and so let's group by week where we call week of the date time so this is week of the year and weekday where week day of the so this is now day of the week and let's find the um mean adoption rate so this is for every week of the year and every um day of the week uh what is the mean adoption rate to then and we can make a visualization like a heat map so the week weekday and let's make the fill the outcome type there and we can use geomtile to make a heat map and let us [Music] scale fill with my favorite viridis so if we do it this is for continuous which is what we have like this ooh nice so let's um let's change those labels because this is oh not that this is the percent um of out of adoption rate so we can we can change these labels so fill this is percent adoption rate um so this is just to be clear week of the year and this is day of the week like that all right i think this is really nice i like this a lot so we have um we can really see some seasonal effects here so we have on the bottom um as we move up we go from sunday to saturday so we see that on the weekends the adoptions are high and then we as we go from um left to right we go through the year and i you know it looks like it's darker kind of in the middle so it looks like it's higher uh you know maybe around the holidays maybe more people are adopting around the holidays perhaps so this kind of heat map lets us see things um see these seasonal patterns which i think is pretty nice pretty good all right so there's lots more we could do with this data but let's go ahead and get started on building a model so we can talk about early stopping so to get started with the model let's load tidy models let's do our um data spend our data budget um i'm going to do i'm just going to copy and paste this and then i'm going to pass it into initial split i'm going to use stratified resampling on the outcome and i'm going to call this shelter split like so and then i will make shelter train which calls training on the split shelter test which calls testing on the split and then i'm going to make some refor some resampling folds so let's um let's make uh cross validation folds so i'll call that on the training data like so i'll do this resampled i mean stratified as well out come type and let's call this shelter folds while i'm here let me set up some metrics so i'm going to set up a metric set so the default metrics in typing models for a multi-class classification problem would be accuracy and roc auc but let's add mean log loss because that is what the um that is what the the challenge during slice was being evaluated on so you can always you know add in something else that you want to that you care about so this is spending our data budget we have a certain amount of data that has labels on it so we need to decide how we are going to spend that next let's talk about feature engineering so we are predicting outcome type and we are going to do it with um other things we've got in here so let's see um so let's use that age upon outcome a feature that we created let's put in animal type let's put in that date time so when did this event this outcome happened the adoption or the transfer or whatever um let's put in sex spay neuter um and and let's let's call that good enough so i'm sure we could do better by like incorporating breed information maybe color information and so on but we are just gonna stay with that so let's put in the the training data here and then let us start to do our feature engineering for example step date um we're going to take that date time object and create features like day of the week um week of the year and let's also put in here in case there's kind of like a like a um like a long-term change so we're looking for different level features at different levels of time and we can remove that the date time by doing keep original calls false we remove the original column now let's create um dummy or indicator variables for all these nominal predictors so that's things like um spay neuter animal type and if we do one hot equals true this it will keep all the levels so for example if uh in spain neuter i think it's like um intact neutered and unknown and this will keep all three instead of removing the the base level and so in a linear model you want to remove the base level or the linear model will you know fail but for a tree based model or like boosted tree it can sometimes help you to have all of them the way that it does the splits um so we will keep it all there and then i'm going to do uh this filter is remove anything that has zero variance so send anything is like all the entirely the same um i mean i think that's probably unlikely but i'm just gonna put that there to be safe and then i'm i'm gonna prep the recipe the reason i'm gonna prep the recipe is just to make sure that it doesn't fail that nothing is um that nothing is unusual nothing is not working the way i think it's going to so you know we created these um we created these these dummy variables here um it all looks good and now it is time for us to talk about early stopping so let us make a boosted tree model um so the way i'm going to do this is i'm going to i'm not going to tune the number of trees i'm going to set it kind of medium and you know it might be better if i set it higher i can set it kind of medium i'm going to set m tri i'm going to tune it let's look here so m tri is um at all the splits as it goes down um m tri is how many predictors um do we do we sample at each split like how many do we show um the you know the learner eddie split then i'm gonna i'm gonna turn tune the learn the the learn rate i'm gonna tune um stop iter is the other thing i'm going to tune so that is um this is the early stopping parameter how many iterations like boosting iterations do you have to go through without stopping so that's the thing that's that is the early stopping thing so i'm not saying stop early after 10 iterations i'm actually going to tune and find the best early stopping parameter so i'm going to set this engine i'm going to say xgboost and then i'm going to set this engine specific parameter of validation which is a proportion of the data and so any data that get patents passed through the xg boost i'm going to say hold out 20 of it as validation data and use that to decide if um if the model is getting better or not and if it's not then stop don't keep going and training and boosting for forever just stop and this is a classification problem so this is my early stopping model specification the next thing that i want to do here is i'm going to make a grid of possible parameters to try so i am going to use not a regular grid but like one of these irregular grids that can cover the space a bit more efficiently so i'm going to put something in here i want to control this a little bit more than i would otherwise so i'm going to say i don't have to name this i guess so i'm going to say like the default is starting at one so like one um one one um parameter so let's let's start at five and let's go up to 20 or i don't know 20 let's see so if we prep the recipe um make new data null like this so there's 22 so i could go all the way up to 21 so let's go up to 20 um let's do learn rate uh let's i don't need to go so small let's do minus five minus one um and let's do stop enter i'm gonna start i'm not let's see 3 to 20 i'm gonna do 10 to 50 actually i'm going to let it go longer and i'm going to set aside so we're going to try 10 10 possible models uh possible sets of parameters hyperparameters so these are hyper parameters of xg boost and we're going to try 10 different possible combinations notice there they are chosen to try to cover this 3 dimensional space pretty efficiently without making a regular grid okay now let's put it together let's call it early stop workflow and let's make a workflow and let's put the first argument as the recipe the second argument as the model specification and then i am going to set up my parallel back end set a seed and then i'm going to i'm going to tune the grid so that means i'm going to tune for every one of my folds i'm forever the 10 folds i'm going to try the e for each fold i'm going to try each parameter so i'm going to tune 100 xg boost models so let's say early stop workflow shelter folds the grid is the stopping grid and the metrics are those shelter metrics that i said here so let's call this stopping result like so okay and let's get let's start it okay so what is this doing um for every sample it's going to the analysis set of the resample and fitting one two three four five six seven eight nine ten possible models but when it goes to xg boost it actually doesn't send all that data to xg boost what it does it says it sends eighty percent of the data actually boost um well it sends all the data actually used but then xg boost trains on 80 of the data and 20 percent of the data it holds back and says um i'm going to use this and check and see how the model is doing is it getting better or is it not getting better and when it stops getting better i am going to stop training i'm going to do the early stopping based on what we told it in terms of early stopping and then when it comes back to tidy models it is going to use this data to evaluate and see how each one of these four options did so the unity the stop iter starts at around you know 12 it goes up to like about 50 or over 50 or something like that so it's going to go through and do that and train on a subset of this and see if it needs to stop or do early stopping or not and then keep going so it'll keep going through and do that over and over so i um this is the um the you know in in xgboost that is the number of boosted like boosting iterations um so it will um decide you know does it need to go through and do all that or not so let's let this go this is going to take a while um even with early stopping which means you know it can stop early this is still a lot of models to go through so let's let this go and then i'll pause the video and come back when it is done all right the model has finished tuning let's take a look at it okay so things look good let's start by um visualizing the results all right oh i'm gonna um i'm gonna change it doesn't look good with um theme minimal which is what that other thing is based on so let's uh do it like this and let's zoom in and take a look so we had three metrics that we were tuning with and three tuning parameters we've got this you know 3x3 grid and we can see the learning rate um we definitely you know we're doing better in this like really big step size in the learning rate and then we can kind of see how all everything else did here so um that is looking that is looking pretty good and if we look at what what was the you know the best result here we get so we got something with a log loss around 0.502 and so during slice that would have been um you know pretty competitive that would have been second place which is what i actually got and so to do better than this we would probably want to you know bring in some information from um from the breed like color uh maybe even like does the name exist or not and then also you know this is just a single model and so we can ensemble models together to get better results okay so once we have a model that we're happy with let's see let's do this so it's not so close to the bottom what we can do is we can take the workflow so this one remember is tunable and then we can finalize it finalize workflow with the um the optimal result from our tuning result so i'm going to choose the one that has the best log loss like that so before these were tunable results and now these are uh they you have specific values like this value for the learning rate this value for m try this value for the early stopping and then i'm going to pipe this to last fit with the data split the to the training testing split and let's call this stopping fit so and let's let's uh run that so what's happening is now we're going to the whole training set and i'm saying okay xgboost fit my whole training set but stop just train and train but um uh stop training if there's no um improvement after this many iterations and so i found that value by tuning on the resamples so it is done and so it has it has now finished so we've got um this is a this is a a result of last fit and the the last fit results we can do things like get out the metrics and these are metrics on the testing set and so this is uh let's see we can kind of compare these to um i forgot to put in the lug loss there so we can compare this to what we got on the um on all our resamples and it looks you know pretty good about the same this is about what we would expect to get we also can you know do other things for example with with these results like let's look at variable importance so i'm going to do with um the vip package i'm going to extract the workflow from the last fit object and so this object here i can save it and use it for prediction later like i can save it maybe as an rds load it back and save it for prediction later and i can extract the parsnip fit that is inside of it and i can call vip so let's say let's say we want to see 15 features i'm going to do points because i like how that looks and i am going to so what we're doing is we're saying what are the most important features for this xg boost model so the age of the pet of the animal is the most important whether they're spayed or neutered the animal type that date time the week the week of the year remember we saw that seasonal effect where there were more adoptions around the holidays and we see cat dog day of the week saturday when we saw there were no more adoptions on the weekends so we see these um these these these are the features that the xg boost says are most important so we looked at metrics we looked at feature importance next let's look at the predictions from the test set like this so these are predictions on the test set so we have um uh this is these are uh predicted probabilities of adoption no outcome transfer this is the predicted class and then this down here was the real value so we can do something like um like an roc curve so what we would do is we would pass in the truth and then in this case since it's multi-class adoption all the way let's pass in all of the um predicted all of the predicted probabilities here and so we get roc curves for all of these and we can pass this to auto plot and we get the three roc curves for these three classes they have different shapes which i think is pretty interesting like at different thresholds it is um different like there's uh there were a lot more adoptions than others so that's probably you know why this happens we could have maybe tried to um balance this if we want maybe if you want to see this all in one plot it we have the underlying data there so we can just go through and do that if we like um i'll save that as an exercise for the reader have the of the gg or ggplot skills um the the another thing you know that might be a little we might be able to see this kind of like why is it harder for one to predict one class than the other the other way we can maybe see that is by doing a confusion matrix so we can put in again the truth and then the predicted class here so we see here like you know of course there's a lot more adoptions than other classes and we can um see that you know we're much more successful also at predicting that majority class which this is so so so common for this to happen and so you know visualizing it visualizing the confusion majors just really brings us home right um we learned um very well how to predict the majority class and we might you know want to take uh try to do some um down sampling or up sampling or something to try to do a better job at some of these other classes to see if that will that would help us or not that might be you know something i would try to do next actually i think that's that might be a pretty interesting thing to try all right we did it we used xgboost for predicting the outcome for animals in animal shelters obviously they need to all be adopted is the right is the right outcome but um we we use the xg boost particularly we use early stopping to be able to um as a way to avoid overfitting with you know this boosted tree algorithm and to be able to um not just keep boosting and boosting them forever um if it's not getting any better um so early stopping is really useful in a lot of situations and um i'm actually going to be on sliced again this coming week um in the final four so we'll see i think i probably would plan to use um early stopping again because it is useful in those kind of circumstances so i hope this was helpful and i will see you next time

Info

Channel: Julia Silge

Views: 2,017

Rating: 5 out of 5

Keywords:

Id: aXAafzOFyjk

Channel Id: undefined

Length: 30min 38sec (1838 seconds)

Published: Sat Aug 07 2021