TidyTuesday: Ensembling Tidymodels with Stacks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey y'all it's andrew couch here and in this toy tuesday video we're not gonna be looking at this week's data set instead we're gonna build upon our tidy models um videos and do some um more machine learning stuff so i was reached i was going through the tidy models github and saw that there was a stacks package and this is a package for stacking your tidy models or doing some ensembling so you'll comment you'll usually see um stacking or model ensembling in like competitions like kaggle where you're essentially combining uh separate machine learning models predictions and creating another model into it so for example this is like a nice little diagram of your models you're using these models outcomes to predict uh the outcome and these models can basically be represented as coefficients with like a lasso regression model so it's cool that there's a package for this to make it a little bit more tidier and neater so i thought it'd be kind of cool to do a short little video of creating an ensemble model using tidy models and the stacks package i will also know that this is the unstable version so you can't do install.packages you have to do a remote install additionally some of the functions may not be working or may change so uh this this is just kind of a basic uh intro to see what the stacks package is about and what we're trying to accomplish so i'm gonna load up a tidy mod uh a our markdown sheet i'll call it tidy models ensemble okay um and for this since we're using this new package i thought it'd be pretty smart to use the aims housing data set so i'm gonna do tidy models typers uh id models aims housing and then the stacks package also paste paste this in um just so you can use it you can install it if you have not installed the package already so then we'll do the make aims we can do a basic uh kind of glance at it so just do a summary let's see how we have all this stuff since this data set is basically already cleaned for us we don't have to really worry about it and i figured we just make our seeds and our trains and test sets first so i'll say train and test sets so set that seed it's the 13th we do a tidy split um this is just the generic um modeling modeling procedure so we're going to create a training test split using our df prop is 80 20 so we'll do 80. we'll oops 80 and then we'll get make our tidy train so training tidy split uh tidy test which will be our testing tidy split and then we'll create our k folds data which will be v fold cv and we're going to create we'll we're going to use our training data set as our validation k-folds uh data set so tidy train okay so that's basic data partitioning nothing crazy and now we're going gonna go over kind of the main approaches for ensemble modeling so one of the main things with ensemble modeling is you want your models to be different um it's generally not real it doesn't make a lot of sense to use like the basically the same xg boost models or ensembling you want to have some diversity into it so i figured we'd first kind of start with a basic uh preprocessing function and i figured that a common um model that's pretty simple is a pca regression model so just gonna be a linear regression model using pca as a preprocessor so we'll create our recipe we're gonna be predicting sales price because it's the ames housing data set we're going to give it our train data to our tidy tidy train data okay and then we're going to probably do some um basic pre-processing since if we look at our tidy our training data set we have 81 columns with a bunch of factors a bunch of continuous variables so i figure we should do at least some data we should remove some of these columns and do some dimensionality reduction the first i'm going to do step near zero variance and i'll say all predictors so any predictors that basically have no variants in it will be removed then i'll do a step uh correlation and say all numeric minus all outcomes which will be removing any highly correlated variables okay we're also going to do a step linear linear combination so any linear combinations um which will remove some of our variables hopefully okay so this is basically our dimensionality reduction for continuous variables i also want to remove some of these classes so some of these things like if we look at um subclass we might have a oops study um so we have a bunch of factors for this and i fear we would want to kind of reduce this especially we're doing one hot encoding etc so what i'll do is step uh other and say all nominal so in this case if there's a minority class it'll just convert it to others so if it only shows up three out of the 500 times it'll be converted to a other variable okay then what we're going to do is step normalize all uh numeric minus all outcomes because we're going to be doing some pca or principal components analysis um we're going to then dummy the variable so all nominal or categorical and lastly we're going to do is step pca and we'll say all predictors and for our components we'll just use five components okay so if we do our pc rec uh prep it we can see we do all of our pre-processing features we have our dummy variables our linear combinations i guess it didn't find anything it removed um some sparse variables and some um and it didn't remove any correlated variables which is fine um so if we do our prep and our choose we can see we can see how it basically condenses it to five principal components to predict sale price that's gonna be our linear and we're gonna throw this into a linear regression model and use that to predict um sale price where we're gonna actually uh blend it in with some more models okay on the next thing that i'm going to do is i'm going to create a spline model so spline rec uh recipe sale price data equals tidy terrain okay uh we're gonna do the same thing where we're just gonna do step near um no zero variance near zero variance since we know that step correlation step linear combination i won't do anything i could just remove this but i figured we'll just uh leave it in just to make it a little bit cleaner so we'll do that okay um we're gonna remove all of the nominal value so only we're only going to be dealing with continuous or numerical data and then we'll do step bs basis uh splines and we'll say all predictors and lastly we'll do a step yo johnson on the all predictors okay oops sale price okay so we have this spline rack so kind of like a spline or uh mars model style if we do a prep oh there's a fly here we see that he's doing all this stuff uh if we juice it we can see that we have a decent amount of variables uh which might be bad for a linear regression but we're going to be ensembling so it won't be used for our last model okay finally we're gonna do a basic recipe uh we're gonna be adding on to on these things uh we're gonna do a step uh uh normalize so all numeric and then step dummy all nominal except numeric and then we'll say minus all outcomes okay so this is our recipe these are our three recipes for some models that we're going to be using so with our tidy wreck we'll do prep juice so essentially this this model will be for or this tidy recipe or like baseline recipe will be used for our more um uh stronger black box models such as our like a random four an xg boost our spline and pca models will be used with these simpler like linear regression models okay so we basically have three types of preprocessing that we're gonna use and we're gonna do it we're gonna apply it to a few models okay first let's make our pca regression model so pca regression model linear reg uh step set oops set sorry set mode regression which i think it defaults to set engine to engine to lm so we're going to do a basic linear model not a gl net or lasso or uh or a ridge regression we're just going to do a basic linear regression and then we'll also do another one which will be our spline regression i would say spline model okay then we'll create a random forest model a rand forest min n will be tuned and trees will be tuned that mode to regression and then set engine to i'll just do random forest okay finally for our fourth model we'll do an xg boost model so a boosted tree and i'll just tune the learn rate and the trees and the tree depth and then we're going to set the engine to xg boost okay so we have our models um one thing that we have to do is set our model controls so with our tune with our stacks package what we're what we're going to need to do is actually save the predictions because essentially it's going to be using each model's predictions and comparing it to the actual value and then modeling these predictions as coefficients so we're going to do a model control which will be a controlled grid we're going to save our predictions which will be true and we're also going to have to save our workflows which will be true because it wants to do these preprocessings to our model so when we're using our final stacked model it should be able to do the pre unique pre-processing for our pca spline random force and xgbs models inside of our model okay i'm also going to do a model metrics function so we're just going to do the the classic regression metric so rmse mae and r squared okay so we have our define models define um tuning uh control and i'll say define pre-processing okay and then finally what we're going to do is create our grid so define grids so we'll do a grid regular parameters of our random forest model our levels will be three so with this uh i don't like how we are only using one tree so i'm going to change that so i'll do filter see trees is less than uh trees is greater than one so that way we're doing one thousand two thousand one thousand two thousand wait um so call us the was it grand forest grid and then we'll do the same thing so grid regular uh parameters actually boost model levels equals three and i'll say filter c trees filter uh trees is greater than one so we have more than uh oh so we have about a thousand trees for a minimum amount of trees of course the xgboost grid okay so now we have our grids and finally what we're gonna have to do is define our workflows uh so we're just see we're basically combining all this stuff so our models are pre-processing and then we're going to start tuning it and then giving it uh passing along into a stacked ensemble model and using that to evaluate on our test set so we're going to create our workflows so workflow um add model pca regression model and then add recipe pca rep we're going to create our spline workflow add model spline model add a recipe spline correct uh random forest workflow our models give it our baseline recipe which will be our tight recipe and then our final workflow which will be our work our actually boost workflow so you might notice something um our linear models should not have any parameters so that'll be very easy to fit um for our random forest and our actual boost models we're gonna have to fit it okay so finally what we're gonna do is fit models so for our pca workflows and our spline workflows we're just going to do a fit resamples so we'll say uh pca res it resamples we give it our pca workflow our resamples which will be our k folds data our metrics will be which will be our model metrics and then we'll give it our control which will be our model control great and now i'm gonna oops copy and paste these in so it'll be our spline res here spline workflow our random forest res which are with give it our random forest workflow give it our actually boost res and then give it our gps workflow okay so i'm going to run this but i act i actually already have the results from this sample so uh we'll um load it in and actually it's not fit resamples for these guys it'll be a tuned grid and a tune grid but these guys don't have any samples so we're just gonna don't have any um parameters to tune so it's going to be fitting a model um and then evaluating on the evaluation or k-folds data in this case it's going to do the same thing as our pc and splines we'll also show it trying every parameter um grouping in our parameter grids that we gave it actually i'll be uh grid is equal to x g boost red um grid is equal to random forest uh random forest grid and i have it yeah okay rand forest ran forest grid so it'll be uh rainforest grid okay so we can run this but we have a uh file already that i tuned last night we have our load our r data if you look at our pc res we actually have our results right here great so now what we're actually gonna do is working on the actual stack and with this stack it's pretty easy so we just have to initiate stacks and then we're going to start adding our candidates so essentially since we saved our predictions we can look at our was it pca res we can do a collect predictions and we have our predictions right it's pretty good um we also have our id or row ids too so it's pretty easy to say okay um pca predicted this the actual thing is that combine all of our model predictions together and build another model to break sales price given our model predictions so we'll do that aim stack uh we'll say stacks we'll add our candidates so let's do our pca res add candidates our spline res add candidates our random forest res add candidates our xgboost res so this is actually going to run into an error um specifically with our xgboost results and i'm not sure why it's doing that it's saying list columns um and this is this is one of the problems with a of a a new pack that's still being developed is sometimes will run into bugs so if i remove actually boost it'll fit perfectly so we have our aim stack and saying a data stack with three model definitions and eight candidate members because our random force model has uh six sets of parameters that it was tuning on so with our aim stack what we're gonna have to start doing is blend our predictions and this is basically um figuring out which what type of uh combination of models we should be using to uh make our predictions for our final ensemble uh model so it's saying is oh we should heavily weight our random forest model and we should lower our rate our weight for our spline rez okay so once we have that we can do aim stack blend predictions and also fit members okay so now we're fitting the um essentially we're fitting each of our models predictions to the sales price so our model saw our actual data such as like housing um uh size of rooms number of bathrooms and our aim stack is just seeing our model predictions for sale price so it's not seeing as much features it's just seeing kind of these summarized uh features what the model sees and what it believes the house value is okay it's still running um we'll just wait here and one thing that i will say is since we have this problem where we can't add our xgb's model um i think it'd be kind of interesting to just kind of do the um to do the ensembling by ourselves using your models um without the sax package um the sax package makes it way easier and a little bit much uh with less code to actually create ensemble models but it's still pretty easy to make a ensemble model uh with just base tidy burst functions okay so we finished training we can see our aim stack pretty good it says out of eight possible blending coefficients the amps the ensemble used two using a lasso penalty of uh this this score so it only used two of our actual models it only used our random forest model and our spline model it ignored our pca model which is kind of interesting uh so we have our aim now we have our final aim stack model let's actually do it now um in base tidy versus way or the tidy burst way without the stacks package if you wanted to do it by yourself and didn't want to work with a package that has a decent amount of bugs okay so first what we're gonna have to do is give in our random forest our random forest res and are actually boost res we're going to just collect the best metrics so we'll say show best and we'll say rmse okay and i'm just going to say slice 1 select trees and min n and that will be our final model okay actually boost res we'll say select or show best rmse uh slice one and we'll say select trees to learn rate okay so this will be our random forest uh was it final pram actually boost and we'll say final ram okay so now that we have our final params we'll actually start creating our predictions so we'll say actually boost res but now what we have i'm going to say um finalize um parameters and say collect model predictions to stack so with our xgb spreads we can actually do is collect the predictions and we have a problem here so for example we've tuned multiple models so we're gonna have to say is uh we'll just do a inner join with our xgboost final param that's joining by tree tree depth on ordnance right so it's just keeping our final models okay we'll select the i d the row dot pred and sell price so this will be xgboost stack and then let's do it for our random forest uh res so i'll do a collect predictions inner join random forest final param make sure it's joining correctly which is treason n and we'll do select id dot row dot pred and sale price great so we have that i'm going to also change something so i'll just say actually boost is equal sale price so it will say like in this case it'll be our actual prediction is our actually boost prediction and we'll call this random forest is equal to oops sale price and it's not sell price it'll be dot pred dot pred and then we'll change that to sale price okay so we're gonna change our prediction names to the actual model names um okay so random forest and we'll call us stack and now let's look at our pca res so p sierras uh collect predictions select id dot pred uh dot select dot row uh we'll say pca equals dot pred and sell price okay and let's call this the uh eca stack finally we'll do our spline res so collect predictions and then we'll do the same thing so select id dot row uh spline is equal to dot pred and sale price and i know i could just rename it but i figured we just kind of make it a little bit cleaner to look at of what i'm actually doing so spline back okay so we have our stack and then we'll create our final stack df actually i don't think we'll we don't need sales price for all of them so i'm just going to remove that okay so now let's actually join these together so actually boost stack left join random forest stack left join um eca back left join spline stack and it's joining by our id and row so that way we have the same predictions for the same sale price and this will be our stack ef okay so create ensemble data create stack data and i'll say uh okay so i could have to clean this so i'll say create stacks model cool so now that we have our stack df what we can actually do is just fit a generic uh lasso regression so linear reg so penalty we'll just say penalty equals point five mixture is equal to uh one well i guess we yeah yeah we'll just to make sure equals one so it's not actually actually uh no it is loss of regression okay so set bow to a regression to engine to oops into glm net and fit it to sale price data is equal to stack df great so now we have our stack model let's actually look at the coefficients so we're going to do a tidy [Music] stack df oops and then i need to also plug out minus id minus that row okay so we can look at our coefficients it did shrink our pca model uh but also it still kept the xgboost model and we have our splines we get so we can actually change this to zero if we don't want to completely remove a model and that's one one thing where you can definitely just mix with you know using your lasso your ridges or your elastic nets for your models um some common other models you use for stacking is decision trees so you have decision trees linear models etc um generally you don't want to go too complex until i can actually abuse a random forest but you can it just leads you more prone to overfitting which is very easy to do when you're dealing with always modeling uh stuff with a lot of data leakage okay so now that we have our final model um let's actually just do our final uh workflow so actually boost workflow so always say finalize sub models so in this case we already have our model we just need to combine our our parameters to our actually boost and random force workflows and then we're going to do our last fits okay so actually boost workflow actually boost workflow uh finalize workflow and it's what actually boosts final pram last fit and we give it our tidy split okay and i'll do the random forest workflow while it's training random forest uh workflow finalize workflow give it our random forest final pram class fit tidy split we'll do it for our pca workflow cool which will be pca workflow class fit tidy split and then we'll give it to our spline portfolio spline wf last fit tidy split okay so it's doing that wait a little bit and these models will be very fast while it's doing that i'll actually say predict i'll say uh extract uh predictions from sub sub models so i can actually can create a list so tibble so models is equal to list actually boost workflow random forest workflow and pca workflow and spline workflow i'll say uh names or model names is equal to c g boost random forest pca and spline and then i can do a mutate and i'll say predictions is equal to map model and say collect predictions and i'll say uh final stack final df wow that's taking a little bit longer than i expected it to be to take but it should it should do it pretty fast let's see here yep okay just did it right when i check so we do that these are just you know pca and spline so it'll be fine okay so we have our final predictions from our sub models okay uh we can select our model names and our pred on spread uh what i can do is uh pivot wider names from model names values from dot pred okay select minus id minus dot row okay so now we have these things what i can do now is actually um use our model to predict on it so i'll just say predict or was it stack model it's called uh was it what i call again stack model right so that's our stack model um i'm actually gonna say stack final df and then i'll say okay predict uh stack model stack final df and i'll say bind calls stack df okay so now we have our prediction our sales price actually boosts spine uh we can also do a pivot uh pivot longer i shall do a all rename to stack is equal to dot pred so that's our stack prediction we have our xgb's prediction our random force prediction pca prediction and spline prediction uh converts this to a tidy format minus sale price and then what i can do is a group by name and then i can do a forecast metric oops it's a model metrics truth is equal to sale price estimate is equal to value oops estimate is equal to value okay and then we'll do a ungroup uh pivot wider just so we can read it pivot the metrics values from equal dot s submit cool so what do we see here um i'll do a range by rmse so our stack has a lower rmse a lower mae and a slightly higher r squared and you can see how it beats all of our models and all of its metrics and that's good because when you're doing predictive modeling challenges that extra one percent of predictive power can make you win that competition so in in so in reality stack when you stack models you're not going to get like a 50 increase in performance but you will get some marginal performance and that's what many people do in capital competitions um generally in cow competitions people are not just using an xg boost model they're using multiple models and stacking them together to make a final output because you don't have to worry about uh long prediction times say like if you're in a production setting you just have to worry about exporting out a highly accurate csv finally let's actually look at our aimstack model so if we do our uh was it ames uh with ames stack we predict our aims stack with our test data i think our tidy test right uh bind our columns for the tidy test select still price and then we do a uh was it a model metric so truth is equal to sale price estimate is equal to dot pred s estimate is equal to dot pred pivot wider names from equals dot metric values from equals dot estimate dot estimate okay i read that and what we can see here when combining with our other other uh stacked model is we actually have a higher higher rmse than our xg boost model so it does not perform as well as our our our custom stack model our custom xg boost model but it still has some pretty decent performance okay and you can see how um doing the aim stack using the stacks package was much easier with less data munching so for all this work right here um could it not to be condensed into into a one like two into a two liner right because these are these are a piped in function so hopefully the aims i'm hoping the stacks package improves where we can start adding more models or i can just figure out what i was doing wrong but you can see how the uh the stacks package is a nice framework and it really follows the tidy framework of tidy modeling and you can see how stacks definitely just improves your performance across the board usually it might it may not increase it by a lot but um after 20 minutes they increase your percentage by two percent um predictive power is worth it for a lot of predictive competitions um i know i went uh over a package that may be changing but i thought was pretty interesting in my research uh last for the weekend so hopefully you guys can do uh some more modeling and kind of improve your models using stacking or win some cattle competitions but in the meantime i'll see you guys next week and tidy on
Info
Channel: Andrew Couch
Views: 1,580
Rating: 5 out of 5
Keywords: Rstudio, Tidyverse, R Programming, Data Science, Analytics, Statistics, Data Visualizaiton, Data Viz, EDA, TidyTuesday, RStats, Data, Data Modeling, Tidy Tuesday, Learn Data Science, Learn Statistics, Learn Machine Learning, Machine Learning, ML, R Stats, R Studio, R Shiny, RShiny, Ensemble Learning, Tidy Models, Predictive Modeling, ML Competitions, Kaggle, Model Stacking, Model Blending, XGBoost, PCA, Splines, Random Forest, RandomForest, Pricing Models, Lasso Models, PCA Regression
Id: 44rINyxp220
Channel Id: undefined
Length: 39min 7sec (2347 seconds)
Published: Tue Oct 13 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.