TidyTuesday: Comparing TidyModels with Caret

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
heyo Sanji catch here and today for this tie to Tuesday I'm actually not gonna be analyzing the tire Tuesday private project instead we're actually to go over the tidy models on package so recently that tie Devers developers and data science team has released an ensemble package called tidy models and it's essentially a a it's like the tidy verse except for machine learning so you can do all these modeling procedures in one package this is actually the successor to the carrot package which was kind of like the default black box modeling package where you do we train multiple models that's at us at a single like framework and Max Kuhn who is actually the developer all the care packages on his team so tidy models from now on is gonna be it's the successor that to the carrot package so I think it'd be interesting to kind of go over the basics of Tidy models and just kind of do a sum comparing and contrasting with tied models and carrot so we're gonna use the abalone dataset which is a pretty popular kind of like Toy Machine warning data set that's why I guess UC Irvine's machine learning repository and you just need to go to the data folder and download the abalone data file okay so so we're just gonna open up a markdown file we're gonna rename this to tidy Tuesday carrot and tidy models okay I'm gonna make sure to what's here I'm gonna make sure it I have my directory set so I can get it set as working directory okay so okay so I'm gonna do the I'm gonna right load in the tidy verse tidy models and the carrot package okay and also I'm gonna do Redux read underscore CSV it'll be a baloney data and since for the the abalone data file it doesn't actually have call names it's it's mostly just like it's like that we're gonna have to specify the actual column names and if we will get the column names it's all this stuff but I actually have the list saved so I'm just gonna copy and paste it soon to not waste any time okay so now we have this and with the abalone data set the me it's actually our I believe a regression data set we're trying to predict the amount of rings in the abalone and the Rings is a century I guess your age of the abalone I guess it's like a muscle or something like that abalone yeah it's like a it's like a snail okay and I guess the Rings are kind of the age of it cool we're just going to do some basic pre-processing so DF we need sex to be as a factor so we'll do that additionally there's about 4,000 rows and it's since I'm gonna be doing some modeling I'm gonna just remove some of it so I'm gonna do set C don't say 80 say DF sample and yeah and I'm just gonna use 800 simples um it's not really a great you know I wouldn't probably do a bunch of Michigan learning models on just a sample size I have 800 without doing some type of procedures for it but this is just for you know comparing so we're not working on will sleep over that do a basic summary statistics of it we have a length diameter Hyatt hallway shucked weight viscera way shell way and then the Rings with a man and a max of two and 24 we also have relatively even genders so there's not a disparity between that FIM female male and maybe eyes they're just kind of like unknown so we can go through that it's infants so infant cool okay I bet infinitely related to the rings but that's not really what we're trying to go over okay so modeling procedure 101 is you first have to create your you're training in your test sets right so you want to train a model off of a tree off of training data and then you want to evaluate it based off the test data okay so we're gonna do it first we're gonna do it the the carat way so with carat you just make a you use the data partition I'm function so we'll say carat split create oops create data partition why so DF will say rings P equals point will do 80/20 split this equals false so we have a carat split right so as always like resampling stuff going on it's okay we don't we don't want to go through that okay so to create those tests training test bullets will do karat train and then we'll say DF and we'll index the rows by the carat splits okay that's fine this is just kind of a kind of arguing about like tables and stuff like that we don't really need to worry about that test like a test set and we'll say whatever you did for those rows we're not gonna use those rows that were using the train sets so we save - what okay and if we see carrot train dimensions 641 with nine columns which makes sense and then we do that test 159 nine columns cool so that's the carrot way to do it so splitting train and test sets with care and then we have the tidy models way which is used initial split right what's interesting about it tidy models is most of the always kind of like steps and processes and actual modeling are are in different packages so there's there's a package for just like doing data partitioning right there's a package for actually tuning your parameters and there's like a package for making them models and there's another package to do on the evaluations it's kind of interesting but you don't have to worry about like which package goes with with what thing um it's all in the tidy models central package okay so we're gonna call this this tidy split and we'll say initial split and it just takes out the data so we'll just say VF and prop equals and won't say 0.8 and what's cool about I guess this is it's a little bit easier to understand what's going on so to create the Train Set you don't actually do this you just say training and you give it the the split from it and then you for test you do testing and used to give it again the same split okay so if we look at tidy trained in its 641 same thing right okay so that's the tidy way so splitting train and tests that's what tidy well with I'm gonna say the tidy way okay so when we're starting it's nothing crazy I would say you know basically the same way pretty efficient on both ends honestly I wouldn't really sleep if you did this way or that way but I kind of like this because you can use this actual split for other things in the tidy framework okay so once we have our trainer tests training and test sets we actually need to apply some pre-processing most models it really doesn't hurt you know Saturn's scale or normalizer data to surround different dimensions for this one it's pretty much I would say centered and centered but we can just keep on doing it so with the caret way what we're gonna have to do is we're gonna apply it Center and scaling to all values besides our target variable which will be Rings we also have to create W variables for this sex so that's it's gonna say like female it goes 0 or 1 right and we're gonna do one hot encoding because we're gonna do a random forest model and normally if you're doing the color just stick with Oh linear regression you should probably do like a dummy variable thing where you have some degrees of freedom where it's only if you have three categorical values it would be two so beep female male and then if female or zero then it'd be implying it be infant but this way we're just gonna do it very simple will be female infant or male zeroes and ones for all of them okay so we're gonna say pre-processing Center scale and one hot and code okay okay so first we're gonna do the what to do the pre-processing processing we care it we're actually use the pre process function there's pretty simple so we'll say since we're gonna be modeling it using the training data we're always going to use to training stuff so we can apply it to the test set we don't really want to have any data leakage so we wouldn't want to do the pre-processing values on the entire set if that makes sense ideally the splits will make it so it doesn't it shouldn't make a difference but it's better to just do it based off the train set and you should never really use a test set for any modeling stuff okay so we're gonna deselect the Rings because we don't want to apply the pre-processing to the Rings so we do that and then we'll say method equals see Center and scale so they're right it's pre processing center scaled yada yada okay so carrot pre-process okay and we're gonna apply that into that into a little object and this is actually some type of like you can use this to predict on to it or to kind of like apply the actual reprocessing so we'll say carrot train equals predict carrot pre-process and we'll use our carrot train data okay so we apply ooh I just feel like air train wrong carrot train and then we'll do the same thing to our test set pre-process okay boom so we applied pre-processing using our training data on both our train and test sets for our carrot modeling service so if we look at our carrot train we still have our categorical or factored variables so we still need to apply the one hot encoding so we'll say apply one hot encoding using carrot and with carrot they use the dummy VARs function and we say dummy VARs data equals carrot train so that's saying if we if you see a factor we're gonna create into a dummy variable and there's an option to make it so it's not one hot encoding I think it's called drop second so we can say like oh we we can say drop what was it key I don't know it's it's something like that where it's like drop second equals it feels true but we don't really need to worry about that since we're not going to be doing that so okay carrot one hot dummy bars boom okay now we're gonna to we'll just apply this one huh encoding to all of our data so I'd carrot train see one hand cut everything is still and scaled and then we still have our rings that which are untouched okay so we'll say karat train boom carrot test do the same thing to our carrot test data to our carrot data okay so that's our pre-processing that we're gonna do for modeling with the tidy models so we'll say tidy way of the pre-process see they actually use a thing called the recipes package which has all of these kind of like pre-processing the styles and they do it in like steps and essentially what it's saying is this pre-processing is basically a recipe where you're trying to like add different ingredients to it to create your final model data okay so the recipe starts with recipe we have to specify our formula so recipes is we're gonna predict rings using the formula our data will be tidy train data and then this is a this is what's really cool is since it's a tidy models it's basically made to be pipe into things so what we're gonna do we're gonna add Center and we're gonna say all predictors - all nominal and all nominal is like all categorical variables I guess so we don't want to apply the center and scaling to our dummy variables and then we're gonna do is step scale same thing all predictors - all nominal and then step dummy all nominal - all OH Allen domino and then we also don't want to apply it - yeah let's just do it and then we'll say one hunt he goes true cool so now we have a recipe so it's saying go centering scaling dummy okay and then we have to prep so there's things where I'm not really sure what house works but basically this is the recipe where it's also not really doing anything right it's just saying like here's what we did we actually have to prep the recipe so we'll call this tidy prep equals prep tidy rep okay oops oh I need call us tidy wreck tidy I'm gonna call this tidy wreck or tidy recipe right okay so we have our our recipe prepped great so now it's really saying okay okay here's what's going on we're gonna Center and scale these things right we're gonna scale these again and then we're gonna create dummy variables from the sex right so that's now it's looking at our data and saying okay how am I gonna apply this recipe what are the what are the predictors what other nominals what where all it was everything right and then we can also say tidy juice which if you notice we don't actually have anything it's just saying outcome predictor like where's my actual process data asked that's when we have to juice our data and we do juice tidy prep and boom we have tidy juice and then it applied all that stuff to it or we have our one hot encoded variables our Center and scaled data with our rings that's untouched cool so you can see that um your you might be wondering like okay well what's did I mean it's cool that this is like able to be pipe and stuff but it's like what's the point of how is the tidy way how is the tidy model is better than carrot well for example when we apply the center in scaling we don't really know how it's supposed to be treating every variable we might want to actually only Center say our heights and only scale our diameters and with the tidy way we can we don't have to do all predictors we can all say like starts with you know contains stuff like that or we say oh all nominal so there's more fine-tuning in a more intuitive fashion where this this carrot way it's almost kind of like a black box for you like I'll just do it you won't care how it happens you just wanted to do it okay cool Plus these recipes we just have Center and scaling but there is much cooler things for for the pre-processing like if we look at our references we have you know has roll all predictors also if we have some some stuff like that we have imputation x' impute a lower threshold mean imputation mode imputation rolling window stuff you know you can you can do more complex transformations you do box-cox I don't even know what this what this is right inverse right inverse ax Blodgett a lot of really cool transformations you can do like a regular transformation on it right again and then there's also more things so you can say like you can create dates counts you can do interaction effects so there's a lot of really cool stuff that you can do like step interact so if I want to do is say interactions between you know the gender and diameter we can add that if we wanted to see like terms equals carbon hydrogen hydrogen stuff like that so yeah there and there's also there's a ton of just really cool stuff where I mean you could go through like a month-long education course about it but we don't really care about that okay that's just there's the basics of it okay so now that we have our pre-processed data for both carat and tidy models we're gonna start doing our sampling procedures so one of the ways that carrots carrots main framework is basically create a train and test set and then with that train set use k-fold cross-validation to figure out the best tuning parameters so if you're in a modeling or data mining class you might have heard of like oh we're gonna do this the train validation and test set that's essentially what cables hostile agent is doing so that's why they only split it in to a training test the folds are acting as a validation set and it's a little bit better RPS you can do it ten times so you have a more robust estimate of your error and parameter hyper parameter tuning okay so we're actually to set our sampling procedures using K folds okay so for carat it's pretty simple you say train control method equals repeated cv repeats equals ten number equals ten and that's how you do it so we'll say carrot K folds boom okay so if we look at carrot K folds we have all that stuff right kind of a kind of dirty you only want to read this stuff it's looks like there's a lot you know stuff in it for the tidy way they use a v-fold CV which is basically the same thing so we'll say tidy train v equals ten strata equals I'm gonna say rings because our thing is based off of rings so tidy K folds okay so again it's not rows and you can see how I made the folds already right partitioned it okay so it's again not not crazy they're doing the same thing this one it's not actually giving you the data it's gonna do that in the actual modeling procedure whereas in the tidy way it already creates the data and then we're gonna put that that K folds thing into the actual model stuff okay but again pretty pretty simple nothing's changed too much however for the actual modeling then we kind of get into some different waters okay so we're gonna say tuning models okay so we're gonna do what we do the carrot for us we're gonna go carrot first so we'll say carrot RF train rings so that's our formula we're going to predict rings using our training data we're going to model it to our method will be random forests which is the random forest package we're not going to use range or anything like that and our train control is equal to carat cross a carat was it K folds and we're gonna just set our tuned length equals three so it's gonna try three different I guess M tries and this might take a while because I told just did not enable parallel processing but that's fine okay so while it's training we're actually gonna do we're gonna start writing out the the tidy model this way so I'm gonna first say create a model a crate plus a create random forest bottle in carat okay the random forest model and carat is pretty easy and honestly I kind of like it better than the actual tidy way because it's very straightforward right I think most people could read this and say okay I get it like I get what you're doing but it's not as flexible right so the tidy way would be we're gonna say we're gonna choose a ran forest and how we're gonna tune it is we're going to say okay our we're gonna tune our M tree or M try we're gonna let the model tidy verse kind of tiny models tune it but we're also gonna give it trees and for our trees we're gonna specify say like I don't know 1000 trees and then min n we're gonna let again tidy verse tidy models tune that thing okay and then we're gonna have to set the mode and since we're trying to predict a rings which is like a continuous variable one we're gonna say oh it's a regression and for our engine we can set our engine to random forest engine well it's cool like this is we don't actually have to choose a random forest engine we can actually choose tidy models we can actually choose different engines which is really nice especially when you can think of how you might prefer say like I'm Ranger instead of random for us and stuff like that so if we look at you see parsnip right you can set your engines so yeah you can you can set all this cool stuff you there's a model as to so it's pretty easy to reference they have like Eris stuff c4 random forests they just have random forests if you look at it for random forests you can say okay these are the things that they give it give you you can say Ranger or random forest and you can also use spark yeah so if we also go back we can see that there's also some cool things so there's like a you can do like a stand back end following your aggression so if you're into that like like Bayesian statistics and stuff like that you can do cool stuff like that okay so yeah as you can see there's a decent amount of models some neural nets k-nearest neighbor let's see what else there's you know mostly very multivariate adaptive regression splines GLM's let's see what else are there yeah and obviously you're boosted models do okay so let's see what it's still doing okay still still running I'm gonna I'm gonna pause this video and then I'll resume it went set finished training okay we'll go to your right when I said I'm gonna stop the video until when it finish finishes training I finish training if you will get the time Sam it sort of the same time so we look at it trained it using three different M tries great it says like okay the best em try our best value was six so six em tree okay so with the random forest model we're gonna set that so we're saying okay we're gonna specify a random forest model okay so let's say tidy RF and you know if you if you know it's something it's just like wait but we didn't actually train anything we just have a model skeleton okay and this is the thing that's a little odd um and I think I'm gonna have to start getting used to you have to actually do a tidy workflow um it's basically saying since we're giving it all these data splits we need it to kind of understand how to model this stuff so we give it a workflow so say okay our workflow is we're going to give it a recipe or pre-processing process and that's our tidy recipe and we're going to add a model and that model is our tidy RF model okay so we have a tie to workflow so let's specify modeling procedure specify model that was a tidy way okay and then finally to actually tune it we're gonna stay tuned the model and to tune the model we actually do the a tune grid we give it our tight our tidy workflow we give it our re samples or our like cross-validation stuff so if we were reference it's tidy k folds and then for our grid search we're gonna say grid will state grid equals and we'll say grid equals five okay so we'll say tidy tune boom so now it's tuning um the models off of a grid that we created I'm using K folds okay I'm saying five tried to five different things so while it's training the models we're gonna actually go and probably create a new section now let's say evaluates D models okay so I'm gonna let this go um actually we might guess keep on going through here oh yeah that's that's what else won't I go over one cool thing about the tidy models package is it actually comes with some other things such as where is its text recipes so there's actually some text it's Roy they made it so it's pretty easy to actually do some text mining so the if you look at it says step tokenized so after do all the tokenization for you remove this stop words you can do like um max N and you can also do a tf-idf thing right there and this is because like all the people who were creating this I've also created a tidy text package okay so if you notice there's like oh you can do word embeddings which is very very cool so you can kind of get some stay over they are modeling stuff if you if you give it some word embeddings yeah let's see if there's anything there's anything cool that we can come look at first before we go look at it yeah so one thing that I'm thinking that they're gonna try to do is with boredom Eddings they might be able to in like embeddings we're seeing betting's it is they do some cooler things such as I was some like neural network like hugging face stuff but you never know okay so our model has been trained because it's creating pre browsing data final final inside okay whatever so if we look at tidy tune we have all of our splits all over fold with a metrics and some notes okay and it says oh we use 10-fold cross-validation using stratification cool so first um we're gonna evaluate our carrot model so let's say evaluate carrot model so carrot RF pretty cool can we plot it out yep we can plot it out best best value is six right lowest root mean squared error is that okay so we have our plot of that cool of it tuning you can see that's probably you know the max the the best the best primer you know let's just look at it like okay look at the MA e on average when it has six when it uses that parameter of six it's off by 1.7 cool okay so now let's do the evaluate the type of be tidy bottle so with the tidy model we look at tidy tune there's actually some cool things you can do so we can say it select best and we say okay with the metric we want the best rmse and it does it right it says m try of 7 with a min n of 33 so that's their best tuning right Howie best tune cool so now we actually want to kind of finalize a model so let's say finalize our models we already know that our care our F we can say final model C is numb trees 6 yada yada um and it'll automatically use the best model basically it says final minute model is that so we kind of have to worry about the care 1 however for the tidy way we can say we have to do a different thing okay so we say um if we look at our tidy our F it doesn't actually have any tuning parameters since we we we were trying to tune it based off the tire tune so we're gonna have to add these new tuning parameters from our tidy best tune right so we will get tidy best tune it has it right em tree in maintenance we need to give it that ok so we're gonna do this we're gonna use a function that they were smart a bit that they knew that people were gonna be using right they know people are gonna be tuning models and you want to apply that final tuning parameter to their model so this function called they finalize model we give it our original model and then we give it the new parameters which is tidy best tune and boom see it's updated 7 and 33 ok so let's say tidy final model cool so once you have a tidy final model we have to make an effort final workflow since we were training off of our workflow we need to create another workflow because our model is different which is a little confusing to be honest and I'm sure there's a there's a reason for it but if you notice that with the tide I'm like why do I always delete this if you notice what the tidy models there's so much things to read about and since they just released it it's gonna be kind of and to me it's gonna take a few months for everyone to kind of like figure out the process they're probably gonna make some fixes they're gonna probably make some changes to the function names stuff okay so we need to make a final workflow so was a tidy final work work work flow work flow function and recipe it's still the same recipe but this time we add a model and this is our tidy final model okay and we look at our tidy I don't work for them What's it saying saying okay we're gonna pre-process it using our given recipe our model is a random forest we're gonna send our scaling dummy it and we're gonna give these parameters okay so we're gonna say tidy so now we need to actually know like okay so we have our final model when you test our actual models right so let's say that say evaluate the test sets using the final final models all right easy so for carrot it's a little weird so if we do some predict you know carrot RF which is our final model with the carrot test it gives you a list of all all this garbage right so we always have to convert it to like as a table all right I will see buying our carrot test as kibble select was it rings and we'll say obvious equals that right so ah blows that as that table I'll say select and we'll say pred equals one okay so we have our final model we have our prior to one you see it's it's it's kind of looking right but this one's obviously very wrong but we'll say that so we'll say I'm carrot final results and we are gonna evaluate through our MSC we'll use carrots function for our embassy for the carrot stuff so our pred is equal to carrot final results Fred our observe is carrot final results observed and we calculate is 2.29 does that make any sense since realistically a good model should I basically have this similar validation and test accuracy so let's look at it so carrot our right carrot RF two point three four two point two nine I think that's pretty solid in fact it's a little bit lower cool so now let's say evaluate so this is a carrot way care it away so for the actual tidy way and we'll we'll do thanks to what do I need that stuff okay and finally for the tidy way we still have our tidy model or tidy workflow what's cool about this is they know like they know oh yeah like we're gonna have to evaluate on the test set there's also this cool function called the last fit which is our last we did all of our modeling tuning this is our last fit we finally are gonna put it to the test set and to actually give it to find the test that we're actually gonna use the tightest bullets all right and it actually evaluates on the test set so we have our trained test with like that okay and then we can actually do is collect metrics and boom we have our test metrics from our test set so it's pretty it's pretty straightforward and we see oh it's two point four four so we know it's that two point two nine so so our actual carrot is a little bit better but still we actually didn't give it as much tuning parameter so uh with random forests isn't gonna are the carrot weighs it takes took a lot longer than the actual tidy way and we can always do more tuning okay we are specified the trees but if we it's kind of tuned the trees too I'm sure would find a better model but if you know something how it's definitely pretty easy to evaluate these models that in the tidy way but the actual amount of procedure seems a little bit more complex that's because I think it's really made to train multiple models at the same time with the same procedure on the same data additionally I think some problems with the Tidy model package is that there's a lot of functions that you just like it makes sense but there's so many so many ways to say it that you're like I don't even know how like what function I'm really looking for so there's gonna be a lot of referencing the actual package that's fine because they have they've made some great documentation additionally um the grid the tune grid is a little confusing with a workflow I think I still need to kind of really understand how the workflow works um but the greatest the really good things that I like but the the tidy models framework is that they really make it so you can't really mess with the test set so it's it's there or trying to make it so like you're not cheating on your actual modeling procedures so yeah um that was basically this is basically a short video of just kind of going over the basics of the tiny models and carrot package one thing that I think would be kind of interesting to do is basically um a good I think a good way to kind of dip your toes into it is really kind of learn the recipes package right and do all the pre-processing using recipes like this and then just juicing it like that so doing like doing up to this and then if you don't feel comfortable doing say they tidy model apart you can just use the juiced data on your carrot stuff and just train models through carrot I think that's a good way to kind of start getting a hang of how how the type models works so yeah um yeah there's a short video but I really liked Italian models package and I'm probably going to keep on working on it um probably also probably make a video on the Charis package because the tidy models really reminds you of Karis - so I will not be messing with that in the future so yeah yeah I'll see us next weekend Teddy on
Info
Channel: Andrew Couch
Views: 4,818
Rating: undefined out of 5
Keywords: Rstudio, Tidyverse, R Programming, Data Science, Analytics, University of Iowa, Statistics, ggplot2, ggplot, data wrangling, Dplyr, Data Visualizaiton, Data Viz, EDA, TidyTuesday, RStats, TidyModels, Caret, ML, Machine Learning, Data, Data Modeling, Black-Box, Recipes, Abalone, TidyModels Package, Caret Package, CRISP-DM
Id: hAMjhbPJTkA
Channel Id: undefined
Length: 41min 35sec (2495 seconds)
Published: Tue Apr 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.