ML Monday live screencast: Predicting board game ratings in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson welcome to a screencast where i'm going to be using r and pivot models to build a predictive model on a new data set so um this data comes from the sliced uh competition it's a really great uh competition it's run it's a live stream of several contestants uh building a predictive model on a data set so the um so this is this is a little different than my usual screencast here i'm going to be basically kind of competing i want to see how uh versus that um is the goal is not going to necessarily get insight from this data set the goal is going to be to get the predictive model as as best i am i can't get i think it's some root mean squared error from this um from this data set of board game geek and the reason i'm doing this is that i'm going to be competing in this competition a couple of uh in a couple of weeks uh and uh um yeah i'm i'm interested in practicing my tidy model skills and i thought you might like to join us so we can all learn about tidy models uh together uh so the goal for today is going to be to take this data set um and predict geek ratings for board games on boardgamegeek.com i'm gonna be down i'm gonna be joining this downloading it uh the data and then um i've actually uh and seeing how good a creative model can build with tidy models so note here is that the um uh is that i actually have looked at this data once before it was a tidy tuesday project like two years ago i did build a predictive model back then but not with tidy models so i'm excited to see how the experience is going to differ so let's uh without further ado let me join the competition and uh let's let's uh let's work so what i'm going to do is download train and test sets with uh let me rename that uh just i'll read it all right so i'm gonna be working from a fresh session here i plan to write up some kind of um what do you call it a uh models and i'm gonna bring in a couple of packages that i found useful i'm going to do scales on your theme set theme later yeah i should really set this up in advance but the um bring in tidy models okay i also found that i want to use parallel processing so i'm going to do um register it's a do parallel register do parallel i have four cores on my computer that'll get some of these steps being a little faster and then i'm going to read in the training set on the testing set so uh i'm actually going to read it in as like uh as uh raw data i'm actually just thinking about this second i'm going to read it in as oops downloads train one dot csv uh and i'm gonna call the other one hold out because we're gonna have a test and training set within this data and uh what i'm gonna do is actually say we're gonna do an initial split of um uh of the raw data i'm gonna do 80 of the raw data into training and 80 testing basically i want there to be three data sets that i'm working with i want there to be a um a training set that i'm i'm generally i'm tuning parameters on and such i want there to be a test than a test set that i try it on while i'm working to make sure it's not over fit and then there's a holdout set where i actually tried and uploaded and i'm only going to have 10 chances during the actual competition 10 chances in two hours to upload this data so i really want to keep the holdout data separate and only kind of like and uh and try just a couple of times so it's good to have a training um train and test i don't know if that's clear but the idea is that i'm actually splitting this data and i'm only going to do eda and such on this train set and the test set is separate and then there's a holdout set of 1500 more games uh so it all right so then most of the work is going to be done on this training like training algorithm on this test it on the test data set as well um and i'm not going to call this raw data i'm going to call this full data and it's good to set a seed i should really set up some kind of template for this later all right so i'm gonna be working with this uh training data first i'm gonna do 10 minutes of exploratory data analysis and then i'm going to um and then i'm gonna then i'm gonna dive into the um uh test modeling so halloween we're about five minutes in all right so first thing is we're gonna want to predict the uh geek rating so as we said so i'm actually gonna start by we're going to be using root mean squared error and i don't think that's but the standard deviation is like basically how how far you can get with just picking one uh everything is the average it's the average the square root of the average squared difference means you can't do any worse than 0.48 that's good to know okay then what if i look at the um what if i try looking at a couple of other ones what if i say let's look at min players by geek rating and let's do a box plot of that and do group equals degrading all right i just want to try looking at a couple of these like what's the relationship that should not have been a slow step oh um maybe min players as it gets really large oh so i really actually should have also started with trane is ggplot of geek rating gm histogram i don't know i didn't think that first all right so oh one thing is they're all it's truncated data everything's about point about 5.6 even though that's a very common value uh all right so i wonder if that's true we don't know if that's true in the holdout but i'm going to assume that it is and um all right that's i'm just gonna quickly check another test will be the same yeah okay so that's uh hmm and i wonder what would it look like if i put it on a log scale and that's sometimes not basically the same yeah what if i put it in a log scale of this minus 5.5 nope what if i did a log scale this minus min of geek rating i'm just thinking about what might make it look a little more normal you know this is that's not perfect uh i'm just going to keep it in my back pocket as some possible transformations i might want to do uh so all right what if i do let's look at a couple of these let's look at key min i always start with a little eda before we get into the models uh payment of this and five so we say okay one two three four five i am oh that's the problem anyone catch that did the wrong three all right so we say okay players that are min player zero that's a weird uh sign and it just it's not crazy that it seems a little bit lower all right but that's probably uh not a real um uh not a really big predictor but let's try based on i really wonder with numb votes strange if numb votes were uh were productive wouldn't it so i'm going to look at ah not necessarily no actually it's not strange at all because the um uh because the number because popular games we'd expect might expect to have higher ratings and scale x luck and that i definitely want a lot of transform yeah popular games tend to have more um tend to have higher ratings uh so that's gonna be a good predictor let's try looking at the average uh the min time and max time min time and i think yeah oh yeah this is makes sense to put it on on a log scale no that might not be related and the max time that might not be related either uh good to know and what if i try it on year i'm going to do group by year i'm really curious are these linear effects is one of the things that i want to know uh so group by year and summarize uh rating is summarize i'll use average use now i'll use median and i'm also going to grab n oops a year rating median geek rating and i'm going to do a filter for you must have at least 20 in a year might be too low yeah that's not so bad okay there might be a slight trend i can start it at zero it doesn't yeah there might be a slight trend in terms of median rating being between five and six and such okay i'm going to include it as a predictor um anyway all right now what i'm going to do is um we saw the 0.48 is the 0.48 is a dump is a naive model oh actually before i look at before i get into prediction i can do one other thing i'm going to look at some of these categorical variables uh oh there's there's an age i missed age and i i'm really curious like do older or younger uh yeah maybe look at this the ones for kids the trouble is hard to tell because of the um that these ages are following more common it's heart of health is a correlation so i can also do something like core train age train geek rating uh and yeah there is uh maybe there's a correlation and if oh but i probably should have done log of h uh plus one and uh yeah there's this correlation on um on ivs all right so uh plus value chord at test it does look like yeah there's a significant p value um all right so i just wanted to look at a couple of those the really i suspect a lot of the value is going to come out of these categories so let's look at mechanic all right yes so one thing we can see here is it's comma separated so i can do something like um there's actually a separate rows in in the entire yard where i say seth is this count mechanic so it goes through this cool i'm going to use like basically going to use text mining but i'm going to use special tokenizer to separate the commas nicely in hand management each is going to have positive or negative positive or negative effects uh that's pretty cool now what if i look at like select what is owned i don't know what owned means is it gonna does it tell me somewhere here uh if i look at is there like a is there a guide uh is there a diet look i tried doing a histogram what is it i'm just noticing that there's yeah so there's like a multiple you can be multiple categories kind of thing so there's there's um uh what is it age no not paige owner owned oh number i bet you it's number of people that say they own it you know i bet that's gonna be better than votes as i guess can be correlated with votes but it's also going to tell us about geek rating gm point yeah you know it's associated with geek rating so is um so is uh the votes but this is interesting all right it's been about 10 minutes of work and i'm going to move on for oh and oh super dope i think that back i'm going to take a look train i didn't look yet at the category one two three okay oh and there's designer that's another comma separated one good to know and there's kind of yeah so what i can see here is that things can be in see you can be across multiple these categories i'm probably gonna want to combine them in also a number of categories could itself be a predictor uh all these are good things to know okay uh yeah i'm gonna start i'm gonna start building models but i'm just noticing the things that i need to clean notice i'm not actually writing code to clean the messages we get it done in the tidy models approach what is what we call a recipe so modeling so 0.48 is about what we do if we do absolutely nothing let's find out how much better we can do i'm going to create a metric set uh in racing yardstick that says we're always using root mean squared error to measure these um all right i'm also going to know that's going to be uh which will start what i'm going to do is i'm going to start with a linear regression and i'm going to say okay i'm going to accept engine um uh lm it's a non-penalized linear regression i'm going to say linear the linear model spec this way we work in parsnip is we create a spec of the model then we do our data cleaning and the data cleaning is really key we're not actually going to be using deep player we're going to do using recipe that way we can keep it and keep applying it to all our individual training sets uh individually and it's a really powerful approach so i'm going to say um geek rating explained by i'm going to start adding them one by one so i'm going to start with geek explained by own which is one we saw was was pretty predictive and your data is on our training data now the question is what uh transformations do i want to do one is i already said i want to do a log transformation on own and i'm gonna use base two i really uh like base two and i'm gonna use an offset of one that's uh log two of owned plus one is the approach reason own plus one in case there's zeros there's not zeros here but it's it's nice to um it's nice to keep it's not gonna make a big difference the results and later i'm gonna include some that are um uh that do have zero so what i'm gonna do is say this is a linear recipe and i want to find out how does this do on cross validated data so i'm actually gonna create one more data set up here which is uh which is train fold as train uh v fold well let's do 10 uh ten repetitions as soon as i do five if i want to be a little bit faster but here we're doing is i'm splitting into ten training and testing sets that way i can train on this model test on this one trainer mod this model test on this one using some of the functions in um you know uh in r so oh i haven't saved yet i'm going to create oh i've created a folder called ml practice we've been doing a couple things so i'm going to put this one in here called um and call it board games all right and um let's yeah so let's uh yeah let's fit our model the third step is now that i have my specs my specification and recipe i say set recipe oops i'm sorry add recipe uh lynn spec add um lynn rec add model lin spec and that's my lin workflow so this is my setup for an approach now it's a good amount of code i'm not just going to use like an lm here but it's really extensible and powerful if i start setting up my analysis i've started finding it really uh really effective and the reason is i take this and i say fit resamples on my train folded data set and call it cv for cross validation and what this is doing is it just ran trained ten models fit them and now in that metrics column it has all of my root mean squared errors on each of these models and i can use collect metrics to view them boom so we say linear model on just log of owned gets us to gets us to 0.265 all right so that's a little bit better we could heck we could we could just go ahead and do that now i'm not going to of course i want to do i want to see how much better i can get what do i want why do i know i want to get better than 0.265 root mean squared error well one thing i can actually do is look at um this was run last week for the the team and it looks like i don't think i'm going to i don't know that i'm going to beat people i'm still kind of getting practicing but it looks like the winning score was like 0.16.16 about as high as people have generally done this is 0.02 i don't know if i'm going to that does seem very high i don't know about very accurate it seems um strange to me but i'm seeing like 0.161718 is where a lot of people kind of um ended up so i'm going to want to do better than 0.265 okay so i'm actually going to keep editing this recipe and let's remember what else do we want to include well i want to include um average time uh yeah i'll skip min time max time and i'm just going to include average time but i am going to keep that log i actually saw earlier that it was um log and let's see if we do any any better if we um if we throw an average time not real barely maybe a tiny bit not so good okay the um but what if i throw in the num votes and i do want so i've owned and numb votes uh oops uh let's see oh it's not it's probably not on num votes is it it's probably n votes i could get the the model error but i'm just gonna oh nope it's not that wasn't the problem let's look at notes uh and what is yeah oh why didn't i just check it's called no it is called num votes own num votes oh oh there it is i needed a plus right here that's the stuff that's the story yeah okay now adding the votes got us a little bit better okay still just linear uh i just told us the linear model and um let's throw in min players and max players i'm really just i'm curious for a sec if i gigi plot min players i'm starting simple with the linear model just nice to keep increasing this yeah that's fine to do i don't need to log that this might be wow who's playing 200 who's playing 200 that's strange yeah but you you log in and it ends up in a more actually in fact what i'm going to do is going to say max i'm going to i'm not going to lock it i'm going to do step mutate and change max players to be p min of max players and 10 let's say uh let me look at that histogram again of max players uh 30 i don't know i just don't want uh all that scattered all over the place uh all right so adding midplayers max players didn't add much we saw that it wasn't terribly yes so it's 0.23 what was it before it was 0.237 now it's 0.234 so cross validator barely getting any better all right let's look at um getting a little bit more real with what we can uh can include let's include year i'm not much better but one thing i wonder about is we saw there might have been a nonlinear trend with here so let's add step ns which adds up what we call spline term to the year and i'm gonna start with just a guess and later we're gonna do some tuning i'm gonna say let's say we want two degrees of freedom uh so that does add a little bit of flexibility so no it actually wow made it made it worse that's a little odd i'm trying this again just in case no no it's gonna be the same both times step n s degrees freedom year that might means i might have over fit if it got it also oh wow so look at that i added a degrees of freedom two and it got worse i decreased freedom three it got better now not on a lot better it's still point two two it's still a lot higher than it's gonna need to be but um but what it uh so what i'm going to do is i'm going to tune this so i can visualize it a little bit more easily and this what's really powerful is i just put tune here and instead of the fit resamples i say to i say um tune grid and i add a second argument where i say grid is crossing degrees of freedom is one to four for this non-linear spline so this is allowing it to represent different kinds of curved shapes with respect to year and now i've actually got four different root mean squared errors for different degrees of freedom there's a wonderful function in the tune package called auto plot and i could say ah okay if i try this with different degrees of freedom uh what is the trend and it looks like oh man maybe i even need more degrees of freedom i'm not gonna go any higher than seven is an awful lot but um might but i just wanna i wanna like look at how does our cross-validated uh means ruby and spread error differ as we change the degrees of freedom it's kind of introducing these ideas one at a time we're definitely gonna get a little bit more sophisticated with our model very shortly but here we go okay we can see there's what we call oh and i neglected to do metrics equals m set uh so remember i created that metric set earlier with saying i want to track root mean squared error and everything and now let's okay there's what we call like diminishing returns you can get it down to 0.215 with a spline term on year but not any lower than that uh all right which is um so i'm going to go ahead and pick five and the way we can do i can later i'll do some finalized stuff but here i'll show how to take this do it a bit more problematic way but i'm gonna for now just pick okay five yep here we are two one yeah not much gain after me and you know uh seems pretty uh pretty close together all right so the um so that was the story of leica of doing um tuning with tune grid choosing these seven degrees of freedom and i'm gonna um i'm gonna actually i'm gonna go ahead and just comment that out so i can keep adding things to the same code but uh yeah that that's what we tried a couple of these we still haven't touched mechanic or um mechanic category or designer all right so let's talk about those so first i'm going to want um let's see i'm going to start with mechanic so something's really cool the mechanic is we noticed before that it's got these um these commas in it i'm going to want to treat each of these their own categorical variable i don't want one for auction bidding one for variable power players and so on so that's going to take a little bit more um more thought what i'm going to do is i'm actually going to use tokenization from the text recipes package i'm going to do library text recipes uh which palette package um for i'm going to do is just say lin in this recipe i'm going to add another so in this recipe i'm going to add an extra step called step tokenize the um uh the we're called mechanic yeah now what does tokenize do so we can do actually with this model is with this recipe is oops um i've got to do plus mechanic is i can bake it so uh prep it and then juice it prep and juice means on uh the training data actually apply the um these cleaning methods and see here's what pops out is this mechanic column which is actually a token list instead of being like it's not a number it's not a categorical it's a list of tokens all right so why is that um powerful well we saw it before how many tokens there were i'm going to need to do a then a step token filter our mechanic where i'm going to say let's say only include the top 20 tokens i'm going to start just with 20. and uh now some of them get filtered out there there are fewer tokens that are being included in here and now that i've done that i can do one last step i want to say step pf mechanic for term frequency and now i have a a bunch of columns i have here we go i have a column for each of these here's how many times it occurred in this column here's sometimes i've heard in this column the um actually let me view it this is this was basically a like a spread like a filter than a spread one one one i now have one hot encoding for these 20 common mechanics uh y20 well later we're going to tune it i was just kind of picking one i'm going to keep this code or write this view code around it's really helpful now let's add in mechanic uh so i'm going to say so i've just done this step uh token filter i add in the top 20 mechanics by tokenizing it oh the problem is i actually did this based on words i don't necessarily want to do it based on words i want to do token red x and i believe it's pattern is comma space and now if i apply this it's actually gonna have one no it's not it's um let's find out uh what happens when i apply it oh it's uh not pattern it's oh i remember now it's options list had an equals best oops uh no it's not oh there it is that's what i did i didn't equal forgot an equal sign so now i rerun this code and now i have like ah hex encounter modular board partnerships uh and now i actually have the full name of the i'm not breaking it down into words which i like more because i like i don't want this action point allowance and system i want to be just one mechanic all right so that's what i do is i i can customize this organization how it works to do tokenizing base splitting based on a regex all right so let's bring this back in let's say lynn w flow fit resamples uh on uh train full on the 10-fold train data set oops i am you re-run this whole thing uh oops and now this is not um i'm gonna pull this out of the way a little bit still working to figure out what's the most agile way to work with these uh these objects i'm going to keep this in a the tuning in a separate one okay and um so just like that we actually oh i forgot to note down adding in some terms with like spline of year gets us to actually where did it get us to um let me remove mechanic a sec silly me i should have saved somewhere two one two one five with five degrees of freedom let me look right all right actually i had this in part two and five great all right and we get to um gets point two one one i'm not i was like two and five on spine of year uh now let's actually now let's actually say okay instead of um spline of view let's do let's add in a few more sorry so instead of max tokens just being 20 let's make that tunable and now let me see max tokens is i don't know let's try it i like to be like 3 10 30 and 100. i'm going to rename this object tune so cross validation or tuning just for next time so now i'm going to try and try it once we did we set this to 3 10 30 100. cool so one i can see is more tokens is generally better we're going to get to overfitting at some point here for sure but it looks like we haven't quite hit the point uh oh uh tuned i'm going to call it tuned is one color currently calling cv look at that we haven't quite hit the point of diminishing returns uh or we're sorry we have hit diminishing returns but there might still be um value here and and what's great is it's gonna be really um this is gonna be really effective when i uh because this approach of what i'm trying out of this like tokenized token filter everything is gonna work for i see here it's like 300 now it's not any better than 100. it could just be there isn't value here but it could also be overfitting uh okay so we get it down to like 207. uh i'm i'm gonna add some uh let's add let's actually try instead of doing lm let's do so or adding on mechanics gets us with with 100 tokens gets us to 0.207 okay we're going to get a lot better than this but let's try uh let's try adding let's see i've been doing this for having okay let's try adding um another one to the recipe we're actually going to use glm net uh so that's going to be la we call penalized regression and we say penalty is tune because i'm going to need to try out a bunch of different oh uh here we go i didn't um i this is uh i'm gonna go here to the tune and say penalty is and the fun thing about penalty is it actually does it's not any uh easy it's not a it doesn't add time to try additional penalties because it runs on all it gets them on all the um uh the models at once it's really uh so it doesn't have it's not going to slow me down to try a ton of different penalties so let's write 10 to the power of i don't know uh this is like a fun there's a whole bunch of penalties in between 10 1 millionth and 0.5 is a little uh low if that's still yeah i'm gonna do actually yeah sure let me try this i'm gonna try this with fewer uh just so we can oh oops um uh and uh oops 5.1 here we go i chose a few just to test out this works the story is that i'm going to actually try uh different lamp values on the penalty parameter so glm net does penalize regression it's going to help prevent overfitting now this is showing is there isn't really any overfitting happening on 310 30 because it always gets like the rmse gets lower and it never really starts going up once we add penalties what if i add in 100 and 300 what happens then and also this is telling me that um what am i doing minus 0.1 i'm going to do minus 1. yeah this is gonna be a little bit slow just trying out all those it's training a model with a hundred or three hundred so i was gonna slow it down just a little bit i wonder you know i'm not seeing any like basically seeing no like there's no having too low a penalty is bad here not really maybe a tiny amount in fact i'm actually gonna change this to my yeah minus one five i want to um uh and let me try well actually that's right there's only like this how many mechanics are there i want to check that oh there's only 52 mechanics oh okay silly me we actually don't even need a token filter uh i'll i'll leave it in but um we'll just say uh i don't know i'll leave this in but the story is we just want that's why it doesn't get worse we have a hundred or three hundred excuse me all right that's fine i should have checked but i'm really curious i noticed we also have designer and i'm going to set to try this same approach on designer and there's ah here there's 1938 designers uh i'm going to be bringing in the same thing we're going to say step tokenize mechanic designer and i'm not even going to filter mechanic i'm going to filter designer instead and what i'm gonna do is say okay i'm between penalty i'm gonna try out a bunch of different max tokens on designer oops uh this um what i'm gonna do is is i'm oh uh yeah i'm separate designers also comma separated as you can see uh this works so that approach from mechanic quote just works with uh like with a designer as well so here we go and let's see if uh tuni if adding in um oops sign failed oops i didn't you know what happened i didn't do plus designer in the model specification and pardon me the recipe specification i'm going to start getting to 3d models and stuff soon i just i like to experiment with with linear first a little bit understa what i love about linear helps me understand like how important each of these steps are aha now one thing i see is you can actually do better with a penalty term and get down to 20.5 uh with designer if you set you can do better with 100 than you can with 30 but even better when you set a um a penalty parameter i'm actually gonna do up to minus two i do two hundred and three hundred and uh here i'm gonna kill this a second uh i don't really need three or uh thirty seems like they're not any worse maybe any better so the um three or ten doesn't seem very good so what i'm doing now is i'm saying okay what threshold should i set in terms of how many designers i include and what penalty parameters should i set see this dip is free like gain from this um aha see that wow it looks like setting a hundred and see and then a penalty parameter is even better than is better than saying more yeah okay it's about see this is where where tuning comes in is i'm able to see okay the advantage comes when i have about 100 designers um i could i could change that around try it with 80 120 i'm not even going to bother uh but let's take a look at the what is that that's some gather collect metrics what is that uh arrange pharmacy the uh oh sorry it's actually estimate nope it's uh it's arranged by mean yeah so point to um adding designer that's point 205 okay good to know so so uh with with penalty all right so that's those two but i still haven't added in categories so that's actually the last step but i think it's the most complicated because i have to make a choice here of do i add do i treat category 1 category 2 category 3 or do i pool them all together i'm going to try pulling them all together and that's going to take just one quick check do i have nope can't unite uh what i really want to do here in this training set is i want to combine all the categories together i don't like that they're separate uh because it makes this really hard to uh to mess with so what i'm gonna do is throw in by the way i've totally forgot to include age what happens when i include age i'm going to include it in my most uh my highest value so far one would be doing tree models soon i promise we're going to get to a better uh model with trees should i vlog age maybe i should have logged age it doesn't seem to get doesn't seem to improve it much um we did yeah i don't leave it as hp it doesn't seem to improve it much uh this is still saying like 0.205 okay so i'm gonna choose to do a hundred and um yeah sorry i could finalize a separate but i'm just gonna do do a hundred i'm gonna leave in the penalty term we're always gonna need a uh well as we want to tune a penalty term we're doing a regularized linear model all right let's go in category now uh so what i'm going to do is i'm going to do something i haven't done yet which is take my full data and operate on it which is i'm going to say i'm going to unite all the columns that contain into um so as i'm going to say unite into a column called categories all the columns that start with category starts with category and here we go oh and uh set equals and why do i do that because now if i take my categories uh-huh i'm going to need to find a oh does he can you can you knight remove um an a's maybe not oh there it is and a that's great i want to say new today uh well i don't know this is about yeah um full data categories aha see now it's in the same kind of format as my other ones my mechanics and things like that see that's a training step that they can't do in a recipe uh well i could do this maybe a long mutate i'm gonna do the same thing to hold out just that way we can uh work with the two of them run it through this whole thing and now i'm gonna say i'm gonna add in categories but i need to do the same thing categories here we go and now i'm going to also ask uh about that unnest oh sorry uh separate rose i'm going to ask what are the most common you know i'm going to call it category not categories because i call a category everywhere else and uh so i'm using singular for mechanic and designer here we go 84 categories okay so i might not even need a i'm going to not even do a max uh threshold there i'm just going to leave it um and hopefully penalization takes care of it i am going to use the token filter on the designer so let's see adding oops it's category tokenize it and here we go just to check here's what it looks like the data set looks like after i do this oop it didn't like that oh because i didn't uh tf it yes i needed to put this here all right and the um all right and uh where am i real wide date oh boy so wide look at how wide it is what i did is i'm constructing these features for mechanic for where am i for designer and for category is really wide set of features uh yep like category world war ii etc and i'm using all of these terms to work with it and now i've got my uh now i've got my linear model my um uh and i'm gonna i'm not gonna need to tune i'm gonna need anything to do anything except for let's see what am i missing what am i missing category is missing i thought that i renamed this and categories here okay oh i didn't rerun the train fold set i should really set seed before doing that so now i added in a bunch more terms 84 more features and yeah we're going to get down to about 0.202 okay and this point i am using uh the vast majority of the data here i'm not technically i'm using i'm using average time not min or max but i don't think that makes a difference i'm using um a spline term a year this is i think it's probably pretty cool now i guess i've got like names i could parse it uh i'm not gonna go any further in terms of in terms of um using this data uh obviously what i'm saying to do is uh is change models because truth is linear models are always going to be only like uh each news gets one term i'm going to start using a random forest my my favorite second model to use so the um so uh any questions on linear question i like seeing enough people have questions and i might not ask them media i might jump back to them later uh one isn't seeing uh is there a way uh yes a couple questions uh is there a way to avoid data leakage when splitting and standardizing yes that's exactly what i've been doing doing here which is the fit resamples and tune functions so the um uh other question is um uh white tidy models rather than carrot tiger models is kind of new they're both created by a large a lot of them are the same person max kuhn is now leading the thai models team at our studio and he was the original developer of carrot and it's um it's meant to be like the next generation it's cara but in a tidy framework uh so it's generally what they're recommending these days in the long term will replace it um all right so the um uh so there we go um let's do some random forest i'm gonna leave in oh i put this code down here uh adding category gets to uh what is that about 0.201 all right so it's hard it's been it's pretty hard i'm sure we could have kept transforming these do some more splines do some or whatever i'm not going to we're going to jump ahead to do some trees all right so for that we do round forest and there's two terms that are so this is a regression ram forest not classifications uh there's two terms and i'm going to want to tune them off the back uh so here we go and i'll say um uh rf spec and i'm going to start with a similar recipe i mean i'm not going to pipe from this i'm going to want to work from here i don't actually need generally you don't need to log turns before throwing them into random forest it doesn't really care about their relative sizes uh because it's going to be tree based this is not the video i'm going to talk about how random forests work um this is when we focus just on on how we use it in tidy models there's a lot of resources out there if you're interested so yeah i'll leave this isn't similarly doesn't really matter if some values are very large it doesn't make much of a difference similarly the spline is not typically so i'm going to care about for a random forest um i don't have any missing data i don't need to replace that and uh okay i threw an age all right and um yeah one thing i will say though is i found that random forest you don't want to start with tons and tons of features they tend to get diluted so we might want to start with a token filter on mechanic and design so this is a problem basically if you have useless features don't have a big problem aren't a big problem for linear models because especially if you have a penalty basically the worst the useless feature can do is make you over fit a little bit and a penalty will tend like a lasso company will tend to set that to zero anyway so useless features aren't a big deal but they are a big deal in random forests because a lot of your randomly chosen trees are going to include them so i'm going to start with plus it makes it makes the whole thing slower i'm going to start with um oh sorry this should be rf threat uh the the random forest recipe i'm going to start with making this um random forest where we uh here if i'm going to do rf rec prep juice here's my terms that i'm working with now uh own num votes average time etc and i'm going to apply random forest to it so let's say the um alright take my uh my rf uh oh yeah i'll do rf workflow workflows workflow is and add recipe rf rec add um model rf spec rf workflow let's ask how oh um i'm going to need tune grid on train fold and oh i've neglected to do set engine ranger i'm also going to say what are my values going to say grid is fro and metrics is m set you can keep just rmse and your grid is the combination of trees is i like to do i can start with not a lot of trees so you just get a first sense of it later trim it up a little bit the number of trot of divisions for each i'm going to start with two to five let's say and i'll call this rf tune and this is the code fitting a random forest it's i specify a key gradient explained by all of these and auto plot rftune it is slower than a regression it is slower i'm going to consider do after this doing five-fold aggression instead of ten-fold regression it's not great because it means each fold has less data and you have fewer folds to uh to fit it on but sometimes if i want to do a lot of tuning on a time budget all right so here's what i'm seeing that like it start if i if i did i'm try with just two two of these it would be um what this is is more trees isn't a lot of benefit but more and try really is a big benefit so what i can do is i'm gonna zero in on a slightly different setting to say three to eight i know that i can probably drop out the two i don't need it and um yeah run through this and what we're going to find out is notice it already gets better than the 0.2 uh than the 0.21 it was before all right so let's see what we uh what we have yeah i'm really considering doing um five-fold regression it can be so slow i've only got four cores in this computer a different computer with more cores maybe i should be using that but yeah oh uh a good question from in in uh insipidity which is the question of um i'll use it what are you using multiple courses we are doing that because we ran do parallel register do parallel because you can say i randomly lost that i'm going to run it again but once we've run this it gives me my our session re started to run it again but it gets it for free it does get it for free here we go oh man look at that it just keeps getting better and better with more trees so this is this is not crazy ah and uh this is not crazy because the um uh we because it's a wide data set remember this is what the data set looks like that we're working with which is what 38 variables so it makes it looks like there might be a lot of terms that cover a wide range of these but also more trees might help too uh i'm going to be um yeah so i'm going to do is i'm going to try so notice we're already getting into much better territory than we had with a linear model i'm going to try this but i i'm worried it's going to be really slow so what i'm going to do is going to do uh train fold 5 and do train v-fold 5. uh so notice now in here as opposed to train fold each has 20 2200 um and we'll model we're averaging across so it's a little less data which means a little farther from our and out of sample accuracy is going to be but it's going to be it's going to be roughly twice as fast and i i feel like i need to do that i'm going to do i'm going to go all the way up to 12 so you were adding extra m tries first so this is how great tune is is i can do um uh is is like i could just keep adding these um i could say oh yeah i want to increase the random selected predictors and get a really pretty graph out of it and um oh an amazing day from hollywood which is i might be interested in one of the most important terms so i can get that with um by adding an argument impurity to the um the recipe the way to do that is i think it's set arguments um i'm reminding myself oh no set args yeah it's set args and i think it's yeah um thank you it's importance impurity which is like genie either the genie uh impurity term for that's going to be a way to measure what splits are the most valuable so thanks for that yeah thanks for that suggestion later i'm going to be taking a look at uh i'm going to be taking a look at that here we go what i'm seeing is oh we had a lot of trees it's already kind of getting it kind of is leveling off but not even that much oh wow look at that like we're still kind of getting better and so much so that i'm starting to wonder if the um if a hot if a hundred is not is not enough in a triad and we simply do it from six through fourteen and really the the part of the trip with this is you really want to keep like kind of um uh you when i'm in what i found is like i've tried time myself i haven't done any live streaming but when when you have a clock running you want to always have something running and then talk about what is my plan for the next one i want to see how far apart these are i'm really i'm curious how so now we are south of 0.18 which is cool it's kind of popping us into maybe the top half of the current set with ran um with random forest on um 12 and this is still this is still a cross validation but uh so within when training set but yeah we're getting we're getting a little bit better i'm really curious here there's the um uh yeah i'm curious what what the difference in 100 trees and 200 trees is going to be there's also there's a lot of random noise here random forest that has random in the name it's not always going to get the same fit so that's something else to watch out for i'm also thinking a little bit about how i might decrease the um i'm i think i'm going to oh i'm definitely going to be tuning the max tokens after this um i might need to tune the max tokens along with the m try that's definitely a little bit slower remember i've only taken the top 10 mechanics and designers uh in categories we definitely saw that there was value in getting a larger set than that so the um uh so i'm gonna you know wait and wait and wait and let's see yeah then that's all right so below 0.7175 so doing 14 and 200 trees let's see what that um tune collect metrics arrange descend oh no not descending uh a semi the lowest is best the best we got is 17 best in sample all right so first thing i'm going to try right now is um this is a good time to actually i i'm going to be spending two hours doing this today and you're welcome not to say for the whole thing but i want to simulate the experience of um that are going to be having sliced so uh and i still haven't uploaded any submissions so this is going to be a good time um do i want i don't think i want to tune i actually know i take it back i want to tune it a little bit more um no but actually no i no let's um let's get tuning later no i don't know i'm going to tune in a little bit more i take the back the um okay so the 14 wow it just keeps getting better look at that i guess even with with 10 max tokens uh let's let's do just 200 and let's do 10 12 14 16 18. i'm really curious how good this can get um remember these differences are pretty small look at 0.185 0.180.175 but that is enough to get us you know into like the top 10 or so so far um i'll talk about what the next things they do one is i definitely can um if it just keeps decreasing sometimes the truth is it's really remarkable with tuning that like my i usually see when i'm practicing this that between two and five trees is often optimum but here it looks like it's because i think we have all these binary categories and the problem is because there's like multiple levels we do yeah wow it just keeps getting look it just keeps getting better now it is these are small increases but they're not they're not like linearly decreasing so we try this 10 15 20 25 and 30. let's see where i just want to find out where it um where does it level off this time here's now to get up here we're going to get to like 0.16 we're gonna have to try something else um we're gonna have to try additionally like and i'll talk about some of things gonna try one is um one is like i can increase uh like because here it's like okay we're still increasing by point one each time it's not really um getting us that much uh uh yeah i know while i'm while i'm running i'll get some questions from juwon which is uh would i consider streaming while doing sliced absolutely so two things one is this going sliced is going to be uh live streamed on nikon's uh channel ah finally we hit it by the way finally we hit a um a point of diminishing returns looks like after 20 we don't see real value i'm going to stick with 20 for now i'm going to leave that as the only thing that's being tuned and i'm also going to ask 100 200 300 say at 20. there's a function called tuned bays i've got a mixed results out of it i think i will try it again is soon but um this is get like uh mixed results and i'm um it's like iterative iterative uh fitting based kind of a type of gradient descent i'm just going to start by saying that 20 and say does it matter how many trees we have now that we have it at a at 20 and 100 200 300. oh understood i was being an answer um all right it looks like not much but i'm going to keep it at 200 it's also faster if i keep it at 200. um all right so i'm going to start with these 200. uh oh and my answer that you want sorry i was sorry one i'm sorry thank you i'm sorry i'm really bad at that oh nick nick ran nick wan thanks nick for for correcting me hi megan uh thanks for joining but the um yeah so my sorry what i was saying was while uh nick is streaming i'm uh it says like megan and nick and they can they can correct me in the comments they're going to be going around you're not going to hear my voice they're going to be showing each to people and giving some commentary i'd also i'm planning to live stream my own thought process like this as well i probably need less explanation uh probably more just like fast movement but yeah that's my current plan okay so the um so so you will if you'd like to watch the youtube channel in a couple weeks you absolutely can okay we're at 0.174 uh all right and we still haven't tuned this really important max tokens which for now i'm gonna tune for all three at the same time even though it might not be quite the right approach i'm gonna say what if i did uh 3 10 and 30 at this we found um tokens and i'm gonna need to try it with both these for now but first i just want to know like how does three compare is like i'm only looking at the top three mechanic designer category versus looking at the top uh 10 or top 30. uh that's what i really love about tune is notice i i've changed i changed almost no code there uh to um [Music] uh to get this zone one thing that uh like one thing though that that i is a shame with this approach is notice i'm changing my code i'm not redoing it and that means i'm losing my old values aha retaining more tokens at least in this range you know overtaking tokens is losing a signal look at that 30 tokens is worse than 10 and the best was three tokens uh can i do zero let's find out if it gives me an error uh so i'm not gonna use 30. let's do so i already mentioned before that's the trouble with it with um random forests is that the um uh the the trouble with random forest is that adding loads adding uh random variables that don't have that high signal um would um uh would um would actually dilute those trees and cause them to have less value so i just want to see like 0.173 is as good as we've gotten yeah but look it's like zero tokens is actually the best among these now they're very very close but certainly the tokens aren't adding uh anything to the random forest so we actually have two two approaches i can think of to work with to work around this so it's like it's actually you do best when you ignore the tokens completely wow i have a thought on that my thought is uh two thoughts one is that uh tokens might just not be the right for doing a random forest but you know what might actually be right now that i'm thinking about it is the first category trees are a little bit better at having one handling one categorical variable rather than handling uh many uh zero ones um so the uh so what i'm actually going to do obviously with this i guess what i think i'm going to do is to leave in first category so check this out here we go i'm going to do uh uh in unite i'm going to include in remove equals false and i'm going to say category and also category one uh and run that rerun my fold rerun my ten fold while i'm at it okay and uh yeah why am i doing that because look at this now i'm gonna keep category one category two category three etc now i think everything uh close something has a category one but i'm going to add in a step other for category to put rare values into another category uh sorry category one oh i didn't count category i meant category one here we go and there's 77. i'm going to throw in a step father anyway so i'm going to do anything that is less than 0.001 is um is dropped but i'm going to keep it as a as a categorical variable so i'm going to talk about that in a um in a second what i'm doing is i'm throwing in category one i just i wonder if it's important we're going to find out what's important with the importance function whether it's important for now i'm going to leave out the tote i'm going to basically filter it completely ignore mechanic designer category oops oh no i just realized something i need i need then in that case i need the cat the rest of the categories to be um i need to remove cadbury one otherwise it'll be redundant uh starts with but not category one starts with but not category one oops uh yeah so yeah i have category one and then category yeah all the other ones are united but not um category one is kept separate that was a complicated little situation i was doing there but i wanted to um to keep category one and have category be about other variables okay so i'm actually leaving in max token zero point seven one seven three is about where we are i'm leaving in zero um i just want to find oh uh let's find out what i did wrong category one oh is there missing data in category one let's find out step other uh let's find out oh uh sorry i do rftune notes ah i've gotten this before and i and it's always the most uh puzzling thing i don't know what it's saying if i drop this does it get any oh i think i might know what it's saying yeah i think i might need to say um step as step factor string to factor category one um if we're going to keep i'm going to keep the category and let's find out how that does nope nope see i've seen this before and it's the most puzzling thing is i'm not quite sure what it means um do i need to turn into a dummy because i turned to a dummy then it's no better than the other uh ones oops uh sorry sorry where was my token filter yeah throw this in ah okay so the first thing i know is the problem is happening in the recipe that's good to know it's still happening here it is happening if i drop category okay something here oh okay max token zero it doesn't like thanks okay i'm gonna leave in max tokens one it's gonna move most things it's fine uh thank you thanks for your patience uh and uh yeah i'm gonna i'm not using this anymore so one is gonna be like oh it's leaving in one extra term and now category one does not exist because i didn't do plus category one all right one hour over we're halfway through our machine learning we're point seven uh one seven three let's see where we can get in the next um in the next hour and the um once uh oh that that's not very useful um uh yeah one server still like 174 if i threw in category one yeah yeah okay we'll see how it um uh uh yeah let's hear this yeah so let's i think i think we fixed in this case all right and i wonder if i can probably drop this and it'll work good to know okay so the um so the question that i had was uh yeah i tried adding in category one no it didn't seem that it didn't really seem to help you know i'm just gonna drop i'm going to drop this yeah i think that was an over complicated idea the basically the idea is that um no more category one oh i i left in a threshold look at that let's find out if this gets any better that might have been why it wasn't useful as i left in a threshold i was put in almost everything and other i had this extra step uh all right so the um so i'm running this for a sec 175 it's still not gaining it i could use i actually i will tune the threshold for a second vocabulary one yeah i do have some additional ideas i just wanted to to explore this one i should actually get out of the habit of just like digging into one thing because i'm wondering if it works if i'm going to do well in this competition the um yeah it's like the higher the threshold the word the better it is which means no this is just not this idea was dead end don't keep track of category one keep that alongside everything else like i i can't say that it's necessarily worse but it certainly doesn't look like it's better and i can drop threshold okay then the um then the story would be basically there are uh so best um what are you missing so the best random first we have so far is um is this point like 173 kind of situation uh 0.174 situation and the um all right so let's actually ah and let's uh let's then actually i actually haven't checked it's against the train the the actual training data so let's uh we'll also get the test data so remember i have this test this i haven't been touched once i haven't seen how any the hype parameters done nothing like that so i can actually do um rf workflow finalize parameters i'll finalize part of my workflow uh trees is 200 m tri equals 20. and then there's there's one called um uh what's called the last fit they'll actually train on the training set test them the test set on uh split uh that's the one that has both training and testing and uh oh and i'll do collect 20 calls requested but there are 10 predictors in the data oh what do you think that means all right so i actually have an rmsc that's well wait it's way worse uh now i i i'm missing something because the um 0.18 is way worse now could just be that it was over i could actually take it back it's not that much worse but 0.174 but it is worse okay so that means i'm going that is an indication i might need to get this better before i actually get i want to upload the data uh yeah the um so this what is 20 columns requested julie do you have any uh any idea the um z that's a very funny question which is is this normal speed yes this is happening live i just talk this fast um i don't know if julia is still on the line has any idea 20 columns requested let's check this out uh what i'll do is i'll do let's look at a training set fit the oh no i'm sorry it goes up rf workflow finalize workflow fit on training 20 columns requested 10 predictors in the data okay what is that what does that mean the um uh 20 call 20 columns were requested uh they should have been like did a um right yeah can will be used uh the um i'm really wondering so what if i did uh all right so what if i took this and i i did train uh oh i did prep my rep my recipe what does the raw data look like we'll find out what the raw data looks like i would do prep and then juice and now this will be the data set one two and three and okay there are 11 columns so it's oh plus one of those predictor so i see 10 so what is 12 20 columns were requested what is that talking about uh i don't know why 20 were um uh would be requested by my rest by my model uh rf spa and so if i take this data but it also doesn't look like it matter oh wait no it does matter because uh this part doesn't so what if i do fit um and i do fit our uh the um model rf spec on the uh data is prepped training and the thing is we say geek rating oh thank you yes joe that's exactly right i did m try 20. oh wow and you know what's great what's really interesting about that is that here i was saying we needed 20 variables but in fact decreasing the tokens made it even better oh okay so this is really cool oh oh goodness okay so here's something i'm realizing changing max tokens it should probably be changed along with with this so let's actually try tuning this one more time uh so if i say m try five thank you that's exactly what's happening uh so if i do 5 10 15 and i also change max tokens to be one 1 3 10. what do my models look like so what i'm what i'm realizing is m tri and uh max tokens fit closely together haven't seen that warning before i didn't actually see i don't remember oh i didn't see it because i'd had a large number of tokens and i was training i was choosing m tri based on the number of um based on having 10 of each but um [Music] yeah the um maybe it turns out it's even better to have fewer now i said before i'm gonna have a different solution here um i'm gonna have like a different uh approach and what i was saying actually was my different approach is i think i'm probably going to um go into an ensemble model the reason why that is is combine linear regress legislation and random photos the reason is that is they have they seem to be taking advantage of different things render forest is a lot better than linear models even if it doesn't touch the tokens at all uh yeah one seven aha see so i can say like ah there it is okay when you don't take tokens it doesn't matter how many you have um yeah there's like a relationship between both of these but the up and that's why i was seeing more when i said more tokens i needed a larger number of trees oh okay this is actually really kind of refreshing because it means i can say let's do max tokens one or see like 10 is pretty bad i'm gonna do one or four and then try and try two to six so it looks like because yeah actually i'm noticing three with a lot of them try three beats it by a little bit so maybe i'm gonna see some like some kind of trend uh the stories that maybe it is gonna have more maybe there's some nice middle ground where there's a few tokens and a certain number of i'm trying it makes sense these two will be linked notice that i'm using mac's tokens across mechanic designer category that might not be the right approach um a particular designer is like i don't know that is really sparse the top few don't i was was um i saw a mechanic in category had a lot more so this might not be the right approach but yeah what i'm doing is either one or four tokens and two to six dividers and let's see what that curve looks like we might be able to get uh to get this a little bit better yes so no not so far it looks like the best we have here is five and one retained token hell i bet zero retained tokens is even better uh so what if i tried doing uh you know i'll try what if i tried dropping all three of the no actually let me see make it no what if i do let's see two to six best is 1.175 yeah you know um and so much better than four tokens wow uh okay what if i actually did this what if i say let's let's ignore the tokens completely since i try and add things one at a time but i keep forgetting uh and uh yeah let me age leave h in here and try completely dropping this it's also going to be a little bit faster uh so the um here put because it has fewer uh terms now two to seven i don't know two to six we're trying on a smaller set so any all parameters are up for grabs so i'm just trying it without tokens now here's what i was starting to say is i the reason i want to do an a an ensemble model is that there's going to be advantages to um uh there's going to be advantages to ignoring the um the linear model gets to take into account all these tokens we saw that improved the model uh the linear model so the um so we actually by averaging by combining these two predictions maybe we can do better than either of the models by themselves so that's going to be my next thing is probably is um there are two additional tricks that i'm thinking of uh for today one is randa is um using uh his ensemble method combining linear and random forest so that's one of the reasons nice to create a linear model first even if random power is a lot better i sometimes get even more value by adding combining linear and um yeah what i'm seeing is like four best i got is like four and three hundred sure we'll do we'll go with that and uh let's try that on our um uh let's try oh sorry i have a new version that doesn't use workflow and i have four predictors and 300 trees this i think is random chance uh so sure let's do uh last last fifth so this is going to be oh oh oh oops sorry i mean do we run that uh so this is going to be 304 as maybe our final model oh man even worse than my previous one oh well sometimes you over you overfit parameters that's fine worth knowing so it does worse than um even when i tried on the training set it's actually doing worse than the um the test set the um i i'm not supposed to do tuning here but i'm really curious if i like change this to six i'm really curious doesn't get any better i'm not much not supposed to be doing what i'm doing right now really that's so always well there's only seven predictors all right yeah i think it's random it's random chance around around these uh these values i'm not going to do any uh i'm not gonna do anything any better so 304 is a is maybe a reasonable model for this i get like out of sample point eight best at a sample point what is it 182 uh for um for uh on the test set okay so the um so yes it was better here but that might have been because i was over optimizing this one data set and yeah and i could upload it and check but and i will be uploading it later but let's uh let's try an ensemble method between each of these so what i'm going to do is i'm actually going to pick the best random forest the best linear model uh so the um here we go i'm going to say rf chosen is rf workflow and i already did yeah i already figured out what the best one was and what was the best on um on the on the original tuned oh i i needed to um yeah where was i i was here and i chose max tokens a hundred and i did uh this so cv was the oh sorry i know it was tuned so this was the best model which was a little about 0.2 notice this was a worse model than the random forest which got the 0.18 so we got tuned and um uh so the lin the chosen linear model uh lin was select best tuned yeah uh i can type this about yet lynn chosen is going to be lin workflow uh i'm gonna do finalize workflow and pass this in this select to like automatically let it do it yeah okay and how what is the performance of that one well if i take this uh uh well i already did the um the last fit on rf workflow but i'm gonna do it here anyway chosen last fit for a random forest on train to test 1.182 for linear model chosen on the test set 0.204 a worse model now i'm going to ask what if i blend the two together so this is done with um stats the area that i have the least uh practice with so i would be excited if people have improvements they suggest okay so the um so we're gonna do is the stacks package has a it creates stacks add candidates uh lynn chosen oop nope it's tune results oh right yeah okay or a workflow set i don't have a workflow set uh oh uh flow set oh i see okay no it does it does like combinations that's gonna be too much to mess with uh so i'm actually gonna do yeah oh yeah i realize now that what i want is i'm going to take the tuned values for each which is a little copy i need to take uh the rf tune and i'm actually going to filter parameters uh list so yeah what i wanted actually was uh-huh i remember this now lit i needed to create the tuned object for it and it also hit me that i forgot something else that i'm going to bring in in a second and here um parameters equals list of um m i said m tri is 4 trees is 300. i don't know what this what this thing is uh parameters is emtry trees i i don't know why that works and the other thing doesn't whatever uh the um uh and yeah there's actually something that i realized i forgot to do which is in or not forgot to do but i need to do in these tuned examples i actually need to throw in a control equals control stack grid which keeps the predictions in each of these models so say control equals control stack grid and uh rf tune tool so it's actually yeah i think it's a good example like you can you can get really deep and oh yes let's improve and prove improve uh the parameter get down to every 0.001 but the truth is that there's a lot of noise here especially i'm doing five-fold cross-validation i like tenfold uh and trenton on another said that it maybe seemed better maybe it's doing worse but the um you can really get you can you can get two in the weeds on that uh yeah and um but yes so here's what i'm doing is i choose some attuned sets a linear model and an rf model and now i can say now i can't do this anymore but um add candidates add candidates rf chosen is taking a minute it's running that that that uh combination if i should shorten it a little here i'm choosing them manually in this case uh and yeah i remember now what i do on um is i say stacked and i'm trying to remember like stat uh i think it's called i think i usually call this ensemble and then there's a few more all right so here's my answer right we run this here now an ensemble has these two model configurations uh i went in and chose uh so and then i would hit the ensemble and i say let's um there's one called blend predictions it will say uh how much do we blend each of these uh so it'll um so actually say give each of these models some weight and then combine them and then i'll say ensemble fit so if i take a look at this here it is they each get a weight random forest point seven five lynn chosen 0.28 it actually makes sense because that uh random force gets more weight it's a better model but it looks like we're getting some value out of lin chosen too and now we need i do need to fit the uh the individual ones fit members the function that does that uh oh i meant to do this just work just work sometimes i run into a couple problems with this function all right and now here's my ensemble fit and um uh the great thing with this is now this is like a workflow i can actually use it to predict things i can actually say predict ensemble fit on our test set and get out my set of predictions even better i can say uh i can well actually that is basically do this and then um i think is there is there actually is a better way to do this but no i think i basically i'm going to do um change oh uh is there an augment method nope so i'll do bind calls test and now i've got the prediction versus the data rmse which takes the uh which i say the um the truth first which is the true value is degrading my prediction is dot tread and say notice that when you combine them and actually it actually did better the random forest had done 0.8182 now it does 0.176 so let's combine the two okay this is now i'm going to upload it and see how it actually does on the data so the um so before i upload it i'm actually going to how can i replace this data um yeah the truth i actually want this is a bit of a hack is i actually want to fit these members not on the training date which is only 80 of the data but on the test data as well that way when i upload it i'm using as much power as i can uh so what i'm going to do is i think i remember this it's um yeah there's my training data here's what i'm gonna do i'm gonna do ensemble blended train is this is kind of kind of ridiculous is full data now now i can't do this anymore because um oh shoot um no i don't want to do i don't want to do this because if i do this it's um this next that step gets uh weird what i'm going to do is going to say ensemble fit full the point is i want to train on all the observations i have i don't want to give up twenty percent of it so i'm going to say uh self at full it's all blended fit members oh i'm sorry ensemble fifth blended full here's what i'm going to do i'm going to replace my uh [Music] uh i'm going to place ensemble blended um one sec all right the um uh uh is uh oh yeah train and uh is is what i say full data and the um yeah okay and then i'm gonna do ensemble shoot i did it again thing i did i'm trying to accept sorry i'm getting a little lost yeah the story is i'm gonna fit on the full training data instead of the um instead of the the eighty percent uh and here we go the um oh yeah and now here's my full fit and now i'm gonna predict on the holdout set hold aloud bind row find rows hold out select id what is the sample submission uh game id what is the sample submission uh and then game id oh bind calls is what i meant to do find columns and then what does the sample submission look like i didn't even look sample submission game id key creating okay so i'm going to do select game id and geek radian equals dot tread these are my predictions uh all right let's say we have and so i have a 1500 observations and that seems uh right so prediction so attempt one this is an ensemble combination of a linear model of a of a lasso model with a random forest let's see how we do so right csv i'm just going to put to the desktop i need to get a little more organized i really want to use um i really want to use what do you call it um the cat the r kaggle package for this but let's do here we go and ensemble combination of rf and gm net all right so the score was actually a lot worse than i originally expected so the worst thing was doing on the training set i wonder why i wonder if you did like a what do you call it yeah look at this i'm actually where am i oh it's a late submission so it's not gonna it's not gonna count yeah but um but alright so it actually would have put me in the bottom half so i wonder hmm i wonder if i should try using just the random forest to adjust the linear regression uh see at the end you better now i'm not i don't think i'm going to try that so the um [Music] uh all right so let's think about what other ways we might go we might go about improving this because that was hmm really is a lot worse than originally expected the um uh all right next thing actually yeah next movie was actually boost all right so actually boost is like uh is like random forests and random fire is kind of a good starting place because i'm actually going to copy some of this code and say all right here we go xg spec boosted treat boost tree and now i'll say and remind myself m try is let's start with always with four and tune the number of trees we're probably not going to change both of these uh but xg boost something i forgot to do was look at the at the importance of the variables i'll get back to that in a bit um but all right so let's see all right so what i'm going to do is take a look at our xg spec and look at the um xg recipe include own votes average time etc i wonder for moving min and max players will help it because we know that's not that important that actually there's a good way to check that what i'll actually do is i'm going to ask about importance of these models so the way to do that is fit a model and um here we go i'm going to actually fit my full one what were most important models and try oh sorry i'm going to do fit on the full training set rf fit and the fun thing about rfip is now i have an importance function i can apply to it and say oh wait no it uh rather yeah i'd play the fit and it's not just this huh oh it's ranger ranger importance that's so weird three deep fifth whatever yeah in any case what this is actually showing is the um num votes is very important owned as important min and max players not really important at all and i think it's like men players not being important does seem pretty plausible so much so i'm going to try removing it and let's see how we do the um [Music] yes it's not going to make a big difference but i'm you know let me try one thing let me try removing min players and still keeping this but just did i choose 200 yeah i chose two no i chose 300. i think the church i did 300 for the chosen one i'm curious if i drop that because remember dropping out on important ones actually can improve your um your random forest just curious what happens but the um but yeah so what so but definitely an extra boost i'm going to drop the mint players so if i do yeah if i do let me see xg uh the recipe then what i do want to do next is say and i got the spec i'll apply the same code well do the tuning code let's find it all right so what is a gradient actually i'll talk about when i'm in the doing the setup trees is tuned 100 150 i also need to train the learning rate that's going to be important tune what i find is like yeah uh i'm just trying to get an idea here of what some of the um xg workflow is oops i forgot that next step of w flow is is workflow add recipe add model learning rate oh uh another map by tuning is it not called learning rate it's called oh yeah it's called um let's check oops let's check learn rate that's better all right so let's see what we got so i'm learning from this experience i should try uploading a little bit earlier uh because i i have only 20 minutes left i have nine models i could upload if i wanted to uh all right so this is showing like um definitely need a need a higher learning rate uh this is otherwise you get an absolute disaster so we're going to try is 0.1.2.3 and oh and the trees well it's actually hard to tell if they matter because this is so zoomed out um and it's possible this could be no yeah it's possible this uh lower okay so it looks like okay learning rate when the lower was a little lower more trees were better that makes a little bit of sense here's going to try 0.01 it's kind of feeling my way out 100 200 300. so what this does is uh if what if you could say oh yeah here's what the random forest got wrong last time and then you you brought you you just look at what is a random virus get wrong and you train just on that so that's what this is doing it's actually saying like um uh it's training each time it's feeding a tree it's training what the errors the previous tree had made one thing i'm seeing is lower learning uh more trees lower learning rate best so far okay let's um let's kind of lean into that a little bit point out i'm going to do seek 0.02 to 0.1 and if i don't just have enough trees it's um what if i 200 400 600 so the story here is that this the number of trees is how long we keep going with this algorithm like run run run run run run run run in this algorithm and the um i did i'm trying for good yeah um and i'm gonna be changing that to oh what does that oops i i totally forgot to to set the second argument it does look like yeah more trees better okay what's happening is it it'll run and it'll try it this approach this um it'll train a tree then then correct then train on the air symmetry then train the errors from that tree then train the air from that model from that model from that model learn rate is how much it changes with every iteration uh so if you're look if your number of trees is low you might need a higher uh learn rate and yeah our number of trees is not very high yet that's probably why we've been leaning on some high learner i think these are relatively high learn rates uh but i really just wanted to yeah i wanted to start getting a feel for this here we go 600 and low learn rate is pretty good okay so what i'm going to do is i'm going to zoom in on 0.01 the 0.03 by 0.1 and do 500 1000 50 that's going to be that's going to be kind of slow 750 1000 okay and um see i'm working on where am i going to where am i going to find like yeah it's clear that 200 trees with a low learning rate just it's just not going to have time to learn uh and then if you set that and notice what happens here is if you set a um a high a even a low learning yeah a low learning rate what happens is you end up over fitting your model you actually can see this this curve is the point where it's overfitting over fitting overfitting all right and now it's like okay we know the learning rate is probably better not 0.01 and unless you have a lot of trees um yeah what i'm going to do and we're south of 0.1745 but that's that that is still just in sample uh and i'm noticing low learning rate with a large number seems like it might be exciting so i'm going to do i'm going to keep yeah i'm going to try this but i'm going to actually do 0.03.01.03 as our learning rates okay so i'm trying to see the trend that's happening it isn't trying to like catch this these are uh catch this curve where generally the lower the learning rate you set uh the more it's going to be able to eventually converge in on the right approach though it might start over fitting but the and the more trees you'll need this is very different from random forests um so yeah okay sorry let's see if um while this is running a question zdax how many years it takes to get this fast efficient ml um i've been using r very fl and very clearly for over 10 years i've been using um uh the tidy verse for about six of those years years like the player gg plot two et cetera um but i am pretty new to tidy models so i think i could be doing a lot better if i had a little bit more practice this is exactly why i'm doing this is i'm still learning some of the ways to like to do this kaggle competition and here we go we can see like ah see this low learning rate is just an absolute train wreck for a low number of um but that actually means it's hard to read so i'm going to do this i'm going to coord cartesian y limb is 0 to 0.2 not 0.15.2 uh not y y min by limb i want to just cut off everything above this point yeah okay see what's happening is like 0.01 it's kind of leveling off right here 0.03 and 0.03 it's already like going upward uh so let me try 1000 2 000 and 5 000. uh and we'll do uh let's do 3 000. and uh let me see then i know i don't want this so here it's definitely getting slower but by the way what people what people say is that the um uh what people say is that xgb boost these days is winning like a lot of these competitions um these these supervised learning competitions so we'll see how well we can do i wonder if xgboost would do all right even on our really large uh even when we started throwing back into tokens because i do miss the tokens i feel like you can do better with the tokens but i guess this was so sparse is kind of the problem with those tokens um carl that's right i'm using an extra boost now uh so the difference here is i have this extra spec is boost tree as far as i'm trying to like look at this little gradient descent curve learning rate the smaller the learning rate generally the more like the more i can hone in on exactly the quote right tree but i'm also a little bit more likely to overfit the other thing that i like about yeah one that one that i worry about is um is robustness and there's like trees when i set the limit here it's kind of like flat lining right the green learning rate is kind of down here all right so the oh maybe here's an interesting question from andrew maybe cat boost with tree snip i haven't heard of this would help with the tokens parsnip beckons tree snip that's interesting so the um cat boost okay uh yeah i hadn't uh i'm interested in learning more about that here we go we can see is like okay 5 000 we're finally doing a little better but the truth is i don't know that it's worth that many uh maybe like who uh yeah who knows at this point uh is even worth it what if i do hmm no i'm not going to keep tuning this because like i don't know there's barely any benefit from 3 000 to 5 000. uh even if i set like the limits to be one eight one seven to one eight yeah notice there's there's not that much benefit here it looks like three thousand at point at 0.01 might be a pretty good um might be certainly the best um cross-validated error we've got it so let's actually try that out where i'll do xg workflow final let's try it on the test set because i haven't been using that enough finalize parameters finalize workflow for uh trees is 3000 and the learning rate um uh and uh last fit on the split train test set um oh sorry learning rate equals 0.01 and it says it looks like the learning rate mattered a whole much lot here so i'm not too worried about uh xg test so here i'm just trying to um learn rate so yeah this is not going to be too fast uh but let's see what it looks like and then maybe i'll try uploading this all right once um 176 it's still better than the air we got 1.182 all right let's try the xg boost with this on the uh let's do the upload because we're near the end of our time and i want to make sure we try out a couple so let's do the um xg tune and oh sorry uh let me do let me take my code i gotta set this oh yeah i can do um predict uh xg fit is trained a fit oh no no sorry i i i meant to train on full data on uh uh and then i do xg fit predict on the holdout set and now i do my combination where was it where was that where the heck was it uh i had my uh there it is attempt two then i'll try and ensemble and we see how we do but i want to get a little bit of data first yep game id geek rating all right uh late submission attempt two xg boost what is it three thousand trees .01 learning all right so a lot better there all right so at 0.16 i actually think i got lucky on this one uh because the um yeah you can see where am i on the leaderboard oh i'm not on the leaderboard because right i figured this earlier all right this would put me in by his third place i think i'm i think more like second but truth is i think it's pretty lucky because um this uh like it's gonna be random random chance in terms of how it does in that training it's a lot better than my test set oh it's still the um that's good i wonder if i combine in the linear model and get even better so let's try that out what i'm gonna do is pull in this certainly is a good model so i'm going to bring in where is it ensemble all this ensemble code and we try blend it for attempt three yeah the ch um chosen one here is .03 yeah what i'm going to do is uh is blend it and do xg i might as well mix all three together and we see how it does uh so the um should i know i'm actually going to do first just mix in the linear and then mix in the rf as well i think it uh i like that that approach a little bit more so the um xg chosen is going to be the um uh oh uh where is it xg2 xg tune filter parameters uh oh fill filter uh filter parameters sorry uh trees equals three thousand and two oh no it's on try it's a learn rate equals 0.01 and let's try ensemble xg i'm going to put xg just at the end of each of these keep those old models around let's see how it does i'm just adding in the linear this fit is going to be slower uh oh that surprised me it's not slower than that oh that's a that looks pretty similar to when i adjust it let's take a look at the blend oh whoa uh there's the problem uh huh i didn't change his variable names a question with um from andrew is what's with the first place submission i actually have a suspicion here i think i i don't like to accuse people of cheating but this data is available online uh the the full board game game is available on like it's it was a tie tuesday so it's very easy to cheat and get an arbitrarily good score i i certainly could be wrong here but i just um uh i'm a little bit i am a little bit suspicious of that uh it just so much better than the other score all right so the um xt the uh oh and now let me try this on fit uh all right so it did do better than my um i think it'd be better than both my ex then it's certainly better than the other fit so i'm actually going to try and i'm going to try this as attempt 3 a blend between these two ensemble blended xg uh train full data yeah there we go xg let me say internet now all this this this stuff just a lot of uh xg full xg on the full data and then i call this attempt three oh uh full xg so make sure this is different than attempt two otherwise i didn't make some mistake yeah they're a little bit different okay let's see how it does blend of attempt 2 2 and klm net no that did way worse wow that did so much worse uh the um wow that that's like is that worse than my original model something's up with the lin it's like the linear model maybe it's you know it's actually it's interesting maybe the linear model is just like really poorly suited here uh somehow uh i really do wonder on that uh because i'm mixing the linear what happens if i mix in the uh the random forest model xg and random forest actually no i want to try something different i want to do a little bit more extras given how well x2 was doing let's try a little bit more parameter uh fitting on xg rather than working with the answer but that that's it's like as soon as we brought the categories in they just did way worse uh that's so maybe the holdout set is different somehow i don't know but anyway the um hold up set like the categories are different the mechanics are different or something all right so the um uh let's see the oh sorry oh man okay the kaggle holdout is sorted okay thanks yeah um all right so where was oh yeah where was i yes i was going to do a tune on this data set and let's try point let's see i'm going to stick to the two values i had which were 3000 and 0.01 and i'm going to add in a tune for um so i'm not even touching mechanic or anything here oh yeah i'm gonna add in a two frame try and try is two to five two to six and um you know though i really should also bring in hmm not the number of trees no i'm not i'm going to keep it the same let's try this out no i'm going to i want to see if it changes i want the reason i'm doing this is i want to see if like if one of them is this is still learning uh in the m try or one or conversely one of them is still overfitting is over fitting that kind of thing just get a vector uh between 1000 and 3000. yeah but i am going to be doing this on two through six and yup keeping the learning rate the same because i haven't touched m try since we decided to do it on xg tune there are a couple other xd tune uh um a couple other extra boost parameters we use like the maximum depth um maximum depth is by default i think i saw that at six which means um i don't know if we have uh if we have less than six m tries probably not affecting it all that much but i'm not sure i would say but there's a couple other ones we can um parameters we can use but yeah given that the best model was actually boost and both times i tried adding a linear it got way worse i'm going to stay away from adding the linear model in um as a weight and yeah i really wish though yeah i wonder if i can bring some tokens back in because i was really curious and i do bring in tokens ah you know the thing to do is probably bring them in one at a time figure out which was most important and um and go from there yes so one thing is i didn't do anything to bring in things like tokens i could have done um things like the uh sorry um i could have so i really do feel like the lasso model is the way to do it because i do think of this as a linear uh trend where if you are in a particular category you're going to um uh your go yeah underneath particular category you're going to like that i was doing this yeah taking its time the um uh oh yeah i was just starting um man it really is taking us i'm doing what's uh two three four five six uh two three um two three four five six so five times we're doing ten of these models and we're doing it on a five-fold data set well let's see and with the um yeah i think that was looking at this like point one six would have been so we forget okay uh let's see aha yeah okay we add oh wow yeah okay more tries is better and three thousand by three thousand we're doing seeing some overfitting that's pretty cool uh oh yeah let's oh let's get a final uh final model set up let's try change this yeah let's use the best version here okay so let's do blend it for attempt four and uh the first results attempt three and yeah i'm almost there i have about two about two minutes after the three minutes after the hour to be two hours i'm gonna try and keep myself for two hours so let's uh let's do um xg yeah let's see where am i here attempt three and there we are uh so xg workflow finalized to uh 1000 and m tri is 6. it might could easily be overfitting here yeah the um but it does look like kind of like a potential benefit so the um yeah this is actually that's interesting it looks a little worse on our original test set i'll try some real quick that's why i keep an original test set as i can mess around with a little um yeah wait it was four and three thousand i'm just messing around because i i i don't need to keep the test set around for anything anymore so the um all right what if i try three thousand and sex let's change and the only change here between this and the old model is i'm try is six it does look like it's overfitting though yeah a little bit worse yeah i don't know i'm going to try six to just try being kind of different from my last model oh what did i do oh uh xg oh uh sorry xg fit three though i need to do um i'm trying six for the x to fit and attempt will i do last temp three okay attempt four oh shoot yeah this would be my last submission i probably get in another i could get another one or two but uh i don't see i don't see the purpose the um oh yeah xg boost change and tried to six yeah it did a little bit better all right one six so okay so what did i learn from this experience uh so this is the other that that's the end of the um uh the thing cool looks like in the end we've been at the top all right what have we learned from this experience couple things first i put actually a lot of work into working with tokens and in the end the best model didn't use tokens at all and i guess the reason for that that is kind of it's a shame because the tokens clearly made the linear model better uh but it was hard to get them into the into the regression to the um to the tree effectively i'd really like to read about what some methods of working with that are one is i could try things like clustering but i don't have a lot i can cluster to reduce the nationality uh i could use clustering or like pca for dimensionality reduction of something like mechanics i think that's what i would have done is i would have done um i think i would have done dimensionality reduction on a little more time on the mechanics and try using throwing that into xg boost uh so you like pca all right so learn so i'm learning a couple learnings i i don't uh learning using tokens don't mix that well with xg boost all right the um don't overly focus on them early okay uh second thing i learned is i upload early and often in the end i only have time to do four uploads i probably couldn't fit in a few more but if i if i tried uploading earlier i might have been able to see that the linear model perhaps even even if it distracts the linear model but it's done worse than i originally expected plus there's random noise and stuff so if i have 10 uploads and i'm trying to do as well as i can i'm probably going to going to try um uh including that third is like xg boost really is op as people work with um oh that is overpowered uh actually put that uh it just like using only what how many uh how many things the um uh how many predictors was i using here one two one two three four five six it was really effective at finding the um i find a really good uh predictor for that uh so the um yeah i'm gonna reordering this a little bit what else did i learn uh stacks takes a lot of code right helper functions i'm just noticing like like doing those ensembles was so much there was so much code involved and then fitting on the full data was kind of like a uh and also the help hold out the running on holdout set like give it a workflow train on full right to file yeah that that's a that's a pretty good helper function uh all right the um uh and oh yes good question from for megan is the public uh one is one percent that one percent of it has said uh you won't see private score till end ah so it might be overfitting it a little bit so that's a really good uh that's really good point um and and they will pick which one in in the competition right oh i see 99 of the test data uh gotcha so i'd have to pick which one um uh i wouldn't wouldn't have seen it uh seen it there okay this so this one is only on one percent okay so and i don't know is there any way for me to find out how it would have done but yeah um all right so the um yeah so this is yeah some some good um uh what do you think in the end i think yeah i would have used the the purely extra boost model anything else i think i learned uh oh yeah look at important scores of random forest i kind of neglected to do that um okay so yeah all right so that that's um that's i think uh that's it for today let's let's review what i did i did that we did a uh run of the full date and did a little bit of cleaning train test uh did a little bit of explanation understand what these distributions look like i thought about how we're gonna tokenize and then um then did a linear uh logistic regression glm net which goes pretty far to like point and then having to um uh go oh yeah there's a good ah yeah yeah cool call so one thing you can see here is a public leaderboard did a lot high uh better than the um the private leaderboard so that's a good um this is a good indication of uh so like if on the public leader board i would have gotten just .16. i would have been down just here uh i would have had to do a lot better to uh so i probably would have done even worse on the full data i said that's one um that's one thought thanks so the um yep uh okay so yes um so that yeah that that's one thought in terms of the um uh yeah that that that's one thought one thought in terms of how i'm gonna go about about computing this okay oh yeah sorry uh thanks for that up megan what else did i do we did um so we learned to use a recipe to do a bunch of transformations in the end random forest needed less feature engineering it was more focused on not including things that weren't predictive so in the end i dropped out the tokens we did tuning we did lots and lots of tuning lots of curves like this tend to aim towards the curves on the lower part of the graph and uh then we um we did it on yeah we uh did method we did uh yeah the um uh it's a learned learn to do and some methods which turn out to do it worse at least on the public the uh the scores that were reported here and um yeah then x then did some x2 boost so uh yep all right that's it for um that's it for today i'm planning to do um ml monday for uh for the next couple weeks i might go with the sliced it like the uh maybe i'd like to to do the slice that i might maybe i'll code along while watching the um next tomorrow slice i'm not um uh uh or tried afterward and do a screencast but i haven't decided yet but the story that i would love next monday to have some data set i haven't done before and try this experience with y'all again i feel like i learned a lot all right thanks everyone for joining um i hope you had fun i certainly did and i'll see you tomorrow for a regular old tidy tuesday
Info
Channel: David Robinson
Views: 3,524
Rating: 4.968504 out of 5
Keywords:
Id: HBZyqkVjUgY
Channel Id: undefined
Length: 125min 4sec (7504 seconds)
Published: Mon Jun 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.