ML Monday live screencast: Predicting Chopped ratings in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson and welcome to another screencast where i'm going to be practicing using the tidy models packages in r to do predictive modeling on a data set i haven't seen before uh so this state is this some type of screencast is my second one where i've done a screencast like this and what i'm calling uh ml or machine learning mondays where i'm going to be taking this uh this is a preparation for the sliced competition this great project uh that's going to be running on on twitch where in um next week uh i think it's tuesday so yeah tuesday um june 22nd i'm gonna be live screencasting along with a couple of other competitors an analysis of of a data set i haven't seen before and trying to build as good a predictive model as i can out of it so today as part of my practice i'm going to be using this um data set the sliced creators put up as a playground for testing out the competition format and i'm going to see what i can i can do in terms of a predictive model uh so i haven't done i enjoy the competition yet so i don't know much about it but uh if you want to join yourself it's uh and code along with me it's called kaggle.com test this link here we go i'm going to put a note in the chat so if you'd like to code along you're very you're very much uh and uh or or try competing with me you're very much welcome to so this is again just a test so i don't know how um how similar it's gonna be to the final competition but yeah i'm gonna see um you see how we do i may or may not be able to join uh tomorrow's uh the competition tomorrow and and uh quote along with them if i do i may not screencast it so let's uh let's get this modeling so i understand and agree and here's data we're going to be predicting ratings looks like and the training set looks like uh id season season episode okay these episodes have chopped i've actually analyzed the data i said this date i haven't seen before but this is data that i actually did see once before and um uh it was a tidy tuesday i think last year i didn't fit a predictive model on it though so the um so we have i did an episode id season season episode series uh episode episode notes air date inform the judges and what the appetizer was all right and i'm gonna go ahead and um can i download is there a thing for downloading all of them i think i can download all of them it is download all let's open the slice test and now i can get into my code and actually let's read in some data and work with it here so i'm going to do library tidy models library tidy verse i also like um theme set theme light and library scales uh let me see what else do i tend to use i think i'm going to use text recipes today so what i'm going to do is say data set is read csv on the um um oh it's called what was it called it was called sliced test uh i'm gonna call the the um the training data data set i'm going to call the test holdout it's a little bit different way that i tend to work with this and they say um let's include the same but this time do tests i should really have some kind of template for this because i'm going to work with the holdout separately how big is the holdout i'll call 150 and the leaderboard i'm curious is uh with 151 percent of the test data so the leaderboard is going to be about one or two observations i really can't trust the public leaderboard at all it's not going to be very impactful it's going to be just basically one of these observations um because notice the final results are based on the 90 not the rest of the data so i'm going to really have to trust my own holdout set so in that with that in mind i'm going to keep 25 percent of the data so anybody was going to do set c 21 so split is initial split um data set and do um i'm also let me let me check something really quickly that actually i can see it already the holdout is random episodes it's not like like there are holdouts from the end i don't need to predict the future here i'm going to predict episodes that are kind of in the middle uh because it would be a different project if i had to try and predict the future but here i can do a split of the data set and do train is training on split and test is testing on split here we go and now i'm going to work with train all right we don't have tons of observations our original data set was 251 observations 351 episodes uh the holdout is 150 episodes and just to quickly check this data set doesn't contain me to hold that right if i do semi join on hold out buy id nope good okay no no overlap great and the um yeah so we train i'm actually training mostly on 263 and testing on another 88 uh so this initial split is maybe about 75 25 now let's get um let's get moving so i'm going to do a little exploratory analysis i think one of the clearest things to work with is series episode uh where we're going to say by the series episode what was i want to predict the rating so this is chopped episodes um and you know i expected there to be a little bit more of a time trend though it does look like see it's like it's like up and down and up and down and up and down if i try a geomet geom smooth method equals low s do we see any trend there's a little bit of a trend they kind of end strong maybe what if i tried method equals um what is it called spline is it no it's yes um what is the uh what is it called for a spline remember i'm actually i'm probably going to be applying a spline it's gam ah okay gam would be like a natural cubic basis ooh that is uh up and down and up and down and up and down and it shows some number of degrees of freedom the idea is that we can that i'm probably gonna use series episode to uh count this and i can also say something like color equals c factor of season look too many seasons lots and lots really well all these seasons i thought it was um i i assume they're like five seasons or something like that but uh yep there's 45 seasons i'm not going to factor that season i'm uh labs is within all right so we see like um across the the series like yeah there could be some time frames especially a little bit it's a little bit hard it's uh hard to say but okay that that's like one way we can see the um the time trends what else we got here uh season and season episode i think they both basically align so if i just arrange by series episode one two three four uh well missing three because i must be in the hold outside etc and uh so it just kind of lines up to one two three like this is kind of in a row but it gives me an idea which is i'm gonna look at um i could look at by see there's actually two ways kind of i could look at it i could average rating season and average rating and expand limits y equals zero maybe uh just kind of look at like okay i can see the um i can see the trend a little more now it's different smooth it or kind of look at it by season if i believe there are clear season boundaries but you know smoothing it is not going to be any um i don't think it's going to be any worse but i'm actually interested in something else and this gives me an idea which is that i want a summarized ratings function where i'm going to say kibble and uh summarize average rating equals mean rating uh n equals n i just want something like this so i can keep applying it and now i can apply to something like this if i look at the episode within each season there's going to be like season episode there's like one two the first second third etc then i can do oops season episode then i can say something like all right is your look does your location within a season matter it may be not really it kind of looks like if i do size n i don't know if seasons have different numbers of episodes yeah see look at this it's pretty level i don't think there's a tre so i was wondering like actually i'm gonna try without the expand i was wondering is there a trend where like the start with a really strong episode i don't think there is here i can actually um test that by instead of doing a summarize do a just a plot of season episode box plot and i'll do a rating group equals season episode yeah there's like a lot of overlap there you can see i don't think there's really like a trend and if i do one more thing if i say if i move this out here and here you say gm smooth method equals lm is there a linear trend no that's that's within the um margin and is there a um is there like any kind of trend i really don't i doubt we're going to see like a oh yeah the prim you start strong anything like that i think this is all random noise okay i just wanted to like to check that a little bit while i'm doing my feature um summaries but this is kind of this is the um which i should say this is kind of the the the graph i was interested in is yeah it kind of does look like there's ups and downs so this could absolutely be noise uh yeah this could be noise but yeah you know all these episodes are kind of a little lower these little higher could be could absolutely be real so i might want to apply a spline to the episode with an entire series uh um steps okay that was one of that was one uh let's go season season episode and series episode there's also lots of oh and lots of other info um and what i'm wondering about this other info is uh i've got an error date and that the air date i bet lines up pretty well to the series episode so if i say mdy air date md and then i assume there's a year uh mdy comes from the lubridae package still doing my feature selection here see see now i've got as an air date and i'm actually wondering i'm wondering if i if i graph the series episode by the air date uh were there giant gaps or something like that that might give me information no not really i'm not sure it's like it looks like air date doesn't doesn't quite line up at a few parts a little bit odd but i don't think i'm going to really gain anything with the air date um i already saw that where you're within a season doesn't really matter and um let's look at some judges what i'm really interested with the judges is are they repeated yes okay there are some judge there's at least like some judges appearing a lot what if i look at judge two and are there judges that are sometimes one and sometimes two uh kind of like scott conant is five times here mark murphy's yeah i might want to kind of combine the three judges into one bucket because i might not really be interested in like are you judge one versus judge two but just are you a judge i did sign like similar to that with categories in my board games and i'll post a ml position last week yeah i might bucket together the chargers but there's definitely okay features are judges uh so then that'll be one that'll be like categorical many to one uh so uh so i have to like um represent that and i'll show how i go by representing that i think the same is true of appetizer well take a look at this if i like select appetizer we're gonna get repeated appetizers separate rows um so i can do something like separate rows uh what is except is appetizer and count appetizer uh not that often it looks like it does happen but honestly if none of them happened more than six times this might not be one i can do a uh feature on um so i'm gonna leave that i think and um i do wonder like yeah i wonder i wonder if like i parsed this so that the word potato was popping up the same place i'll consider oh um like i parsed into words maybe it ends up a little more um information but yeah now i'm going to try on tr actually i'm going to do you know this is actually ringing a bell uh so you can judge me i'm not doing this from scratch but i actually do remember doing something like this entree dessert and i say serving um value uh so then there are kind of three i'll call it phase and then serve phase and then food count uh uh count phase and food sword equals true are there duplicates of um oops i've neglected to do the this step and separate the food here we go so it's like seven had to have cream cheese for dessert six had asparagus and half hazard um yeah i do feel like it's not gonna give that much information uh what if i drop the phase that everything's appeared multiple sometimes but i have a feeling it's not gonna be that informed like is the rating really based on the ingredient even if it is am i gonna be able to distinguish that from random noise uh so that that's um because i have only 263 observations okay so this is um having said that i am still going to try what if i group by food summarize ratings and fill and arrange by descending n if i said like okay uh of my top ones is there any signal here okay watermelon appears seven times and has an average rating of eight point seven six compared to uh at the low end avocados appears seven times uh also seven times eight point fourteen um is that statistically significant well could we we could test with some kind of um uh we could you we'll see we had a machine learning model do i wanna include them um i guess so just because i don't have like a lot a lot else um to go on here but but i have a feeling it's not going to be that informative like this i don't know we'll we'll see we'll include uh we'll try including food at like the top level see if watermelon gets a positive signal avocados gets a negative signal see if that works out again it might just be overfitting okay now we've got our uh our other columns so that that covers the foods and i bundle together all three of the foods um and then and actually i can i can bundle together judges right now so i'll do gather position and judge position judge uh and then contains judge this is a step that lets me count my judge column where now i have like okay now i've bundled all three of them together it doesn't matter what position they appear in one two or three so we're in good shape here uh and now let's do the um now i've got votes to on votes yeah i wonder um i'll i'll jump into that in a second but uh yeah i've got now i've got contestant data i'm still working my way to the features and i'm doing i can do the same thing i just did with gather except i'm gonna yeah i got my contestants and i have my contestant info so there i'm actually gonna use pivot wider oh i'm sorry pivot longer i want contestant and contestant info i think for a second that i kind of want to like break these all down now this could be a little bit tough to do so here's what i'm going to do i'm going to actually i think i could do a pivot longer but i'm going to try this instead i'm going to do gather the um contains contestant key value keep it really select id key value and i'm going to do mutate um uh i'm going to do new column is if else if key contains because i really want to like spread i want to see my contestants names and i want my contestant info next to the name and i haven't done anything with that yet uh actually though the info you know i'm not actually going to need that right now i'm going to say contains contestant and not contains info actually you guys i don't need to gather the info i just want to i want to count my contestants can he that um all right i'm not going to use contestant data because i can just see this nobody pops up more than a handful of times even if episodes with anna subbata are really high really highly rated not getting in from any information from that and it's not going to be very helpful uh so i'm not going to use contest not be not going to use contestant names but let's find out about contested info a sec how much info we got on this you know i see a lot of text here and um i do see though that okay there's this episode popping up here that's kind of a of a mess but they're gonna have titles like chef and manager director of culinary operations things like that maybe the more senior ones have a higher rating it's possible so i am going to be tokenizing these i tokenize the um tokenize the info and we're going to see how we use text recipes for that and last thing i'm going to investigate i'm at you rating does the votes predict the rating you know there's really not a lot of votes here uh is um method equals lm nope it doesn't look like the number of votes and something uh is predictive of the rating uh you know i could include it if we're doing like some kind of random far something but i have a feeling it's not gonna matter so what does that leave us with is what that leads us with is in this predictive model we're going to do uh features um we're definitely gonna be doing judges though i wonder if judges is correlated with time or something like that would definitely series episode as a spline uh so that's like that trend over time ooh episode notes i totally ignores we're going to tokenize both the episode notes and combined contestant info so that's our our text data again we have a hard time just because we have so little so little data uh hard time to pull out signal but the um we are going to tokenize the episode notes and we're going to talk let me see and we're going to uh include food include collapse foods together ingredients together include as features uh i'm gonna and i'm kind of just like taking a look at this that kind of goes in the order i would think of for predicting these ratings all right uh yep all right so we're definitely going to do yeah we're gonna get some gonna get some text and um and uh yeah let's review for a second we will be graphing we looked at you know that was the the air data is not important we looked at the um episode within the series i think this is our best shot at getting some predictions is have a curve that looks kind of like this but but maybe but um make sure it's not over fit and we can consider other other approaches like k n for that too okay and then we're going to um we're not going to do by season it's kind of already considered in that that loop up and then down is kind of already considered in that spline that i just showed and um when knocking your position within season i think that's random like the fact the third is worse than the second i think that's random noise um i don't think that's uh i don't think that's you know i say it i say it and i wonder i wonder like what if um and then then i just i want to find out a sec uh what i'm going to do is sign really uh really simple here which we're going to say is uh filter season episode in two or three we have 18 versus there's only within the training data uh and then i'm going to do pita test rating compared to season episode with applies to this data yeah even this which i picked out for being like an extreme pair is not statistically significant just forget about it like just forget about it we're not gonna do position within episode okay um oh i haven't saved yet okay so the um uh so let's do a linear model i like to start generally with a linear model and uh what's the what's the evaluation metric here you know it's not saying but i've guessed that it's rmse root mean squared error since we have a linear output looking at the leaderboard we have like 0.69.53 the sample solution is 0.53 again i'm not going to trust um this output just because this is i think only like it's it scored only one of them but i do wonder like if i did stan what is the dummy model would be the standard deviation of the uh of the ratings so just on the training data uh if i look at the standard deviation of the ratings yeah i would expect to get like 0.44 so i think that's again this is only on one data point so it's in the ballpark and um yeah let's hit let's uh let's let's take a look at this so what i'm gonna do is try doing a linear model where we'll say how am i gonna click how i'm gonna clean this in advance i'm gonna do rating explained by series episode i'm gonna start just with that and um data equals train step is pretty important i'm gonna try step uh and add a spline basis to series episode degrees of freedom is uh i'm going to want to tune this so i'm going to do is i'm going to do um our data is really small so i'm going to use high through uh high um oh i'm gonna i completely neglected to include an important step which is do parallel register do parallel i have four cores on this computer keep all those running and um uh train fold is gonna be i'm gonna create a folded set of my training data with v is like i'll try 20 fold validation uh cross validation i might as well um the uh because i think it's gonna be pretty fast to fit these so i'm gonna say is explained by this so it's been a series episode and i'm gonna tune this here i'm gonna actually try something really quick i'm going to just um take this and i'll try this without you know i'll just do tuning why not and then what i'll do is i'll do because i don't need to see the example and then what i'll do is i'll say um mod is linear regression set engine lm and i'll do workflow set recipe add recipe rec add model mod and we'll do tune grid on the on the um the full the training data and we'll say grid is um let's say yeah we'll do grid is cros is crossing degrees of freedom is 1 to 10 let's say i'm going to do 10-fold cross-validation until i'm sure that's fast enough tuned actually i should separate this out call these linwreck lin mod yeah this um this isn't so fast that i wanna i wanna give it give it up did i did i run my line that that sets the course did i yes i did right here cool all right and i'm just separating this out a little bit this does is it helps us choose our number of degrees of freedom uh and check this out the moment you add oh i i should really um have selected a metric set early on i'm going to do that up here and set is metric set only rmse so i'm kind of nice to focus in and say metrics equals m set this does cross validation on um on my 10 fold set across each of these 10 degrees of freedom so the answer's going to trend test and see how does our rmse remember the dummy was like 0.44 it looks like wow you add a sixth and seven degrees of freedom to that spline and suddenly it becomes a better model now uh yeah i can't say it's perfect but i do kind of like the um yeah i like i'd like to keep that i'd like to uh set that seven okay so the end it's even kind of clear that like okay it stops at seven and then it gets a little um are there here i'm gonna um set this seed 2021 up and um all right so we're gonna set degrees of freedom to seven i'm gonna let me see i'm going to keep i'm not going to keep this code because i'm going to be kind of going back and forth so i'm just going to set this to seven i can do this how i would do this in shop we're half an hour in and i would do um let's see so then i've got i got degrees of freedom would i uh and oh yeah i started adding other terms so next up was judges and i can't i can't quite do unite but here's what i can do is um one second i'm just thinking for one second how i'm gonna do judges okay i think i'm gonna treat it as like a text column where what i'm gonna do is gonna say step mutate judges is paste judge one judge two judge three sep is let's say semicolon why not and then i'm actually gonna do step tokenize we're gonna say on judges let's tokenize it by uh let me keep this up at the top uh tokenize it by um token equals regex options is list pattern is semicolon i think that's about right now how does this uh how does this work we'll take a look at it i'm going to go um if i if i do uh there's a nice trick if i do prep and then juice i can see the oops judge one judge two judge three is that not what they're called judge one judge two judge three oh yeah i need to include them so now i've got judge one judge two judge three but i've also got a judges column it's a list column check that out with like three tokens each now a couple things we need to do i need to say step select not judge one ju not judge two not judge three is this even going to work he's gonna give me some problem and um we'll find out but i'm also gonna do step token filter uh look at say only the top 10 judges i'm not explaining everything here but i'm um i'm pointing towards that if you want if you want to look at like uh here then judges and then i'm going to turn this so i'm going to filter out the judges that are rare and then step pf on judges we'll turn this into a frequency column a binary column either yes or no so now i've got a column for every judge check that out pf judges tf judges etc etc etc uh so yeah there's all my like um uh all my judges here all right and i'm gonna call this just judge it looks a little bit better and yeah so that's the um so i've got a spline term with seven and then i've got my top ten judges uh all right now i'm kind of branching out on this and then i'm gonna add uh contestant info which you're gonna tokenize and that's gonna be a little bit more interesting tokenized because what i'm going to do is going to say here we go so there's first judges there's the same kind of pattern of combine and then parse if i really had like um uh if there were like 20 contestants i'd do this i would do this in a different way but this is not so bad with four but i don't care about the contestants because they mostly don't appear in multiple they don't appear they don't have to appear many times i care about their info and you know i can actually do step select not not ends with or not or is it matches not ones that end in a number nope that won't work because the splines and a number never mind i'm going to do negative contestant info one and so on you know there's other ways i can do this but the uh but the key is i've got the test and info which you're going to tokenize token filter tf keep only the let's say i don't know 20 contestants so this is going the difference here is that i'm not going to i'm going to just turn this into a big massive contestant text and then the um yep uh yeah i'm going to try this in a masking test text and then tokenize it using regular like word tokenization and then i'm going to keep oh i'm going to stop words step stop you know we don't have enough data i'm going to keep only the top ten step stop words contestant info didn't need to do that with judges judges aren't um uh aren't uh like stop words and now if i try taking a look at the data the uh contestant info yeah i say one info it should have been info one info two info three info no wait wait hold on a second uh let's check the data for a second oh no it is this i just um let me make these uh negative all of these okay and um uh the difference just is that i didn't include them here there is another way to do this using a using something called add variables we can add like particular roles i'm not going to do that today but um i but yeah but here now see i've got the tf do they have the word restauranter do they have the word owner do they have the word ny um or you're new or york and so on so um uh yes so i can uh now i can use these as features uh really got we're gonna be overfitting here we've got 29 features and only um and only a few hundred uh observations i'm also getting something with judges where um no i've got these these judges and i gotta i've gotta drop wait judge is a factor okay pf judge oh yeah i'm gonna have to um how did judge get back oh here it is i don't know how this this stuff got duplicated yeah so this is what the data ends up um looking like 28 there's a bunch of and i'll do this this is it's really nice to see like this is what it looks like all right so it's a bunch of things with regard to the time a bunch of things with regard to the judges a bunch of things with regard to contested info i haven't added the food yet so the um i have a suspicion something's going to break when i run this on linear regression on a linear regression i also i'm going to need to tune both these parameters so here's where i'm going to add these one at a time so i it's nice that i threw all these in but i'm going to what am i going to do i'm going to um let's start just with let's keep let me see uh [Music] i'm going to comment these and i'm going to comment this line and i'm going to comment this line let's try with just judges they probably should have done this first is like lynrec with judges uh and i'm missing somewhere here there's a oh there's the plus parson wrong okay i think sun's gonna break here let's find out uh so if i took this and i just said fifth on train what happens oh it just works i had a weird suspicion that select step was going to break something but yeah that looks pretty uh pretty solid if i fit it on train and i summary it we get a sense of like oh um part of me i need to extract the model and then summary it uh yeah uh what it looks like is ooh none of the judges really popped up except like and that's the thing when you have like 10 judges and one of them pops up as potentially sleeping and barely significant that's a sign that this is just going to lead to some overfitting what would happen if i did max so already i'm suspicious of the judges uh so the um there want at least one very yeah uh alright um three and here we go the comments are confusing it a little bit but i'm just like training this model and look i include three judges and like you know maybe uh so so one thing is when you when you the more terms you add the harder it is for anything to become significant so it's like this may not even be literally anything if i include the three most common judges is what i end up excuse part of me with okay so the um so judges may or may not anything we'll find out when we add some uh cross validation here let me remind myself on the tuned what was the best model on my uh that was using cross-validation about .045 at best cv cross validation on trained for um seven degrees of freedom about point o four point four five and round not be more precise than that and um how do we do if we try this on um here i'm going to do max tokens i'm going to tune this value and i'm actually only going to tune it from like one to six and here we go we run this and keep degrees of freedom seven i'm going to tune the number of judges that we include this will i i have a weird suspicion that none of them is going to be better than just not including the judges we'll find out in a second if so then we don't have to add and we have this little data yeah you can kind of see that every judge you add makes it worse higher rmse now it's not by much the judges aren't they kind of they kind of know it's not a significant term but it is that is kind of happening now um i'm not doing slightly important which is i could actually be using a penalty term so i'm going to actually change this to your glm net and i try penalty okay i like to set this myself say uh negative 1 to negative 0.5 0.05 uh so try different geomet penalty terms uh oops uh penalty equals tune so the reason for this is that we can do regularized regression where we don't add ops we don't add features until we um uh un unless they're worth the the change um unless they're worth the improvement of the model and that sometimes soon as we have very few observations maybe that will help you actually can kind of see that it uh there's a slight little there's a little dip here tiny little dip but it's still best when there's only one judge i bet it's hell i bet it's even better when there's zero judges so judges um at least having a linear shuffle just not seeing anything uh no i'm not convinced is an improvement from judges so that that's uh that tells us something and actually tells i can just um i'm also honestly gonna like comment these out and hell i'm gonna just keep this in a whole separate place because i i don't want to um i don't get too uh packed with this but yeah i'm just not convinced that judges add information okay let's do the next one contestant info adding like this is good because sometimes it's really you can surprise yourself uh one second i'm combining this text i'm also gonna tune and here i'm gonna do like one three 5 10 20 or something like that like i'm going to include a few more tokens here uh and um one sec i'm not going to do here i'm going to say not remove all the contest info afterward all right try with a bit more tuning uh and what happened let's find out by going to tuned notes one uh it doesn't like my paste uh problem with mutate info what doesn't it like about this paste uh so one info let's see oh info oh there it is i'm not doing the um i'm not doing this step i can drop this i have a vague suspicion of 263 observations in our training set we're going to have a hard time uh we're kind of going to have a hard time like distinguish really determining like oh oh why am i writing a second time uh auto plot this so look at this like you know there's a slight increase improvement when we have three retained tokens but then from that point like 10 is bad 20 is bad it's not bad but worse than like oh so what i'm doing here is is changing the amount of regularization so i want to point out by the way is the penalty i can add as many penalties as i want it'll be fast it's not fitting a separate model for each of these that's why this is so fast and i include so many here's kind of like okay three tokens might be the best you know that that i don't know how i feel about that that sounds kind of like i'm over fitting but i am going to try do it from one to five just see what i um what we result in yeah i think we're getting very little benefit out of this is what i'm mostly starting to starting to feel from uh from this is like a contested info when this happens you change the size run it again yeah it's like okay there's a slight increase of three what is the third most common um uh contestant uh info well let's uh let's see so i'm gonna rerun that um len wreck prep juice us finalize uh finalize oh well not men wreck uh lynette lynn you know yeah finalized recipe uh max tokens is three and uh all right it's like chef new and ny uh suspect either new or new york has some impact so actually we'll try this real quick where i'll say take a look at our um [Music] yeah take a look at one of our models so the way i can look at a model is um actually i'm just going to pick the best one from tuned so i'm going to do what i'm going to do is going to say lynn wf finalize workflow select best tuned and uh then do that's the link yep and then fit that on the entire training data extract model heidi does that work yeah that work that looks pretty good and um and yeah let's do filter string detect term um what i'm going to do is actually look at all the word the contestant info terms and see what they looked like so there's a great graph i can make of this which is um show the step and actually give it to the lambda and the term line always a scale and color equals term i can see with which term got a positive impact or it's like nu is positive chef and ny is negative i don't know why nu is positive and and y is negative but all right uh and um yeah it maximized that camera where did it uh where did it maximize that the um we have our these uh these two these tokens in this best penalty i can say we chose not expand limits geom v line x-intercept is this so yeah it stopped before and why it happened to choose the value that didn't have and why it's like chef is negative news but all right so new the the term new which probably offer for maybe first to me tends to refer to new york but could further something else maybe is positive like a new strategy saying yeah all right so we see that we see this when we add um three uh tokens weird to me that the more time chef appears the more negative is but that could be that could be absolutely nothing that could be random chance okay i'm not getting a bit um tiny improvement at best from three tokens okay i can i can still consider leaving them in but uh but yeah uh the spawn was definitely pretty important all right let's um let's see what else i can so that was bringing in some text date i'm still going to leave it in the uh in the mix now let's bring in the foods no foods are very common but i'm still going to um uh i'm going to do thinking about it yes i think i'm going to do the same thing as we do this combination pattern a lot and say food is mix appetizer um what was the appetizer it was entre dessert uh where is it and separate them method equal also token equals um equals regex options pattern equals semicolon so i'll separate these foods out you know and hell you know what i'm going to do it by word instead of my food that way like blackberry and blackberry jam will still we'll we'll count together i'm actually yeah i'm not going to separate them out i'm just going to like i'm just going to tokenize i'm going to treat it with a big bundle of words uh i don't think there are stop words in food but if there's of or whatever i'm going to drop it and um yeah i'm going to keep this at three that was i could i could see an argument to getting rid of it too but i'm going to set the number of foods we keep and um [Music] yeah the all right and then negative appetizer negative entree minus dessert and include those three up here and to get rid of them down here oh oh i am i do sound a little confusing here okay and i've just randomly can desert combine them together tokenize them and let's see what this what this looks like when i say token oh oh i don't need token equals words i could but but that's the default anyway oh contestant info doesn't exist uh pretty sure i step mutated in here buddy contestant info and i thought i just used it something's going to be something's going to be funky here column contestant and four doesn't okay um see i combine all these together contestant info oh oh okay yeah i do two of these look at me i wasn't doing a step tf on food it actually made sense the problem and i was annoyed that sorry i'm there i was annoyed with that new line was causing problems around this all right so now i've got my my recipe and i juice it and now i've got cheese chocolate cream as potential words um and i do the same thing i'm going to say um seek one to nine by two let's see how um how let's make it one to eleven let's see the eleven most common words uh what does that do to our um what does that do to our predictive power i'll answer some questions about this is running uh one is oh just wondering uh youtube brings up those judges with different names at chris santo and chris santos uh that's some good clean that's some good potential cleaning i'll check that out yes uh hey check this out i added like up to 11 and it just the model just gets worse and worse and worse look at that it like stacks up and stacks up so food probably doesn't improve any like adding words from food doesn't improve anything i don't think added foods themselves are going to improve it either um and yeah i'm pretty skeptical that like i really just i think i'm going to overfit in all these so many foods are so rare and everything i don't think that if i include a few more it's going to it's going to have good signal all right so i'm going to um i think i'll give up on the foods too um you know just to make absolutely sure what i'm going to do is do 1 to 30 by i don't know 4 and um let's just make sure that like the more foods you add consistently the worse the model is see like i'm just gonna like the way that the number of tokens is stacking up make sure there's not some thresholds like is the 20th food i don't think it is i think i think we're um i think we're ruling out food even intestine info wasn't very informative but um yeah i think this is just a hard one i think it's it's hard to beat the um the average here and i think that if we try too hard to beat the average we're going to find ourselves over fitting our hyper parameters yeah no the more foods we add the more terms we add the worse the model is basically directly so morph um the more food words we add the worse all right so the um so we're not going to use food i put that up in the graveyard of um yard of features all right and um yeah uh and oh yes sorry i i not using food not using judge and what else did i list like episode notes uh episode notes sounds uh sounds pop i already tried the ingredients yeah episode notes sounds plausible so let's do the same tokenizing we do on test info but do it on epic episode notes and this it's a little easier i don't need to deal with the uh it's constant pattern tokenized stop words token filter tune and then tf i don't need to remove anything and uh i can just do here we go let's uh take a look at what it looks like with three tokens episode notes episode part and round that those are pretty pretty uninformative so the um episode first part round tournament wow a lot of the chopped chef chefs a lot of those are feel like uh like they should stop doing these stop wars really um deciding if i should drop those or what the um yeah the feeling this is not gonna these are not going to be very interest i feel feelings are going to be interesting like they have a lot of words that i bet appear a whole lot and i wonder if i should use a feature selection approach yeah i think i'm going to try a feature selection approach um on these i'm thinking for a second of how i would go about that okay there's a there's steps that are things like step um lots of ways i can say it'd say for instance like step correlation and that'll move ones that are uh that'll move any any steps that are beyond a certain level of correlation whether the variable so if words always appear together it keeps only one of them that's not a bad uh move but i actually wanted something like a step that removed ones that aren't already um well new step nzv which moves ones that um that don't add a lot of variation and what having been not a lot of variation well if i take a look at the um if i take a look at the distribution of words let me let me uh actually glance at that a sec if i take my train on uh library text tokens uh episode notes and i group by word and i summarize um an episodes is m distinct id so the um end i do a quick uh i understood because i am to join stop words okay i was curious whether they're words that that appear in like all of them it doesn't look like they really are um because there are how many 263 observations and i was worried there'd be ones that appeared across all of them but instead yeah i just kind of get some they get rarer and rarer uh and see i was i i i think the best i can do here is actually pick a few that seem interesting including theme and feed like uh let's see theme tournament i wonder what basket is if that's like if basket is just right now i think the bat i think that a basket is just like the things that they give us the ingredients they give them uh basically i think the best i can do when i have this few observations is um uh well i can take a look then at at each of these it is a string detect uh if i try looking at episode notes look for the word theme the theme new orleans theme theme this episode was every dish in a bowl if i look at other ones that aren't like that oh and uh then i don't see like oh yeah so tournaments and parts okay i think part might kind of be interested might be kind of interesting do we have the word part or was that in the stock words that must have been in the stock words okay so i'm actually going to draw instead of a stop of doing the stop words i'm going to say the word part yeah there's like okay i'm going to part round tournament let me take a quick look at basket see i'm trying to see how these are used in contact text when you have tons of data you can just machine learn but when you have a situation like this you might want a little bit uh champion and champions and a leave-in themed part round the tripod ones that like theme let's look for the word featured oh i dropped the knot and doesn't seem like it like that interesting i'm just kind of kind of going by this what about cut does that mean someone was caught in the episode no cut themselves oh wow cut himself that's that sounds like fun i'm gonna throw that in and um [Music] all right let's uh let's go with like just this little list part round tournament theme and uh cut why not and um this is a manual set and something just hit me which is that i do not know how to filter only four particular tokens uh oh man i am oh uh there's i know stop words but what is a way to say i only want to keep particular words so what if i say step merge multiple token lists oh that could have been hand oh that would have been handling some of the other steps i was doing uh hashing i do not know i don't know how to pick just a few i'm gonna give i'm gonna give this a a quick shot uh which is which is i wonder if i take my episode notes and instead of doing stop words i do step mutate let's find out actually i can um i can do this down here for a sec see i've got this data and if what if i didn't do um step tf then what i get when i prepped and juiced is episode notes what happens if i say map episode notes is map episode notes intersect episode notes i wonder if intersect works on token lists and uh yep okay let's see and what if i included nah see i'm not seeing like um i bet it just doesn't work yeah i bet i can't intersect oh man what a filter for particular ones i was so uh i was looking forward to that but i don't know how um yeah i really uh really wonder if there's a keep list a white list um yeah let me see one more look at this let me see the oh what if stop words aha keep equals aha whether they should keep the stopwords or discard them okay i think then this is how uh so we'll say where's my uh where was my here it is word selected and i'm going to do um [Music] words like step stop words uh custom stop word source is word selected keep equals true and i need to do it on episode notes and drop the token filter see where did my see nothing absolutely nothing i did my word selected custom stop word source i've got this and like uh i think i'm using it the right way but um i i and nothing nothing's nothing stuck around from episode notes uh so if i like if i take a look at the oh oh there it is sorry i think it back woohoo there's um there's this and look at me uh and step i forgot to do step tf episode notes here we go part round theme tournament why not um i also wonder if if i stemmed them first what if tournament baby and tournaments are one and theme and themed things like that i wonder if i stem them what they what they end up looking like so if i say step stem episode notes they said and then i choose some round theme tournament well i don't know if it changed anything but i know that i i tried it out and i still have all the right ones so i'm guessing it i i can look uh that eventually but for now yeah i'm gonna keep these episode notes okay and part round theme tournament let's find out do we have do they give us literally anything and i don't have um max tokens to test anymore actually i do and what i'm going to do is i'm going to say maxwell theme yeah what i'm going to do is going to say like i only have four columns matting from these i wonder if i as i add them i don't think i can actually do zero easily but one should make almost no difference the question is does adding these terms um help at all deserve a way to compare them so there we learned just to use like a custom stop word let the custom selected word list based on intuition since looking at the data is going to lead us to overfit possibly uh we get an improvement we do on two or three we get an improvement uh okay so i don't know which those which that was but one of those words um one of those which is actually i can i can find out real quick if i say uh max tokens 2 as soon as so 1 was round 2 was part hell i'm just i could just include part and nothing else and it really does look like you get a benefit out of it um and that makes so so i'm just going to note like part helps that could be overfitting but it kind of does make sense uh i don't know actually i might predict that it's like actually i don't know i don't know is it is it positive or negative i could see it having a difference but all right so that was one um instance i just wonder if i didn't do the stop words i just did seek one to fifteen by two on the episode notes actually let me try like one to 30 by three do i get anything this also i think was worse than the original was so i don't know uh but two tokens i may just include the word part and if so i can simplify this helps a little if so i can simplify this model down a lot add just a feature for was it to include the word part this is not an easy one it's not enough data and not enough features that work uh oh whoa whoa what's this we got 20 is that is that 28 hmm i wonder and why uh but the problem is that's the only one that kind of catches up to these other models here i'm going to try what i do is basically lasso all right we can use lasso for feature regret for future um regression so here's for future selection so here's we're going to try i'm going to try a hundred it's probably not gonna be that good a model but it might tell us which of them is best it might uh yeah the while this kind of is all over the place you can see but what if i then but if i can then take it and say like um actually yeah what i'll do is i'll do finalize workflow uh and say select um sure select the best this is of that hundred um what of the i'm gonna apply the same approach i did before and say what of the food what was it the food term so that no no was the um it was the uh episode notes what terms popped up if any too many is too many is the is the main answer here what i'm going to do is i'll say tidy filter filter for episode notes but also filter for lambda equals greater than equal to uh so filter for a lambda and then say filter lambda is min lambda i need to find the the one value that kind of crosses over that point a range estimate bonuses what are large at this level what are larger terms five and stock okay five and star pop up and uh theme pops up those are some ones i kind of like cook is negative i don't know i don't know how i feel about that one um but and where was that one that we had part i don't see it but uh maybe it's redundant with round yeah i don't know that i'm going to see that i just i just don't believe i'm going to get that much out of that much out of this selection oh man oh man i'm going to go back to the previous version it is tough this is tough all right so i'm going to um but i'm just going to include a part so the question i have here is count string detect episode notes part all right so there are 46 that have the word part in them and let's take a quick look if i looked at like uh if i mutate has part is not is in a part of my episode notes and string detect episode notes heart and i'll do part um this means a word oh sorry uh word boundary oh no it's not there that's a word boundary to make sure i don't include up and here's one true i don't want to include a part or party but the um so if i do all right so i've got the these that that part and i'm going to try something quick i'm going to say t.test it's always nice to use this physical test just this if this doesn't work there's just no way that like see it's not even significant oh man ain't this ain't that a kick in the head like you'd expect like okay well maybe this has some impact i think it's just random i think it's just random chance skip i'm not gonna get it is tough it is this one is a tough one to get anywhere above our um our expectation i wonder if i using spline if i should have uh if i should add something else besides this uh spline term it's the only thing we've gotten so far is time food hasn't worked the um the judge the episode notes hasn't worked the judges haven't worked um and uh yeah we got just a tiny bit from three tokens in contestant info tiny bit uh so and that is kind of it for train right yeah kind of the trouble you have 250 um observations taking a quick look through contestants don't pop up multiple times and uh votes were pretty votes we saw it doesn't seem to have um uh any any impact it's so low anyway i can't even imagine what impact it would have oh well so the um so i've got my my uh mostly i've got splines then a little bit from the contestants and i'm just gonna i'm honestly just gonna get rid of the episode notes don't wanna overfit here i'm gonna leave in three contact three words from testing for why not i already had them set up and i'm gonna do my cross validation all right now i i now um i'm going to spend 15 minutes trying out i'm i'm trying to end up the half hour so do 90 minutes total on this i uh looks like we haven't been able to by creasing features we haven't been able to get past like 0.045 or whatever with these um and uh with like a regularized regression but here's a thought is that i tried using splines but i could have used k nearest neighbor on um on the uh the time so it's kind of so that's an approach where we say like let's actually look at the um on the episode number on c c uh series episode number so this is another approach you might want to try i've actually never used cameras they were in one of these screencasts before so i'm going to grab this and this will be called k n recipe and i'm going to skip here we go skip this basically skip everything is what if i did just by serious recipient when we try from uh from there is uh i'll say canon mod that's k canon nearest neighbor and neighbors equals tune and i also want there are two approaches here there's neighbors and there's weight funk so there's like a weight function that can be either um rectangular triangular etc what i've seen i've just experimented a little and i've seen triangular be good i've seen optimal be good it sounds promising certainly doesn't it and it i've seen um i think what's it called it it gives me a little uh kind of good i'm gonna i'm gonna throw in try weight just like these aren't gonna take very long it's only one dimensional nearest neighbor and the um for for this so i'm gonna do canon recipe canon mod and do canon workflow and what this is gonna do is is actually going to do a um it'll just take the be the average of all the of all the surrounding ones and i'm curious if i do oop uh this one is meant to be tuned see if time is the only way to work for let's really make time work for us wait funk is this and um let's see metrics um yup the and uh and neighbors is this is that this is another one of those it's very fast try with many levels so i might as well do seek one to four to your nearest one to your nearest 40 by two so takes a little time to do k n tuned k n workflow here we go now it's going to do is is nearest neighbor finds the nearest one or the nearest 10 or the nearest 30 um and uh averages them or really applies one of these um weight functions to them uh and uh and that and then let's see auto plot canon tuned and tells us can we beat 0.04.405 well wow well try weight is really bad with one nic these i think all of them are bad with one neighbor so let's not let's never do one neighbor that makes sense uh let's do start with five neighbors and see what we got from there i didn't pick these four weight functions because of all that much just kind of was like picking some that i've played with a little bit uh yeah and now you can see um uh you know um i don't know if like yeah we'll we'll just see how it compares to a linear model through like the spline model but it's a rat it's similar in terms of the regular spine model in terms of like getting down here with it says triangular weight i'm gonna um yeah this is very easy to overfit but i'm still i'm still gonna do it anyway uh still gonna go for an end um and say okay let me find the nearest the number of neighbors that we like the most um and what this is doing i'm going to show it in a second uh this is one that we really can show in terms of like um it's actually it's predictions okay it's kind of all it's kind of uh almost it looks like generally i like what i like about triangular is like in between 20 and 25 it's kind of flat uh so we pick triangular 22 or something like that i like 500 percent ro bust-ish and but really here yeah again we are fitting to this to this small training set okay so k n tuned works is one of a purpose and what i'm interested in then is is is can i improve this by adding some k n on something like the um the ingredients or the episode description like um uh individual tokens weren't working but what if i did a dimensionality reduction on the up on the um the description so let's talk this what if what i do is um is i include we go episode notes and what and um and i do so and i do here we go episode notes could do linear tubers it's unlikely linear is going to work and do um uh in the recipe oh yeah so let me step tokenize episode notes step stop words episode notes step token filter episode notes same thing over and over um max tokens we can tune this but i'm going to try like i'm going to try a hundred and then do and do smaller because it's kind of a slow step then do step pca uh actually i'm gonna do this in a minute oh sorry step tf so if i take a look at add like this data and i do select contains tf or really start uh yeah i've got these words and the question is there's a lot of dimensions um that's going to really mess up any k n but i can do a um uh uh a dimensionality reduction on them so i could do something like uh as matrix and i'll just try this out i'll say like uh svd for uh singular value decomposition and that gets i've done the and and i'm going to go back and use step pca for this but um and tidy matrix equals d what this does is try saying like okay um oh i've i got i didn't center it i didn't do all the things i'm supposed to uh let me let me throw in step center no step normalize which i think's both centers and scales all the tf columns so yep standard deviation run mean zero starts with tf so i'm going to normalize it and then i'm going to try and then i'm going to take it prep it and matrix equals b height of the matrix and say pc huh this kind of indicates there could be a smaller dimensional space like the first couple of principal components explain a lot of variation across words let me increase the max tokens to a hundred uh maybe there's like things that we can see in there i wonder if scaling those is a bad idea i actually i'm starting to think that scaling those top hundred is a bad idea because it there'd be variation across it's probably a bad idea but i'm i don't can't stop you from doing it uh okay so what this is showing is like this kind of i'm looking for an elbow of like first first second third fourth fifth sixth seventh seven principal components kind of has a drop off with those explain more variation than the next bunch and that's i don't know maybe a little less than 20. um i'm just eyeballing it here but this is giving a sense of like i could try seven principal components and include that in the k n uh or some numbers smaller than that and see if that gets anywhere in terms of like um uh in terms of the past of the possible addiction i would guess that it's going to be bad uh or worse than um i rested the more dimensions we add the worse can it be because of the um the other one is just is just pretty good the other dimension i'm using but let's uh let's try this out a sec well just because you know we're already here what if i do step pca num of all starts with pf and i'll say num comp equals tune and we have we already have the guests of the neighbors i'll leave in the multiple neighbors because that doesn't slow down or anything and i'll say but i am gonna you only pick my favorite one which was triangular i'll just yeah i'll just keep triangular as the weight function and what i'm going to do is say num comp is one to five this was 0.404 so see let's see if it gets better or worse what it does if i throw in a dimensionally reduced set i even wonder even let's do one three five seven nine why not uh and see if reducing down the uh this is what this will do is add our pc our principal components uh as terms instead of the um the words themselves so that means if words tend to appear together uh it'll um it'll kind of bundle them all in into one dimension there that could that never can make um canon better but let's see uh i have a certain suspicion it's not going to work let's let's see if i think it's worth a shot um okay yeah i can see that it's not really working because the best was one component and even that was worse than our original because it flattens out a point four two uh i wonder if we didn't send if we didn't scale it does that pca let me not scale it yeah oh argument default so oh my defaults are don't oh that's good this doesn't scale let me see it as advisable to scale it this can be changed using the option here each variable will actually don't understand why this is saying um argument defaults are set to this uh center false i'm gonna try step center but not normalize and the same i'm confused about whether the um what the options are but i'm going to try and make insurance the options list step uh oh sorry a scale let's see if this just does any better i have a weird suspicion about the um the scaling causing problems here that doesn't mean that this is going to work i think it might just be one of those things you don't want to you know kind of suspect you that like just be in the near just just an average of your surrounding ones is going to be the best is going to be the best approach plot i don't know slightly different but not better and it's still like the best approach is as few components as possible yeah i just thought this just doesn't work uh so the um like the uh especially mixing in the you know i'm actually curious about something if we didn't include the series episode would it work at all what if i said um yeah i'm actually curious like this serious episode might mean that the more you have the uh yeah the um yeah this is what episode might mean like uh the like like that is such a better predictor that it goes with the other ones now this is gonna be a worse predictor an arm in our missing terms but maybe we can average it if there's some kind of signal where um like more components is helping it a little bit let's find out i'm just curious if like if i just use the dimensionality reduced words do i get anywhere you know i you know it kind of is getting somewhere and i'm not sure if it so it gets like worse with five and then better again you know i've seen that they all flatten out around 30 nearest neighbors or 40 nearest neighbors so i can actually graph this a little bit differently as collect metrics filter neighbors equals and say uh num comp by the actual by the rmse and this is bad but it looks like maybe if we got a little bit bigger it would be more dimensions will be better i'm not i'm not confident in that but i i i at least i think it's worth a shot to say um to say what if i did like one to 20 does it does it actually like is there some dimensionality where we're getting something meaningful out of the data i don't know i i said it it's not gonna do well when it's competing against the um the time term but i wonder if there's there's some signal here they might just not be by the way like there's no reason that it has to be true that the rating is uh is use that like the text is useful towards figuring out the rating i could mix in the um the contested info i could you know try all these things but the um i could i could write up here i can i could merge uh all of them which why don't i do that a second that i can try out this merging approach starts with that contains now just gets worse better worse um now maybe if i change max tokens will be better but let me try mixing in the contestant photo so i'm gonna do tokenize episode notes and contains info then step token merge i haven't seen i haven't done this one before and token merge let's see what it looks like merge these two and i don't know what it merges them into but um uh let's try this to be episode notes and contestant notes oh um what is the merged version what does it look like what is it what does it look like if i do juice of trap juice token merge it's called okay i'll call it token merge oh uh num comp is tuned yeah i th then i'll um i'll go from here yeah i tried i'm just trying the same thing but mixing in the um uh mixing in all of these contestant one and four two and four three and four four info and you can see i'm doing a tokenizing all of them merging all them into one uh and um [Music] yup the token is all the merge all them stop word them token filter and try out some num tokens yeah after this that is i think i'm gonna be the last thing i i i try i'm not gonna try blending this with the original but it's nice to try it's like try out a variety sometimes um [Music] and yes yeah lots of spending plenty of time mix yeah i'm uh here's not separating the words in the episode notes in the info i think that's probably fine more or less but it's seen can i find some dimensional space of these tokens well it's you know it oh let me try looking this this oh there's gonna be a lot of facets you know it's a little uh this might have a little bit of information okay i'll tell you how we're gonna find out uh from an ensemble perspective i'm not gonna use stacks it's gonna take a little bit of an effort to get it working but it kind of looks like now it's not as good as the time but i think it might be adding independent information in terms of like the more components i have the better and let me try looking at this again at and oh and i just checked 40 was uh not more necessarily but like and 40 was kind of it was leveling off at that point all right and yeah what i'm going to try is um is i don't know 12 uh degrees so that that gets to an arm i see at 0.43 not as good as time by itself but maybe i can combine these two so this is canon on text and so we saw on in terms of lasso regression we couldn't get anything but but then maybe we bring the text into a low dimensional space do known on it and maybe we're getting somewhere uh so let's find this out what i'll do is i'll do um let's see i've got my linear workflow lin workflow finalize workflow select best tuned i'm going to fit it on my full training data and then i'm going yeah uh uh-huh so fit on the train yep i'm going to fit on the training data lan mod this is my um the best i was able to do and let me make sure that did i get rid of the tested info with three which is kind of the best we tried i have the spline term okay and then i try canon the canon just on text which we know again is not as good but it it's better than it was when had only one dimension which means it might be getting signal that's independent which is really what we're aiming for here uh so what i'm going to do is um yeah i'll do canon tuned um canon mod is k n workflow finalize workflow select best canon tuned fit train it does not like my scale oh it's scale dot what is that what is what is happening step pca or whatever it is scale dot is false huh yes i i'll just skip this but okay um so this is that yeah so this is the best model to come up with i i didn't i've done this all right and now what i'll do is first i'm going to compare them on the test set so there's a um so first i'll do fit sorry predict actually i think i can do no can i do augment on a workflow no i i can't uh what i'll do is i'll fit on the test on the holdout uh pardon me predict how why would it need raiding why would it need raiding what what is it what does it think oh not lin mod then workflow uh and no lin mod something's up why is it asking me for raiding i of course it doesn't have the rating that's what we're doing the prediction on is there some column here that i'm doing reading explained by everything uh there's something where i do where i say fit find our last fit if i i'm sorry i i i i'm a little i'm getting a little stuck here here uh if i take this and i do last fit on my split data where i tried supposed to do it on yeah that that this is one way to do it is i just as i train on the bet on the test and applied it to the train i don't know why it wasn't working a second ago but this and then collect metrics rmsc of 0.43 um so this is just so it's a little bit worse than i was doing with the spline but uh but that's all right that kind of that means yeah we that kind of thing happens it's overfitting and anything else let's try k n workflow and ask finalize workflow ask the same thing oh um that's so odd that uh yeah i need to put regression unknown yeah i need to put regression here i didn't even uh uh so the the i needed to do this um and then yeah whenever you see that yeah and can't wait those are very similar really uh the rmsc of my spline model was as good as my uh and those there's this one like they're identical or anything right like a canon workflow it's a k n with neighbors and a weight function and if i look at this with a finalized they fit it on the uh and yeah i'm um and oh i can't do that but um but i'm that's so interesting i'm using that only the text here and on my um maybe the ensemble was a useful way to approach this uh so that that's actually kind of interesting i could use the stacks package yeah maybe i'll use a stacks package let me try this out um and try and ensemble the two so the um lin works i only want to submit one guess and it looks like yeah the there's one that's linear and is mostly based on the the spline of the time and there's a linear workflow and um and lin finalized linear finalized and there's one based on text and this last fit fitted on on the train test it so then it's and collect metrics last fit on the and then canon finalized you just want to make sure look at that entire thing um all right so the the r squared is a little better on the first one but alright so canon on dimensionally reduced might have done okay at least all right so they might have added a little information all right so this is a good example of where maybe text this kind of text method is useful all right the um as long as it combined in the contested data okay and the um the oh yeah the last step is i'm going to create a stacks add candidates of um lynn finalized does this nope this doesn't work um right it needs to be on the tuned data set um add candidates expects attuned filter parameters parameters okay uh yeah there's actually a whole situation here can entuned uh is there just a filter best wouldn't that be nice select best i thought i'd be done by now but i'm still kind of experimenting and um let's see the uh canon best um ohio author of texas yeah this has been really um text recipes has been really uh a blast today especially text uh text recipes plus a dimensionality reduction and um parameters oh uh can i tuned lei ignore the um the scales uh i'm gonna rerun this real quick and linear best i don't like to run to include all the ones in the stacks it ends up being a lot slower let's call this one tuned add candidates takes these tuning objects i meant to take kind of an ensemble we've normally given it two and um oh boy i totally forgot i need to do uh i didn't knock the next step in here i need um grid uh grid control stack grid otherwise it's not going to have the appropriate here and i'm actually going to do i'm going to make this a little smaller i'm going to do just 11 9 to 20 by 3. it's not that and uh you can pick the bunch from that's the best from that set and uh to be a little bit faster and try the same thing here which i'm now going to call lintoon to be if i don't have this control stack grid the next step of ensembling the methods is not going to work and here we go uh oh and uh cannon tuned lin tuned there we go now i have my my configuration and now i get to do something to say blend predictions where it's going to say what is the best uh it's always been so slow when i when i run this and i wish i think it's got a bootstrap somewhere that what it actually does is end up combining the two together and then i do fit numbers to train those two models combine uh lin k n blended so we're combining a spline it also has a couple of terms mostly it's a spline on time along with the nearest neighbor on the dimensionality reduced um uh the digitally used word set so here now i've got this and it's been it's been fit and now i'm going to take this training on the on the original last fit on split collect metrics uh oh uh oh nope doesn't seem to work that's okay i'll just i'll just take this and fit oh it already it already is fit what am i what am i what am i doing it's fit on the was oh but it didn't fit it with the whole data yeah this is a this is a thing i've run into again and again so what i can do is i can actually take this and uh it did fit it with the training data but it didn't include the test set in there which honestly could make it a little bit better so i'm going to um so i'll do is yeah let's let's take a quick look though what is the performance of this if i put if i predict this on the test set this keeps happening hold out yeah this is just supposed to what is happening anybody like why is um why am i not able to predict on the whole like of course it doesn't have a course it doesn't have uh this in it i've had it's that see these two work uh on last fit uh i don't understand oh wait hold on nope don't know why step pca starts with yeah what is is there a step that's supposed to be skipped in one of these like um reading explained but i just don't know why it's requiring rating and i don't know if i tried linear work the linear workflow finalized if i tried to fit it on the full training data so linear has a problem does k n have a problem no so the problem is somewhere in my linear model i have a weird guess i'm going to try something i'm going to i'm going to get rid of i'm honestly going to get rid of um uh i'm going to try getting rid of the contestant data i have the weirdest guess that's causing a problem and then i don't need glm net because i'm not even including any tokens i'm doing nothing except the episode and let's just take a quick look if i do nothing keep it as simple as possible mod don't know this don't need this uh let's just train it oh um yeah that all right uh this might be a little too much work what if i yeah what if instead of step select it's a good question i use step rm with shot i tried putting back all that stuff that i got rid of nope it does not like step oh all right step rm but not the negative something about this recipe is not not behaving and uh yep i have my my terms and let's try this all right so thank you uh emil um so step select was causing problems oh i get it i think it was saying everything except for okay that is a little um curious uh so good to know that if i want to remove a couple columns use step or m not step select uh i guess it could have also no idea all right skip equals true maybe all right the um and this now that i've got this set up i wonder if here we go here it is predicting the holdout but let me first predict on the test find columns with uh with test and find the rmse between uh between the truth which is rating and dot tread all right so did a little bit better than either of the models did by themselves i remember they did like 0.437 by themselves uh each by themselves so blended model is the best that i've got so far so what i'm going to do is bring it in on the holdout set cryptid and uh bind rows apart me bind columns on the holdout set selected just id uh id rate and put that in a better uh order which means i can actually just say here we go id and rating and here's my and yeah this is my set of predictions on the let's make sure that the holdout was here and um where to go so this is the the best way to accept i'm gonna make one change uh that i think is actually a little bit important which is what i'm gonna do is i'm gonna say let's take these um blended predictions and let's change the training data before i do the fit uh so i'm going to do there's a there's a term here train which right now is just the training data i want to be the whole data set that way because i want to train these models i want to train these models on the full data and otherwise i'm losing 25 percent of my data it's not perfect but it's like i'm only predicting the holdout once so i want to so i uh i'm going to take this yep take this change it can fit it's going to be called change it fit it but now it's got um it'll fill the same weights and everything oh oh uh it's like slightly different than it was earlier barely point like eight point five eight point four three it's a little bit different and our fingers crossed that it's still the same now technically we have a little more data here some of the hype separators are different but this is the best i can do in terms of like i wanna do as good a hold out set as i can so we take this we write csv to i don't know the desktop i say the um uh we say linear k n blend like to name my models and then i really should have submitted some earlier because what if i ran out of time uh but yeah i always i always end up taking two hours even though i say i'm gonna do it one last time uh blend of linear on uh spl on spline on the time as well as three tokens which is probably not so i shouldn't have that should i i don't like it i've decided i'm gonna fix that i'm gonna change one thing it's just so much neater i'm gonna say here we go i'm going to completely drop everything yeah isn't this new i'm going to drop everything except for seven degrees of freedom on the time i'm still going to leave in the glm net why not and uh and this time i'm gonna say tune deg free i'm just gonna move change these models around a little bit one more run one last time and uh one to ten i'll probably oh pardon me uh deck free is so the story here by the way is that even though i'm only using one predictor this breaks it apart into seven components so regularization yeah regularization is not going to help with this what am i doing let's find out what it looks like but um yeah i can just i can drop this and just use glm gl i'll use lm but yeah regularization doesn't help right wait ah regularization helps me have too many uh seven ah maybe it does help look at that i didn't really expect that maybe maybe that's nothing but okay i'm going to yeah i'm going to use this the i'm going to say is uh the best we can do by its you know is 10 degrees of freedom along with some uh yeah it's a lot better at seven but then it gets even better down here sure we're going to say do it by series episode with seven with some regularization and we'll pick the best model i'm suspicious this model but it will uh oh yeah i can kind of see like seven was better but then this kind of dips under well sure uh why not and um that's the best model with just time i just thought it was felt off to do and now if i do uh lin lin wf select best lin tuned and uh in entune tuned and what happens if i do um last and then here's my finalized workflows oh lint this should be lin tuned then w even you know what that this is a little better than my last one was so 1.436 this one just got worse i have no idea how that just happened maybe it was some random something uh wait what did i change about this model i don't i you know what i'm done i'm done doing things all right and k n on text whatever i'm gonna make my submission all right so the public score is not not amazing the public score is kind of the middle here it's like it's around the same level but then but that's only on one or two observations so i'm not too worried about this but the uh let's yeah we'll see in when does this one done up three months from now well we we will be a long time until we know but this was the um submission i didn't think all right let's talk a little bit of what we learned uh and what we did so we brought in the data and the first thing we saw is we tried a bunch of things on ratings we don't know we don't have a lot of data and it really did look like the most interesting thing we could do is do something with regard to this effect of time or the effect of series uh episode um we really tried exploring a couple other ones we tried looking at things like the um we saw that well there are judges that have occurred multiple times there there are foods that occur multiple times intestines there mostly aren't um we said that votes didn't seem to really be related to the um to the rating uh so we picked we tried a couple ideas we saw that um we can that point four four is well if we do if we just do kind of a flat model and um we said uh and yeah and we tried bringing in some linear some linear modeling uh we tried we tried it with with uh the the time point and we got down to 0.405 we tried it on a couple it's cross-validated and we tried a couple other things throwing in but they all kind of just like stayed the same uh they didn't none of them seem to be any better and then we um uh yeah and and yeah we ended up um uh and we just kept trying to add things tokenizing we learned a lot about tokenizing when they were stemming when would stop stopwatch to keep and then uh yeah then we did the linea then we um and we also just tried to throw it as regular spline uh nothing responded i said the really key here is you do need this spline term you can't do a linear term and expect to be nearly as good but you could do nearest neighbor on time and um instead we did neighbors neighbor on dimensionality reduced uh text space so we tokenized all the episode notes in the info merge them together move the stop words to the hundred most common tokens but instead of using those hundred as a linear term we um we uh reduce the dimensionality using um step pca and the um uh and once we did that we actually did see that i thought it was kind of uh it was nifty i didn't really uh expect it that we we kind of looks like we have some signal in terms of like um uh cuddles are signal in terms like find the nearest ones in text space so um so a combination of canon and pca you don't want to mix different kinds of features there because then usually they won't compare to something like time it's really good but it might be good an ensemble model to apply it just to some text i've got them in my back pocket if we if we're not seeing that success with other terms with them with text uh that's one idea then we combine them using the stacks package into one and um through that one and then we uh fit it on everything we could and we and we blended it saved it set it up i don't know how how we're going to do but we'll um we'll find out in i guess a couple of months all right that's it for today um that we took a look at chopped learns about the vegetable reduction canon and some other things i i may i i'm not going to do a tidy tuesday tomorrow but i may screencast um the um the uh i may uh join along in the coding with sliced i'm not sure if i'm gonna screencast that or not i'm gonna decide a little sooner uh all right i am i hope you had fun i certainly did i'll see you next week
Info
Channel: David Robinson
Views: 1,757
Rating: 5 out of 5
Keywords:
Id: VJzdOacghnU
Channel Id: undefined
Length: 115min 4sec (6904 seconds)
Published: Mon Jun 14 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.