ML Monday live screencast: Predicting box office performance with R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi i'm dave robinson and welcome to another screencast where i'm going to be uh predicting doing a predictive model on a kaggle dataset as preparation preparation for sliced so slice is a really fun competition of a part of this summer we're starting out tomorrow that's some uh june 22nd at 8 30 pm eastern i'm going to be uh uh live screencasting myself phone uh do attacking our data set i've never seen before and doing predictive models so please do tune in tomorrow for that and um but in the meantime today as part of this practice i'm going to be analyzing a data set that i haven't done on this done machine learning before which is uh box office prediction data set i picked an old kaggle data set that i thought looked fun uh to analyze actually look at it now i think i have analyzed this data once before on a tidy tuesday i can't remember when i might even done a predictive model with lasso but i know i didn't use uh tidy model i'm pretty certain i didn't use tidy models on it so this is going to be a new um yeah it's going to be a new approach for me to try uh to try using tidy models to pick this box office data so uh this will be as close to a dress rehearsal that can mostly so i've actually got a template i've set up for um kicking off an analysis loading up my packages and such i'll try posting that tonight or tomorrow morning uh just be transparent about how i'm setting up my um my workflow but uh yeah let's start by um by entering the competition so let's see stand degree and i'm going to download all tmdb box office etc and i'm going to open that up and um all right and then what we forget what we're going to be preaching today is bot is and feel free to follow along if you search for a tmdb box office prediction in fact um yeah let me put this in cap and the chat real quick if you'd like to follow along and try competing against me uh we certainly won't be the only ones on leaderboard actually we won't be on the leaderboard at all this is a very old competition but if you'd like to try predicting yourself please do join and um yeah what i'm going to try doing is uh yeah we're going to we're going to take a look is let's see a data set from the the movie database and that includes let's be fun uh that's budget genres um the home page the imdbi the imdb id original language the title an overview and its popularity presumably on imdb and um there are 22 columns we're not seeing all of them so we'll take a look at some at some of the more once we've got it loaded in but we're going to try and predict its box office and that was worth checking what is the evaluation metric go it's going to be root mean squared log error so rmse after we're doing the log of the revenue we're going to have to exponentiate the revenue again once we once we pull this out but that means it's going to be a linear model not like a classification problem so let's uh let's get started so i've loaded up these packages i'm telling it the metric set is honesty um like tomorrow i'm going to spend a little less time explaining what i'm doing a little more time uh it's tomorrow i'm going to spend very little time explaining what i do i am going to live screencast it but um uh tmdb box office prediction load in my data and uh i pretty in revenue and here i'm gonna run this whole thing and i should now have a fivefold trading set here we go as well as a test set for me to use locally and evaluate my um uh yeah evaluate my data so let's so let's see what we what we can explore from this data set based on the we always want to start with the train and um oh what's below there's lots of columns here this is definitely stuff that i didn't see last time okay so belongs to collection is a jason column okay so there is like bundles of um of collections in here but let's start with some simpler ones and go with budget so that's um logarithmic and is also going to be on a dollar dollar scale here we go so some very low budget things mostly log normal between 100 k and 100 million and let's see is it predictive of the box office of that is their revenue so i can say budget revenue and both are scale x and y both dollars both log uh and looks like maybe we have some oh um oops g on point here we go yeah so it's all right so one interesting thing is that there's there's either there's a lot of zero budget data that doesn't seem to be made to mean all that much um because it can be across this entire range and then there's generally really a positive relationship so there's a few useful things you can also say filter but we'd have to think about this zero budget when i'm working with this on a model and i can do a filter to make it a little bit cleaner and to see okay there's generally kind of a kind of a linear trend between the two okay and then we can look at genres and genres gonna be interesting because the um uh what i'm gonna see here is that this is gonna be a json column and i'm gonna do genres is map uh under genres i'm gonna do library i'm gonna do library jason like it's got some i'm gonna do map uh have you ever seen have you worked with jason collins before the trick here is do you want to do from jason light from jason and uh no it doesn't care for this i thought it would be a jason let's see the um if i do train genres one jason light from jason oh that's odd that uh it uses single it's using single quotes oh that's a little um i i guess uh from jason only uses double quotes and the um yeah let me try fixing that it's got some data cleaning to do here first replace our single quotes with double quotes and uh oh missing um also missing values too so let's try filter not isn't a genres all right so that one worked um and but they become a data frame which actually is not that bad uh necessarily but uh i can't because i can then a nest wider genres this is not going to work it's going to have id duplicated uh can i say names repair so the trick is that i need um these genres to be like um uh actually that i don't like that approach because just to me each could have multiple genres so the um the trick is notice that it's got adventure action oh okay now that i see that that seems to be how it works and let's take a quick look at this um i don't need to parse everything uh so pull genres top 10 random ones it looks like it's always name an id and a name and then what about collection uh because if this is what i think it is uh what is it train belongs to collection you know it's got a name it's got a this i don't need the poster i don't need the backdrop i just need the name and the um yeah okay so what this is telling me is that i can actually extract it out a little bit more easily than i originally expected uh what i can do yeah i'm guessing that the tidy tuesday one already did sound like this because i can't remember doing this before if i did work with this data set but uh what i can do is do this string genres is a string extract all from genres it would look like name and then a greedy let me see uh here if i try grab i'm gonna take this on like one example of genres so this is an example and i need to build a function called extract uh extract json names i'm going to use this a couple places so that from a string it's going to do a string so like for example so it's going to do this string extract all from s to take all the things that match this and extract json names and string extract all so then extract json names from pick an example yep it grabs adventure action family yeah but i only want to grab the capturing group can remember how do i oh yeah string match all that turns it into um a oh this is a oh yeah that that creates this column and then i just need to do on uh yeah on each item grab only the second map uh here grab the second column which is the capturing group so now this creates a list call of adventure action family uh but i actually want to map chr i want to combine these in let's say a semicolon to limited lists it's going to be easier for me to parse later so i'm going to work with rather than a than a list column so i'm going to do map onto each of these paste collapse is uh a semicolon and um oops and now i get actually adventure family okay so why am i doing all that so that i can do mutate uh belongs to collection is extract json names belongs to collection and now i have uh any of those those collections will be in their json column and then do the same thing to uh to the gen to the genres it's because i don't need to keep any any ids really most of these columns i don't need to have much let me see we have um well let's look are there any other uh json columns that have multiple in them may probably sound like actor does so let's see the um overview popularity poster path production companies yeah i can grab just out the names of those let's do that too so it's in fact to mutate across each of the of these belongs to collections genres production companies apply this extract json names that way i can easily add additional ones that i'm then i'm going to apply this to that across the super handy production countries also spoken production countries spoken languages keywords all of these are many are one-to-many this is gonna be a blast uh and lots of text keywords cast and crew keywords cast and crew keywords cast crew uh oh it's not i it's not oh i did it twice still didn't work let's see what is it called keywords has a capital that's odd but i can roll with it keywords cast and crew and all these now have yeah semicolon delimited ones all right so the um and revenue okay hopefully nobody has semicolons in the middle of their name i'm not gonna check on that all right the um okay so we so we have tons and tons of categorical things that means we're gonna be needing to do some like um uh for some we'll be doing tax parsing but some will be doing semicolon parsing but there's definitely a lot of like one-to-many uh categorical setups here uh all right so let's um let's see then how what what we're gonna do is uh to start with in terms of rmse um i also have original i have a couple categorical things i have original language um i have dad i have wait so let me see other ones that um besides budget i've got let's see i've got um a popularity put don't put x don't put popularity on a dollar and in fact i wonder if popularity goes on a um on a log axis it does popularity okay yeah just yet kind of like um uh that other kind of like a budget it was like you know it there's a few really small values uh though here not so much and by the way are there either zero revenue no there aren't okay uh okay so the um makes sense we're judging by arm by the log means whatever um all right so revenue so rev so popularity might not have such a um uh a correlation with it let's look at what else we we have to work with we have original language so we can do um i'm going to create a little function summarize uh summarize revenue i find myself using this a lot while your table oh so uh take whatever table it is and summarize the um average revenue is mean revenue which is actually not really how i want to work with 12 i want to use median revenue because i'm going to be doing mostly uh and i could use geometric mean and i will in fact geom mean revenue is and that'll be x of mean of log of revenue i don't need to add one there's no zeros here and um i always have an n is n why do we do that because then i can say let's group by um original angular oh and arrange descending n i like this because i can say group by original language summarize revenue and get back like okay english french russian etc and then i can do a um a little fcp lump language summarize revenue and um and we'll lump it into let's say the top 10 um mostly they're english but sure i'll still do that and then plot immediately where i'll say uh mutate original language is fct reorder original language by uh by median revenue and do original undo median revenue original language geom point a s size equals n this nifty little graph we can say okay these um and uh scale x i'll leave this on a continuous scale but i'll do labels dollar right now and let's se and yes let's say whatever the zh original language is where i don't know uh it's definitely not very common but it's got a higher revenue okay so um maybe that'll turn out to have a positive impact but there aren't besides that there are not a ton of categorical variables here um i saw like uh there's some we're not going to use it all like a home page uh there's some uh well there's has this does it have a home page that i can definitely like transform i'm going to throw that in my little data cleaning step that's a bit of information but while we're feature selecting and um let's see then we got text over this bit of popularity i'm not gonna use i'm not i'm just not gonna use the poster we got production companies we got uh country production countries and yeah release date i'm also going to want to do um release release date is what is that mdy mdy mdy of release date and um turn to a date because i'm probably gonna want to create things like year uh as well as like so there's there's not both year and month that i can get out of this it might be kind of interesting there is runtime let's take a quick look at runtime as a um as its own predictor uh runtime and let's try without a log scale see how it looks yeah there's like um some short things some long see i'm up not seeing it a trend but gm smooth method lm you know there is a trend and uh i wonder if i added it would be a little clear if i said scale x log can at least got rid of those um here's the run time see no you know it is it is plausible some of these shorter movies have are lower than the highways i'm certainly going to want to include runtime uh in terms of relationship with revenue um yeah the uh as well as including budget and such that would be like my intro minimal plot would have uh budget and runtime and then start adding from there but really it's gonna be i think it's gonna be this categorical one somebody's quickly noting numeric predictors uh definitely budget also uh runtime um popularity hard to see uh and but but um and all um all log a log scale i'll log plus plus one those said budget um budget zero might be um might be more like n a that's thing worth noting the um oh status do we see us do we have a status i already have my little set up here for group by status rumor rumored is future there's revenue for rumored ones i don't even understand how that works um sure though the um oh there's only four so it's not rumored is not a rumor is not going to be important um only four things in the in that category uh okay the um and then let's look at all the text ones so i'm just going to write them down here while we go there's actually there's categorical uh multi like multi which are genres let's see genres uh production companies production countries spoken languages keywords some reason capitalized uh cast crew i think that's everything that's like multiple out of one so these are like some of the things we can add as as categorical predictors um and then there's text there's actually like um free text which we can include the original title overview [Music] tagline title i don't know the difference original title and title i'm probably just going to go with one of them though maybe the original yeah maybe the difference is about the translation could be uh no i'm not sure uh overview yeah uh oh what did i say um what did i say overview pop um tagline title yeah i think that's it overview has like the plot yeah this is going to be fun okay the um all right the the one that i'm going to want to look at before i look at categorical and text is uh so i think we're going to do a linear model in some numeric predictors but first i really want our our date time so i actually really understand what's the role of data of date and time so it's kind of um the way to do this is start with group by year i know i turned into a oop i did not uh train cleaned i'm going to do the cleaning a separate step here though later i'm going to mess with it a little bit more summarize revenue filter n is greater than i don't know only the ones with at least 10 year median revenue this is not oh this is probably adjusted for inflation because we can see that it's not really going up over time let's try making this instead of year let's do yeah so it's like um maybe it's changing a little bit but it's not like like the low point was 2010 for the median revenue um so and maybe it is um maybe the median isn't changing with the max i'm you know we can't uh say that much about this what if i only did years at least 20 what if um this is a little yeah i did hear about about this these years being the age of the blockbuster i don't know but the um so uh yeah these probably are adjusted for inflation which makes sense you'd otherwise get a lot of power just out of correcting for year so let's try instead of looking at year try looking at decade this is a truncated division where am i not quite half an hour that's good that's good for for exploratory analysis um and gm point uh oops uh ten times yeah uh all right so we have some distant future uh you could be going on here so i something's up with that the uh what if what if i do arrange descending release date what's going on here i have 2068 2067 2066 i bet oh look at this rosemary baby that's not 2066 that's 1966. these are movies that came out in thunderball these are movies that came out in the 60s okay so any movie that's over uh that's after the year 20 uh 2040 is probably i probably want to subtract 20 years so the um yeah the um 20 years that's so interesting can i do can i do years 10 uh oh yeah i can this is fun years 20 okay what i want to do for a bit of cleaning is i would say uh release state is if else release date if it's greater than 2040 or 101 hell if it's greater than 2030 do um uh then it's release date minus years 100 otherwise release date a little bit of data cleaning and um and now i'm going to try this uh this again there you go here they go where they where they belong see that year's 100 time uh delta thing very handy okay so this um it does look like there is a trend but it's a non-linear one if i include it all right yeah look at that that's every five years so um year like that's a year rounded uh that's every five years there's definitely a trend that just is not uh not necessarily spikes goes up goes down and oh something's up with my year down here it is uh there we go all right so the um uh so this is that there is a nonlinear trend from year nonlinear and what about time of year uh so let's take a look at the um at the month month is month of release date i mean i is it true that once in the summer and i'm gonna do label equals true uh is it true that movies in the summer have a higher um tend to have higher revenue it's true i guess when the blockbusters come out uh and i'll need a group equals one uh oh yeah look at that look at look at this check this out this is all right the um uh and i'll do a quick uh let's see scale why continuous labels is dollar and uh yeah look at this there's like there's a september dip that's really important uh so much so that i wonder like if i did instead of month if i did week is this also impactful like is month gonna be probably gonna do something like a linear term like yeah wow look at this group by week is week of release date which i think will we'll do the week within the year i think let's find out yeah look at that you can you can't even see all right yeah there's like one week within the year that's uh yeah there's like a couple spike weeks probably the start of summer and the um start of december that kind of thing uh yeah i can make it um hmm yes i think this is cool but i'm actually really interested in looking at this over time so i'm also i'm going to do a group by this and uh let's look at month and let's look at so i'll do some effect of day of year or something like that or week of year but the um the main one that i'll be interested in is uh how does this compare to kind of just this in their interaction term of here of year is going to be year of release date and what i do i i broke it up by five and i say color equals factor year and i say group is here uh too many i am yeah i'm going to need to change this to decade for one thing and for another thing filter for year is 1980. yeah so like 1990s where the the year well look at that the median revenue in june of the 1990s just spiked up a whole bunch i uh yeah huh um and 2020 is a weird year with the pandemic it looks like but the wow wow look at that and um but 2020 nothing should nothing should be in 2020. the um uh this this data set was from two years ago so it's not pandemic that's like that that should be 1920s yeah i think i messed up my um so i say if it's greater than 20 20. yeah uh then subtract a hundred i think those movies were from um i could check that but i think those movies were for different years yeah the um all right so this it had nothing to do with pandemic this data set is from two from several years ago okay so the point is that there is an action term a little bit between month and year not enormous maybe it's mostly just this one month but um yeah okay so we can we can see some some things going on here uh all right so i'm going to want a term for time within year uh month non-linear okay let's uh let's get to building a linear model uh okay so we're gonna be using our um and one of the first thing we do is going to put all this cleaning steps into here we go we're going to put the cleaning steps into a recipe uh am i and now i'm just thinking for a second how am i going to um i promise i hadn't just set the phone oh um can i yeah recipes i remember this upset for me how do i set the formula of a recipe the reason i'm asking is like add form i can select step rm i can whatever but the problem is i'm just i'm not happy with this base situation uh because i'm gonna need to pick the formula up here uh and i'm gonna have to work with it i'll come back to this later but it annoys me i'm doing all these cleaning steps in any case the um yes i'm going to set this up once and this does a lot of my like my cleaning i'm also going to do step rename keywords equals keywords because i don't like that i'm trying to decide they will apply to all the training and all the testing data at the same time but i'm going to say here we go the uh um recipe takes uh right of my training data oops not that one of my training data i'm going to be predicting revenue based on everything but i'm going to start just one by um of course i can't start one by one uh all right here's what i'm going to do i'm going to i'm going to move this up into i'm not going to do any recipe clean data function here we go i'm just going to apply it once to all my data sets uh huh uh key keywords is not does not exist um i just saw keywords a second ago there it is keywords oh that's because i didn't change this okay and now clean clean the data uh all right the um data set is clean and hold out is uh and the whole held out set is also cleaned it's semi-colon to limited everything okay i just wanted that initial cleaning to happen now i don't need this anymore i can do everything based on train nope i just said that's not true oh because i didn't rerun this entire block i didn't rerun the um the the steps that split it up yeah okay so here's my train here's my over over month here's my by month and year and here's my week the release date okay and um yeah what i'm gonna do is create my recipe where i'm going to say i'm going to call it rec linear i'm going to start it really simple with just my linear terms so i know that budget is important i know that runtime is at least a little important not sure if popularity is important but it doesn't really hurt to add it and i'm going to do year and i'm going to do oh so i'm going to do release date then i'm going to do step mutate release year you know what uh yeah i'll do it now release year is is made my future engineering year of release date and release month is uh sorry release week within the year is going to be week of release date data's missing the default the um oh right yes and step rm release date what i have to do with this is prep and juice so we can see here's the clean data we have so far and then i can fit a model and the way i go about doing that is the is um uh is do workflow add recipe rack linear add a model and i'll just do a linear reg in here and do a little set engine use regular lm oh but uh last thing i forgot to do is we'll do step log of it's very important we do um actually i'm going to do that in the cleaning step because we never want to forget it we're going to do revenue no i'm not going to do it here because i do some explanation i describe what he said step log of revenue uh and this will be skip equals true because it'll only be applied to the clean data we'll say uh log the revenue base is two i like base two we don't need an offset uh and then i'll add another log of skip because this is the predictor data and we can't run it on the on the eventual data we'll say runtime and popularity did miss any uh yeah those were the ones i remembered the um oh so budget run time popularity with an offset of one and a base of two uh so the offset meaning plus one and here now if i do rack linear uh prep juice look at the data i have logged each of these i've got my release here my release week and the other things i'm going to add here are going to be a step ns for release year and a step in s for release uh week that's a linear spline where each of them is going to be degrees of freedom is going to be uh deg free equals tune i'll tune them separately okay so what i'm going to do is going to tune linear spline for it uh for dag free here deg free month a week week of the week within the year and my my uh my two terms okay so the um uh now i've got my linear model i got my linear my i call this usually lynn wreck here's my workflow and um yeah let's set up the uh the approach will say lintoon and this is specifically going to say lin tuner i'm just googlin tune uh lynnwf tune grid on um and a couple things i'll say i'll say on train five-fold metrics equals msat already create uh i create that yep i did and the um the uh uh oh the control is i created a sign earlier where i said grid control this useful for a couple of approaches to use and i do five-fold uh tuning on oh i'm going to want to i'm going to shoot i like to choose my um grid is crossing dead free year try a couple values of deck for a year one to four dead free week is also one i'll make it one now like it one to four let's see what this gets so i'm adding a couple of spline terms to this week term and this year term and seeing what we get uh so we do five-fold cross-validation remember adding no categorical terms yeah that's why i'm not doing any kind of penalties they're not really a lot of degrees of freedom going on here and um then do auto plot on this let's see i did set yeah it should be doing multiple uh cores and uh yeah it's useful to add uh yeah here we go auto plot boom now one thing i see is okay adding degrees of oh wow this arma c something's up with this rmsc i'm doing something wrong is it not doing a log on the the eventual date okay yeah i'm going to put this in the i'm going to put this in the cleaning steps because uh uh uh yeah revenue log 2 of revenue i'm just going to put that in here and uh off remember to exponentiate it later it's not in the holdout set yeah all right and the uh oh dude all right the training set here we go uh all right i don't need this anymore try this one more time so this is supposed to say like okay as we change degrees of freedom on year end on week how does the um pharmacy change we can mostly ignore this i think it's predicting on a log scale and then comparing on a real scale and that's not that's not doing any good for anybody oh that's what happened i did a step log that's no good i need a second step exponentiate is a post-processing step i don't have that i don't i don't think i can do that uh disregard this um this next piece okay the key here is that we need two degrees of freedom for a year but it looks like not really more and every degree of freedom for a week though was kind of helping so i'm gonna do one or two for a year and do one to eight and let's find out uh how deep this rabbit hole goes and can we keep getting the armacy down by but this is a good time to go to the leaderboard and say what is a good rmsc okay so though there have been this is a non like a uh it's not a competitive one you don't win for being at the top but it still gives us like a sense like you know where where would we fall um so we want to be yeah we want to be like one two kind of range to be in a typical area so certainly doing like three point five is that would not and land us very well here would land us in well under a thousand uh that's alright we just got started yes so let's auto plot here we go we want like degrees of freedom six might be the the best like two degrees of freedom for a year and uh sit maybe six you know let's yeah sure let's say let's say um six degrees of freedom for a year um for partly for for a week uh so we can kind of we can finalize that normally nor like keep this code but i'm just going to keep editing in that's kind of the way we go fast with this is we say two for a year i'm not gonna be maybe i'll be conservative do four for a week i don't like to over um i'm gonna be adding penalties later so i don't worry about it all right six foot four week and uh yeah let's start adding a couple other terms so we added numeric predictors and date time the key next key is to be doing categorical and for that we can add we can start spatting one and say um this is where we do tokenize on genres where we'll say um token equals regex and options list sep no pattern equals semicolon that way we're going to split these up and do genres uh see we've split them up into tokens now but um absolutely tokens i'm generally going to say okay step tokenize um step token filter genres only keep the common ones max tokens is uh is let's say i don't know 50. but let's actually talk for a second about that limit so what we're going to want to ask is um is what is the effective of genres so i can use separate rows on genres and then count genres sort equals true and say here's our most common genres but i don't actually want to count i want a group by genre summarize revenue and ask okay the median median oh this is on a it's now in a log scale uh you know what i'm gonna do i'm gonna i'm gonna be silly i'm gonna say um two to the power of this way i can keep my uh that's why the geo mean is so funny uh all right so the so i can keep this this fun uh function going and do these explorations uh and there's 23 so i'm going to want to say i'm going to try all of them and soon we're going to be writing up a new function for working with these but genres geo point so there is kind of this gap you can see okay there's missing there's missing values um and uh yeah there's so having no genre is bad having a documentary is that is that bad for your revenue have been foreign well uh all these are so like seemingly oh our tv and movie i don't know um uh the same category let's check that real quick if i do filter genres string detect genres tv pull um genres tv movie oh okay the the problem was that i separate rose and i didn't actually tell it where to separate i didn't tell it uh sep equals a semicolon tv will be much better there's only one of them uh which sounds like it's the doctor who movie but the um uh yeah then if i and then i'm actually curious like if i've thrown a geomet text label equals n what do these look like and be just and we make size and be its own thing i just want to show how many there are in each uh v just one h just one there it is and now it's like okay there's only one of these and uh not very many that are missing um yeah you know we can do a filter for like 20 on this and um but even then like even the fact that the nas are kind of useful it's like okay that's guess i don't want to well the n a is not a token though is it well it's um uh yeah i'm still i still find them interesting i don't really want to throw in a filter but um but i'm gonna yeah i'm gonna throw in like 22 which will throw in all but uh but one i suppose does tokenfilter allow a minimum percentage ah hold on let's take look at this min times and max oh min times yeah there's max tokens or okay min times all right there it is equals uh five uh why do i like that because now i can prep and juice it and i've got oh max features was set to 100. uh that's fine um oh and then after doing the tilted filter i do step tf genres and now i've got a column for each of them tf genres western and thriller and science fiction romance and horror music so the um so now i've got one hot encoding of this and it can actually fit in something like a linear model uh this is going to be i think pretty impactful so let's try it again just right away and yeah let's um nope i'm not going to tune it anymore but i am going to add a uh instead of linear red i'm going to add a no i'm going to do linear reg with glm net and i'm going to do penalty is tomb and i'll do penalty is oh um oh yeah so the trick is that i can actually do a whole bunch of of penalties except i can't uh probably because i'm missing let's see what i did missing data is always a good thing to know uh oh max feature such a that wasn't that was the warning that's not the um the problem n a all right what missing data do i have if i do this filter not complete cases dot is a good way to find any up run time in a all right so i'll do a quick uh step miss step um what is it called impute mean impute is a good one on run time make it the average uh that's a good one let's see how it does once i do that oh it's now step and pewd mean that's good to know let's also see how to do on my fun thing is oh look at this uh regularization helped it a little bit okay so it added a little bit of value to relative my last but not a ton so the um but now i do want uh see this dip means that i do want some regularization um so the um okay so that's genres let's add a couple more let's do oh i forgot about um categorical original language so there's um i'm going to put all the i'm going to put these on like levels like here's the numeric ones here's the date one here's the the categorical one which is original language and original language you're going to want to do step other original language um stepmother so say if you have if you're less than one percent just get another just get a an other category yeah that one i know about the one and i know about the warning i'll heck i'll do here i'll just do max tokens is 20. uh just get rid of that the uh all right so i got my original language and i wonder when i do that sword what is the count original language yeah adds a couple we have a couple languages there um all right and but i'm also going to do so i've got genres but i'm also going to throw in production actually yeah let me throw in one more while i'm doing this production companies production countries and uh let's see what we get uh something might have had an n a well let's see uh uh let's see what i did i got uh n a is okay what happens when i prep this and i do filter not oh um silly me i didn't um there's some important which is for these categorical ones i believe that i have to do um step dummy um the all the remaining nominal predictors so after i do the text very important yeah i think it doesn't like my um uh my category my factor predictor it wants it to be a become a dummy variable like these zeros and ones fun fact is that tomorrow i'm going to be screencasting my live process here and oh yeah look at this this regularization uh would be we'd be overfitting a lot but the regularization gets down to still 3.3 not really cracking that uh not not getting past that 3.3 with a linear model which maybe we'll get better when we get extra boosts and other things but let's see um yeah the uh you know i should have done i should also do step token filter on production companies and production countries max tokens is i don't know okay keep 50 of each actually instead of max tokens i'll do min times is uh 20. oh uh what was i doing oh yeah that's supposed to be the other wait no no i just didn't oh that's what happened i did not tokenize production companies and production countries this had one filter this gets the other that's what happened with this overfitting uh these production companies production countries i wasn't splitting it by semicolons yeah the model actually got a little bit better when i did it right just a little bit like 3.2 something um and uh yeah let's and then so what else do we have that's categorical spoken languages cast crew keywords oh oh stupid stupid stupid meme i totally forgot i turned them all in dummy variables that's a that's a huge nightmare what i need is is genres production companies production countries uh spoken languages cast crew keywords and i wish i had an all tokens function that'll be a nice addition here and then genres i treat specially because i kind of want to keep those ones that only popped up five times that probably doesn't actually help us much but uh spoken languages cast crew keywords so i tokenized them i'm going to create this i'm going to check this out once any missing data nope but uh wait no input is determination these these messages are not a big deal uh i wish that it would listen to many times and understand the the max features thing uh that i don't want that okay min only include people that appear at least 20 times and yeah otherwise be kind of pretty wide here okay i'm going to run this through yeah there's a uh all right so we're getting like we're getting better but we're not we're not really on the board yet um so i but i'm really interested in this i am interested in this model so let's take a quick look at it uh oh well oh yeah these are not problems uh dot extracts trying to grab one of these one of these models uh the challenge is grabbing one of them that's uh yeah actually grabbing one of these models is kind of a of an adventure um yeah but i can i can take it i can actually take it back i can i can fit it oh i'm gonna add the text first yeah i'm gonna actually have the text real quickly okay so the um so what else do i have in here i have origin i have overview tagline title so those are going to tokenize the natural language step tokenize over your tag your title overview tag line title and uh this one's going to be more by words and then snap token filter overview tag line title and i'll do max i'll do min i can tune this but i'll do step i'll move some stock words from um overview tagline title i'll do a token filter and then a step tf of these three i've run them through the whole gamut again oh uh i'll write into you lots and lots of terms thrown all into this linear model uh what would fail let's see if i juiced this what do i get oh um that's a confusing uh this error happens when does this error happen uh let's see if i just had a review tagline title if i just had if i just had title experiment this a little bit it does not like my attempt to tokenize title what would oh um is it called title it is if i remove the stop words any good something about the stop words oh well something's broken about the stop words huh uh or is it the mid times mixed with the stop words we're trying to debug on the fly something yeah something about the combination of min huh all right uh step stop words token filter title overview and tagline all right checking that it works once and then how big is this data set how it features 507 features woof that'll that'll get you up in the morning we can mess with a lot of these like uh filters but um yeah so the reason i'm doing all this is that i want to say okay what if i took this lynn wf select finalize workflow uh select best lin tune fit on train now i'm gonna do a little more tuning on things like max tokens but just actually i'll do that now i'll do that first okay so the the stuff i might be interested in is like here genres are simple because there aren't a lot of genres but i might want to add here yeah the um min times tune and i'm going to leave in this token filters just a hundred of each word whatever uh whatever but at this but this many times maybe it's worth tuning and we try something where we say min times is gonna be three um what if we do every time there's at least 10 words sometimes 30 words so every time this is uh 30 appearances i don't know 50 appearances or a hundred i just wanted to see like what happens if we change these up so this is where i'm trying out this um except all these kind of like production companies special countries spoken languages and setting different minimums and i'm wondering like uh is that minimum actually helpful in this in this model so i'm trying to four values so 10 is really good i might even bring it down to three or something like that that air is not a problem so just kind of give it a pallet cleanser oops why does um oh oh uh mid times crossing mid time why am i uh why is uh oh wait is min times not a tunable parameter well maybe not and uh if so i'm going to try and start doing max tokens tune trying some different values i like tuning by the minimum rather than max tokens because it's applied to each of these and the minute the max tokens for each of them probably should be different based on the cardinality but i think it uh it's at least um it's the way to really i should be moving on to actually boost and yeah in my second half hour so that's kind of i usually want to take a look at here we go that's a little more like okay if i do like 100 tokens see it gets like see it's not even as it's not even as as good as it as the one i had a second ago we just had min times is uh is ten is it it's not even breaking yeah i'm just gonna do many times on this one to do would have been times 20. what else can i tune here i can tune the year and a little bit more as for z-axis i didn't use glimpse check out the data i used view and i've been i would spend a little time like looking across it um oh no i was wrong i thought that i was beating 3.1 but when but i was doing something oh max times oops min times that's a big difference max times versus mid times big difference yes this is the best model i've had so far okay so i'm going to use this model for a second i think it's it's it's going to be kind of reasonable even if there's probably a few more tweaks i could make uh it's getting oh and i neglected i'm going to throw in has home page real quick let's see remember i added that one early on i said like uh throw in do they have does it have a home page or not that kind of thing yeah so to do is i'll say take this whole workflow fit it and then let's work with with it let's do um here we go uh let's load this in and say what are the um what are some of the coefficients now these are numeric coefficients so they can't be interpreted the rest are going to be generally term frequencies so they they can be more interpreted i still might want to scale them no i'm not going to scale them um i'm not going to scale them right now but the next step will be a scale that gets to say a different kind of interpretation so what i can do is i can say all right so this is at our best penalty and everything the um if i say top also i say fill uh top n 30 absolute value of estimate estimate term gm call mutate term is fct reorder term by estimate and do a filter term is not equal to intercept i just want to take a look at the um at what is positive what is negative what this is showing is um this is showing like some impact but it's yeah let me scale them real i do want to scale them so like uh yeah i'm gonna scale them real quick so let's do here we go step scale all numeric predictors shall do step normalize which doesn't really make a difference but it um but it'll it'll center them and scale it'll center them and scale them so standard deviation one means zero uh so the um uh lasso does this for you for so it's not gonna make the model any better but it is going to make this uh this fitted one a little more interpretable in terms of understanding what the most important terms are uh so here it's like okay popularity and budget remember on a log scale really important has homepage turns out to be cool because i'd be fairly moderately important um and then a lot of like things that tf is always a term frequency so it's like um does it contain this word or technically how many times it contains this word uh if i throw this in a hundred this was like independent film i had just too many independent film on one side and popularity budget and adventure and universal pictures and paramount pictures and all these things i can really parse this apart quite a bit i'm not going to do much more i just think that um actually yeah there's there's a too much because i do want to get to the other models but i just want to take a quick look at okay here are some of the um uh the terms that are important yeah i'm gonna i'm gonna go ahead to um to forests and and actually boost first i like to do a forest first because the forest shows me um let's see these are tokenization steps i may need to redo them but i'm going to start just a few linear terms on a forest just tell me how many trees i need uh and then the x2 boost is kind of on top of that and uh here i'm going to do it as i'm gonna grab all my cleaning and decide what i need and what i don't need and i can bring some things back in later i don't need these i'm going to drop for a minute i'm going to drop all of the tokenization okay i don't need to log anymore drop everything that involves tokens uh and i'm going to leave the dummy and i'm going to keep these two years and everything and do yeah i'll leave an original language popularity has homepage we saw was kind of nice uh oh um i have somewhere here and i'm missing a um parenthesis oh yeah there is okay i'm going to call this rf rec i'm going to say rf workflow is workflow add recipe rf rec add model um rand forest m tri is tune trees is tuned set engine range here i like i've started doing this in line because it's so much swimming with lines of code but now i'm going to take the same approach tune grid but i'm going to tune grid on the um on m try is let's say two to f how many columns do i have in um if i prep and bake this a bunch i really want to yeah i'm going to have fewer languages you know what i'm going to drop language for a bit i think that i think that the rf rfs don't do as well when they're categorical just with rare levels so i'm gonna drop those and say for a sec rf work um rfwf vm m tri and freeze is 100 or 300 let's say here we go enter right so trying out a couple of these later yeah what i like about uh let's see so one thing we've seen is the more it's actually worse higher armacie the more predictors i'm using which is kind of a surprise to me uh among budget runtime popularity has homepage release date uh that i don't see that a lot uh just for kicks i'm going to throw in an extra i don't think this is the this is the problem but i'm going to try doing it to m try one which doesn't make up i'm trying one really would surprise me it already is is kind of competing with the best linear model we had um this making me wonder if like yeah it's making me wonder a bit why yeah more parameters maybe i mean maybe some of them yeah really seems like that means there aren't interaction terms going on here that that's not the kind of thing i'm accustomed to being that much of a difference okay yeah so at least you do need two or you do need two not one but not more okay so i'm try to that's good to know and then yeah having more trees was a little bit better all right so that's like a little bit of a start with with uh random forests what would i throw and i'd throw in genres next i think well i can throw in a lot as long as i add some harsh filters so what if i do this what if i do actually no here's what i'm going to yeah here's what i'm going to do i'm going to throw all these in uh so the um and throw these in which normally would make it enormous and then i'm going to say step and znz v which is non-zero variance oh uh all predictors so then if i take rf rack and prep and juice oh and ready to uncomment these two lines genres doesn't exist let's see where did i go where did i go wrong i add genres here i have genre oh there it is so nzv you know what mtv does is remove the low ver low variance ones so generally ones that are mostly zero and if i change free cut to like is a point ninety-five five what if i say ninety to ten it's like nine to one so then i remove any that i move even more so now i don't have a lot of terms in here and um yeah i'm going to leave it at 95 to 5. the default but the um uh but the keys this removes any that don't have that don't have like at least a bunch of like ones in them so kind of applies that like othering across all of them and i can give that a quick shot and let's see so the best version we had with just numeric and uh it was a numeric and country was uh three point about three point twelve we're gonna try and release date was three point twelve whatever we add this okay later i'm gonna tune it but i'm gonna just do 300 that shall do no i think that back i do want this but i want yeah okay so i'm adding a couple terms to this uh i can tune freak freak cut yeah the um i found adding categorical variables to like um these kind of rare variables for random forest doesn't do all that well i usually want some kind of dimensionality reduction or some kind of feature selection um so here uh well what i see here is okay you're going to need more breaks which makes a little bit of sense uh seek 2 to 10 by 2 and let me try just doing trees 200 by 12 even soon i should get the xg boost because actually boost does tend to win here but as i like getting a sense of here of how might i um how might i winnow down the the set of features usually it might just use xt boost on like all the big stuff and then use and then then combine it with a linear model that kind of stuff tends to work well okay so here it's like 10 with um a best that's still not really competing all that well here so let's try some you know let me let me jump ahead to actually boost a second because i'm really curious xg rec i can grab all this stuff back later but we have oh yeah i'm going to drop all these and you have just this set leaving the mtv doesn't matter it's so nice to see the data set every once in a while okay and uh yeah what i'm going to do then is do an xp version where we say xg rec and rand and uh boost tree with m try tune and trees is tune and learn radius tune set engine xd boost okay uh so all right so so really i love it extra boost it keeps all the trees so i can easily say trees from 100 to 12 to 1 000 by a hundred and we say learn rate i often start with 0.01 but i'm also going to start with 0.03 see how that let those two do and i'm try that said we stopped two was the best so i'm going to try one to four one is usually not gonna be the best so i'm gonna do two to four and heck let's do 1200 okay and yeah the the what we're seeing here is that i can actually do like i don't need to tune this so carefully because i can actually make a graph of here's how it's learning over time we're away over time i mean as the trees continue and see when do we get diminishing returns so this so far is just with this set i may again i may want to do dimensionality reduction on the remainder of them uh on genre and everything else or some kind of feature selection could be handy but the um we're just starting with this set all right so the one thing i see is yeah we've dip very quickly i'm actually gonna so much so that i'm gonna not do that in the future but if i do extra tune i ah so hard to even see i'm just gonna do xg-tune collect metrics uh cm oh uh right the filter trees greater than 300 or you know what i'll do i'll do what am i doing i'll do auto-tune cohort cartesian y limb is three to four all right so still still not quite uh leveling off but like oh so one thing i'm seeing is these in this case i only needed like 500 trees two predictors was doing best kind of similar to what we saw from the wrap it's pretty similar what we saw from the random forest okay let's add in a little bit more data let's actually bring in the whole recipe leaving the 95 the everything must be at least five percent and boost tree and what was this in terms of learning rate it's like well 01 yeah this 003 would fit way too quickly and start with 01 and then maybe 0.03 later but let's start with just learning rate 0.01 when i added these in i really needed more so i'll seek 2 to 14 by 2. see i'm still keep notice i'm still keeping all this tokenization kind of stuff uh oops i neglected to put this back this one back in so trolls here i think we're losing all the really exciting ones we're losing the like uh all the ones that all these these rare things like independent or stuff like that i think that's where we're gonna see the real benefit okay so the one thing is yeah more trees were better here they at least like leveled off more more predictors were better still not being still not beating three let me try like 98 to two uh and try let's see i think i don't need this many i think i might need more like start with a cup do a couple doing a couple of finicky things three to fifteen by three a three to eighteen by three so i'm trying to just like is like searching for some better parameters i don't like for some reason i like controlling them myself a little bit more than than uh trusting like tuned greater tune-based there's rather particularly two bit days with them or doing um isomore suggested simulated annealing for hype parameter selection however i'm tuning in i i was just especially things like this where the trees parameter is all about like you can really see it level off you just want to aim for where does it stop um leveling off i think you can get it a little bit more easily uh but yeah i wonder if i combine yeah i feel like if i combine uh some pca on all these on all some that is some some dimensionality reduction all the terms then i might uh get to a better to a better spot that might be what's next for me because i do think like the more categories that it's going to be a little it's going to it's going to kind of add a lot of categories here this saying if something's not two percent like for the binary ones this will say they must not be at most um at least two percent uh yes not like less than two percent so this is too low a learning rate and even here you're still going down by 700. what's your 700 so let's do let's do to a thousand this learning rate might be too slow for me today what i'm aiming for and it still is like you're still gaining from more but not by a lot 99 to 21 or 4 to 20 by 4. let's i'm just trying to kind of mess it with like what's the best i can squeeze out of this set of tokenization uh and well first of all when i do at least one percent how high dimensional is this my first question i juice it and 285 my goodness that's a lot of so a lot of things to select from so much so i'm going to do it even more aggressively i could change the tree depth too not sure i will in this case i'm not trying out like um deeper trees you guys shouldn't know i changed my mind i'm not gonna go any deeper than a thousand tree i'm not gonna go any farther than a thousand trees otherwise i i'm just i'm spending a lot of time a lot of time running i'm not gaining a lot out of um yeah i uh i'm gonna try dimensionality reduction and extreme boost see we are beating our old models by a little but not not really beating random forest by much and i haven't cracked this like three rmlc uh yet so i wonder if i do extreme in fairness i do this in two hours and the people on the scoreboard probably presumably took longer but i'm still i still would like to do better than i'm doing right now here we go i see it's leveling off really quickly and like i mean i'm not seeing 20 25 not a big difference uh yeah i see it level off okay so like even 25 with yeah we're not really seeing like i'm not seeing a lot of benefit from just adding more terms here uh nope it's actually pretty similar to what the random far with the original random forest was um okay then and i could blend the two that in linear model and i will be doing that for sure but let's try oh yeah let's try uh a pca recipe now this is getting this so long and i need to shorten things but original language let me see uh leaving an original language this time uh what if i did step yep step pca starts with tf num comp well actually let me let me find out first uh let me look at this here we go let's look at this once prep juice let's actually uh do some pc on this stick with the 285 uh why that's gonna get really tough to pca and at least uh without doing it sparsely and then if i do um xg oh yeah i want to say move revenue oh select starts with tf as matrix this is one of the full data but it'll let us at least get a sense of like um actually let me do step normalize starts with tf first that way that i can actually like already have this normalized and now i can do spd on the tf map trying to look for is there here we are are there a couple of um principal components they're really like well the first three are important and then we go to 16. okay so i might i could do 16 i could do four i'm going to start with sure i'm going to start with 16 and let's just see if we get any actually i'll start with both and um what i'm going to find out is here we go i'm going to have all these ncv removal and then i'm going to do step pca starts with tf oh and uh num comp equals two i'm going to call this x3 rec pca i'm going to do xg pca with actually pca rec and i'm going to add another tune parameter let me see i'm going to do yeah these are all right 0.01 is all right and here i'm going to go back doing 2 4 6. just 600. eggs the key is i'm going to do num comp is either uh 4 or 16. first i want to check that this step it's just so big that it's too slow to do pca on uh oh uh finalized num comp is4 just want to see if like is it going to crash if i it's going to like be really slow no it's not too bad uh p7p2 okay so i'm going to try this num comp oh uh huh see i'm going to try a couple of um i'm going to try it like once i do four stop that one do three was the was the other point and then one two three four five uh i know six sure or just get a sense does adding principal components help but i'll to the into this to this um there's any principal components to my original set which is all these um this like revenue budget etc does that any principal components help of all my term frequency ones which are all of these so yeah one more time i really would explore what some of these terms what some of these principal components are see how this looks so i'm happy with this one of the remaining time doing some fun visualization because you do get some points for visualization and i want to like uh in the competition i want to practice yeah like what would i do what would good visualizations of this look like but first i do want to get like to see can i beat can it be just random forest on these five terms uh not by much if i can seeing that like here six components see adding components kind of just hurt it yeah look at that the um adding components look like it's actually reducing the value i'm not seeing four six like maybe the best seems to be four with very few components and it's not doing much better uh did i need to step normalize all the pca ones first i thought i already did that but i'm not 100 sure but i want to normalize those what happens if i don't normalize them i'm actually really curious suddenly that if i take this matrix and i don't oh the dimension is zero oh oh uh xg yes pca wreck final oh yeah let me just set this up now let me know i i don't need i don't necessarily need to i just was kind of thinking like okay what if i do one pca two three four instead of adding more and more actually that's a little bit limited and if i do so what would this look like if i took xgpca rec frap finalized choose the uh i still have all my language ones which now that i think of it i can do i would using pc earlier without doing the tokenization filtering into a single outcome i'm not too worried about the tokenization filtering that kind of moves like the rarest or the rare i do need to do the tokenization because otherwise i'm just keeping the i'm pulling out the original language so that this way let's see so that way language is kind of is combined into that giant one uh one two three four if i did four yeah so it's just these and then the four terms okay if i do try this i'll try this one more time two to eight try a couple more trees just in case something's going on here now it could be that's the lower principle components that are meaningful but i'm not sure i'd bet on it there's actually yeah there's a couple more things i want to try here let me list quickly the things i want here's what i'm going to do i'm going to do an xg boost i'm going to i'm going to create my model so i can upload something because i like to make sure i have something uploaded that's not like right at the at the end uh particularly what i'm going to do is do a stack model so do let's see stacks and i find that if i have too many tuned parameters since i'm getting really slow so what i do is i do filter parameters select best this xg oh um but i'm probably gonna shoot another regular one not the extra one not the principal component uh what it says i want to combine the linear model with the xg model i bet it will it will out compete both of them because if you do get you probably will be getting some um yeah see like more components oh wait no more components is worse at basically everyone every level of this uh yeah so i don't think this is um i don't think i don't think pca on text is going to be very promising here okay so i'm going to start with just this version and i'll set it pretty high let me see yeah i am now this was extra cumin was this pre um pca yeah the um take a quick look the that was that maxed out at what xg xg tuned maxed out at 3.09 i would have a rather minimum amount and if i just tried tuning the tune it or should i just try a height well let's see let's try 95 to 5 and yeah actually what if i set yeah slightly lower learning rate 2 000 fewer terms trying to find the best version of this like relatively small one i can do or what if i what if yeah what if i removed everything that was yeah i'm gonna move quickly let me see all right now this throws in some of some extra oh i don't like it i don't like um all these terms i'm just gonna leave in you know what i'm just gonna leave in genres because i know they're so impactful maybe there's some relationships there and i'm going to kill everything else i'll leave in a little bit of language i bet these genres aren't helpful in the xg boost let's find out i'm trying to move everything except genres and then i bet it's going to be relatively few i'm try is two to four generally a learning lower learning rate and more trees is generally better oh i added genres and i need more trees than this more uh enterprise than this that is i wouldn't be surprised if the effect of genre turns out to be mostly linear and then i can just combine the extra boost with the linear model so much waiting in this game look like this this thing's not nothing yet it's still improving a little bit i'd go past the thousand on this it's still yeah it's still pretty good but five versus six not a big difference so i just do kind of like with genres included i would do like three to eight and i'd go a little farther on 500 and four to eight this is with a lower learning rate we're gonna get in a slightly better rmse and then i'll try removing genre try it see if we can beat that uh with extra boost uh just wondering like maybe not maybe it's a waste of even looking at these oops yeah it's really leveling off there uh but eight is still sort of the best i don't know they kind of get even a little bit of there okay and what is the best version of that oops that was fun uh the best version that was 3.05 okay sure and um 3.5 versus let's subscribe version 2 with no genre and trees are more like two to four now whichever one of these is better i go with that as the extra version oh this should be yesterday two keeping two in case i want to keep the original i don't need to retune it and then in the meantime i'm gonna say i'm going to say oh see it's similar but the genres helped just a little bit okay so i'll use the original so i'll say yep the just a little bit the genres help so i'm gonna i'm gonna keep 48 and leave in genres and all those so i have my original code and then i'm gonna do actually best is select is um extra tune select filter parameters select best this is still not the for me the part of the workflow that i'm the best at and the um oh uh that's right oh and lynn best is this the right auto is this the right auto plot yep that's a good version and then stacks add candidates lynn best i like to only add the best from each other it makes a lot faster and the oh oh boy what just happened well let's hope this doesn't happen tomorrow what's going on here what these two are different training sets somehow they shouldn't have okay can add try this step first that worked now let me just see this for a second this is extreme best and uh predictions looks right okay each of them is 450 and predictions each of them has 450 these look pretty good no then why is this doing this 450 times 5 is that yeah that'll be on these five so something is up between when i try up there it is two of the models failed on lin on lin best they fail that's so interesting all right let's take a look at lincoln try re-running this one more time remember this model this whole linear one about 15 minutes since i actually started three minutes after the hour this worked well look at that two of them failed only 18 was only 18 were available i wonder if the um i wonder if these warnings somehow caused two of these folds to just up and fail that's strange oh wait no wait hold on yeah here it is length of your name is not equal to array extent okay something happened on this one i don't have time to fully explore it so i'm going to try let's see if i how if i just removed one of the yeah two of these models failed that's what happened still failed what happens if i remove all my uh three of my text ones i don't want to be debugging a model the last minute nope two of these failed okay then i'm gonna need to test this out for a second okay what i'll do is i'll take a look at trane v4 five-fold splits one nope two was where it failed it failed on two and three uh training on this one and uh i take this and run my linrec prep on this and yeah there's where there's where the fail is um i don't know why uh yeah i really don't know why i wish let me see the names prep let's see if i said is there some missing data in here that of course is missing data but all right i'll need to just keep dropping things out see what happens oh hey i think that helped okay so it was one of those considering just how much worse that make the model realistically i might have to um run that wait listen i might just have to go with that workflow that's a lot worse let me try let me put back what can i put back if i drop the keyword um step rm contains keywords nope doesn't like it oh right of course a step rm contains keyboards not stepper not contains keywords still doesn't work oh now everything's always failing all the time always oh because i left this line commented out i don't know what is failing here see that it wasn't keywords this is not my favorite part of this process i could just take the xt model but i really want to do an unsub up nope still failing minus production companies minus countries try not just randomly removing some of these because i do not know where that error popped up still failing all right i'm gonna have to go with commenting them all out which i had a second ago yep here we go oh well actually can i leave just genres the genres work i like the genres well this is right well the problem is at least genres what if i just kept a different one maybe it's the min times that's causing problems can i just add one here's that cast i do not like this approach i don't like this approach keep bugging i don't like anything about this all right oh well still keep the other ones but something's going on yeah uh all right and all right we're still it's still not that bad model okay and now i go back down to link best here we go nope still failing i thought that i removed oh i still have genres right oh this one worked okay okay and zero okay uh the best that looks good this looks good i don't know if it's saying zero but i'll do blend predictions all the models see yeah it's the oddest thing i'm really lost here huh it keeps happening where it's like let me see this is not well at least i'm getting all my mistakes out today before tomorrow's big day i really wanted this to so why did it say zero i added this and it's one this at its zero add all of them what do i get what just happened why does why does this best one work and then this one removes it stacks has been so frustrating to me yeah i don't know what's happening here all right here's what i'm gonna do i wouldn't make it if this were the actual competition i guess but the um what i'm gonna do is do a quick thing where i say unnest predictions i'm gonna do it myself when we say i'll write my predictions and my uh yeah and and i've got both of them and i've got the actual revenue so i'd say now let's predict revenue explained by oh and uh right we'll do select dot row dot pre dot tread revenue and model spread model trend and now i've got a prediction for everyone and i'll do a linear mod of linear plus xg boost explains revenue so the model would be this amount plus this of extra boost um so here's my combination this is my linear linear model and then i'll do lin best fit is going to be lynn wf finalize workflow sign about stacks every time i use this i'm just a little bit a little bit off with it and i fit my two and then i can do something like bind calls predict uh lin best fit on my uh my hold out and then i give them names i always say rename see i'm doing the i do this job for it this is great because i like i like all the things that are that are going on here here i go and my id and these oh i can bring that in later but uh and now what i'll do is actually augment this with the um with my combination the model oops new data so nav and now it's uh transmute id is holdout id and revenue is two to the power of dot fitted what i did is i i adjusted the vm combination myself okay and um this is on hold out this is on their holdout set but i'm gonna do it on test first and test and let's see i can find out my for my accuracy on the test set my my seven foot that i held that originally and do rmse oh uh of log two of fred i gotta read log it versus actual revenue log 3.19 it's not that far from it that's kind of what i had but i'll do it on the holdout right csv and do a quick late submission i think it's going to be not great yeah i guess it's going to be somewhere above three which puts me not a great place oops look at me go revenue what if i got an n a here let's see where i got an a one a on the linear on in one row i have a feeling that was like i have a vague guess well i can actually just kind of fix it where i can say replace n a linear equals i know whatever the average of linear is going to be maybe about 20. whatever it's only one observation so okay so one good bit of news is that while i did i was doing worse in this set it actually seemed to do a lot better in the holdout i wonder if that's something to do with the zeros but yeah so the service in the end it was 2.1 so i just about made it in two hours when i started working and um uh but yeah i really shouldn't push it this much i really need to take a before tomorrow i'm gonna take a look another look at the um uh yeah it's i'm not gonna be on the leaderboard because it's a late submission but let's see where would it have ended up around the middle yeah around the middle around 600 something the um i'm curious how it would have done compared to just the linear model the xg model by itself i'm not gonna do that right now i'm feeling like uh taking a break all right so what do we do today the um we did this this one took more data cleaning than most ones we've tried so far we had to do on some like extracting of json names we had we um did the boilerplate for bringing in the data and bringing it apart we did lots of eda where we did things like look at on popularity but revenue run time by revenue budget by revenue which is probably the strongest predictor language by revenue uh and the um uh and then we and uh yeah then we took a look at uh difference over year and over month so seeing those summer blockbusters then a little bit by year and month oh that was group is yeah where did i what did i do wrong here oops something's up oh well the um this one worked this worked but the revenue was still logged and something went wrong yeah something's something strange in the neighborhood and um oh that's right i kept on train this is that old version so yes then we did the um then we said okay we can look at a little combination by data range in year but really yeah we definitely want a non-linear term based on when you are in the year um maybe with the spine but ideally with some kind of tree and then we took a look at the um at the genres and the revenue that you make in each of those uh each of these uh each of these um uh genres and so okay that's one of that's gonna be one of the most impactful but then we took a look at the linear uh we did a giant linear model we did a smaller extra boost model uh and the um we plotted a little bit about the interpretation of the and i could have used this a little bit more to realize oh look at it now i should have kept in production companies max2 boost because so many of the positive ones and production companies um and even yeah even like yeah you can see like like some of these positive negative terms i probably could have used this for a little bit more feature selection uh but then i and then i took a look at forest and x2 boost and i saw the f the forest didn't even do that much worse than first is actually pretty similar to xg boost in this case i probably could have used the forest um felt like it's actually in this case the cross validity was pretty uh was pretty similar uh and that was that was yeah um so those are some things we um yeah oh it's a three point one two yeah a little different myself before but yeah and um and then we did some extra boost i tried to use stacks and it was failing on me and i tried um brain and i tried doing a pca and found it wasn't being held up and doing an extra boost and it wasn't helpful uh so i i ditched that approach and then um yeah and then finalized with this uh cook my own uh blend approach to uh to combine these so it comes to the the linear and the exodus models okay that's it um from me and please if you join today please be sure to join tomorrow for the actual sliced uh episode so that's gonna be 8 30 p.m eastern on nick uh nikon's uh uh data science twitch channel uh between about uh tomorrow for sure uh all right then um uh time to go get a good night's rest um hope you had fun i certainly did see you tomorrow
Info
Channel: David Robinson
Views: 1,483
Rating: 5 out of 5
Keywords:
Id: IkTfKnUoYf0
Channel Id: undefined
Length: 123min 50sec (7430 seconds)
Published: Mon Jun 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.