Tidy Tuesday screencast: analyzing board games and predicting ratings in R

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi I'm Dave Robinson and welcome to another screencast where I'm gonna be analyzing data I've never seen before using the r programming language as always this data set comes with heidi tuesday project run for the R for data science learning community so let's see we have this week looks like board games this is awesome cuz I really like board games so I'm excited about this from the BoardGameGeek database more than 90-thousand games with crowdsource ratings I'll write limited to gives at least 50 ratings between 1950 and 2016 10 thousand games well that's a lot of variables is really exciting there's some script looks looks like the script that came from the board game data okay so I'm gonna bring this data into are a new library tidy verse I'm gonna save this as board games and we're gonna view with the data 10,000 rows is a game idea description I hope there's a game Nate didn't title of the name came somewhere there's a name all right how long we expected to take to play picture good enough mmm I could I could think of some things to do with that the year was published artist a category we can see it a column I always look for these where you have on rows that would need to be separated like multiple values it looks a compilation is usually empty designers multiple values is unlikely I'm going to focus much on that expansion ah so family it seems like is and mechanic these are categories oh wow there's lots of data here we were going to have like comma we're gonna have commas 8 it looks like it's admin better description needed we probably don't need that so someone that will be on countries some of them will be on Cirie's it'll say classic lion this describes where the mechanics are it looks like there's gonna be fun because I'm gonna be separating this this column a little bit we have a publisher I wonder how he publishes there are rating in the number of ratings makes loads of sense alright there's gonna be a blast we for example we can try doing some models to predict based on some combination of family mccanna and mechanic what the average rating and a year probably what the average rating is going to be category family mechanic okay we're start with a little exploratory data analysis first I sort of find out some things about these columns I want to find out for some of the the ones that I'm are there's something I know I'm gonna look at like category family mechanic but for some of the others I'm curious if I counted the publisher how many publishers are there let me see how many publishers are there 5,000 publishers some publishers have more than a hundred games mostly kind of a longtail okay I could make a bar plot I'm not going to do that right now I'm also curious why I counted the year that play the year published how many are there per year so year published and enemy this is a line plot across years art really clear most the data is recent I'm guessing that peak I'm guessing that peak oh no it looks like it saw it Peaks in 2015 the data looks like was it was probably collected in 2016 it which is wide maybe that's a partial year it's a little lower in general we have more recent data than all their data alright I've also got to take a look we often aggregate by decade here I can see especially the early decades are gonna be hard to examine I'm curious about actually I'm gonna start with some histograms of let's see let's histogram by average rating oh it looks quite normally distributed not perfectly normal can go much low end is limited right at the top by 10 but it looks kind of normally distributed around 6 sorry what is that it's 6.25 that's good that's good news if we're gonna fit a model because um it means we could have like a linear output it's not going to be it's about to be in classification yes or no we're not doing um Susie well and we're not doing like it doesn't have a really skewed distribution we need to get a little creative I love things that are roughly normal that's cool I also I'm curious if I filtered for ones with what is the the the distribution of users rated that's definitely going to be a kind of a long tail distribution we set a cut-off I think at 50 so what if I said only use ones at at least 200 ratings would that really I'm curious what is would that really change the distribution if I said users rated greater than 200 not really there's still sort of a long tail on the lower and there are some like what I'd probably call bad games like yeah even if I do 500 stage change the shape a little bit but um and it's harder to get like a I guess this is mostly because games that are pretty bad people don't buy and therefore don't rate but that's why we see this kind of carving out on the distribution but almost mostly though it'll probably treat average rating kind of I guess is something I want to predict and look at like the mean squared error and I'm predicting that all right so have to line let me see oops these two are the same I hope where's my histogram here it is okay so so then we want to ask answer some questions about I'm probably going to spend too much on things like publisher or author but I haven't arrested in max play time as well all right so I remember I'm just able to look through seeing what each of the distributions are this is what I'm probably don't want to predict a working way towards a predictive model here and I probably want to predict average rating so I want to first kind of sense what's the distribution of max play time ooh that's pretty weird I'm gonna say who would say that maybe is a data quality thing we would say the housings of hours or whatever I assume this is in minutes thirty minutes six minutes 240 minutes and let's say less than a thousand minutes yest a bit of a funky distribution I could add a scale X log10 on this and they I could probably leave it like this okay so most things fall in between ten minutes and this is less than a thousand minutes this is ten hunt a couple hours so yeah so I'm going to quickly divide that by 60 that doesn't really help sometimes on a log scale with time I like to I never said a minute in a an hour or even least and 60 hours of each evenly-spaced but here that's not really going to help I'm probably going to log breaks and say hour to let me see how those two breaks one two four eight and say play time over 60 that's like now we're I can really say to ^ seek negative two to four something like that it's not perfect but this gives um I'm fine I would show us around this shows like quarter of an hour fifteen minutes all the way to a tower even a little less ten minutes to eight hours is really where we're seeing in our range okay and I don't neck on a log scale yeah it's not too far off of my kind of normal should be so I'll probably use log time as a predictor and whatever I'm gonna try to predict average rating okay I'm gonna see house quickly filter for max play time greater than five max play time less than a thousand been with that's loops it's always hard together Ben with reading a log scale it's not very intuitive yeah we'll start up all even uh really gotta be careful with your bin widths some could be misleading but this gives a general sense then in a log scale we peak around an hour okay so I've got this I've got this to this distribution of your published the and categories I haven't looked at categorical variables I like to look at every very I'm going to be Fianna predictive model see if I can figure out the the average rating but I haven't looked yet Adam let's say hmm I noticed I've got family and I had mechanic and I had category so if I select was it called his name first of all let me quickly check that name is um is it unique oh not unique isn't that annoying okay I'm gonna quickly select ID and name and also family expansion and category here's what I'm doing here I'm it's not ideas that it's game ID what I'm doing is I realize that family expansion category are each categorical variables I want to treat them in a very similar way in particular Suzy in particular I need to split all of them around their commas and I could do it on each individually but it actually would be a lot of like repetitive work and therefore realize all three of these are categorical variables and now I'm now if I gather them into type and value for each of family expansion and actually I've I'm gonna do it for all three I just say that family expansion category now I've got family value and now I can actually separate rows that comes from aunt Heidi our package and say value set is by commas it looks like I can also by the way I can filter not isn't a value it's not gonna DNA's on could be helpful here we go now we say I'm actually going to arrange this by game ID and now you can say died my Maher it's I guess it's a German game it has two families and three categories the categories being economic negotiate she made a negotiation and political dragon master has one family two categories so now I can call this categorical variables it's gonna be really handy once I start fitting the model is I have put these categorical variables at a patty form honestly I could have done it with other ones like publisher or artist I decided against it because I thought those are probably a little rarer but I hmm hmm all right I'm gonna actually I'm gonna try that out I'm gonna say it's first I'm gonna say let's throw in what was it it was art it was um publishing I'm gonna start with actually take that back I think publisher only has one I'm gonna throw in artist and is a designer yes designer and I'll show Oakland I said gather everything besides in a game ID and name and now I've got my categorical variables with a few extra things like the artist and the designer if eighty-eight thousand piece of information about about ten thousand games so now I can actually count type and value and say what are the most common combinations I can see we have loads of card games loads of war games and fantasy games and we also see a whole collection of Kickstarter games we also have ones about fighting Mork as well those like generally it's categories that are really common I'm going to quickly view this it's category yet which makes sense that's kind of what I'm but not it's like it's like your categories that I figure are like bigger than family this one is probably not a not a very well you know maybe it's indicative for a particular game hmm all right so we have humor we have what caves are uncredited maybe that actually has some predictive power though I don't know if it's meaningful lots and lots of categories and a few families that fit into the common ones okay now I actually see that we have some really common artists so this artist is more common than say most family than most family like tags are so family 3d games family Napoleonic category American West we have some artists I see a couple designers popping up okay so these are some categorical variables now I can start with by let's say call this categorical counts let's say I start with that and if I filter for type equals category I could make a solid little bar plot I could say value and Hume call cord clip we we do this bar plot in almost every screencast I say value is FCP reorder value by n value meaning category in this case oops we have too many I want to say though most common categories I'm a little tired of this default theme I like theme lifet I'm gonna switch that now so now I can say title is most common categories hmm you know now that I'm doing like now that I'm looking at this I've saved a category I got family I might actually want category family always the other ones that used expansion I don't even know if that's was Explorer was expansion what was that uh saying oh yeah that's actually I don't I don't know this is one I want to this is like oh there's a list of expansion sets that I didn't mean to include that really that that's not as um that's something out not really what I'm looking for in this case the number of expansions could be a little interesting hmm so I'm going to say let me see okay yeah I'm gonna so just with family category artists I know I'm gonna start just with category should include all of them and I'll instead of 20 had 20 I'll do group by type top-end 20 by value what I want to do is a fascinated graph where I see the top numbers each category similar to what I was doing between before by browsing through and well let's find out if there are categories that get a link that match with the families facet by type here's a fun trick fill is type and show legend equals false except they have a prettier graph I just realized I need scales equals I'm gonna have me a huge amount of trouble if I don't say oh I forgot to remove expansion if I don't say scales equals start with pre-and oh I have to ungroup this is really important you have to ungroup before you FCT reorder so some artists link up with designers I can tell because these orders are really funky that can be fixed in the dr lib package so this is a package i could this is a package of my personal useful functions and I use the function with order with reorder within by category see how you got like oscar Arevalo is popping up it's like he's here and designer here an artist that's messing with the whole order so I'm gonna change this to reorder within and then scale X reordered you can also do actually I'm on the facet it's free Y which is annoying I said category I meant type if I don't free this these scales I'm gonna have a really oh oh oh top in by value top in by n come on David get it together yes so he's our the twenty most kind of the most common artists designers categories families she'd probably want to order this a little differently I can actually say I can order these facets so I can say type is reorder type by and by not could be some it could be actually it doesn't actually matter I meant to do absentee be order and I probably want decreed descending is true because I think I'd like yeah I kind of like this graph accept it as too many items on it okay so there was some quick iteration to say we have a lot of the most common things we look at that are categorical variables are your category that is it can be multiple for each person category family artist designer so if I want to say I predict from someone new like what their weather game will be good this is what I'm going to be looking at so um each of these categories okay so that was some exploration of the categorical variables there notice it was pretty easy because I did to do all these together because I did my tidying up front I took the I said there's really one row for each name type and value yeah they're the type and value can be a little misleading this is like a category type and this is a category but this is a category so I don't this is a category this is just a Khattak one possible value of a categorical variable so that that's um yeah there's one point I'm trying to visualize this okay okay so why have I been doing this the reason is I'm really interested in what categories tend to have higher average ratings I'm going so but I'm going to let me mouthing over this now I'm going to do this differently I am actually going to start by trying to predict average rating with a linear model they are variations on a linear model predict average rating then we're gonna bring the category data back in need to think about that for a second okay here's my board games what could I use to predict the the average rating well I'm gonna start with something I'm gonna say and use a linear model and I'm gonna say average rating explained by wise an average by one that means nothing except an intercept ah I now find out that if I run an intercept chuckles pretty summary on it I get quite certainly the average is six point three seven on r-squared of 0.85 that's all right so all right so that's where I'm going to start is my is this is oh I just realized what am i doing I'm not cross validating anything I'm going to have a holdout set so I'm actually going to say game ID I'm gonna do just one for right now mod 5 equals oh so take every fifth game and say holdout set I'm not gonna bring it in until the very end there are a few ways to do this I'm just keeping this one like separate right now and training and I'll say I'll say board game did its board games filter okay I'm going to start with here we go we have a summary all right now um I want to add something I want to say Oh couldn't if it could depend on the number of players all right I'm gonna add max players so I say intercept and Max players it basically it barely made the residuals arrow go down at all looks in general like games with but it doesn't say that it games with a few more players it looks like it's a slightly significant effect like wow that's really close to not significant games with more players get more 10 tend to be a little bit lower rated wow that's very little so much so they're like I bet you if we plotted it max players by average rating it would be nonsense but we're we're warming up with some of our predictors I also noticed oh there's a that goes up to a thousand all this data all this data I'm actually gonna say log 2 of max players plus 1 some of them are 0 that's where they weird ok ah this is a much better in effect notice that's why it's actually useful to understand just throwing that in but instead I'm actually going to say if I plotted with scale X log 10 and I do a Geum smooth there actually is a general downward trend according to this some according to this graph so for starters games with more players likely to get us a lower rating so weird for it to be begins with a hundred players so maybe there's a data quality issue okay but that's at least one place we're going to we're going to start max players are one of the numerical variables do we have we have max play time we already know that we want to do log 2 of max play time filter and not is an a No next what is zero Oh hit this stuff max play time +1 and all right what it said ID so with it well this generally says is every time you double the number of players expect the average rating to go down by point 15 every time you double the the max that doesn't go with a longer game you generally expect the Mac of the play time to go up by um point by point 13 so generally longer games for fewer players this is a very very like how to say this is like the the very crude looking at a linear model already already we actually brought down a residual standard error we improved the we got closer we got closer to the to being able to predict average rating not a lot closer them okay so that's a um so that's star firt for a model what else can we do that's that's predictive well this min play time and Max play time another include both they're probably correlated I'm not gonna actually even bother I also don't think it's these aren't that useful okay but one that could be quite useful is I'm gonna look over time this is really critical because time almost always has it has an effect I actually want to explore it a little bit up here and my um I'll do it down here what I'll do is I'll say board games oh wow do I still do here it is I'm gonna leave out the holdout set and I'm gonna say board game data select cross validating we can do that later I'm gonna do mutate decade is seeing this before I do 10 times max Oh taught me times year published and is this right yep and get the check yes this turns in nineteen fifty sixty seventy eighty etc and then I summarized on should get a group by a decade summarize for the average rating what is the average average rating that's why I'm asking looks like it's been going up over time and not by a small amount so decade rate in yeah that that's quite a bit up especially recently the trend is not quite linear but it's not completely crazy to say let's predict it linearly so I'm just gonna throw I could use spline Sinai but I'm also just I'm just gonna throw it in as another term I'm going to say year published that's so all three of these are quite significant predictors we can tell notice getting a little hard to read um but um we tell ones that are part that every year that you're that you are later you expect to go the average rating to be up by two point by 0.02 I'm actually going to hiding this instead of gonna use broom okay I like that that depiction of a good deal more wow these are significant p-values that's not we have a lot of data and it's a real trend so I'm gonna say is yeah every year it is later you expect the rating to go up by 0.02 notice as soon as we went to that our residual standard error dropped by a whole bunch so these are a couple predictors are going to need not so much because so we're trying to predict the average rating of a game this will be one of the reasons could be if we if we launched a new game do we expect to be popular now we're never gonna launch a new game that's in 1980 but we still need to control for it this is still going to UM when we train this model to predict average rating we need to know that the older game with probably even like an older game would say the same categories is going to look a bit different it's going to be a it's going to have a lower rating so that's really useful okay so so far that say um that is a good amount of our categorical data so sort of our knock non categorical data so we've got down to like a real standard error if like 0.74 okay cool so let's move on to trying to predict the average rating based on categorical data so gonna start with in not with it not a model based approach I'm going to start with this I'm gonna take my board game data I'm gonna join it with what is it called category category categorical variables by game ID and an by name I guess we actually had both of those in there I didn't really need the name in categorical variable and now notice that I do have a type I have value and I also have average rating and that's the exciting one I'd like oh family country Germany and here's the average rating so why is that useful because I can group by type and value and summarize for the number of games always useful but also the average rating of those games this is like this is it's almost a reactive model shoot that comma I just noticed that a well we can't deal with that there are some times commas in artist names that add these they're still pretty rare okay and left I said we shouldn't focus on too much because it really should be descending or the ones that are in the most games so here we can say the average war game how does it as an average rated of six point seven seven the average card game a rating of only six point twenty five that's a big difference they have a party games are less popular on BoardGameGeek and choking children's games really aren't popular so i could actually create a graph here there are a couple of them I could say type equals locates category I could create a graph and I also probably want just why am I not arranged in descending games so I'm gonna call this by game oh no it's not my game bike at it by categorical I'm gonna call a categorical by a categorical variable so if I filter for category take the twenty most common that's a solid graph this you see yeah I could say just hit me I'm doing I'm doing averages I probably probably want a box plot so I could really say try this filter type equals category and FCT lump yeah I'm gonna say type equals FCP lump all but I don't know in in I said type it's actually value that's where this is my naming conventions are not the best lump all but the top 15 and then by value to average rating like a coward flip I usually like a reorder for the in a case like this so I'd say values reorder FCT reorder by average rating it'll do it by median which is really nice for a boxplot there it is it's like children's games get lower ratings World War two games get higher ratings so that's excellent but notice I could also apply this I'm gonna copy paste so we keep it could have faceted it it's gonna get become a crowded plot I'm going to say not just category let's look at its the other one family the top 20 most common families I noticed we notice it's solitaire wargames unusually popular 3d games unusually not popular unusually unpopular had been better description needed right in the middle that's alright and similarly I could could I look at designer these are the 20 designers who designed the most games the uncredited ones if a game is on credit it's like they have a lower rating that that doesn't send me as causal it could just mean it's a not popular game we also see we find our problems with our data like the fact of the word junior gets associated as a designer I don't don't love that promise with these commas I mean look let me look at this actually let me select designer do we have actually have multiple yeah we do have multiple like Wolfgang creamer and should all record well it's frustrating we could spend time cleaning it I'm not going to spend that time now we've done enough feature engineering okay here's a bunch of people okay yes some artists definitely have higher rated games now that doesn't mean the artist cause the game to be better maybe they work with a good publisher or something like that but it is it is notable okay so I'm going to yes so we've now look we've seen that cat conclusion is that I'm gonna do just category and family categorical variables can be correlated with higher lower rated games if all these box plots were really flat we wouldn't have anything to work from we would probably stop right about here so why am I doing this I'll tell you why because these are features these categorical variables are features of our data in fact if i unite type and value into the and a feature unites a super handy function in tatiana I can do um here we say what which family country Germany which category economic which category political which artist Marcos I'm not gonna be able to pronounce that and so on and visa and we now could say what the most common features are but we can do more than that we can actually have a we could actually build like a sparse matrix of our data so I'm going to add count there's a deploy function lets me say I want to know how common each feature is I don't want to include any feature that is not at least I don't know in a hundred games I meant and greater than 100 how many features does that leave I wonder distinct feature 83 hmm 109 at least 50 141 features okay what I'm doing is I'm taking a hundred forty-one feet that it's a place to start if you know 141 features all those that effect at least 50 games out of our 8,000 that one that were working from I said 8,000 but I forgot that I'm working with just the the whole I should be working just with a holdout set I'm actually going to go change how I do this a little bit I'm going to say I call this board games raw I want just want to avoid ever working with our holdout set I'll try and pull it in at the end but heck even if I don't have time to pull it in you can do it yourself it's good to evaluate a model on the hold off set and board games raw and we say board games look board games is board games raw so here we go and everything since then I can just use board games I'm moving that old one okay I'm yeah I decided I'm gonna do everything based on our whole are without our holdout set I've 1/5 of the data it's a good practice machine learning to make sure you're not overfitting Oh your conclusions to the data right so tidy and board games board games that way this was my box plot my other box for my family all right and categorical variables now as fewer games yeah so I'm saying must have at least fifty I'm looking only at features that have at least fifty appearing at least fifty games out of these about eight thousand that I'm keeping around okay that wasn't the cleanest way to go about that but it works okay so I'm going to yeah I'm gonna keep these features here's my trick I want to do a regression I want to say if you're economic how much does that add if you're a negotiation how much does that add if you're political how much does that add the challenge is that with this many variables calling this features they say I run into a lot with this many variables many of them are going to be highly correlated if they appear in one they're gonna appear in another like if some if one has there's a card game maybe card games aren't usually likely to be fantasy many features many of which are correlated is really really not a good way to go about linear regression so I can't just do a linear regression where I said if your economic then does your average rating go up what I'm going to do is instead I'm going to do lasso regression I'm gonna use I think I've used it before in one session I think on medium text posts gonna show again for lasso regression for the average rating so that's in the GLM net package and it to do this I need to create a sparse matrix spar tools work with sparse matrices are actually in the tidy text package because it's a really common that Julia Sylvia and I created because this is a really common need in text data what I'm doing is creating a sparse matrix i say cast sparse feature I'm sorry game ID and the rows feature on the columns and nope Enys account I actually just wanted to be binary yes or no so I'd say feature matrix what you know I'm also usually you need librarian matrix if you want to do anything with a with a sparse matrix I just created a matrix with with 8300 rows and and a hundred and thirteen features what are the columns of that feature matrix it's going to be each of our features and it's going to be a binary matrix can I view a future matrix by myself no I cannot but it can view it because it gonna be big I even turn into yeah I guess I can I created this binary matrix where the row names actually are the game IDs which is actually the game IDs notice we skip every fifth because they're in the holdout set and this is how we'd want to approach a regression we want to say based on this maƮtre this matrix of features what are the basic features what contributes to a to a higher or lower average rating so I have my features now I need my vector that goes with it I still maybe there's a shortcut to this summer I still kind of take the same lame approach remember lame I mean here's all my row names and I actually find myself it's certain it's certainly less than impressive what I do is I take my my original data my board games and I want the average rating out of it where I use a match function which is handy kind of an old-school art trick do I say wherever every item in the features matches board games game ID I'd do the same kind of code often um I could have done with a join it wouldn't have been a lot simpler I also notice okay this is a problem why do I have features that aren't in the data set or is it that average rating is sometimes na no that's not one okay so where what row names of features are not in match the in the game a match to a game ID that's because I set that as the so if I say board games filter well what if I did oh sorry I meant to do set difference of this versus board games game ID so why are so many oh these are holdout set I thought that I removed that well not all of them are hmm I'm missing something one moment please I do I crave a board games was my categorical variables did I not rerun this there's still plenty of um missing ones well I didn't rerun this sorry about this this is not as clear as I was hoping it would be how am I still missing every fifth one I thought that I categorical variables thought that I did that one two three three four six okay so it does have Oh nuts nuts nuts nuts this is this is really lame I'm sorry this is actually I think let me see no it skips right over five so if I do set difference wait where is row name five in row names features I'm sorry it's probably gonna be something really simple where is Delta game ID is five doesn't happen okay I think my cast sparse has a problem this is probably need to you this is really some of this is really frustrating around this cache sparse is supposed to features Oh oops okay so that happened Oh feature I meant to feature matrix my bad Oh row names Oh feature of Wow did anyone catch that I needed to use the road all right that was what a waste of your time sorry about that okay I know the binary matrix and I have reading data associated with each that is I have a binary matrix wow that was not yeah that was not very helpful all right I have a binary matrix this is my predictor this is my is what I'm predicting this is the goal the goal so what's the tip my tongue the name for that these this is these are predictors the regressor I suppose I think whatever they have what I'm predicting it's my X and my Y so the way GLM networks is that I give it the feature matrix I give it the ratings and I all right I'm gonna start with a I'm going to start with a GLM net fit so this is my lasso regression fit super fast it comes with a built-in plotting method it's not very helpful what it's a little helpful what this does is say as I'm going to show the a bit more about the data okay lots of its enormous it's not very helpful but groom allows me to tidy it and what it says is at every step here's what term gets added and here's what its value is what is a step a step means an increasing value of the penalty parameter lambda so last of aggression is all about so saying quickly it's useful to have this equation there's gonna be a math in a brief flat math interlude yeah here we go yeah I like I like this this one so what this is is is saying normally in in regression we're trying to minimize the sum of squares we're trying to say get that as close to prediction as we can but what this is doing is adding a penalty term it says but I also don't want to make my coefficients too big my coefficients get too big then that I might be overfitting it so we're fighting overfitting which particularly happens if you have many correlated predictors so this is saying is let's um is is when I as I keep decreasing this lambda value I keep adding parameters in this one I actually add wargame as a positive parameter notice that as your l1 norm I think this is basically as lambdas going down this is each of our coefficients starts increasing and then flattens so if I view this what this is saying is you go ah we add wargame uncredited children's game as a negative effect so it starts adding coefficients one by one that's not that helpful we still need to choose a lambda so we have but notice that this is how it works that it keeps decreasing I'm sorry yes decreasing lambda and adding more and more terms that are either positive or negative it turns out once you've got a lower lambda term Roger B McGowan is pretty high is pretty positive whereas humor is a little negative crowdfunding Kickstarter big positive on um on this game yeah so weird what I would do is the challenge is we need to choose a value of lambda I could choose a random one I could say a step equals 30 and boom now I have my coefficients I can say if the designer is dnsa or the family of solitaire games add a huge amount to your score I can also say if your family's monopoly or magazine strategy and tactics subtract it Chevreuse comin hoppily is not popular board game get a geek it looks like but that we choosing a random value we need to choose once in a principled way the way to use that is to use I'm actually gonna call it CV lasso is to use cross-validation this is built into the lasso to the GLM net package with CV dot G lamb neck ah there's good news here sort of what this is saying is that as I decrease the lambda value my mean squared error goes down and it basically flattens out we weren't in much risk of overfitting that actually means back here when I said let's only use ones that are current at least 50 I probably should have decreased that threshold a little bit I could include a few more features oops oh right right of course um I wrong this would have to be this is now exists I mean of course this now exists in G LM not fit the same thing the same kind of output exists but here's my CV lasso and I can plot it yeah we're still not overfitting generally on this you'll see a curve that goes down then back up this is saying as we decrease lambda we add more and more parameters but does show that once you've added about 177 parameters you've got in about all the the gain you're going to get and that happens at um this dashed line it tells us what lambda value to choose it tells a CV lasso lambda within out use one standard error of the minimum is when lambdas here it's like you don't gain much by Attabad in anything in fact you looks like you might start overfitting your mean squared error going back up so this is yeah so this is the lambda value we're going to choose why is that useful because I can take this and say filter lambda equals this and now I have a model where I actually say here my oh I like to arrange in descending order of estimate these are now my coefficients so I can now say I can now say here's my there's an intercept of 6 point 2 3 and then add if it's if it's in the family the general group like to have tableau building hmm or if it has this artist bring it up by a lot whereas if it if it's as a design credit designer or its monopoly or strategy and mega magazine strategy and tactics bring it down by a by a lot so this is this gives us some ways of I'm understanding how these features from these different groups family are just everything get turned into it get turned into a predictive model one thing to realize is we talked about controlling for a few factors including the number of players we didn't do anything with that so let's actually include those in this variable let's what we do is say let me see huh okay well we need to say is is non categorical variables those are board games select and keep the game I just like having the name around even though it's not actually important and which were the ones I've looked at certainly the publication year published I'm gonna change this to a transmute and say I care for the Year published I also care about the log 2 of the of them in players I care about the log 2 remember we decided earlier on if we're going to be linear we need to treat it is log 2 it was called what was called max play time year published playtime equals log 2 of max playtime max players equals log 2 of players plus 1 we add 1 to each these are non categorical should really call them features these are some features we want to add in there was was there when I was forgetting I feel like there was one I also I want to say year published - 1950 I don't really like I don't like the intercept term will end up really negative just because of the year year since 1950 I'll say year is year published - 1950 all right then I just get a more intuitive intercept term I feel like I'm missing one maybe I'm not okay here's my categorical features the trick is I need to gather them gather everything besides game ID and name and say pipe which I feature value feature and I'm doing it backwards here's the way gather works I have a column called feature column called value just as I do oh here I don't have quite at that I need to gain value as you either have it or you don't find rows with non categorical features what I just did is I added all my features together all the categorical ones have a value of one in every place and then I stuck on the end the feature log to max play time I don't need the aunt anymore that was a so I've added these a couple features back here in this step you see okay alright and now they're right there in my columns year mat log - max players love - max play time of AD I've combined these features with these um these features that are numerical with these features that are yes or no okay now keep this rating okay and now I can view I need to change one more thing I just remembered when I unite them I want separate I want to have some colons in here it looks a bit different but it's gonna look way better I'm like on my ground the resulting graph that I'm gonna make so my question now is I'm now controlling for year whereas year we thought it would be here somewhere did I rerun I reran this this and this and this and this and if I say filter turn stream detect term year hmm I'm doing something wrong call names if I just hide it and I filtered him I am running CV laso I am taking this and there should be even if it wasn't significant it should be showing up what about my log too I am missing something it's like it's like this data is just missing that's some embarrassing if I did I didn't even ends with 292 it's like it adds everything except for this there okay let me see features Hale oh that's it haha there it is I needed to include the value in the cast bars my bad what I was missing here is AZ matrix so I know I'm thinking through this quickly I'm not able to explain every step if I go all the way to the right it actually these these don't line up but this shows odd there it is year max players it needs to have the values in there that's the step that I was missing so now if I run it here it is year is year does have a an impact on this one so I'm going to try this again okay it is a small coefficient because it's one for every year not just present or not present if I filter also for a string to tack term log - I bet I see the other ones yes so max play time still is in effect Mac pause effect lot max players still has a negative effect for every time that you double it okay so this was my predictive value a predictive model and what I would do a usually at this point is I find I'd filter out the intercept really like these understandable models I'd take the top and say take the top 20 estimates that go in either direction I plot term estimate I'd flip the coordinates and do you have CQB order these are the combinations of categories that these are the largest coefficients in our predictive model so I can say title equals largest coefficients in our predictive model subtitle is on a Basso regression I usually don't need this access it's a coefficient so the coefficient I'm not going to write it in the graph but the story here is that the coefficient is the average it's the average amount that that a that you'd add sorry not average much the amount you'd add for a particular tag category or a designer or family being present so it's saying that like tableau building if it's that building add a coefficient it's common that there was a combinatorial game I don't know that means add point for if it's Monopoly game subtract point something point six point seven so all the so this is um this would be the way we fit up reactive molecule probably even fit a few more coefficients in there that's a pretty good graph yeah yes pretty that's pretty solid and unfortunately we're out of time for today otherwise I would go into evaluating the model you really like using the yardstick package and comparing it to other models this would be a really interesting one to do some follow-ups for example looking at UM comparing this is a so this is a lasso regression model it's a linear model every single term just gets added in so we were doing some feature engineering but then in the end we just said let's um use this let's add to add a certain number for every category that it has it doesn't allow for nonlinear relationships and it doesn't allow for interactions it can't say if you're both a campaign game and you have this designer that's an especially good combination so um so the yeah that's that we didn't look at what I really loved about last regression as I've mentioned before is how intuitive um the end result can be even if we didn't understand every step they were doing through here where we were taking um matching this vector of average rate of average ratings are we doing the sparse casting even we don't even if we don't expect and exactly how we chose a value of lambda we do understand the end coefficients that it comes up with we can say these are the quote good designers families categories these are the bad ones and that itself is a pretty exciting thing to do with a um with a model okay so we did some machine learning we did some understand you do some exploratory analysis and we did some machine learning for predicting the average rating based on some features one thing I decided models and trying other approaches one thing I would have done if I'd had more time would have been to look actually at the words used in the descriptions they're probably common words and descriptions that might not be in the in the category of the family that might um give even more information about average rating but for now this is a way we can the same we can use to UM to predict average rating for some ah for a collection of games okay well thanks very much much for joining I'm David Robinson I had had fun exploring these this board game dataset with you and I'll see you next week
Info
Channel: David Robinson
Views: 4,734
Rating: 5 out of 5
Keywords: tidytuesday, rstats, data science
Id: qirKGdQvy9U
Channel Id: undefined
Length: 63min 23sec (3803 seconds)
Published: Fri Mar 15 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.