Predictive modeling in R with tidymodels and NFL attendance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Julia Sookie and I am a data scientist and software engineer at our studio where I work on tools for modeling and machine learning in this video we're gonna use a data data set from tiny Tuesday about weekly attendance at NFL football games and we are going to build predictive models we're gonna build supervised machine learning models we're gonna use the tiny models framework to UM to build those models and this this will be a good video if you are looking to UM to get started using tiny models for the modeling and machine learning that you need to do so let's get started all right let's get started let's build a couple of models so the data set we're going to be using today is the data set from this week's tidy Tuesday it is about weekly attendance for NFL games and we have a lot of other data about the team's how they're doing that year and information there so I have an arm markdown file open here I'm gonna just run that first code to get a few things set up there for myself I've got a link here to where the data is from and I have from over there I have pasted or I have copied over from where I was the the information on how to get the data so I'm gonna I load the tidy verse meta package for data so I have access to packages for like deep liar and ggplot2 for data munging and visualization and then I am going to I am going to run these other lines of code so that I now have the the attendance data set which is it what it looks like from the data dictionary is that its weekly so it has a row for every game and so this weekly attendance number tells us how many people were at each game and then this standings data set which is for every year so for every year it has their information about their record so the first thing I'm gonna do is I'm gonna join those up so I'm gonna take attendance and I'm going to left join with the standings and the three things that looks like these things have in common are the team the team name in the year so I'm gonna say year team name and team here I'm gonna save that I'm gonna call that attendance joined like that I'm gonna run it and now I have something called attendance joined where I have that all together so and I have this dataset that has almost 11,000 rows and has one row per team per week and then we have other information there about what they what's going on with that team in that week and or year so information about the eventual their eventual record that year before we go on to train a model I am gonna spend a little bit of time doing some exploratory data analysis so the people who participate in Tidy Tuesday do a fantastic job of this so I'm just gonna do some some nothing nothing too fancy with visualization but just before I get started modeling I'm gonna demonstrate a little bit of a PDA of exploratory data analysis because it is such an important part like understanding your data is such an important part of building a good model so first for example let's um let's look at the let's look at the the teams the team name so if we put the team name on the x-axis and the weekly attendance on the y-axis and maybe let's maybe see if there's any difference that we can see of years that people years and teams that these teams made it to the playoffs and let's make a box plot here and let's uh I usually like to make the outliers a little bit more see-through so I can see that so this is going to be a vertical plot like this and let's let's flip it so that we can read the words that we are seeing like this okay great so I can read the words now which is the names of the teams which is good um let's try to make these let's try to make this in order so I can kind of see say by the median what what so there instead of alphabetical order they are in order by who had the highest weekly attendance so let us use the function fact reorder like this and we're going to what is going to go on the access is phat is team name but reordered by weekly attendance like so and that that didn't do anything I think and yeah it's because it has any values in the values for weekly attendance and it's taking the median and sending up an n/a so let's just filter those out here because we're doing we're doing that anyway so that's what we're that's what we are plotting so let's filter those out and then this should make a nice plot where we see yeah now it's an order so we can see at the top the Giants the Jets at the bottom the Raiders the Bears and so we have a nice little plot where you can see that there is there are some differences and weekly attendance with the different team so if we're gonna build a predictive model this would be something we would want to put into the model it also looks like for most almost all of the teams we see a difference between the playoffs versus no playoffs in terms of weekly attendance the years that they are doing well and made it to the playoffs there were more people who came versus the other one for most not all you can see but that would probably be something else we want to put into the into the model let's uh let's go down here let's explore something else that is in the data let's see so something else that's in there is um there's quite a lot if we want to just kind of look at some of the column names here things about their record for the year like wins losses points this points differential is related or the margin of victory I think is related to how many points the different they earn versus allowed to be scored versus how many games they played the simple rating I think is related to the margin of victory and the strength of schedule and so forth so let's for example I I would imagine that the margin of victory is related to the playoffs but we would like to see if that is if it is it it are these things that are measuring the exact same thing how much overlap is there so let's we could either go back to standings or we could use distinct here so we could for example distinct on team name year margin of victory and playoffs so that gives us for every year the team name the year the margin of victory and whether they made it into the playoffs or not and let's make a plot here and let's um let's make a histogram of the margin of victory which is a numeric value there and let's make two on top of each other two histograms on top of each other and that are a different color so we can see them so we want them to be both go down to the x-axis so we're gonna say position equals identity and we are going let's make them transparent so that we they are we can see them kind of on top of each other and here we see the the playoffs versus no playoffs distribution for the for the margin of victory something here to notice when we're talking about building a predictive model the playoffs is not it's about evenly divided like there we it's not like a super minority class so fortunately we don't have to deal with anything super you know complicated when it comes to that we don't have you know class and balance and this thing we're going to use as a predictive variable and the margin of victory doesn't look distributed super weirdly so this is all nice this is all looking good I think we're gonna I think we're gonna be able to do this which is good um what's up let's just look at one more thing we've got we've got dates and weeks this is the week of the season and I will admit I'm actually not much of a football fan and I I don't know actually if attendance changes throughout the season does it go up or down I I don't know so let's uh let's make the week a factor and then let's let's um put week on the x-axis that weekly attendance that we are going to make into a predictive model on the y-axis and let's make it pretty colored so we can see it easier let's make another box flat and I always like my my outliers to be faint okay so let's see what that looks like all right so not huge differences during the season maybe even drifting down so maybe more people come at the beginning of the season than the end but but not a huge change here we can try putting this into the model and see see what's gonna happen okay so this these are just some brief since we're since I'm recording and we want to get to the model I'm not gonna keep going but these are just some brief examples of doing exploratory data analysis with with this model that we want to get ready over this data that we want to get ready to build a model for again this is just such an important part of our model of the modeling workflow let's build a data set that we are going to use for modeling so we first of all we're gonna be modeling weekly attendance and there are weeks I think like each season every team has a week off so we don't need those we don't need those um we don't need those weeks because we we don't want to imputing or anything because they like no they didn't play that week let's just take them out and then let's take some let's take some columns that we're going to keep so we definitely want weekly attendance because that's what we're modeling let's do team name because we think there are some differences let's try a year and week because he if there's some effects from time now we have all these information about the team's record and there would be some various ways of deciding what you know we try throwing it all in there we could try different approaches to decide what to do for this for the purposes of this video let's um let's try three let's try the margin of victory the strength of schedule and the play here so we're gonna call this attendance DF and so this is gonna be the thing that we are going to model this is our modeling data set right here all right so it is so with everything we've done so far has been with the tidy verse meta package and now it's time for us to load the tidy models meta package it contains like tidy verse contains things like deep liar tidy are ggplot2 tidy models contains packages built for building models we're the first one that we're going to use that belongs to it is we need some functions from a package called our sample now we're gonna what we're gonna do is we're gonna take attendance DF and we are going to split it into a training set and a testing set so we're gonna use so we're at first let's let's load this so we are going to use a function called initial split so you can see their proportion default is 3/4 so we're not keep a 74 5 percent of our data in training and put 25 percent in testing and we can give it a strata which is where we're gonna say I want you to split the data about evenly according to something that I have in there so we're going to do it playoffs where to say hey when you split this data please make it about even between so we have about an even number between the the thing the people who made the playoffs and did not so let's look at this thing so this is a split it is a is of type is like a list or something underneath but it's like a type split and we so this thing keeps track of which bits of data belong in the training and which bits of data belong in the testing to actually get the training and testing out we call a function training on the split like so and let's uh we will need this so let's assign it a name and then we can while we're here let's make the testing we use a function called testing and we call it on the split so all together here we have this object this thing that's a split and then we call these functions to get out the m data that we reused for training and testing so and I felt trained notice that it has about 3/4 it has three-quarters of the data in it and NFL test has one quarter of the data in it and we divided it evenly by playoffs so that we have about the same proportion of playoffs in both sets all right so now we have our our training and testing sets Mabley move this up so it's easier to see on the video and the next thing for us to do is to build some simple models okay so we're gonna build a first a ordinary least-squares linear regression as if you called LM and then we are going to build a random forest now it's going to take a little more code for us to train the linear regression model then typing the LM you know but the point here is for us to understand how does the tiny models framework work because it is composable consistent extensible and allows us to train so many different kinds of models so the the tiny models framework the first thing you're gonna do is set up a model spec and so we do this with saying what kind what kind of model are we gonna do so this is a model specification so the linear regression we're gonna say hey I want to train a linear regression model we are not going to do any regularization or anything we're just doing a very plain vanila linear regression so we just call it like this and then we're gonna do something called set engine so set engine actually could set we could say the engine is equal to things like Stan or you know he'd use some different kind of packages in here but what we are doing is we're actually saying set engine equals LM so with this this code right here it does what it's saying hey I want to train LM so that's that's all this is doing right here but it this this this framework can be used to in a composable flexible way say all the different kinds of models that you will want to train so let's call this LM spec so this is a model specification so let's uh see what that looks like what does it tell us so it's a linear regression model specification right so it's saying what what is the thing that we made here so now that we've made a model specification we can fit with it so we can pipe to the fit function we can fit and so the first thing we're going to send in is we're gonna tell it what what what like what are we fitting like what kind of model and so in this case let's use a formula so let's say weekly attendance and till the dot so we say everything in that data set and as the predictors and then the thing they as has the features and the thing that we are predicting is we collect weekly attendance and the data here is NFL train so this is our training data set so let's save this as LM fit here and we can do this and we have a we have a model so this looks probably very similar to if we you know me this will be this is actually exactly the same as if we did LM like this is in fact it says right here this is what we did right so this the reason that we're walking through this is to say how do we go about specifying a model and then fitting a model and we can change many these things because it's big we're we're comedic imposable so that we can change things such as the engine and such as the different ways we want to do this this particular one I think we can um you know we could for example tidy it and see you know what's going on what are the biggest things that push let's go the other way what are the things that push the the weekly attendance highest or lowest we can see those results there so that is our first model we did it we trained our first model using the tiny models framework now let us train our second model so this one is going to be a random forest so here is the random forest model [Music] specification so random forests can be used either for classification or regression and so we have to tell it that hey we want to do regression here and then we need a set engine because you probably know there's a zillion and not a zillion there are several packages for random forests in in our and we're let's use the ranger package and so let us let's call this our F spec so this is a model specification we're saying what we're doing is we have setup here is the kind of model I want to to fit now that we have let's let's run this and look at it this is this is what is telling us we have specified the model that we want to fit now that we've specified the model that we want to fit we can fit it okay so we will take the we will take the specification and we pipe it into the fit function and actually I'm just gonna copy and paste because it's exactly the same one of the great one of the important design choices in tidy models is to build composable reusable pieces that are consistent with each other this is one of the if you're if you're interested in using tiny models it's probably because you have felt some of the pain of not having that those some of those options in our ecosystem so here is the output of the random forest fit so it fits there we go and all right so we did it we trained our models congratulations to us alright so now let's um let's evaluate our models so let us start with the let's start with the training data and let's let's evaluate how our two models the the linear model and the random forest model did on our training data so let's start with the fit that we made and we're gonna we can now predict on the fit so we can use the predict function and we'll say whoops we will say that new data equals NFL train so what we're doing here is we are saying here I'll just run it for you it gives me a tipple with something that's called predictors the predictor the predicted values are here and let's make let's build onto this onto this data frame and make something so first of all let's let's add the the true values there let's call it truth and that was in this is the training data so it is weekly attendance here so now we've got the the predicted values and the true values we're about to start binding some things together so let's call let's add another column called model that tells me which model this was like so now I'm we're gonna bind rows here and we're gonna I want to just copy all this and paste it in and I am going to instead of the LM fit I'm gonna put the RF fit here and instead of LM I'm gonna call it RF and other than that everything is the same okay so let's call this results train like this so what I have is one set of rows for the linear model one set of rows for the random forests and I've got the predicted values and the true values you can probably guess what's happening next I'm gonna copy that paste it and I'm going to make results for the test so instead of NFL train its NFL test and all these places and yeah that should do it so let's run those so my results test it has fewer rows of course because there's fewer examples in the test and the test data and but we have the same thing the predicted values the true values and the models and which model came from okay so it's time for us to measure how these things did so let's um well now we are gonna use some packet some functions from a package and tidy models called yardstick so we'll take let's start with a training data so we are we can group by model which remember is this column here this model column and then we can call a function that will give us a metric so for this regression question let's use our MSE our MSE so the things we need to tell our MSE are what column has the truth the true value and what column has the estimate from our model here so let's run that all right let's see what we have here so the root mean squared error is significantly lower for the random forest than the linear model so the way that we just evaluated this on the training data looks like the random forest did better let's look at the testing data okay let's see what we have there ooh ooh okay so for the linear model the training data and the testing data have about the same root mean squared error so that means we didn't over fit they're doing about the same we would expect to get the same value we would expect our model perform about as well on any new data I'm sad to say the same cannot be said for a random forest model with the way we have the way we have said enough right now so our root mean squared error is much higher on the testing data than on the training data so a random forest is a you know a pretty powerful machine learning algorithm it can memorize the features of training data and we what we have done so far I'm sad to say has not served us well for that kind of model with even the data that we have let's just confirm that by making a a quick visualization for you so let's take the testing data and let's label it so it has a column like so and then let's bind it to the training data like this try UPS no train equals training so now I've got these things together in one data frame and let's make a little bit of a visualization so we can see how things are going so we're gonna put the true value on the x-axis the predicted value on the y-axis and let's make the two different kinds of models different colors and here's just because it's always nice to see let's put the one to one line the one to long line here we let's make it dashed and gray and then let's put our points on there's gonna be a lot of them I think so let's make them a little bit transparent and let us facet wrap by train so that we can see we can see these two things next to each other okay the line is a little too hard to see let's make it a little darker okay so training is over here and the the random forest is the blue one and the alum is the coral colored one so there are significant differences in the shapes of these distributions the I spent for the random forests especially the the random forests in the the the in the testing data the linear model and the random forests are performing about the same they're there they're acting about the same but on the training data they look quite different the random forest is doing quite a bit better so this is just confirming by us you know using visualization to be able to see with our eyes these kinds of patterns that we saw before in our summarized metrics so gosh bad news huh this is sad fortunately fortunately we have some options we have some options here so this this is a thing that happens with more powerful machine learning algorithms and the we have some options here and an option that we have here is to use resampling so what we will do is we will use resampling on the training data and the point of the resampling on the training data is to get a better estimate of how our model will perform on new data so instead of doing this section up here where we only evaluate it one time on the whole drinked training set we instead will use what we know about how random forests work to measure this in a in a more reliable way so let us we're not need to set C to get again Jack forget to set a seed I did if you want this to be reproducible don't forget to set a seed before you split things or make samples okay so we are going to use a function called V fold a CV so this makes fold for cross-validation so the default is it makes 10 and that's good for us so we're gonna make folds on the training data the training data to be clear and we can again use stratification to make sure that each of the little folds is divided up evenly by something for in this case the playoffs who made the playoffs and who did not so let's make this so let's look at this this is different this is a data frame of splits and an ID that tells us which split it is this is different from that from that what did we call it that attendance split so let's look at this attendance split is one split it's one of these and this cross-validation this NFL folds is a Tibble that has splits in one column and IDs so that we can keep track of them in another column so the way that these cross-validation splits work is that they keep track of where the of which rows which observations belong in the split and which don't and what we're gonna do is we're gonna fit a model so it's split into ten pieces the data set is now split into ten pieces we're gonna fit a model to nine pieces evaluate on the tenth then we're gonna shift and we're in a split we're an evaluate but no sorry we're gonna fit on another nine and evaluate on a new tenth that has been held out and then we're gonna shift again evaluate so I keep saying wrong fit on nine evaluate on one and so forth so we move through the whole set so we can do this with a function called fit free samples so Fitri samples takes a set of at least three functions when you use four here the first thing we're gonna send to fit resamples is the same formula we've been using this whole time so weekly attendance explained by everything else the next thing is the model specification we have that saved remember we we made that here so it's the same we want to use it same way so here we fit it to one thing and here we're using it to fit a whole bunch of things so the same random forest specification and now let's send it the resamples that is this NFL folds and then because I want to make a little plot here to end with we're gonna send it a control my control resamples control resamples and we're gonna say save predictions equals true like this and so this this allows us to save the predictions each time it gets evaluated when it trains on 9/10 evaluates on one tenth each time it does that it's gonna save the predictions for us so we can make a little plot so let's call this RF results here and let's let's run this this is gonna this isn't I think the only thing in this video that's gonna take just a moment to run because it is training a random forest model ten times so it's gonna take just a moment let's see let's see how long we can gaze upon this sad plot which made us sad while it happens and and look forward to getting better results here in a moment while that's running I'll talk about what we're gonna do next so we are going to take the the RF results and we're going to pipe oh it's done that's good okay we're going to pipe to a function called collect metrics so collect metrics take something like a Fitri sample that that tune and you know like we're like fit something to a whole a whole bunch of times and gets the metrics out and summarizes them so for us it just does that which is perfect because that's what we want and notice what do we have now time for a celebration so our our MSC is now about the same as the our MSC from the linear model let's remind ourselves it is also about much is much higher now it is now about the same as the linear model for the testing data we're no longer like you know 2/3 as small or whatever it is like it's now it is now I'm much closer to these other values when and let's step back and what does that mean it means remember what are we doing which data are we doing this with the training data so what we did here is we used our training data with resampling to estimate how well our model was performing and then we would evaluate on our testing set and say do we think how do we think we're doing so so we've done a much better we're able to understand how our model will perform on new data better here so we made we made some some sad here we'll say not great choices here right but now we're able to make good choices better choices here better choices here so that's good so let's wrap up with just making a quick visualization because if we look at what is in the result we have the predictions because we asked we asked Fitri samples to keep them for us and we can unnecessary bold we have the we have the true value of the weekly attendance that week the predicted value and which row did it come from in the original remember these are resampled randomly sampled so what we can do is we can make a visualization it's a little much like this one that's right here but it will show us how the resamples did so we're going to put the true value on the x-axis this the predicted value on the y-axis let's make the color ID so we can see the different resamples in different colors let's put this same slope equals one line here and then let's put the points on and let's give them some transparency like so alright so here we have all the folds and notice that they are distributed I mean I mean this this model is you know doing medium okay right this is but the way that these these are distributed are much are much happier than what we had before so that is good so that is that so we did it we took this data set of this tidy Tuesday data set about NFL game attendance and we have performed exploratory data analysis to understand a little bit about what is in our data and then we train to to predictive models we then we evaluated those models and we learned that for the random forest model we needed to use resampling to get a an accurate estimate of how is that model going to perform on new data so we just scratched the surface of what it is that the Tidy models framework can can do and can help you to do for your modeling and machine learning questions and and problems that you need to solve some next steps of what you might want to explore would be to learn how tidy models can help you pre process your data that's the model pre-processing steps that you need to do and also model tuning often we need to learn hyper parameter tuning from our data and we can do that with tidy models as well so those would be some great next steps
Info
Channel: Julia Silge
Views: 13,583
Rating: 4.9779005 out of 5
Keywords:
Id: LPptRkGoYMg
Channel Id: undefined
Length: 40min 43sec (2443 seconds)
Published: Wed Feb 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.