Using Machine Learning for Predicting NFL Games | Data Dialogs 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you a lot it's great to be here it's great to meet the students and the faculty for the program and hopefully it's a fun talk so this is a standard quote that gets thrown around a lot you know the NFL is a very unpredictable league where you know a lot of times the better team wins but generally speaking lots of crazy stuff happens I don't know how many people are like avid football fans or minor fans but even if you're not this talk is kind of designed to be showing like how you can use machine learning for this particular problem but it's also designed to show you like how like machine learning you can actually use right most of the time we talked about machine learning Netflix is giving you recommendations Google is searching things for you Amazon wants you to buy things but you know it feels very other people are doing it for you like what if you actually have a real life problem yourself right for me my problem was I'm in a fantasy football league with my brothers-in-law and I want to win right so and I've ... I was doing it without machine learning for years I was just thinking like oh I'll pick this team I'll pick that team and every year I was doing it I'm like I could probably do better I should get the computer to make some picks for me and I waited waited many years and then finally I'm like I'm just gonna do it and so this is sort of like an example for you guys also that you know you can actually use this stuff in like real life so this is joint work with a good friend of mine we kind of came up with the idea together we coded it together you know he teaches at another University in a financial engineering program also honorary mention to my 12 year old daughter she she also helps with this process she helps make picks she insists on taking some cut of the winnings fifty dollars is a lot for like a twelve-year-old so she's pretty happy anyways okay so how do fantasy football leagues generally work right there's all different varieties of leagues this one that we're looking at today is like a relatively simple league we're not picking individual players but just like let's go back one step you know basically as a football fan or a sports fan you watch every week and you think you like know better the coach should have done this or that team should have won or you know like you think you know better and then also there's like all the people on TV they're talking like this should happen or that should happen this team's gonna win like is that all really true the other thing that you know we have like some market information in the sense that you know before the games like there is a betting line you can go to Las Vegas you can bet on these games and you know you can say this thing called the point spread which is like the amount that a certain team is supposed to beat another team by like that probably encapsulates like a lot of information right ... people are you know sort of irrational people always vote for their own team but you imagine like lots of people voting and they're actually putting their real money at stake they're not gonna on average let's make stupid decisions so one of the ideas for this league is that you know we start with the point spreads as like a simple way to get started and see if we can really like do any better than that using machine learning techniques okay so how does our league work our league is like I said relatively simple it's called you know a pick'em league meaning that on average every week there's like 16 games sometimes there's 14 on bye weeks and stuff like that but you're supposed to rather than worry about the point spreads you're just supposed to pick who wins so for example you know whatever like I forgot all the games this week but I think Denver was playing like you know Oakland this week right and you're supposed to pick out right you're not supposed to worry about who's favored and who's not favored you just pick who you think is a winner and then the way you kind of like accumulate points in this league is you have to assign points like 16 all the way down to one and if you get your top pick right then you get 16 points you get your second top pick right you get 15 points if you miss your 14th pick you get zero points for that one so you can imagine like and then the way you win this league is you accumulate points over the course of the year and the person with the most points at the end of the year wins wins the league and generally speaking you can win in an individual week right by B just going crazy picking all the right upsets and getting everything just right but on average if you're gonna win over the course of the season it's gonna be better to be steady and consistent and just like not make mistakes so let's just look a little bit of like how this looks when I go to the website and this is how I make my picks I don't even remember this was probably like a long time ago but this is just kind of how it works you know you pick these two teams this week I you know the model or I or whoever like decided that Indianapolis was the top pick and so we assigned like 16 weight to them 15 right to them and then you go and you just enter your picks and it kind of locks it in and then you're competing against everybody else okay so what are the various strategies right like so one thing I kind of already mentioned and eluded to is like let's pick the simplest strategy that requires like no brain power well I mean one brain power no brain power would just be the guest randomly right but that's not what we're trying to do so we take exactly what you know Las Vegas is telling us and we basically take the team that's like the highest spread so if a certain team is favored to win by ten and that's the highest that week we put them at sixteen and then the next team may be their only favored to win by seven so we put them second and then we go down the line and order them in that ... sort of way and then we have like some various tie breakers like if two teams are both you know favored to win by four you know we just pick the ones that are like a home team or like if there's still a tie then we pick okay which ones got the better record but an on average like those little differences like don't make much of a difference on the other hand you could just do it just ad hoc based right you could not care about what Las Vegas says you could just do your own thing you could look at the win-loss records of the teams you could look at or are they playing a good team are they playing an away game or a home game are they playing a division game or a non division game so this is a little nuanced depending on how familiar you are with the NFL basically the league is broken up into you know like six divisions or I think it's like eight divisions nowadays and basically you play all the teams in your division multiple times during the season as opposed to not you don't play everybody so you have a much more familiar relationship with the teams in your division you play them much more there's more heated rivalries there's more competition you tend to play a little bit different and then you know other things you could look into you could look into injury reports you could just have personal preference intuition for example I know my brother-in-law is a giant Steelers fan he kind of can't physically bet against them even though they might be like you know favorite to lose so but you know but he'll ... he'll pick them but he'll put them at the bottom like for one point even if it means that he's picking the wrong thing but you know like I'm not going to do that I want the machine to tell me what's the right thing to do um but the other thing to remember is like ideally aside from the personal preference and intuition part ideally the point-spread encapsulates a lot of what's out there in the world if some major player got injured the point spreads will affect that if some team doesn't play good on you know artificial turf or you know bad weather like it should encapsulate that so our data set is actually relatively clean so if we just look back like historically at this league like you know what happens it turns out that this spread guessing strategy which you know I say like requires no brain power wins ... straight up will win this league half the time and you know so I just kind of compiled the years the winning score of like whoever won that year and then what this spread method you know using some back testing would have would have gotten us and you can see like you know basically four out of the eight years that I looked at it just using you know no smartness no machines no intuition you would have won this league so now you're you're already in like you've already put a set a pretty high bar right because all these people maybe like 50 people in the league they're all doing their best to try to win and you know this really simple method that requires no guessing is already kind of outperforming them so how can you do better so this is where I decided to give my machine learning project and see if I could do better so just some basic machine learning basics we're gonna use a technique called supervised learning supervised learning is where you give the computer some training data you give it what you call our features which are like the known you know variables and then you actually give it a known result and then the computer extracts like a model out of that and then using that model it can now predict what's going to happen with new examples that it's never seen before and how good your model is is basically how well did you train the model okay so one quick thing I think we've all seen linear regression before this is not what we're gonna use linear regression is good for predicting you know some Y variable when you have a bunch of X variables and you know we've all done this before we've minimized things but that's not necessarily going to help us with this problem because we're trying to predict wins and losses on the other hand a technique called logistic regression is good for classifying things so this is like you know some sort of I think this is from the Coursera machine learning course but basically you're trying to discriminate between two people and you can see this blue line is what's called the decision boundary and you have people that are these yellow dots and you have those black pluses and the decision boundary more or less does a good job of discriminating between the two but it doesn't get everything quite right and maybe that's okay because you know you can't expect that your machine learning algorithm like will get it hundred percent right and you're willing to live sort of with what Ed mentioned is like some googliness right like it's okay it doesn't have to be perfect and in fact if you had drawn the perfect line that like you know just discriminates between the you know the two different data sets here you get into a problem area that's called overfitting like you fit your data exactly but you don't actually you're not very good at making predictions you're only good at memorizing what happened in the past so you don't want to get into that problem so when you're doing logistic regression because you're basically doing like some sort of a binary classification you want to use a function that like helps you sort stuff out so you can see this the standard thing that goes into a logistic regression is this thing called a sigmoid function it has like a nice feature that it's like smooth and then if you're above 0.5 you know you very quickly go up to 1 and then if you're sorry if you're above zero you very quickly go up to 1 if you're below zero you very quickly go down to zero and the basically the the answer you get is like related to the probability of the confidence in that pick so if your probability is closer to 0.99 and you have a very high probability if your probability is like close to 0.1 then you're closer to zero so you have a very low probability of well low probability of being classified as a 1 you have a very high probability of being classified as a zero and this works for us because in the end what we're trying to do is we're trying to classify based on our you know history of these NFL games like did the team that was favored did they win the game that they were supposed to win or not so now we're getting a little bit closer to solving our problem so in the simplest form like the logistic regression has a set of inputs called features and it has a single output for a binary classifier and in our case we have to figure out what are the relevant features that I want to include in the model and I have to also think about like exactly like carefully like what am I going to get the computer to predict because I want that probability to be meaningful for when I go to like make my picks like 1 through 16 so what are the things that we picked we picked a very simple amount of features we didn't look at a ton of data we just looked at your current year's and last year's win loss record we looked at what week of the season it is because let's say you have a hundred percent winning record that's ... much different than if you're 1 and 0 or 10 and 0 it's much more meaningful also we look to see if it was a home game because it was a clear advantage to playing at home and we look to see if it's a division game because this is one of the things that I've kind of noticed over the years that like two teams in general will play each other well and in these division games even if you're like an underdog you tend to play much much better against your division opponents because of the familiarity and especially here at home and I don't know how many people are like fans you all sorts of crazy stuff happens you know like the Jets for example are not good but they'll beat New England at home and you know nobody's surprised and then the spread this is also like one of the key pieces of information that goes into it and the idea here is that we're using the spread and we're using these other features to sort of augment the model to see if we can do better and then the binary classifier the final thing we're trying to predict is did the team that was favored did they win the game or not so just zeros and ones okay so it's a data science talk we're gonna do some Python so we use this Python comes with a really nice machine learning package I'm sure if you're taking the machine learning course you run into what's called scikit-learn it's actually pretty straightforward like the actual ... like Ed said like 80% of this work was getting the data formatted correctly so that it could actually do three lines of code right literally three lines of code you have X's which are your features you have Y which is your classifier and you fit the model you score the model and you predict and that's it and all the other stuff I'm going to show you is the 80% which like goes into making sure that like the ones and the zeros and the numbers all look good together alright so and then how do we do this in Python there's these things called iPython notebooks you know normally on a weekly basis I have like just scripts that run automatically and you know spit out the right answer but when I'm doing what's called exploratory data analysis or looking at results and trying to visualize results we try to use these notebooks so let's see if this doesn't break completely ... oh look at that nice use of technology all right so here's my notebook we'll go through it relatively quickly there's you know first of all they are just some like setups you import some directories you import some packages turn off warnings let's see here so I'm not going to run it live because I'm sure if I try to do that it would break but I did run it just not too long ago so you should believe me it's not completely canned okay so first of all we have some reference data ... here's the team's what league they're in what division they're in this is important aside from the historical data the next thing we're going to do is we're going to define what we call the test and training sets so anytime you're doing machine learning you want to you're trying to make predictions and you're trying to see how good your predictions are so you don't want to validate your data based on stuff that you memorize so you want to hold out some data that you haven't seen before and then you want to see how good your model works on that luckily for us because we have like a lot of historical data I can basically run the model on let's say and I what we chose to do is like pick three years of data so let's say we took the data from 2008 9 and 10 and then we predict what we think would have happened in 2011 and since 2011 has passed already we can test to see if our model was any good or not and so this is how we tested the model but this is actually live where I'm gonna show you like what we do on a weekly basis to make the predictions for this week so right now the test year is 2016 we don't know what's gonna happen we want to predict for 2016 and we're gonna train based on these three years 2013 through 15 and we kind of mess with different ideas of which how many years to use like five years seems like a good idea but ended up being too you know it like incorporated information was a little too old one season was like not enough information to get the statistics like kind of robust and then the other thing I would like remind you is that this is mostly like a fun project and you know you guys can ask like a ton of questions like did I do this and did I do that and we thought about some things and we didn't think about others but I think this idea is that you know you can you know use this as a starting point in your explorations using machine learning and see how far you want to go but I'm happy for suggestions though because I do want the model to get better ok so this is the part that this is like the 80% basically getting all the training data you read in all the games you like look at the records of the teams you have to compute all these like metrics for you know who's in what division who won who lost and then so I do it for the training set I do it for the test set not that exciting okay so right before I'm about to send in the data to the model like what does it look like so I the computer doesn't care whether Baltimore's playing Pittsburgh it's just just a name to it right so the things that the computer cares about is the features that I talked about so this is what the features look like the favored record this is the first week of the season so clearly your ... your current record for everybody is 0% and then and this is why the previous year's record is somewhat important because the first game of the season who knows who's gonna win there's just the spread right but hopefully if like the Super Bowl winner is playing you know somebody who was like terrible last year that's some indication of you know who might be better so we have the previous record we have which game of the week you're at we have the line we take the absolute value of it because we have another field here that says favored home game so that automatically accounts for the minus sign or the plus sign as to who might be favored and there's this flag for a division game and then this is the classifier it's not that exciting it's just zeros and ones in that week did the favorite team win so I send this all to the scikit classifier and it's pretty straightforward this is all wrapped so that we can you know run this over and over again but what I showed you before about running the classifier and predicting it inside it really is like just those three lines so we set up the classifier and then we can predict week 9 which is the week that just happened so we're gonna look to see what happens and then we kind of look at like what does the prediction data look like and so basically what's happening is you know these were the games this last week and I ranked them by the probability that the particular team would win and so the nice thing here is like not only does it tell me that if I'm above 50% that's telling me that the favorite team should win and there's only one upset pick this week turned out it didn't work but everything that's above 50% should be that the favorite team wins and this also gives me a way to rank the teams between 16 all the way down to one there's also only 14 games this week so it goes 16 down to three and so this is what I need to do in order to make my picks into the system and then you can just see like and that's pretty much it like we can see what the model would have predicted and so let's just jump back to basically the the other thing here we'll present and then we'll show a little bit results now that we showed how we use this okay so back testing so we trained over multiple sets of three year periods and like looking forward like another year and we look to see how the spread strategy would have done against the person who won the league that year and we also extrapolated how like the machine learning strategy would have done that year and it looks pretty good and this is back testing so we just have to remember that like back testing's like never as good as forward testing I don't know if anyone's ever traded on Wall Street at a hedge fund you have all these great ideas you're gonna make money you try to put it in action in real life it doesn't work but you know but still you have to do your back testing and you have to convince yourself that you went through some like reasonable amount of you know effort to make sure that you think the strategy is gonna work going forward and then you tweak it along the way as things break or you come up with more information one thing that I'll mention is like I keep referring to this moderate strategy over here there's a bunch of like different ways that you could actually make the picks in this particular league one particular way which I call the conservative strategy is to just always pick the favorite regardless but then only you like use the numbers to kind of reshuffle the order so that would be very similar to the spread strategy it would just kind of change the order of some of them the other thing is to actually pick the predicted team so for example I don't know if you remember at the bottom it said Baltimore was well Pittsburgh had a 44 percent chance of winning which means Baltimore the underdog should be favored to win so we're gonna actually pick Baltimore to be favored to win but we're gonna put them at the bottom of the pile just because it's an upset well the other thing we could do which I call the aggressive strategy is to figure out what's the relation to the point five because like what if Baltimore what if the probability of Pittsburgh winning was zero right that means it's a hundred percent chance that Baltimore is gonna win so then I should actually take Baltimore and put it way at the top at sixteen but you know we did some back testing on that and it turned out that the aggressive strategy tends to have like a very high standard deviation it like wins some years like by one hundred and forty points and it loses other years by a hundred and forty points and so you know in an effort to be you know a little bit more conservative and to see if we could like win more consistently we decided to pick this moderately conservative strategy and and then live testing right live testing like how how does it work any good at all or not so 2014 was the first year that we ran the strategy the spread strategy actually won that year my daughter was happy because she's the one who puts in the picks for the spread strategy because she's pretty sure that that's the best one the moderate strategy did not do so well this year that year last year was pretty ideal the moderate strategy came in first place and the spread strategy came in third place and the second person was just barely above the spread strategy so um that was actually kind of nice and it was like a little bit of validation of the model and how it works and we were happy to see that happen and and then currently we're not doing so hot but I will say that because it's like a slow and steady strategy like about two-thirds of the way through the season is like when it really kind of like starts to build up and like the the consistency of it starts to like outperform the people that are just like making random guesses on a weekly basis so hopefully good things will happen and and then just you know depending on how much football you watch on Sundays and Monday nights this is what we had picked for this current week and you can see that the spread strategy which is the far corner if you see favored win they only got two wrong whereas the algorithm with the moderate strategy actually got three wrong because it wrongly picked the upset of Baltimore over Pittsburgh and Pittsburgh actually won so um and then the Seattle game which is why I'm wearing a Seattle t-shirt is gonna happen tonight and they're predicted to win I'm not necessarily a fan but it's fun to root for the algorithm ... and that's all I believe we have a little time for questions thanks very much we've got a hand right up there at the back straight away if anyone's got a mic we'll go to the back thank you hi thanks so much for your presentation one immediate question I have is football to me doesn't see I'm a sports fan I like a lot of different sports and if I were gonna do something like this football would not be first on my list because of the very limited number of games that's like seems like at least one thing that would make this a little less conducive your training sets and dev sets and all that just can't be as large you know baseball yeah so did that go into you I mean are you just a huge football fan like what what are the contributors don't know I'm like a big sports fan all around and the idea is to sort of like use this as like a starting point and then we definitely want to look into like baseball and even like I don't know Pro Cycling you know it's like one of my favorite things you know it doesn't seem like a team sport but it is if so yeah I agree like with baseball there's definitely a lot more like 162 games over the course of the season lots of individual player statistics and then the other thing you know this was just literally to get started to win this particular league but even like you can imagine starting to look at player statistics and how to how do you do like a player team oriented fantasy league but yeah it's certainly a good point not limited to football at all Thanks okay we've got let's go right across at the end of we just on we down at one mic at the moment yeah okay gentleman in the white shirt there and then we'll and then you can pass it back for the next question after that thank you hello thank you for the presentation we can hear you yeah it's good yes so my question is that it sounds like your model relies a lot on the Vegas spread mm-hmm I was thinking um why do use the Vegas spread and have you thought about the place to get say with I don't know the predictions from 538 for instance instead of the Vegas spread yeah so that's a good point so for I think the thing is like what 538 does is they do some version of this right they do something else that is also like model driven right so I think one of the lessons that I got from the years of working at Wall Street is there's like market information right so 538 is model information and there's market information and there's a difference between what the bank says is the valuation of a security according to the model and what the market says and companies have gone bankrupt and credit crises has have happened so the idea was to take market information and I think what we do is we look at 538 to see what they're predicting I think even like being like Microsoft search engine if you just type in like NFL games like they give like a probability of winning and we've kind of like matched up ours you know see like oh like are we totally off-base what are they doing we're trying to we have you know we don't really know like what they're doing but it must be something along these lines but you know probably they're using like a much larger richer data set to kind of you know pull in this is meant to be like relatively simple like literally like five inputs into the model and see if we could like you know do something that's interesting and effective but yeah great question hi my question is can you talk a little bit about why you picked logistic regression versus any other classifier yeah so one of the reasons I picked so the thing is one of the things I didn't show here is that we do like to run a third strategy and ... using like support vector machines and that one also is like a really high volatility and so we haven't gotten that to work so there is like certain amount of like what's the word like machine learning know-how and like really understanding like some of those like algorithms like much more like theoretical basis logistic regression I feel like is the simplest to understand because of the binary classifier and because we're doing a binary output for example the support vector machines when it gives the probability it's not usually visualizable in this like sigmoid function oriented way so I think for the illustrative purposes like logistic regression is also like a great idea but we are trying to test like other you know decision trees and random forests and see if like something would do better or not but that being said like so far over the years the logistic regression has actually performed the best in terms of like going up against live competition okay thank you okay we've got a hand right at the back there in the corner yes mic's coming to you thanks sorry since it's a dialogue I feel obligated to talk for a minute not ask you a question is that fair kind of trying to talk about what Ed talked and you talked together with the question of like why start at a domain that doesn't have that much data not sports but this particular sport I'd like to actually tell you that I think statistics and machine learning started with small data not big data and I know it's a very good thing to always think about big data as the challenge like processing the data and doing all that it's much more intensive when you have big data but the challenge with small data is actually a very important one I know sports may not be like life-threatening moments and we may not think about it as important as I personally don't think about it as important as other things I'm life but but I think that this raises are actually a really good point in data more data is better than no data and it's a really big important topic there's a lot of domains that don't have big data and are very very important to tackle and the algorithms not all of them but a lot of them especially like the stuff that you were talking about are applicable and you should not shy away from them just because they're small smaller data sets so I really do I'm going to talk a little bit about agriculture where life isn't as pretty with big data sometimes and so I'm really a big advocate of taking something with small data and showcasing it trying it and even failing and learning from it so I appreciate the effort to go not into baseball which I personally dislike so thank you for trying to tackle something different ... I've got a mic so I'm going to talk so we're going on going on the same line of thinking do you think adding features to those additional five those initial five adding additional features would improve the quality of your predictions given your experience or do you feel like the simplest method is the best given the success of the spread relatively so I think definitely like more data would kind of enhance the model but the nice thing about machine learning models is like you don't presuppose like so I put in this thing like for the division games that I think is important right but then when you actually run the model it kind of spits back out at you like what is the relative weight of that factor and if it thought it was useless it would be zero or you know you can another thing you can do with what's called feature engineering and machine learning is like you can take out that feature and seeing if you're training accuracy goes up or these results are more stable so we kind of did that with at least the features that we picked and then but we are trying to figure out like how to add more data but it is also an 80% problem right like you know it's just like work and and we just haven't gotten there but I think definitely like there's lots of other data in the world where you can look at like defensive statistics and offensive statistics you know do you care about like individual like players and injuries and stuff like that like if a star quarterback is like not playing like is the difference that I guess the thing is like what this the whole point of this exercise is that the spread is already telling you something and can the model uncover like does a home game mean more than what the spread is already telling you right because generally speaking you kind of hear anecdotally that like all things being equal the home team has like a three-point advantage in the spread right so it's already baked in so it's only efficient market I thought what's that like the efficient market hypothesis right exactly so it's only like can the model discover something that it's not taking into account as much as it should or over doing something and so yeah that's like an open question like who knows which things will be relevant or not but it's definitely we're trying with other data sets volunteers are welcome ... we have tome for a couple more questions let's take yeah the mic is getting passed to you there and then we'll go to the back for the final question so has anyone in your league become more data-driven as a result of you doing this project I'm not sure they all know I'm doing it I even tell them that the person who like wins the league just straight up uses spreads but I think the thing is everybody like it's a fun thing to write so there is like some I think what happens with a lot of people is they start with the spreads and then they make tweaks like they just kind of adjust based on personal preference and intuition and what they happen to know so I think maybe some people are doing that a little bit more but it still seems to be just kind of like a fun thing let me see if I can outwit the algorithm and not to say the algorithm is not amazing all the time it does pick bizarre things and you know even I like question it and I'm like who knows but you know we just go in it and we root for the algorithm yeah question at the back there yeah all right so when you were talking about the the spread because there's a lot of interesting information baked into it have you ever actually thought of trying to predict the spread and see what kind of inputs actually go into the spread itself and try to understand that a little bit more deeply so we haven't tried to predict the spread but one thought that we had was to take out the spread and see if like it could like rank you know independent of the spread right so that would be like an interesting thing because then you're almost like recreating the spread or doing like an agnostic thing where you're not taking this like market information so that's like one thing we thought of we haven't tried to predict the spread that's an interesting it would be I guess what you'd have to do is you'd have to take these probabilities and like map them historically to what that means right like is a 90% probability a 10-point spread or not you know then you would do some linear regression probably yeah seems like it'd be fun yeah and the other thing I was thinking of have you thought of incorporating because we have these beautiful simulators that we've been building for 12 years now is in Madden 2016 and actually running a bunch of games on that and then adding that as another feature the one thing I have thought of though so this is like obviously like writing a bunch of Python code using scikit-learn etc etc there are starting to be like drag-and-drop machine learning tools for the non programmer I think like Microsoft has something there's something called BigML there's this new thing I just ran across the other day called orange I forgot what it was called orange something or other but basically there are tools for the data savvy person but not necessarily like a Python programmer to like pull in your data you know say that these things are you know kind of do a little bit of cleansing do a little bit of that 80 percent and you know say that these are the features this is the variable and predict for me and I actually tried running this on BigML and it more or less gives like the same answer that you know the model was giving short of like not knowing what the tiebreakers were so that was kind of cool I thought that it was like like this is almost achievable for the masses or my brother-in-law if he wanted to you know great Amit thanks so much for sharing your fantasy football work with us
Info
Channel: Berkeley School of Information
Views: 37,369
Rating: 4.928287 out of 5
Keywords: UC Berkeley, ischool, school, of, information
Id: 8emUyzczThY
Channel Id: undefined
Length: 37min 15sec (2235 seconds)
Published: Thu Dec 08 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.