Fortune-Telling with Python: An Intro to Facebook Prophet

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm not big on complicating things just for the sake of complicating them so whenever I'm doing data science I'm always checking out the new packages the new tools and functionality that's available to see if I can refine and simplify my analysis and one of the tools that came on my radar recently was Facebook profit now for those of you that are familiar with time series forecasting you'll know that our is usually the the platform of the foundation of good time series forecasting as a lot of packages had a lot of history with it Python not so much right Python is very new to the game even in terms of stats models and other packages but we do see a lot of momentum and a lot of brand-new packages that are coming to the fore including this open source packaged by Facebook so just by a show hand so I can get to know you guys get to know you know what your backgrounds are but also find out how quickly I can get through this so I can jump into that conga line for the office tour who here is a beginner when it comes to time series so either know-nothing or you know you may have played around with it a little bit but you didn't really know what you were doing okay great so easily half of the room okay what about intermediate so use it fairly regularly you might you know be dangerous and enough to tweak your models things like that all right good any experts in the room you are an expert at time series modeling put your hand down I know you're not any real experts in the room you can tell by the way they smile at you that they're full of it it's like all right so that's good to know so I wish I had time to go through this like 8000 line jupiter notebook here but i focused just on facebook profit and we'll go through quickly because i know you guys want to socialize a couple of key things about this package so it is a piecewise either linear or logistic growth curve trend i'll talk about what that means in a little bit it's got a yearly seasonal components I hope we all know what seasonality is but it's these you know fluctuations across the year so holidays or q4 for retail is huge because people are buying more stuff to give gifts there are also seasons months of the year when business is really slow and depending on your industry you know it could be the summer or it could be the winter right we also have year weekly fluctuations or weekly seasonality and we definitely see Peaks and certain activities on the weekends and lulls during the weekdays or vice-versa you can have other seasonality too right you can have monthly you get a multi-year you could have quarterly but Facebook decided to really focus on these because these are very pertinent to them but what's nice is it translates very well to most people's businesses some weekly seasonality some yearly and then another cool thing about this is it's got holidays that you can actually provide yourself and basically customize when it should be taking an extra look at those specific dates is you know we expect interesting things to be happening on these dates it's built in stands so who here is familiar with Stan Stan okay so if you know Stan you're probably a fan right Stan is very fast and because of this we can actually fit models to time series data extremely quickly another nice thing is the code translates very easily between R and Python so whenever they make an update you know for one the other will probably be instantaneous and it's very easy to update it's just not going to be as much overhead for the team and for all the those who are contributing one more question before we actually jump into the demo the foundation of this time series model is called a generalized additive model anybody here familiar with additive models show of hands you have very few of you that's kind of interesting so additive models are very powerful most people have no idea what they're are and I'm gonna try to teach you guys in 30 seconds what these things are so if you know what a linear regression model is right you've got these beta coefficients so beta zero which is my y-intercept and then a beta for every single input right that kind of limits these models because when you're talking about a beta you basically have a number you have a scalar that's all you can have we go to additive models and these things are much more sophisticated they actually have custom functions for Y with respect to each input so interactions right when we when we multiply x1 times x2 that's basically a basic example of what these additive models are doing but they're a lot more sophisticated than that you may ask holy crap how do you even solve these things right how do you how do you figure out what the best fit is it's much more challenging mathematically but there's actually a technique called backfill which should remind you of the term back propagation with neural networks so there's a lot of similarities between these things you guys comfortable with it yes yeah so additive models very complex but in a nutshell they use custom functions which really allow them to be much more powerful than a simple linear model so we'll jump straight into this well I'm taking a look at Payton Manning's Wikipedia page views okay and as we see here we have some serious spikes going on so this is basically days from the first day just the raw data and then the actual number of views on the Left does anybody have any ideas what we may want to do to make this data look cleaner group-by days of the week no we want to keep it we want to keep it on day just like a linear trend so normalize it what do you mean by that okay what's a great way to normalize this data may be anything more simple take the log that's right whenever you've got right skewed outliers right what do you do with your distribution you take the log and we basically normalize all those right skews so that's what we're doing here we're doing a numpy log let's take a look at it again boom we've got something that looks a lot better now not only does it look better to us but it's better for timeseriesforecasting because stationary data is generally a requirement for lots and lots of models almost all of them so this data looks stationary basically in a nutshell that means that the average is not changing over time so if you looked at different windows of the data the average wouldn't be changing and the variance also is not changing now it is to an extent but it's generally stationary it sits on a spectrum here's another cool thing I told you this would be pretty simple right so they followed the SK learn syntax here and you basically just declare a model and say dot fit to my data boom I'm out I'm not gonna drop this mic cuz it's not mine no no we'll keep going we'll keep going thank you I'll be here all night so the next thing we do for forecasting into the future is pretty simple hack here what we what we're doing is we're just creating a data frame and we're populating the dates so it can be a date in other words you know you're a month date where it can be a date time that includes hours and minutes and things like that now keep in mind this model is designed to work on a daily level it can work on different levels but it does best on daily even so it's really great it's it's very flexible and does a good job with with other inputs too so what I'm doing here is I want to look a year into the future and see what future pageviews might look like for Peyton Manning's site and I just do a predict on future so I'm taking that data in oh by the way a little bit of housekeeping there they need you to explicitly label your column headers so your dates need to be called D s and your outcome is y right and I actually did that up here at the very beginning somewhere maybe I didn't go because this one was already formatted as such but you'll see you'll see what I did in the next example so the forecast matrix that comes out of this has a lot of really good information we're gonna subset it just so we can take a look at the stuff that matters most so we've got future dates we've got a y hat which is our prediction and then we've got a confidence interval which is you know always a nice feature to have for any model let's take a look at our forecast and that is what it looks like so here we are on the right and we are predicting basically after Super Bowl 2016 for the next year anybody see any uh trend or maybe general pattern in this data okay so there's strong seasonality when we get into the playoffs right we get spikes on Peyton Manning's page now if you had another quarterback who may be as talented but is on a terrible team I'm not going to give any names right they're probably never making the playoffs and you're not gonna see those spikes you might see there them peaking in the regular season but of course Peyton Manning he's a pretty winning quarterback so we see those spikes anything else let's look bigger picture ooh you're starting to see this this guy get into the sunset years of his write of his career so you can see that you know it kind of peaked around twenty ten eleven twelve thirteen fourteen let's say and then we do still have some outliers up here of course these are the actual data points in blue the prediction is in sorry not in blue in black the black points the prediction is in bluh and by the way we can actually have our model predict historical points because that's always a good habit to have just to see how well it's doing previously especially if you don't do like a train test split lets you know hey here are some actual observed values I can see how well this did and then we have the confidence intervals in the lighter blue shading so yeah he is Sun setting and seems like check this out though we can actually break the raw data down through mathematical analysis into the sub components of time series and this this is just mind blowing if you deal with time series data in general we can split these components into a trend into a seasonal factor by the way there's two of them for Facebook profit and then we can also look at errors so random fluctuations which of course are not really random but they're attributed to lots of little things that we you know too small to track things like that so here's the trend downward of course weekly is this what you guys expected ok sunday is kind of expected right but Monday is actually the peak day that people are going in and checking out the Wikipedia page that's interesting maybe it's a combination of hey I actually have time to geek out a little bit now that I'm at work you know I should be working but um of course what we all do is we check our you know fantasy football stats and see how we did and maybe Wikipedia is quick at updating it things like that okay Monday morning quarterbacks there you go yeah armchair quarterbacks of course and then we have the yearly so as many of you have mentioned we see this yearly seasonality where nobody really cares about Peyton in May June July and then all of sudden they're really interested with the start of the season which we're about to get to here and you know of course it Peaks up here because he usually goes pretty deep into the playoffs well see keep going here there's one key thing missing here and it's holidays and these are not Christmas with Payton or Thanksgiving with Payton and his family these holidays are the playoff games and the Super Bowls that he did showcase Anor featuring and we can actually split these up separately so we've got the playoffs here and we've got the Super Bowl here quick note on this guy lower window and upper window I can actually designate spillover so if I did like a negative two on the lower and a plus two on on upper that holiday is so impactful maybe it's like fourth of July or something for your business that it's gonna impact the third and the fifth of July and of you know we we would expect to see something like that for big holidays like Christmas and fourth of July so we'll bring these guys in and then we're gonna take a look again and once we actually do our forecast we can see the effect that various Super Bowls have on on Peyton Manning's play the cool thing is here so check this out what are we looking at here what do you guys think the lower bars are or to rephrase this did Peyton Manning make the playoffs in 2012 he did not okay which Super Bowls do you think he featured in yeah mm so he's played in four Super Bowls he's won two of them so he played in 2007 we don't have this data 2010 2014 and of course he won in 2016 all right and then weekly looks the same and yearly looks the same too so on to something more existential here not Tom Brady more existential than that predicting carbon dioxide emissions okay this data comes from Mauna Kea in Hawaii a beautiful place that I've actually climbed the number of times and on top you've got all of these satellite observatories and a bunch of contraptions for collecting Atmos farik data including co2 so we're gonna bring this in we're gonna actually bring this in from stats model which is nice rename the columns to get everything into the right format and go to town so fit our model take a look at the monthly frequency so if you're you know not a daily person your business is not a daily then put in a capital W or a capital M or a capital a confusingly right for annual if you wanted to do your analysis on larger timeframes we can include the history this basically means hey take this model and try it out on historical data also and we will follow the same procedure here and we will forecast our future again this is on the monthly level for 120 periods basically ten years into the future from I believe it was 2003 so this is what co2 emissions look like right the trend is not very good there's the yearly interesting right we've got a peak around May June July and then a lull in the autumn months okay last thing and then we're gonna join this conga line right so the reality of growth and change with data is that some of it is limited right for Facebook their growth is limited it's not very limited you may think like wow it's kind of unlimited growth but it is limited by the population of the world right and let's let's clarify the population of the world that has access to the Internet that's why they're actually putting these balloons up in the sky to give everybody Wi-Fi so that they can all browse Facebook but we don't want to use a linear time series forecast for this because it's a line right we don't want to use a line we want to use something that's very natural the way you know we see things growing and tapering off asymptotically in nature and that is a logistic growth curve and it's well defined or or fits really nicely with a sigmoid curve if you guys are familiar with that anybody that's done logistic regression remember the sigmoid curve is bounded between 0 & 1 in this case we naturally bound the sigmoid curve with a cap on the top so this is saturation for city of Chicago right if we are building a nap these are this is the limit of the number of people that could theoretically use our application in the city of Chicago Midwest United States you name it so that's what I did here I set a cap here's a cool thing because we can actually manually do this and we just literally create a column in a data frame we can change that cap over time so as more and more people have access to the Internet Facebook can just go in and you know algorithmically add one every single day or some sort of function that just slowly increases that cap over time so you can do that also so we do this we include history equals false in other words hey just really focus on future predictions by the way that doesn't negatively affect your model it's just a different kind of view when you plot it and that's what it looks like so I wish this was the case right there was some natural cap with co2 it doesn't seem like there's you know a negative feedback loop there but if you do have situations where there's either negative feedback loops or caps this is a great way to use that so the question again hey why should I use this thing right maybe I'm familiar with ARIMA models for example I use stats model I and R with the stuff that already exists why I use this again it's built in stands so it's very fast it's very fast it usually fits much faster than than most other approaches out of the box it's performance is typically on par with an ARIMA model that an expert has taken considerable time to and all of this comes from that gam Foundation the the generalized additive model it's it just D limits the model in what it can do so again it's not for everything but does super great for weekly or sorry daily and an annual seasonality and I expect to see updates in the future also with a functionality there so thank you guys and I'd love to answer any questions you have
Info
Channel: Chicago Python Users Group
Views: 45,704
Rating: undefined out of 5
Keywords: python, chicago
Id: 95-HMzxsghY
Channel Id: undefined
Length: 20min 36sec (1236 seconds)
Published: Fri Sep 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.