Nathaniel Cook - Forecasting Time Series Data at scale with the TICK stack

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Daniel cook so before I even get started into my talk I just wanted to who is at the talk earlier this afternoon at McKinley on forecasting automation robustly okay so some of you guys gave to talk thank you okay because it turns out why talk tax like highly related to their talk so I'll probably be referencing their talk a few times today on the fly here so see how that goes but uh yeah totally unplanned but it was exciting I enjoyed their talk okay so today I'll be talking about forecasting time series data specifically with the tick stack explain what that is and at scale and I'll explain what I mean by AB scale so like I said I'm a tango cook I worked for a company called influx data and we build a set of open source products that we call the tick stack and there are a time series data processing platform so hence timeseriesforecasting time series products this is great so that will just jump right in so so why am I here a couple months ago Facebook released their profit algorithm that these guys talked a little bit about previously and I've read up but I was like ah that's interesting that's kind of fun so I tried it out and had a good experience so I'm here to show that with you and to see how it works out so an overview what we'll be talking about today first let's give a quick review of what I mean when I say time series data forecasting really simple and then we'll talk about the challenges of scale and what I mean by when I say scale and then we'll introduce and a little bit of depth the Facebook profit algorithm and that procedure and how that works and then then the majority of talk will be focused on an example walk through an experiment I did using their algorithm a net procedure and how that turned out for me on some example data and then we'll present some learnings and then wrap up great so what is timeseriesforecasting well assuming that the past has any relation on the future we can predict future values based off past values and so the the idea there is you develop a model doesn't matter what the model is great develop a model that says based on these past values what should my future values be and so that's what I mean we say time series data forecast then you need to compute the accuracy and the earlier talk give us a nice huge list of a dozen different metrics that you can use to compute the accuracy of your time to use metric today we'll just focus on one the the ma PE mean absolute percentage error but you could switch it out for any of the others that I talked about today and then finally another method is is to use baseline models and to compare the your predicted model against other models that are more simple to make sure that your baseline model hasn't just gone totally crazy and done something weird okay so there you go simple hand-drawing time series data on the Left you've got the red line which is your raw data and then you predict ah it's going to do something like the blue line right and then the black tick represents the current time and then and then later real time passes and then you actually get the raw data for the data that you predict it and they overlap because they were for the same period of time and then you compare them and you say well how good were they right and this is where there's multiple different metrics so that the ma PE metric is basically you take the percentage error between those and you take the mean across that entire interval and you get a number between zero and one and zero means you predicted it perfectly because you had zero error and one means that the relative values was a hundred percent off and you can even go above one actually if you get really really bad air will actually see an example of that today okay so what are the challenges of forecasting a scale it's actually simple it takes a lot of time to get a model to work right so most the time when you put a model in production it's the model that's doing something awesome and you can spend a lot of time tuning it and optimizing it and making that model work for you in this case we have lots and lots of models and so we can't spend time each of the models we have to make sure that on a big scale that all the models individually they're working well but we don't have time to sit there and babysit each one so we need to design a workflow and a system where we can minimize or in fact eliminate the time we spend on any of the models great so this is where Facebook's profit procedure comes in and they talk about they today's talk isn't just about the technology it's also about the people involved where the people the teams and the interactions and how communication happens and a workflow so that the communication is effective and efficient so the Facebook algorithm Spacek the Facebook paper called profit didn't just introduce a model or an algorithm it also introduced this workflow or this procedure to go with it so we'll talk about that here so first what was the algorithm it's actually really simple and it was intentionally very simple it was a generative general additive model I call it basically you just your predicted value is why they're on the left of this this graph here and this equation and then you had three terms and you just add them together and those terms equal your predicted value okay so what are those three terms the first term of the growth term and it just kind of predicts the overall trend up or down of your of whatever series you're forecasting the second term of the seasonality term so the the domain in which this algorithm tends to work well our business metrics so for example things that have daily cycles or weekly cycles or monthly or yearly cycles these kinds of seasonal cycles in them and so the seasonality term would learn about those things and then lastly you could add in a holiday term which would say hey July 1st everybody or July 4th quasi first July 4th everybody shooting out fireworks and nobody's doing any work and so we could expect something different on that day and so you can account for those kinds of things ok so that was the model the algorithm so what's the workflow and so Facebook labeled it analyst in the loop or basically put the analyst in there who's someone who understands basic statistics but is it necessarily statistician and and who isn't necessarily a DevOps guy either who's running around and setting up servers and code it's just someone understands the domain of the data and wants to and wants to make sure that the models are working well so this four step cycle or workflow at first you create a model and so in this case creating a profit model right modeling based off profit algorithm and then second you evaluate that model based on some metric like I repeated we're going to use the the map MFE metrics today then at the bottom you surface all of the forecasts that have problems so we'll define what they have problems that means a minute but you surface them and you bring them forward so that you know that those are the ones you need to spend time on as opposed to all and then you visually inspect the model and you look at it and go off this is what's wrong with it and then you fix it and then you get back to the top so the analyst need only participate in these purple on steps these green steps are fully automated and they can happen on in the background do not touch upon I think about them they just run and so the analysts just simply sit there and spend their time working through the problematic models fixing them and tuning them and basically gets this nice workflow where the problem surface on their own and they can spend time just whacking them all and it sounds boring but it's a lot better than not knowing where the mole is thwack horrible analogy shouldn't wait data okay great so now we're going to jump right into the example work through the data before I do that or any any questions starting off I'm right now I kinda like this a little bit more interactive so if you've got questions okay great yes great so what am i Canadian no sorry I mean okay hi I'm not Canadian nope oh okay oh it's my first note that was just I've no idea what that was yeah great so for this example I went and got a dataset that's how we plan to look at so I went to github and I found an awesome Python most you guys too often must forget where people I compiled these lists of awesome things and so there's an awesome Python list out there it's got about 400 Python projects listed in there under different categories for different things and I wouldn't I grabbed their gap stars and so we can meaning you get as a relative proxy for popularity of the project and it's a nice time series data so we're gonna play with it so the cool thing about grabbing awesome lists as opposed to just grabbing the top 100 or 400 repos was it was at the top right so in this list it's a very very heterogeneous list of projects that do very different things and and that's going to be important a minute so some of these projects are really small might there's a dozen of these projects that only have 30 stars but they're on this awesome list because they're awesome somehow but only sort of people agree and then there's some really large projects on this list as well some projects that have 30,000 plus stars and and then very large and then these projects range from very different use cases some of them are really simple like CLI tools they're like nothing to do with data science and they're just happening written Python and then some of them are like type I and then these other like important numpy and important Python projects important to us and so they're very very diverse so my point in pointing out the diversity of these projects is that they that you can expect different growth patterns on these projects right the time series growth on each of these projects is not going to be the same they're going to have different characteristics they're going to be unique in their own way and so we're going to learn about and then likely it's somewhere we have new and old projects have some projects that were there and like day one of github and so they have dates I'll be back 2008 some of them are started earlier this year and so only have a few months of data so we've got this this very heterogeneous set of projects but then the data is all the same in that it's just a simple growing curve of stars over time ok so what's our goals of the example well we want to forecast the future stars of these projects for all 400 of these repositories and get out and we want to be able to fix any forecast that we set up initially that are broken are just really bad and we want to do it in a repeatable manner such that this cost very little overhead very little time and resources on an analyst or developer so that you can you can do this as scale right and 400s pretty small right but the idea here is you know these could be different communities right so the example the Facebook uses is they have lots of communities think you have Facebook like different groups and things like this they have each one of those has its own growth and things like that and so they're managing many many many of these that's a and I little insight of how this could be used on a bigger scale ok so to do this in a scalable manner we need a nice platform or be nice to have a nice platform so is where the tick's tech comes in ok so let's take a minute here and then about six taxes so first tick is an acronym for TSE k and for the four products that make up the stack so telegraph is this first piece it's a over here and telegraph the tiger it's pretty awesome and telegraph is a collection agent so the idea here is it's the thing that goes and gets your data we're not using it in this example per se just because I already collected 800 just by scraping the the gab api's but in a more real-time environment you need a way to kind of ingest your data in a real-time way and that's what Telegraph can do for you okay so then the next piece in the stack is in flux DB and that's the database that stores the time series data it's optimized for storing time series data and and it has a sequel like query interface so that you can start your data in query back out do kinds of operations that you typically do on time aggregation selections etcetera etcetera okay so then the next piece is cronograph which is up here choreographed dress run the contractor up today he's pretty often this is a visualization engine to detect the sec so it allows you to visualize graph your data build it's basically the GUI for the rest of the stack and so that's what crew graph does and then last is capacitor I'm actually the developer of capacitor and it's the real-time processing engine on the data so the database allows you to write the data and in query back out with some basic aggregations when you want to start running machine learning models or other complex things on your data that's when you're bringing capacitor to be able to do all this processing so an analogy it is very rough but probably doesn't like a mole one but capacitors kind of like spark Apache spark in this sense specifically built obviously for the time series use case and for this for this spec okay so that's the take stack we're going to use it today to model all these 400 gave repos and forecast they're stars okay so how are we going to do that first we just need to store the data in flux to be that was relatively straightforward downloaded it well relatively meaning all the data cleaning like takes a little while but download all the data from gap API rejigger it stored in flux to be at the database so now in the database we have a set of timestamps for when a user signed starred the repo and then and then github terminal or an implicit terminology we've tagged that data with which repository that's that star with four tag just means an index column on your data so then we use capacitor capacitor runs tasks it's like a job scheduler runs tasks and does work on your data so we're going to use capacitor tasks to do all the automation steps of this workflow the four steps there this is the second and third steps are the ones that autumn automated Brittany's capacitor to run those tasks for us and evaluating service problems with our models and then third we can use Kona grass to inspect the models just visually just look at them that the third step they're in the steps and yeah okay so here's a graph hopefully it looks horrible to you because it is horrible this is a graph of all 400 repos and they're stars very useful I have great insights on this date I mean you can pick out a few things right like you can see that they're like three projects that are way high above everyone else you can see that some exhibit is really sharp like growth periods so like basically vertical and then and then kind of just have like linear or exponential growth after that and but for the most part yeah I'm not going to glean a whole lot from that and you could go look at it one by one but that'd take all day so not very helpful of a graph but there it is okay so step one we need to create some tasks and capacitor to do our evaluation steps so our plan here is to create three tasks that do the model and forecast the data to open we're going to label it's like baseline models that are really really simple forecasting models and by simple I mean the mean one literally takes the mean of the data and says that's what the future value is going to be is the current mean so very very simple model but it actually print turns out to be quite useful later for just sanity checking our models and then the other one is an exponential smoothing or also known as like whole winters you take it's a recursive algorithm where you take the current past history state and you build up on it and you can predict out future one it's pretty straight forward not very complex not knowing is complex for example the profit model and that allows us another baseline just hey make sure that things are within within tolerance okay so then we have our profit model so capacitor is a system and I'll talk a little bit about how it works in a second but the profit model is written in Python and our capacitors not run in Python is there anything go if you care and so how do we talk to it well faster has a plug-in system UDS user-defined functions and so we can just call out Python code and run the profit model from there and it's efficient uses protocol buffers for the data transfer eccentric setter and okay so we have these three tasks and these are actually task templates meaning that we just we set up a template for the mean task the whole winters task and the profit task and then we instantiate one of those tasks for each of the github repos that we have so we have roughly you know 1200 tax and the reason why we're doing all these different tasks that sounds like wow 1200 tasks why am i managing so many when I only had 400 to start with why do you know why we're blowing up the reason is it gives us the ability to set different parameters per task per repo and it doesn't end up being a difficult thing to manage because you don't touch into the tasks unless they have issues and then you just can treat them so again the task is allows you to have custom parameters per model which we'll see in a second is useful okay so that's step one I just used a good new parallel a little CLI command there and just said boom define all these tasks duck and then the way capacitor does the the template system is any updates to the template updates the rest of the tasks so you don't end up ever really interacting this tests after you define okay so BAM here's some code probably looks really unfamiliar to you just definitely not Python and yeah let me tell you what this is so who here was at the DAC tutorial yesterday okay so a small stash okay so so the way - does their computation model is they model the data as a dag directed acyclic graph capacitor is the same thing they take the data that you know the pipeline like first you're going to sum the data then you're going to take the mean and they're going to do whatever and they build this dag which is a pipeline of the data flow of your data capacitor is the same thing and it executes to the stack based on the data the time series data that's flowing through it the way you describe your dag is through a DSL inside a fastener called tick script simple and this is what you're looking at right here so the idea here is that each of these pipe characters and the act character act just means it's UDS basically think of it as a pipe character each of these pipes characters represent an edge and and that graph where data is transferred from note to note and then the note itself is like what are we going to do at that step in the graph okay so here we're querying the data out of the database so we're going to select the value we're getting a little bit of weird formatting on that there's supposed to be a space between from and source there and between where and projects but you get the idea it's equal like this is pseudocode anyway so don't expect this to actually run there's a little bit of boilerplate above and below that I just made it for brevity and the videos we select our data we grab it over a certain historical amount of time and then we do it every so often so these tasks run continuously they're always scheduled and they just run continuously so that this is just a process that's online and money let me group by instead of grouped by dimensions in this case it's the project so we're going to select all the data we're going to group it by our list of projects and then we're going to shove it into the profit algorithm which this basically except calls out to the UDF and run this code so the profit algorithm has a couple of properties that you can pass in that I've abstracted here to the US we you tell it you give it a bunch of historical data and then you say how far in the future should I predict values and so that's what this little bit here is as we say you know half are we've got a parameter to this entire script that's how far we should forecast and then we've got an interval for data points we divide them that's a count of how many periods into the future we should forecast then the change point prior scale the talk earlier today talks about using Bayesian inference to detect change points the profit algorithm uses a similar method to find change points in these data and flection points and growth and so you can give it this kind of this value from 0 to 1 that says how how aggressively should you look for change points and so you can pass that in and then the profit algorithm also returns a confidence interval with and so you can tell it you know another one again a value between 0 and 1 typically like 80 or 90 percent in a confidence interval and you can say how big you want those conferences to be okay so there's our template we can instantiate that 400 times and give it different values but by default we'll just give all 400 the same default values and works out the most of them okay so step two we need to evaluate these models we need to do so and a you know we need to do it for all the historical data we have and for each of the different models right so we've got the different projects and the different models the mean to hold and the profit models and so this is just a simple single-tasking capacitor where we group the data both by the project dimension and the model dimension and then we compare that data with so obviously we'll hold out some of the data right so we we pass it in the first year data we have it predict out a little bit of time and then we go look at that little bit of time and we see how well it did right we compute the error metric and again here we are we're talking about the mean absolute percentage error we're grabbing that the map metric okay so here's the task that computes that one so it's really simple we've got our source data and our forecasted data we join them together and they're already grouped by projects and models so we join them on the models or on the project dimension then we compute the difference and then divide by the current value and in this case we don't have problems running the values of zero because there's always at least one star or the data doesn't exist and so we never have a divide by zero issue and we get a percentage error there but then we want the mean percentage error so then we just sum up and count all of the different errors in that window of time that we were given and we bring those back together and compute the mean and then BAM we've got our mates value and we write that back to the database so now we have a database that has the source data a database that has the forecasted data and then a database that has the errors and all of those are over time and as a dimension so we can track those over time okay so step three is we need to service the problematic models this is really easy we now have the data sitting in a database it's just a simple select query just go get me so we're talking about air so go get me the top 10 projects that have really high air dot and then we say you know over the recent data where the models profit our worst performers if you want you can do the inverse and you can get your best performing models and really simple you've got a set of all your models and their errors over time in the database you can just query any backup so you can even go a step further than just querying them and putting a table you can plot a distribution of these so with the distribution what I'm plotting here is the the accuracy so the x-axis oh yeah the x-axis here is the percentage error as a log and then the y-axis is just the normalized histogram distribution so you can see that the me model here in red and it has like a point didn't do so great but overall the errors like you know somewhere between point zero one and point one so you know one percent in ten percent so that's not great but it gives you a good baseline right so then the whole pointers model actually has this nice slope over here where and get some that are a lot better than the mean model but overall still not that much better than the mean model in fact in some cases that's quite a bit worse but still somewhere between 1 percent tempers and air on the whole winters model so then we have the profit model which on the whole actually gets a lot more of its distributions down there towards the smaller errors it's got some that are really bad like this guy's got two hundred fifty percent error I just totally blew it out the model default values were just horrible and didn't work very well so hey that's our problem guy these are gonna be the first one that comes back on that select statement and we're going to go fix it okay so here's a graph of how you would fix that model little dark bull we make it work okay so the top data is both the raw data plotted with the predicted values for all three models and then this graph is the the air being plotted and you can you can see over there the scale is like hundreds of percent of air and this is the this blue line is the profit model think I just blew out I got a ton of air and it predicts so drastically that the growth curve of this entire project just looks flat in comparison so this model just horribly predicted the data and then I just kind of shut it down here this is for an LP trace which is like a s trace for Python type thing and so we've got this table here I only show the first entry but you've got a table here of all 10 entries that are the worst performers and this is the guy that's on the top and we've selected him and reviewing that projects data so we've got a problem we've visually inspected the data we understand how it's broken so how do we fix it okay so the Facebook profit gives the simple guidelines and the whole point of all of this setup and putting this all together so that you have a very simple checklist of three things three questions you can ask yourself and then proposed solutions so that the process here is really quick and simple so first question does the model have large errors compared to the baseline models and that one it was like yeah definitely that was the issue we'll go look back in a second and so so maybe the problem models misconfigured update the model so then we just know that hey the problem model has that parameters let's fix it another possibility was that all the models went bad so even the simple whole winter is miscible mean model went bad well in that case it's likely that there's some outlier in the data just throwing all the models off go find it remove it right that's something it's really simple to do all right third case the errors were historically good like they're historico lis very low but they just recently spiked okay well maybe that just means that we had a change and the underlying process of this data and we just need to go set that as a change point the profit algorithm will auto detect those change points but you can also manually specify them so you can say hey something happened in the data that's different now think of it differently manually here's the time it changed and so you kind of get these three and there may be other edge cases right this is going to cover a very wide breadth of the different ways that your forecasting can fail and then you can update right so we look back real quick on the mean and hole winters models have very small errors and they basically follow the raw data which is in red so the profit models the only one having the issue so that means we're definitely here in case one where the profit model is the one that's misconfigured and we need to update it so I went in I booked at this one it also a very small amount of data this project started October of last year so it doesn't have a long history to predict it and so that was my hypothesis for why it was getting messed up and you can you can clearly see that it is thinking there's really drastic change points and really over predicting here so that that initial parameter called the change points prior scale I just pulled that down nope you are not allowed to be as flexible in deciding where the change points are let's just pull that down okay so pull it down here's the updated model BAM table model still isn't great but at least it's not 250 percent error right now we're down to like 5 percent and that was a 10 second change and then I just kicked off the task to rerun the model and that took another 30 seconds and bam we fixed this one our list of worst performers now has a new guy who's sitting at 7% so now we know that our max worst performers at 7% err still not great but it's not 250% and we can now go tackle that one and and that you know I don't going to take out our minute or two hopefully right so that's kind of the general idea is to invent it up and come back to step one here is to start with a bunch of data model at all compared to baselines compare it to whatever metric you want the mean absolute percentage error into the NRS and evaluate those dozens of values find the ones or you can even do multiple if you wanted you could have multiple graphs down here that's not just the MAPE value but you could you could do multiple of them and then you just get this simple workflow where you've got a table of really bad foreign models you iterate on them and you fix them and then you can keep your upper bound on your worst performers law and keep it down and then you don't end up spending a ton of time managing these and then you can trust that the rest of your models are doing really well so here's the final distribution now of the profit model the whole winters one's a mean one stayed the same so I didn't regret to let it change those and that that point that was way out here at 2 and 50% he's gone he's back down in the middle somewhere and so it'd be nice to take off anybody above 10% anywhere near close that just shove them back down but I didn't spend the time to actually do that the but you can now start pushing your distribution down okay so so learnings and and we're at here so we have used a tic-tac the database class or engine in the visualization engine part of my screenshots are actually from something from graph on ax just because I was messing with stuff there so if you're familiar with katana or chronographs very similar and some of my screenshots probably recognize from Ravana but anyway that doesn't matter use a visualization engine and was able to go there so forecasting is scale is about reducing the cost per forecast making that overhead just go way down so they don't have to worry about it and then and then you can do that by creating a simple workflow with good tooling that enables the analyst to not have to worry about the complexities of deploying or running the model or anything like that and so that's what the tick tack provides is it's this platform on which you can run many many models and you can track and rate those models over time watch how they're performing and then surface those problems so there's also a couple of really cool extensions here capacitor actually has its origins and DevOps meaning that it it was first built as an alerting platform so I get they're running on two servers the CPU spikes alert me that the the CPU spike which means that capacitor has built into it all these systems and tools to be able to alert you proactively about issues right so what that means now is you don't have to actually even go to Agri fauna or cronograph dashboard to look at them if one of your model starts to go haywire you just get paged on your phone and then you go look at it and then you don't spend your day worrying about this thing if you're if you if you're getting silenced things are good if you hear something you go look at it and and then you can and then it can also integrate in the other side of things where it can for example we have talked in this from before about using kubernetes to scale at a large system capacitor conductor kubernetes and it can scale out on your pods etc etc and it auto scaling based on based on these metrics so if you've noticed so if you're forecasting your data you could forecast your utilization on your system and auto scale and head of demand instead of reactive to meant you could proactively auto scale and that's where the power of this of the tic-tac really comes in is it's got these composable bits where it's got nice decomposed products the four different products that interact together to collect store visualize and process your data and then they enter and they integrated with so many different systems that you can put in this big picture solution and on one hand you can use it to analyze and ensure that your models are staying active and staying staying accurate and then you can in the same system then go take proactive action on those models and so um here's some resources from from a talk today the link is awesome list link to get a profit paper and implicator comm just to see more about fixed active learn about it and that's what happy today thank you questions yes point said set or repeat the first part say the equation for the growth seasonality and holidays I think talked about the change points that it detected it and got off by 250% how does it actually like predict the graph and then secondly how much data do you need to start using something like this yeah so yeah kind of skipped that part so profit has that simple general out of term that's like the girls turn the seasonal term in the holiday term and add them together and it just does an optimization LG BFS is the acronym I don't remember a stop Manuel that stands for but it's it takes the data and it fits using that optimization prop that they can handle constraints and things and it just fits the data to those parameters so look at the Facebook profit paper there's a ton of math behind it kind of skipped over that wasn't the important part is in the algorithm by time that's pretty quick doesn't need a ton of data so I only have a data point per day for this stars data and so and that one that was short so it's like four months so you know hundreds of in a couple hundred points not not too many and and it can even predict less if you're trying to predict out less time into the future okay thank you other questions yes thank you my question was that does the profit model learn across the trend lines of time series data is like trying to learn if one awesome project was similar to another and predict the time series according to that or is it just for it was just per single series so there's no like cross correlations or anything okay yep I mean you can do that in something like the tick stack it can start doing those correlations for you but the profit algorithm itself is simply about time series forecasting for single theories okay but it's faster you can run off through just a question on capacitor is it is it like tightly integrated into influx DV or is it like a separate processing framework yeah the capacitor is a separate process and it so it's tightly integrated with input to be and that it can consume in flux DV but it can also consume data from other sources as well and so it just has an API where you write data to it because it's stream processing or it can go query the data and so either way you want to do it it can stream the data live to it or you can just have a query back kind of in flux to be but it can run independent of influx to be if you want it to you know that kind of thing so the question was kind of do aggregations as well and absolutely yeah so I showed you some of the more complex features of capacitor where it can actually call out to a Python mod model for the profit algorithm but it can do a lot of basic things to just like simplify auditions by dimension and time and these kinds of things well thank you very much [Applause]
Info
Channel: PyData
Views: 9,701
Rating: 4.879518 out of 5
Keywords:
Id: raEyZEryC0k
Channel Id: undefined
Length: 37min 14sec (2234 seconds)
Published: Mon Jul 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.