Time Series with Driverless AI - Marios Michailidis and Mathias Müller - H2O AI World London 2018

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you for coming I know it's very late so we'll be as quick as possible hopefully it will find it useful I would like to discuss for something which is very dear to us is the time series recipe we have designed and implemented with my colleague and Co partner in in kaggle for people that don't know my background I used to be ranked number one on Cargill and actually Matthias used to be wrong at 4th and this is a snap sort of our profiles it may not be very beautiful but we have actually participated in together in more like 150 competitions which is which is quite a lot I spend a lot of time trying to optimize for different problems we spend quite a lot of time on time series and we try to take that knowledge and feed that in also great opportunity to thank again Matthias because I think I would I don't think I would have done it without his help so thank you very much I would like to although I believe most of you have already seen how driverless AI works very quickly it's it's very simple you have some input data you have some variables and you have a target which you want to predict and for time series the setup is actually the same so you try to predict this why they started variable and then you define an objective function something you want to optimize let's say minimize the type of error or maximize some form of accuracy and once you have defined this metric then you just allocate some resources you put some time constraints and obviously based on the hardware available in your computer driver let's say I would run for a certain number of iterations and then it would start giving you outputs this outputs maybe some visualizations about the data and the interesting thing about the visualization it capitalizes on patterns which are interesting so it's not going to show you everything it's going to show you graphs which have something to tell it automates the feature engineering process in feature selection and that as in many other problems is extremely important in time series too the feature engineering is slightly different than the one you've seen so far obviously there is an automated selection of models and tuning of the hyper parameters and then we include an interpreter modality module which is essentially an explanation often visual art with other means - that tries to tell you why the model is giving the predictions it does and obviously you can export this whole pipeline into a single for example Python packets and you can implement it all in one go so what is a time series problem this is a bit repetitive with what I've said yesterday but let me go through it quickly it's essentially when the target variable demonstrates a significant important relationship with time that relationship might be very straightforward like the one you see here something very linear so the more time goes by the more sales increase or decrease the Oracle be something very seasonal all nonlinear something that goes up and down through time which might be a bit more tricky but more realistically quite often look something like the graph I'll show you now so that so that we are all on the same page a recipe capitalizes on trying to identify the time groups which are embedded within your data sometimes you give a data set which might have multiple values for the same dates and that's because you have different overlapping groups for example different stores that or different products that so sales for the same day it is extremely important to be able to clean this up and clearly define these groups because then we can extract features which are specific to each one of these essentially unique time series obviously you as you see all the groups so slightly different behavior but at the same time we can find common ground to combine information from all this time series in order to improve the ones who have slightly less information our general modeling framework takes into account certain hyper parameters we assume that the tray spun over a period of time which could be you can think of them are some time units that time units could be days could be minutes seconds weeks years months it doesn't matter in this example it could be months so the training data spans over a period let's say nine months and then quite often you have a gap period a period where you don't have access to the data and then you have a period after this which is essentially the period that you try to predict this is your turn your test period sometimes you have not that gap in the data but our SCP is comfortable in taking into account that there will be a gap of information between the period you have data and the period where you actually try to predict the whole period you try to predict is essentially the 11 and 12 week in this example and we call it the forecast horizon and it is important to differentiate to be able to define this forecast horizon because not all points into the future are equal for example for the 11th month we can see back two months we have available data from month nine but for month twelve we have available data three months ago so we don't know what happening level we don't know what happen in ten but we do know what happened in nine so it is important to specify this horizon so that the model can optimize for this and pick for you the right the correct ranges to find the right features and optimize for your problem and the way we construct our data is essentially we take the test window and we try to replicate it within our data so we will take the most recent window with the same forecast horizon in this case two months we will apply a gap and then we will start generating features we that that look in the past from before that point so essentially we try to replicate what we try to what you try to predict and we do that multiple times because we want to make certain that we optimize and we pick the right models that come as close as about what the alchemists are really being tested on I would have different validation schemas that can help us to achieve that goal of getting a very generalizable model and here I've just picked a time series from a specific store that shows different sales a very typical way the most simple actually to construct the validation is just take the most recent window that has the same size as the test data and use the remaining four training data again here you make the assumption that the most recent data will be closer to the actual test data because they are the nearest in the future a slightly different approach which may be more robust is to have rolling windows of adjusting side from the training data so you start again the first approach is actually the same as they want you use before but then you start rolling your validation window by equal sizes it's time using less training data but applying exactly the same feature engineering approach in order to predict a model that is able to achieve good results in all these windows using different training data less training data it's time able to give you robust results it is very likely to be a very generalizable and strong model again and a different way to do this is to do equal size of rolling windows so not use all available information for training but constrain it so that even the training window has exactly the same size validation where you keep moving that forward similar similar with the other one but the training size is adjusted to fall into the validation size and another approach is to keep taking different random validation and train windows which are obviously sequential in order to make a model that this robust against any timeframe so we used these different approaches in travelers and we try to pick the best one depending on the situation the type of your data how much data your test period spans to and some other parameters and now it's it's good to move into the feature engineering which is a very exciting part but my colleague Mattias is going to take over regarding future engineering so what are some driverless I so that's basically what I'm talking about first of all we have some basic feature engineering like if you everyday to expect simple things like what's the day what's a month what's a year features we have also but just a new addition extract if it's a holiday or not because on certain types of data that's really available information to have and that's also information we can create for our test data sets so that's a good feature and we created another things is legs so our leg space recipe is of course heavily based on creating legs and creating features on passed information so of course we never look into the future but we look a lot into the past and extract legs so what we do for instance if you have date and sales we look back so what we as a sales like last day two days ago etc and with those legs what we can do is we can also create moving averages here's a simple example where you take just a mean of two legs but in the table let's say I we have a WMAs for sizes and we can also create an order differentiated legs so the P takes the legs create some subjects means and then only after that creating the averages we can do a lot of aggregation of legs so me in stand that Samsa set of accounts of the past you can create interaction of legs so simplest example would just a subject to specific ones but there are the functions included not only subsection and another thing as you know regression on legs what you do is you create a bunch of legs and you create the Linea morons I'm and taking something like slope and intercept it's a new feature for your current bone and one question that we get often tossed okay how do we create the candidates for all exercises how do we choose and certainly what kind of exercises to look at first method is we create a ranking based on autocorrelation so we basically compute autocorrelation for each leg size which is possible because what is possible he knows that because of the length of the training data said and eventually we have the testator said of the user inputs specific focused horizon so we have this information so we can compute it for every possible value and rank them based on autocorrelation another thing is we have predefined intervals based on the estimated effect their frequency so what happens if you put a time series into driverless is it identify the main frequency of the time stamps of the individual time series and based on that we have basically simple lookup tables rentals for daily data it makes sense to look at every week or every two weeks and so on and the same holds for weekly data you can say okay maybe let's look at every two weeks every four weeks and based on those feed event intervals every care now subsets of souls and aggregates and simply apply all the other future engineering and summers on subsets of that and so we can pick up seasonal patterns for instance yeah that's just an indicator that we have set for many different frequencies and if you have a very uncommon frequency like maybe gigahertz or something then it will just fall back to or possible leg sizes and the genetic algorithm to still try to figure out what is good leg sizes today so we can still do the auto correlations or simply several modules several strategies or combine them so generic I always figured out what's useful another thing is sometimes if you create legs then they are too powerful we have in general much more information in the training very near for validation or tests and the first strategy to counter overfitting is to lower the pound of considered leg sizes so let's say we have a forecast to rise in four weeks and we want to create a lack of one week then we can only do that for the very first week in our test set so 75% of our legs will the test would be nan information but what we can also do could say okay that Lexus is too small don't create it at all that's the first strategy another one is okay I have not created Lex I have one for my testing yourself 75 percent NAND and I do the same process for for my training dataset but because it's much longer and then a lot more months can actually look one month back we have much less Nance which is bad thing so let's do drop out an alliance of frequencies so that we have exactly the same relatively frequency of non-available information in trainers and tests and the search Sergius to do dark Tiger pinning before we left so and instead of legging the actual target we leg pins up in various and that decreases the possible model splits GBM can perform currently drive less heavily relies on GBM and also for time series and that's why we have such a surgery implemented another quick notice there we are working on my motto for time series so and that's especially helpful if you want to explain your predictions so what I can do you can click on your prediction you see the actuals as well so you can compare how good is it and then you see sea shepherd waves of those predictions so you can actually see okay what actually has driven this prediction so was it a holiday wasn't a specific leg something like that and in the current of the new version 1.4 it's an alpha version of its turn a really early state but you can already have a look and I want to finish my talk with a roadmap because there's a lot what we plan for the future one is data augmentation a time of prediction hence the name but what it actually means is that consider you have like data said there's a lot of stores quickly say it's awesome and you have really a lot of stores for whatever reason you have millions of them and you want to Train only of subsets of them like let's say off but you still want to predict all of them and what you can do you can what's the plan is that you can provide the circulator at prediction time and so maybe postponed to the training and then you can still forecast so stores which haven't been seen yet in train another one is signaling and classification that's actually another problem domain if you want so it's instead of having a label for each time point you have one label for each time group this is a sample it would be groups a and B and you actually want to they have all the same target right so the group per se as a target like let's say a is a bad signal and B is a good signal and in order to make a model out of it we have to transform the firm's data and create features on each the time series individually let's say we go into the frequency domain and some features they're doing sliding window vacu gates continual crossing extract some peak parent said over you can do there it's a big domain and I think that will be also really useful for some users another one is that we will add more algorithms like profit from Facebook we haven't already implemented we're just waiting for the license to change that will happen soon hopefully and then we can put it out and also some classical metals like a Rhema for instance so more traditional approaches because currently we as I mentioned we rely on gradient boosting which is fine for a lot of datasets but it's not optimal for all so it just brings them together and if you have said we can also of course assembling we always a fan offset so that will be an option as well and finally finally it's a rate of learning so it's another way to deal with the problem said if you have long focus services and you train only one holistic model that you can't create you know enough legs or short legs for for long focus why since what we can do you can train iterative models so only predict one time ahead and then use the predictions lexos predictions can in enhance the predictions on the very last point of time and yeah many more that's always changing we listening to customer feedback and trying to figure out what's actually needed out there so if you have any suggestions if you want anything in tell us and yeah so thank you thank you my bias a few questions came up so do you reverse temporal ordering time based V I'm guessing us in using future to predict pass that's very interesting actually sometimes work but no we don't do it currently might be a bit cheeky but it is an interesting concept can you specify the gap horizon in traveler say I manually yes you can but if you provide the test dataset will be picked up automatically for you you still have the option to add this option why is there a gap period at all why not use all the data available of the past to predict the future that's interesting it was asked yesterday too that's just because it happens that sometimes you don't have all the latest state available if you have then yes you don't need a gap and it is better not no it's always better not to have one but sometimes that's not the case you don't have the latest data available at at prediction time that's why you need to incorporate that gap in your data so as the model to learn to expect that you might have some recent missing information it's just you know happens it's it's it might be expensive or costly to have the most updated data available at scoring time does the model understand the meaning of months or these are just column names yes it does understand so based on the on I'm not sure actually what you mean here maybe you can find me afterwards but from a date you can derive this attribute and and understand the concept do you also use cross-sectional regressions to generate features or panel data hmm okay that's an interesting one the short answer is no the short answer is no I mean we we essentially use panel regressions right to generate and we take some features from these panels regressions to use as features but I wouldn't say we strictly take certain periods to run every question using may be good to find me afterwards actually I'm not 100% certain about what you mean here can you make the music off in the room next to us I wish I had this power are you going to add LSD I mean driverless AI at some time yes it is actually in the roadmap it will be added sooner or later how much data do you need to create a reasonable time series forecast model for example is one year of daily data enough very tough question that depends on the problem generally it's always good to have your observations spanning into thousands of numbers but at the end of the day you have to do the best that you can with you know like what you have so if it is one year then one year I mean I wouldn't say it's ideal but it it works I cannot I cannot it's tough to comment this because it depends on the problem how do you account for seasonality within time series forecasts I mean we hope we capture this by either selecting the right lag pizzas and we also introduced some other variables in the form of you know what day of the month it is if it is a holiday so we have some features which may may help us to to point out periods that we might expect some outburst in in the target variable in your X pause the skews or stores you mentioned have history how about new SKUs and stores with not history forecast will be here zero no so the way it works the way we generate plaques we generate it for all possible groups but we also generate it for no groups so we do create pizzas which may mean the overall mean or just the overall mean of AI school or the overall mean of a store and skew so at the very worst that prediction we'll have as a feature the overall mean and not then a zero but we also take into account other features as I said which might be a bit more static like in a day of the week or month or if it is a holiday information which can be available when you try to predict in the future so your predictions will be saved by these features - I'm a bit cautious about the time sort of thing it's good if if we stop it here and you can always find me afterwards for the remaining of the of the questions does that make sense thank you very much for your time
Info
Channel: H2O.ai
Views: 2,193
Rating: 5 out of 5
Keywords:
Id: EGVY7-Spv8E
Channel Id: undefined
Length: 25min 18sec (1518 seconds)
Published: Tue Nov 06 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.