PyCon.DE 2017 Nils Braun - Time series feature extraction with tsfresh - “get rich or die..

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's PhD in physics here in Townsville and he also worked for blue yonder and he will present us a tool called Kia's fresh for a time series feature extraction and with this tool you can either get rich or die overfitting so give a warm welcome to a near spawn yeah thank you for this nice introduction so I hope you all here because you want to get rich yeah of course yes the first of all I have to admit you will not learn how to get rich during this talk so you can although no it's fine um but you can also stay and learn the tools that you need to the ingredients how to extract features from time series and this will then lead you towards a way how to get rich ok so I will talk about time series so what is the time series a time series is just recording some value over the time so for example you can think about I want to record my heart rate during this talk so you know at this peak and you can think about other things for example you have device in IOT business and you record temperature over the time and then the question is how can you use this time serious so this recorded value over time for machinery there are quite some different domains where your time series can come from for example RT nassif epic no precision medicine connected cars or even physics and what you want to do is to feed this time series into your machinery model so for example you want to predict if this device where you have measured the temperature will break in the next day or the next few days but you cannot just feed this time series into a machine learning model because we have a dynamic problem here not a static one so the machine learning model has for example to learn that a peak after 5 minutes is the same as a peak after 10 minutes because the maximum temperature itself is 2 1 that breaks the device so what we do here is remap the structured input of the time series into a lower dimensional space so what does this mean we extract features okay what is the feature a feature is a value that you can calculate with your time series a very simple one is the global maximum you can see here of this x here so this would be a feature but there are other features like the median or the mean of the time series the global minimum or fourth um for example you can count the number of peaks in this time series so these are all features you can think about like a Python function you feed in your time series as an array and you get back one where you so this is a feature with these features we can now feed our data science model or machine learning model and then predict our target that we want so the next step in this process is of course we want to automate our feature extraction automating the feature extraction will not only lead you to a faster and easier feature engineering process but you also do not need any knowledge of other signal processing libraries than you you're working on so you only meet your domain specific knowledge and you hope that other people with domain specific knowledge and other domains have already implemented the correct way in these features also automated feature extraction plays a crucial role in the whole stack of automated data analytics analytics for example life data analytics or stream data or data analysis as a service and it also helps you in efficiently reduct the sample size if you want to transport your data from your sensors to your computing center okay so now we know we want to automate feature extraction or there's one drawback if you just autumn eyes feature instruction in struct like 500 feet features from a time series you end up with many that are irrelevant to your target the problem here is not only computing time but it's also the curse of dimensionality so I think you all know that you have one dimension you do a cut based analysis and then you maybe do a cut in the middle and then you have lots lots of statistic in both bins but if you increase this to two dimensions three dimensions or n dimensions so if you have 500 features you have five front dimensional feature space and then the number of statistics in each bin is low so you're prone to overfitting okay so how cool would it be if there would be an automated feature extraction library plus feature selection or written in Python and all open source yeah well okay let me introduce to you TS fresh TS fresh stands for time series feature extraction based on scalable hypothesis tests all right so this is what we do well we extract the features from the raw time series and then we select them so let me walk you through the details we start with the ram x raw time series data in our case in the panda's data frame format and then we do the future extract so we have a library of more than 60 feature extractors or of different parameters that are all run on parallel on this time series of hidden so you end up with features for example active global maximum or the standard deviation but also more complicated ones like features from the fast Fourier transformation a wavelet transformation saw stuff like this with all these features we then do a hypothesis test so we calculate the p-value if this feature is relevant to the target you fit in or not after you've calculated a p-value for each feature we can perform the benyamin e-liquids CLE procedure which yeah throws out all ii roland fitting a global false coral rate that you can fit in and you end up this case with these features that are done relevant to your target during the development of TRS we have thought lots lots about speed so the framework itself is optimized for speed it's multi processing in runs in parallel and the features extractors of course are all using or you know on binary libraries for example things that number is our parent stuff like the same with the more than sixty feature extractors and in total more than 500 extracted features per time series and the selection which is ready for classification or regression talks tasks so whenever you have a real-valued or binary target we can handle large use cases a large number of use cases but still we try to make it as easy as possible to use and we have interfaces to the socket learn transformer so you can use it in your pipeline to show you how easy it is to use I want to walk you through a short example so we start with the pandas dataframe like this again imagine you have device measuring the temperature and the pressure over the time you can have multiple devices that's why I have this ID so device one device to whatever and then the temperature yeah for different time steps all you have to do is to feed this into our extract features method you have to give the name of the idea of the code of the column ID and the name of the column where your sorting order is written into in this case it's the time but can be anything and what you will end with is this kind of pandas dataframe we have one line for each device and then the different features in this case features for the temperature and features for the pressure and we have now extracted the mini maximum in this case you can then fit if you want this into our select features method you have to give you target that you want to train on and you have to give the false discovery level that you want to achieve and then you will end up more or less with the same kind of set of frame but with all over and features ruled out all right I remember that you want to get rich in the beginning so I want to show you how you can apply TS fresh to finance data this would be just a toy example so I hope not that you now invest and you all your money into this algorithm but we'll see how good perform it's okay what will I do I want to do stock market prediction which is yeah quite relevant problem so we're looking into for example here the Google and the Apple stock over the time overlocked I'm scared doesn't 10 to 217 and I'm looking into the adjusted closing price so this means adjusted because it's just for things like stock splittings and so on the closing price is the yeah the price to close on on each day okay what I want to do is I use the data up to a given day for example first January 2015 I use all historic data train my machine learning model on it and then try to predict if the stock will be higher or the next day or not and if I say it will be higher I buy stocks today and then sell them tomorrow okay so my target would look like this on the top so you can see on all red bars the stock was lower in the next day and green bars it's higher and you can see here jump to the top so this is what I want to predict not how much it will increase but just really increase or not alright so let's do this I'm using the pandas data reader here nice tool to get a stock data and fetching the stock data from Yahoo for these symbols like xx things Apple Google Microsoft or whatever you want and I end up with pandas dataframes for all those stocks it was I don't know two weeks three weeks ago something um not with the with this version I think you have to change the breath in a bit I have to look into my of all the rest of the code you yeah yeah and and also Google changed their API so you can only get one year back not not a whole data but it's it possible to get some data you can also use tools like zipline for example they have with this funnel they have to have their own library ok we can discuss in the QA maybe later alright and what I then want to do is I want to simulate going back in time so what I do is I take one day and then use the data of the last 100 days and train my machine or the model on that and do this shifting through time so this is a typical thing you have to do is rolling with blocks of 100 through your you are to it times time series and this is already implemented in to use fresh in there or utility is really function for that which is called Ryan roll time series which does exactly this you take all your data and then you roll with packages of 103 data before that what I want to do is I want to treat all stocks more or less in a similar way so melt the large data frame where one column is one stock into a larger data frame where all where I put the stocks all in one one column this is done with the melt function of plumbness alright and then well a fetus into T is fresh I extract features as small as the same function call s before the only thing I've changed here I want to use multi processing so I'm using six course because it was run on the server small server and then I will only want to read extract features of a subset not the whole feature set we have but only what we call efficient parameters yeah this is a subset of all and just want to save some time end up as I said with this kind of data frame where I have for each package of 100 days the features extracted for given simple alright and then I just have to feed this day by day into my extra boost model and then I can do what I said if at my extreme model predicts I will be better the next day or not I better buy some stocks and then sell them tomorrow all right this is now the performance it goes up well okay this is good and you can see in blue the performance of the algorithm I've told you and in orange benchmark which is Standard & Poor's 500 is a typical benchmark you normally show to compare your argument against so first yeah yes yes I wanted to talk about this so first of all and it's comparable to SP well this is kind of expected I'm using typical stocks and you've seen I've used the tech stocks so it goes up well this is expected but it's not randomly going up I'll also try this with just picking up random stocks every day and then I get like some results are like 500 percent plus what most of them are bad so and sxs said I'm excluding fees here so I'm buying a stock every day and selling it every day so I have a lot of heat okay the reason why I'm showing you this is not that I want you to use this algorithm to take all your money and then buy stocks I just want to show you that with a small amount of code I have not shown you the target population but something else you can already gain some useful features and a reasonable result with trees fresh all right so after this introduction into what we do with how to use to use fresh I want to show you some tricks some things that we've learned during the development of tea is fresh maybe that will help you in your next library so first of all we love pandas pandas is great the only thing is that if you using this in your library and you do not do this with thinking about it what we unfortunately did is that it's quite slow so especially the creation of Panda serious the pandas dataframes should not happen often and in our case we have this huge group buy and apply function where we grew by their ideas of the time serious and then apply our future extraction on these and in these apply functions we actually returned panda serious because you can because we thought it would be nice but this was actually a huge bottleneck for us so now we go just with normal Python lists with tablets and and we are fine and the second thing we all know but sometimes don't remember appending to pandas dataframes can be slow every case if you do it one row by one that's it's really it's really slow so just do it in blocks if you can so this is what we did for speeding up T's fresh we started with a good profiling you should always do this for example with things like line profiler and there plenty of stuff around for profiling patent applications and then in our case we've seen we had to reduce the amount of pandas objects especially when using random access to them or when using masking or basic arithmetic and we tried not to copy them around every time or move them around yeah we could gain speed up factor of 50% which sounds good but yeah we did not use pandas the way it's supposed to be in the beginning so always have a consistent speed test ready so then you can compare different good commits and that you compare compare things got better overs the second thing i want to mention is a settings object so our top-level functions extract features and select features have a lot amount of parameters so for example you can choose exactly what features you want to select and you want to sorry you want to extract and for example based on a selection you did before and what we start with actually once we create our own settings objects where we put some logic into so people can users can change things and user can easily access things and that the day we did this actually our number of issues on github increased and we had many questions on twitter on this and the problem actually here was we reinvented Python because their dictionaries there are where we there are things like dictionaries and parameter lists and stuff like this already built in Python and people just are used to this and try not to build your own things around the standard library because patents and libraries is perfect all right after these two small tips I want you to show you the latest things we have in TS fresh so maybe you have the problem that you have too many time series to do this locally and maybe also too many times here's to just do it on one worker machine so now I can say no problem we have brought Gia's fresh to the cloud so we try now with more power and actually thanks to this nice talk in the morning and we now know all about tasks so we have implemented tasks in two tiers fresh so it's just one line more before a feature extraction you can choose the different what we call distributor in this case you can use the cluster does this beauty you connect to your class Tasker now you already know how to do this how to create a task cluster and then you can use Pierce fresh with tasks there are more things to come so as I've said dusk distributed is already built in so you can now run whatever you want maybe on Amazon ec2 and what we try to do next is to run on AWS lambdas so we will build another distributor for that case we have first chests a private instance for that and we did this with the help of flask nice package for web service management and zappa which even better package if you want to manage lambdas you can find more information on the first blog post of chris who did some bringing tears fresh to the auto ml cloud and then on also on mine look alright let me summarize so now you're not rich but you know how to get rich maybe so I've presented UTS fresh with the feature extraction and selection by a Python package it can handle energy resources different lengths for example it's ready for decent role processing and big data because of the cloud computing its field-tested because it was developed for consulting project in the city of ignore with more now more than two thousand stars Gita Wow and and it's robust because it can handle overfitting because of the feature selection so you can find more information or with the docs page naki top repository I want to thank all the contributors to choose fresh especially the top four max us andreas and we shall find for giving us help of the idea for this project thank you quick question maybe I missed some points that were really did the prediction in the trading strategy was it just kind of like in simple pretty sure or today have some out-of-sample component or hello boys oh yeah that encouraging um now what I did is so when I want to predict the day of second chance of 2015 I take all the data up to this date and only trade on this data and then I can predict the next day and on the next day I can reuse the the day from before so I'm only learning on the data that I have so I mean I need it's kind of like a roadie with no not justified lady the mean or the median or something but I calculate all my 500 features in all rolling window as I said it's not the thing you want to do for for finance data so it's not the thing you want to print to do but yeah it's this thing you can start off with thank you very much I'm just wondering how you would incorporate type 2 error rates or control for type 2 error rates in here because I've seen that you're using a rescale or corrected p-value yeah for like I assume like you you try to to limit the error down by multiple testing kind of thing yeah the procedure already yeah the the forceful rate yeah so we've seen that the features lecture is not perfect in every cases that that's that's for sure so still you need to test us on on your validation sample okay so false negative correction or the like the statistical power is not in robbery at the moment so as I said indeed is you could pay me you've got really a procedure that was implemented as this forces for rate and and it's great talk and great package thank you are you guys working on multiple time series features like divergence between two series and things like that we have thought about this the problem is that would be a major refactoring of our framework we do have plans for this of course and because there are lots of features that you still won't use for that it will not come up with our next release but I do hope so we will have it someday yeah but I cannot give you a date yeah thank you for the for your talk I have a muscle was working with time series with a prediction monitor of it sorry is condition one turn and the question how do you do feature selection so that you said about p-value and stuff like how do you know which out of sixty features are so useful I mean what you do is as I said are you calculate the p-value and the p-value to saves you it compares to hypothesis this feature as relevant or not so for example for binary then we have different statistical methods for binary target a binary feature we have an other statistical method as for binary and reevaluate and for example you can go with things like Augmon on spoon of tests and stuff like this so we end up with a p-value that just tells you this feature is relevant or not alright then what you have to think about is if you're looking to a set of 500 features and you can do p-value and you know p-value is something with that has to do something with statistic you have to think about the look adds very effect so there's always something you can always find something because p-value is yeah probability so you this is all included in this big a meaning you got yellow procedure where it draws small as with the number of features you have and it adjusts for this this this factor that you have to apply because you look into that many features if you if you do a p-value if you check how important a features you need to have some check like you need to see okay that's feature is important because it improved but info time series what if you don't have labels like in my case for example often I just have time series and I have no labels I have data from one month and I have to tell predict anomaly but I know at the end there was anomaly for example so I had to predict anomaly three days before two hours before and so on so I don't have any labels so how in this case do I check so but you have a target that you can train on or not I just have date like I just let's say I have time series I know at the beginning it's there okay we don't have any target that we can't know if this feature exactly I mean you can also just use it up to here and do the feature extraction then you may be looking into the features you're already gained some knowledge about your data then you can just use it as a large feature extractor library ok very good more questions maybe I have a question so what is it like running an open source project that has 2000 stars so how did you get that you will do some promotion or yeah actually we were quite surprised that we have two stars I mean we started the project it was as I said was constructed for consulting and I work for blue Honda during this time and then we open source it which was great so big under absurd but was a really glad that it's happy and then we yeah we kind of promoted it maybe to some questions on Stack Overflow and maybe here and there but not very large scale what people liked it and then we got featured on on different online magazines and then yeah but then we got 2000 github but now it's kind of stagnating so please start github not just just so so we're glad that people won't use it and I think people are missing just is easy library of we check structures I mean we have not reinvented those feature extractors we are using them from other libraries because yet they know how to do this we just summarize these things more questions if that's not the case and you can go to lunch and they think needs again force myself [Applause]
Info
Channel: PyConDE
Views: 10,654
Rating: 4.9613528 out of 5
Keywords: PyCon.DE2017, Python, PyCon.DE, PyData
Id: Fm8zcOMJ-9E
Channel Id: undefined
Length: 27min 10sec (1630 seconds)
Published: Fri Dec 01 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.