Automatically Find Patterns & Anomalies from Time Series or Sequential Data - Sean Law

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
first off I'd like to thank the conference organizers for having me shout out to Hector Garcia for giving me the opportunity to talk here but also thanks to my employer TD Ameritrade for letting me come out into the wild and and have a chat with all of you for those of you don't know so a TD Ameritrade is a financial investment firm online mobile brokerage we manage about just over a trillion dollars in client assets so we're across the United States we were voted last year or selected as the number one overall broker by Q plunger we have you know if you're a long-term investor if you're thinking about retirement goal planning whether you're you want to trade online mobile we have all these different products we'd love for you to use them also if you're an options trader Forex foreign exchanges futures we also have a wonderful platform called thinkorswim you should go check that out too additionally we were also recently chosen onto the Forbes a 2019 list of America's best large employers we have an office which is where I worked in downtown Ann Arbor right above Bar Louie so come check us out you find jobs at jobs out to the measured jobs at TD Ameritrade calm in addition to that maybe just a little bit of culture right what you see this was just a couple of weeks ago so this was right behind where our building is and we have some special guests serving as baristas downstairs and that is our CIO Vijay Sankar and as well as our CEO Tim Hockey serving I mean you'll see me down at the bottom here right getting getting a picture so it gives you an idea but more importantly you know these cups have a very special message related to our company purpose which is transforming lives and investing for the better the key thing here is transforming lives first right and so to a large extent I do some of this in the collaboration with some of my data co-organizers so dr. Patricia Schuster as well as I've been Zeitlin we could organize the PI data and our Ann Arbor data science meetup it happens monthly all of our talks are recorded and posted on YouTube so if you go check out pi data and our bird you'll find all of our our past talks we have one coming up on Monday so go and check it out it's at six pm also shout out to numb focus and Midas as well TD Ameritrade for sponsoring our events so with that let's get started so Who am I why am i standing here so I'm Shawn I am I'd like to say I'm a reformed scientist turned turn research our plaid researcher so I work at TD Ameritrade on something called the exploration lab team so this is the technology research and development arm of the company so I serve as a senior Applied Research as well as data science and really what we're doing is that we're exploring new and emerging technologies or in my case oftentimes new methodologies that might help us differentiate ourselves from our competitors now who is this talk for you might fit into one of these categories right whether your data scientist an engineer maybe you're a people manager hopefully what you'll take away from this is maybe one of three things right first thing is that I guarantee you that you'll likely learn something new a new technique for exploring time series data or number two you'll be highly entertained or number three you'll watch an individual rip through just over a hundred slides in less than 25 minutes okay so what's the problem what's the problem we've we've all been through this if we've ever stared at this right we get some manager or some business person telling us hey I've got some time series data now perform some magic doctormick fancy-pants right and your data might look like this if you got 10 data points small data right that's great you can visualize it you can look at it what if we had 100 data points it gets a little bit it's still manageable but still a bit nasty a thousand data points forget about visualizing it right and then this forget about it and there's only 10,000 data points right and we've all felt this dread and this is what I tend to look like right just overwhelming dread and often times we wish that there was some special way like a magic wand that I can wait to say ridiculous right and take this scary thing and make it not so scary now there are some typical ways to analyze time series data obviously time series that domain is tends to be a very visual thing so we can visualize the that's what you want to do another very nice thing that we can do is statistics take maybe mean median mode maxes and mins but typically with time series data we're not really interested in that global metrics right or you didn't take you can approach it using some Bayesian methods as well everybody's favorite ARIMA models right so you start building Auto regressive models then the next thing you might want to do is look at anomaly detection or outlier detection but again not perfect right because in order to know what is an outlier an anomaly you have to figure out what is normal first before you can say what is abnormal and also when you see an anomaly again is it normal or abnormal now right because it's a repeat now if you really trust your data we can get into forecasting right that you believe that your data is good enough to help you predict the future or guess at that future and of course everybody's favorite when all else fails more data all right and I don't I don't really recommend this but you know you pick your poison now when I say more data for time series right it could mean longer time series so add more data at the end of it are also you know reasonable amount of data but more dimension so add additional different time series right and look at the relationships through the form of clustering or sometimes you might even sort of stretch and compress your data using dynamic time warping these are very classical methods and then obviously it's hammer time right we're looking for a nail to hammer so let's choose deep learning that's another very common common method these days right and I and I took this quote directly from the conference website it says that there are lots of conferences these days and almost all of them promise new and magical insights that will surely revolutionize the way that you work and at the bottom right after after this it says this is not one of those conferences right and I tend to agree that there's no silver bullet here right there's no magic here and for those of you are machine learning practitioners it's sort of what we call it no free lunch theorem right that everything there's a compromise you choose one method it's good for one thing but it's not good for everything right so then for time series data analysis what's the goal what are we trying to achieve here all right so let's maybe take a step back and use leverage an illustrative example so this is a time series just thirteen data points and equals 13 so n is the length of your time series if I asked you do you see anything interesting about this do you think see any patterns here I will tell you that nothing repeats itself exactly right so we can plot it I mean we can take a look right again things that we do with time series right we might see some things they see some things there but when I approach time series are basically any problem I tried not to go in with a solution in mind and I follow this person does anybody recognize who this is right so you you might you might recognize this person as Lord William of Occam and he's very popular he's known for something called a law of parsimony or more commonly referred to as Occam's razor right which simply states that simpler solutions are more likely to be correct than complex ones and this is what drives me as a reform scientist right is how do we approach things and how do we approach time series analysis so I asked myself this one question what's the most simple and intuitive approach that I can take to analyze time series so going back to our ativ example we have to make a definition first cuz again we're not interested in the full time series but let's take a smaller section of this and the definition is a subsequence so a subsequence is a part or section of the full time series so for example here this is a subsequence so it's not the full time series just a chunk of it this one can also be a subsequence right or you know seven data points could also be a subsequence so that's what we want to take a look at that now we have to then choose what is a subsequence length so in this case we're choosing let's say M equals four so M is a subsequence length and I can then now say well how does this subsequence compare to somewhere else along that time series right because we want to know where should we start looking where should we start exploring so in order to compare sub sequences we've got to take something called Euclidean distance and the definition of the Euclidean distance is the straight line distance between two points right what does that mean well I'm not that smart right so I'm gonna make this simpler let's use just simply two points right M equals two and I can plot them so that's one of my points the green box I can plot the other one and this the definition said that the distance including distance between them is simply the straight-line distance and this is Euclidean distance I don't have to give it a fancy game I can just call it H for simplicity draw some lines right add some variables now we know how to calculate that Euclidean distance right take the square root of a square root of both sides we get H and we all recognize this right as masquerading as Pythagorean theorem very very simple right high school maybe even junior high school math we know how to do this and so we plug in our points in there and we can very quickly calculate the distance between these two sub sequences right is root 2 now what does this mean if we add an extra data point right so in this case we're adding these all we got to do is just simply Oh add a C to this all right sorry let's get there okay add C to it and then we get root 3 so as we move on now we're at the stage of how do we do it for our a sub sequence of 4 right very very similarly the same thing but that's only to comparing two sub sequences what if I want to do all pairwise comparisons here so what I then have to do is think about well let's let me repeat my sequence all right and I want to store the distances that I'm calculating in another vector so just so you can visualize this right I'm going to compare these two sub sequences they're identical so the distance is 0 I slide the second one over by one calculate the distance again and I keep doing this all the way across right done so there's just pairwise Euclidean distances but then I can ask a question what is the what is the other subsequence that best matches that green one now there's what we call the trivia match which is your just matching yourself that's not interesting because that's obvious but what we can do is look at the next best match which is just over here the six point nine we're just going to blank out that trivial match and we can repeat this process by now taking our original subsequence slide it over right and perform that same calculation over again in this case we find out the best or second-best match is all the way at the end there and we can do this for the entire entire thing we can slide this green box all the way over the sliding window and they calculate all the pairwise matches and at least this vector of distances at the bottom is just called a distance profile a very simple name right but if we stack all these distant profiles on top of each other that's called a distance matrix or pairwise Euclidean distance and again I've highlighted all of the places right where there was the the smallest match so great we have a simple intuitive approach that we've taken but there's a question of is this solution scalable well let's take a look at this it turns out that if you were to plug this into your your favorite a computer algorithm right in my case it's Python it's only a handful of lines of code but to understand the computational complexity we just have to examine these street these three for loops that's really all it is right and that gives me all of M Squared by M again n is the length of your time series M is a subsequence length and the space complexity is also N squared nasty actually right and if we were to actually do a brute force calculation of a slightly longer time series beyond 13 elements right this is what it looks like literally a back-of-the-envelope calculation calculation for you here so if I was collecting data 20 times a minute so that's basically every three seconds I collect a data point for 60 minutes per hour 24 hours a day 365 days a year right for five years right collecting data every three seconds for five years there might be some sensor data or whatnot right this amounts to a time series that is just over fifty two and a half million data points that you've collected and now if I were to use that brute force method to calculate that pairwise Euclidean distance it would take me roughly 4.4 for years and not not only that but all about 11.1 petabytes of memory would be required to store or you can write to disk if you want it if you're crazy right right not a it's an unenviable task so it's a solution the brute force solution is scalable not really okay and that really explains why at the beginning people are using all of these methods because either they're more interpretable perhaps but mostly it's because it's it's more scalable right for large amounts of data especially when it comes to deep learning until now and so recently in 2016 there were a group of researchers over at UC Riverside University of California Riverside as well as the University of New Mexico that published a series of back-to-back papers and I only stumbled upon it on believe it or not a subreddit okay and I should point out that all of the work that you're seeing here today the underlying research was not done by me it was done by these researchers and they're not necessarily collaborators of mine I've reached out to them exchanged emails with them about you know for further clarification but all the credit really goes to them it's a beautiful amount of work that they've done and they've since 2016 published about 13 papers built upon this fundamental idea of something called a matrix profile a matrix profile so what does the matrix profile a matrix profile simply a vector that stores the distance between each subsequence within a time series and its nearest neighbor what does that mean so going back to our picture all this means is that we don't store the entire and n by n array all we're interested in are just those those numbers right and that's the vector that it's talking about just the minimum distance to its to its nearest neighbor that's all we care about and if we only store that the space complexity is o event right very manageable now if we take so why is this useful so if we take our original illustrative example right and we take our matrix profile we can actually plot these or annotate our original time series right underneath it and this is what we get so this is our matrix Pro and if we're asking the question what's the best match all we need to look for here is the minima right so this sequence right here of 1 3 2 9 right has the best match with something over here right which is 1 2 to 10 as similarly with this with his matrix profile you can look at outliers to what doesn't have a good match and therefore has a large distance but there's a whole bunch of other things that you can extract out as matrix profile a lot of times you can get things for free so now again this is the the research that was done so the researchers looked at comparing it to brute force analysis which which we just computed but the first algorithm that came out with what's called stamp and it is taking all of that math and applying a very nice fast Fourier transform to it and they went from a time complexity of N squared by m to N squared log M significantly better then the next paper that they published called stomp they threw away the the fast Fourier transform and just looked at the math and from the map itself they were just they were able to get it down to N squared and this is an exact solution this is no approximations I laughter often times when you have these types of Euclidean distance calculations they apply some approximations or some heuristics this is an exact solution you will not get any false positives and then finally they ported this over to GPUs right just throwing hardware at it to make them make the computation even faster so when what they did was they and so I was looking at about fifty two and a half million data points they went crazy and went for a hundred million data points right they took N equals 100 million use their GPGPU stomped calculation and they what they published was that it took thirty a twelve point one three days right instead of I think it says down there just over fifteen hundred years if you did it in brute force way okay think about that for a second but and so this is one of the corresponding authors a monkey oh he says given the matrix profile most time series data mining problems are trivial or easy to solve in a few lines of code of extra code after you've computed a matrix profile and he also goes on to say these are the best ideas and the time series data mining in the last two decades and to me anybody who says that must be a gangster okay but of course bringing it back down to earth there's another scientific gangster that I would like that says that or popularize that popularized the phrase extraordinary claims require extraordinary evidence right and that's the late Carl Sagan and so that's where I embarked with TD Ameritrade on validating whether they not this was for real so I built an initial prototype I'm a Python guy using numpy for some of the numerical computing right and I compared not for a hundred million data points just simply for just two million data points to to the twenty one their GPU implementation said that it takes about nine minutes I did it on a single CPU too long it doesn't even register on this bar but then I asked well it does work I just compute for smaller data sets and it's sorry it makes me a sad panda so I thought well what about lies right I have multiple threads multiple cores and I'm using something another fantastic open source piece of software called numba which takes your Python and numpy code inge it compiles it so you nearly get c-level performance out of the box and within two days of taking my original code I was able to parallel lies my algorithms right across 16 CPUs and brought it down to about three hours still not great but progress and then that's when I realized that there's a space here that's missing right we went from single CPU stomp - you know GPU stomp what's missing here is maybe some sort of multiple multi-core distributed computation and that's where I leveraged you know the last week we talked about desk so I leveraged ask out-of-the-box to distribute my code and my computation and I was able to take it from 16 CPUs to 128 CPUs by in five days right the thing about this traditionally if you have to write distributed code it becomes very very painful I did this in five days okay and I'm not that good I'm not I'm not a software engineer right and I brought it down to about 13 minutes right for 128 CPUs so not bad but being very competitive I asked well what if we just went to up to 56 CPUs right and this was on the full 100 million data points we were actually able to get it done in just under 10 days feeding the GPU stomp implementation now let me be very clear because I've been harassed about this right I'm not at all saying that that CPUs are faster than GPUs yeah I would be an idiot to say that right but the point here is that with open source tools such as things like number and desk we are able to do things very quickly right if it wasn't for these open source tools I wouldn't have been able to talk about this so go back to the richer question is the solution scalable I think Amon was correct that there is something here and so our implement distributed implementation sort of playing along the base important play of words of stamps Tom we call our implementation stump because we're in tree town tree City and Arbor but our distributed version is called stumped okay so but more importantly what I'm excited to announce today is that recently we open sourced all of this software and the open source library is called stumpy or stump I if you will so what does stumpy so stumpy is a powerful and scalable Python library that efficiently computes the matrix profile which can be then used for a variety of time series data mining tasks right so these are some of the things that you can do you can do motif discovery which is finding patterns within your time series data anomaly detection right just for free essentially something called very interesting called time series chains which is if you have a repeatable pattern that sort of trends toward a certain direction there's the way that for you to identify that just by looking at your matrix profile that comes out of the box with thumpy also think about multiple dimensions so having multiple time series right that may be collected from different censors you can leverage stumpy for this as well and then further you can you can segment your data if you're trying to find different change points within your data there's thinking about clustering so the author original authors have described in called MPD's a distance metric snippets given many many times series sorry many many subsequence or patterns that you have that look similar how do you choose one representative one to show right so that's what snip is and there's many many more things i'm only scratching the surface i recommend you go in google you see our University of California Riverside and matrix profile and you'll find a ton of information and so stumpy is open source you can follow us at stumpy underscore dev but it has fully unit tested hundred-percent code coverage whatever that means minimal dependencies but only supporting modern Python not legacy Python if you have your Python installed 3.5 or more you can just do a pip install stumpy and that should install otherwise find me in hallways or if you're cruel to yourself you can download the source code and install from there with that I would like to conclude I will be outside for coffee and I'll be also here for the reception so please come and chat use the code contribute file issues any feedback would be great thank you for your time you
Info
Channel: Criteo Eng
Views: 14,593
Rating: 4.8846154 out of 5
Keywords:
Id: WvaBPSeA_JA
Channel Id: undefined
Length: 23min 43sec (1423 seconds)
Published: Wed Jul 03 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.