From Fourier to Koopman: Spectral Methods for Long-term Time Series Prediction

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Henning Langer and I'm going to present to you in the next couple of minutes a paper that I just recently finished with my collaborators Steve Brunton and Nathan Coates so the archive identifier is right here but you can also find a link below in the description so what I'm going to talk to talk to you guys today about is it's very fundamental and it's it's a very simple idea so I'm gonna introduce algorithms that can do the following so you present the algorithm with data temporal data from ten point t equals one to type on Team Quest capital T and what the algorithms are gonna do that I'm gonna introduce today is they're gonna predict these temporal snapshots into the future and what the goal is of these algorithms is to go very far into the future so we want to go like 10,000 steps into the future which the public which is approximately one year at in an hourly sample rate and this obviously doesn't work without making any assumptions about the data so we make the following assumption namely that the temporal snapshots are produced by a quasi-periodic system so I'm going to introduce two algorithms the first one we call Fourier forecast that algorithm is similar to the Fourier transform but it does not make implicit periodicity assumptions that would be harmful for forecasting and the second algorithm that we call coupon forecast is in general based on Koopman theory and you can understand it as the Fourier transform in a nonlinear basis in a nonlinear and data dependent basis so the optimization objectives of the algorithms are pretty nasty so the fourier forecast algorithm is as a non-convex objective and the Koopman forecasting algorithm has a nonlinear and non convex objective and even though the objective functions are pretty hard to solve we can find global up in the direction of some of the parameters by using the fast Fourier transform so unlike a meta-level what I'm going to do in this task is explain the ways in which we compute the parameters that govern the temporal evolution of the algorithms so the algorithms have how to compute parameters and easy to compute parameters and I'm always going to talk about how to obtain the how to obtain parameters and I also are going to talk about how to obtain a single one of them and in order to to optimize all of them we just apply it correlate descent so and what these algorithms are doing like on a meta level is they alternately optimize hard and easy to compute quantities but in this talk I will only talk about how to obtain the hard work the other hard to optimize quantities so okay let's get to it and talk about the first algorithm that that we call for a forecast and what we're trying to do is the following we have our snapshots XT like the latest measurements and we're trying to fit a linear dynamical system Y to YT to our temperature of snapshots XT and we're introducing two constraints and the first constraint is that YT evolves according to a linear dynamical system and the second constraint we're introducing is that the real part of the eigenvalues of this matrix B that describes the linear dynamical systems to the system is a zero which basically means that this system does not crash into the origin so it doesn't go to zero and it also doesn't blow up to infinity so it's a stable system which basically fits the which the data that we are that we are trying to model namely quasi periodic systems so when you look at this optimization objective you can rewrite it in this way so the constraint that eigenvalues of the matrix be purely imaginary turns YT into a vector that only contains sinuses and cosines evaluated at specific frequencies and in this case the easy to compute quantities is the matrix a but the hard to compute quantities are Omega I so in order to avoid basically into notational clutter we call this vector that contains the sines and cosines we just call this capital Omega so the nice thing is because a is linear and Omega describes a linear dynamical system we can actually find analytic an analytic solution for WI and basically what comes out as you if you solve this optimization objective is a very nice symmetry relationship to the Fourier transform so if you solve this objective for Omega I what you basically get out is the following the orange graph shows this error function and the blue graph shows the Fourier transform so colloquially speaking if you take the Fourier transform you move it up on the X on the y axis and then flip it on the x axis you get the error surface so now let's go back and look at the assumptions that we make about the data the assumptions about them that we make about the data XT is that they stem from a crazy periodic system and we can show so basically this is this is known in electrical engineering that if you have a quasi periodic system its Fourier transform it's going to be a superposition of these sinc functions so how do we how do we optimize our objective function so if we were to use the fast Fourier transform which which evaluates the Fourier transform at integer integer multiples of 1 divided by T we always get frequencies that are periodic in T and that's that's not a very good property if you are if you're trying to forecast now what would happen if we would use gradient descent because the superposition of sinc functions is super non convex gradient descent will very quickly get stuck in local minima so what we what we end up doing is we're combining the strengths of the two optimization procedures in order to solve this there's objective in the following way so we use the fast Fourier transform to locate the valley in which the global optimum results and this like this will give us a frequency which is periodic in T but then we will use gradient descent to improve this initial guess provided by the fast Fourier transform to break this implicit periodicity periodicity assumption and graphically this looks like this so the blue line that you see is the data that we were provided with and the orange line below this is the arrow surface that we can compute so what we do is we use the fast Fourier transform to locate this valley and put back our initial guess for square for gradient descent right here and then using stochastic grande descent improve this guess in wander down this error surface and that's is how we improve a single W a single Omega I which brings us to the next algorithm which is the Koopman forecasting algorithm and so the the Koopman algorithm is based on Koopman theory and in 1931 coupon or Bernard coupon showed that you can take any nonlinear dynamical system and apply a nonlinear but time invariant function at to the measurements of this nonlinear dynamical system and lift it into a space where it's time of illusion can be described by linear methods and there's a very strong analogy to calais theorem which is arguably the theoretical underpinning of kernel methods and deep learning so Koopman theory tells us you can take a nonlinear omeka system apply a function f and in this in this transformed space we can describe the data by linear methods and the same is true for kavas theorem capacity Arum says you can take any data set that has two classes projected into a high dimensional space and there will always be a space in which we can linearly separate or like describe the data with linear procedures or linear methods so this is just a quick reminder this Omega T that I introduced earlier is a stable linear dynamical system and we go into the reasons why we do that a little bit more in more detail in the paper but the objective that we're trying to solve for the coupon algorithm is the following it's it's essentially the same as the algorithm as the four algorithm except for one difference instead of a linear link between the the oscillator YT and XT we assume that there's a nonlinear function f parameter as by theta that links the the data and the oscillator and so why do we do that so one reason is it gives us noise for both robustness which are going to go into a little bit more detail later and another reason is it's more expressive so if you have a square wave you would need infinitely many Fourier modes to describe this data but in the Koopman framework you only need a single oscillator and a non-linear function so that's the whole idea so in this case F theta is usually a new network parameterize by theta and in this case theta is like the easy to compute quantity or like the easy to obtain quantity because we can just use stochastic gradient descent but because of this non-linearity this analytical solution that we that we exploited for the previous algorithm it doesn't work anymore so the symmetry between the Fourier transform and the error surface just doesn't exist anymore so now we have a problem we have an optimization objective that's not only nonlinear but it's non convex however we can still compute the global Optima in the direction of Omega I and how we do that is by exploiting periodicities in this error function so let's take a step back so the error function we're trying to solve is a sum of what we call temporarily local loss functions so we're summing up there's this L for every time step now let's have a look at this loss function at a specific time step in a little bit more detail so if you look at this locally like temporary local loss function it's very easy to show that this loss function is periodic in two pi divided by T and this follows almost directly from properties of the sine and cosine function so what this looks like graphically is if the following way so on the first the first graph shows the temporary loss function the temporarily local loss function for T equals one and it makes one revolution over the span from 0 to 2 pi the loss function at T equals 2 is going to make 2 revolutions the loss function at T equals 3 it's going to make 3 revolutions and so on and so on so how do we exploit the paradiso T in those temporary local loss functions in order to solve for the for Omega what we do is we compute with just sample this loss function for every T in this within two pi divided by T and then it's then we have all the information we need in order to solve globally for Omega I because what we need to do is once we have the loss function computed within the first period we just can we can just take this period and just copy and paste it over and over so what we need to do is we Pete the loss function tee times but then because we want to sum all of these loss functions up we need to resample them to have like the same sampling interval and and the last step is just to sum all of them up so and the result usually looks like something like this so it looks very similar to like the regular Fourier transform the nice thing is that there's repeating and resampling is very very easy and efficient to implement in frequency domain it's basically one for loop okay so now I showed for two algorithms how to compute the global optimum with respect to Omega let's have a look at what these algorithms do so we have a couple of results some of them are theoretical and some of them are practical I'm gonna start with theoretical results and so one of the results is that the that the three algorithm the linear firaga algorithm has Universal approximation properties for finite data sets so if you give me a data set from 0 to t I can always over fit on the test set and that follows very very simply from just basic properties of the Fourier transform and it's also analogous to covers theorem because it also requires projecting the data into an N dimensional space and just like Kavis theorem this trivial solution just does not generalize so it's it's not a very useful property in practice actually so if we have a look at infinite data streams then the Koopman algorithm is more expressive than the for a counterpart and this is due to the fact due to a fact that I alluded to earlier namely that for for some signals the Fourier algorithm would need an infinite infinitely big parameterization where the coupon algorithm would require only a finite dimensional parameter ization so so the algorithms that I introduced today they have a very close relationship to Bayesian spectral analysis and because of this relationship we can actually prove error bounds fairly easily so what we can show is that the error the prediction error is going to grow linearly with T so the further you go out in your prediction the higher the error will be but the error grows linearly and which is a very nice property and another nice property is that the that the performance degrades also linearly with noise variance but the algorithm scales super linearly with the amount of data provided so capital T over here so it grows linearly with T but it shrinks super linearly with the amount of data provided so let's try to predict a very simple nonlinear oscillator so this is a nonlinear oscillator that has like fake temporal like fake temporal patterns that mimic basically temporal patterns at an hourly rate and we compare the Fourier algorithm and the Koopman algorithm to LST amps and how we are doing that because ala stems are usually not meant to do long-term predictions is we take the prediction of the other stem and fill it recursively back so we're integrating out the lsdm so under on the x-axis you can see the prediction horizon basically the further you go out and on the y-axis is the is the cumulative error and LS TMS are doing pretty badly and the Koopman algorithm does better than the for algorithm so let's try to understand whether this is the case so these are the actual predictions in time domain and so this is the prediction horizon and in the beginning all of the algorithms are doing fairly well how if you go out pretty far to the future then things start to break and then break differently so Ellis TMS seem to be a biased frequency estimator so the Alice TMS get the non-linear shape right but they are slowly going out of phase which is catastrophic if you're evaluating how good your algorithm is with a mean squared error so it would be better for the a system to just predict zero so how does why does the Koopman algorithm do better than the free algorithm algorithm the reason is the fuller algorithm needs a lot more free or Fourier modes to appropriately model non-linearity and the small Fourier modes that get drowned out by not by noise and you can see like the small the small Fourier modes just slowly drifting out of phase and the Koopman algorithm because it can explain the the nonlinear shape with extra non-linearity does a lot better it's more noise robust so let's have a look at so this was this was the synthetic Dallas data set but something very similar happens when you look at a real data set so we took a data set that has that contains the the energy consumption of a node in the German electric electrical grid at an hourly rate and it's the it's the demand of a node in the distribution network and this data has lots of temporal patterns so energy consumption usually is lower on the weekends compared to weekdays that means we have like a weekly cycle then we have a typical daily cycle because energy consumption is usually highest around like 2 p.m. and then we have obvious seasonal patterns because energy consumption is in Germany at least usually higher in the winter compared to the summer so we have all of these like temporal patterns and there's also a lot of like structural noise because energy consumption depends on weather which is kind of hard to predict so if you look so this is these are the results in frequency domain and so in order to like support the claim that I made earlier you can see that the lsdm seem to be biased frequency estimators so they they're close but they are not perfect in terms of getting frequency right so why does the Koopman and the cumin algorithm in this case does better that than the fury algorithm and the question is why so I'm not sure how well wizard visible this is but the coupon algorithm can explain very small harmonics just with a pinch of non-linearity non-linearity and if we would try to model these with a fury algorithm it wouldn't work because they would get drowned out by noise so the coupon algorithm gives us noise resilience a lot of fluid flows are nonlinear oscillators and we can model nonlinear oscillators with with our algorithms so we took a simulation of a flow around a cavity and I think this is Reynolds number of around a thousand and we can almost perfectly model what's going on let's summarize what I presented to you today so today I introduced two algorithms to fit linear and nonlinear oscillators to data the objectives are non convex and non convex and non lead nonlinear respectively so there's a there's a ton of like V word phenomena that are quasi periodic and basically you fit the inductive bias of the algorithms they're introduced today in examples for example gate then weather or space weather tons of fluid flows are quasi periodic they're periodic systems epidemiological data because there's it's called the flu season for a reason so lots of applications the power systems sales for example the occupancy in a room is somewhat periodic and if you have a phenomenon that that fits this quasi periodic character stick then these algorithms will work well so the code is available on github and if you have any questions please feel free to email me and I'm very thankful for your attention
Info
Channel: Nathan Kutz
Views: 5,303
Rating: undefined out of 5
Keywords: Fourier forecasting, Koopman theory, time-series prediction, time-series forecasting, machinen learning, LSTM, Lange, Brunton, Kutz, spectral methods
Id: RBYFsFr4soo
Channel Id: undefined
Length: 22min 28sec (1348 seconds)
Published: Sun Apr 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.