Mike Mull | Forecasting with the Kalman Filter

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
better looks pretty interesting so I appreciate you coming over here this is basically the stuff I want to start out with the stuff that you should probably take away from this I think it's probably if you're not already familiar with this stuff it'll probably be a little bit too much to just in 30 minutes but these are the kind of main takeaways I wanted to mention rudolph Coleman he's guy who obviously invented this algorithm back in the 1960s very very famous and distinguished scientist inventor who actually passed away about two months ago it sounds like something that you would use on Star Trek but it's actually an algorithm it's a recursive algorithm and it's very strongly connected to this idea of state-space models of time series one of the key ideas that it combines sources of information to give you a better estimate than you could get with either of them independently and a lot of different time series models still can here okay um is that any better okay I think I might have been blocking microphone and and the other thing is just liquidus it's used in all sorts of applications most of my examples will focus on econometrics applications but uh you know use think spaceship probably there's probably some implementation tip of it in the laptop that you're using right now some preliminary notes when I'm using a development version of stats models I don't think the release version has any of the state space stuff in it yet I I just build from source but I think this is version eight release candidate I should just use this one is it help better if I hold it like this so if you look at the development version of the stats model source you'll actually see there's two versions of common filters one is an older version which i think is in the actual current release version which is used for parameter estimation on what are called auto regressive models also I'm using the terminology for common filters that's from a book by Durbin and Koopman called the time series analysis by States based methods there's a lot of information on these out there and it's used in a lot of different domains and so you'll see different different variables using equations different terminologies different ways of explaining how this thing works this is kind of econometrics you of the world so this we're going to start the basic listen I'm going to start is just a very simple time series which is a random walk so you know it looks often probably reproduce it on next ones so say that we have a secondary time second time series which is just that first time series plus some noise so the original time series is the blue - is the series with the noise is the green line so basically what you're asking with the column and filter in a very simple sense is if we don't have that original blue time series what can we infer about it from just this noisy green time series so this example is actually a very simple version of what's called a state space model this is called the local level model but basically all states based models have this idea of there being a state equation and an observation equation or sometimes called a measurement equation that first equation that described the random walk is our state equation and the second equation which describes the bat series plus plus the noise is the observation equation so it's called the observation equation because why the Y values are the things that we expect to be able to observe but we're trying to get back to information about the Alpha alphas the actual States so these are just some quick examples of what that might mean you know obviously the sort of generic idea of having a signal a signal plus noise that's that's actually where this idea of it being called a filter comes from is you know you can think of this in terms of signal processing you have a noisy SiC other noise a signal that you can measure and what you really want to do is filter out the noise and get the original signal the state may be the position or position and velocity of a vehicle you know a spaceship or a robot and what you can actually measure is measurement from a sensor that's a you know a satellite or GPS or something like that another example I've seen is you want to you want to anticipate with what a commodity price is going to be but those don't change quickly enough because you only get information on those when people bid on that commodity but people bid on futures more frequently so if you have futures prices you might be able to infer commodity price from futures prices and you know my trackpad example if there may be a common filter and certain and certain computing devices just to keep track of where your where your cursor should be based on where you're moving your finger so this is what a linear state-space model looks like in general I'll cover this in more later but basically all we've added here are some some equations some terms that you know make us make this more generic so in the state equation we now have these terms T which is sometimes called the transition matrix which sort of describes the dynamics from one step to the next step and this matrix Z and the observation equation is sometimes called the design equation it sort of relates how the state values are connected to the observation values so I'm going to jump into the way stats models does this and in particular this is the more the development version that Chad Fulton developed so this is hopelessly old-school much like myself so the this is a basically a UML class diagram on the right-hand side is basically the series of classes in in in stats models that define the time series machinery at least as it's used in in the state space stuff and on the left-hand side is another class hierarchy which describes the state space model and the way they're connected is through this this model called MLE model or maximum likelihood estimation model and basically what that allows you to do is use the common smoother as a way to determine the parameters given a time series so if you have a time series as input and you need to you know estimate means and variances and that time set the semele model will let you do that using the machinery of stats model so I'm going to do now is this is not something you should probably do in general but what I'm going to do now is I'm going to create a common filter class that specify that first initial simple model that we use just by manually plugging in the various various matrices in the in the state space model in that K in this case they're literally trivial because they're either matrices of ones or constant ones but this is this is enough to define a state space model given that the Kalman filter class and the next what I'm going to do is once I've defined it I'm going to simulate it so again you'll see a simulation here which is fairly similar to the the set of time series I showed the first couple of slides so this this visualization is trying to try to explain some of the terminology of the common filter at state space stuff so when you're when you get the new observation and you're trying to determine the best estimate for the next state that process is called filtering and what I was trying to denote that with a vertical line here so if you are trying to predict points into the future that's forecasting or prediction there's also another process called smoothing smoothing goes backwards in time but basically what you need to do to smooth the states is you need to run the common filter forward up until up until all of the you've gotten all of the points in the time series and then you run a process backwards to basically interpolate new States or smoothen excuse me the green triangles here are examples of what you call the filtered state so typically what will happen is that as you get new observations you you estimate a state the filter today which you hope is closer to the actual state than the observation that you get the vertical line between the last observed point on the red line and the state 80 sometimes called the forecast error it's also called innovation so that's that's basically the difference between what your next best prediction is if you don't know you don't know the next observation and the best prediction if you do not so I won't expect you to to digest this right away but this is one recursion for this particular simple model basically each line corresponds to one step in the process so the first step we determine the what's called the innovation the forecast error from the new observation calculates variance and then we calculate the filtered state and forecast States so what we're really trying to determine is a new conditional distribution so a distribution of the state given the observations that we have what we start out off with is some estimate of the state at that prior to that observation and what we sort of know is we sort of know the distribution of the possible next States and we sort of know the distribution of possible errors in that next state and so with this history with these histograms are trying to show is if I if I know the state if I know the current state I can sort of predict that the next state just based on the distribution I know at that time which is the sort of blue histogram the green histogram is the distribution of possible forecast errors for the next state so what we know when get the next observation which we're calling YT now we can actually calculate the actual value for that forecast there VT and we also in a sense know the relationship between the next state and that forecast error and if this this slide seems out of place like it made your head spin a little bit I apologize for that but roll with me here for a second so there's many ways to explain this idea this is the one that Durbin and Koopman let use and I I like it seems to make sense to me this is a this is a simple lemma the mathematical lemon that has to do with basically linear regression so this applies to least scores regression so you've got these two variables a and B which are jointly distributed bivariate normal distribution and this lemma says that the distribution of a given B and the variance of a given B correspond to these two equations on the bottom so this is important for a couple of reasons so we can we can use our state our next state alpha t and the innovation B T as a and B I'm going to focus on the covariance between the two things so if you work through this it turns out the covariance is this value PT which is actually the covariance of the state of current at the current time period and you'll see that this particular ratio PT over PT plus Sigma epsilon squared comes up twice in these equations and that ratio is called the Coleman gain and I put a box around it because it's kind of the key idea in the common filter and so intuitively what this means is that if the error that noise in the observation is small then this ratio is going to be close to 1 and so that means that the value that you observe is probably pretty close to the actual state and so you want to you want to move your estimate to that closer to that actual observation if the noise in the measurement is large on the other hand then this this ratio will get progressively smaller and that means that you're probably better off staying closer to the state that you estimated without the observation it's also worth noting that using that lemma on the previous page and and what's shown here it this is actually the optimal estimate given linear system and Gaussian distributions much like least squares regression so on that slide earlier we talked about filtering and basically filtering is again the idea of trying to find a better estimate for the current state given the latest observation what you're really trying to find is this conditional distribution so if we just use our column and filter model that we used before what I'm going to do is I'm going to cheat and initialize the initial known state to something that I already know and then I'm going to bind that Y the observations to the common filter and then I'm going to run the filter forward and what you end up with is something like this is that like no frequency so why is the why is the set of observations and the red dashed lines or the actual States which you know in normal cases we don't know and the blue dots are the the filtered state so with forecasting it's very similar process except what we're looking for is the next value in the time series and the trick of it here is that we don't have an observation for this we're just trying to estimate the next point in the time series given what we what we know about previous observations and the state again it's it's we're really calculating a conditional distribution so this is the this is a forecast out for I think ten time periods given our so-called local level model and you'll notice that it's basically a flat line that runs from the the last estimated state and that that's pretty typical of the of this particular model so once you once you have no longer once you no longer have observations the forecast being carried forward are basically the the mean of of the very last state that you you were able to determine through filtering and again this is like the simplest case of it of forecasting just to back up a little bit there's a lot of things we ignored in this discussion so far so one of them is that I assumed that I already knew these parameters these variances of the time series that I'm starting with which is generally not the case normally you need to take the data that you have and estimate these parameters from the data that's essentially what the common smoother does that but that's why when you're using the models in the in the stats model state space package typically you want something that's derived from this mlu model class because that class will do this process for use in the normal stats model time series and shading we also typically don't really know what the first state is and so there needs to be some way to initialize that first state and the a fairly common method and one that seems to work pretty well a lot of cases is what's called a diffuse prior which is kind of a fancy term for guessing so basically you say the mean is 0 and the variance is ridiculously large and we'll start from there so when you move on to more complicated state-space models the recursion for the Kalman filter gets a little more complicated and again I realize this is not super easy to read but the same idea holds here you've got one set of equations at the top that you've got three lines of equations one of which is calculating the next mean value and this on the right hand side one of which is calculating the variance of the distribution mainly what's happening here is you're introducing the T matrix the transition matrix to make the dynamics a little more interesting and the Z matrix to to make the relationship between the state and the observation more interesting so the next the next example I'm going to show is within stats models called an unobserved components model if you work with the time series models you might also call this a structural time series model so it's it's a type of model where the time series is divided up into a bunch of components so you know a trend and some you know some random mean and a trend and and you know possibly a seasonal seasonal aspect and so forth I'm going to do a really really simple structural model which just has a random trend so this is a data time starting with here is industrial production data from from Fred Federal Reserve and I'm gonna this is what this model this local is called a local level model or local linear trend model this is what it looks like in state space form so those three there's three equations top describe the model and then the ones at the bottom describe its form and state space one and this is basically how you set it up in stats model basically I'm saying I have a level and I have a trend in both of them are stochastic or random and then I fit that data set to get the parameters that I need I'm not showing the summary of the model because it fills up way too much space on the slide I'm not sure if you can actually see it here but when you when you do forecast with the with this local linear trend model what you basically end up with is a line with more or less constant slope it's kind of hard to see the slope here but there is a vector slope and it will again it will the column and filter at this stage since our our T matrix and Z matrix note vary with time we'll have reached a steady state and so every forecast after the after we have observations is just going to give us this constant trend the gray part is the uncertainty confidence interval in the forecast and so I think the basic idea it's trying to convey is that at the further you get away from having an actual observation the more uncertainty you have I think that's fairly intuitive to understand but you know it's it's really well spelled out in the math of Kalman filter another thing another reason why states based models are are popular with the economists in particular is that a lot of time series models can be put into state space form putting the whole zoo what are called ARIMA models this is a just a simple example of what's called an auto regressive model and this is this is the top equation is the normal auto regressive time series model and the stuff at the bottom is the state space form of it and these are the design transition and the r matrix is what's called the selection matrix and basically that maps so the shocks that we have the the random parts in the series now come from a multivariate distribution and this selection matrix basically says I want to take since it's a it's only one 0 here that says I just need one random component so it's basically if you look at the previous slide we just have an epsilon there so the R matrix is basically selecting one element from this bivariate distribution to add it as an error term so I I'm trying to build a model here using what's called a Serra max model which again is one of the implementations in the stats model state space library and but what I'm trying to accomplish here is to forecast electricity demand in the industrial sector having read some some papers on this the the common idea is that demand is affected by industrial production and the price of competing energy sources so this is what industrial electricity demand looks like and probably the first thing you notice is that there's a fairly prominent seasonal aspect to it so the first thing you want to do in this type of model is figure out you know what that seasonality is and this is monthly data so typically that means a 12-month season this is the stats model seasonal decompose function which does a lot of the work for you to to show you the seasonal parts in the trend in your your data so that the top series there is the original data and the the line in the in the second row down is the trend in the data third row third row down is the seasonal component and the hot part at the bottom is the residuals the residuals and ideally would look like white noise here they look like they probably have a little bit of structure to them still so I don't want to go into this too deeply but if you're doing these types of ARIMA models basically the thing you have to do is you have to figure out where the auto correlations are in your data and typically the way you do that is you difference the time series to find the seasonal aspect you difference the time series to find any other auto correlations and then you can plot these auto correlations and partial Auto correlation diagrams don't know if the the top these are these are actually features and stats model - these plot ACF and plot key ACF functions will plot these auto correlations for you and so roughly what this shows me is that there are there are significant components at both the 12-month lag in the partial auto correlation and the correlation and possibly a significant lag at one month it turns out it works up better if I include the auto correlation in that but not the moving average part excuse me and again I think it got cut off but this is in stats model how you set up this type of Serra max model this particular example all I'm using is the original time series there's no so called exogenously at a-- so there's no regression variables in here so this is this is what a forecast looks like of this time series using state space model in the column filter you can see that it does pick up the seasonal part in the auto and the auto regressive part in the time series and the great part is the confidence interval of that of that forecast this next model is basically the same thing except in this case I am including the the extra so called exogenously Abele's the industrial production very was that the energy price variables and again the difference on this particular one is that to do to do this forecast into the future you also need future values of the exogenous variables and in fact that's probably why you would do it model like this because those variables may be easy to access access the band than the variable that you're trying to forecast but in this case what I've done is cheat and leave out the last two years and treat those as data that I'm predicting so what this shows is that again the Green Line is the forecast and the gray area is the uncertainty in the forecast and the black dots are what the original series has so enough time yeah I just want to go over a couple of extensions to the Kalman filter real quick one one common thing is there's something called the extended Kalman filter so one thing that I probably didn't make clear in the earlier slides is that this is essentially a linear linear dynamic so you're multiplying a previous time step by a matrix and so it's a linear relationship a lot of times that doesn't work and so people have created what's called the extended Kalman filter which allows you to have nonlinear dynamics in the Kalman filter there's a variation called the unscented Kalman filter and then recently there's been a lot of interest in what are called particle filters which they're much more computationally intensive and more difficult to explain but they take a lot of the assumptions out of the out of the process there's no assumption about the distribution of the data there's no assumption about the about stuff being linear and that's about it you know I recommend Durbin and Koopmans book if you if you want to learn about this Koopman also wrote a another book on state space analysis which is a little bit simpler and there's this blog post called how common filters work on the uncommon lab website which if you can only read one thing about these I highly recommend it it's very very complicated and and lengthy post but it covers everything from particle filters to the basic common filter and if you can work through it it's it's a really really good piece of work that's it for me no idea covariance estimation for of a ensemble common filter um yeah I think I mean I think there's an assumption that if you have I think part of the reason is that this is not a particularly good model the the it turns out that some of these energy variables which are supposed to be good predictors of electricity demand don't seem to be for instance coal price does not really seem to be a that good of an indicator of industrial electricity but I left it in just to just to see what would happen so I think I think probably just the issue of not having a very good model is is more to blame here than not having more observations oh yeah yeah definitely you sort of I mean I'm not entirely sure that's that's that's true I mean the common filter should to some extent update as you go I mean one one issue that I did not have time to explore here just for time constraints is the the dynamics in the state equation can be time time dependent so and most of these models it'll it'll reach a steady state at some point and then everything after that it doesn't really update the Kalman filter anymore but you can't have situations where you can make those vary by time so continuously it will update them as you go I personally have not had situations like that although but it is it is actually one of the key features of the common filter is that it deals really well with missing observations in fact forecasting in the common filter is effectively the exact same thing as as predicting the state without an observation and so you know for instance in the in the one of the variables that you can use for this electricity model is gasoline prices and for reasons I don't understand when you load that when you download that from the federal government sometimes they just leave out months but it's it's relatively easy it's much easier with a common filter and state-space models to do that to handle that kind of data that is with with normal estimation methods known time series methods probably our that that's that's the most probably developed area for common filters I believe that that uncommon lab blog post I believe they do most of their work in MATLAB as well and I don't know it I don't know of any particle filter implementations in Python yet so but they are an are not I'm sorry yeah I I put them on my I put them on my about an hour ago so catches that this is a so this is an actual PI ipython notebook so when you look at it on github or if you download it it won't look like the slides unless you install rise but yeah there's there's a link on my Twitter handles quick step k wik step there's a link to it there I just run a common filter per se well you can even and it's all the base implementation is in Saipan and see so you know if you have a lot of data it's it's still pretty fast well I guess I think of them as being in slightly different realms but I mean you could conceivably use capable cross-validation to do a out-of-sample validation of a common filter I think it's kind of hard to do Kate it's kinda hard to do cross validation on time series it's sort of a black art but yeah I get I guess the place where you would choose to use Kalman filters as opposed to more traditional time series techniques I think are one you have non-stationary data so if you if you don't know what non-stationary means then probably don't care and the other one is you have missing observations and those are really both those are two cases where the Kalman filter really is superior to normal time series models you
Info
Channel: PyData
Views: 29,751
Rating: undefined out of 5
Keywords:
Id: GmSXhmbv5Zg
Channel Id: undefined
Length: 38min 49sec (2329 seconds)
Published: Fri Sep 23 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.