Elizabeth Ramirez - Kalman Filters for non-rocket science - PyCon 2016.mp4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Can you please help me welcome Elizabeth for a talk on Kalman filters. She works at the New York Times. Thank you. (Applause) Ok, so my name is Elizabeth. I work as a software engineer at the New York Times, and today we're going to talk about Kalman filters for non-rocket science. So, a little background. The Kalman filter, it's an algorithm named after Rudolf Kalman, and it's basically a predictor corrector technique by which we calculate recursively a system state at time tk, using only the state at previous time step u K minus 1. We're going to explain that later. We only use the previous step and the new information that comes to the system. So if you're interested of the details of this, you can read here in this original paper that was published in 1960. And nothing really has changed a lot from 1960, so. This is a picture of Rudolf Kalman receiving the National Medal of Science on October 7 in 2009 from President Barack Obama at the White House, just to remark how important has been the Kalman filter on science development. Ok, so first, let's talk about Kalman filter for rocket science. So the Kalman filter was first applied to the problem of trajectory estimation for the Apollo space program of NASA in the 1960s, and then it was incorporated on Apollo space navigation computer. It is also being used in the guidance and navigation system of the NASA space shuttle and altitude control and navigation systems of the International Space Station. So, in other words, this is -- the Kalman filter is mostly used for positioning and navigation systems, AKA rocket science. So, there is a transcription of the original Kalman filter code used for Apollo 11 guidance computer, and it is available for public domain. Just check this. You are allowed. It was implemented in the 1960s using AGC for assembly language. AGC: It stands for Apollo Guidance Computer. So, it was using like, very low level assembly instructions like count, compare, and skip, CCS., transfer to storage, clear and add, and so on. So this is how the code for the original Kalman filter looks like. So it's assembler, if you understand it, like, good for you. (Laughter) Then, let's talk about Kalman filters for non-rocket science. So the Kalman filter can also be used for some time of forecasting problems for some specific time series that can be modeled as a time-varying mean with additive noise. So, this is because the Kalman filter is a generalization of the least squares model. So let's start with the formulation of the least squares model and the normal equation. Let's say that we have a linear system represented by Au equals b, where A is the matrix of equations, u are representing our unknowns, b are a vector representing our measurement. So, we need to solve for u. So I think that most of us in college, we know the case where A is a square matrix in which we have the same number of unknowns and the same number of equations. So under other number of conditions we know that we can invert A and solve for u. So we can have u equals A inverse B. But what if A is not a square matrix? What if it's rectangular, because we have more equations than unknowns? So the number of rows in this case m is bigger than n. So we can't solve the system as we normally know, because we can't invert A because it's a rectangular matrix. So we say about the system that it's overdetermined. We have too many equations and too few unknowns. So we're trying to fit m measurement by a small number of parameters n. I think this is better illustrated by an example. Let's say we might have a hundred points that fit a straight line, but the system equation for a line is just Cx plus D. So we only need to solve this system for two parameters which are C and D, but we're solving a hundred equations with two unknowns. So, this is a problem. So what we can do for this system, for this overdetermined system, is that we can find the best estimate for u. We cannot find an exact solution but we can find an estimate by minimizing the squared error. So each equation introduces an error and the total square error is here, E, b minus a dot u estimate represented by the u hat, squared. So, let's try to do something here. Let's try to multiply both sides of the original equation by a transpose. So we have in the left hand side, A transpose A u hat equals A transpose b. And this is known as the normal equation. The good thing now is that A transpose A is now a square matrix. We can check the dimensions here. We know that A transpose is n by m, A is n by n, and A transpose A is n by n. So if the original A has independent columns that means that columns are not a linear combination of each other, now A transpose A is invertible and we can solve for the estimate u hat. That minimizes the square error. So our solution here now is going to be u hat equals A transpose A inverse A transpose b. So this is how least squares work. We need to introduce errors because when you take measurements we get errors. So let's talk a little bit about covariance matrix. So let's run an experiment at once. So the error of the -- each measurement might be independent, or there might be some correlation between errors because simply that the measurements were taken by the same device, or something like that. So we're going to consider the case when errors are independent and the covariance matrix represented by a big sigma here. This covariance matrix is going to be a diagonal because the expected value of different errors, the product of different errors like ei and ej, is going to be zero. So we're going to have a diagonal matrix, and ii entries in the diagonal are going to be sigma i square which is the expected value of the square of the error. So this matrix is going to be always symmetric and it's going to be always positive definite because the variance of sigma i square is necessarily positive and we know that positive definite matrices has nice properties. So because of this it can be shown that the choice of these C equals big sigma inverse minimizes the expected error in estimation of u hat. The C is called weighting matrix and the normal equation including now is waiting matrix is going to be like the weighted normal equation. A transpose C A u hat equals A transpose Cb. Now we can solve for the estimate u hat. So when -- in this case when sigma square is equal to 1 and sigma ij is equal to 0, that means that our error has 0 mean and unit variance. The problem becomes just ordinary least squares that we saw before. So, well, we need a little bit more background for the formulation of the Kalman filter, so we need to talk about recursive least squares. Let's use an example for illustrating this. So let's say we have the average of 99 numbers: b1, b2, through b99 and average is going to be u hat 99. So let's say that a new number b100 arrives. So how to find the new average, u hat 100 without adding all over again. I'm pretty sure that this problem came up when I took the GRE test and I didn't know how to solve it, but, ok. So we want to use only the old average, u hat 99, and the new data, b100. So we have two ways to express this: The first way is u hat 100 equals 99 over 100, the old average, plus 1 over 100, the new measurement. Or the second way to express that is how we say here is u99 plus 1 over 100, b100 minus u99. So we like better the second form because it's presented an update to the previous u hat 99. So here in this equation we called this expression b100 minus u hat 99. We call these the innovation because it tell us how much information is contained in the new measurement b100. So you can see that if b100 is equal to u hat 99 there's no new information in the new data so the innovation is 0 and the average doesn't really change. The other important part is this fraction 1 over 100 which is called the gain factor and in part is where the filter terminology comes from because it's an input modulated but by a gain factor. Okay, just a little bit more math. Let's generalize the example of the average to the same linear system we saw before, Au equal b. So let's just start by -- from an old estimate we have for the equation A old u equals b old. So when new information arrives, let's see, it's called A new or b new, we add a new row to A old and a new road to b old. And the solution of this new system that you can see, like in the second row, is going to lead to a new estimate u new. So, we don't want to solve this entire system every time a new information comes in. So how can we update u old to u new using only A new and b new? So, let's back to the normal equation and let's find with this new information A transpose A. So you can see here that A transpose is this, like, horizontal matrix. And if we calculate the dot product between these two is going to be A transpose old A old plus A transpose new A new. So you can see here that is like a known part plus a new part. Now if we work on the right hand side of the equation A transpose b is going to be this A old transpose b old plus A new transpose b new, but we know from the original equation that b old is equal to A old u hat old. So we replace that back into this equation. So, using those two expressions for the normal equation and like, simplifying, we get this expression for the recursive least squares in this form that we already say that we really like. So it's going to be u hat new is equal to u hat old plus A transpose A inverse A new transpose b new minus A new b old. So this is the expression for recursive least squares. So we can see here that the gain matrix is going to be A transpose A inverse A new transpose, and is often denoted by K for Kalman, obviously. So if we go back to the average problem, we have the least square solution for 99 equations and one unknown that is going to be this u the average. So the A matrix in this case is going to be a column vectors with just ones. And when the 100th equation comes in, so it's a new measurement, u equals b100. That's b new. So, the only thing we need to do is add a new row to A new which is going to be a new one. And if we apply the recursive least squares expressions for this, we're going to get this expression that is the same that we have previously found. So in this case we could have had a weighting matrix to measure the reliability of u hat, but in this example the hundred equations were equally reliable and had the same unit variance. So this is why C, the covariance matrix, doesn't show up here. So now finally we have enough elements to discuss the Kalman filter formulation. So let's see how it works forecasting in time series. So the Kalman filter is basically a time-varying least squares problem. So in this quick time we produce an estimate u K hat at each time tk. So the whole idea of the Kalman filter is updating our best least square estimate of the state vector u hat after new observations comes in. So, we want to compute like the change to update what we predicted. So this will work if we can express the new estimate as a linear combination of the old estimate, u old, and the new observation, b new. So we can see it like here, like how the linear combination should be. U new equals some L dot u old plus some K dot b new. So, there are some nice features of the Kalman filter that are important to our implementation. So first is that the Kalman filter is recursive. We didn't store at all observations b old because those measurements are already used in the estimate u old hat. So normally the state vector u is much shorter than the measurement vector b which is growing in length with each measurement. So this filter is efficient in this sense. So, we can write the linear combination for the Kalman filter as we see in the second point. U new is going to be u old plus a gain matrix that multiplies our innovation. So again, the innovation between all the estimates and new measurements are modulated but the gain matrix. So the reliability of u hat of our estimate is given but the error covariance matrix P that tells the statistical properties of u hat based on the statistical properties of the measurements b. So we see here that our covariance matrix P is going to be A transpose C A inverse. So in the Kalman filter implementation we also need to update the covariance matrix when a new measurement b new arrives because it, like, includes new information for the covariance. Let's present the algorithm here. I mean there is some other steps before coming to this algorithm for prediction correction. It's basically in that the property of A transpose C A is a tridiagonal matrix. So the forward elimination to solve the system is going to be a recursion and the back substitution is going to be another recursion. The forward recursions find the best estimate of the final state and very often that's all we need because we don't really need to calculate back substitution. And we are not going to do back substitution here because back substitution just adjusts earlier estimates to account for later measurement. So this process is called smoothing and it produces the correct solutions to the normal equation, but we're not going to do smoothing today, just the forward recursion. So the forward recursion is a two-step process: prediction and correction. For a prediction we will use all the information we have through time k minus 1 to generate the prediction, then when a new measurement comes time k, we're going to add a correction. So, putting all this together the Kalman filter produces the final state u hat K, given K. So we can see here for prediction how we predict u K given K minus 1. So that's a prediction like taking into account all information we have up to time K minus 1. We have here that this F matrix is called the state transition matrix. It establishes how the state vector changes from one time step to another. This -- the F matrix is like really particular to the specific system we're trying to solve. And then we have the expression for the estimate of the covariance matrix, given the old information. We have here a Q, which is the covariance matrix of the system error. This is going to be because every time we add the new equation we add an error that is different from the error of the measurement. So the system error is going to be Q. So for the correction we need to calculate the gain matrix, which is K, using the prediction of the -- of the covariance matrix. And then we use the gain matrix K to update our predictions of u K now given K, because we now have the new measurement at time K and the covariance matrix at time K. So finally, we're going closer to the code. So, as mentioned in the abstract for this talk, NumPy will provide all the core linear algebra we need for prediction and updates, and also the data structure to hold all the equations of the state. So let's implement two functions for the prediction and corrections of steps. We want to calculate the following variables at each step that predicted mean and covariance of the state that's predicted u and predicted P before the measurement comes in, the estimated mean and covariance of the state after the measurement, the corrected one, the innovation and the filter again. So, this is the data that we require for making a prediction. We require the previous state vector u, the previous covariance matrix P, the state transition matrix F, and the process noise covariance matrix Q. So, this is like, extremely simple. Is it clear? okay. So it's going to be a predictor step. It's just going to calculate the dot product between the transition matrix and the old estimate for the state. And it's going to calculate the covariance matrix using the state error and the --and also the F matrix, and we return both of these. The correction step requires the predicted state and covariance, the A matrix which is like our matrix of observation equations. Now we account for b which can be on a scalar or a vector of observations at time K. The covariance matrix of the system and the covariance matrix of error in observations. So again, this is really simple. The correct step is going to calculate first this C that is given by the Kalman filter update formula, and then we compute the Kalman gain matrix which is this K here, and finally we add the correction to predict the state and to the covariance matrix. So this is really simple. So how will we simulate this? So we want to initialize all our data structures and run the prediction correction for some number of iterations. And at the end we want to compare the predicted and corrected state with the actual measure that came into the system. So here I'm going to just have a time step of point 1. I'm going to have this A matrix that is representing some equations. I'm going to initialize my state vector with zeroes and we're going to use some random measurements but center at the state value, like the predicted state value. So, we're going to use random numbers that returns a sample from the standard distribution in accordance to the unit variance. We're using here for Q and R, just for the sake of simplicity. So we are going to run this a hundred times. We're going to store our prediction corrections and measurements on this list. We're just going to run predict, correct, and generate a new measurement at the end of the look. So we're printing here the predicted final estimate, the corrected final estimate after the new measurement came in, and the actual measured state. So, for the system we predicted minus 23 point 41. We corrected and looked at the prediction. It's not that bad, so it's kind of useful here. So this is really useful for example when you're using your GPS and you go through a tunnel or you don't have signal but you keep like observing that the GPS is working. And it's just because the system is predicting the next state. So in this plot we can see that the prediction is on circle, the correction is on axis and the actual measurement is on triangles. So it's interesting that the correction is always between the prediction and the measurement. So I don't remember why, but that's the correct way for this. So conclusions. We have seen that the Kalman filter is a viable forecasting technique for time series. For some specific times series that can be modeled in a certain way and that is not limited to rocket science. And Kalman filter is like really similar to least square method but it has some computational advantages in terms of efficiency. So, almost everything in this talk has been extracted from the book Computational Science and Engineering by Gilbert Strang. Gilbert Strang is the guy in the picture. So if you feel like getting deeper into Kalman filters, I recommend to go to his book or to the videos available at MIT OpenCourseWare. So there are also some obviously available packages for Kalman filters that take into account other information like correlated errors and some -- some other informations like pykalman. So again, if you want to check this and better understand, just download pykalman or go to Gilbert Strang's book, Whatever you want. Okay. I think that's it. Thank you very much. I hope this wasn't like, too heavy of math and too light Python, but I hope you enjoyed the talk. Let me know afterwards if you have any questions. This is my email address. Yes, and my Twitter account, So just feel free to get in touch. That's it. (Applause)
Info
Channel: PyCon 2016
Views: 9,057
Rating: undefined out of 5
Keywords:
Id: k_MpfzMc9PU
Channel Id: undefined
Length: 30min 2sec (1802 seconds)
Published: Fri Jun 17 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.