How autocorrelation works

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
at this point we need to take a little bit of a detour to talk about autocorrelation it's a tool that we're gonna need to dig into the temperature data and to find the patterns that we need to make predictions it's not as complicated as it sounds but it does bear some explanation so to start with let's talk about linear regression we're gonna use some Python code here as an example so let's assume that we start with an array of things that look like temperatures and as a potentially long array but it has a certain number of days in it and we also create an array that represents those days and we can plot this now using days on the x-axis and temperature on the y-axis and the first day day zero has the 68 point 2 associated with that we can plot that and so forth for all the rest of our temperatures across all the rest of our days linear regression is just the process of taking a line and fitting it to all of those points to be a little bit more specific this line is described by where it crosses the y axis the temperature axis so on day zero what's the temperature estimated this is called the intercept and then for each day that we change how much does the temperature change so that's the slope the rise over the run the temperature change per day these two things fully describe the line once you know the slope and you know where it crosses the y axis you know exactly what that line is um once you have that line for now we're just going to assume that we have it to start with we'll come back to how we get it in a minute once you have that line there is a deviation each measure data point won't lie exactly on that line but will be off by just a little bit so we'll just call this the temperature error too right if we write a little function to calculate the temperature error it would look something like this where first we calculate the slope which is this temp change over days per day and then our estimate of the temperature on any given day is just this equation of the line y equals B plus MX or a plus BX or however you write it but the estimate equals the intercept plus the slope times your x value so in our case our x value is days so our estimated temperature will be whatever the result of that equation is those will be points lying exactly on that line and then to measure temperature on any given day temps will be off from that by temp error have a difference of temp error so this is how we can calculate the deviation of our measured points from our line now it is a characteristic of the best fit line that it takes all of those deviations all of those temp errors and minimizes the sum of the square of them if you take them all square them all and add them up that quantity is minimized by this best fit line if we were to change the intercept at all if we were to change the slope at all that sum of squared errors would go up so yeah so we minimize that and this is the Python for doing that we use polyfit which is fit a polynomial to our data described by x and y and this in our case days and temps the last number is the order of the polynomial so first order is a straight line second order would be quadratic third order would be cubic since we are fitting a line we've put a 1 there and it returns the slope and the intercept of that line and now by passing that slope and intercept in to the find temp error function we can actually find what our errors are correlation is closely related to linear regression once we fit that line and minimize the square of those errors the square of those deviations then correlation helps us to put a number on how closely the data points hug that line if they lie exactly on the line then correlation often given by R is one but if they're close to the line it can still be quite high like 0.9 or above if they're in the neighborhood of the line that would be more moderate correlation somewhere in the 0.5 range plus or minus and then if the points don't make any attempt to hug the line at all that's very low the lowest it can possibly get is zero and but anything lower than 0.2 or 0.3 is often considered quite low now and I say the lowest you can get a zero that's actually not true if the line is slanted downward when it goes to the right then this slope is negative and the correlation is negative so measured points laying exactly on a downward pointing line would have a correlation of minus one so 0 equals no correlation at all r equals one correlation of one is perfect correlation in a positive direction r equals minus 1 perfect correlation in a negative direction so when we go to calculate this in Python the way we do it is we take our data temps and days and we use the core co-efficient core coke call and it spits out something that looks like this what it actually does is takes our two quantities days and temps and finds the correlation between days and days days and temps temps and days temps and temps it's a symmetric operation so the upper right hand triangle on this Square will be same as the lower right hand triangle days versus temps is the same as temps versus days in addition the correlation of anything with itself is 1 so the diagonal we'll be ones all the way down so what we really want the number we really want is to say get row 0 column 1 and we'll pull off the correlation between days and temps this is the number that represents the correlation of the data that we see on the right now so far we've been correlating temperature versus days we can actually correlate temperature with itself as we saw just now if we just straight-up correlate temperature with itself the correlation is 1 is perfect that's good because temperature is a time series we can do this trick where we shift it by one day and we correlate whatever the temperature is on day I with whatever it was on I minus one so we're always comparing the temperature on one day to the temperature on the day before we would expect this to be correlated in the case of temperatures because the temperature from one day to the next does change but usually not as much as it changes from one month to the next or from one season to the next so we would expect this to have a reasonable amount of correlation so this correlating a time series with itself with some amount of shift is what autocorrelation is now it it to calculate this then we would just put the use the coracle call and put in the original data set and then the shifted data set and then find the Co the coefficient of correlation between them now we don't have to just do a single day shift we could for instance shifted by four days and calculate that the full autocorrelation function takes a sequence of these shifts starting with we know that at zero shift of zero the correlation is 1 but we can sequence we can cycle through these and for each shift calculate the autocorrelation and make a sequence of these so with the shift of zero correlation is one perfect when we shift by one day it goes down quite a bit there's still quite a bit of correlation but it's it's not one and then with each successive day that correlation gets a little less until about day five six seven it gets down and it's bouncing around in the noise that's just random amounts of variation due to two happenstance this is intuitively what happens in a time series like weather where the weather one day is reasonably similar to the weather the next day but the further out you get the larger the swing you would expect outside of that and so this is what it looks like in an autocorrelation plot the further away you get the lower the correlation there's also a very useful term called partial autocorrelation and what that is is we start again with our temperatures and we think of them now as being errors residuals that we haven't been able to fit yet so in our very first correlation between with a shift of one we take and we calculate the temperature on day I versus the temperature on day I minus one and then we find the residuals what's left over after we fit what are those errors what are those deviations from the best-fit line now instead of with autocorrelation now we would find the correlation between the original temperatures and the temperature and a I minus two what we are going to do is take the residuals after fitting day I minus one the leftover errors and we're gonna plot those against the temperature on day I minus 2 the day before and then we'll fit a line to that find the error find the residuals and then we'll repeat this so each day we try to fit the leftover error we take it out and then we pass the residuals on to the next day to be fit and what this does is it lets us see as we plot that we go each day and after we calculate the core the coefficient of correlation between that day and the day before then we fit the line to it we calculate the estimate along all the points on that line and then we subtract that estimate from our what we're still trying to fit and we get updated residuals so what that means is on day one we get our original day one shift one autocorrelation because the residuals that we're fitting there is actually the original temperatures but after that it falls off very quickly and that means that what we're doing is we're using the temperatures two days ago to try to correct for any errors that we missed by using today as an estimate so in the case of temperature once we use the previous day to estimate today's temperature there's actually not a lot of information left in the days that come before so it falls down to the noise very quickly
Info
Channel: Brandon Rohrer
Views: 155,844
Rating: 4.9204969 out of 5
Keywords: autocorrelation, correlation, statistics, machine learning, signal processing, time series, feature engineering, data science, prediction
Id: ZjaBn93YPWo
Channel Id: undefined
Length: 12min 29sec (749 seconds)
Published: Mon Apr 09 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.