Integration, Cointegration, and Stationarity

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome this is the video for the integration Co integration and stationarity lecture part of the quanto p.m. lecture series so we can get started just by going over to learn and support clicking learn then we're going to go to the lectures link right here click that and scroll down to lecture 19 and we'll clone that right into our research environment with that link and get right into it so what is the point of this lecture what we're going to be covering here is some very basic statistical assumptions starting with this idea of stationarity we're going to use that to talk about orders of integration and finally Co integration and how this all ties together in algorithmic trading contexts now what is stationarity there are a few different definitions of stationarity but what we're going to be working with is something called wide-sense stationarity and what that means is that the mean and variance or standard variation at standard deviation them I'm going to be using these interchangeably are constant throughout the entire life of the time series so if we look at any two points in time no matter where you are the mean and standard deviation are the same as they were before so what this basically boils down to is that a stationary time series is all drawn from the same exact probability distribution all with the same parameters so in the case of it being drawn from the normal distribution has the same mean and variance no matter where it is now why is this important what it really comes down to is this is a fundamental assumption that a lot that underlies a lot of our statistical tests in a lot of our statistical models for example in Auto regressive models where we build a regression model of the future using terms of the time series in the past or moving average models where we build a forecast of future using coefficients that are a moving average of values of the time series past all of these techniques are based on this idea of stationarity and when it comes down to is that you're fitting to a probability distribution whenever you build a model you're you're fitting to some underlying characteristics of the data and you want that those characteristics to say the same you want your assumptions to continue to hold into the future so if there is some sort of shift some trend in the mean or standard deviation or any of the other parameters of the probability distribution that your process is drawn from then your model is going to have built going to be have built on something entirely different from the current regime that you're in it's based on entirely different assumptions so there's no reason to believe that it's going to continue to hold and that's why we need to know about stationarity and to test for this in time series analysis so which is important packages here and define this generate data point function which is just going to draw a sample from the random normal distribution for a given mean and standard deviation and we'll just be using this to generate time series and then we have series a here we're just going to generate a stationary time series from the normal distribution parameters 0 and 1 and we'll just plot that so we can see what a stationary time series looks like and we see all right we've got some white noise right that's exactly what this is white noise is the quintessential stationary time series it's oscillating all around the same point it's staying within basically the same range there's nothing weird going on here this is fine and this is just a sanity check really because we know that this is going to be stationary because everything is drawn from the normal distribution now what does a non stationary time series look like well as I said before it's going to be anything that has a trend in the mean or standard deviation or other parameters of the underlying data so if we generate that right here well say that this mean now depends on T as we generate these hundred different samples so the mean is going to be gradually increasing as we go and we have that and it looks right here like this is very clearly non stationary it's a completely different beast from this guy up here and this is just important to give you an idea of what isn't stationary now why do we care so much of stationarity besides the assumptions and stuff like even getting away from the models and statistical tests and everything some of our most fundamental statistical techniques are useless for us when we don't have stationarity so if we plot this non-stationary time series again and try to take the mean just a point estimate of the mean indicated by this red line here we see that this isn't going to be informative at all our typical descriptive statistics just don't mean anything we're not going to be able to use this to look at the future the past we're only going to be able to look at it exactly in this middle period here and it doesn't really make sense to look at something that isolated when we want a mean of the entire time series that's the extra information that we're trying to glean we could do something like a moving average where we where we have a moving window across the entire length of the time series so we take the mean of stuff here and gradually and gradually and gradually just calculate the mean and it'll smooth out this curve a little bit but this introduces like a period of lag equal to essentially how long your moving average window is so we're going to be getting the information that we want a lot later than we actually want it compared to if we had just taken a point estimate and this is one of the major disadvantages of nantes stationary time series now how do we actually figure out whether a time series is stationary or now luckily we do have statistical hypothesis tests designed to check for stationarity and we're just going to make use of that of one of the most popular ones here the augmented dickey-fuller test and just define this check for stationarity function and use it to test whether a and B are stationary so we'll run these two cells and we see okay a is likely stationary and B is likely non stationary that makes total sense given the definitions just a sanity check confirming it the thing is with a statistical hypothesis test we do have a risk of false positives or false negatives that's what is signified by this cutoff here we have a 1% chance of getting a false positive so let's try something with a more subtle trend in the mean like this is so obvious that this is going up let's add some deterministic function like sign which we know is going to oscillate between 0 and 1 so this is bounded it's deterministic but it's a trend that we might not be able to pick up with this statistical test so let's see if we can get that looking at it while it looks a lot like the white noise process from before it's just going up and down it's oscillating all around the same area but we know by definition that this is a non stationary time series that there's a trend this kind of illustrates why is dangerous to just look at graphs to tell you whether you're right or not ideally we should be looking at graphs to tell us whether we are wrong whether there is something that is obviously not characteristic of a stationary time series looking at this we would probably say it was stationary but we know it isn't unfortunately or fortunately I guess our statistical test well it fails in this case we know that it isn't stationary but this test is telling us that it is stationary so this just kind of illustrates the limitations of these tests the ideal thing to do would be to try a few more tests given we have the advantage of knowing that this is non stationary and if we were working with real data we would likely just end up moving forward with it regardless but it's difficult to pick up on these kinds of things not knowing otherwise we would proceed forward and that's totally fine more often than not this statistical hypothesis test is going to give you the right answer so it's okay to play some faith in it especially when we don't really understand the underlying relationship if it's especially if it's something more subtle like this so let's have a look at an integrated of order zero time series we have a from before which was stationary and if we plot that here well we know it's integrated of order zero because it's stationary so here's an integrated of order zero time series if we want to build the time series its integrated of order one what we do is we cumulatively sum an integrated of order zero time series are a basically a poor-man's integration which is where the named orders of integration comes from so if we do that here and plot it call it a 1 we see that we're getting a little bit smoother this is starting to look a lot more like a price process which we can kind of see how these financial time series tie into this idea if we want something that's integrated of order two well we just take an integrated or integrated of order one time series and cumulatively some I'll call that a 2 right here and we see we're getting smoother and smoother and smoother we can continue in this fashion until we get a time series that's integrated of order n just by cumulatively summing and integrated of order 0 time series n times now to go in the other direction what we need to do is we take the first order differences so first we're going to introduce a little bit of mathematical notation namely this back shift operator L so if L is applied to any element of a time series it immediately outputs the previous element of that time series so here we see L applied to X T yields XT minus 1 so we can represent the first order differences XT minus XT minus 1 as 1 minus L times X T now this isn't super important it's some somewhat obtuse mathematical notation what's really important to take away from this is this notion of first order differencing so if we have an integrated of order n time series and we take the first order differences n times then we're going to be left with a series that's integrated of order 0 so let's try this out on some real life data let's pull some Microsoft stock from 2014 to 2015 using like our built-in quanto peon functions so we'll run that here and let's check it for stationarity if you've ever seen a price process before we know that it's not going to be stationary just by looking at it and we can plot that right here and we see obviously this is a non stationary time series but let's take the first order differences this looks a little bit like an integrated of order 1 time series if we take the first order differences maybe we'll be left with a stationary time series at the end of it so we'll use the pandas functions here to difference this to this time series first order and we'll just check for stationarity using our function from before and we see that these returns are indeed stationary these additive returns as a sanity check let's also do the multiplicative returns using the percent change so if we do that here we see that these are also likely stationary and these look again white noise now something to be concerned about when looking at real life data is that if we've tested and determined that at some point in the past the return series was stationary or that the price series was integrated of order 1 that's not necessarily going to be true for the future this is why we need to constantly be checking whether our base assumptions hold when we build a model because things can shift especially when we're not really expecting them to now generally we assume that return streams are going to be stationary and this is just a base assumption of a lot of financial mathematics because it's rooted in the idea that prices are log normally distributed therefore the price Siri that the return series is going to be stationary or normally distributed as a result now that we've covered stationarity and orders of integration we can finally get into cointegration but first we have to define what a linear combination is so if we have some set of time series X 1 X 2 X 3 X 4 and so on and so forth here Y is linearly so if we have some set of time series X 1 X 2 and so on and so forth up to X K we can construct a linear combination of these separate time series by adding them up and multiplying them by that by a variety of different constants for example if we have y equals 5 times X 1 plus 2 times X 2 then Y is a linear combination of X 1 and X 2 it can be as long we can use all K time series as here in the theory or we can use smaller example just like X 1 X 2 only a few time series now if we want to have a cointegrated set of time series we need a set of integrated of order 1 time series x1 through XK and if there is some linear combination of all these X's such that the linear combination is integrated of order 0 then that set of time series is cointegrated so if x1 x2 and x3 are all integrated of order 1 and this linear combination is integrated of order 0 then these time series are Co integrated so co integration is a useful idea because when we're dealing with real-life data or any data in general and we're trying to build a model on it we don't have a lot of information to go on our statistical techniques can tease out a little bit of the meaning but there's only so much we can go on like if we have a stationary time series that's all we really know besides like its parameters and we can test to figure out what sort of distribution that it's drawn from however if we have a set of cointegrated time series this indicates that there is some sort of subtle relationship between the time series that are involved so if x1 and x2 are Co integrated then there is some driving under factor between x1 and x2 that could clue us into like this greater relationship between them it gives us one more bit of data to go on so let's just simulate what a cointegrated time set of time series looks like so if we have a hundred samples of this random normal distribution we'll cumulatively summit here call it x1 that's going to be an integrated of order one time series that'll be our x1 and we'll say that x2 is equal to x1 plus some additional random noise we plot them here and we see alright it looks like they're moving together this makes sense x2 basically looks like x1 with a little bit of noise on top now we know that x1 is integrated of order 1 by definition because we took a stationary time series and we cumulatively summed it to get it but let's just double check to make sure that x2 is is integrated of order 1 so we'll take the first order differences here and we'll check the result for stationarity so if we do that we see that Z the differences of x2 is likely stationary awesome so by construction we already kind of know what this Alinea combination is going to be so we just say that Z is equal to x2 minus x1 we'll plot that and check for its stationarity here and we see alright Z is likely stationary that means that x1 and x2 are Co no cointegrated fantastic now again real-life data isn't going to be as nice as simulated data we're not going to know specifically what the linear combination is between the things we're trying to test so in order to do this we need to figure out some tool that we can use to figure out the coefficients of a linear combination so in this case what we're going to use is linear regression it's a simple quick and dirty way to try to get this beta out that we can then use as part of the as part of the linear combination so let's pull some financial data a B G B and F SLR from 2014 to 2015 right here and then let's plot it here and we see that all right looks like they could be moving together we're not really sure like F SLR seems like it's moving around a lot more than a B G B is so let's calculate the linear regression between these and see if the beta will allow us to construct the linear combination that's integrated of order zero so we'll use a stats model as a Python package here to just do a simple ordinary least-squares regression and we'll pull out what beta is and we get that beta in this case is 1 point 5 3 6 so 1 point 5 4 about and we'll define Z our linear combination as x2 minus B times x1 check for stationarity and plot C and we see that all right Z is likely stationary awesome if we plot C it looks like it's a it's a little funky here but the test is telling us it's stationary and we're relatively confident in the test so we can conclude that during this period of 2014 to 2015 the FBG b and f SLR nor a BG b and f SLR are cointegrated now again similar to the issue with stationarity this is only a forecast we can't be sure that two stocks that are Co integrated in the past are going to continue to be Co integrated in the future especially since we have more things going on we have an entire other relationship to consider generally when we're looking for Co integrated stocks we're going to be looking within some universe where we think that there's already going to be an underlying relationship so we'll look at stocks within the same industry or we'll look at stocks in closely related industries like we'll see if individual shipping companies are Co integrated or if a manufacturer and a supplier of Steel are Co integrated that kind of thing but if there's some sort of regime change if one of these companies gets a completely new management decides to start switching what they're doing then a pair that was previously Co integrated may become non Co integrated in the future it will just get a lot of nonsense so if we try to make any bets based on Co integration well we may run into issues now Co integration is the core idea behind pairs trading which we cover in another one of the lectures as part of the lecture series and pairs trading is basically based on this idea that of Co integration where if we have some linear combination of the stocks its land a great integrated of order 0 so it's going to follow some probability distribution so we can bet on this spread between stocks based on the linear combination and use that to inform our stock purchases now in the real world we don't necessarily have to deal with this whole process of constructing a linear regression then making a linear combination and testing to see if that stationary we do have built-in tests of cointegration and they basically amount to that but in the stats models package again we have this Co int function and that'll tell us whether X 1 and X 2 are Co integrated here's the p-value and we can conclude that yes they are Co integrated from this test so if we want to basically apply the cointegration test to more different independent pairs of stocks then we can use the cointegration and function and Co integration function instead and make our lives easier hey everybody thanks for watching the quantou peein lecture series i just wanted to let you know you could get more content if you are interested here's the quanto peon lecture series page is available at wwlp.com slash lectures it's easiest just to Google quanto peon lectures and if you're already on the quanto peon website you can get to it via learn and support learn every lecture has a notebook most lectures will have videos you can watch to follow along just like the one you're watching right now and in addition some lectures also have sample algorithms you can clone and play around with and maybe even use to start developing some of your own trading strategies in addition to the lecture series we have github in case you're interested in checking that out that's github calm slash quanto bian slash research underscore public and we also have my twitter account at the street quant finally we also have services for schools and academics and that's quant opium comm slash academia you can see here some of the offers that you know we have for professor's everything is free but you know we offer a little bit more help to educators who want to use the platform finally you can always email me at Dulaney at quanto being calm again that's de la and ey at quanto p.m. calm feel free to shoot me any questions we really appreciate feedback on anything we're doing here thanks very much
Info
Channel: Quantopian
Views: 39,249
Rating: undefined out of 5
Keywords: finance, quantitative finance, statistics, stationarity, integration, orders of integration, cointegration, pairs trading, math, time series, risk, risk analysis, algorithms, algorithmic trading
Id: Pn_RiDbK82M
Channel Id: undefined
Length: 21min 23sec (1283 seconds)
Published: Thu Jul 14 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.