9. Volatility Modeling

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right. Let's see. We're going to start today with a wrap up of our discussion of univariate time series analysis. And last time we went through the Wold representation theorem, which applies to covariance stationary processes, a very powerful theorem. And implementations of the covariance stationary processes with ARMA models. And we discussed estimation of those models with maximum likelihood. And here in this slide I just wanted to highlight how when we estimate models with maximum likelihood we need to have an assumption of a probability distribution for what's random, and in the ARMA structure we consider the simple case where the innovations, the eta_t, are normally distributed white noise. So they're independent and identically distributed normal random variables. And the likelihood function can be maximized at the maximum likelihood parameters. And it's simple to implement the limited information maximum likelihood where one conditions on the first few observations in the time series. If you look at the likelihood structure for ARMA models, the density of an outcome at a given time point depends on lags of that dependent variable. So if those are unavailable, then that can be a problem. One can implement limited information maximum likelihood where you're just conditioning on those initial values, or there are full information maximum likelihood methods that you can apply as well. Generally though the limited information case is what's applied. Then the issue is model selection. And with model selection the issues that arise with time series are issues that arise in fitting any kind of statistical model. Ordinarily one will have multiple candidates for the model you want to fit to data. And the issue is how do you judge which ones are better than others. Why would you prefer one over the other? And if we're considering a collection of different ARMA models then we could say, fit all ARMA models of order p,q with p and q varying over some range. p from 0 up to p_max, q from q up to q_max. And evaluate those p,q different models. And if we consider sigma tilde squared of p, q being the MLE of the error variance, then there are these model selection criteria that are very popular. Akaike information criterion, and Bayes information criterion, and Hannan-Quinn. Now these criteria all have the same term, log of the MLE of the error variance. So these criteria don't vary at all with that. They just vary with this second term, but let's focus first on the AIC criterion. A given model is going to be better if the log of the MLE for the error variance is smaller. Now is that a good thing? Meaning, what is the interpretation of that practically when you're fitting different models? Well, the practical interpretation is the variability of the model about where you're predicting things, our estimate of the error variance is smaller. So we have essentially a model with a smaller error variance is better. So we're trying to minimize the log of that variance. Minimizing that is a good thing. Now what happens when you have many sort of independent variables to include in a model? Well, if you were doing a Taylor series approximation of a continuous function, eventually you'd sort of get to probably the smooth function with enough terms, but suppose that the actual model, it does have a finite number of parameters. And you're considering new factors, new lags of independent variables in the autoregressions. As you add more and more variables, well, there really should be a penalty for adding extra variables that aren't adding real value to the model in terms of reducing the error variance. So the Akaike information criterion is penalizing different models by a factor that depends on the size of the model in terms of the dimensionality of the model parameters. So p plus q is the dimensionality of the autoregression model. So let's see. With the BIC criterion the difference between that and the AIC criterion is that this factor two is replaced by log n. So rather than having a sort of unit increment of penalty for adding an extra parameter, the Bayes information criterion is adding a log n penalty times the number of parameters. And so as the sample size gets larger and larger, that penalty gets higher and higher. Now the practical interpretation of the Akaike information criterion is that it is very similar to applying a rule which says, we're going to include variables in our model if the square of the t statistic for estimating the additional parameter in the model is greater than 2 or not. So in terms of when does the Akaike information criterion become lower from adding additional terms to a model? If you're considering two models that differ by just one factor, it's basically if the t statistic for the model coefficient on that factor is a squared value greater than two or not. Now many of you who have seen regression models before and applied them, in particular applications would probably say, I really don't believe in the value of an additional factor unless the t statistic is greater than 1.96, or 2 or something. But the Akaike information criterion says the t statistic should be greater than the square root of 2. So it's sort of a weaker constraint for adding variables into the model. And now why is it called an information criterion? I won't go into this in the lecture. I am happy to go into it during office hours, but there's notions of information theory and Kullback-Leibler information of the model versus the true model, and trying to basically maximize the closeness of our fitted model to that. Now the Hannan-Quinn criterion, let's just look at how that differs. Well, that basically has a penalty midway between the log n and two. It's 2*log(log n). So this has a penalty that's increasing with size n, but not as fast as log n. This becomes relevant when we have models that get to be very large because we have a lot of data. Basically the more data you have, the more parameters you should be able to incorporate in the model if they're sort of statistically valid factors, important factors. And the Hannan-Quinn criterion basically allows for modeling processes where really an infinite number of variables might be appropriate, but you need larger and larger sample sizes to effectively estimate those. So those are the criteria that can be applied with time series models. And I should point out that, let's see, if you took sort of this factor 2 over n and inverted it to n over two log sigma squared, that term is basically one of the terms in the likelihood function of the fitted model. So you can see how this criterion is basically manipulating the maximum likelihood value by adjusting it for a penalty for extra parameters. Let's see. OK. Next topic is just test for stationarity and non-stationarity. There's a famous test called the Dickey-Fuller test, which is essentially to evaluate the time series to see if it's consistent with a random walk. We know that we've been discussing sort of lecture after lecture how simple random walks are non-stationary. And the simple random walk is given by the model up here, x_t equals phi x_(t-1) plus eta_t. If phi is equal to 1, right, that is a non-stationary process. Well, in the Dickey-Fuller test we want to test whether phi equals 1 or not. And so we can fit the AR(1) model by least squares and define the test statistic to be the estimate of phi minus 1 over its standard error where phi is the least squares estimate and the standard error is the least squares estimate, the standard error of that. If our coefficient phi is less than 1 in modulus, so this really is a stationary series, then the estimate phi converges in distribution to a normal 0, 1 minus phi squared. And let's see. But if phi is equal to 1, OK, so just to recap that second to last bullet point is basically the property that when norm phi is less than 1, then our least squares estimates are asymptotically normally distributed with mean 0 if we normalize by the true value, and 1 minus phi squared. If phi is equal to 1, then it turns out that phi hat is super-consistent with rate 1 over t. Now this super-consistency is related to statistics converging to some value, and what is the rate of convergence of those statistics to different values. So in normal samples we can estimate sort of the mean by the sample mean. And that will converge to the true mean at rate of 1 over root n. When we have a non-stationary random walk, the independent variables matrix is such that X transpose X over n grows without bound. So if we have y is equal to X beta plus epsilon, and beta hat is equal to X transpose X inverse X transpose y, the problem is-- well, and beta hat is distributed as ultimately normal with mean beta and variance sigma squared, X transpose X inverse. This X transpose X inverse matrix, when the process is non-stationary, a random walk, it grows infinitely. X transpose X over n actually grows to infinity in magnitude just because it becomes unbounded. Whereas X transpose X over n, when it's stationary is bounded. So anyway, so that leads to the super-consistency, meaning that it converges to the value much faster and so this normal distribution isn't appropriate. And it turns out there's Dickey-Fuller distribution for this test statistic, which is based on integrals of diffusions and one can read about that in the literature on unit roots and test for non-stationarity. So there's a very rich literature on this problem. If you're into econometrics, basically a lot of time's been spent in that field on this topic. And the mathematics gets very, very involved, but good results are available. So let's see an application of some of these time series methods. Let me go to the desktop here if I can. In this supplemental material that'll be on the website, I just wanted you to be able to work with time series, real time series and implement these autoregressive moving average fits and understand basically how things work. So in this, it introduces loading the R libraries and Federal Reserve data into R, basically collecting it off the web. Creating weekly and monthly time series from a daily series, and it's a trivial thing to do, but when you sit down and try to do it gets involved. So there's some nice tools that are available. There's the ACF and the PACF, which is the auto-correlation function and the partial auto-correlation function, which are used for interpreting series. Then we conduct Dickey-Fuller test for unit roots and determine, evaluate stationarity, non-stationarity of the 10-year yield. And then we evaluate stationarity and cyclicality in the fitted autoregressive model of order 2 to monthly data. And actually 1.7 there, that cyclicality issue, relates to one of the problems on the problem set for time series, which is looking at, with second order autoregressive models, is there cyclicality in the process? And then finally looking at identifying the best autoregressive model using the AIC criterion. So let me just page through and show you a couple of plots here. OK. Well, there's the original 10-year yield collected directly from the Federal Reserve website over a 10 year period. And, oh, here we go. This is nice. OK. OK. Let's see, this section 1.4 conducts the Dickey-Fuller test. And it basically determines that the p-value for non-stationarity is not rejected. And so, with the augmented Dickey-Fuller test, the test statistic is computed. Its significance is evaluated by the distribution for that statistic. And the p-value tells you how extreme the value of the statistic is, meaning how unusual is it. The smaller the p-value, the more unlikely the value is. The p-value is what's the likelihood of getting as extreme or more extreme a value of the test statistic, and the test statistic is evidence against the null hypothesis. So in this case the p-values range basically 0.2726 for the monthly data, which says that basically there is evidence of a unit root in the process. Let's see. OK. There's a section on understanding partial auto-correlation coefficients. And let me just state what the partial correlation coefficients are. You have the auto-correlation functions, which are simply the correlations of the time series with lags of its values. The partial auto-correlation coefficient is the correlation that's between the time series and say, it's p-th lag that is not explained by all lags lower than p. So it's basically the incremental correlation of the time series variable with the p-th lag after controlling for the others. And then let's see. With this, in section eight here there's a function in R called ar, for autoregressive, which basically will fit all autoregressive models up to a given order and provide diagnostic statistics for that. And here is a plot of the relative AIC statistic for models of the monthly data. And you can see that basically it takes all the AIC statistics and subtracts the smallest one from all the others. So one can see that according to the AIC statistic a model of order seven is suggested for this treasury yield data. OK. Then finally because these autoregressive models are implemented with regression models, one can apply regression diagnostics that we had introduced earlier to look at those data as well. All right. So let's go down now. [INAUDIBLE] OK. [INAUDIBLE] Full screen. Here we go. All right. So let's move on to the topic of volatility modeling. The discussion in this section is going to begin with just defining volatility. So we know what we're talking about. And then measuring volatility with historical data where we don't really apply sort of statistical models so much, but we're concerned with just historical measures of volatility and their prediction. Then there are formal models. We'll introduce Geometric Brownian Motion, of course. That's one of the standard models in finance. But also Poisson jump-diffusions, which is an extension of Geometric Brownian Motion to allow for discontinuities. And then there's a property of these Brownian motion and jump-diffusion models which is models with independent increments. Basically you have disjoint increments of the process, basically are independent of each other, which is a key property when there's time dependence in the models. There can be time dependence actually in the volatility. And ARCH models were introduced initially to try and capture that. And were extended to GARCH models, and these are the sort of simplest cases of time-dependent volatility models that we can work with and introduce. And in all of these the sort of mathematical framework for defining these models and the statistical framework for estimating their parameters is going to be highlighted. And while it's a very simple setting in terms of what these models are, these issues that we'll be covering relate to virtually all statistical modeling as well. So let's define volatility. OK. In finance it's defined as the annualized standard deviation of the change in price or value of a financial security, or an index. So we're interested in the variability of this process, a price process or a value process. And we consider it on an annualized time scale. Now because of that, when you talk about volatility it really is meaningful to communicate, levels of 10%. If you think of, at what level do sort of absolute bond yields vary over a year? It's probably less than 5%. Bond yields don't-- When you think of currencies, how much do those vary over a year. Maybe 10%. With equity markets, how do those vary? Well, maybe 30%, 40% or more. With the estimation and prediction approaches, OK, these are what we'll be discussing. There's different cases. So let's go on to historical volatility. In terms of computing the historical volatility we'll be considering basically a price series of T plus 1 points. And then we can get T period returns corresponding to those prices, which is the difference in the logs of the prices, or the log of the price relatives. So R_t is going to be the return for the asset. And one could use other definitions, like sort of the absolute return, not take logs. It's convenient in much empirical analysis, I guess, to work with the logs because if you sum logs you get sort of log of the product. And so total cumulative returns can be computed easily with sums of logs. But anyway, we'll work with that scale for now. OK. Now the process R_t, the return series process, is going to be assumed to be covariance stationary, meaning that it does have a finite variance. And the sample estimate of that is just given by the square root of the sample variance. And we're also considering an unbiased estimate of that. And if we want to basically convert these to annualized values so that we're dealing with a volatility, then if we have daily prices of which in financial markets they're usually-- in the US they're open roughly 252 days a year on average. We multiply that sigma hat by 252 square root. And for weekly, root 52, and root 12 for monthly data. So regardless of the periodicity of our original data we can get them onto that volatility scale. Now in terms of prediction methods that one can make with historical volatility, and there's a lot of work done in finance by people who aren't sort of trained as econometricians or statisticians, they basically just work with the data. And there's a standard for risk analysis called the risk metrics approach, where the approach defines volatility and volatility estimates, historical estimates, just using simple methodologies. And so that's just go through what those are here. One can-- basically for any period t, one can define the sample volatility, just to be the sample standard deviation of the period t returns. And so with daily data that might just be the square of that daily return. With monthly data it could be the sample standard deviation of the returns over the month and with yearly it would be the sample over the year. Also with intraday data, it could be the sample standard deviation over intraday periods of say, half hours or hours. And the historical average is simply the mean of those estimates, which uses all the available data. One can consider the simple moving average of these realized volatilities. And so that basically is using the last m, for some finite m, values to average. And one could also consider an exponential moving average of these sample volatilities where we have-- our estimate of the volatility is 1 minus beta times the current period volatility plus beta times the previous estimate. And these exponential moving averages are really very nice ways to estimate processes that change over time. And they're able to track the changes quite well and they will tend to come up again and again. This exponential moving average actually uses all available data. And there can be discrete versions of those where you say, well let's use not an equal weighted average like the simple moving average, but let's use a geometric average of the last m values in an exponential way. And that's the exponential weighted moving average that uses the last m. OK. There we go. OK. Well, with these different measures of sample volatility, one can basically build models to estimate them with regression models and evaluate. And in terms of the risk metrics benchmark, they consider a variety of different methodologies for estimating volatility. And sort of determine what methods are best for different kinds of financial instruments. And different financial indexes. And there are different performance measures one can apply. Sort of mean squared error of prediction, mean absolute error of prediction, mean absolute prediction error, and so forth to evaluate different methodologies. And on the web you can actually look at the technical documents for risk metrics and they go through these analyses and if your interest is in a particular area of finance, whether it's fixed income or equities, commodities, or currencies, reviewing their work there is very interesting because it does highlight different aspects of those markets. And it turns out that basically the exponential moving average is generally a very good method for many instruments. And the sort of discounting of the values over time corresponds to having roughly between, I guess, a 45 and a 90 day period in estimating your volatility. And in these approaches which are, I guess, they're a bit ad hoc. There's the formalism. And defining them is basically just empirically what has worked in the past. Let's see. While these things are ad hoc, they actually have been very, very effective. So let's move on to formal statistical models of volatility. And the first class is-- model is the Geometric Brownian Motion. So here we have basically a stochastic differential equation defining the model for Geometric Brownian Motion. And Choongbum will be going in some detail about stochastic differential equations, and stochastic calculus for representing different processes, continuous processes. And the formulation is basically looking at increments of the price process S is equal to basically a mu S of t, sort of a drift term, plus a sigma S of t, a multiple of d W of t, where sigma is the volatility of the security price, mu is the mean return per unit time, d W of t is the increment of a standard Brownian motion processor, Wiener process. And this W process is such that it's increments, basically the change in value of the process between two time points is normally distributed, with mean 0 and variance equal to the length of the interval. And increments on disjoint time intervals are independent. And well, if you divide both sides of that equation by S of t then you have d S of t over S of t is equal to mu dt plus sigma d W of t. And so the increments d S of t normalized by S of t are a standard Brownian motion with drift mu and volatility sigma. Now with sample data from this process, now suppose we have prices observed at times t_0 up to t_n. And for now we're not going to make any assumptions about what those time increments are, what those times are. They could be equally spaced. They could be unequally spaced. The returns, the log of the relative price change from time t_(j-1) to t_j are independent random variables. And they are independent. Their distribution is normally distributed with mean given by mu times the length of the time increment, and variance sigma squared times the length of the increment. And these properties will be covered by Choongbum in some later lectures. So for now what we can just know that this is true and apply this result. If we fix various time points for the observation and compute returns this way. If it's a Geometric Brownian Motion we know that this is the distribution of the returns. Now knowing that distribution we can now engage in maximum likelihood estimation. OK. If the increments are all just equal to 1, so we're thinking of daily data, say. Then the maximum likelihood estimates are simple. It's basically the sample mean and the sample variance with 1 over n instead of 1 over n minus 1 in the MLE's. If delta_j varies then, well, that's actually a case in the exercises. Now does anyone, in terms of, well, in the class exercise the issue that is important to think about is if you consider a given interval of time over which we're observing this Geometric Brownian Motion process, if we increase the sampling rate of prices over a given interval, how does that change the properties of our estimates? Basically, do we obtain more accurate estimates of the underlying parameters? And as you increase the sampling frequency, it turns out that some parameters are estimated much, much better and you get basically much lower standard errors on those estimates. With other parameters you don't necessarily. And the exercise is to evaluate that. Now another issue that's important is the issue of sort of what is the appropriate time scale for Geometric Brownian Motion. Right now we're thinking of, you collect data, whatever the periodicity is of the data is you think that's your period for your Brownian Motion. Let's evaluate that. Let me go to another example. Let's see here. Yep. OK. Let's go control-minus here. OK. All right. Let's see. With this second case study there was data on exchange rates, looking for regime changes in exchange rate relationships. And so we have data from that case study on different foreign exchange rates. And here in the top panel I've graphed the euro/dollar exchange rate from the beginning of 1999 through just a few months ago. And the second panel is a plot of the daily returns for that series. And here is a histogram of those daily returns. And a fit of the Gaussian distribution for the daily returns if our sort of time scale is correct. Basically daily returns are normally distributed. Days are disjoint in terms of the price change. And so they're independent and identically distributed under the model. And they all have the same normal distribution with mean mu and variance sigma squared. OK. This analysis assumes basically that we're dealing with trading days for the appropriate time scale, the Geometric Brownian Motion. Let's see. One can ask, well, what if trading dates really isn't the right time scale, but it's more calendar time. The change in value over the weekends maybe correspond to price changes, or value changes over a longer period of time. And so this model really needs to be adjusted for that time scale. The exercise that allows you to consider different delta t's shows you what the maximum likelihood estimates-- you'll be deriving maximum likely estimates if we have different definitions of time scale there. But if you apply the calendar time scale to this euro, let me just show you what the different estimates are of the annualized mean return and the annualized volatility. So if we consider trading days for euro it's 10.25% or 0.1025. If you consider clock time, it actually turns out to be 12.2%. So depending on how you specify the model you get a different definition of volatility here. And it's important to basically understand sort of what the assumptions are of your model and whether perhaps things ought to be different. In stochastic modeling, there's an area called subordinated stochastic processes. And basically the idea is, if you have a stochastic process like Geometric Brownian Motion of simple Brownian motion, maybe you're observing that on the wrong time scale. You may fit the Geometric Brownian Motion model and it doesn't look right. But it could be that there's a different time scale that's appropriate. And it's really Brownian motion on that time scale. And so formally it's called a subordinated stochastic process. You have a different time function for how to model the stochastic process. And the evaluation of subordinated stochastic processes leads to consideration of different time scales. With, say, equity markets, and futures markets, sort of the volume of trading, sort of cumulative volume of training might be really an appropriate measure of the real time scale. Because that's a measure of, in a sense, information flow coming into the market through the level of activity. So anyway I wanted to highlight how with different time scales you can get different results. And so that's something to be evaluated. In looking at these different models, OK, these first few graphs here show the fit of the normal model with the trading day time scale. Let's see. Those of you who've ever taken a statistics class before, or an applied statistics, may know about normal q-q plots. Basically if you want to evaluate the consistency of the returns here with a Gaussian distribution, what we can do is plot the observed ordered, sorted returns against what we would expect the sorted returns to be if it were from a Gaussian sample. So under the Geometric Brownian Motion model the daily returns are a sample, independent and identically distributed random variable sampled from a Gaussian distribution. So the smallest return should be consistent with the smallest of the sample size n. And what's being plotted here is the theoretical quantiles or percentiles versus the actual ones. And one would expect that to lie along a straight line if the theoretical quantiles were well-predicting the actual extreme values. What we see here is that as the theoretical quantiles get high, and it's in units of standard deviation units, the realized sample returns are in fact much higher than would be predicted by the Gaussian distribution. And similarly, on the low end side. So there's a normal q-q plot that's used often in the diagnostics of these models. Then down here I've actually plotted a fitted percentile distribution. Now what's been done here is if we modeled the series as a series of Gaussian random variables then we can evaluate the percentile of the fitted Gaussian distribution that was realized by every point. So if we have a return of say negative 2%, what percentile is the normal fit of that? And you can evaluate the cumulative distribution function of the fitted model at that value to get that point. And what should the distribution of percentiles be for fitted percentiles if we have a really good model? OK. Well, OK. Let's think. If you consider the 50th percentile you would expect, I guess, 50% of the data to lie above the 50th percentile and 50% to lie below the 50th percentile, right? OK. Let's consider, here I divided up into 100 bins between zero and one so this bin is the 99th percentile. How many observations would you expect to find in between the 99th and 100 percentile? This is an easy question. AUDIENCE: 1%. PROFESSOR: 1%. Right. And so in any of these bins we would expect to see 1% if the Gaussian model were fitting. And what we see is that, well, at the extremes they're more extreme values. And actually inside there are some fewer values. And actually this is exhibiting a leptokurtic distribution for the actually realized samples; basically the middle of the distribution is a little thinner and it's compensated for by fatter tails. But with this particular model we can basically expect to see a uniform distribution of percentiles in this graph. If we compare this with a fit of the clock time we actually see that clock time does a bit of a better job at getting the extreme values closer to what we would expect them to be. So in terms of being a better model for the returns process, if we're concerned with these extreme values, we're actually getting a slightly better value with those. So all right. Let's move on back to the notes. And talk about the Garman-Klass Estimator. So let me do this. All right. View full screen. OK. All right. So, OK. The Garman-Klass Estimator is one where we consider the situation where we actually have much more information than simply sort of closing prices at different intervals. Basically all transaction data's collected in a financial market. So really we have virtually all of the data available if we want it, or can pay for it. But let's consider a case where we expand upon just having closing prices to having additional information over increments of time that include the open, high, and low price over the different periods. So those of you who are familiar with bar data graphs that you see whenever you plot stock prices over periods of weeks or months you'll be familiar with having seen those. Now the Garman-Klass paper addressed how can we exploit this additional information to improve upon our estimates of the close-to-close. And so we'll just use this notation. Well, let's make some assumptions and notation. So we'll assume that mu is equal to 0 in our Geometric Brownian Motion model. So we don't have to worry about the mean. We're just concerned with volatility. We'll assume that the increments are one for daily, corresponding to daily. And we'll let little f, between zero and one, correspond to the time of day at which the market opens. So over a day, from day zero to day one at f we assume that the market opens and basically the Geometric Brownian Motion process might have closed on day zero here. So this would be C_0 and it may have opened on day one at this value. So this would be O_1. Might have gone up and down and then closed here with the Brownian Motion process. OK. This value here would correspond to the high value. This value here would correspond to the low value on day one. And then the closing value here would be C_1. So the model is we have this underlying Brownian Motion process is actually working over continuous time, but we just observe it over the time when the markets open. And so it can move between when the market closes and opens on any given day and we have the additional information. Instead of just the close, we also have the high and low. So let's look at how we might exploit that information to estimate volatility. OK. Using data from the first period as we've graphed here, let's first just highlight what the close-to-close return is. And that basically is an estimate of the one-period variance. And so sigma hat 0 squared is a single period squared return. C_1 minus C_0 has a distribution which is normal with mean 0, and variance sigma squared. And if we consider squaring that, what's the distribution of that? That's the square of a normal random variable, which is chi squared, but it's a multiple of a chi squared. It's sigma squared times a chi squared one random variable. And with a chi squared random variable the expected value is 1. The variance of a chi squared random variable is equal to 2. So just knowing those facts gives us the fact we have an unbiased estimate of the volatility parameter sigma and the variance is 2 sigma to the fourth. So that's basically the precision of close-to-close returns. Let's look at two other estimates. Basically the open-to-close return, sigma_1 squared, normalized by f, the length of the interval f. So we have sigma_1 is equal to O_1 minus C_0 squared. OK. Actually why don't I just do this? I'll just write down a few facts and then you can see that the results are clear. Basically O_1 minus C_0 is distributed normal with mean 0 and variance f sigma squared. And C_1 minus O_1 is distributed normal with mean 0 in variance 1 minus f sigma squared. OK. This is simply using the properties of the diffusion process over different periods of time. So if we normalize the squared values by the length of the interval we get estimates of the volatility. And what's particularly significant about these estimates one and two is that they're independent. So we actually have two estimates of the same underlying parameter, which are independent. And actually they both have the same mean and they both have the same variance. So if we consider a new estimate, which is basically averaging those two. Then this new estimate has the same-- is unbiased as well, but it's variance is basically the variance of this sum. So it's 1/2 squared times this variance plus 1/2 squared times this variance, which is a half of the variance of each of them. So this estimate has lower variance than our close-to-close. And we can define the efficiency of this particular estimate relative to the close-to-close estimate as 2. Basically we get double the precision. Suppose you had the open, high, close for one day. How many days of close-to-close data would you need to have the same variance as this estimate? No. AUDIENCE: [INAUDIBLE]. Because of the three data points [INAUDIBLE]. PROFESSOR: No. No. Anyone else? One more. Four. OK. Basically if the variance is 1/2, basically to get the standard deviation, or the variance to be-- I'm sorry. The ratio of the variance is two. So no. So it actually is close to two. Let's see. Our 1/n is-- so it actually is two. OK. I was thinking standard deviation units instead of squared units. So I was trying to be clever there. So it actually is basically two days. So sampling this with this information gives you as much as two days worth of information. So what does that mean? Well, if you want something that's as efficient as daily estimates you'll need to look back one day instead of two days to get the same efficiency with the estimate. All right. The motivation for the Garman-Klass paper was actually a paper written by Parkinson in 1976, which dealt with using the extremes of a Brownian Motion to estimate the underlying parameters. And when Choongbum talks about Brownian Motion a bit later, I don't know if you'll derive this result, but in courses on stochastic processes one does derive properties of the maximum of a Brownian Motion over a given interval and the minimum. And it turns out that if you look at the difference between the high and low squared divided by 4 log 2, this is an estimate of the volatility of the process. And the efficiency of this estimate turns out to be 5.2, which is better yet. Well, Garman and Klass were excited by that and wanted to find even better ones. So they wrote a paper that evaluated all different kinds of estimates. And I encourage you to Google that paper and read it because it's very accessible. And it sort of highlights the statistical and probability issues associated with these problems. But what they did was they derived the best analytic scale-invariant estimator, which has this sort of bizarre combination of different terms, but basically we're using normalized values of the high, low, close normalized by the open. And they're able to get an efficiency of 7.4 with this combination. Now scale-invariant estimates, in statistical theory, they're different principles that guide the development of different methodologies. And one kind of principle is issues of scale invariance. If you're estimating a scale parameter, and in this case volatility is telling you essentially how large is the variability of this process, if you were to say multiply your original data all by a given constant, then a scale-invariant estimator should be such that your estimator changes in that case only by that same scale factor. So sort of the estimator doesn't depend on how you scale the data. So that's the notion of scale invariance. The Garman-Klass paper actually goes to the nth degree and actually finds a particular estimator that has an efficiency of 8.4, which is really highly significant. So if you are working with a modeling process where you believe that the underlying parameters may be reasonably assumed to be constant over short periods of time, well, over those short periods of time if you use these extended estimators like this, you'll get much more precise measures of the underlying parameters than from just using simple close-to-close data. All right. Let's introduce Poisson Jump Diffusions. With Poisson Jump Diffusions we have basically a stochastic differential equation for representing this model. And it's just like the Geometric Brownian Motion model, except we have this additional term, gamma sigma Z d pi of t. Now that's a lot of different variables, but essentially what we're thinking about is over time a Brownian Motion process is fully continuous. There are basically no jumps in this Brownian Motion process. In order to allow for jumps, we assume that there's some process pi of t, which is a Poisson process. It's counting process that counts when jumps occur, how many jumps have occurred. So that might start at 0 at the value 0. Then if there's a jump here it goes up by one. And then if there's another jump here, it goes up by one, and so forth. And so the Poisson Jump Diffusion model says, this diffusion process is actually going to experience some shocks to it. Those shocks are going to be arriving according to a Poisson process. If you've taken stochastic modeling you know that that's a sort of a purely random process. Basically exponential arrival rate of shocks occur. You can't predict them. And when those occur, d pi of t is going to change from 0 up to the unit increment. So d pi of t is 1. And then we'll realize gamma sigma Z of t. So at this point we're going to have shocks. Here this is going to be gamma sigma Z_1 And at this point, maybe it's a negative shock, gamma sigma Z_2. This is 0. And so with this overall process we basically have a shift in the diffusion, up or down, according to these values. And so this model allows for the arrival of these processes to be random according to the Poisson distribution, and for the magnitude of the shocks to be random as well. Now like the Geometric Brownian Motion model this process sort of has independent increments, which helps with this estimation. One could estimate this model by maximum likelihood, but it does get tricky in that basically over different increments of time the change in the process corresponds to the diffusion increment, plus the sum of the jumps that have occurred over that same increment. And so the model ultimately is a Poisson mixture of Gaussian distributions. And in order to evaluate this model, model's properties, moment generating functions can be computed rather directly with that. And so one can understand how the moments of the process vary with these different model parameters. The likelihood function is a product of Poisson sums. And there's a closed form for the EM algorithm, which can be used to implement the estimation of the unknown parameters. And if you think about observing a Poisson Jump Diffusion process, if you knew where the jumps occurred, so you knew where the jumps occurred and how many there were per increment in your data, then the maximum likelihood estimation would all be very, very simple. And because this sort of is a separation of the estimation of the Gaussian parameters from the Poisson parameters. When you haven't observed those values then you need to deal with methods appropriate for missing data. And the EM algorithm is a very famous algorithm developed by the people up at Harvard, Rubin, Laird, and Dempster, which deals with, basically if the problem is much simpler, if you could posit there being unobserved variables that you would observe, then you sort of expand the problem to include your observed data, plus the missing data, in this case where the jumps have occurred. And you then do conditional expectations of estimating those jumps. And then assuming that those jumps had those-- occurred with those frequencies, estimating the underlying parameters. So the EM algorithm is very powerful and has extensive applications in all kinds of different models. I'll put up on the website a paper that I wrote with David Pickard and his student Arshad Zakaria, which goes through the maximum likelihood methodology for this. But looking at that, you can see how with an extended model, how maximum likelihood gets implemented and I think that's useful to see. All right. So let's turn next to ARCH models. And OK. Just as a bit of motivation, the Geometric Brownian Motion model and also the Poisson Jump Diffusion model are models which assume that volatility over time is essentially stationary. And with the sort of independent increments of those processes, the volatility over different increments is essentially the same. So the ARCH models were introduced to accommodate the possibility that there's time dependence in volatility. And so let's see. Let's see. Let me go. OK. At the very end, I'll go through an example showing that time dependence with our euro/dollar exchange rates. So the set up for this model is basically we look at the log of the price relatives y_t and we model the residuals to not be of constant volatility, but to be multiples of sort of white noise with mean 0 and variance 1, where sigma_t is given by this essentially ARCH function, which says that the volatility at a given period t is a weighted average of the squared residuals over the last p lags. And so if there's a large residual then that could persist and make the next observation have a large variance. And so this accommodates some time dependence. Now this model actually has parameter constraints, which are never a nice thing to have when you're fitting models. In this case the parameters alpha_one through alpha_p all have to be positive. And why do they have to be positive? AUDIENCE: [INAUDIBLE]. PROFESSOR: Right. Variance is positive. So if any of these alphas were negative, then there would be a possibility that under this model that you could have negative volatility, which you can't. So if we estimate this model to estimate them with the constraint that all these parameter values are non-negative. So that does complicate the estimation a bit. In terms of understanding how this process works one can actually see how the ARCH model implies an autoregressive model for the squared residuals, which turns out to be useful. So the top line there is the ARCH model saying that the variance of the t period return is this weighted average of the past residuals. And then if we simply add a new variable u_t, which is our squared residual minus its variance, to both sides we get the next line, which says that epsilon_t squared follows an autoregression on itself, with the u_t value being the disturbance in that autoregression. Now u_t, which is epsilon_t squared minus sigma squared t, what is the mean of that? The mean is 0. So it's almost white noise. But its variance is maybe going to change over time. So it's not sort of standard white noise, but it basically has expectation 0. It's also conditional independent, but there's some possible variability there. But what this implies is that there basically is an autoregressive model where we just have time-varying variances in the underlying process. Now because of that one can sort of quickly evaluate whether there's ARCH structure in data by simply fitting an autoregressive model to the squared residuals. And testing whether that regression is significant or not. And that formally is a Lagrange multiplier test. Some of the original papers by Engle go through that analysis. And the test statistic turns out to just be the multiple of the r squared for that regression fit. And basically under, say, a null hypothesis that there isn't any ARCH structure, then this regression model should have no predictability. This ARCH model in the residuals, basically if there's no time dependence in those residuals, that's evidence of there being an absence of ARCH structure. And so under the null hypothesis of no ARCH structure that r squared statistic should be small. It turns out that sort of n times the r squared statistic with p variables is asymptotically a chi-square distribution with p degrees of freedom. So that's where that test statistic comes into play. And in implementing this, the fact that we were applying essentially least squares with the autoregression to implement this Lagrange multiplier test, but we were assuming, well, we're not assuming, we're implicitly assuming the assumptions of Gauss-Markov in fitting that. This corresponds to the notion of quasi-maximum likelihood estimates for unknown parameters. And quasi-maximum likelihood estimates are used extensively in some stochastic volatility models. And so essentially situations where you sort of use the normal approximation, or the second order approximation, to get your estimates, and they turn out to be consistent and decent. All right. Let's go to Maximum Likelihood Estimation. OK Maximum Likelihood Estimation basically involves-- the hard part is defining the likelihood function, which is the density of the data given the unknown parameters. In this case, the data are conditionally independent. The joint density is the product of the density of y_t given the information at t minus 1. So basically the joint probability density is the density at each time point conditional on the past, and then the density times the density of the next time point conditional on the past. And those are all normal random variables. So these are the normal PDFs coming into play here. And so what we want to do is basically maximize this likelihood function subject to these constraints. And we already went through the fact that the alpha_i's have to be greater than zero. And it turns out you also have to have that the sum of the alphas is less than one. Now what would happen if the sum of the alphas was not less than one? AUDIENCE: [INAUDIBLE]. PROFESSOR: Right. And you basically could have the process start diverging. Basically these autoregressions can explode. So let's go through and see. Let's see. Actually, we're going to go to GARCH models next. OK. Let's see. Let me just go back here a second. OK. Very good. OK. In the remaining few minutes let me just introduce you to the GARCH models. The GARCH model is basically a series of past values of the squared volatilities, basically the q sum of past squared volatilities for the equation for the volatility sigma t squared. And so it may be that very high order ARCH models are actually important. Or very high order ARCH terms are found to be significant when you fit ARCH models. It could be that much of that need is explained by adding these GARCH terms. And so let's just consider a simple GARCH model where we have only a first order ARCH term and a first order GARCH term. So we're basically saying that this is a weighted average of the previous volatility, the new squared residual. And this is a very parsimonious representation that actually ends up fitting data quite, quite well. And there are various properties of this GARCH model which we'll go through next time, but I want to just close this lecture by showing you fits of the ARCH models and of this GARCH model to the euro/dollar exchange rate process. So let's just look at that here. OK. OK. With the euro/dollar exchange rate, actually there's the graph here which shows the auto-correlation function and the partial auto-correlation function of the squared returns. So is there dependence in these daily volatilities? And basically these blue lines are plus or minus two standard deviations of the correlation coefficient. Basically we have highly significant auto-correlations and very highly significant partial auto-correlations, which suggests if you're familiar with ARMA process that you would need a very high order ARMA process to fit the squared residuals. But this highlights how with the statistical tools you can actually identify this time dependence quite quickly. And here's a plot of the ARCH order one model and the ARCH order two model. And on each of these I've actually drawn a solid line where the sort of constant variance model would be. So ARCH is saying that we have a lot of variability about that constant mean. And a property, I guess, of these ARCH models is that they all have sort of a minimum value for the volatility that they're estimating. If you look at the ARCH function, that alpha_0 now is-- the constant term is basically the minimum value, which that can be. So there's a constraint sort of on the lower value. Then here's an ARCH(10) fit which, it doesn't look like it sort of has quite as much of a uniform lower bound in it, but one could go on and on with higher order ARCH terms, but rather than doing that one can also fit just a GARCH(1,1) model. And this is what it looks like. So basically the time varying volatility in this process is captured really, really well with just this two-parameter GARCH model as compared with a high order autoregressive model. And it sort of highlights the issues with the Wold decomposition where a potentially infinite order autoregressive model will effectively fit most time series. Well, that's nice to know, but it's nice to have a parsimonious way of defining that infinite collection of parameters and with the GARCH model a couple of parameters do a good job. And then finally here's just a simultaneous plot of all those volatility estimates on the same graph. And so one can see the increased flexibility basically of the GARCH models compared to the ARCH models for capturing time-varying volatility. So all right. I'll stop there for today. And let's see. Next Tuesday is a presentation from Morgan Stanley so. And today's the last day to sign up for a field trip.
Info
Channel: MIT OpenCourseWare
Views: 120,903
Rating: 4.8614717 out of 5
Keywords: volatility modeling, geometric Brownian motion, Poisson jump diffusion, ARCH models, GARCH models, stochastic volatility models, volatility
Id: cDlbEQz1PQk
Channel Id: undefined
Length: 81min 15sec (4875 seconds)
Published: Tue Jan 06 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.