The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: All right. Let's see. We're going to start
today with a wrap up of our discussion of
univariate time series analysis. And last time we went through
the Wold representation theorem, which
applies to covariance stationary processes, a
very powerful theorem. And implementations of
the covariance stationary processes with ARMA models. And we discussed
estimation of those models with maximum likelihood. And here in this
slide I just wanted to highlight how when
we estimate models with maximum likelihood
we need to have an assumption of a probability
distribution for what's random, and in the ARMA structure
we consider the simple case where the innovations,
the eta_t, are normally
distributed white noise. So they're independent and
identically distributed normal random variables. And the likelihood
function can be maximized at the maximum
likelihood parameters. And it's simple to implement
the limited information maximum likelihood where one conditions
on the first few observations in the time series. If you look at the likelihood
structure for ARMA models, the density of an outcome
at a given time point depends on lags of that
dependent variable. So if those are unavailable,
then that can be a problem. One can implement limited
information maximum likelihood where you're just conditioning
on those initial values, or there are full information
maximum likelihood methods that you can apply as well. Generally though the
limited information case is what's applied. Then the issue is
model selection. And with model
selection the issues that arise with time
series are issues that arise in fitting any
kind of statistical model. Ordinarily one will
have multiple candidates for the model you
want to fit to data. And the issue is how
do you judge which ones are better than others. Why would you prefer
one over the other? And if we're considering a
collection of different ARMA models then we could say, fit
all ARMA models of order p,q with p and q varying
over some range. p from 0 up to p_max,
q from q up to q_max. And evaluate those
p,q different models. And if we consider sigma
tilde squared of p, q being the MLE of
the error variance, then there are these
model selection criteria that are very popular. Akaike information criterion,
and Bayes information criterion, and Hannan-Quinn. Now these criteria all
have the same term, log of the MLE of
the error variance. So these criteria don't
vary at all with that. They just vary with
this second term, but let's focus first
on the AIC criterion. A given model is
going to be better if the log of the MLE for the
error variance is smaller. Now is that a good thing? Meaning, what is
the interpretation of that practically when you're
fitting different models? Well, the practical
interpretation is the variability of the
model about where you're predicting things, our
estimate of the error variance is smaller. So we have essentially a
model with a smaller error variance is better. So we're trying to minimize
the log of that variance. Minimizing that is a good thing. Now what happens when
you have many sort of independent variables
to include in a model? Well, if you were doing a
Taylor series approximation of a continuous function,
eventually you'd sort of get to probably
the smooth function with enough terms, but suppose
that the actual model, it does have a finite number
of parameters. And you're considering
new factors, new lags of
independent variables in the autoregressions. As you add more
and more variables, well, there really
should be a penalty for adding extra variables
that aren't adding real value to the model in terms
of reducing the error variance. So the Akaike
information criterion is penalizing different
models by a factor that depends on the size of the model
in terms of the dimensionality of the model parameters. So p plus q is
the dimensionality of the autoregression model. So let's see. With the BIC criterion the
difference between that and the AIC criterion is
that this factor two is replaced by log n. So rather than having a sort
of unit increment of penalty for adding an extra parameter,
the Bayes information criterion is adding a log n penalty
times the number of parameters. And so as the sample size
gets larger and larger, that penalty gets
higher and higher. Now the practical interpretation
of the Akaike information criterion is that it is
very similar to applying a rule which says, we're
going to include variables in our model if the square of
the t statistic for estimating the additional parameter in the
model is greater than 2 or not. So in terms of when does the
Akaike information criterion become lower from adding
additional terms to a model? If you're considering two models
that differ by just one factor, it's basically if the t
statistic for the model coefficient on that factor is a
squared value greater than two or not. Now many of you who have
seen regression models before and applied them, in
particular applications would probably
say, I really don't believe in the value
of an additional factor unless the t statistic
is greater than 1.96, or 2 or something. But the Akaike
information criterion says the t statistic
should be greater than the square root of 2. So it's sort of a weaker
constraint for adding variables into the model. And now why is it called
an information criterion? I won't go into
this in the lecture. I am happy to go into
it during office hours, but there's notions
of information theory and Kullback-Leibler
information of the model versus the true
model, and trying to basically maximize the
closeness of our fitted model to that. Now the Hannan-Quinn
criterion, let's just look at how that differs. Well, that basically has a
penalty midway between the log n and two. It's 2*log(log n). So this has a penalty that's
increasing with size n, but not as fast as log n. This becomes
relevant when we have models that get to be very large
because we have a lot of data. Basically the more
data you have, the more parameters
you should be able to incorporate in
the model if they're sort of statistically valid
factors, important factors. And the Hannan-Quinn
criterion basically allows for modeling processes
where really an infinite number of variables might
be appropriate, but you need larger
and larger sample sizes to effectively estimate those. So those are the criteria that
can be applied with time series models. And I should point
out that, let's see, if you took sort of
this factor 2 over n and inverted it to n over
two log sigma squared, that term is basically one of
the terms in the likelihood function of the fitted model. So you can see how this
criterion is basically manipulating the
maximum likelihood value by adjusting it for a
penalty for extra parameters. Let's see. OK. Next topic is just
test for stationarity and non-stationarity. There's a famous test called
the Dickey-Fuller test, which is essentially to evaluate
the time series to see if it's consistent with a random walk. We know that we've been
discussing sort of lecture after lecture how simple random
walks are non-stationary. And the simple random walk is
given by the model up here, x_t equals phi
x_(t-1) plus eta_t. If phi is equal
to 1, right, that is a non-stationary process. Well, in the
Dickey-Fuller test we want to test whether
phi equals 1 or not. And so we can fit the AR(1)
model by least squares and define the test statistic to
be the estimate of phi minus 1 over its standard error where
phi is the least squares estimate and the standard error
is the least squares estimate, the standard error of that. If our coefficient phi is
less than 1 in modulus, so this really is a
stationary series, then the estimate phi converges
in distribution to a normal 0, 1 minus phi squared. And let's see. But if phi is equal
to 1, OK, so just to recap that second
to last bullet point is basically the property that
when norm phi is less than 1, then our least squares
estimates are asymptotically normally distributed
with mean 0 if we normalize by the true value,
and 1 minus phi squared. If phi is equal to
1, then it turns out that phi hat is super-consistent
with rate 1 over t. Now this super-consistency
is related to statistics converging
to some value, and what is the rate of
convergence of those statistics to different values. So in normal samples we can
estimate sort of the mean by the sample mean. And that will converge to
the true mean at rate of 1 over root n. When we have a
non-stationary random walk, the independent
variables matrix is such that X transpose X over
n grows without bound. So if we have y is equal
to X beta plus epsilon, and beta hat is equal to
X transpose X inverse X transpose y, the
problem is-- well, and beta hat is
distributed as ultimately normal with mean beta
and variance sigma squared, X transpose X inverse. This X transpose
X inverse matrix, when the process is
non-stationary, a random walk, it grows infinitely. X transpose X over
n actually grows to infinity in magnitude just
because it becomes unbounded. Whereas X transpose X over
n, when it's stationary is bounded. So anyway, so that leads
to the super-consistency, meaning that it converges
to the value much faster and so this normal
distribution isn't appropriate. And it turns out there's
Dickey-Fuller distribution for this test statistic,
which is based on integrals of diffusions and one
can read about that in the literature on unit roots
and test for non-stationarity. So there's a very rich
literature on this problem. If you're into econometrics,
basically a lot of time's been spent in that
field on this topic. And the mathematics gets
very, very involved, but good results are available. So let's see an application
of some of these time series methods. Let me go to the
desktop here if I can. In this supplemental material
that'll be on the website, I just wanted you
to be able to work with time series,
real time series and implement these
autoregressive moving average fits and understand
basically how things work. So in this, it introduces
loading the R libraries and Federal Reserve data into
R, basically collecting it off the web. Creating weekly and monthly
time series from a daily series, and it's a trivial thing to do,
but when you sit down and try to do it gets involved. So there's some nice
tools that are available. There's the ACF
and the PACF, which is the auto-correlation
function and the partial auto-correlation
function, which are used for interpreting series. Then we conduct Dickey-Fuller
test for unit roots and determine, evaluate
stationarity, non-stationarity of the 10-year yield. And then we evaluate
stationarity and cyclicality in the fitted autoregressive
model of order 2 to monthly data. And actually 1.7 there,
that cyclicality issue, relates to one of the
problems on the problem set for time series,
which is looking at, with second order
autoregressive models, is there cyclicality
in the process? And then finally
looking at identifying the best autoregressive model
using the AIC criterion. So let me just page through
and show you a couple of plots here. OK. Well, there's the
original 10-year yield collected directly from
the Federal Reserve website over a 10 year period. And, oh, here we go. This is nice. OK. OK. Let's see, this
section 1.4 conducts the Dickey-Fuller test. And it basically
determines that the p-value for non-stationarity
is not rejected. And so, with the augmented
Dickey-Fuller test, the test statistic is computed. Its significance is
evaluated by the distribution for that statistic. And the p-value tells
you how extreme the value of the statistic is,
meaning how unusual is it. The smaller the p-value, the
more unlikely the value is. The p-value is what's
the likelihood of getting as extreme or more extreme a
value of the test statistic, and the test
statistic is evidence against the null hypothesis. So in this case the p-values
range basically 0.2726 for the monthly data, which
says that basically there is evidence of a unit
root in the process. Let's see. OK. There's a section
on understanding partial auto-correlation
coefficients. And let me just state what
the partial correlation coefficients are. You have the
auto-correlation functions, which are simply the
correlations of the time series with lags of its values. The partial
auto-correlation coefficient is the correlation that's
between the time series and say, it's p-th lag that is
not explained by all lags lower than p. So it's basically the
incremental correlation of the time series variable with
the p-th lag after controlling for the others. And then let's see. With this, in section
eight here there's a function in R called ar, for
autoregressive, which basically will fit all autoregressive
models up to a given order and provide diagnostic
statistics for that. And here is a plot of the
relative AIC statistic for models of the monthly data. And you can see that basically
it takes all the AIC statistics and subtracts the smallest
one from all the others. So one can see that according
to the AIC statistic a model of order seven is
suggested for this treasury yield data. OK. Then finally because these
autoregressive models are implemented with
regression models, one can apply
regression diagnostics that we had introduced earlier
to look at those data as well. All right. So let's go down now. [INAUDIBLE] OK. [INAUDIBLE] Full screen. Here we go. All right. So let's move on to the
topic of volatility modeling. The discussion in
this section is going to begin with just
defining volatility. So we know what
we're talking about. And then measuring volatility
with historical data where we don't really apply sort
of statistical models so much, but we're concerned with
just historical measures of volatility and
their prediction. Then there are formal models. We'll introduce Geometric
Brownian Motion, of course. That's one of the standard
models in finance. But also Poisson
jump-diffusions, which is an extension of
Geometric Brownian Motion to allow for discontinuities. And then there's a property
of these Brownian motion and jump-diffusion
models which is models with independent increments. Basically you have disjoint
increments of the process, basically are independent
of each other, which is a key property when there's
time dependence in the models. There can be time dependence
actually in the volatility. And ARCH models were
introduced initially to try and capture that. And were extended
to GARCH models, and these are the
sort of simplest cases of time-dependent
volatility models that we can work
with and introduce. And in all of these the sort
of mathematical framework for defining these models
and the statistical framework for estimating their parameters
is going to be highlighted. And while it's a
very simple setting in terms of what
these models are, these issues that
we'll be covering relate to virtually all
statistical modeling as well. So let's define volatility. OK. In finance it's defined as the
annualized standard deviation of the change in price or
value of a financial security, or an index. So we're interested
in the variability of this process, a price
process or a value process. And we consider it on an
annualized time scale. Now because of that, when
you talk about volatility it really is meaningful to
communicate, levels of 10%. If you think of, at what level
do sort of absolute bond yields vary over a year? It's probably less than 5%. Bond yields don't-- When you think of
currencies, how much do those vary over a year. Maybe 10%. With equity markets,
how do those vary? Well, maybe 30%, 40% or more. With the estimation and
prediction approaches, OK, these are what
we'll be discussing. There's different cases. So let's go on to
historical volatility. In terms of computing
the historical volatility we'll be considering
basically a price series of T plus 1 points. And then we can get
T period returns corresponding to
those prices, which is the difference in
the logs of the prices, or the log of the
price relatives. So R_t is going to be
the return for the asset. And one could use
other definitions, like sort of the absolute
return, not take logs. It's convenient in much
empirical analysis, I guess, to work with the
logs because if you sum logs you get sort of
log of the product. And so total cumulative
returns can be computed easily with sums of logs. But anyway, we'll work
with that scale for now. OK. Now the process R_t, the
return series process, is going to be assumed to
be covariance stationary, meaning that it does
have a finite variance. And the sample estimate
of that is just given by the square root
of the sample variance. And we're also considering
an unbiased estimate of that. And if we want to
basically convert these to annualized
values so that we're dealing with a
volatility, then if we have daily prices of
which in financial markets they're usually--
in the US they're open roughly 252 days
a year on average. We multiply that sigma
hat by 252 square root. And for weekly, root 52, and
root 12 for monthly data. So regardless of the
periodicity of our original data we can get them onto
that volatility scale. Now in terms of
prediction methods that one can make with
historical volatility, and there's a lot of work
done in finance by people who aren't sort of
trained as econometricians or statisticians, they basically
just work with the data. And there's a standard for
risk analysis called the risk metrics approach, where the
approach defines volatility and volatility estimates,
historical estimates, just using simple methodologies. And so that's just go
through what those are here. One can-- basically
for any period t, one can define the
sample volatility, just to be the sample standard
deviation of the period t returns. And so with daily
data that might just be the square of
that daily return. With monthly data it could be
the sample standard deviation of the returns over the
month and with yearly it would be the sample
over the year. Also with intraday data, it
could be the sample standard deviation over intraday periods
of say, half hours or hours. And the historical
average is simply the mean of those
estimates, which uses all the available data. One can consider the
simple moving average of these realized volatilities. And so that basically is using
the last m, for some finite m, values to average. And one could also consider
an exponential moving average of these sample
volatilities where we have-- our estimate of the
volatility is 1 minus beta times the current
period volatility plus beta times the
previous estimate. And these exponential
moving averages are really very nice
ways to estimate processes that change over time. And they're able to track
the changes quite well and they will tend to
come up again and again. This exponential
moving average actually uses all available data. And there can be discrete
versions of those where you say, well let's use not
an equal weighted average like the simple moving
average, but let's use a geometric average of the last
m values in an exponential way. And that's the exponential
weighted moving average that uses the last m. OK. There we go. OK. Well, with these different
measures of sample volatility, one can basically build
models to estimate them with regression
models and evaluate. And in terms of the
risk metrics benchmark, they consider a variety
of different methodologies for estimating volatility. And sort of determine
what methods are best for different kinds of
financial instruments. And different financial indexes. And there are different
performance measures one can apply. Sort of mean squared
error of prediction, mean absolute error
of prediction, mean absolute prediction
error, and so forth to evaluate different
methodologies. And on the web you can actually
look at the technical documents for risk metrics and they
go through these analyses and if your interest is in a
particular area of finance, whether it's fixed income
or equities, commodities, or currencies,
reviewing their work there is very
interesting because it does highlight different
aspects of those markets. And it turns out that basically
the exponential moving average is generally a very good
method for many instruments. And the sort of discounting
of the values over time corresponds to having roughly
between, I guess, a 45 and a 90 day period in
estimating your volatility. And in these approaches
which are, I guess, they're a bit ad hoc. There's the formalism. And defining them is
basically just empirically what has worked in the past. Let's see. While these things are
ad hoc, they actually have been very, very effective. So let's move on to
formal statistical models of volatility. And the first class is-- model
is the Geometric Brownian Motion. So here we have basically
a stochastic differential equation defining the model
for Geometric Brownian Motion. And Choongbum will be
going in some detail about stochastic
differential equations, and stochastic calculus
for representing different processes,
continuous processes. And the formulation
is basically looking at increments of the price
process S is equal to basically a mu S of t, sort of a drift
term, plus a sigma S of t, a multiple of d W
of t, where sigma is the volatility of
the security price, mu is the mean return
per unit time, d W of t is the increment of a standard
Brownian motion processor, Wiener process. And this W process is
such that it's increments, basically the change in value
of the process between two time points is normally
distributed, with mean 0 and variance equal to the
length of the interval. And increments on disjoint
time intervals are independent. And well, if you
divide both sides of that equation by S of t then
you have d S of t over S of t is equal to mu dt
plus sigma d W of t. And so the increments d S
of t normalized by S of t are a standard Brownian motion
with drift mu and volatility sigma. Now with sample data
from this process, now suppose we have
prices observed at times t_0 up to t_n. And for now we're not going
to make any assumptions about what those time increments
are, what those times are. They could be equally spaced. They could be unequally spaced. The returns, the log of the
relative price change from time t_(j-1) to t_j are
independent random variables. And they are independent. Their distribution is
normally distributed with mean given by mu times the
length of the time increment, and variance sigma squared times
the length of the increment. And these properties will
be covered by Choongbum in some later lectures. So for now what we can
just know that this is true and apply this result.
If we fix various time points for the observation
and compute returns this way. If it's a Geometric
Brownian Motion we know that this is the
distribution of the returns. Now knowing that
distribution we can now engage in maximum
likelihood estimation. OK. If the increments are
all just equal to 1, so we're thinking
of daily data, say. Then the maximum likelihood
estimates are simple. It's basically the sample mean
and the sample variance with 1 over n instead of 1 over
n minus 1 in the MLE's. If delta_j varies
then, well, that's actually a case
in the exercises. Now does anyone,
in terms of, well, in the class exercise the issue
that is important to think about is if you consider a given
interval of time over which we're observing this Geometric
Brownian Motion process, if we increase the sampling
rate of prices over a given interval, how does that
change the properties of our estimates? Basically, do we obtain
more accurate estimates of the underlying parameters? And as you increase
the sampling frequency, it turns out that some
parameters are estimated much, much better and you
get basically much lower standard errors
on those estimates. With other parameters
you don't necessarily. And the exercise is
to evaluate that. Now another issue
that's important is the issue of sort of what
is the appropriate time scale for Geometric Brownian Motion. Right now we're
thinking of, you collect data, whatever the
periodicity is of the data is you think that's your period
for your Brownian Motion. Let's evaluate that. Let me go to another example. Let's see here. Yep. OK. Let's go control-minus here. OK. All right. Let's see. With this second
case study there was data on exchange rates,
looking for regime changes in exchange rate relationships. And so we have data
from that case study on different foreign
exchange rates. And here in the top panel
I've graphed the euro/dollar exchange rate from
the beginning of 1999 through just a few months ago. And the second panel is a
plot of the daily returns for that series. And here is a histogram
of those daily returns. And a fit of the Gaussian
distribution for the daily returns if our sort of
time scale is correct. Basically daily returns
are normally distributed. Days are disjoint in
terms of the price change. And so they're independent
and identically distributed under the model. And they all have the
same normal distribution with mean mu and
variance sigma squared. OK. This analysis assumes
basically that we're dealing with trading days for
the appropriate time scale, the Geometric Brownian Motion. Let's see. One can ask, well, what
if trading dates really isn't the right time scale,
but it's more calendar time. The change in value
over the weekends maybe correspond to price
changes, or value changes over a longer period of time. And so this model
really needs to be adjusted for that time scale. The exercise that
allows you to consider different delta t's shows you
what the maximum likelihood estimates-- you'll
be deriving maximum likely estimates if we
have different definitions of time scale there. But if you apply the calendar
time scale to this euro, let me just show you what
the different estimates are of the annualized mean return
and the annualized volatility. So if we consider trading days
for euro it's 10.25% or 0.1025. If you consider clock time, it
actually turns out to be 12.2%. So depending on how
you specify the model you get a different
definition of volatility here. And it's important to
basically understand sort of what the assumptions
are of your model and whether perhaps things
ought to be different. In stochastic modeling,
there's an area called subordinated
stochastic processes. And basically the idea is, if
you have a stochastic process like Geometric Brownian Motion
of simple Brownian motion, maybe you're observing that
on the wrong time scale. You may fit the Geometric
Brownian Motion model and it doesn't look right. But it could be that
there's a different time scale that's appropriate. And it's really Brownian
motion on that time scale. And so formally it's called
a subordinated stochastic process. You have a different
time function for how to model the
stochastic process. And the evaluation of
subordinated stochastic processes leads to consideration
of different time scales. With, say, equity markets,
and futures markets, sort of the volume of trading,
sort of cumulative volume of training might be really
an appropriate measure of the real time scale. Because that's a
measure of, in a sense, information flow
coming into the market through the level of activity. So anyway I wanted to highlight
how with different time scales you can get different results. And so that's something
to be evaluated. In looking at these
different models, OK, these first few
graphs here show the fit of the normal model
with the trading day time scale. Let's see. Those of you who've ever taken
a statistics class before, or an applied statistics, may
know about normal q-q plots. Basically if you
want to evaluate the consistency of
the returns here with a Gaussian
distribution, what we can do is plot the observed
ordered, sorted returns against what we would
expect the sorted returns to be if it were from
a Gaussian sample. So under the Geometric
Brownian Motion model the daily returns are a sample,
independent and identically distributed random variable
sampled from a Gaussian distribution. So the smallest return should
be consistent with the smallest of the sample size n. And what's being plotted here
is the theoretical quantiles or percentiles versus
the actual ones. And one would expect that
to lie along a straight line if the theoretical quantiles
were well-predicting the actual extreme values. What we see here is that as the
theoretical quantiles get high, and it's in units of
standard deviation units, the realized sample
returns are in fact much higher than would be
predicted by the Gaussian distribution. And similarly, on
the low end side. So there's a normal
q-q plot that's used often in the
diagnostics of these models. Then down here I've actually
plotted a fitted percentile distribution. Now what's been done here
is if we modeled the series as a series of Gaussian
random variables then we can evaluate
the percentile of the fitted Gaussian
distribution that was realized by every point. So if we have a return of say
negative 2%, what percentile is the normal fit of that? And you can evaluate the
cumulative distribution function of the fitted model at
that value to get that point. And what should the
distribution of percentiles be for fitted percentiles if
we have a really good model? OK. Well, OK. Let's think. If you consider the 50th
percentile you would expect, I guess, 50% of the data to
lie above the 50th percentile and 50% to lie below the
50th percentile, right? OK. Let's consider,
here I divided up into 100 bins
between zero and one so this bin is the
99th percentile. How many observations
would you expect to find in between the
99th and 100 percentile? This is an easy question. AUDIENCE: 1%. PROFESSOR: 1%. Right. And so in any of
these bins we would expect to see 1% if the
Gaussian model were fitting. And what we see is that,
well, at the extremes they're more extreme values. And actually inside there
are some fewer values. And actually this is exhibiting
a leptokurtic distribution for the actually
realized samples; basically the middle
of the distribution is a little thinner
and it's compensated for by fatter tails. But with this
particular model we can basically expect to
see a uniform distribution of percentiles in this graph. If we compare this with
a fit of the clock time we actually see
that clock time does a bit of a better job at getting
the extreme values closer to what we would
expect them to be. So in terms of being a better
model for the returns process, if we're concerned with
these extreme values, we're actually getting
a slightly better value with those. So all right. Let's move on back to the notes. And talk about the
Garman-Klass Estimator. So let me do this. All right. View full screen. OK. All right. So, OK. The Garman-Klass
Estimator is one where we consider the
situation where we actually have much more information
than simply sort of closing prices at different intervals. Basically all transaction
data's collected in a financial market. So really we have
virtually all of the data available if we want
it, or can pay for it. But let's consider
a case where we expand upon just having
closing prices to having additional information over
increments of time that include the open,
high, and low price over the different periods. So those of you who are
familiar with bar data graphs that you see whenever you
plot stock prices over periods of weeks or months you'll
be familiar with having seen those. Now the Garman-Klass
paper addressed how can we exploit this
additional information to improve upon our estimates
of the close-to-close. And so we'll just
use this notation. Well, let's make some
assumptions and notation. So we'll assume that mu is equal
to 0 in our Geometric Brownian Motion model. So we don't have to
worry about the mean. We're just concerned
with volatility. We'll assume that the
increments are one for daily, corresponding to daily. And we'll let little f,
between zero and one, correspond to the time of day
at which the market opens. So over a day, from
day zero to day one at f we assume that
the market opens and basically the Geometric
Brownian Motion process might have closed
on day zero here. So this would be C_0 and it
may have opened on day one at this value. So this would be O_1. Might have gone up
and down and then closed here with the
Brownian Motion process. OK. This value here would
correspond to the high value. This value here would correspond
to the low value on day one. And then the closing
value here would be C_1. So the model is we have this
underlying Brownian Motion process is actually working
over continuous time, but we just observe it over
the time when the markets open. And so it can move between when
the market closes and opens on any given day and we have
the additional information. Instead of just the close, we
also have the high and low. So let's look at how we might
exploit that information to estimate volatility. OK. Using data from the first period
as we've graphed here, let's first just highlight what
the close-to-close return is. And that basically
is an estimate of the one-period variance. And so sigma hat 0 squared is
a single period squared return. C_1 minus C_0 has a distribution
which is normal with mean 0, and variance sigma squared. And if we consider
squaring that, what's the distribution of that? That's the square of a
normal random variable, which is chi squared, but it's a
multiple of a chi squared. It's sigma squared times a chi
squared one random variable. And with a chi squared
random variable the expected value is 1. The variance of a chi squared
random variable is equal to 2. So just knowing
those facts gives us the fact we have an unbiased
estimate of the volatility parameter sigma and the variance
is 2 sigma to the fourth. So that's basically
the precision of close-to-close returns. Let's look at two
other estimates. Basically the
open-to-close return, sigma_1 squared,
normalized by f, the length of the interval f. So we have sigma_1 is equal
to O_1 minus C_0 squared. OK. Actually why don't
I just do this? I'll just write down
a few facts and then you can see that the
results are clear. Basically O_1 minus C_0 is
distributed normal with mean 0 and variance f sigma squared. And C_1 minus O_1 is
distributed normal with mean 0 in variance 1 minus
f sigma squared. OK. This is simply using the
properties of the diffusion process over different
periods of time. So if we normalize
the squared values by the length of
the interval we get estimates of the volatility. And what's particularly
significant about these
estimates one and two is that they're independent. So we actually
have two estimates of the same
underlying parameter, which are independent. And actually they both
have the same mean and they both have
the same variance. So if we consider
a new estimate, which is basically
averaging those two. Then this new estimate has the
same-- is unbiased as well, but it's variance is basically
the variance of this sum. So it's 1/2 squared times
this variance plus 1/2 squared times this variance, which is
a half of the variance of each of them. So this estimate
has lower variance than our close-to-close. And we can define the efficiency
of this particular estimate relative to the
close-to-close estimate as 2. Basically we get
double the precision. Suppose you had the open,
high, close for one day. How many days of
close-to-close data would you need to have the
same variance as this estimate? No. AUDIENCE: [INAUDIBLE]. Because of the three
data points [INAUDIBLE]. PROFESSOR: No. No. Anyone else? One more. Four. OK. Basically if the
variance is 1/2, basically to get the standard
deviation, or the variance to be-- I'm sorry. The ratio of the
variance is two. So no. So it actually is close to two. Let's see. Our 1/n is-- so it
actually is two. OK. I was thinking standard
deviation units instead of squared units. So I was trying to
be clever there. So it actually is
basically two days. So sampling this
with this information gives you as much as two
days worth of information. So what does that mean? Well, if you want
something that's as efficient as daily
estimates you'll need to look back one
day instead of two days to get the same efficiency
with the estimate. All right. The motivation for
the Garman-Klass paper was actually a paper
written by Parkinson in 1976, which dealt with using
the extremes of a Brownian Motion to estimate the
underlying parameters. And when Choongbum talks about
Brownian Motion a bit later, I don't know if you'll
derive this result, but in courses on
stochastic processes one does derive properties
of the maximum of a Brownian Motion over a given
interval and the minimum. And it turns out
that if you look at the difference between the
high and low squared divided by 4 log 2, this is an
estimate of the volatility of the process. And the efficiency
of this estimate turns out to be 5.2,
which is better yet. Well, Garman and Klass
were excited by that and wanted to find
even better ones. So they wrote a paper that
evaluated all different kinds of estimates. And I encourage you
to Google that paper and read it because
it's very accessible. And it sort of highlights the
statistical and probability issues associated
with these problems. But what they did
was they derived the best analytic
scale-invariant estimator, which has this sort of bizarre
combination of different terms, but basically we're
using normalized values of the high, low, close
normalized by the open. And they're able to get
an efficiency of 7.4 with this combination. Now scale-invariant estimates,
in statistical theory, they're different
principles that guide the development of
different methodologies. And one kind of principle is
issues of scale invariance. If you're estimating a scale
parameter, and in this case volatility is telling
you essentially how large is the
variability of this process, if you were to say multiply your
original data all by a given constant, then a
scale-invariant estimator should be such that your
estimator changes in that case only by that same scale factor. So sort of the
estimator doesn't depend on how you scale the data. So that's the notion
of scale invariance. The Garman-Klass paper
actually goes to the nth degree and actually finds a
particular estimator that has an efficiency
of 8.4, which is really highly significant. So if you are working
with a modeling process where you believe that the
underlying parameters may be reasonably assumed
to be constant over short periods
of time, well, over those short periods
of time if you use these extended
estimators like this, you'll get much more
precise measures of the underlying parameters
than from just using simple close-to-close data. All right. Let's introduce Poisson
Jump Diffusions. With Poisson Jump
Diffusions we have basically a stochastic
differential equation for representing this model. And it's just like the
Geometric Brownian Motion model, except we have this additional
term, gamma sigma Z d pi of t. Now that's a lot of
different variables, but essentially what
we're thinking about is over time a Brownian Motion
process is fully continuous. There are basically no jumps in
this Brownian Motion process. In order to allow
for jumps, we assume that there's some process pi of
t, which is a Poisson process. It's counting process that
counts when jumps occur, how many jumps have occurred. So that might start
at 0 at the value 0. Then if there's a jump
here it goes up by one. And then if there's another
jump here, it goes up by one, and so forth. And so the Poisson Jump
Diffusion model says, this diffusion
process is actually going to experience
some shocks to it. Those shocks are
going to be arriving according to a Poisson process. If you've taken
stochastic modeling you know that that's a sort
of a purely random process. Basically exponential
arrival rate of shocks occur. You can't predict them. And when those
occur, d pi of t is going to change from 0
up to the unit increment. So d pi of t is 1. And then we'll realize
gamma sigma Z of t. So at this point we're
going to have shocks. Here this is going to be gamma
sigma Z_1 And at this point, maybe it's a negative
shock, gamma sigma Z_2. This is 0. And so with this overall
process we basically have a shift in the
diffusion, up or down, according to these values. And so this model allows for
the arrival of these processes to be random according to
the Poisson distribution, and for the magnitude of the
shocks to be random as well. Now like the Geometric
Brownian Motion model this process sort of has
independent increments, which helps with this estimation. One could estimate this
model by maximum likelihood, but it does get tricky
in that basically over different increments
of time the change in the process corresponds
to the diffusion increment, plus the sum of the
jumps that have occurred over that same increment. And so the model ultimately
is a Poisson mixture of Gaussian distributions. And in order to evaluate this
model, model's properties, moment generating functions
can be computed rather directly with that. And so one can understand how
the moments of the process vary with these different
model parameters. The likelihood function is
a product of Poisson sums. And there's a closed form
for the EM algorithm, which can be used to
implement the estimation of the unknown parameters. And if you think about observing
a Poisson Jump Diffusion process, if you knew
where the jumps occurred, so you knew where
the jumps occurred and how many there were
per increment in your data, then the maximum
likelihood estimation would all be very, very simple. And because this sort
of is a separation of the estimation of
the Gaussian parameters from the Poisson parameters. When you haven't observed
those values then you need to deal with methods
appropriate for missing data. And the EM algorithm is a very
famous algorithm developed by the people up at Harvard,
Rubin, Laird, and Dempster, which deals with, basically if
the problem is much simpler, if you could posit there
being unobserved variables that you would observe,
then you sort of expand the problem to
include your observed data, plus the missing
data, in this case where the jumps have occurred. And you then do
conditional expectations of estimating those jumps. And then assuming that
those jumps had those-- occurred with those frequencies,
estimating the underlying parameters. So the EM algorithm
is very powerful and has extensive
applications in all kinds of different models. I'll put up on the
website a paper that I wrote with David
Pickard and his student Arshad Zakaria, which goes
through the maximum likelihood methodology for this. But looking at that,
you can see how with an extended model,
how maximum likelihood gets implemented and I think
that's useful to see. All right. So let's turn next
to ARCH models. And OK. Just as a bit of motivation, the
Geometric Brownian Motion model and also the Poisson
Jump Diffusion model are models which assume
that volatility over time is essentially stationary. And with the sort of independent
increments of those processes, the volatility over
different increments is essentially the same. So the ARCH models
were introduced to accommodate the
possibility that there's time dependence in volatility. And so let's see. Let's see. Let me go. OK. At the very end, I'll go through
an example showing that time dependence with our
euro/dollar exchange rates. So the set up for this
model is basically we look at the log of
the price relatives y_t and we model the
residuals to not be of constant volatility,
but to be multiples of sort of white noise with mean 0
and variance 1, where sigma_t is given by this essentially
ARCH function, which says that the volatility
at a given period t is a weighted average
of the squared residuals over the last p lags. And so if there's a
large residual then that could persist and
make the next observation have a large variance. And so this accommodates
some time dependence. Now this model actually
has parameter constraints, which are never a
nice thing to have when you're fitting models. In this case the parameters
alpha_one through alpha_p all have to be positive. And why do they
have to be positive? AUDIENCE: [INAUDIBLE]. PROFESSOR: Right. Variance is positive. So if any of these
alphas were negative, then there would be a
possibility that under this model that you could
have negative volatility, which you can't. So if we estimate this
model to estimate them with the constraint that
all these parameter values are non-negative. So that does complicate
the estimation a bit. In terms of understanding
how this process works one can actually see how
the ARCH model implies an autoregressive model for
the squared residuals, which turns out to be useful. So the top line there
is the ARCH model saying that the variance
of the t period return is this weighted average
of the past residuals. And then if we simply add
a new variable u_t, which is our squared residual minus
its variance, to both sides we get the next line, which says
that epsilon_t squared follows an autoregression on itself,
with the u_t value being the disturbance in
that autoregression. Now u_t, which is epsilon_t
squared minus sigma squared t, what is the mean of that? The mean is 0. So it's almost white noise. But its variance is maybe
going to change over time. So it's not sort of
standard white noise, but it basically
has expectation 0. It's also conditional
independent, but there's some possible
variability there. But what this implies
is that there basically is an autoregressive
model where we just have time-varying variances
in the underlying process. Now because of that
one can sort of quickly evaluate whether there's
ARCH structure in data by simply fitting an
autoregressive model to the squared residuals. And testing whether
that regression is significant or not. And that formally is a
Lagrange multiplier test. Some of the original papers by
Engle go through that analysis. And the test statistic
turns out to just be the multiple of the r
squared for that regression fit. And basically under,
say, a null hypothesis that there isn't
any ARCH structure, then this regression model
should have no predictability. This ARCH model
in the residuals, basically if there's no time
dependence in those residuals, that's evidence of there being
an absence of ARCH structure. And so under the null
hypothesis of no ARCH structure that r squared statistic
should be small. It turns out that sort of n
times the r squared statistic with p variables
is asymptotically a chi-square distribution
with p degrees of freedom. So that's where that test
statistic comes into play. And in implementing this, the
fact that we were applying essentially least squares
with the autoregression to implement this Lagrange
multiplier test, but we were assuming, well,
we're not assuming, we're implicitly assuming the
assumptions of Gauss-Markov in fitting that. This corresponds to the notion
of quasi-maximum likelihood estimates for
unknown parameters. And quasi-maximum
likelihood estimates are used extensively in some
stochastic volatility models. And so essentially situations
where you sort of use the normal approximation, or
the second order approximation, to get your estimates,
and they turn out to be consistent and decent. All right. Let's go to Maximum
Likelihood Estimation. OK Maximum Likelihood
Estimation basically involves-- the hard part
is defining the likelihood function, which is the
density of the data given the unknown parameters. In this case, the data are
conditionally independent. The joint density is the product
of the density of y_t given the information at t minus 1. So basically the joint
probability density is the density at each time
point conditional on the past, and then the density times the
density of the next time point conditional on the past. And those are all
normal random variables. So these are the normal
PDFs coming into play here. And so what we want
to do is basically maximize this
likelihood function subject to these constraints. And we already went
through the fact that the alpha_i's have
to be greater than zero. And it turns out you
also have to have that the sum of the
alphas is less than one. Now what would happen
if the sum of the alphas was not less than one? AUDIENCE: [INAUDIBLE]. PROFESSOR: Right. And you basically could have
the process start diverging. Basically these
autoregressions can explode. So let's go through and see. Let's see. Actually, we're going to
go to GARCH models next. OK. Let's see. Let me just go
back here a second. OK. Very good. OK. In the remaining few minutes
let me just introduce you to the GARCH models. The GARCH model is
basically a series of past values of the
squared volatilities, basically the q sum of
past squared volatilities for the equation for the
volatility sigma t squared. And so it may be
that very high order ARCH models are
actually important. Or very high order ARCH terms
are found to be significant when you fit ARCH models. It could be that
much of that need is explained by adding
these GARCH terms. And so let's just consider
a simple GARCH model where we have only a first order ARCH
term and a first order GARCH term. So we're basically
saying that this is a weighted average of
the previous volatility, the new squared residual. And this is a very
parsimonious representation that actually ends up fitting
data quite, quite well. And there are various
properties of this GARCH model which we'll go
through next time, but I want to just
close this lecture by showing you fits of the ARCH
models and of this GARCH model to the euro/dollar
exchange rate process. So let's just look at that here. OK. OK. With the euro/dollar
exchange rate, actually there's
the graph here which shows the
auto-correlation function and the partial
auto-correlation function of the squared returns. So is there dependence in
these daily volatilities? And basically these blue
lines are plus or minus two standard deviations of
the correlation coefficient. Basically we have highly
significant auto-correlations and very highly significant
partial auto-correlations, which suggests if you're
familiar with ARMA process that you would need a very
high order ARMA process to fit the squared residuals. But this highlights how
with the statistical tools you can actually identify this
time dependence quite quickly. And here's a plot of the ARCH
order one model and the ARCH order two model. And on each of
these I've actually drawn a solid line where the
sort of constant variance model would be. So ARCH is saying that we
have a lot of variability about that constant mean. And a property, I guess,
of these ARCH models is that they all have
sort of a minimum value for the volatility that
they're estimating. If you look at
the ARCH function, that alpha_0 now is--
the constant term is basically the minimum
value, which that can be. So there's a constraint
sort of on the lower value. Then here's an
ARCH(10) fit which, it doesn't look like it sort of
has quite as much of a uniform lower bound in it, but one could
go on and on with higher order ARCH terms, but rather than
doing that one can also fit just a GARCH(1,1) model. And this is what it looks like. So basically the time varying
volatility in this process is captured really,
really well with just this two-parameter GARCH model
as compared with a high order autoregressive model. And it sort of
highlights the issues with the Wold decomposition
where a potentially infinite order autoregressive
model will effectively fit most time series. Well, that's nice
to know, but it's nice to have a parsimonious
way of defining that infinite
collection of parameters and with the GARCH model
a couple of parameters do a good job. And then finally here's
just a simultaneous plot of all those volatility
estimates on the same graph. And so one can see the
increased flexibility basically of the GARCH models compared to
the ARCH models for capturing time-varying volatility. So all right. I'll stop there for today. And let's see. Next Tuesday is a presentation
from Morgan Stanley so. And today's the last day to
sign up for a field trip.