The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: OK. Well, last time I
was lecturing, we were talking about
regression analysis. And we finished up talking
about estimation methods for fitting regression models. I want to recap the method
of maximum likelihood, because this is really
the primary estimation method in statistical
modeling that you start with. And so let me just
review where we were. We have a normal linear
regression model. A dependent variable
y is explained by a linear combination
of independent variables given by a regression
parameter beta. And we assume that there are
errors about all the cases which are independent
identically distributed normal random variables. So because of that relationship,
the dependent variable vector y, which is an
n-vector, for n cases, is a multivariate
normal random variable. Now, the likelihood function is
equal to the density function for the data. And there's some
ambiguity really about how one manipulates
the likelihood function. The likelihood function
becomes defined once we've observed a sample of data. So in this expression for
the likelihood function as a function of beta
and sigma squared, we're considering evaluating
the probability density function for the
data conditional on the unknown parameters. So if this were simply a
univariate normal distribution with some unknown mean
and variance, then what we would have is
just a bell curve for mu centered around a
single observation y, if you look at the
likelihood function and how it varies with
the underlying mean of the normal distribution. So this likelihood
function is-- well, the challenge really
in maximum estimation is really calculating
and computing the likelihood function. And with normal linear
regression models, it's very easy. Now, the maximum
likelihood estimates are those values that
maximize this function. And the question is, why
are those good estimates of the underlying parameters? Well, what those
estimates do is they are the parameter values for
which the observed data is most likely. So we're able to scale
the unknown parameters by how likely those parameters
could have generated these data values. So let's look at the
likelihood function for this normal linear
regression model. These first two lines here are
highlighting-- the first line is highlighting that
our response variable values are independent. They're conditionally
independent given the unknown parameters. And so the density of the
full vector of y's is simply the product of the density
functions for those components. And because this is a normal
linear regression model, each of the y_i's is
normally distributed. So what's in there
is simply the density function of a normal random
variable with mean given by the beta sum of independent
variables for each i, case i, given by the
regression parameters. And that expression
basically can be expressed in matrix form this way. And what we have is
the likelihood function ends up being a function
of our Q of beta, which was our least squares criteria. So the least squares
estimation is equivalent to maximum likelihood
estimation for the regression parameters if we have a normal
linear regression model. And there's this
extra term, minus n. Well, actually, if we're going
to maximize the likelihood function, we can also maximize
the log of the likelihood function, because that's
just a monotone function of the likelihood. And it's easier to maximize the
log of the likelihood function which is expressed here. And so we're able to
maximize over beta by minimizing Q of beta. And then we can maximize
over sigma squared given our estimate for beta. And that's achieved by
taking the derivative of the log-likelihood with
respect to sigma squared. So we basically have this
first order condition that finds the
maximum because things are appropriately convex. And taking that derivative
and solving for zero, we basically get
this expression. So this is just
taking the derivative of the log-likelihood with
respect to sigma squared. And you'll notice
here I'm taking the derivative with
respect to sigma squared as a parameter, not sigma. And that gives us that
the maximum likelihood estimate of the error variance
is Q of beta hat over n. So this is the sum of the
squared residuals divided by n. Now, I emphasize here
that that's biased. Who can tell me
why that's biased or why it ought to be biased? AUDIENCE: [INAUDIBLE]. PROFESSOR: OK. Well, it should be n
minus 1 if we're actually estimating one parameter. So if the independent variables
were, say, a constant, 1, so we're just estimating a
sample from a normal with mean beta 1 corresponding to
the units vector of the X, then we would have a one
degree of freedom correction to the residuals to get
an unbiased estimator. But what if we
have p parameters? Well, let me ask you this. What if we had n parameters
in our regression model? What would happen if
we had a full rank n independent variable matrix
and n independent observations? AUDIENCE: [INAUDIBLE]. PROFESSOR: Yes, you'd have
an exact fit to the data. So this estimate would be 0. And so clearly, if
the data do arise from a normal linear regression
model, 0 is not unbiased. And you need to have
some correction. Turns out you need
to divide by n minus the rank of the X
matrix, the degrees of freedom in the model, to get
a biased estimate. So this is an important
issue, highlights how the more parameters you add
in the model, the more precise your fitted values are. In a sense, there's
dangers of curve fitting which you want to avoid. But the maximum likelihood
estimates, in fact, are biased. You just have to
be aware of that. And when you're using
different software, fitting different
models, you need to know whether there are
various corrections be made for biasedness or not. So this solves the
estimation problem for normal linear
regression models. And when we have normal
linear regression models, the theorem we
went through last time-- this is very important. Let me just go back and
highlight that for you. This theorem right here. This is really a very
important theorem indicating what is the
distribution of the least squares, now the maximum
likelihood estimates of our regression model? They are normally distributed. And the residuals, sum
of squares, have a chi squared distribution
with degrees of freedom given by n minus p. And we can look at how
much signal to noise there is in estimating
our regression parameters by calculating a t
statistic, which is take away from an estimate its
expected value, its mean, and divide through by an
estimate of the variability in standard deviation units. And that will have
a t distribution. So that's a critical
way to assess the relevance of different
explanatory variables in our model. And this approach will apply
with maximum likelihood estimation in all
kinds of models apart from normal linear
regression models. It turns out maximum
likelihood estimates generally are asymptotically
normally distributed. And so these properties here
will apply for those models as well. So let's finish up these
notes on estimation by talking about
generalized M estimation. So what we want to consider is
estimating unknown parameters by minimizing some
function, Q of beta, which is a sum of evaluations
of another function h, evaluated for each of
the individual cases. And choosing h to take on
different functional forms will define different
kinds of estimators. We've seen how when h
is simply the square of the case minus its
regression prediction, that leads to least squares,
and in fact, maximum likelihood estimation, as we saw before. Rather than taking the
square of the residual, the fitted residual,
we could take simply the modulus of that. And so that would be the
mean absolute deviation. So rather than summing
the squared deviations from the mean, we could
sum the absolute deviations from the mean. Now, from a
mathematical standpoint, if we want to solve
for those estimates, how would you go
about doing that? What methodology would you
use to maximize this function? Well, we try and apply
basically the same principles of if this is a
convex function, then we just want to take derivatives
of that and solve for that being equal to 0. So what happens when
you take the derivative of the modulus of y minus xi
beta with respect to beta? AUDIENCE: [INAUDIBLE]. PROFESSOR: What did you say? What did you say? AUDIENCE: Yeah, it's
not [INAUDIBLE]. The first [INAUDIBLE]
derivative is not continuous. PROFESSOR: OK. Well, this is not
a smooth function. But let me just plot x_i beta
here, and y_i minus that. Basically, this is going
to be a function that has slope 1 when it's positive
and slope minus 1 when it's negative. And so that will be true,
component-wise, or for the y. So what we end up
wanting to do is find the value of the
regression estimate that minimizes the
sum of predictions that are below the estimate plus
the sum of the predictions that are above the estimate given
by the regression line. And that solves the problem. Now, with the maximum
likelihood estimation, one can plug in minus log the
density of y_i given beta, x and sigma_i squared. And that function simply sums
to the log of the joint density for all the data. So that works as well. With robust M estimators, we can
consider another function chi which can be defined to have
good properties with estimates. And there's a whole theory
of robust estimation-- it's very rich-- which
talks about how best to specify this chi function. Now, one of the problems
with least squares estimation is that the squares
of very large values are very, very
large in magnitude. So there's perhaps
an undue influence of very large values, very large
residuals under least squares estimation and maximum
[INAUDIBLE] estimation. So robust estimators
allow you to control that by defining the
function differently. Finally, there are
quantile estimators, which extend the mean
absolute deviation criterion. And so if we consider
the h function to be basically a
multiple of the deviation if the residual is positive
and a different multiple, a complementary multiple if
the derivation, the residual, is less than 0,
then by varying tau, you end up getting
quantile estimators, where what you're doing is minimizing
the estimate of the tau quantile. So this general
class of M estimators encompasses most
estimators that we will encounter in fitting models. So that finishes the technical
or the mathematical discussion of regression analysis. Let me highlight for you--
there's a case study that I dragged to the desktop here. And I wanted to find that. Let me find that. There's a case study that's been
added to the course website. And this first one is on
linear regression models for asset pricing. And I want you to
read through that just to see how it applies to
fitting various simple linear regression models. And enter full screen. This case study begins by
introducing the capital asset pricing model, which
basically suggests that if you look at the
returns on any stocks in an efficient
market, then those should depend on the return
of the overall market but scaled by how
risky the stock is. And so if one looks
at basically what the return is on the
stock on the right scale, you should have a simple
linear regression model. So here, we just look at
a time series for GE stock in the S&P 500. And the case study guide
through how you can actually collect this data
on the web using R. And so the case notes
provide those details. There's also the
three-month treasury rate which is collected. And so if you're
thinking about return on the stock versus return
on the index, well, what's really of interest is the excess
return over a risk-free rate. And the efficient
markets models, basically the excess
return of a stock is related to the excess
return of the market as given by a linear
regression model. So we can fit this model. And here's a plot of the excess
returns on a daily basis for GE stock versus the market. So that looks like a
nice sort of point cloud for which a linear
model might fit well. And it does. Well, there are
regression diagnostics, which I'll get to-- well, there
are regression diagnostics which are detailed in the
problem set, where we're looking at how influential are
individual observations, what's their impact on
regression parameters. This display here
basically highlights with a very simple
linear regression model what are the
influential data points. And so I've highlighted
in red those values which are influential. Now, if you look at the
definition of leverage in a linear model,
it's very simple. A simple linear model is
just those observations that are very far from the
mean have large leverage. And so you can confirm
that with your answers to the problem set. This x indicates a
significantly influential point in terms of the
regression parameters given by Cook's distance. And that definition is also
given in the case notes. AUDIENCE: [INAUDIBLE]. PROFESSOR: By computing
the individual leverages with a function
that's given here, and by selecting out those
that exceed a given magnitude. Now, with this very,
very simple model of stocks depending
on one unknown factor, risk factor given the market. In modeling equity
returns, there are many different factors that
can have an impact on returns. So what I've done
in the case study is to look at adding
another factor which is just the return on crude oil. And so-- I need to go down here. So let me highlight
something for you here. With GE stock, what would you
expect the impact of, say, a high return on crude oil to
be on the return of GE stock? Would you expect it to
be positively related or negatively related? OK. Well, GE is a stock that's
just a broad stock invested in many different industries. And it really reflects the
overall market, to some extent. Many years ago,
10, 15 years ago, GE represented maybe 3% of
the GNP of the US market. So it was really highly related
to how well the market does. Now, crude oil is a commodity. And oil is used to drive cars,
to fuel energy production. So if you have an
increase in oil prices, then the cost of essentially
doing business goes up. So it is associated with
an inflation factor. Prices are rising. So if you can see here,
the regression estimate, if we add in a factor of
the return on crude oil, it's negative 0.03. And it has a t value
of minus 3.561. So in fact, the market, in
a sense, over this period, for this analysis, was not
efficient in explaining the return on GE; crude oil
is another independent factor that helps explain returns. So that's useful to know. And if you are clever about
defining and identifying and evaluating
different factors, you can build
factor asset pricing models that are
very, very useful for investing and trading. Now, as a comparison
to this case study, also applied the same
analysis to Exxon Mobil. Now, Exxon Mobil
is an oil company. So let me highlight this here. We basically are
fitting this model. Now let's highlight it. Here, if we consider
this two-factor model, the regression
parameter corresponding to the crude oil factor is
plus 0.13 with a t value of 16. So crude oil definitely
has an impact on the return of Exxon Mobil,
because it goes up and down with oil prices. This case study closes
with a scatter plot of the independent variables
and highlighting where the influential values are. And so just in the same way that
with a simple linear regression it was those that were far
away from the mean of the data were influential, in a
multivariate setting-- here, it's bivariate-- the
influential observations are those that are very
far away from the centroid. And if you look at one of the
problems in the problem set, it actually goes
through and you can see where these
leveraged values are and how it indicates influences
associated with the Mahalanobis distance of cases
from the centroid of the independent variables. So if you're a visual
type mathematician as opposed to an algebraic
type mathematician, I think these
kinds of graphs are very helpful in understanding
what is really going on. And the degree of influence
is associated with the fact that we're basically taking
least squares estimates, so we have the quadratic
form associated with the overall process. There's another
case study that I'll be happy to discuss after
class or during office hours. I don't think we have time
today during the lecture. But it concerns
exchange rate regimes. And the second case study
looks at the Chinese yuan, which was basically pegged
to the dollar for many years. And then I guess through
political influence from other countries,
they started to let the yuan vary
from the dollar, but perhaps pegged
it to some basket of securities-- of currencies. And so how would you determine
what that basket of currencies is? Well, there are
regression methods that have been
developed by economists that help you do that. And that case study goes
through the analysis of that. So check that out to see how
you can get immediate access to currency data and be
fitting these regression models and looking at the
different results and trying to evaluate those. So let's turn now
to the main topic-- let's see here-- which
is time series analysis. Today in the rest
of the lecture, I want to talk about univariate
time series analysis. And so we're thinking of
basically a random variable that is observed over time and
it's a discrete time process. And we'll introduce you
to the Wold representation theorem and definitions
of stationarity and its relationship there. Then, look at the classic
models of autoregressive moving average models. And then extending those
to non-stationarity with integrated autoregressive
moving average models. And then finally, talk about
estimating stationary models and how we test
for stationarity. So let's begin from
basically first principles. We have a stochastic process,
a discrete time stochastic process, X, which consists
of random variables indexed by time. And we're thinking
now discrete time. The stochastic behavior
of this sequence is determined by specifying
the density or probability mass functions for all finite
collections of time indexes. And so if we could specify
all finite.dimensional distributions of
this process, we would specify this
probability model for the stochastic process. Now, this stochastic process
is strictly stationary if the density function for
any collection of times, t_1 through t_m, is equal to
the density function for a tau translation of that. So the density function for any
finite-dimensional distribution is stationary, is constant
under arbitrary translations. So that's a very
strong property. But it's a reasonable
property to ask for if you're doing statistical modeling. And what do you want to do
when you're estimating models? You want to estimate
things that are constant. Constants are nice
things to estimate. And parameters of
models are constant. So we really want the underlying
structure of the distributions to be the same. That was strict
stationarity, which requires knowledge of
the entire distribution of the stochastic process. We're now going to introduce
a weaker definition, which is covariance stationarity. And a covariance
stationary process has a constant mean,
mu; a constant variance, sigma squared; and a
covariance over increments tau, given by a function gamma of
tau, that is also constant. Gamma isn't a constant function,
but basically for all t, covariance of X_t, X_(t+tau)
is this gamma of tau function. And we also can introduce
the autocorrelation function of the stochastic
process, rho of tau. And so the correlation
of two random variables is the covariance of those
random variables divided by the square root of the
product of the variances. And Choongbum I think
introduced that a bit. in one of his lectures,
where we were talking about the correlation function. But essentially, the
correlation function is if you standardize the
data or the random variables to have mean 0-- so
subtract off the means and then divide through by
their standard deviations. So those translated variables
have mean 0 and variance 1. Then the correlation
coefficient is the covariance between those standardized
random variables. So this is going to come up
again and again in time series analysis. Now, the Wold
representation theorem is a very, very powerful theorem
about covariance stationary processes. It basically states that if
we have a zero-mean covariance stationary time
series, then it can be decomposed into two
components with a very nice structure. Basically, X_t can be
decomposed into V_t plus S_t. V_t is going to be a linearly
deterministic process, meaning that past values of
V_t perfectly predict what V_t is going to be. So this could be like a
linear trend or some fixed function of past values. It's basically a
deterministic process. So there's nothing
random in V_t. It's something that's
fixed, without randomness. And S_t is a sum
of coefficients, psi_i times eta_(t-i), where
the eta_t's are linearly unpredictable white noise. So what we have is S_t
is a weighted average of white noise with
coefficients given by the psi_i. And the coefficients psi_i
are such that psi_0 is 1, and the sum of the
squared psi_i's is finite. And the white noise
eta_t-- what's white noise? It has expectation zero. It has variance, given by
sigma squared, that's constant. And it has covariance across
different white noise elements that's 0 for all t and s. So eta_t's are uncorrelated
with themselves, and of course, they
are uncorrelated with the deterministic process. So this is really a very,
very powerful concept. If you are modeling
a process and it has covariance
stationarity, then there exists a representation
like this of the function. So it's a very
compelling structure, which we'll see how it applies
in different circumstances. Now, before getting into the
definition of autoregressive moving average
models, I just want to give you an intuitive
understanding of what's going on with the Wold decomposition. And this, I think,
will help motivate why the Wold
decomposition should exist from a mathematical standpoint. So consider just some
univariate stochastic process, some time series X_t
that we want to model. And we believe that it's
covariance stationary. And so we want to
specify essentially the Wold decomposition of that. Well, what we could
do is initialize a parameter p, the number
of past observations, in the linearly
deterministic term. And then estimate the linear
projection of X_t on the last p lag values. And so what I want to do
is consider estimating that relationship using
a sample of size n with some ending point t_0
less than or equal to T. And so we can consider y
values like a response variable being given by the successive
values of our time series. And so our response variables
y_j can be considered to be x t_0 minus n plus j. And define a y vector and
a Z matrix as follows. So we have values of our
stochastic process in y. And then our Z matrix,
which is essentially a matrix of
independent variables, is just the lagged
values of this process. So let's apply
ordinary least squares to specify the projection. This projection matrix
should be familiar now. And that basically gives
us a prediction of y hat depending on p lags. And we can compute the
projection residual from that fit. Well, we can conduct
time series methods to analyze these residuals,
which we'll be introducing here in a few minutes, to specify
a moving average model. We can then have estimates of
the underlying coefficients psi and estimates of
these residuals eta_t. And then we can evaluate whether
this is a good model or not. What does it mean to be
an appropriate model? Well, the residual should
be orthogonal to longer lags than t minus s, or
longer lags than p. So we basically shouldn't
have any dependence of our residuals on lags
of the stochastic process that weren't included
in the model. Those should be orthogonal. And the eta_t hats should be
consistent with white noise. So those issues
can be evaluated. And if there's
evidence otherwise, then we can change the
specification of the model. We can add additional lags. We can add additional
deterministic variables if we can identify
what those might be. And proceed with this process. But essentially that is
how the Wold decomposition could be implemented. And theoretically, as
our sample gets large, if we're observing this time
series for a long time, then well certainly the
limit of the projections as p, the number of lags
we include, gets large, should be essentially
the projection of our data on its history. And that, in fact, is the
projection corresponding to, defining, the
coefficient's psi_i. And so in the limit, that
projection will converge and it will converge
in the sense that the coefficients of
the projection definition correspond to the psi_i. And now if p goes to
infinity is required, now p means that there's
basically a long term dependence in the process. Basically, it doesn't
stop at a given lag. The dependence
persists over time. Then we may require
that p goes to infinity. Now, what happens when
p goes to infinity? Well, if you let p go
to infinity too quickly, you run out of
degrees of freedom to estimate your models. And so from an
implementation standpoint, you need to let p/n
go to 0 so that you have essentially more
data than parameters that you're estimating. And so that is required. And in time series
modeling, what we look for are models where
finite values of p are required. So we're only estimating a
finite number of parameters. Or if we have a moving
average model which has coefficients that
are infinite in number, perhaps those can be defined by
a small number of parameters. So we'll be looking for
that kind of feature in different models. Let's turn to talking
about the lag operator. The lag operator is
a fundamental tool in time series models. We consider the operator L
that shifts a time series back by one time increment. And applying this
operator recursively, we get, if it's operating
0 times, there's no lag, one time, there's
one lag, two times, two lags-- doing
that iteratively. And in thinking of these,
what we're dealing with is like a transformation on
infinite dimensional space, where it's like
the identity matrix sort of shifted by
one element-- or not the identity, but an element. It's like the identity
matrix shifted by one column or two columns. So anyway, inverses
of these operators are well defined in terms
of what we get from them. So we can represent
the Wold representation in terms of these lag
operators by saying that our stochastic
process X_t is equal to V_t plus this
psi of L function, basically a
functional of the lag operator, which is a potentially
infinite-order polynomial of the lags. So this notation is
something that you need to get very
familiar with if you're going to be comfortable with
the different models that are introduced with
ARMA and ARIMA models. Any questions about that? Now relating to
this-- let me just introduce now, because this
will come up somewhat later. But there's the impulse
response function of the covariance
stationary process. If we have a stochastic process
X_t which is given by this Wold representation, then
you can ask yourself what happens to the innovation
at time t, which is eta_t, how does that affect
the process over time? And so, OK, pretend that you are
chairman of the Federal Reserve Bank. And you're interested in the GNP
or basically economic growth. And you're considering
changing interest rates to help the economy. Well, you'd like to
know what an impact is of your change in
this factor, how that's going to affect the
variable of interest, perhaps GNP. Now, in this case,
we're thinking of just a simple covariance
stationary stochastic process. It's basically a process that
is a random-- a weighted sum, a moving average of
innovations eta_t. But the question is, basically
any covariance stationary process could be
represented in this form. And the impulse
response function relates to what is
the impact of eta_t. What's its impact over time? Basically, it affects
the process at time t. That, because of the
moving average process, it affects it at t plus
1, affects it at t plus 2. And so this impulse
response is basically the derivative of the
value of the process with the j-th previous
innovation is given by psi_j. So the different
innovations have an impact on the current value given by
this impulse response function. So looking backward,
that definition is pretty well defined. But you can also
think about how does an impact of the
innovation affect the process going forward. And the long-run
cumulative response is essentially what is the
impact of that innovation in the process ultimately? And eventually, it's
not going to change the value of the process. But what is the value to
which the process is moving because of that one innovation? And so the long run
cumulative response is given by basically the
sum of these individual ones. And it's given by the
sum of the psi_i's. So that's the polynomial of
psi with lag operator, where we replace the lag operator by 1. We'll see this
again when we talk about vector
autoregressive processes with multivariate time series. Now, the Wold
representation, which is a infinite-order moving
average, possibly infinite order, can have an
autoregressive representation. Suppose that there is
another polynomial psi_i star of the lags, which we're
going to call psi inverse of L, which satisfies the fact if you
multiply that with psi of L, you get the identity lag 0. Then this psi inverse,
if that exists, is basically the
inverse of the psi of L. So if we start with psi of
L, if that's invertible, then there exists
a psi inverse of L, with coefficients psi_i star. And one can basically take
our original expression for the stochastic process,
which is as this moving average of the eta's, and express it
as this essentially moving averages of the X's. And so we've essentially
inverted the process and shown that the
stochastic process can be expressed as an infinite
order autoregressive representation. And so this infinite order
autoregressive representation corresponds to that intuitive
understanding of how the Wold representation exists. And it actually works with the--
the regression coefficients in that projection several
slides back corresponds to this inverse operator. So let's turn to some
specific time series models that are widely used. The class of autoregressive
moving average processes has this mathematical
definition. We define the X_t to be equal
to a linear combination of lags of X, going back p
lags, with coefficients phi_1 through phi_p. And then there are
residuals which are expressed in terms of a
q-th order moving average. So in this framework, the
eta_t's are white noise. And white noise, to reiterate,
has mean 0, constant variance, zero covariance between those. In this representation, I've
simplified things a little bit by subtracting off the
mean from all of the X's. And that just makes the formulas
a little bit more simpler. Now, with lag operators, we
can write this ARMA model as phi of L, p-th order
polynomial of lag L given with coefficients 1,
phi_1 up to phi_p, and theta of L given
by 1, theta_1, theta_2, up to theta_q. This is basically
a representation of the ARMA time series model. Basically, we're
taking a set of lags of the values of the stochastic
process up to order p. And that's equal to a weighted
average of the eta_t's. If we multiply by the inverse
of phi of L, if that exists, then we get this
representation here, which is simply the
Wold decomposition. So the ARMA models basically
have a Wold decomposition if this phi of L is invertible. And we'll explore
these by looking at simpler cases
of the ARMA models by just focusing on
autoregressive models first and then moving
average processes second so that
you'll get a better feel for how these things are
manipulated and interpreted. So let's move on to the p-th
order autoregressive process. So we're going to consider
ARMA models that just have autoregressive terms in them. So we have phi of L X_t
minus mu is equal to eta_t, which is white noise. So a linear combination of
the series is white noise. And X_t follows then a linear
regression model on explanatory variables, which are
lags of the process X. And this could be expressed
as X_t equal to c plus the sum from 1 to p of phi_j X_(t-j),
which is a linear regression model with regression
parameters phi_j. And c, the constant term, is
equal to mu times phi of 1. Now, if you basically take
expectations of the process, you basically have
coefficients of mu coming in from all the terms. And phi of 1 times mu is the
regression coefficient there. So with this
autoregressive model, we now want to go over what are
the stationarity conditions. Certainly, this
autoregressive model is one where, well,
a simple random walk follows an autoregressive
model but is not stationary. We'll highlight that
in a minute as well. But if you think
it, that's true. And so stationarity is something
to be understood and evaluated. This polynomial
function phi, where if we replace the
lag operator L by z, a complex variable, the
equation phi of z equal to 0 is the characteristic
equation associated with this autoregressive model. And it turns out that we'll
be interested in the roots of this characteristic equation. Now, if we consider
writing phi of L as a function of the
roots of the equation, we get this expression
where you'll notice if you multiply
all those terms out, the 1's all multiply out
together, and you get 1. And with the lag operator
L to the p-th power, that would be the product
of 1 over lambda_1 times 1 over lambda_2,
or actually negative 1 over lambda_1 times
negative 1 over lambda_2, and so forth-- negative
1 over lambda_p. Basically, if there are
p roots to this equation, this is how it would
be written out. And the process
X_t is covariance stationary if and
only if all the roots of this characteristic equation
lie outside the unit circle. So what does that mean? That means that the norm
modulus of the complex z is greater than 1. So they're outside
the unit circle where it's less
than or equal to 1. And the roots, if they are
outside the unit circle, then the modulus of the
lambda_j's is greater than 1. And if we then consider
taking a complex number lambda, basically
the root, and have an expression for 1 minus
1 over lambda L inverse, we can get this series
expression for that inverse. And that series will exist and
be bounded if the lambda_i are greater than 1 in magnitude. So we can actually compute
an inverse of phi of L by taking the inverse
of each of the component products in that polynomial. So in introductory
time series courses, they talk about
stationarity and unit roots, but they don't
really get into it, because people don't
know complex math, don't know about roots. So anyway, but this
is just very simply how that framework is applied. So we have a
polynomial equation, the characteristic equation,
whose roots we're looking for. Those roots have to
be outside the unit circle for stationarity
of the process. Well, it's basically
conditions for invertibility of the process, of the
autoregressive process. And that invertibility renders
the process an infinite-order moving average process. So let's go through
these results for the autoregressive
process of order one, where things-- always start
with the simplest cases to understand things. The characteristic equation
for this model is just 1 minus phi z. The root is 1/phi. So lambda is greater than
1-- if the modulus of lambda is greater than 1,
meaning the root is outside the unit circle,
then phi is less than 1. So for covariance stationarity
of this autoregressive process, we need the magnitude of phi
to be less than 1 in magnitude. The expected value of X is mu. The variance of X
is sigma squared X. This has this form, sigma
squared over 1 minus phi. That expression is
basically obtained by looking at the infinite order
moving average representation. But notice that if
phi is positive, then the variance
of X is actually greater than the variance
of the innovations. And if phi is less than 0,
then it's going to be smaller. So the innovation variance
basically is scaled up a bit in the autoregressive process. The covariance matrix is
phi times sigma squared X. You'll be going through
this in the problem set. And the covariance of X is phi
to the j power sigma squared X. And these expressions can
all be easily evaluated by simply writing out the
definition of these covariances in terms of the original
model and looking at what terms are independent,
cancel out, and that proceeds. Let's just go
through these cases. Let's show it all here. So we have if phi
is between 0 and 1, then the process experiences
exponential mean reversion to mu. So an autoregressive
process with phi between 0 on 1 corresponds to a
mean-reverting process. This process is
actually one that has been used theoretically
for interest rate models and a lot of theoretical
work in finance. The Vasicek model is
actually an example of the Ornstein-Uhlenbeck
process, which is basically a
mean-reverting Brownian motion. And any variables
that exhibit or could be thought of as
exhibiting mean reversion, this model can be
applied to those processes, such as interest rate
spreads or real exchange rates, variables where one can
expect that things never get too large or too small. They come back to some mean. Now, the challenge
is, that usually may be true over
short periods of time. But over very long
periods of time, the point to which you're
reverting to changes. So these models tend to
not have broad application over long time ranges. You need to adapt. Anyway, with the AR
process, we can also have negative
values of phi, which results in exponential mean
reversion that's oscillating in time, because the
autoregressive coefficient basically is a negative value. And for phi equal to 1, the Wold
decomposition doesn't exist. And the process is the
simple random walk. So basically, if
phi is equal to 1, that means that basically just
changes in value of the process are independent and identically
distributed white noise. And that's the
random walk process. And that process, as was
covered in earlier lectures, is non-stationary. If phi is greater than 1, then
you have an explosive process, because basically the
values are scaling up every time increment. So those are features
of the AR(1) model. For a general autoregressive
process of order p, there's a method-- well, we
can look at the second order moments of that process, which
have a very nice structure, and then use those to
solve for estimates of the ARMA parameters, or
autoregressive parameters. And those happen to be
specified by what are called the Yule-Walker equations. So the Yule-Walker equations
is a standard topic in time series analysis. What is it? What does it correspond to? Well, we take our original
autoregressive process of order p. And we write out the
formulas for the covariance at lag j between
two observations. So what's the covariance
between X_t and X_(t-j)? And that expression is
given by this equation. And so this equation for gamma
of j is determined simply by evaluating the expectations
where we're taking the expectation of X_t in the
autoregressive process times the fix X_(t-j) minus mu. So just evaluating
those terms, you can validate that
this is the equation. If we look at the equations
corresponding to j equals 1-- so lag 1 up through
lag p-- this is what those equations look like. Basically, the left-hand side
is gamma_1 through gamma_p. The covariance to
lag 1 up to lag p is equal to basically
linear functions given by the phi of
the other covariances. Who can tell me what the
structure is of this matrix? It's not a diagonal matrix? What kind of matrix is this? Math trivia question here. It has a special name. Anyone? It's a Toeplitz matrix. The off diagonals are
all the same value. And in fact, because of the
symmetry of the covariance, basically the gamma of 1 is
equal to gamma of minus 1. Gamma of minus 2 is
equal to gamma plus 2. Because of the
covariant stationarity, it's actually also symmetric. So these equations allow
us to solve for the phis so long as we have estimates
of these covariances. So if we have a
system of estimates, we can plug these in in
an attempt to solve this. If they're consistent
estimates of the covariances, then there will be a solution. And then the 0th
equation, which was not part of the series
of equations-- if you go back and look
at the 0th equation, that allows you to get an estimate
for the sigma squared. So these Yule-Walker
equations are the way in which many ARMA
models are specified in different statistics packages
and in terms of what principles are being applied. Well, if we're using unbiased
estimates of these parameters, then this is applying
what's called the method of moments principle
for statistical estimation. And with complicated models,
where sometimes the likelihood functions are very hard
to specify and compute, and then to do optimization
over those is even harder. It can turn out that
there are relationships between the moments of the
random variables, which are functions of the
unknown parameters. And you can solve for basically
the sample moments equalling the theoretical moments
and you apply the method of moments estimation method. Econometrics is rich with many
applications of that principle. The next section goes through
the moving average model. Let me highlight this. So with an order
q moving average, we basically have a polynomial
in the lag operator L, which is operated
upon the eta_t's. And if you write out
the expectations of X_t, you get mu. The variance of X_t,
which is gamma 0, is sigma squared times 1 plus
the squares of the coefficients in the polynomial. And so this feature,
this property here is due to the fact that we have
uncorrelated innovations in the eta_t's. The eta t's are white noise. So the only thing that comes
through in the square of X_t and the expectation of
that is the squared powers of the etas, which
have coefficients given by the theta_i squared. So these properties are left--
I'll leave you just to verify, very straightforward. But let's now turn to the
final minutes of the lecture today to accommodating
non-stationary behavior in time series. The original approaches
with time series was to focus on
estimation methodologies for covariance
stationary process. So if the series is not
covariance stationary, then we would want to
do some transformation of the data, of the
series, into a stationary so that the resulting
process is stationary. And with the
differencing operators, delta, Box and Jenkins
advocated moving non-stationary trending
behavior, which is exhibited often in
economic time series, by using a first difference,
maybe a second difference, or a k-th order difference. So these operators are
defined in this way. Basically with the
k-th order operator having this
expression here, this is the binomial expansion
of a k-th power, which can be useful. It comes up all the time
in probability theory. And if a process has
a linear time trend, then delta X_t is going to
have no time trend at all, because you're
basically taking out that linear component by
taking successive differences. Sometimes, if you
have a real series and you look at the difference,
it appears non-stationary, you look at first differences,
that can still not appear to be growing
over time, in which case sometimes the second
difference will result in a process with no trend. So these are sort of
convenient tricks, techniques to render
the series stationary. And let's see. There's examples here of
linear trend reversion models which are rendered
covariance stationary under first differencing. In this case, this is an
example where you have a deterministic time trend. But then you have reversion
to the time trend over time. So we basically have
eta_t, the error about the deterministic trend,
is a first order autoregressive process. And the moments here
can be derived this way. Leave that as an exercise. One could also consider
the pure integrated process and talk about
stochastic trends. And basically,
random walk processes are often referred
to in econometrics as stochastic trends. And you may want to try and
remove those from the data, or accommodate them. And so the stochastic
trend process is basically given by the first difference
X_t is just equal to eta_t. And so we have essentially
this random walk from a given starting point. And it's easy to verify it if
you knew the 0th point, then the variance of the t-th time
point would be t sigma squared, because we're summing t
independent innovations. And the covariance between
t and lag t minus j is simply t minus
j sigma squared. And the correlation between
those has this form. What you can see is that this
definitely depends on time. So it's not a
stationary process. So this first differencing
results in stationarity. And the end difference
process has those features. Let's see where we are. Final topic for
today is just how you incorporate non-stationary
process into ARMA processes. Well, if you take
first differences or second differences
and the resulting process is covariance
stationary, then we can just incorporate that
differencing into the model specification itself, and define
ARIMA models, Autoregressive Integrated Moving
Average Processes. And so to specify
these models, we need to determine the order
of the differencing required to move trends,
deterministic or stochastic, and then estimating
the unknown parameters, and then applying model
selection criteria. So let me go very
quickly through this and come back to it the
beginning of next time. But in specifying the
parameters of these models, we can apply maximum
likelihood, again, if we assume normality of
these innovations eta_t. And we can express
the ARMA model in state space
form, which results in a form for the
likelihood function, which we'll see a few lectures ahead. But then we can apply limited
information maximum likelihood, where we just condition on the
first observations of the data and maximize the likelihood. Or not condition on the first
few observations, but also use their information as well,
and look at their density functions, incorporating
those into the likelihood relative to the stationary
distribution for their values. And then the issue
becomes, how do we choose amongst different models? Now, last time we talked about
linear regression models, how you'd specify a
given model, here, we're talking about autoregressive,
moving average, and even integrated
moving average processes and how do we specify
those, well, with the method of maximum likelihood,
there are procedures which-- there are measures of
how effectively a fitted model is, given by an
information criterion that you would want to minimize
for a given fitted model. So we can consider
different sets of models, different numbers of
explanatory variables, different orders of
autoregressive parameters, moving average parameters,
and compute, say, the Akaike information criterion
or the Bayes information criterion or the
Hannan-Quinn criterion as different ways of judging
how good different models are. And let me just finish
today by pointing out that what these
information criteria are is basically a function of the
log likelihood function, which is something we're
trying to maximize with maximum
likelihood estimates. And then adding some penalty
for how many parameters we're estimating. And so what I'd like you to
think about for next time is what kind of a penalty
is appropriate for adding an extra parameter. Like, what evidence is
required to incorporate extra parameters, extra
variables, in the model. Would it be t statistics
that exceeds some threshold or some other criteria. Turns out that these are
all related to those issues. And it's very interesting
how those play out. And I'll say that for those
of you who have actually seen these before, the
Bayes information criterion corresponds to an
assumption that there is some finite number of
variables in the model. And you know what those are. The Hannan-Quinn criterion
says maybe there's an infinite number of
variables in the model, but you want to be
able to identify those. And so anyway, it's a
very challenging problem with model selection. And these criteria can
be used to specify those. So we'll go through
that next time.