The following content is
provided under a Creative Commons license. Your support will help
MIT OpenCourseWare continue to offer high quality
educational resources for free. To make a donation or
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Today's topic
is regression analysis. And this subject
is one that we're going to cover it today
covering the mathematical and statistical foundations
of regression and focus particularly
on linear regression. This methodology is perhaps
the most powerful method in statistical modeling. And the foundations
of it, I think, are very, very important
to understand and master, and they'll help you in any
kind of statistical modeling exercise you might entertain
during or after this course. And its popularity in
finance is very, very high, but it's also a very
popular methodology in all other disciplines
that do applied statistics. So let's begin with setting up
the multiple linear regression problem. So we begin with a data set that
consists of data observations on different cases,
a number of cases. So we have n cases indexed by i. And there's a single variable,
a dependent variable or response variable, which is
the variable of focus. And we'll denote that y sub i. And together with that,
for each of the cases, there are explanatory variables
that we might observe. So the y_i's, the
dependent variables, could be returns on stocks. The explanatory variables could
be underlying characteristics of those stocks
over a given period. The dependent variable
could be the change in value of an index, the S&P
500 index or the yield rate, and the explanatory
variables can be various macroeconomic
factors or other factors that might be used to explain how
the response variable changes and takes on its value. Let's go through various
goals of regression analysis. OK, first it can be
to extract or exploit the relationship between
the dependent variable and the independent variable. And examples of
this are prediction. Indeed, in finance
that's where I've used regression analysis most. We want to predict what's going
to happen and take actions to take advantage of that. One can also use
regression analysis to talk about causal inference. What factors are really
driving a dependent variable? And so one can actually
test hypotheses about what are true
causal factors underlying the relationships
between the variables. Another application is for
just simple approximation. As mathematicians,
you're all very familiar with how
smooth functions can be-- that are smooth
in the sense of being differentiable and bounded. Those can be approximated
well by a Taylor series if you have a function of
a single variable or even a multivariable function. So one can use
regression analysis to actually approximate
functions nicely. And one can also use
regression analysis to uncover functional
relationships and validate functional
relationships amongst the variables. So let's set up the
general linear model from a mathematical
standpoint to begin with. In this lecture, OK,
we're going to start off with discussing ordinary
least squares, which is a purely mathematical
criterion for how you specify regression models. And then we're going to turn to
the Gauss-Markov theorem which incorporates some statistical
modeling principles there. They're essentially
weak principles. And then we will
turn to formal models with normal linear
regression models, and then consider extensions
of those to broader classes. Now we're in the
mathematical context. And a linear model is
basically attempting to model the conditional
distribution of the response variable y_i given the
independent variables x_i. And the conditional distribution
of the response variable is modeled simply
as a linear function of the independent variables. So the x_i's, x_(i,1)
through x_(i,p), are the key explanatory
variables that relate to the response
variables, possibly. And the beta_1, beta_2,
beta_i, or beta_p, are the regression
parameters which would be used in defining
that linear relationship. So this relationship has
residuals, epsilon_i, basically where there's
uncertainty in the data-- whether it's either due to a
measurement error or modeling error or underlying
stochastic processes that are driving the error. This epsilon_i is a
residual error variable that will indicate how this
linear relationship varies across the different n cases. So OK, how broad are the models? Well, the models
really are very broad. First of all,
polynomial approximation is indicated here. It corresponds, essentially,
to a truncated Taylor series approximation
to a functional form. With variables that
exhibit cyclical behavior, Fourier series can be applied
in a linear regression context. How many people in here are
familiar with Fourier series? Almost everybody. So Fourier series
basically provide a set of basis functions
that allow you to closely approximate most functions. And certainly with
bounded functions that possibly have a
cyclical structure to them, it provides a
complete description. So we could apply
Fourier series here. Finally, time series regressions
where the cases i one through n are really indexes of different
time points can be applied. And so the independent
variables can be variables that are
observable at a given time point or known at a given time. So those can include lags
of the response variables. So we'll see actually when
we talk about time series that there's
autoregressive time series models that can be specified. And those are very broadly
applied in finance. All right, so let's go through
what the steps are for fitting a regression model. First, one wants
to propose a model in terms of what
is it that we have to identify or be interested in
a particular response variable. And critical here is
specifying the scale of that response variable. Choongbum was discussing
problems of modeling stock prices. If, say, y is the stock price? Well, it may be that it's
more appropriate to consider modeling it on a logarithmic
scale than on a linear scale. Who can tell me why that
would be a good idea? AUDIENCE: Because
the changes might become more percent
changes in price rather than absolute
changes in price. PROFESSOR: Very good, yeah. So price changes basically
on the percentage scale, which log changes would be,
may be much better predicted by knowing factors than
the absolute price level. OK, and so we have
to have a collection of independent variables,
which to include in the model. And it's important
to think about how general this set up is. I mean, the
independent variables can be functions, lag values
of the response variable. They can be different
functional forms of other independent variables. So the fact that we're talking
about a linear regression model here is it's not so limiting
in terms of the linearity. We can really capture
lot of nonlinear behavior in this framework. So then third, we need to
address the assumptions about the distribution of
the residuals, epsilon, over the cases. So that has to be specified. Once we've set up
the model in terms of identifying the response
of the explanatory variables and the assumptions
underlying the distribution of the residuals, we need to
specify a criterion for judging different estimators. So given a particular
setup, what we want to do is be able to define a
methodology for specifying the regression parameters
so that we can then use this regression
model for prediction or whatever our purpose is. So the second
thing we want to do is define a criterion
for how we might judge different estimators of
the progression parameters. We're going to go
through several of those. And you'll see those-- least
squares is the first one, but there are actually
more general ones. In fact, the last
section of this lecture on generalized estimators
will cover those as well. Third, we need to characterize
the best estimator and apply it to the given data. So once we choose a
criterion for how good an estimate of
regression parameters is, then we have to have
a technology for solving for that. And then fourth, we need
to check our assumptions. Now, it's very often the case
that at this fourth step, where you're checking the
assumptions that you've made, you'll discover features
of your data or the process that it's modeling
that make you want to expand upon your assumptions
or change your assumptions. And so checking the
assumptions is a critical part of any modeling process. And then if necessary, modify
the model and assumptions and repeat this process. What I can tell you
is that this sort of protocol for
how you fit models is what I've applied
many, many times. And if you are lucky in a
particular problem area, the very simple
models will work well with small changes
in assumptions. But when you get
challenging problems, then this item five
of modify the model and/or assumptions is critical. And in statistical
modeling, my philosophy is you really want to,
as much as possible, tailor the model to the
process you're modeling. You don't want to fit a
square peg in a round hole and just apply, say,
simple linear regression to everything. You want to apply it when
the assumptions are valid. If the assumptions
aren't valid, maybe you can change the
specification of the problem so a linear model is still
applicable in a changed framework. But if not, then
you'll want to extend to other kinds of models. But what we'll be
doing-- or what you will be doing if you do
that-- is basically applying all the same principles
that are developed in the linear
modeling framework. OK, now let's see. I wanted to make
some comments here about specifying assumptions
for the residual distribution. What kind of assumptions
might we make? OK, would anyone like to
suggest some assumptions you might make in
a linear regression model for the residuals? Yes? What's your name, by the way? AUDIENCE: My name is Will. PROFESSOR: Will, OK. Will what? [? AUDIENCE: Ossler. ?] PROFESSOR: [? Ossler, ?] great. OK, thank you, Will. AUDIENCE: It might
be-- or we might want to say that the residual
might be normally distributed and it might not depend too
much on what value of the input variable we'd use. PROFESSOR: OK. Anyone else? OK. Well, that certainly
is an excellent place to start in terms of starting
with a distribution that's familiar. Familiar is always good. Although it's not something
that should be necessary, but we know from some of
Choongbum's lecture areas that Gaussian and
normal distributions arise in many
settings where we're taking basically sums of
independent, random variables. And so it may be that these
residuals are like that. Anyway, a slightly simpler
or weaker condition is to use the Gauss-- what
are called in statistics the Gauss-Markov assumptions. And these are assumptions
where we're only concerned with the means
or averages, statistically, and the variances
of the residuals. And so we assume that
there's zero mean. So on average, they're not
adding a bias up or down to the dependent variable. And those have a
constant variance. So the level of
uncertainty in our model doesn't depend on the case. And so indeed, if errors
on the percentage scale are more appropriate, then
one could look at, say, a time series of prices
that you're trying to model. And it may be that
on the log scale, that constant variance
looks much more appropriate than on the original
scale, which would have-- And then a third attribute of
the Gauss-Markov assumptions is that the residuals
are uncorrelated. So now uncorrelated does
not mean independent or statistically independent. So this is a somewhat weak
condition, or weaker condition, than independence
of the residuals. But in the Gauss-Markov
setting, we're just setting up
basically a reduced set of assumptions that we might
apply to fit the model. If we extend upon
that, we can then consider normal linear
regression models, which Will just suggested. And in this case, those could
be assumed to be independent and identically
distributed-- IID is that notation for that-- with
Gaussian or normal with mean 0 and variance sigma squared. We can extend upon
that to consider generalized Gauss-Markov
assumptions where we maintain still the zero mean
for the residuals, but the general-- we might
have a covariance matrix which does not correspond to
independent and identically distributed random variables. Now, let's see. In the discussion of
probability theory, we really haven't talked yet
about matrix-valued random variables, right? But how many people
in the class have covered matrix-value or
vector-valued random variables before? OK, just a handful. Well, a vector-valued
random variable, we think of the
values of these n cases for the dependent variable
to be an n-valued, an n-vector of random variables. And so we can
generalize the variance of individual random variables
to the variance covariance matrix of the collection. And so you have a covariance
matrix characterizing the variance of the n-vector
which gives us the-- the (i, j) element gives us the
value of the covariance. All right, let me put
the screen up and just write that on the board so
that you're familiar with that. All right, so we have
y_1, y_2, down to y_n, our n values of our
response variable. And we can basically talk
about the expectation of that being equal to
mu_1, mu_2, down to mu_n. And the covariance matrix
of y_1, y_2, down to y_n is equal to a matrix
with the variance of y_1 in the upper 1, 1 element, and
the variance of y_2 in the 2, 2 element, and the variance of
y_n in the nth column and nth row. And in the (i,j)-th row, (i, j),
we have the covariance between y_i and y_j. So we're going to use matrices
to represent covariances. And that's something
which I want everyone to get very familiar
with because we're going to assume that we
are comfortable with those, and apply matrix algebra with
these kinds of constructs. So the generalized
Gauss-Markov theorem assumes a general
covariance matrix where you can have
nonzero covariances between the
independent variables or the dependent variables
and the residuals. And those can be correlated. Now, who can come
up with an example of why the residuals might
be correlated in a regression model? Dan? OK. That's a really good example
because it's nonlinear. If you imagine sort of
a simple nonlinear curve and you try to fit a
straight line to it, then the residuals
from that linear fit are going to be consistently
above or below the line depending on where you are
in the nonlinearity, how it might be fitting. So that's one example
where that could arise. Any other possibilities? Well, next week we'll be talking
about some time series models. And there can be time
dependence amongst variables where there are some
underlying factors maybe that are driving the process. And those ongoing
factors can persist in making the
linear relationship over or under gauge
the dependent variable. So that can happen as well. All right, yes? AUDIENCE: The Gauss-Markov
is just the diagonal case? PROFESSOR: Yes, the Gauss-Markov
is simply the diagonal case. And explicitly if we replace
y's here by the residuals, epsilon_1 through
epsilon_n, then that diagonal matrix
with a constant diagonal is the simple Gauss-Markov
assumption, yeah. Now, I'm sure it
comes as no surprise that Gaussian distributions
don't always fit everything. And so one needs to get
clever with extending the models to other cases. And there are-- I know--
Laplace distributions, Pareto distributions, contaminated
normal distributions, which can be used to
fit regression models. And these general cases really
extend the applicability of regression models to
many interesting settings. So let's turn to specifying
the estimator criterion in two. So how do we judge what's a
good estimate of the regression parameters? Well, we're going to cover least
squares, maximum likelihood, robust methods, which are
contamination resistant. And other methods exist
that we will mention but not get into really in
the lectures, are Bayes methods and accommodating
incomplete or missing data. Essentially, as your approach
to modeling a problem gets more and more
realistic, you start adding more and more
complexity as it's needed. And certainly issues
of-- well, robust methods is where you assume
most of the data arrives under normal
conditions, but once in a while there may be some
problem with the data. And you don't want
your methodology just to break down if there happens
to be some outliers in the data or contamination. Bayes methodologies
are the technology for incorporating
subjective beliefs into statistical models. And I think it's fair
to say that probably all statistical modeling
is essentially subjective. And so if you're going to be
good at statistical modeling, you want to be sure that you're
effectively incorporating subjective information in that. And so Bayes methodologies
are very, very useful, and indeed pretty much
required to engage in appropriate modeling. And then finally, accommodate
incomplete or missing data. The world is always sort
of cruel in terms of you often are missing what you
think is critical information to do your analysis. And so how do you
deal with situations where you have some
holes in your data? Statistical models provide
good methods and tools for dealing with that situation. OK. Then let's see. In case analyses for
checking assumptions, let me go through this. Basically when you fit
a regression model, you check assumptions by
looking at the residuals, which are the basically estimates of
the epsilons, the deviations of the dependent variable
from their predictions. And what one wants
to do is analyze these to determine whether our
assumptions are appropriate. OK, but the Gauss-Markov
assumptions would be, do these appear to
have constant variance? And it may be that their
variance depends on time, if the i is indexing time. Residuals might depend on
the other variables as well, and one wants to determine
that that isn't the case. There are also influence
diagnostics identifying cases which are highly influential. It turns out that when you
are building a regression model with data, you
treat all the cases as if they're equally important. Well, it may be
that certain cases are really critical to
estimated certain factors. And it may be that much of the
inference about how important a certain factor
is is determined by very small number of points. So even though you
have a massive data set that you're using
to fit a model, it could be that
some of the structure is driven by a very
small number of cases. So influence diagnostics give
you a way of analyzing that. In the problem set
for this lecture, you'll be deriving some
influence diagnostics for linear regression
models and seeing how they're mathematically defined. And I'll be distributing
a case study which illustrates fitting
linear regression models for asset prices. And you can see
how those play out with some practical examples. OK, finally there's
outlier detection. With outliers, it's interesting. The exceptions in data are
often the most interesting. It's important in
modeling to understand whether certain
cases are unusual. And sometimes their
degree of idiosyncrasy can be explained away
so that one essentially discards those outliers. But other times,
those idiosyncrasies lead to extensions of the model. And so outlier detection can be
very important for validating a model. OK, so with that introduction to
regression, linear regression, let's talk about
ordinary least squares. Ah. OK, the least squares criterion
is for a given a regression parameter, beta,
which is considered to be a column vector-- so I'm
taking the transpose of a row vector. The least squares criterion
is to basically take the sum of square deviations
from the actual value of the response variable
from its linear prediction. So y_i minus y hat i,
we're just plugging in for y hat i the
linear function of the independent variables
and the squaring that. And the ordinary least
squares estimate, beta hat, minimizes this function. So in order to solve for this,
we're going to use matrices. And so we're going to take
the y vector, the vector of n values of the
dependent variable, or the response variable,
and X, the matrix of values of the
independent variable. It's important in this
set up to keep straight that cases go by
rows and columns go by values of the
independent variable. Boy, this thing is
ultra sensitive. Excuse me. Do I turn off the touchpad here? OK. So we can now define
our fitted value, y hat, to be equal to the
matrix x times beta. And with matrix multiplication,
that results in the y hat 1 through y hat n. And Q of beta can basically
be written as y minus X beta transpose y minus X beta. So this term here is an
n-vector minus the product of the X matrix times beta,
which is another n-vector. And we're just taking the
cross product of that. And the ordinary least
squares estimate for beta solves the derivative of
this criterion equaling 0. Now, that's in
fact true, but who can tell me why that's true? Say again? AUDIENCE: Is that minimum? PROFESSOR: OK. So your name? AUDIENCE: Seth. PROFESSOR: Seth? Seth. Very good, Seth. Thanks, Seth. So if we want to
find a minimum of Q, then that minimum will have,
if it's a smooth function, will have a minimum
at slope equals 0. Now, how do we know whether
it's a minimum or not? It could be a maximum. AUDIENCE: [INAUDIBLE]? PROFESSOR: OK, right. So in fact, this
is a-- Q of beta is a convex function of beta. And so its second
derivative is positive. And if you basically think
about the set-- basically, this is the first
derivative of Q with respect to beta equaling 0. If you were to solve for
the second derivative of Q with respect to beta,
well, beta is a p-vector. So the second
derivative is actually a second derivative
matrix, and that matrix, you can solve for it. It will be X
transpose X, which is a positive definite or
semi-definite matrix. So it basically had a
positive derivative there. So anyway, this ordinary
least squares estimates will solve this d Q of
beta by d beta equals 0. What does d Q beta by d beta_j? Well, you just take the
derivative of this sum. So we're taking the sum
of all these elements. And if you take the
derivative-- well, OK, the derivative
is a linear operator. So the derivative of a sum is
the sum of the derivatives. So we take the summation out and
we take the derivative of each term, so we get 2 minus x_(i,j),
then the thing in square brackets, y_i minus that. And what is that? Well, in matrix
notation, if we let this sort of bold X
sub square j denote the j-th column of the
independent variables, then this is minus 2. Basically, the j-th column of X
transpose times y minus X beta. So this j-th equation for
ordinary least squares has that representation in
terms-- in matrix notation. Now if we put that all
together, we basically can define this derivative
of Q with respect to the different
regression parameters as basically the minus twice
the j-th column stacked times y minus X beta, which is simply
minus 2 X transpose, y minus X beta. And this has to equal 0. And if we just simplify,
taking out the two, we get this set of equations. It must be satisfied by
the ordinary least squares estimate, beta. And that's called the
normal equations in books on regression modeling. So let's consider
how we solve that. Well, we can re-express that
by multiplying through the X transpose on each of the terms. And then beta hat basically
solves this equation. And if X transpose
X inverse exists, we get beta hat is equal
to X transpose X inverse X transpose y. So with matrix algebra, we
can actually solve this. And matrix algebra
is going to be very important to this
lecture and other lectures. So if this stuff is-- if
you're a bit rusty on this, do brush up. This particular
solution for beta hat assumes that X transpose
X inverse exists. Who can tell me
what assumptions do we need to make for X
transpose X to have an inverse? I'll call you in a second
if no one else does. Somebody just said something. Someone else. No? All right. OK, Will. AUDIENCE: So X
transpose X inverse needs to have full
rank, which means that each of the submatrices
needs to have [INAUDIBLE] smaller dimension. PROFESSOR: OK, so Will said,
basically, the matrix X needs to have full rank. And so if X has full rank,
then-- well, let's see. If X has full rank, then the
singular value decomposition which was in the very
first class can exist. And you have basically
p singular values that are all non-zero. And X transpose X
can be expressed as sort of a, from the
singular value decomposition, as one of the orthogonal
matrices times the square of the singular values times
that same matrix transpose, if you recall that definition. So that actually
is-- it basically provides a solution for X
transpose X inverse, indeed, from the singular value
decomposition of X. But what's required is that
you have a full rank in X. And what that means
is that you can't have independent variables
that are explained by other independent variables. So different columns of
X have to be linear-- or they can't linearly depend
on any other columns of X. Otherwise, you would
have reduced rank. So now if beta hat
doesn't have full rank, then our least squares estimate
of beta might be non-unique. And in fact, it is
the case that if you are really interested
in just predicting values of a dependent
variable, then having non-unique
least squares estimates isn't as much of a
problem, because you still get estimates out of that. But for now, we want to assume
that there's full column rank in the independent variables. All right. Now, if we plug in the value
of the solution for the least squares estimate,
we get fitted values for the response variable, which
are simply the matrix X times beta hat. And this expression
for the fitted values is basically X times X transpose
X inverse X transpose y, which we can represent as Hy. Basically, this H matrix in
linear models and statistics is called the hat matrix. It's basically a
projection matrix that takes the linear vector,
or the vector of values of the response variable,
into the fitted values. So this hat matrix
is quite important. The problem set's going
to cover some features, go into some properties
of the hat matrix. Does anyone want to make any
comments about this hat matrix? It's actually a very
special type of matrix. Does anyone want to point out
what that special type is? It's a projection matrix, OK. Yeah. And in linear algebra,
projection matrices have some very
special properties. And it's actually an
orthogonal projection matrix. And so if you're
interested in that feature, you should look into that. But it's really a very rich
set of properties associated with this hat matrix. It's an orthogonal projection,
and it's-- let's see. What's it projecting? It's projecting from
n-space into what? Go ahead. What's your name? AUDIENCE: Ethan. PROFESSOR: Ethan, OK. AUDIENCE: Into space [INAUDIBLE] PROFESSOR: Basically, yeah. It's projecting into
the column space of X. So that's what linear
regression is doing. So in focusing and
understanding linear regression, you can think of, how do we
get estimates of this p-vector? That's all very good and useful,
and we'll do a lot of that. But you can also
think of it as, what's happening in the
n-dimensional space? So you basically
are representing this n-dimensional vector
y by its projection onto the column space. Now, the residuals are
basically the difference between the response value
and the fitted value. And this can be expressed
as y minus y hat, or I_n minus H times y. And it turns out that I_n minus
H is also a projection matrix, and it's projecting the data
onto the space orthogonal to the column space of x. And to show that that's
true, if we consider the normal equations, which
are X transpose y minus X beta hat equaling 0, that basically
is X transpose epsilon hat equals 0. And so from the
normal equations, we can see that
what they mean is they mean that the residual
vector epsilon hat is orthogonal to each
of the columns of X. You can take any column
in X, multiply that by the residual vector,
and get 0 coming out. So that's a feature
of the residuals as they relate to the
independent variables. OK, all right. So at this point, we've gone
through really not talking about any statistical
properties to specify the betas. All we've done is talked-- we've
introduced the least squares criterion and said, what
value of the beta vector minimizes that least
squares criterion? Let's turn to the
Gauss-Markov theorem and start introducing some
statistical properties, probability properties. So with our data, y and X-- yes? Yes. AUDIENCE: [INAUDIBLE]? PROFESSOR: That epsilon-- AUDIENCE: [INAUDIBLE]? PROFESSOR: OK. Let me go back to that. It's that X, the columns
of X, and the column vector of the residual are
orthogonal to each other. So we're not doing a
projection onto a null space. This is just a statement that
those values, or those column vectors, are orthogonal
to each other. And just to recap, the
epsilon is a projection of y onto the space orthogonal
to the column space. And y hat is a projection
onto the column space of y. And these projections are
all orthogonal projections, and so they happen to result in
the projected value epsilon hat must be orthogonal to
the column space of X, if you project it out. OK? All right. So the Gauss-Markov theorem,
we have data y and X again. And now we're going to
think of the observed data, little y_1 through
y_n, is actually an observation of the
random vector capital Y, composed of random
variables Y_1 up to Y_n. And the expectation
of this vector conditional on the values
of the independent variables and their regression
parameters given by X, beta-- so the dependent
variable vector has expectation
given by the product of the independent variables
matrix times the regression parameters. And the covariance matrix
of Y given X and beta is sigma squared
times the identity matrix, the n-dimensional
identity matrix. So the identity matrix has
1's along the diagonal, n-dimensional, and
0's off the diagonal. So the variances of the Y's
are the diagonal entries, those are all the
same, sigma squared. And the covariance between
any two are equal to 0 conditionally. OK, now the
Gauss-Markov theorem. This is a terrific result
in linear models theory. And it's terrific in terms of
the mathematical content of it. I think it's-- for a math class,
it's really a nice theorem to introduce you to and
highlight the power of, I guess, results that can arise
from applying the theory. And so to set this
theorem up, we want to think about trying
to estimate some function of the regression parameters. And so OK, our problem is
with ordinary least squares-- it was, how do we specify
the regression parameters beta_1 through beta_p? Let's consider a general
target of interest, which is a linear
combination of the betas. So we want to predict
a parameter theta which is some linear combination
of the regression parameters. And because that linear
combination of the regression parameters corresponds to the
expectation of the response variable corresponding
to a given row of the independent
variables matrix, this is just a
generalization of trying to estimate the means
of the regression model at different points
in the space, or to be estimating other
quantities that might arise. So this is really a very
general kind of thing to want to estimate. It certainly is appropriate
for predictions. And if we consider the
least squares estimate by just plugging in beta hat
one through beta hat p, solved by the least squares,
well, it turns out that those are an unbiased
estimator of the parameter theta. So if we're trying to
estimate this combination of these unknown parameters,
you plug in the least squares estimate, you're going to get
an estimator that's unbiased. Who can tell me
what unbiased is? It's probably going to be a new
concept for some people here. Anyone? OK, well it's a basic
property of estimators in statistics where the
expectation of this statistic is the true parameter. So it doesn't, on average,
probabilistically, it doesn't over- or
underestimate the value. So that's what unbiased means. Now, it's also a
linear estimator of theta in terms
of this theta hat being a particular
linear combination of the dependent variables. So with our original
response variable y, in the case of y_1 through
y_n, this theta hat is simply a linear combination
of all the y's. And now why is that true? Well, we know that beta hat,
from the normal equations, is solved by X transpose
X inverse X transpose y. So it's a linear
transform of the y vector. So if we take a
linear combination of those components, it's also
another linear combination of the y vector. So this is a linear
function of the underlying-- of the response variables. Now, the Gauss-Markov
theorem says that, if the Gauss-Markov
assumptions apply, then the estimator theta
has the smallest variance amongst all linear unbiased
estimators of theta. So it actually is
like the optimal one, as long as this is our criteria. And this is really a
very powerful result. And to prove it, it's very easy. Let's see. Actually, these notes are
going to be distributed. So I'm going to go through
this very, very quickly and come back to it later
if we have more time. But you basically-- the
argument for the proof here is you consider another
linear estimate which is also an unbiased estimate. So let's consider a competitor
to the least squares value and then look at the difference
between that estimator and theta hat. And so that can be characterized
as basically this vector, f transpose y. And this difference
in the estimates must have expectation 0. So basically, if we look at--
if theta tilde is unbiased, then this expression
here is going to be equal to zero,
which means that f-- the difference in
these two estimators, f defines the difference
in the two estimators-- has to be orthogonal to
the column space of x. And with this
result, one then uses this orthogonality of
f and d to evaluate the variance of theta tilde. And in this proof, the
mathematical argument here is really something--
I should put some asterisks on a few lines here. This expression here is
actually very important. We're basically looking
at the decomposition of the variance to
be the variance of b transpose y, which is
the variance of the sum of these two random variables. So the page before
basically defined d and f such that this is true. Now when you consider
the variance of a sum, it's not the sum
of the variances. It's the sum of the
variances plus twice the sum of the covariances. And so when you are
calculating variances of sums of random variables,
you have to really keep track of the covariance terms. In this case, this
argument shows that the covariance
terms are, in fact, 0, and you get the
result popping out. But that's really a-- in
an econometrics class, they'll talk about BLUE
estimates of regression, or the BLUE property of the
least squares estimates. That's where that comes from. All right, so let's now consider
generalizing from Gauss-Markov to allow for unequal variances
and possibly correlated nonzero covariances
between the components. And in this case,
the regression model has the same linear set up. The only difference
is the expectation of the residual
vector is still 0. But the covariance matrix
of the residual vector is sigma squared,
a single parameter, times let's say capital sigma. And we'll assume here
that this capital sigma matrix is a known n by n
positive definite matrix specifying relative
variances and correlations between the observations. OK. Well, in order to solve
for regression estimates under these generalized
Gauss-Markov assumptions, we can transform the
data Y, X to Y star equals sigma to the
minus 1/2 y and X to X star, which is
sigma to the minus 1/2 x. And this model then becomes
a model, a linear regression model, in terms of
Y star and X star. We're basically multiplying
this regression model by sigma to the minus 1/2 across. And epsilon star actually
has a covariance matrix equal to sigma squared
times the identity. So if we just take a
linear transformation of the original data,
we get a representation of the regression
model that satisfies the original
Gauss-Markov assumptions. And what we had to
do was basically do a linear transformation
that makes the response variables all have constant
variance and be uncorrelated. So with that, we then have the
least squares estimate of beta is the least squares, the
ordinary least squares, in terms of Y star and X star. And so plugging that in, we then
have X star transpose X star inverse X star transpose Y star. And if you multiply through,
that's how the formula changes. So this formula characterizing
the least squares estimate under this generalized
set of assumptions highlights what you
need to do to be able to apply that theorem. So with response values that
have very large variances, you basically want to discount
those by the sigma inverse. And that's part of the way in
which these generalized least squares work. All right. So now let's turn to
distribution theory for normal regression models. Let's assume that
the residuals are normals with mean 0 and
variance sigma squared. OK, conditioning on the values
of the independent variable, the Y's, the response
variables, are going to be independent
over the index i. They're not going to be
identically distributed because they have
different means, mu_i for the dependent
variable Y_i, but they will have a constant variance. And what we can do is
basically condition on X, beta, and sigma squared
and then represent this model in terms of the
distribution of the epsilons. So if we're conditioning
on x and beta, this X beta is a constant,
known, we've conditioned on it. And the remaining uncertainty
is in the residual vector, which is assumed to
be all independent and identically distributed
normal random variables. Now, this is the
first time you'll see this notation, capital N sub
little n, for a random vector. It's a multivariate
normal random variable where you consider an n-vector
where each component is normally distributed,
with mean given by some corresponding
mean vector, and a covariance matrix
given by a covariance matrix. In terms of independent and
identically distributed values, the probability structure
here is totally well-defined. Anyone here who's taken a
beginning probability class knows what the
density function is for this multivariate
normal distribution because it's the product
of the independent density functions for the
independent components, because they're all
independent random variables. So this multivariate
normal random vector has a density function
which you can write down, given your first
probability class. OK, here I'm just
highlighting or defining the mu vector for the means
of the cases of the data. And the covariance matrix
sigma is this diagonal matrix. And so basically sigma_(i,j)
is equal to sigma squared times the Kronecker delta
for the (i,j) element. Now what we want to do
is, under the assumptions of normally
distributed residuals, to solve for the distribution
of the least squares estimators. We want to know, basically,
what kind of distribution does it have? Because what we want
to be able to do is to determine
whether estimates are particularly large or not. And maybe there's
no structure at all and the regression
parameters are 0 so that there's no dependence
on a given factor. And we need to be able to
judge how significant that is. So we need to know what
the distribution is of our least squares estimate. So what we're going to do
is apply moment generating functions to derive the
joint distribution of y and the joint
distribution of beta hat. And so Choongbum introduced
the moment generating function for individual random variables
for single-variate random variables. For n-variate
random variables, we can define the moment generating
function of the Y vector to be the expectation of
e to the t transpose Y. So t is an argument of the
moment generating function. It's another n-vector. And it's equal to the
expectation of e to the t_1 Y_1 plus t_2 Y_2 up to t_n Y_n. So this is a very
simple definition. Because of independence,
the expectation of the products, or
this exponential sum is the product of
the exponentials. And so this moment
generating function is simply the product of the moment
generating functions for Y_1 up through Y_n. And I think-- I don't know if
it was in the first problem set or in the first lecture, but e
to the t_i mu_i plus a half t_i squared sigma squared
was the moment generating function for the
single univariate normal random variable,
mean mu_i and variance sigma squared. And so if we have n of
these, we take their product. And the moment
generating function for y is simply e to the
t transpose mu plus 1/2 t transpose sigma t. And so for this multivariate
normal distribution, this is its moment
generating function. And this happens to be--
the distribution of y is a multivariate normal with
mean mu and covariance matrix sigma. So a fact that
we're going to use is that if we're working with
multivariate normal random variables, this is the structure
of their moment generating functions. And so if we solve for
the moment generation function of some
other item of interest and recognize that
it has the same form, we can conclude that it's also
a multivariate normal random variable. So let's do that. Let's solve for the
moment generation function of the least
squares estimate, beta hat. Now rather than dealing
with an n-vector, we're dealing with a p-vector
of the betas, beta hats. And this is simply the
definition of the moment generating function. If we plug in for basically
what the functional form is for the ordinary least
squares estimates and how they depend on
the underlying Y, then we basically-- OK, we have
A equal to, essentially, the linear projection of Y.
That gives us the least squares estimate. And then we can say that
this moment generating function for beta hat is
equal to the expectation of e to the t transpose Y, where
little t is A transpose tau. Well, we know what this is. This is the moment
generating function of X-- sorry, of Y-- evaluated
at the vector little t. So we just need to plug in
little t, that expression A transpose tau. So let's do that. And you do that and it turns
out to be e to the t transpose mu plus that. And we go through a
number of calculations. And at the end of the day, we
get that the moment generating function is just e to the tau
transpose beta plus a 1/2 tau transpose this matrix tau. And that is the moment
generation function of a multivariate normal. So these few lines that you
can go through after class basically solve for
the moment generating function of beta hat. And because we can
recognize this as the MGF of a multivariate normal, we
know that that's-- beta hat is a multivariate normal,
with mean the true beta, and covariance matrix given by
the object in square brackets there. OK, so this is
essentially the conclusion of that previous analysis. The marginal distribution
of each of the beta hats is given by beta hat-- by a
univariate normal distribution with mean beta_j and variance
equal to the diagonal. Now at this point, saying
that is like an assertion. But one can actually
prove that very easily, given this sequence of argument. And can anyone tell
me why this is true? Let me tell you. If you consider plugging in
the moment generating function, the value tau, where only
the j-th entry is non-zero, then you have the moment
generating function of the j-th component
of beta hat. And that's a Gaussian
moment generating function. So the marginal distribution of
the j-th component is normal. So you get that
almost for free from this multivariate analysis. And so there's no hand waving
going on in having that result. This actually follows
directly from the moment generating functions. OK, let's now turn
to another topic. Related, but it's the
QR decomposition of X. So we have-- with our
independent variables X, we want to express
this as a product of an orthonormal matrix
Q which is n by p, and an upper
triangular matrix R. So it turns out that any
matrix, n by p matrix, can be expressed in this form. And we'll quickly show you
how that can be accomplished. We can accomplish
that by conducting a Gram-Schmidt
orthonormalization of the independent
variables matrix X. And let's see. So if we define R, the upper
triangular matrix in the QR decomposition, to have
0's off the diagonal below and then possibly nonzero
value along the diagonal into the right, we're just
going to solve for Q and R through this
Gram-Schmidt process. So the first column of X is
equal to the first column of Q times the first
element, the top left corner of the matrix R. And if we take the cross product
of that vector with itself, then we get this expression
for r_(1,1) squared-- we can basically solve for
r_(1,1) as the square root of this dot product. And Q_Q_[1] is simply the first
column of X divided by that square root. So this first element
of the Q matrix and the first element r, this
can be solved for right away. Then let's solve for
the second column of Q and the second column
of the R matrix. Well, X_X_[2], the second
column of the X matrix, is the first column
of Q times r_(1,2), plus the second column
of Q times r_(2,2). And if we multiply this
expression by Q_Q_[1] transpose, then we basically get this
expression for r_(1,2). So we actually have
just solved for r_(1,2). And Q_Q_[2] is solved for by
the arguments given here. So basically, we successively
are orthogonalizing columns of X to the
previous columns of X through this
Gram-Schmidt process. And it basically can be repeated
through all the columns. Now with this QR
decomposition, what we get is a really nice form for
the least squares estimate. Basically, it simplifies to the
inverse of R times Q transpose y. And this basically
means that you can solve for least squares
estimates by calculating the QR decomposition, which is a
very simple linear algebra operation, and then just do
a couple of matrix products to get the-- well, you do have
to do a matrix inverse with R to get that out. And the covariance
matrix of beta hat is equal to sigma squared
X transpose X inverse. And in terms of the covariance
matrix, what is implicit here but you should make
explicit in your study, is if you consider taking a
matrix, R inverse Q transpose times y, the only thing that's
random there is that y vector, OK? The covariance of a matrix
times a random vector is that matrix
times the covariance of the vector times the
transpose of the matrix. So if you take a
matrix transformation of a random vector,
then the covariance of that transformation
has that form. So that's where this covariance
matrix is coming into play. And from the MGF, the
moment generating function, for the least squares
estimate, this basically comes out of the moment
generating function definition as well. And if we take X
transpose X, plug in the QR decomposition,
only the R's play out, giving you that. Now, this also gives
us a very nice form for the hat matrix,
which turns out to just be Q times Q transpose. So that's a very simple form. So now with the
distribution theory, this next section is
going to actually prove what's really a
fundamental result about normal linear
regression models. And I'm going to go through
this somewhat quickly just so that we cover what the
main ideas are of the theorem. But the details, I think,
are very straightforward. And these course notes
that will be posted online will go through the various
steps of the analysis. OK, so there's an
important theorem here which is for any
matrix A, m by n, you consider transforming
the random vector y by this matrix A. It is
also a random normal vector. And its distribution
is going to have a mean and covariance
matrix given by mu_z and sigma_z, which have
this simple expression in terms of the matrix A and
the underlying means and covariances of y. OK, earlier we actually
applied this theorem with A corresponding to the
matrix that generates the least squares estimates. So with A equal to X
transpose X inverse, we actually previously went
through the solution for what's the distribution of beta hat. And with any other
matrix A, we can go through the same analysis
and get the distribution. So if we do that here,
well, we can actually prove this important
theorem, which says that with least
squares estimates of normal linear regression
models, our least squares estimate beta hat and
our residual vector epsilon hat are independent
random variables. So when we construct
these statistics, they are statistically
independent of each other. And the distribution of beta
hat is multivariate normal. The sum of the squared
residuals is, in fact, a multiple of a chi-squared
random variable. Now who in here can tell me what
a chi-squared random variable is? Anyone? AUDIENCE: [INAUDIBLE]? PROFESSOR: Yes, that's right. So a chi-squared random variable
with one degree of freedom is a squared normal zero
one random variable. A chi-squared with
two degrees of freedom is the sum of two independent
normals, zero one, squared. And so the sum of n squared
residuals is, in fact, an n minus p chi-squared random
variable scale it by sigma squared. And for each component
j, if we take the difference between the least
squares estimate beta hat j and beta_j and divide
through by this estimate of the standard
deviation of that, then that will, in fact, have a
t distribution on n minus p degrees of freedom. And let's see, a t distribution
in probability theory is the ratio of a normal random
variable to an independent chi squared random variable, or
the root of an independent chi squared random variable. So basically these
properties characterize our regression parameter
estimates and t statistics for those estimates. Now, OK, in the
course notes, there's a moderately long proof. But all the details
are given, and I'll be happy to go through any
of those details with people during office hours. Let me just push
on to-- let's see. We have maybe two minutes
left in the class. Let me just talk about
maximum likelihood estimation. And in fitting models
and statistics, maximum likelihood estimation
comes up again and again. And with normal linear
regression models, it turns out that ordinary
least squares estimate are, in fact, our maximum
likelihood estimates. And what we want to do
with a maximum likelihood is to maximize. We want to define the
likelihood function, which is the density function
for the data given the unknown parameters. And this density
function is simply the density function for a
multivariate normal random variable. And the maximum
likelihood estimates are the estimates of the
underlying parameters that basically maximize
the density function. So it's the values of
the underlying parameters that make the data that was
observed the most likely. And if you plug in the values
of the density function, basically we have these
independent random variables, Y_i, whose product
is the joint density. The likelihood
function turns out to be basically a function of
the least squares criterion. So if you fit models
by least squares, you're consistent with doing
something decent in at least applying the maximum
likelihood principle if you had a normal
linear regression model. And it's useful to know when
your statistical estimation algorithms are consistent
with certain principles like maximum likelihood
estimation or others. So let me, I guess,
finish there. And next time, I will
just talk a little bit about generalized M estimators. Those provide a
class of estimators that can be used for
finding robust estimates and also quantile estimates
of regression parameters which are very interesting.