NARRATOR: The
following content is provided by MIT OpenCourseWare
under a Creative Commons license. Additional information
about our license and MIT OpenCourseWare
in general is available at ocw.mit.edu. PROFESSOR: So I'm thinking
of large sparse matrices. So the point about A -- the
kind of matrices that we've been writing down, maybe
two-dimensional, maybe three-dimensional
difference matrices. So it might be five or seven
non-zeros on a typical row and that means that you don't
want to invert such a matrix. These are big matrices. The order could
easily be a million. What you can do fast is
multiply A by a vector, because multiplying a
sparse matrix by a vector, you only have maybe five
non-zeros times the n -- the vector of length n -- so say
5n operations, so that's fast. I'm thinking that we are at
a size where elimination, even with a good ordering,
is too expensive in time or in storage. So I'm going from direct
elimination methods to iterative methods. Iterative means that
I never actually get to the perfect answer x, but I
get close and what I want to do is get close quickly. So a lot of thought
has gone into creating iterative methods. I'll write down
one key word here. Part of the decision is to
choose a good preconditioner. I'll call that matrix P. So it's
a matrix that's supposed to be -- that we have some freedom to
choose and it very much speeds up -- so the decisions are,
what preconditioner to choose? A preconditioner that's
somehow like the matrix A, but hopefully a lot simpler. There are a bunch
of iterative methods and that's what today's
lecture is about. Multigrid is an idea that
just keeps developing. It really is producing answers
to very large problems very fast, so that will be
the following lectures. And then come these methods --
you may not be familiar with that word Krylov. Let me mention that in this
family, the best known method it is called
conjugate gradients. The conjugate gradient method. So I'll just write that
down, conjugate gradient. So that will come next week. That's a method
that's a giant success for symmetric, positive
definite problem. So I don't assume
in general that A is symmetric or positive definite. Often it is if it comes
from a finite difference or finite element approximation. If it is, then
conjugate gradients is highly recommended. I'll start with the
general picture. Let me put the general
picture of what an iteration looks like. So an iteration
computes the new x. Let me call it x-new or
x_(k+1) from the known x. So we start with x_0,
any x_0, In principle. It's not too important
for x_0 to be close. In linear methods, you
could start even at 0. In nonlinear methods,
Newton methods, you do want to be close,
and a lot of attention goes into a good start. Here the start
isn't too important; it's the steps that you taking. Let me -- so I guess
I'm going to -- I'm trying to solve A*x equal b. I'm going to split
that equation into -- I'm just going to write
the same equation this way. I could write it as x
equals I minus A x plus b. That would be the
same equation, right? Because x and x cancel. A*x comes over here
and I have A*x equal b. I've split up the equation and
now actually, let me bring -- so that's without
preconditioner. You don't see a matrix P. That's
just an ordinary iteration, no preconditioner. The preconditioner will come --
I'll have P and it'll go here too. Then, of course, I have
to change that to a P. So that's still the
same equation, right? P*x is on both sides, cancels
-- and I have A*x equal b. So that's the same equation,
but it suggests the iteration that I want to analyze. On the right-hand side is
the known approximation. On the left side is the
next approximation, x_(k+1). I multiply by P so I -- the
idea is, P should be close to A. Of course, if it was
exactly A -- if P equaled A, then this wouldn't be here,
and I would just be solving, in one step, A*x equal b. So P equal A is one extreme. I mean, it's totally,
perfectly preconditioned, but the point is
that if P equals A, we don't want to
solve that system, we're trying to escape
from solving that system. But we're willing to solve a
related system that is simpler. Why simpler? Here are some possible
preconditioners. P equals the diagonal. Just take the diagonal
of A and that choice is named after Jacobi. That's Jacobi's method. Actually, it's
already a good idea, because the effect of that
is somehow to properly scale the equations. When I had the identity
here, I had no idea whether the identity and A
are in the right scaling. A may be divided by delta
x squared or something. It could be totally
different units, but P -- if I take the diagonal of A,
at least they're comparable. Then you see what -- we'll
study Jacobi's method. So that's a very simple choice. Takes no more work than
the identity, really. Second choice would be -- so
that would be P for Jacobi. A second choice, that you
would think of pretty fast, is the diagonal and the
lower triangular part. So it's the -- I'll
say the triangular -- I'm going to say
lower triangular, triangular part of A,
including the diagonal. This is associated with
Gauss's great name and Seidel. So that's the
Gauss-Seidel choice. So why triangular? We'll analyze the
effect of this choice. If P is triangular,
then we can solve -- I'm never going to write
out an inverse matrix. That's understood, but
I can compute the new x from the old x just
by back substitution. I find the first component
of x_(k+1), then the second, then the third. It's just because the first
equation -- if P is triangular, the first equation will
only have one term, one new term on the left side
and everything else will be over there. The second equation will
have a diagonal and something that uses the number I just
found for the first component and onward. So in other words,
triangular matrices are good. Then people thought about
combinations of these. A big lot of activity
went in something called overrelaxation,
where you did a -- you had a weighting of these
and it turned out to be if you adjusted the weight, you
could speed up the method, sometimes called SOR--
successive overrelaxation. So when I was in graduate
school, that was a very -- that was sort of the beginning
of progress beyond Jacobi and Gauss-Seidel. Then after that came
a very natural -- ILU is the next one that I want
-- the I stands for incomplete, incomplete LU. So what are L and U? That's the letters I always
use as a signal for the lower triangular and upper triangular
factors of A, the ones that exact elimination
would produce. But we don't want to
do exact elimination. Why? Because non-zeros fill in. Non-zeros appear in those
spaces where A itself is 0. So the factors are much --
have many more non-zeros than the original A.
They're not as sparse. So the idea of incomplete LU is
-- and I'll come back to it -- the idea of incomplete
LU, you might say, should have occurred
to people earlier -- only keep the non-zeros that
are non-zeros in the original A. Don't allow fill-in or allow
a certain amount of fill-in. You could have a tolerance and
you would include a non-zero if it was big enough, but the
many, many small non-zeros that just fill in and
make the elimination long, you throw them out. So P is -- in this
case, P is L -- is an approximate L
times an approximate U. So it's close to A, but
because we've refused some of the fill-in,
it's not exactly A. And you have a tolerance there
where if you set the tolerance high, then these are exact, but
you've got all that fill-in; if you set the tolerance to 0,
you've got the other extreme. So that would give you an
idea of preconditioners, of reasonable choices. Now I'm going to think
about -- first, how do -- how to decide whether the x's
approach the correct answer? I need an equation for the error
and because my equations are linear, I'm just
going to subtract -- I'll subtract the upper
equation from the lower one and I'll call the -- so the
error e_k will be x minus the true answer. This is x exact. Minus x, my k-th approximation. So when I subtract
that from that, I have e_(k+1) and here
I have P minus A -- when I subtract that
from that, I have e_k, and when I subtract
that from that, 0. So very simple error equation. So this is an equation, this
would be the error equation. You see the b has
disappeared, because I'm only looking at differences now. Let me write that
over on this board, even slightly differently. Now I will multiply
through by P inverse. That really shows
it just so clearly. So the new error is, if
I multiply by P inverse, P inverse times P is I, and
I have P inverse A, e_k. So what does that say? That says that at
every iteration, I multiply the error
by that matrix. Of course, if P
equals A, if I pick the perfect preconditioner, then
P inverse A is the identity, this is the zero matrix, and the
error in the first step is 0. I'm solving exactly. I don't plan to do that. I plan to have a P that's
close to A, so in some way, this should be close to I.
This whole matrix should be close to 0,
close in some sense. Let me give a name
for that matrix. Let's call that
matrix M. So that's the iteration matrix, which
we're never going to display. I mean, it would be
madness to display. We don't want to compute P
inverse explicitly or display it, but to understand it -- to
understand convergence or not convergence, it's M that
controls everything. So this is what I would
call pure iteration or maybe another word that's
used instead of pure is stationary iteration. What does stationary mean? It means that you do the
same thing all the time. At every step, you just use
the same equation, where -- these methods will be smarter. They'll adapt. They'll use more information. They'll converge faster, but
this is the simple one -- and this one plays
a part in those. So what's the story
on convergence now? What about that matrix M? If I take a starting
vector, e_0 -- my initial error
could be anything. I multiply by M to get e_1. Then I multiply by
M again to get e_2. Every step, I multiply
by M. So after k steps, I've multiplied k times
by M. I want to know -- so the key question
is, does that go to 0? So the question, does it go to
0, and if it does, how fast? The two parts, does it
go to 0 and how fast, are pretty much controlled
by the same property of M. So what's the answer? When does -- if I take -- I
don't know anything about e_0, so what property of M is going
to decide whether its powers get small, go to 0, so
that the error goes to 0? That's convergence. Or, maybe blow up, or
maybe stay away from 0. What controls it? One word is the answer. The eigenvalues. It's the eigenvalues of
M, and in particular, it's the biggest one. Because if I have an eigenvalue
of M, as far as I know, e_naught could be
the eigenvector, and then every
step would multiply by that eigenvalue lambda. So the condition is that
all eigenvalues of M have to be smaller than
1, in magnitude of course, because they could be negative,
they could be complex. So that's the total
condition, that's exactly the requirement
on M. How do we say how fast they go to 0? How fast -- if all the
eigenvalues are below 1, then the slowest
convergence is going to come for the eigenvalue
that's the closest to 1. So really it will be -- and this
is called the spectral radius and it's usually denoted
by a Greek rho of M, which is the maximum
of these guys. It's the max of the eigenvalues. And that has to be below 1. So it's this
number, that number, because it's that number
which really says what I'm multiplying by at every step. Because most likely, e_naught,
my initial unknown error, has some component in the
direction of this biggest eigenvalue, in the direction
of its eigenvector. Then that component will
multiply at every step by lambda, but that one will
be the biggest guy, rho. So it's the maximum eigenvalue. That's really what it comes to. What is the maximum eigenvalue? Let me take Jacobi. Can we figure out the maximum
eigenvalue for Jacobi? Of course, if I don't know
anything about the matrix, I certainly don't want
to compute eigenvalues. I could run Jacobi's method. So again, Jacobi's
method, I'm taking P to be the diagonal part. Let me make that explicit
and let me take the matrix that we know the best. I'll call it K. So k --
or A, this is the A -- is my friend with 2's on the
diagonal, minus 1's above. I apologize -- I don't
really apologize, but I'll pretend to apologize
-- for bringing this matrix in so often. Do we remember its eigenvalues? We computed them last semester,
but that's another world. I'll say what they are. The eigenvalues of A
-- this isn't M now -- the eigenvalues of this matrix
A -- I see 2's on the diagonal. So that's a 2. That accounts for the diagonal
part, 2 times the identity. I have a minus and then the
part that comes from these 1's, which are forward and back. What would von Neumann say? What would von Neumann
do with that matrix? Of course, he would test it on
exponentials and he would get 2 minus e to the i*k minus
e to the minus i*k. He would put the two
exponentials together into cosines and
that's the answer. It's 2 minus 2 cosine
of whatever angle theta. Those are the eigenvalues. The j-th eigenvalue is --
the cosine is at some angle theta_j. It's worth since -- maybe just
to take again a minute on this matrix. So the eigenvalues
are between 0 and 4. The eigenvalues never
go negative, right? Because the cosine
can't be bigger than 1. Actually, for this matrix
the cosine doesn't reach 1. So the smallest eigenvalue
is 2 minus 2 times that cosine close to 1. It's too close for
comfort, but it is below 1, so the eigenvalues are positive. And they're between 0 and 4. At the other extreme,
theta is near pi, the cosine is near minus 1,
and we have something near 4. So the eigenvalues of that --
if you look at that matrix, you should see. Special matrix eigenvalues
in this interval, 0 to 4. The key point is,
how close to 0? Now what about M --
what about P first? What's P? For Jacobi, so P for Jacobi
is the diagonal part, which is exactly 2I. That's a very simple
diagonal part, very simple P. It produces the matrix
M. You remember, M is the identity matrix minus
P inverse A. That's beautiful. So P is just 2I. So I'm just dividing A by
2, which puts a 1 there, and then subtracting
from the identity. So it has 0's on the diagonal. That's what we expect. It will be 0's on the diagonal
and off the diagonal, what do I have? Minus -- with that minus,
P is 2, so it's 1/2. It's 1/2 off the diagonal
and otherwise, all 0's. That's the nice matrix that
we get for M in one dimension. I have to say, of
course, as you know, we wouldn't be doing any of this
for this one-dimensional matrix A. It's tridiagonal. We can't ask for more than that. Direct elimination
would be top speed, but the 2D case is what we
understand that it has a big bandwidth because
it's two-dimensional, three-dimensional even bigger
-- and the eigenvalues will come directly from the
eigenvalues of this 1D case. So if we understand
the 1D case, we'll knock out the two- and
three-dimensional easily. So I'll stay with this
one-dimensional case, where the matrix M
-- and I'll even, maybe should write down exactly
what the iteration does -- but if I'm looking
now at convergence, do I have or don't I have
convergence, and how fast? What are the eigenvalues
and what's the biggest one? The eigenvalues of A are
this and P is just -- P inverse is just 1/2
times the identity. So what are the eigenvalues? Again, I mean, I know
the eigenvalues of A, so I divide by 2. Let me write down what
I'm going to get now for the eigenvalues of M,
remembering this connection. So the eigenvalues of M are 1
minus 1/2 times the eigenvalues of A. Better write that down. 1 minus 1/2 times
the eigenvalues of A. Of course, it's
very special that we have this terrific
example matrix where we know the eigenvalues and
really can see what's happening and we know the matrix. So what happens? I take half of that --
that makes that a 1. When I subtract it
from that 1, it's gone. 1/2 of that, that 2 is cancelled
-- I just get cos theta. Theta sub -- whatever
that angle was. It's a cosine and
it's less than 1. So convergence is going to
happen and convergence is going to be slow or fast according as
this is very near 1 or not -- and unfortunately, it is pretty
near 1, because the first, the largest one -- I could
tell you what the angles are. That's -- to complete
this information, I should have put in what --
and von Neumann got them right. That's j -- is there
maybe a pi over N plus 1? I think so. Those are the
angles that come in, just the equally spaced
angles in the finite case. So we have N eigenvalues. And the biggest one is
going to be when j is 1. So the biggest one, rho,
the maximum of these guys, of these lambdas, will be
the cosine of, when j is 1, pi over N plus 1. That's the one that's
slowing us down. Notice the eigenvector
that's slowing us down, because it's the
eigenvector that goes with that eigenvalue that's
the biggest eigenvalue that's going to hang around the
longest, decay the slowest. That eigenvector is very smooth. It's the low frequency. It's frequency 1. So this is j equal to 1. This is the low frequency,
j equal 1, low frequency. The eigenvector is just
samples of the sine, of a discrete sine. Lowest frequency that
the grid can sustain. Why do I emphasize
which eigenvector it is? Because when you know which
eigenvector is slowing you down -- and here it's the eigenvector
you would think would be the simplest one to handle
-- this is the easiest eigenvector, higher-frequency
eigenvectors have eigenvalues that start dropping. Cosine of 2*pi, cosine of 3*pi
over N plus 1 is going to be further away from 1. Now I have to admit that
the very highest frequency, when j is capital N -- so when j
is capital N I have to admit it happens to come back
with a minus sign, but the magnitude is
still just as bad. The cosine of N*pi over N plus
1 is just the negative of that one. So actually, we
have two eigenvalues that are both hitting
the spectral radius and both of them are the
worst, the slowest converging eigenvectors. This eigenvector
looks very smooth. The eigenvector for
that high-frequency one is the fastest oscillation. If I graph the eigenvectors,
eigenvalues, I will. I'll graph the eigenvalues. You'll see that it's the
first and the last that are nearest in magnitude to 1. I was just going to
say, can I anticipate multigrid for a minute? So multigrid does -- is going
to try to get these guys to go faster. Now this one can be
handled, you'll see, by just a simple
adjustment of Jacobi. I just weight
Jacobi a little bit. Instead of 2I, I take 3I or 4I. 3I would be very good
for the preconditioner. That will have the effect
on this high frequency one that it'll be well
down now, below 1. This will still be the
slowpoke in converging. That's where multigrid
says, that's low frequency. That's too low. That's not converging quickly. Take a different grid
on which that same thing looks higher frequency. We'll see that tomorrow. So if I stay with Jacobi,
what have I learned? I've learned that this is rho. For this matrix,
the spectral radius is that number and let's
just see how big it is. How big is -- that's a
cosine of a small number. So the cosine of a small
number, theta near 0, the cosine of theta is 1
minus -- so it's below 1, of course -- 1/2 theta squared. Theta is pi over
N plus 1 squared. All these simple
estimates really tell you what happens when
you do Jacobi iteration. You'll see it on the output
from -- I mean, you can -- I hope will -- code
up Jacobi's method, taking P to be a multiple of
the identity, so very fast, and you'll see the convergence
rate, you can graph. Here's an interesting point
and I'll make it again. What should you
graph in seeing -- to display convergence
and rate of convergence? Let me draw a
possible graph here. Let me draw two possible graphs. I could graph -- this is
K. As I take more steps, I could graph the error,
well, the size of the error. It's a vector, so
I take its length. I don't know where I
start -- at e_naught. This is k equals 0. This is the initial guess. If I use -- I'm going
to draw a graph, which you'll draw much better
by actually getting MATLAB to do it. Before I draw it, let me mention
the other graph we could draw. This tells us the error
in -- this is the error in the solution. If I think that's the
natural thing to draw, it is, but we could
also draw the error in the equation,
the residual error. So I'll just put up here
what I mean by that. If I put in -- so
A*x equals b exactly. A*x_k is close. The residual is -- let me call
it r, as everybody does -- is the difference A*x minus
A*x_k It's how close we are to b, so it's b minus A*x_k. The true A*x gets
b exactly right. The wrong A*x gets
b nearly right. So you see, because we
have this linear problem, it's A times the error.
x minus x_k is e_k. So this is the residual
and I could graph that. That's how close my answer
comes to getting b right, how close the equation is. Where this is how
close the x_k is. So that would be the
other thing I could graph. r_k, the size of the
residual, which is A*e_k. So now I have to remember what
I think these graphs look like, but of course, it's a little
absurd for me to draw them when -- I think what we see is that
the residual drops pretty fast. The residual drops pretty
fast and keeps going, for, let's say, Jacobi's method,
or, in a minute or maybe next time, another
iterative method -- but the error in the
solution starts down fast and what's happening here is the
high-frequency components are getting killed, but those
low-frequency components are not disappearing, because we
figured out that they give the spectral radius,
the largest eigenvalue. So somehow it slopes
off and it's -- the price of such a simple
method is that you don't get great convergence. Of course, if I put
this on a log-log plot, the slope would be
exactly connected to the spectral radius. That's the rate of decay. It's the -- e_k is
approximately -- the size of e_k is approximately
that worst eigenvalue to the k-th power. Do you see how this can happen? You see, this can
be quite small, the residual quite small. I think it's very
important to see the two different quantities. The residual can be quite small,
but then, from the residual, if I wanted to get
back to the error, I'd have to multiply
by A inverse and A inverse is not small. Our matrix A is pretty --
has eigenvalues close to 0. So A inverse is pretty big. A inverse times A*e_k -- this
is small but A inverse picks up those smallest
eigenvalues of A -- picks up those low
frequencies and it's bigger, slower to go to 0 as
we see in this picture. So it's this picture
that is our problem. That shows where Jacobi
is not good enough. So what more to
say about Jacobi? For this matrix K, we
know its eigenvalues. The eigenvalues of
M we know exactly. The matrix M we know exactly. Do you see in that matrix
-- now if I had said, look at that matrix and
tell me something about its eigenvalues, what would you say? You would say, it's symmetric,
so the eigenvalues are real and you would say that every
eigenvalue is smaller than 1. Why? Because every eigenvalue is --
if you like to remember this business of silly circles,
every eigenvalue is in a circle centered at this number -- in
other words, centered at 0 -- and its radius is the
sum of those, which is 1. So we know that every
eigenvalue is in the unit circle and actually, it comes
inside the unit circle, but not very much --
only by 1 over N squared. What do I learn from
this 1 over N squared? How many iterations
do I have to take to reduce the error by, let's
say, 10 or e or whatever? If the eigenvalues are 1 minus
a constant over N squared, I'll have to take
the N squared power. So if I have a
matrix of order 100, the number of iterations
I'm going to need is 100 squared, 10,000 steps. Every step is fast, especially
with the Jacobi choice. Here's my iteration -- and P is
just a multiple of the identity here. So every step is certainly fast. Every step involves a
matrix multiplication and that's why I began
the lecture by saying, matrix multiplications are fast. A times x_k -- that's the
work and it's not much. It's maybe five
non-zeros in A -- so 5N. That's fast, but if I
have to do it 10,000 times just to reduce the error
by some constant factor, that's not good. So that's why people think,
okay, Jacobi gives us the idea. It does knock off the high
frequencies quite quickly, but it leaves those
low frequencies. Well, I was going to
mention weighed Jacobi. Weighted Jacobi is to
take, instead of p -- put in a weight. So weighted Jacobi -- let me put
this here and we'll come back to it. So weighted Jacobi
simply takes P to be the diagonal
over a factor omega. For example, omega equals
2/3 is a good choice. Let me draw the -- you
can imagine that if I -- this is 2 times the identity,
so it's just 2 over omega I -- I'm just choosing a
different constant. I'm just choosing a
different constant in P and it has a simple affect
on M. I'll draw the picture. The eigenvalues of
M are these cosines. So I'm going to draw those
cosines, the eigenvalues of M. So they start -- I'll just draw
the picture of cos theta if I can. What does a graph of
cos theta look like? From 0 to pi? So I'm going to see --
this is theta equals 0. This is theta equal pi. If I just draw cos theta
-- please tell me if this isn't it. Is it something like that? Now where are these theta_1
-- the bad one -- theta_2, theta_3, theta_4, theta_5,
equally bad, down here. So there's cos --
there's the biggest one. That's rho and you see
it's pretty near 1. Here I'm going to hit
-- at this frequency, I'm going to hit minus rho. It'll be, again, near 1. So the spectral radius
is that, that's rho. The cosine of that smallest
angle, theta over N plus 1. What happens when I
bring in these weights? It turns out that -- so,
now the eigenvalues of M, with the weights, will
be 1 -- it turns out -- 1 minus omega plus
omega cos theta. You'll see it in the
notes, no problem. So it's simply
influenced by omega. If omega is 2/3 --
if omega is 2/3, then I have 1 minus omega
is 1/3 plus 2/3 cos theta. Instead of graphing cos
theta, which I've done -- let me draw -- let me complete
this graph for cos theta. The eigenvalues unweighted,
with weight omega equal 1. Now it's this thing I draw. So what happens? What's going on here? For a small theta,
cosine is near 1. I'm again near 1. In fact, I'm probably even
little worse, probably up here. I think it goes down like that. So it starts at near 1, but
it ends when theta is pi. This is 1/3 minus 2/3. It ends at minus 1/3. You see, that's a lot smarter. This one ended at minus 1,
that one ended at near minus 1, but this one will end with omega
-- this is the omega equal 2/3, the weighted one. All you've done is fix
the high frequency. When I say, oh, that
was a good thing to do. The high frequencies
are no longer a problem. The low frequencies still are. So the only way we'll
get out of that problem is moving to multigrid. Can I, in the remaining
minute, report on Gauss-Seidel and then I'll come back to it? But quick report
on Gauss-Seidel. So what's the difference
in Gauss-Seidel? What does Gauss-Seidel do? It uses the latest information. As soon as you compute
an x, you use it. Jacobi computed all -- had to
keep all these x's until it found all the new x's, but
the Gauss-Seidel idea is, as soon as you have a better
x, put it in the system. So the storage is cut in
half, because you're not saving a big old vector while
you compute the new one. I'll write down
Gauss-Seidel again. So it's more up to
date and the effect is that the Jacobi
eigenvalues get squared. So you're squaring numbers
that are less than 1. So Gauss-Seidel, this gives
the Jacobi eigenvalues squared. The result is that if I draw
the same curve for Gauss-Seidel, I'll be below, right? Essentially, a Gauss-Seidel
step is worth two Jacobi steps, because eigenvalues
get squared as they would do in two Jacobi steps. If I do Jacobi twice, I
square its eigenvalues. So it seems like
absolutely a smart move. It's not brilliant, because
this becomes cosine squared. I lose the 1/2, but I'm
still very, very close to 1. It still takes order N squared
iterations to get anywhere. So Jacobi is, you
could say, replaced by Gauss-Seidel as
being one step better, but it's not good enough. And that's why the
subject kept going. This gives rho as
1 minus order of 1 over N, first power, where these
were 1 minus 1 over N squared. So this is definitely
further from 1 and then this can be
way further from 1. So you can see some of
the numerical experiments and analysis that's coming. I'll see you Wednesday to finish
this and move into multigrid, which uses these guys.