So, we're gonna start on our very
last topic which is Markov Chains. And as you know, I have one more
homework due on the last day of class, which is next Friday. Almost all the sections
are still meeting today and tomorrow because there's really way too
much material for one more section. So this week we'll focus on law of
large numbers, central limit theorem, multivariate normal,
which we have been doing. Then the last section next week then the TFs will do examples of Markov chains,
which is our last topic. Okay, so I'll just start from
the beginning of Markov chains. There's also a handout I wrote on
Markov chains which you can read at some point if you haven't seen it already,
just posted under the notes. One other thing not to scare anyone
before Thanksgiving, but the finals are on December 15th, so it's not really
too soon to start thinking about it. I did post a final review handout,
which is 66 pages, so there's a lot of stuff you can study. The 66 pages consist of
23 pages of review and then the rest of it just
consists of literally just all the past 110 finals that have
been at Harvard since 2006. I'm teaching 110 at every Fall, and I just don't really believe in
withholding information, or some people get access to more exams than others
cuz they have a friend from whatever. This is the entire history of 110 finals, there's five full length practice exams,
and I think you should be, whenever you can,
starting practice, practice, practice. And I posted the solutions
to them as well, but you should try your best to
resist looking at the solutions, except to check your answers or
if you're really, really stuck. So I would highly recommend
doing it at least two or three of those exams under as close
as you can to exam conditions. That is, time yourself strictly. You have four pages of notes double-sided,
that kind of thing. Then after each one,
then you can go through it. If you run out of time and you're not
done, it's still a good idea to spend some time working on the problems still
without looking at the solution. And then finally look at the solutions,
study your notes, study everything to see
where you can improve. So we've covered everything now except for
Markov chains, and so if you're starting on
those practice before we've gotten far enough in Markov
chains it's not too much of a problem. Because in general the problems
are not intended to be in order of the first problem is the first
thing we did in the course and so on, it's not really chronological like that. But it has been true every year that I
put Markov chains as the last problem and no where else. So if you haven't gotten far
enough on Markov's chains yet, you can just skip the last
problem of each year. Until we get a little
farther with Markov chains. Everything else we've
covered the material so the sooner you start
practicing the better. So you can assume that the style would
be similar to those past exams and now that will give you
a good sense of everything. So Markov chains, what is it? Well, they are invented or discovered depending on your
philosophical point of view. They were invented or discovered by Markov who we've already
heard of through, Markov's inequality. He introduced them just
over a hundred years ago. I think that's 1906. Now I'll tell you a little about
Markov's original reason for studying Markov chains, but the number of applications they found
since then has just been unbelievable and a lot of directions it's gone in that
Markov would not have anticipated. So Markov chain,
it's an example of a stochastic process. I'll tell you just very briefly
what a stochastic process is. It's one of the most important
examples of stochastic processes. A stochastic process, we've been dealing, originally we were dealing with
one random variable at a time, and then we're sometimes doing two, and
with the central limit theorem and law of large numbers, we were looking
at sequences of random variables. So sometimes we have sequences
of random variables, and we're studying the sample mean, and
let n go to infinity, stuff like that. But when we're considering this
sequence of random variables, mostly we've been assuming
that they're i.i.d. The central limit theorem and law of large
numbers, there are more general versions of those theorems, but we've been assuming
i.i.d which is a very strong assumption. Independent identically distributed. So Markov chain is an example
of a stochastic process. Which basically just means random
variables evolving over time. So I'll just say examples
of stochastic processes. So, what does that mean? I mean,
it doesn't literally have to be time but that's the most common interpretation. So we have this sequence
of random variables X0, X1, X2, blah, blah, blah, blah, blah. We've been looking mostly at
the case where these are i.i.d. So we can think of the subscript as time,
but if they're i.i.d. Then is just like saying
it's resetting every time to a fresh independent
random variable. So from a Stochastic process point of view
more interesting is when it's kind of evolved in some way where there
is some kind of dependence. But if we allow these to be
completely dependent in some arbitrarily complicated way,
that can be extremely complicated. You have this and they can all relate to
each other in some very complicated way. So Markov chains are kind of a compromise. It's basically once going
one step beyond i.i.d. So they're gonna become dependent. But it's in a very special way
that I'm gonna tell you about. That has very nice properties. So, we're gonna think of Xn as the state of a system. So, it can be interpreted very generally. You have any kind of system,
we'll see examples, that you're imagining
some kind of particle or whatever that's wandering around
randomly from state to state. It's jumping from state to state, and that's at time n, which we're
assuming is discrete right now. That is n is a non-negative integer. So we can consider stochastic processes
in continuous-time or continuous-space. Continuous-time or discrete-time, continuous-space or discreet-space, so
there's basically four possibilities. That is this index here,
we just do X0, X1, X2. If we want a continuous time,
we could look at Xt, where t is time,
which could be continuous. Then we'd have a collection
indexed by a continuous variable. Here we're just indexing it discretely,
so this is the discrete time. So in STAT 110, we're only gonna
look at discrete time processes. If you wanna go further into stochastic
processes, I highly recommend STAT 171, which is an entire course
on stochastic processes. We're gonna do Markov
chains in discrete time, so this is the discrete time And
we're solving discrete space. Continuous space would mean that
each of these Xs is allowed to take values like let's say any real number
or values in some continuous space. But for our purposes we're gonna
assume that each x j takes values in same discreet space and
actually we're gonna assume that it's a finite
space to make things easier. So we're gonna be looking at Markov chains
where we have finitely many states and each of these Xs is one of those states. And we just have this process that's
bouncing around randomly from state to state, and then we wanna
describe what's the Markov property. Okay, so here's the mark off property,
what is Markovian mean? And then I'll tell you a little
bit about why did Markov introduce this and what are some examples. Markov property is a certain
conditional probability statement. It says if we look at the next, our interpretation is think
of Xn as the current state. I mean I don't know how to
define the word now but we have some intuitive conception of now. Now is now, that's the present,
and then after now is the future. And before now is the past, right? So a lot of Markov chain stuff
you can understand intuitively if you think about past present and
future. So let's think Xn is right now and
we wanna predict the future. We wanna know what's gonna happen at time
N plus one cuz we're letting time be discrete. So one step in the future and we have
Xn is today, Xn plus one is tomorrow. Okay we want to know what's
the probability that Xn plus one equals J. Or J is just some state, what we're
usually gonna assume that the states are numbered from 1 to capital
M just as a convention, but there's no rule saying
you have to do that. We're assuming there's
finitely many states. So we're saying what's the probability
that the state of the system tomorrow is state J, given the entire
past history of everything? So we're given,
well what's the state today? Xn = I, and what was the state yesterday? Yesterday was at xn minus one. It was at xn- 1, and
maybe that's i n- 1, and xn- 2 is in- 2,
all the way back to the very beginning. And x zero, let's say it's i zero. So this is saying conditional, what's
the probably of a certain state tomorrow, given the entire past
history of the system, right? That looks very complicated, right? It took a lot of time just to write all
these things and all these dot dot dots, and this is a big complicated thing. In a general stochastic process,
maybe you can't simplify this very much. But this is too complicated. The Markov structure means that only the most recent information matters. We can get rid of everything else. So this just becomes X
n+1 = j given Xn = i, you can get rid of all the rest. So all these concepts that we've been
doing the entire semester come back here at the difference between independence and
conditional independence. We're not saying that the past
is independent of the future, but we are saying if the properties holds
and where's that says that the past and future are conditionally
independent given the present? So that's kind of a handy
intuitive way to remember it, so past and future are conditionally
independent given the present. Okay, that's exactly
what this statement says. Given the present, that is,
if we know the current value, Xn, if we know that, then anything that's
further back is just obsolete, outdated information which
we can just get rid of. So I just crossed out everything
further back in time, right? Only the most recent one matters. Conditionally if we didn't get that
doesn't mean that if we had xn + 1 given xn- 1 ,then we are not given xn,
it doesn't mean we can get rid of xn- 1, right? It's conditional independence. So that's the Markov assumption. Whether that assumption is good or not for a real example completely
depends on the problem, okay? But that's the Markov property. And in particular for our purposes, let's actually assume that
this doesn't depend on n. So we'll usually call this thing Ii,
or you call it whatever you want. Let's call it a transition probability. So, sometimes this is called
homogeneous Markov chain. But a lot of times people will not bother
to say the word homogeneous, so you have to be a little careful from the context,
is it assumed to be homogeneous or not? Homogeneous means that this
doesn't depend on time, right? So this is saying we're looking at
the case where the probability, if the system is at state i, and then we wanna know what's
the probability at the next time point? It's at state j.
If that's always qij, if it doesn't change with time, then we would say that
that's homogeneous or time homogeneous. And for
our purposes we're just gonna be studying homogeneous Markov chains, so
these qujs don't depend on N, cuz that's a nicer case to look at, okay? So just to have a mental picture in mind,
I just made up a simple little example. It really helps to have pictures in mind,
so here's one Markov chain. You can make up your own, but
here's one that I just made up. So there are four states
we can represent as ovals. Let's number the states 1 through 4, okay? And then to specify the Markov chain
we just need to specify transition probabilities, that is what's the
probability going from state 1 to state 2, all the different states. So it may not be possible to go from
any state to any state in one step, in fact usually it's not possible. So for example,
that's just an example I made up, let's say the system is at state 1. It has probably one
third of staying there. And it has probably two
thirds of going to state 2. I'm just drawing arrows and I'm putting
probabilities on the arrows that reflect the probabilities of
those different transitions. From state 2, it's equally
likely to go back to state 1 or to go to state 2, I guess I'm putting
the probabilities above the arrow. So from state 2 either goes here or
here, equally likely from state 3. From state 3, it either goes back to
state 4 with probability one quarter. Or it goes to state, back to state one, with probably one half, or it stays there with
probability one quarter. And then, let's say from state 3, just to make life a little bit easier
from state 3, it always goes to state 4. So I'll just put a 1 there. You make your own example. Anyway, it just helps to have
a picture like this in mind, and you can have Markov chain that
has infinitely many states, and then you You can actually
draw infinitely many states. But we're only gonna be looking at the
case where there are finitely many states. So you can always, for our purposes, you can always imagine a picture like
this where we have some number of states. And then you have a particle randomly
bouncing around between states, following the arrows, and the arrows
may have different probabilities. Okay, that's the mental picture. And then to explain in that picture,
what does these property say? It says that you are wandering around
from state to state, like that. And then what this says is,
if I wanna know the future. Or predict the future. All I care about is the current state. I don't care about the whole history of
all of the long wanderings throughout time to get there. It doesn't matter how you got there. It only matters where you are. That's what the Markov
property says intuitively. So we have these transition
probabilities q ij, and we can write that as a matrix,
that's called the transition matrix. So we can always just draw a picture
like this, but sometimes it's easier to work with a matrix, rather than
having to draw a picture every time. So we can encode the same information as
this picture just written in matrix form. The transition matrix,
usually we call Q in this course, but you don't have to call it that. That's just the matrix of all the qijs,
all the transition probabilities. So for this example it's
gonna be a 4 by 4 matrix, and then I'm just gonna write, what's
the probability of going from 1 to 1? That's one-third,
because there's this loop there, that's has a probability of one-third. Probability of going from 1 to
2 is two-thirds, from 1 to 3 or 1 to 4 is 0,
we can't go there in one step. We can go there in more than one step. Then from state 2,
it either goes back to state 1 or it goes to state 3 with
equal probabilities. So this will be,1/2 0 1/2 0. From state 3 it always goes to state 4,
so it's gonna be 0 0 0 1. I'm just writing down
this q ijs as a matrix. And then from state 4 it either goes
back to state 1 that's probably 1/2 or it goes to state 3 that's 1/4 or
it stays at state 4 that's 1/4. So that would be the transition matrix. Notice that every row sums to one. And, again, it's just a convention,
some people would use the transpose of this matrix instead, and then
the columns sum to one, just a convention. We're going to take the convention
that we write it this way, the i, j entry is the probability of
jumping from i to j in one step. So therefore each row sums to 1. Columns may or may not sum to 1,
but the rows sum to 1. So we wanna go the other way around and say, what's a valid Markov
chain transition matrix? Well, we can just write down
a blank square matrix, fill in any numbers you want as long as they're
non-negative and each row sums up to one. And then you can easily see how you would
then convert that to a picture like this. Just drawing errors with probability. So, that's what a transition matrix is. By the way, the two most important
concepts other than the basic definition of Markov property, two most
important concepts we're gonna need, one is transition matrix. Now you know what a transition matrix is,
the other one's stationary distribution, which we'll talk about later. So, that's the Markov property. So to tell you a little bit
about the history, well, so, how it's used now, it's being used now,
in two sort of different ways. One is as a model that, might be used in various problems in the social
sciences, physical sciences, and biology, where you actually, literally believe
that you have a Markov chain. Or using it as an approximation for
some system evolving over time. That's very, very useful. But that sort of limited in the sense that it seems like a pretty strong
conditional independence assumption. Where you can forget the entire past
as long as you have the present. So there have been thousands of pages
of debates about whether if you take the stock price over time,
is that Markovian or not? Or is the weather Markov,
things like that. So those become empirical questions,
which you can study. One thing to point out though, is it's not
as limiting as it looks in the sense of that, this is sometimes called
a First Order Markov Chain, where it only goes one step back. You can generalize this easily to the case where the conditional independence
is that it depends on Xn and Xn- 1. And it turns out that to understand
how that kind of thing works, the right starting point is
to be looking at this anyway. So you can extend this and consider, Markov chains where it goes,
ten steps back or think things like that. Anyway that's still still kind of
an empirical question, is that useful for a real system or not or is that too strong of an assumption
about this conditional independence? But in more another way
that this is used for Markov chain Monte Carlo which essentially created revolution in scientific
computing and is being used everywhere. And I like to challenge people
finding an example of some major fields of study where Markov chain
Monte Carlo has never been applied, and no one has successfully
met that challenge yet. So say you're interested in French poetry, you can find Markov chains being
applied to different poetry. But why is that? Do you really want to think of French
poetry as actually following this model? No, but, the idea of Markov Chain
Monte Carlo is that you don't have to worry about whether the actual process
you're observing follows a Markov Chain. Markov Chain Monte Carlo means that
you synthetically construct your own Markov Chain. You then can't argue is it Markov or
not, because you built your own chain. Why would you want to
build your own chain? Well, the idea, which is a very beautiful,
one of the most useful and beautiful ideas of the 20th century,
in my opinion. Is that you can construct
your own Markov chain that will converge to a distribution
that you're interested in. Cuz suppose you're trying to
simulate some complicated system and the computations are too hard to do
everything analytically and explicitly. There are some extremely clever
ways to construct a Markov chain synthetically, that will converge
to the thing you're interested in. And so then you just program your Markov
chain on the computer, so this is a very recent thing, cuz we need to have fast
computers in order for this to be useful. It's a recent idea, you program
your Markov chain on the computer, run the chain for a long time and then use those result to study the
distribution that you're interested in. So that is the idea I'm just
gonna write both acronym, MCMC. We don't really have time to
talk more about MCMC than that. But Markov chain Monte Carlo is
just getting more and more popular. Widely used every year as a way of doing
extremely complicated simulations, and computations that would have been
completely impossible, like 50 years ago. Or even 30 years ago. Okay, that's Markov chain Monte Carlo. So Markov chain Monte Carlo is when you
build your own Markov chain, right? And then the other application is just
that you have this natural system that you choose to model it,
using the Markov chain. So those are two important, but
different uses of Markov chains. Okay, so both of those ideas are very, very different from what Markov
himself originally had in mind. When he introduced Markov chains,
it was actually to help settle a religious debate over the existence of free will. Kind of a religious philosophical thing. So at the time, this was the early 1900s, the law of large numbers
had recently been proven. Like here, we recently proved
the law of large numbers. Okay, and I told you the story
of the athlete who was very upset about the law of large numbers
because he said in the long run, he'd always be average, and
so he couldn't improve. And so he hated statistics. Which is kind of like ridiculous. But that was actually very
similar kind of a concern that people had when the law of
large numbers was first proven. Which was that some of
the philosophers were upset that the law of large numbers might
prevent us from having free will. Because it says in the long run,
converge to the mean. Where's the scope for
having free will then, right? I'm saying it will converge to this,
right? Where's free will, okay? So one of Markov's rivals tried
to rescue free will by saying, well the law of large numbers assumes IID. And human behavior is not IID,
therefore we're okay, right? Okay, but you can see that that's not
a very convincing argument because we proved that if the random variables are
IID, then the law of large numbers holds. But we didn't show that if it's not IID, then the law of large numbers doesn't
hold and there are more general versions. So what Markov wanted to do was to find
a case that's find a process that's one step in complexity beyond the IID,
right? That's what this is, right? The IID case, would say, you can just completely forget
everything you're conditioning on. And he wanted to go one
step in complexity beyond IID where you go one step backward, right? So go one step forward in thinking by
going one step backward in conditioning. You're going one step beyond IID. And then he proved a version
of the law of large numbers that applies to Markov chains. And so then he said,
well look you don't actually need IID for the law of large numbers to hold, okay? And the chain that he
originally looked at, which might be the first Markov chain,
was he took a Russian novel and he empirically said,
was looking at consonants and vowels. And he said, okay. There are two states: vowel and
consonant, and a vowel is either followed by a vowel or
a consonant. A consonant is either followed
by a vowel or a consonant. And then we kind of empirically computed
the probability that vowel is followed by vowel, vowels followed by consonant,
and so on. And he was just studying
that simple chain. So just like this picture,
except even simpler. Only two states, and
I drew one with four states. But conceptually, it's the same thing. You're just bouncing around
between states, okay? So that's the whole
idea of a Markov chain. And if you keep these
simple pictures in mind, it will make things much,
much, much easier. Even for more, we could have a Markov
chain that has 10 to the 100 states, but you can still keep pictures
like this in mind, okay? So we've defined the transition matrix. It's just this matrix of
transition probabilities. But I wanna show you how do we actually
use the transition matrix to get higher order things,
like what's the probability. For example, we wanna answer the question, this is the probability of
going from i to j in one step, what's the probability of going from
i to j in two steps, or ten steps? Things like that. Those questions can be answered in
terms of the transition matrix. Okay, so let's do that now. So suppose that we start the chain. Suppose at time n which we're
considering as the present Xn, the chain, which is Xn at time n,
has distribution, Which I'm thinking of as s,
which is a row vector. Which we're gonna think of as, in other
words, we can think of it as a 1xM matrix. And all I mean by this is, we're taking
the PMF, but since we're assuming. I'm assuming that there are M states,
so here M = 4, in this example. M is the number of ovals in this picture. So we're assuming M states,
and we're saying, okay. The distribution,
I just mean the PMF cuz it's discrete, but since there only is finitely many values,
you may as well just list them all out. So this is, this thing is the PMF kind of
just listed as a vector listed out, right. So it's the probability of being in state
one, probability of being in state two, and so on. Just list out the probabilities,okay? So in particular, we know that s is
a vector and we're writing it as a row. And the entries are non-negative and
add up to one. Okay, so let's assume that that's
how things stand at time n, and then we wanna know what
happens at time n plus one. So we wanna know what's
the P (Xn + 1 = j). So we wanna know the PMF,
this is the PMF at time n. It's just we're writing it as
a vector because that's convenient. That's the PMF at time n, now we want the
PMF at time n + 1, one step in the future. Okay, well, to do that it's not hard
to guess that what we should do is condition on the value now, right? Condition on what we wish that we knew,
it would be convenient if we knew where the thing is now,
that would be very useful, right? So we're just gonna condition, so we're gonna sum over all
states i of of the P(Xn + 1 = j given Xn = i) times P(Xn = i),
right? Just the law of total probability. But these are things we know. Just by definition, this is qij. That is the probability of going from
i to j, by definition that's qij. And this thing here. Is just SI, right? Because that's just
the probability of being at i and that's just the ith value in
this vector that's just SI. I'm gonna write it on the left,
because it looks a little bit nicer to me. So, That's the answer as a sum, but it's easier to think of
this in terms of a matrix. This is in fact the jth
entry of S times Q. Remember Q is an m x m matrix,
and S is a 1 x m matrix, so it's valid to multiply these. The inner numbers match up and then the dimensions of this
are the outer numbers, 1 by M. So S times Q is a 1 by M matrix. And you don't need a lot of
matrix stuff in this class but I'm assuming you can at
least multiply matrices. How do you multiply matrices? Well, you take a row of this and
do a dot product with a column of this. And when you do that,
it's exactly the sum, right, you're just going row dot column and
you're adding this up, okay? So the interpretation of
this sum is just it's just the entry of this vector
times this matrix. Therefore, S times Q is the distribution PMF written as a vector at time n + 1. Okay? So, that's now playing the role of S. So, by the same argument, If we do SQ squared,
that's the distribution at time n + 1. N + 2 sorry, two steps into the future, and SQ cubed is distribution
at time n + 3, and so on. Because it's just the same thing again,
right? Because now S times Q is now
playing the role of S, right? And what this calculation
says is to go one step to the future multiply
on the right by Q. So I multiply SQ by Q we get
SQ squared multiply this by Q you get SQ cubed, and so on, okay? So what that says is that our powers, if we multiply the initial vector
of probabilities by Q to a power, then that gives us the distribution
that number of steps in the future. So this is saying that by, of course this is gonna be pretty
tedious to do by hand for most problems. But using a calculator or
computer that can multiply matrices, then it just say, take powers of Q
you don't like think that hard each time you just use powers of Q and
you know how this is gonna evolve. Okay, so similarly, By definition Q gives the one step, Probabilities, so
the probability that Xn + 1 = j, given Xn = i,
by definition that's just qij, Okay, but
suppose we wanted the two step thing. Xn + 2 = j given Xn = i, okay. This is a similar calculation
to what we just did. This is the transition probabilities. This is saying go one step ahead in time. And now this says what happens
if we only know Xn but we want to predict two steps ahead? Well it's not hard to figure
what to condition on, right. The missing link here is Xn +
1 that's what's missing so, clearly we should condition on Xn + 1. So let's say we condition on Xn + 1. So this is P(Xn + 2 = j given Xn + 1 =, let's say k, Xn = i, I'm just conditioning on Xn + 1, right? Cuz that's the missing
intermediary that we want. Times the P(Xn + 1 = k given Xn = i). That's just loyalty of probability, the conditional version
everything is given, Xn = i. Let's simplify that into something
that's more interpretable. First of all,
the Markov property says that, if we know the value at time n + 1 and
the value at time n, this is obsolete. That's irrelevant now,
cuz of this conditional independence, we only want the most recent value,
all right? So we can get rid of that. So by definition, Now we're just saying
we're going in one step from k to j. By definition, that's just
the transition probability qkj. And then this one is also one step, right? This is saying go from i to k in one step. And again I'm gonna write this on
the other side just cuz it looks nicer qik, qkj. I just swapped the order of those
two terms to make it look nicer. We've expressed it in terms
of one-step things, okay. And now, something that looks like this,
that's of this form, it has a nice interpretation in terms
of matrix multiplication again. This is just the ij entry,
row i, column j of q squared. Because well, why is that true? Well, just think of,
how do you do q squared? You do q times q. So you take a row of q and .product with
a column of q, and you'll get this, okay? So similarly same idea,
that we wanna know what's the probability, just repeating the same argument. We wanna know like,
what's the probability that X, let's say, I don't know, m steps in the future,
Xn + m = j given Xn = i. So this is saying it's at i now
what's the probability that it'll be at j m steps in the future? That's just gonna be the ij
entry of q to the m. So, what works out pretty nicely at
the powers of the transition matrix, actually give us a lot of information. We don't need a separate study of
every different transition and different number of steps. Just work with powers of
transition matrix very, very nice. Okay, so
just a couple more quick concepts. Well one that's not so quick is
the notion of a stationary distribution. And I'm gonna define it now, but
we'll talk about it in detail next time. But I just want to mention it now just
cuz it's one of the most important ideas. Transition matrix and and
stationary distribution are the two most important ideas other
than the basic definition. What's a stationary
distribution of a chain. It's also called a steady stay or long
line or equilibrium, that kind of thing. It's supposed to capture the idea, so one reason it's important is if
we run the chain for a long time, we wanna know, does it converge
to some limiting distribution? Okay, and under some very mild conditions,
the answer is yes, it will converge to something that we
call a stationary distribution, okay? So that's gonna be be very important for
describing how the chain behaves. Especially in the long run, but it's
also just useful for other things too. So we say that S,
which is a probability vector. 1 by M again. So again we're just thinking of
that as a PMF written as a row. And the definition is that
this is stationary, For the chain, for
the Markov chain that we are considering. If S times Q Which is,
just to make sure this makes sense again. This is a 1 x M * M x M. So this is gonna be a 1 by M matrix. If S * Q equals S, okay? So I'll tell you the intuition for
this definition but first of all for those of you who've seen eigenvalues and
eigenvectors before, this is pretty reminiscent of
an eigenvalue eigenvector equation. Usually you see eigenvalues and
eigenvectors written the other way, like a matrix times the vector but if you want to write it that way just
take the transpose of both sides. So that's exactly an eigenvalue
eigenvector equation, for those of you who know what that is. If you don't know what an eigenvalue is,
well you should learn but you don't need to know it for
this class, okay? Now for the intuition,
we just showed a few minutes ago, we just showed that if the chain F
time N follows the distribution S, then this is the distribution
one step later in time, right? So if SQ equals S,
it says if it starts out according to S, one step forward in time,
it's still following S. That's why it's called stationary, right? Cuz it didn't change. And we go another step forward in time,
the distribution is still S. And you go one more step,
it's still S, right? So if the chain starts out of
the stationary distribution, it's gonna have the stationary
distribution forever. It doesn't change,
that's why it's called stationary. Okay, well, there are a lot of questions,
we can ask though. I won't quite call this a cliffhanger but there's some important
questions we can ask, right? One is does a stationary
distribution exist? That is, can we solve this equation? Now, even if we solve this equation, if we
got an answer that had like some negative numbers and some positive numbers,
that's not gonna be useful, right? So we need to solve this for
S that is non negative and adds up to 1. So does such a solution
exist to this equation? Does it exist? Secondly, is it unique? Thirdly, just now,
I just kind of said intuitively that this has something to do with the long
run behavior of the chain, right? But this definition doesn't seem like it
has anything to do with any long run or limits, right? This is just S times Q equals S. So does the chain converge
to S in some sense? And we'll have to talk more about what
that means and whether it's true. Okay and
then there's a more practical question. Assuming that it exists and it's this
beautiful thing that tells us the long run behavior, there's still the question
of how can we compute it, right? So how can we compute it? So those are basic questions about stationary distributions, okay? So under some very mild conditions
the answer will be yes, to all three of these
first three questions. There are a few technical
conditions that we'll get into but under some mild technical conditions. It will exist. It will be unique. The chain will converge to
the stationary distribution. So it does capture the long run behavior. As for this last question though,
how to compute it? In principle, if you had enough time,
you can just use a computer or well, if you had enough time you
can do it by hand in principle. Solve this equation, right? Even if you haven't done matrices before. You could write this out as some
enormous linear system of equations. And if the solution exists in
the principle you can find it, you eliminate variables one by one, okay? That might take hundreds of years,
all right? So, can we compute it efficiently without
spending years doing tedious calculations? Or even with a computer,
it may get too nasty of a calculation. And so the answer to that,
in the very general case is that, it may be an extremely difficult
computational problem. But we're gonna study
certain examples of Markov chains where we can not only compute it,
we can compute it very quickly and easily without actually having
to even use matrices at all. So there's one really,
really nice case that we're gonna look at. Not today, though so. Okay, so that's all for today,
happy Thanksgiving, everyone.