Okay, we are about to get started.
My name is Dimitri Bertsekas, a professor here at MIT, and I'd like to introduce
the next speaker - plenary speaker John Tsitsiklis. I'm sure John needs no
introduction to most of you but let me say a few things. He grew up in Greece
and came to MIT as an undergraduate in the mid 70's, and then he stayed on at MIT
up until now with the exception of one brief year at Stanford in the mid 80's. We, all of us, know John for his research and for his books, but he also has a very substantial service record. He has served -
he's now serving as the director of LIDS - Laboratory for Information and Decision
systems. Before that, he was a co-director of the Operations Research Center and
other positions within MIT but he has also been in very responsible service positions in his native Greece, including being the chair
of the Council of the Harokopio University. Turning to his research, he is
well known for that and has received quite a few awards, and the most recent
one is the IEEE Control Systems Award just last December, and just before
that, I had the great pleasure of sharing with him the John von Neumann award of
Informs for two books that we wrote together. One in 1989 on parallel and
distributed computation and the other one in 1996 on neurodynamic programming, and the interesting thing about these two books was they both were the first
books on subjects that were very marginal at the time. Very few people
were looking at them and they were considered highly speculative. The other common thing about these two books is that 20 years later, the subject
became very popular, very hot. Parallel and distributed algorithms including asynchronous algorithms are very popular nowadays in
machine learning, and the second book Neuro-dynamic Programming is better
known these days as 'reinforcement learning' and reinforcement learning is
actually the hottest field in artificial intelligence these days, and John will
talk to us about reinforcement learning. [John Tsitsiklis]: All right, good afternoon. Thank you very much Dimitri for the introduction. Actually, it's a great pleasure to have
this introduction from - coming from Dimitri, who - we had so many great
years working together. I don't know if I should thank Devavrat (Shah) or not for the
invitation, because he's having me speak on a topic that I was kind of qualified to talk about 20 years ago and rather unqualified today. The good thing
is that because I'm coming from the theory side, fortunately or unfortunately,
I'm not sure, there have been too many big breakthroughs since then. *audience laughter* Although the field has moved in great leaps, I mean we hear about the great success
of reinforcement learning in practice but we have a little bit of a theory-
practice gap here, in that the theory is lagging behind the practice, which is not
the usual way that things happen these days. In any case, what I will try to do
is - probably missing most of the people in the audience - first, sort of running
through the core ideas in - behind reinforcement learning. For some, you can
think of this as a tutorial that maybe goes too fast. I'm not going to go into
any great technical detail but we'll try to give a sense of the main key
conceptual ideas. All right, so okay, what is reinforcement learning about?
It's about making decisions in a situation that has some dynamics - things
evolve in time - and there is some uncertainty as well, and decisions have
to be made sequentially. How does one deal with such a context? That's
called stochastic control, that's the field of stochastic control, and the key
method for dealing with such problems is dynamic programming that goes back to Bellman. Now, for me, the subject starts when I was kind of starting graduate
school, and so I read Dimitri's book on dynamic programming from cover to cover,
and then I had the pleasure - that was actually the
first class that I taught. So dynamic programming gives you the methodology
for doing these things, and then enter reinforcement learning. There's a
sequence, and we'll not try to give a fair history there, but starting with the
checkers-playing program by Samuel, some ideas evolve. The modern version of
reinforcement learning pretty much starts with the activities of Barto and
Sutton that was around the mid 90's or so. So they wrote this book in 1998 called
Reinforcement Learning. Now from around that time when the activity was
happening, we had the luck that some people like Bartow and Michael I. Jordan
pointed to Dmitri and myself and they kind of said 'look there's that thing
happening down there, kind of feels that it has something to do with the stuff
that you guys are doing.' So we looked at it and the way we looked at it, it was
basically an approximate methods for solving the classical dynamic
programming problem. Now when one says approximate, you first start thinking
about approximations in the traditional - let's say differential equation setting -
where approximation means discretizing, but these were a different kind of
approximation, they were approximations based on neural networks and that's why
we decided to give this title to our book. It's dynamic programming that
involves neural networks. Now, of course on the substance, it's really one and the
same topic. Now leap forward - that's 20 or 30 years later - and we have this big
success, I guess everybody must have heard about it, but with DeepMind with
AlphaGo and AlphaGo Zero they beat pretty much anyone in go, in chess
whatever - it's a really spectacular success. So it almost feels as if the
field has reached the point that it wanted to reach. Well, and all that as
I said goes, and this Dmitry said these days is called reinforcement learning.
What's exactly reinforcement learning? You look at Wikipedia, it's about systems that are supposed to learn the behavior based on
feedback from the environment, and the idea is, like a little child that you
send it out in the world and learns how to play or animals that kind of learn
where to look for food and so on. Is that exactly what happened with AlphaGo and
AlphaGo Zero? Was it interacting with any environment? So did that promise, did
that vision of RL - has it already materialized or maybe not quite. So I
guess the answer is going to - my answer is going to emerge as I move on, and the
other thing that's a bit concerning is that there is this huge thing that
happened, but the methods that are used do not really rely on any big
theoretical breakthroughs of the last three years. It's really huge
computational power together with people learning, figuring out how to use deep
neural networks plus a few extra very clever subroutines or tricks that's come
in. So I'll walk you over through all these and try to make sense of the
field. So I'm going to talk a little bit about what dynamic programming is, set
the terminology and all that. The key ideas are approximation and
parametrization and there's different things that one can approximate or
parametrize and that gives different types of methods. So I'll walk through
those types of methods as well. I want to spend a little bit of time on something
that's really quite important. There we are talking about offline
algorithms or online algorithms, so when you say systems that learn from their
environment you have that image of something working in the real world. Is
that happening or what does it take to happen? All right, so, dynamic programming. Ok, it's the first time in my life that I write
the status st instead of xt so I feel like I'm betraying my intellectual
heritage *audience laughter* and I guess sometimes you have to do that. So there's a system, it has a state and it's being controlled and what the controller is it looks at the state and makes a decision. The decision could be just a function of the state, so it would be phi of st but more broadly, it could choose
the decisions in a randomized way with a distribution that depends on the
particular state and when that's done then there's a cost that's incurred that
has to do with the state of the system and the action. Now, if you're playing a
game the costs are zero and only at the end you get a cost of one or minus one
whether you want or not. Okay, here's a compromise that I didn't make, I kept
this as cost as opposed to reward. Somehow engineers always have thought
about minimizing costs of how you do things. In the brave new world of AI
people only talk about maximizing rewards. Maybe there's - I don't know what that means, but I'm sticking to costs. So we want to do the cost down. Now the system
is a dynamical system, it has some dynamics and the dynamics are that the
next state is determined from the previous state somehow. It could be
through a dynamical equation that gives you the next state, or if you come from
an Operations Research type of tradition, you might describe it differently in
terms of transitions probabilities. So the state evolves like Markov chain,
except that the Markov chain is controlled also by the decisions. So it's
a Markov chain whose transitions are affected by the decisions or the actions
that you're taking. All right, so what we're interested in is to minimize the
cost over some horizon. Starting from some given initial state, and there's a
discount factor. It's a factor less than 1 that basically means that future costs
matter less than current costs, that's a reasonable assumption in economics, for
example. And what we want to do is to minimize this over all control policies,
that's our objective, and the value of that minimum is the value function. We
define that wave V-star, remember that symbol. It's the optimal cost, it's the
cost that you are going to incur if you act optimally over the given time
horizon, and it's a minimization over all policies and that makes the problem difficult. You're minimizing over functions, you're not minimizing
over decisions, you're minimizing over policies. What is capital T? It could be a
given finite horizon, it could be a random time at which the world ends or
the most convenient formulation - I wish I knew how to go back -
Okay, the most convenient formulation, I'm going to stick to that for simplicity, is
when the time horizon is infinite and the discount factor is less than one
which helps to keep the overall cost finite. Quite often, that discount factor
there is a surrogate. The problem that one really wants to solve in many
situations, is that you want a system that keeps running forever and you're
interested in the average performance of that cost. Mathematically, that's a more
difficult problem but it is approximated well by taking a discounted problem
with a discount factor that approaches one, so we'll stick with a discounted
formulation. Interesting point from dynamic programming theory is that there
exists an optimal policy under some basic assumptions. One policy that's good
no matter what your starting state is. Not the same action - the action will
change depending on the state, but the policy? You just need to find one policy which is simultaneously optimal for all possible starting states. Now dynamic programming, this framework basically is everywhere. No matter
what your field is, I'm sure you can come up with examples of problems and
applications where you have big complicated problems of sequential
decision making under uncertainty and where dynamic programming could be
applied, and that goes from toy situations like games all the way to
serious applications like energy or airplanes or cars or factories and so on.
All right, so it's - therefore, we take it for granted it's an important problem.
How do we solve it? Here's the way of thinking. If I'm sitting somewhere at
some state and I have to make a decision, what do I need to take into account? I need to take into account the cost that I'm going to
incur right now as I'm making my decision, but then I will move to some
next state and from the next state I'm going to incur some further cost. Let's
suppose that tomorrow after I get to the next state, I will be infinitely wise and I'm going to do the optimal thing starting from that state.
And right now what I need to do is to look at this sum of my current costs and
future costs that I'm going to incur and just simply take the minimum of these
and take the action that's the best one in that way. So that's the idea behind Bellman's equation, the key equation in dynamic programming. It tells you,
starting from some state, the optimal cost that you can get is the following -
it's the immediate cost plus starting from the next state, the optimal cost
from that next state. Now the next state is random so we throw in an expectation
there over the distribution of the next state, and that distribution is of course
affected by the current state and reaction because the next state comes from a distribution depending on the current state and reaction. And we throw in that
factor of gamma because costs incurred in the future get discounted by certain
factors. So I'm looking at these two quantities - that's the total cost
I'm going to incur if I do A now and do the optimal thing afterwards and I'm
choosing my action in the best possible way. So that's Bellman's equation, if you
look at it. It's basically a system of equations. V star appears on both sides,
you have one equation per state, you have one unknown per state, so it's a system
of equations. In principle, you can solve it and you can write it in shorthand -
oops I'm missing a star on at the end - V star is some kind of weird, nonlinear
operation applied to V star itself. So it's a fixed-point equation that needs
to be solved. So in the old days, whenever a mathematician looks at a fixed-point
equation, what do they do? They start proving necessary and sufficient
conditions for existence and uniqueness and all that
stuff. So that was done, some people still do it for more and more complicated
versions of the problem, and then one starts looking into algorithms. Again, in
the old days you would have theoretical algorithms that are nice for textbooks.
Some of those algorithms were actually practical, up to a certain point
though, because it's a system of equations. If S takes one out of a
thousand possible values, then you have a system of a thousand equations to solve.
That's okay. If S can take a trillion possible values, then things are
a little more complicated and in that case you stop looking at exact
algorithms and maybe start thinking about approximations. All right, so let's
look at the computational lengths a little more. Here's the equation in
detail, instead of writing the expectation there I'm writing explicitly
what's involved. The next state s-prime is distributed according to some
distribution that is determined by the current state and the action that they made. So you see that this now looks like an ordinary system of equations. It would
have been a linear system but except that there is this minimization there
and so the right-hand side is actually a nonlinear function of V star that makes
it a little more difficult. But in theory at least not too difficult. So, if you
look at finite problems where the number of states is some number N you can
have algorithms that run in polynomial time in N and that solve
these systems of equations exactly. Actually, the funny thing is that the -
really, the main known method to say that this can be done in polynomial time is
by reformulating the problem as a linear programming problem and solve it that
way, and then there's special cases that are simpler like shortest path problems,
deterministic problems and so on. Okay, so that's a well understood and old topic as long as N is manageable, it's polynomial in N, but if N is a
trillion, polynomial of a trillion doesn't sound too - doesn't sound too good, and here's where the catch is. Think of any system that has D different components, so the queuing system who did different queues. The
state of the system will be I have to tell you it's particular to how many
jobs it has, and let's say that it can have 0,1 up to K different jobs each
queue. To describe the state of the system, you need to give me D different
numbers, the occupancies for each one of the different queues, and the number of
possible configurations or states is actually exponential in the number of
the components. And any interesting problem that one might want to solve in
practice would involve several subsystems and that situation would show
up. Having a polynomial time algorithm - polynomial in K to the D - is still
exponential in D, so we're starting to hit a bottleneck. Would there be good
news that some times problems decompose and become simple and easier even though the state space is very large? Can we maybe solve them? Yeah, there are some places with good news. In the multi-armed bandit problem is the famous one that
can be solved in polynomial time even though the state space is exponential. In
some cases you can get the closed form solutions like in control theory we love
linear systems with quadratic costs although the state space is infinite, but
we can solve them exactly with closed- form solutions, that's great. These are
the problems that are heavily represented in textbooks, but it's a tiny,
tiny fraction of the problems that one would want to actually solve and once
you move outside these then all the news are basically negative. There's very
little positive that you can say, so controlling a system with DQ's takes
exponential time to be solved no matter how clever you are, what algorithm you
use, and reinforcement learning or any method is not - cannot do it, it's an
exponential time problem. And similarly if you tweak the dynamic programming
problem a little bit and have imperfect observations so that you need to do a
little bit of filtering, then instead of being able to solve it in polynomial
time it becomes most likely again it was going to require exponential time.
Well, formerly the problem is polynomial space complete so for our practical
purpose it also means that it's hopeless for an exact solution. So at that point,
one could give up. Or, let's say, well we have to do something. It's intractable.
It's high-dimensional. Let's start approximating. What does it mean to
approximate? Well, whatever quantities there are that
we want to manipulate, if they are too complicated then somehow approximate
them and represent them by something that's a little simpler. The universal
idea these days about function approximation goes as follows: you have
some complicated function, but it's hard to describe, but you would like to
approximate it by a simpler function. So F is unknown, what you do is you get data
points somehow about F. These could be noisy data points, and then you
interpolate between them using a parametric architecture. For example,
you write down some functional form that you like. Right, so this is a function of
X, it's going to be a function of X, but it has some tuneable parameters. F tilde
is your approximation architecture, you know what F tilde is and you want to set
theta so that this thing as a function of X is close to the function that you
wish to approximate. So you want to find parameters so that this approximation is
best in some sense or is good in some sense. Now, this is the setting of
what people call these days 'supervised learning.' You'd get data points about
some function, and you try to fit those values and do something. Parametric
approximations are not the only ways of approximating, there are other ways like
nearest neighbor methods and so on but I'm going to stick mostly to parametric
methods. So the simplest parametric method is what we call linear regression.
That is, you say I'm going to approximate my complicated function Ff with a
linear function with coefficients theta and let me find the thetas that best fit
the unknown function. Or maybe you'd have some insights into the problem and you
say there are some important features of X and let me try to fit my function
using a linear combination of those features. Here you can throw quite a bit
of domain knowledge into the problem that can be useful, but still it's
nice mathematically because the coefficients you are trying to determine
show up linearly here. So it's - again it becomes a linear problem to find
the right thetas. Or... so these are so-called linear architectures. At
the other end, we have nonlinear architectures. You can try to fit your
function by just throwing in there a general purpose neural network. The
thetas in this case are going to be the tuneable weights inside the neural
network. So the neural network takes X's produces F tilde and you want to set the
parameters so that the input-output behavior of this neural network is
similar to the target function F that you have there. Okay, so that's the idea
of approximation, there's some things that one needs to keep track - how many
parameters do I have in my approximation? In regression, traditionally people
have been working with small numbers of parameters. With neural networks they
talk about millions of parameters and depending on what architecture you use
and how many parameters you have, different capabilities in terms of
approximation power and speaking about approximation power, it becomes relevant
once you start doing the theory, whether you're - with respect to what metric
do you have good approximations, and that also has to do - what metric do I use in
my supervised learning when I try to fit? And some metrics are good in practice,
some metrics are good to do theoretical analysis, and the two do not always
coincide. All right, so now that we have this approximation idea, we can start
using it, and there's three main approaches - now I'm getting finally to
reinforcement learning - three basic approaches. One is to say, okay the optimal points is going to be something complicated, let me write down a parametric form for the policies and try to tune the
parameters theta so that I get a policy that works well. So that's the
optimization policy space. Second approach would be try to approximate the
value function - that V that showed up in the Bellman's equation. V basically tells
me for any state how good or bad is that state. It's a position evaluator in
traditional AI lingo, and there's a variant of this - the so-called Q
functions for those who know more about the field - which tells you if I am at
state S and I commit to doing action A right now, how well off
will I be starting from that point? And we want to choose those thetas so
that the approximate value function that they have is not too far from the optimal
one from Bellman's equation. What is the value function good for? Value
functions are good for because they indirectly tell you what to do. If I
have a value function that I believe is close to the true V star, I'm going to be
taking actions guided by that approximation of V star. If V tilde is
the same as the true V star, the optimal value function, this minimization tells
me the optimal thing to do and then the hope is that if V tilde is close to -
there's two hopes, that V tilde is close to V star and that by using the approximate
V tilde, I'm going to come close, I will be approximately optimal. One piece
of terminology, we call that the greedy way of making decisions based on V tilde.
If you are working with two values, you again the Q and approximate Q factor
tells you what to do, and you can do a little fancier things. This is
essentially a one stage look ahead. If I take that action and then the
consequences from there will be V tilde, how should I act? Or you could say I'm
going to have a short horizon and do multi-step look ahead. Let me think about
the optimal sequence of actions I could use, and then the game kind of
ends and at the end of the game I'm going to be at some state. How much do I
like or dislike that state? I'm going to judge that in terms of the V tilde that
I had computed. So you can do one stage look-ahead, multiple stage look-ahead or do something a little more clever than that which is called Monte
Carlo tree search, which is essentially again trying to find a good policy for a
short horizon problem. Except that this tree becomes a little kind of big and
you have intelligent ways of deciding where to explore in that tree, where to
go deeper, where to know to go deeper and actually this trick is, I think, one
of the big things that happened in reinforcement learning the last few
years. That people realize that you get a huge performance improvements in
practice by doing this, so this is an important idea. All right, so these are
two approaches and finally, why not combine the two? Try to approximate,
parameterize both the policy and the value function and have one help the
other. If I have a policy I can try to learn the value function associated with
that policy and I can then try to use the information, the value function, to
improve the policy. I'm going to come back to those methods in a little more
detail, shortly. So this is a preview, this is the high-level picture of the field,
basically, of the methods that are out there. Okay, now what is the theoretician
going to do in that space? I guess the minimum that one can do - of course, there
cannot be any magic. That is, problems that are exponential time hard, you
cannot solve them faster than exponential time. But, you ask some
other things. Assume that I'm lucky and that for the instance of the problem
that I had, somehow my supervised learning gave me good approximations and
all that, when I work with approximate functions am I going to be approximately
optimal? This is sort of the most basic sanity check for a method. Any
approximate method as the approximations become better and better, it should give you a sound exact algorithm, and that has been
much of the lens with which we have been trying, we and others have tried to do
much of the theory. So let's take the textbook method. So approximate method
should mimic exact sound methods, so let's start with the main exact sound
method and work backwards and think how you can make it an approximate one.
Suppose you have a policy, you can evaluate what that policy does starting
from any given state. That's the value function for that specific policy. If I'm
using that policy, and I started some state, how much courses am I going to get? That's the cost to go the value function for that specific policy, so policy
iteration starts with the policy, does the policy evaluation. There's an
alternative way of doing the policy evaluation which is by solving a system
of equations. My core starting at state S is my immediate cost, plus the future
cost that I'm going to have. These are two different ways of representing that
same object and you could use either one of those equations to come up with an
algorithm and once you have the value function that tells you what happens
under a policy, you can come up with a new policy which actually provably is
going to be better. What does that - what's the thinking behind this equation? I'm
sitting at state S, I'm looking at the immediate cost if I do a certain action
and I am saying after I take that action I'm going to be at the next state and
starting from that next state I will keep using the original policy pi and so
I'm going to incur the cost of that original policy. So instead of using pi
all the time, I am now contemplating doing A first and then reverting to pi.
Is there something better I can do in the first stage? This expression here
tells me what's the best thing I can do in the first stage and this actually
gives provably an improvement. The main theorem of policy iteration is
that policies monotonically improve, they become better and better, and at least for finite problems you are going to eventually end up with
an optimal policy. So that's a sound, exact algorithm. What's an approximate
version of this? Well, just try to do all those same steps but using approximations instead. All right, how exactly does it go? First step is given a
policy, described somehow, come up with a value function for that. These are two
possible equations for the value function, let's just work with the first
one. The value function of the policy is the expected cost if you keep using that
policy. How do you evaluate that? How do you compute it? Very easy, just simulate.
It's an expectation. Start at a certain state, run your policy, see how much cost
you get. This is a data point, a noisy data point for V of S, do it for lots of
initial S's, collect those data points and then fit - run the linear regression
or a neural network or whatever to fit a function, and now you have a
representation of your value function under this particular policy. Okay,
let's assume that the gods are with us and that we get an approximation which is a good representative of the actual value function of that policy. If our
approximation is not good, of course, there's nothing you could ever hope to
guarantee, but if we're lucky and that happens, then what? I'm going to use
it to make a decision. I'm going to follow the rule that I was describing
before - if you have a value function, it's a guide on how to make decisions, and let
me give myself some slack. I may be off by some epsilons, it might not be the
exactly best arg min, it might be close to the best arg min, but let's suppose
again here that I managed to do that. So if approximation errors are small,
then what? Well, there's a basic sanity check for this algorithm. Interestingly, this is a result that we never published outside of our book, but
it's actually one of the hardest proofs that we have in our book
with Dimitri, and this is a non-obvious result. It says that if you do everything
approximately within epsilon, then in the end you're going to get policies that
are within order of epsilon from the optimal. Okay, in some ways it sounds
trivial. If you change things by epsilon in the end everything changes by epsilon,
however it's non-trivial because this thing is a very discontinuous thing. If V
tilde changes just a little bit the optimal, the A that wins in this
minimization could be a different one and that discontinuity causes a huge
jump on the next V tilde that you're going to have, it's going to be very different.
So it's a very discontinuous algorithm. Continuity arguments do not work but
nevertheless the theorem is true. Improvement is not monotonic but the
result says that the algorithm eventually will settle in some
performance which is within order of epsilon from the optimal. So this is a
sort of very basic sanity check. It's a very good thing to have. It doesn't mean
that you should implement this algorithm, though. It's really a meta algorithm. I
didn't say anything specific about how you fit the V tildes, and the
assumption that you're making that approximations are good - I mean it's an
assumption, if it's true it's okay but there's no way to a priori verify an
assumption of this kind. And then, if we're talking about details,
is the right metric to use the old thesis with respect to the infinity norm,
that you want very good approximations everywhere and that gives us very good
performance everywhere? On the other hand, probably what one cares about is average
performance and maybe the approximation errors that one can guarantee would be
in some average sense. It's very hard to approximate uniformly and generalize
uniformly well with the neural network, for example, and if you're
talking about averages then average over which measure, over which distribution? So
there's technically interesting questions that come on here and yes, it would be interesting to explore variations of these with
different types of metrics and so on, and another issue that comes in is that at
least in the literature, half of the papers prove results about being close
to optimality for every initial state, and half of the papers give you versions
that are 'let's fix an initial state and optimize over policies starting from
that initial state.' They kind of feel similar but technically, they diverge
quite substantially and different versions have different levels of
difficulty, and again the main issue both for the theoretical analysis but also
why one does not use this algorithm in practice, is that at least when we
have discrete actions, little changes in V can cause big changes in the
policy that you would be using, and that leads to oscillatory behavior that
kind of doesn't feel good. So one would like to tame those oscillations, and the
way to tame oscillations in general is, instead of making wholesale changes, try
to make small changes. One place where you can make small changes is when
you're calculating the V's, the approximations to the value function.
Instead of - make a big change to your policy and then see what's your new V
which is going to be very different, start changing V's little by little.
One way of doing that is by looking at the Bellman equation which is this
equation that we want to solve. It's a fixed-point equation - again, with V
star missing at the right-hand side. V star equals TV star, and the way one solves
fixed-point equations little by little is by iterating on this relation. There's a
similar equation you might want to solve if you have a fixed policy, it has the
same structure. Here, this turns out to be a linear operator, here it's non-linear.
Let's say we want to solve one of those two problems. The natural way of solving
fixed point problems is an iteration of this type - that is, fix the right-hand side, calculate the left-hand side, then put the left-hand side back to the right-hand side and keep going. All right, but we don't have
TV in our hands. We can get approximations of TV here and there by
trying a few states here and there and maybe approximating expectations using
simulation so we also have noise, and that means instead of making a wholesale
transition, we want to avoid those simulations, we want to take steps a little
bit at a time. So A tie is a small step size. I'm getting an estimate of TV at
some states and I move my value function a little bit in that new direction. All
right, but how do we actually represent V? V is represented through some parametric
architecture, could be a neural network, so I cannot change V. I can only change
the parameters of my neural network. So, therefore, what I should do is to - instead of doing this I'm going to tweak thetas so that V's kind of move in the
direction of TV, and that effectively corresponds to something like that at
least for linear architectures. That there's a certain space of value
functions - there should be a tilde here - value functions that are representable
in my approximation architecture that I have. Since I'm working within those
value functions that are representable, any value function I produce should be
in that space. So essentially, I end up doing a projection within
that space. So this is a little bit abstract, but the essence of what's
happening is that you sort of try to do this update with small steps, but all the
time you also need to - effectively, you are projecting, so that you stay within
the space of representable functions. That's not how the algorithm actually
runs but - algorithms run - but this is the mathematical structure behind the
algorithms that keep updating V incrementally and try to find an
approximate solution to Bellman's equation. If all goes well, a method of
this kind is going to converge to a fixed point of this iteration. So if V
tilde has converged, you cancel various stuff and you're
left with pi TV equals to V, that's the equation that you end up solving, and
if everything goes well the fixed point of this is close to a fixed point of the
initial equation, which is the exact equation that you want to - we wanted to
solve. So, short mathematical digression but this was an exciting topic, this was
the PhD thesis of Ben Van Roy more than twenty years ago, but essentially he
looked at this type of equation - when do we get convergence and when not - and, okay, allow me two minutes of math. The way that one solves fixed-point equations,
that proves that methods for fixed points work, is that you have a norm and
you argue that this norm keeps going down. The operator T of dynamic
programming is always a contraction with respect to the maximum norm. How about
the projections? Projections we're talking or Euclidean projections, things
like that, they tend to be contractions with respect to quadratic norms. So
when you put the two together, you don't know which norm to work with, and you're
kind of stuck. Unless you're lucky and something extra happens and you get
additional contraction properties, and there's two cases. If you have a fixed
policy you're looking at optimal stopping problems, it turns out that T
will also be a contraction with respect to a weighted Euclidean norm and
the same thing will be true for pi, the same norm, and one gets a valid, solid
algorithm for that setting and actually one of the algorithms that is used in
the financial industry to solve high- dimensional options pricing problems,
which boiled down to optimal stopping problems, actually is based on
this type of analysis. The other case is if one looks at parameterizations where
you approximate the value function with approximately constant value functions,
projection of a general function onto piecewise constants turns out to be a
maximum norm contraction and that fits nicely with the maximum norm contraction
of T and so one gets a sound algorithm once more, and that kind of extends also
to some non-parametric settings if one talks about nearest neighbor type of approximations you get again an L-infinity contraction and good things can
happen at that context. So these are the good news, but in some ways they're
limited. It's for piecewise-constant approximations or, for the case of a
fixed policy and optimal stopping. The sad thing is that that's the end. If you
throw in nonlinear approximation architectures or if you look at linear
architectures without fixing the policy or if you look at q-learning with
function approximation and all that, none of these methods has convergence
guarantees and these are actually the methods that people do use in
practice, so it's a big gap. Sometimes they work, but they're not - they don't have any
guarantee of working, and there are even counterexamples where you put in
approximation architecture which is perfect. With the right choice of
parameters you could fit the true value function but somehow the iterations that
the algorithms do go away from the right theta star and go in the wrong place. So
that's a little bit of a sad situation for a theoretician, that one place that
we kind of felt a little stuck 20 years ago. All right, so that was the idea of
incrementally changing V and if you go back, my original motivation was I want
to do things incrementally to avoid big oscillations. Another way of avoiding big
oscillations is to look at policies and have the policies change little by
little. Okay, now if my actions are discreet, how can I avoid oscillations? If
from action one I choose now to do action two it's a big jump, it's not a
small jump. So here's the trick that's being used
here. Talk about probabilistic decision rules. So I'm going to -
looking into randomized policies, so my policy at state S is that I'm going to
choose one of the three action according to these probabilities, and now I'm going
to start moving those probabilities little by little. In the end when I get
to something optimal, all the probabilities should be concentrated on
the optimal action, but while I'm training and learning I'm going to be
using a mix. So now that we made pi of S continues, we're free to start throwing
in neural networks in there, and the typical neural network picture these
days would be: you feed S inside and you have three outputs going out there that
give you the three - the probabilities of the different actions, and you want to
tune those parameters so as to get good policies. Once - and now, how do I tune the value of S? Here's one idea: if I have a V in my hands, a value function that's kind
of reasonable, that's coming from some good policy and I'm sitting at some
state S, I can do a look ahead and ask - what's the best thing I should do in the
short run, and after that I'm going to revert to a policy that I know has
performance V tilde. I look at that problem, I find the decision that's good
for that small problem, and I take that as my guide and I update my policy by
moving a little bit in the direction of the one that was suggested by my look
ahead. That's what I would do if I had direct control of the pi's but I don't
have direct control of the pi's I can only play with parameters, so what do I
do? I change the parameters of my policy by moving in a sort of gradient
direction. The gradient tells me if I change theta upwards is pi tilde going
to go upwards or downwards, so it tells me which direction to change theta so
that pi tilde kind of matches pi hat. So I'm changing
the parameters, basically, maybe don't look at the
formula too much. It's changing the parameters in a way so that pi tilde
gets closer to the new action that was suggested. All right, so this is not my
method but it's the method that - essentially, it's the core of the method
that was used in AlphaGo and in AlphaGo Zero. We can think of that as we
can call it, in a word that goes back, is an actor-critic method in the sense that
it involves both a critic v tilde that tells you how good different positions
are, and as the policy changes both are adjusted and there's very interesting
questions here about convergence properties of such a method that adjust
is both V's and pi's. There's some work by Devavrat and co-authors
that show that this does something reasonable, at least with some
assumptions, and when one uses nearest neighbors instead of parametric
approximations. For the parametric approximations of value functions I guess there's probably some - there's room for some work to be done there. Okay, now
again I was mostly focused on value functions, let's start talking now about
just policy space optimization. Forget value functions - somebody comes with
policies, with some tuneable parameters. Those who come from systems and control can easily think of PID controllers or things like that. You have the structure
of your controller, there's some parameters to tune, you want to play with
them but instead of three parameters you might have a million parameters. How do
you tune them? You need the gradient of your cost with respect to the parameters
of the policy. How do you compute gradients? There are messy formulas that
tell you how the gradient can be expressed as an expectation of some
function of the trajectory. So you can simulate trajectories and calculate
gradients and update thetas with stochastic gradient descent,
and that was proposed early on for a fixed initial state and so on, and
there's an algorithm for how to calculate those gradients efficiently.
My student, we looked at it as the case of average cost problems over
steady-state distribution, which is a little harder because the steady-state
distribution over which you take average is the steady-state distribution under
the optimal policy, but you don't know what the optimal policy is so there are
some loops to be closed there. So that basically works in theory, it has all the
properties of stochastic gradient descent. That is, under some assumptions
you'll get limit points that are stationary points and maybe these are
good local optimum or whatever, but it's a very slow algorithm. Typically, one has to
use a very small step size, high-variance gradient estimates converges slowly. You
can do maybe a little better by using some variation of the gradient - so-called
'natural gradients,' but they're kind of stuck. But at this point, a thought comes
to mind. These days, everybody is talking about deep neural networks, that over-parameterization can be a great thing and somehow, you will not have bad local
minima and all that. Are there statements of this kind that can be made
at least for some control problems and some particular parameterizations
that you get a nice landscape as a function of theta so that gradient
descent in policy space does something reasonable? I'm not aware of any single
result of that kind. Actually, I'm not quite aware of how one would even start.
What kind of assumptions or structure to put in place, but something like that
should happen. Something good should happen if these methods work. All right,
as I'm running out of time there's another class of actor-critic algorithms
that tries to help the stochastic gradient descent by using an auxiliary
value function that reduces, supposedly, the variance in the gradient updates.
There's some intuition there that they should reduce the variance, but when you
try to do the theoretical analysis, things are complicated. It was a
very bright PhD student, Vijay Conda, that looked at it and has a fascinating
chapter at the ends of his thesis, but which no one followed up since then. But
there's some very good open problems to track in there. All right, so now,
reinforcement learning is supposed to be about this thing that goes in the world,
explores and as it lives, lives a useful life and that's better and better things,
and that means online. Now - but much of the work is actually done offline. We
have a model, a simulator, a physical system that you can tweak and play with
or historical data, whereas online you only have the physical system in
principle that you work with. All these methods, I didn't use the words online or
offline, so you might be wondering what's happening, what and where. Something that one can do is, you train a policy using your, whatever method you want, and burn
as many forests by running thousands of servers to learn a good policy and then
you go and implement it, and that's basically what's AlphaGo did. So there's
not much online, it's an offline heavy computational method that solves,
approximately - or tries to solve approximately - a dynamic programming problem. More
interesting things - Okay, give a little more credit to
AlphaGo. You work on offline, you learn the policy, but you also learn a value
function and when you are online you are a little more clever than that. So online,
you don't just implement the policy that you learned, but wherever you happen to
be, you do some clever look-ahead to do better, and that's actually where
AlphaGo falls. But the - sort of a big promise of really intelligent systems
that live in the world and learn would be something like this, where there's
some point - possibly initial analysis or some very crude controller - and then you
go into the world. This is the real thing, you're supposed to train, learn and do something wonderful. Well, as I have run out of time, I will just sort
of give my bottom line here. That I think that this is really an almost impossible, unrealistic but also maybe maybe not interesting goal to have. When one looks at the successes of these reinforcement
learning methods, all the practical ones, they do always involve very heavy
offline computation. If you go online and you can only collect so many data
samples, they will not be enough to fit your huge neural network. If you're
trying to fit a neural network with the million parameters to do something
useful, you're going to need a few million data points. Okay, so maybe in
some context you can think of having millions of data points accumulating in
real time, but in most applications this is not going to be the case. So there's,
there's a lot that needs to happen offline, and offline you work either with
a model or a simulator in most cases. Is AlphaGo a model-free method or not? I
guess it's not. AlphaGo uses a simulator of the Go game and the simulator is a
perfect exact model of the game of Go. So a model doesn't necessarily mean
equations, as long as you have a module that reproduces exactly what the system
does, this is a model-based method, and there is some discussion in this
literature. Sometimes one reads papers where the authors proudly say 'we have a
model-free method.' Usually it's not quite model-free, what it often means is
that we try to do as much as we can without touching the model, but the model
is really there. If it's completely model-free it's unlikely to be useful, or to
work, unless you are in domains that are kind of circumscribed. In multi-armed
bandit problems, for example, you can think of certain algorithms - you might
call them kind of model-free - that is I just tried
different things, but in essence there is still an underlying model behind. That is,
the model that I'm dealing with a coin that has a fixed constant probability of
success and all that. So models are there, they're unavoidable, and one should do
the best one can do with them. So, as I'm running out of time I'll just
keep... Okay, I'll just keep the last slide in
there while taking questions, but one point I would like to emphasize is
that the way that practitioners have benchmark problems, it's a very good idea
for theoreticians to also have benchmark problems and see how different methods
compare on problems that we know how to solve well. But forget that, suppose I'm
using a reinforcement learning method - what is it going to do? And for example, Ben Recht has a very interesting program of that kind where he takes
linear quadratic problem as his benchmark and asks 'how do the different
learning methods work there?' even if we don't really need them, but just... just the
comparison. All right and hoping that all these successes that we see in practice
will get the resurgence of theory and many new interesting results coming up
in the next few years from the new generation of people who are getting the
field. Thank you. *audience applause* [Audience Question 1]: John, just... so I thought that okay, if I'm trying to learn to play chess against you, and the interesting
thing is that I don't know the transition probability kernel, right -
and somehow this should add an extra twist to the story. [John Tsitsiklis]: Okay, so games - now games are a little more complicated and all that. You don't have just - and the Bellman
equation is different and all this. Now, are you trying to find the equilibrium
policies or how to play against me? If you're playing against me, then that will
be online training - you play against me. If it's literally me, we'll play maybe
a few hundred games but that's it. You cannot really tune everything about your policy, maybe you have a few parameters. Now if you're playing against the
computer, so they - you can have a billion games played, then that would be - and
through online learning thing, yes. [Audience Question 2]: When you say 'waiting for a theory to catch up with practice,' do you have any, like, examples at the top of your head where
theory has a chance to impact practice? Where some frontier that,
you know, through practically doing reinforcement learning we couldn't break,
but through some theoretical model or analysis we were able to overcome some
challenge. Do you have any sort of anecdotes or things that you have at the
top of your mind? [John Tsitsiklis]: Well, in the past I could mention this example of optimal approximate solutions to optimal - high-dimensional optimal stopping problems where the theory kindly informed the
algorithms and ended up with algorithms that were actually practical. Now, often
in this field it's not that theory tells you exactly what to do, but it gives you
the sanity checks and kind of guides your thinking. So, in many cases the
impact of theory is indirect but still valuable. [Audience Question 3]: So, you mentioned the last chapter of Konda's thesis. Given that we have a few minutes, maybe you can elaborate what are
the - what sorts of questions are there? [John Tsitsiklis]: Okay, so there is a formula for the gradient of the cost as a function of the policy parameters. That
formula is a double expectation, and the inside expectation is something that can be expressed in terms of approximate value functions. And so
you say, if instead of simulating that inner expectation, if I use a trained
value function I should be reducing the variance and do better. Now, and there's
an extra aspect there that the value function needs only to be trained with
respect to some natural features that are determined by the policy space
architecture. There are some canonical features and these are the only ones
that you need. Extra features do not come into the gradient calculation.
Now, Konda has example where she shows that using these learnt value functions
does not improve convergence rate but then he says if we throw in some
extraneous features that are not necessary in the gradient formula,
somehow you get better performance in practice, and it's - I use the word mystery because it's a kind of fuzzy thing that
you look at it and it kind of raises questions in your mind. Not necessarily
perfectly formulated, but I think there is something there. [Host]: Is there a last question for John? [Audience Question 4]: Thank you for the
presentation. Apologies if I didn't understand this correctly, but you made a
point at the very end where you said reinforcement learning - something to the
effect that you didn't think would eventually ever reach. Now could you
explain why? Is it because you couldn't reach that optimal state of reinforcement
learning. Is it because of - such as limitations where people may not have
readily access to super computers or is it a mathematical practicality? Now, with supercomputers... so the question is whether we can do reinforcement learning online, right? With offline computation we can do lots of things. The
question is whether you can start - your car starts driving and it learns how to
drive that day. [Audience Question 4] *inaudible, off mic* ... the older movie in the 80's with WarGames, with Hal where it learned to the point of playing certain games or certain
understanding that it was suboptimal to actually play the game. That's maybe very
simplified, it may be a tangential point but could you explain - reinforcement
the challenge or the blocks of reinforcement learning to that point
that you tried to make - really my question is what is that challenge? Is it because of access to computing power or is it because of a
practicality or math complexity? [John Tsitsiklis] It's really neither, it's the amount of - if your policy is complicated enough, a good policy that needs a million parameters,
to tune a million parameters you need a few billions of data points and you can -
in many applications you cannot do that on the fly. Leaving aside the issue that
if you're talking about an airplane that flies, I don't want to fly an airplane
that's trying to learn a good flying policy. So there will be some subset
where real time is going to happen, but it's not going to be the real time model
free of a tabula rasa. It's going to be based probably on - you already have a
model and some good policy and stuff that you did offline, and then online, yes
you can refine, get a little better performance and all that but there's going
to be a limit and I think those sample complex - the amount of data that you
need is going to be the main bottleneck. *audience applause*