John Tsitsiklis -- Reinforcement Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Okay, we are about to get started. My name is Dimitri Bertsekas, a professor here at MIT, and I'd like to introduce the next speaker - plenary speaker John Tsitsiklis. I'm sure John needs no introduction to most of you but let me say a few things. He grew up in Greece and came to MIT as an undergraduate in the mid 70's, and then he stayed on at MIT up until now with the exception of one brief year at Stanford in the mid 80's. We, all of us, know John for his research and for his books, but he also has a very substantial service record. He has served - he's now serving as the director of LIDS - Laboratory for Information and Decision systems. Before that, he was a co-director of the Operations Research Center and other positions within MIT but he has also been in very responsible service positions in his native Greece, including being the chair of the Council of the Harokopio University. Turning to his research, he is well known for that and has received quite a few awards, and the most recent one is the IEEE Control Systems Award just last December, and just before that, I had the great pleasure of sharing with him the John von Neumann award of Informs for two books that we wrote together. One in 1989 on parallel and distributed computation and the other one in 1996 on neurodynamic programming, and the interesting thing about these two books was they both were the first books on subjects that were very marginal at the time. Very few people were looking at them and they were considered highly speculative. The other common thing about these two books is that 20 years later, the subject became very popular, very hot. Parallel and distributed algorithms including asynchronous algorithms are very popular nowadays in machine learning, and the second book Neuro-dynamic Programming is better known these days as 'reinforcement learning' and reinforcement learning is actually the hottest field in artificial intelligence these days, and John will talk to us about reinforcement learning. [John Tsitsiklis]: All right, good afternoon. Thank you very much Dimitri for the introduction. Actually, it's a great pleasure to have this introduction from - coming from Dimitri, who - we had so many great years working together. I don't know if I should thank Devavrat (Shah) or not for the invitation, because he's having me speak on a topic that I was kind of qualified to talk about 20 years ago and rather unqualified today. The good thing is that because I'm coming from the theory side, fortunately or unfortunately, I'm not sure, there have been too many big breakthroughs since then. *audience laughter* Although the field has moved in great leaps, I mean we hear about the great success of reinforcement learning in practice but we have a little bit of a theory- practice gap here, in that the theory is lagging behind the practice, which is not the usual way that things happen these days. In any case, what I will try to do is - probably missing most of the people in the audience - first, sort of running through the core ideas in - behind reinforcement learning. For some, you can think of this as a tutorial that maybe goes too fast. I'm not going to go into any great technical detail but we'll try to give a sense of the main key conceptual ideas. All right, so okay, what is reinforcement learning about? It's about making decisions in a situation that has some dynamics - things evolve in time - and there is some uncertainty as well, and decisions have to be made sequentially. How does one deal with such a context? That's called stochastic control, that's the field of stochastic control, and the key method for dealing with such problems is dynamic programming that goes back to Bellman. Now, for me, the subject starts when I was kind of starting graduate school, and so I read Dimitri's book on dynamic programming from cover to cover, and then I had the pleasure - that was actually the first class that I taught. So dynamic programming gives you the methodology for doing these things, and then enter reinforcement learning. There's a sequence, and we'll not try to give a fair history there, but starting with the checkers-playing program by Samuel, some ideas evolve. The modern version of reinforcement learning pretty much starts with the activities of Barto and Sutton that was around the mid 90's or so. So they wrote this book in 1998 called Reinforcement Learning. Now from around that time when the activity was happening, we had the luck that some people like Bartow and Michael I. Jordan pointed to Dmitri and myself and they kind of said 'look there's that thing happening down there, kind of feels that it has something to do with the stuff that you guys are doing.' So we looked at it and the way we looked at it, it was basically an approximate methods for solving the classical dynamic programming problem. Now when one says approximate, you first start thinking about approximations in the traditional - let's say differential equation setting - where approximation means discretizing, but these were a different kind of approximation, they were approximations based on neural networks and that's why we decided to give this title to our book. It's dynamic programming that involves neural networks. Now, of course on the substance, it's really one and the same topic. Now leap forward - that's 20 or 30 years later - and we have this big success, I guess everybody must have heard about it, but with DeepMind with AlphaGo and AlphaGo Zero they beat pretty much anyone in go, in chess whatever - it's a really spectacular success. So it almost feels as if the field has reached the point that it wanted to reach. Well, and all that as I said goes, and this Dmitry said these days is called reinforcement learning. What's exactly reinforcement learning? You look at Wikipedia, it's about systems that are supposed to learn the behavior based on feedback from the environment, and the idea is, like a little child that you send it out in the world and learns how to play or animals that kind of learn where to look for food and so on. Is that exactly what happened with AlphaGo and AlphaGo Zero? Was it interacting with any environment? So did that promise, did that vision of RL - has it already materialized or maybe not quite. So I guess the answer is going to - my answer is going to emerge as I move on, and the other thing that's a bit concerning is that there is this huge thing that happened, but the methods that are used do not really rely on any big theoretical breakthroughs of the last three years. It's really huge computational power together with people learning, figuring out how to use deep neural networks plus a few extra very clever subroutines or tricks that's come in. So I'll walk you over through all these and try to make sense of the field. So I'm going to talk a little bit about what dynamic programming is, set the terminology and all that. The key ideas are approximation and parametrization and there's different things that one can approximate or parametrize and that gives different types of methods. So I'll walk through those types of methods as well. I want to spend a little bit of time on something that's really quite important. There we are talking about offline algorithms or online algorithms, so when you say systems that learn from their environment you have that image of something working in the real world. Is that happening or what does it take to happen? All right, so, dynamic programming. Ok, it's the first time in my life that I write the status st instead of xt so I feel like I'm betraying my intellectual heritage *audience laughter* and I guess sometimes you have to do that. So there's a system, it has a state and it's being controlled and what the controller is it looks at the state and makes a decision. The decision could be just a function of the state, so it would be phi of st but more broadly, it could choose the decisions in a randomized way with a distribution that depends on the particular state and when that's done then there's a cost that's incurred that has to do with the state of the system and the action. Now, if you're playing a game the costs are zero and only at the end you get a cost of one or minus one whether you want or not. Okay, here's a compromise that I didn't make, I kept this as cost as opposed to reward. Somehow engineers always have thought about minimizing costs of how you do things. In the brave new world of AI people only talk about maximizing rewards. Maybe there's - I don't know what that means, but I'm sticking to costs. So we want to do the cost down. Now the system is a dynamical system, it has some dynamics and the dynamics are that the next state is determined from the previous state somehow. It could be through a dynamical equation that gives you the next state, or if you come from an Operations Research type of tradition, you might describe it differently in terms of transitions probabilities. So the state evolves like Markov chain, except that the Markov chain is controlled also by the decisions. So it's a Markov chain whose transitions are affected by the decisions or the actions that you're taking. All right, so what we're interested in is to minimize the cost over some horizon. Starting from some given initial state, and there's a discount factor. It's a factor less than 1 that basically means that future costs matter less than current costs, that's a reasonable assumption in economics, for example. And what we want to do is to minimize this over all control policies, that's our objective, and the value of that minimum is the value function. We define that wave V-star, remember that symbol. It's the optimal cost, it's the cost that you are going to incur if you act optimally over the given time horizon, and it's a minimization over all policies and that makes the problem difficult. You're minimizing over functions, you're not minimizing over decisions, you're minimizing over policies. What is capital T? It could be a given finite horizon, it could be a random time at which the world ends or the most convenient formulation - I wish I knew how to go back - Okay, the most convenient formulation, I'm going to stick to that for simplicity, is when the time horizon is infinite and the discount factor is less than one which helps to keep the overall cost finite. Quite often, that discount factor there is a surrogate. The problem that one really wants to solve in many situations, is that you want a system that keeps running forever and you're interested in the average performance of that cost. Mathematically, that's a more difficult problem but it is approximated well by taking a discounted problem with a discount factor that approaches one, so we'll stick with a discounted formulation. Interesting point from dynamic programming theory is that there exists an optimal policy under some basic assumptions. One policy that's good no matter what your starting state is. Not the same action - the action will change depending on the state, but the policy? You just need to find one policy which is simultaneously optimal for all possible starting states. Now dynamic programming, this framework basically is everywhere. No matter what your field is, I'm sure you can come up with examples of problems and applications where you have big complicated problems of sequential decision making under uncertainty and where dynamic programming could be applied, and that goes from toy situations like games all the way to serious applications like energy or airplanes or cars or factories and so on. All right, so it's - therefore, we take it for granted it's an important problem. How do we solve it? Here's the way of thinking. If I'm sitting somewhere at some state and I have to make a decision, what do I need to take into account? I need to take into account the cost that I'm going to incur right now as I'm making my decision, but then I will move to some next state and from the next state I'm going to incur some further cost. Let's suppose that tomorrow after I get to the next state, I will be infinitely wise and I'm going to do the optimal thing starting from that state. And right now what I need to do is to look at this sum of my current costs and future costs that I'm going to incur and just simply take the minimum of these and take the action that's the best one in that way. So that's the idea behind Bellman's equation, the key equation in dynamic programming. It tells you, starting from some state, the optimal cost that you can get is the following - it's the immediate cost plus starting from the next state, the optimal cost from that next state. Now the next state is random so we throw in an expectation there over the distribution of the next state, and that distribution is of course affected by the current state and reaction because the next state comes from a distribution depending on the current state and reaction. And we throw in that factor of gamma because costs incurred in the future get discounted by certain factors. So I'm looking at these two quantities - that's the total cost I'm going to incur if I do A now and do the optimal thing afterwards and I'm choosing my action in the best possible way. So that's Bellman's equation, if you look at it. It's basically a system of equations. V star appears on both sides, you have one equation per state, you have one unknown per state, so it's a system of equations. In principle, you can solve it and you can write it in shorthand - oops I'm missing a star on at the end - V star is some kind of weird, nonlinear operation applied to V star itself. So it's a fixed-point equation that needs to be solved. So in the old days, whenever a mathematician looks at a fixed-point equation, what do they do? They start proving necessary and sufficient conditions for existence and uniqueness and all that stuff. So that was done, some people still do it for more and more complicated versions of the problem, and then one starts looking into algorithms. Again, in the old days you would have theoretical algorithms that are nice for textbooks. Some of those algorithms were actually practical, up to a certain point though, because it's a system of equations. If S takes one out of a thousand possible values, then you have a system of a thousand equations to solve. That's okay. If S can take a trillion possible values, then things are a little more complicated and in that case you stop looking at exact algorithms and maybe start thinking about approximations. All right, so let's look at the computational lengths a little more. Here's the equation in detail, instead of writing the expectation there I'm writing explicitly what's involved. The next state s-prime is distributed according to some distribution that is determined by the current state and the action that they made. So you see that this now looks like an ordinary system of equations. It would have been a linear system but except that there is this minimization there and so the right-hand side is actually a nonlinear function of V star that makes it a little more difficult. But in theory at least not too difficult. So, if you look at finite problems where the number of states is some number N you can have algorithms that run in polynomial time in N and that solve these systems of equations exactly. Actually, the funny thing is that the - really, the main known method to say that this can be done in polynomial time is by reformulating the problem as a linear programming problem and solve it that way, and then there's special cases that are simpler like shortest path problems, deterministic problems and so on. Okay, so that's a well understood and old topic as long as N is manageable, it's polynomial in N, but if N is a trillion, polynomial of a trillion doesn't sound too - doesn't sound too good, and here's where the catch is. Think of any system that has D different components, so the queuing system who did different queues. The state of the system will be I have to tell you it's particular to how many jobs it has, and let's say that it can have 0,1 up to K different jobs each queue. To describe the state of the system, you need to give me D different numbers, the occupancies for each one of the different queues, and the number of possible configurations or states is actually exponential in the number of the components. And any interesting problem that one might want to solve in practice would involve several subsystems and that situation would show up. Having a polynomial time algorithm - polynomial in K to the D - is still exponential in D, so we're starting to hit a bottleneck. Would there be good news that some times problems decompose and become simple and easier even though the state space is very large? Can we maybe solve them? Yeah, there are some places with good news. In the multi-armed bandit problem is the famous one that can be solved in polynomial time even though the state space is exponential. In some cases you can get the closed form solutions like in control theory we love linear systems with quadratic costs although the state space is infinite, but we can solve them exactly with closed- form solutions, that's great. These are the problems that are heavily represented in textbooks, but it's a tiny, tiny fraction of the problems that one would want to actually solve and once you move outside these then all the news are basically negative. There's very little positive that you can say, so controlling a system with DQ's takes exponential time to be solved no matter how clever you are, what algorithm you use, and reinforcement learning or any method is not - cannot do it, it's an exponential time problem. And similarly if you tweak the dynamic programming problem a little bit and have imperfect observations so that you need to do a little bit of filtering, then instead of being able to solve it in polynomial time it becomes most likely again it was going to require exponential time. Well, formerly the problem is polynomial space complete so for our practical purpose it also means that it's hopeless for an exact solution. So at that point, one could give up. Or, let's say, well we have to do something. It's intractable. It's high-dimensional. Let's start approximating. What does it mean to approximate? Well, whatever quantities there are that we want to manipulate, if they are too complicated then somehow approximate them and represent them by something that's a little simpler. The universal idea these days about function approximation goes as follows: you have some complicated function, but it's hard to describe, but you would like to approximate it by a simpler function. So F is unknown, what you do is you get data points somehow about F. These could be noisy data points, and then you interpolate between them using a parametric architecture. For example, you write down some functional form that you like. Right, so this is a function of X, it's going to be a function of X, but it has some tuneable parameters. F tilde is your approximation architecture, you know what F tilde is and you want to set theta so that this thing as a function of X is close to the function that you wish to approximate. So you want to find parameters so that this approximation is best in some sense or is good in some sense. Now, this is the setting of what people call these days 'supervised learning.' You'd get data points about some function, and you try to fit those values and do something. Parametric approximations are not the only ways of approximating, there are other ways like nearest neighbor methods and so on but I'm going to stick mostly to parametric methods. So the simplest parametric method is what we call linear regression. That is, you say I'm going to approximate my complicated function Ff with a linear function with coefficients theta and let me find the thetas that best fit the unknown function. Or maybe you'd have some insights into the problem and you say there are some important features of X and let me try to fit my function using a linear combination of those features. Here you can throw quite a bit of domain knowledge into the problem that can be useful, but still it's nice mathematically because the coefficients you are trying to determine show up linearly here. So it's - again it becomes a linear problem to find the right thetas. Or... so these are so-called linear architectures. At the other end, we have nonlinear architectures. You can try to fit your function by just throwing in there a general purpose neural network. The thetas in this case are going to be the tuneable weights inside the neural network. So the neural network takes X's produces F tilde and you want to set the parameters so that the input-output behavior of this neural network is similar to the target function F that you have there. Okay, so that's the idea of approximation, there's some things that one needs to keep track - how many parameters do I have in my approximation? In regression, traditionally people have been working with small numbers of parameters. With neural networks they talk about millions of parameters and depending on what architecture you use and how many parameters you have, different capabilities in terms of approximation power and speaking about approximation power, it becomes relevant once you start doing the theory, whether you're - with respect to what metric do you have good approximations, and that also has to do - what metric do I use in my supervised learning when I try to fit? And some metrics are good in practice, some metrics are good to do theoretical analysis, and the two do not always coincide. All right, so now that we have this approximation idea, we can start using it, and there's three main approaches - now I'm getting finally to reinforcement learning - three basic approaches. One is to say, okay the optimal points is going to be something complicated, let me write down a parametric form for the policies and try to tune the parameters theta so that I get a policy that works well. So that's the optimization policy space. Second approach would be try to approximate the value function - that V that showed up in the Bellman's equation. V basically tells me for any state how good or bad is that state. It's a position evaluator in traditional AI lingo, and there's a variant of this - the so-called Q functions for those who know more about the field - which tells you if I am at state S and I commit to doing action A right now, how well off will I be starting from that point? And we want to choose those thetas so that the approximate value function that they have is not too far from the optimal one from Bellman's equation. What is the value function good for? Value functions are good for because they indirectly tell you what to do. If I have a value function that I believe is close to the true V star, I'm going to be taking actions guided by that approximation of V star. If V tilde is the same as the true V star, the optimal value function, this minimization tells me the optimal thing to do and then the hope is that if V tilde is close to - there's two hopes, that V tilde is close to V star and that by using the approximate V tilde, I'm going to come close, I will be approximately optimal. One piece of terminology, we call that the greedy way of making decisions based on V tilde. If you are working with two values, you again the Q and approximate Q factor tells you what to do, and you can do a little fancier things. This is essentially a one stage look ahead. If I take that action and then the consequences from there will be V tilde, how should I act? Or you could say I'm going to have a short horizon and do multi-step look ahead. Let me think about the optimal sequence of actions I could use, and then the game kind of ends and at the end of the game I'm going to be at some state. How much do I like or dislike that state? I'm going to judge that in terms of the V tilde that I had computed. So you can do one stage look-ahead, multiple stage look-ahead or do something a little more clever than that which is called Monte Carlo tree search, which is essentially again trying to find a good policy for a short horizon problem. Except that this tree becomes a little kind of big and you have intelligent ways of deciding where to explore in that tree, where to go deeper, where to know to go deeper and actually this trick is, I think, one of the big things that happened in reinforcement learning the last few years. That people realize that you get a huge performance improvements in practice by doing this, so this is an important idea. All right, so these are two approaches and finally, why not combine the two? Try to approximate, parameterize both the policy and the value function and have one help the other. If I have a policy I can try to learn the value function associated with that policy and I can then try to use the information, the value function, to improve the policy. I'm going to come back to those methods in a little more detail, shortly. So this is a preview, this is the high-level picture of the field, basically, of the methods that are out there. Okay, now what is the theoretician going to do in that space? I guess the minimum that one can do - of course, there cannot be any magic. That is, problems that are exponential time hard, you cannot solve them faster than exponential time. But, you ask some other things. Assume that I'm lucky and that for the instance of the problem that I had, somehow my supervised learning gave me good approximations and all that, when I work with approximate functions am I going to be approximately optimal? This is sort of the most basic sanity check for a method. Any approximate method as the approximations become better and better, it should give you a sound exact algorithm, and that has been much of the lens with which we have been trying, we and others have tried to do much of the theory. So let's take the textbook method. So approximate method should mimic exact sound methods, so let's start with the main exact sound method and work backwards and think how you can make it an approximate one. Suppose you have a policy, you can evaluate what that policy does starting from any given state. That's the value function for that specific policy. If I'm using that policy, and I started some state, how much courses am I going to get? That's the cost to go the value function for that specific policy, so policy iteration starts with the policy, does the policy evaluation. There's an alternative way of doing the policy evaluation which is by solving a system of equations. My core starting at state S is my immediate cost, plus the future cost that I'm going to have. These are two different ways of representing that same object and you could use either one of those equations to come up with an algorithm and once you have the value function that tells you what happens under a policy, you can come up with a new policy which actually provably is going to be better. What does that - what's the thinking behind this equation? I'm sitting at state S, I'm looking at the immediate cost if I do a certain action and I am saying after I take that action I'm going to be at the next state and starting from that next state I will keep using the original policy pi and so I'm going to incur the cost of that original policy. So instead of using pi all the time, I am now contemplating doing A first and then reverting to pi. Is there something better I can do in the first stage? This expression here tells me what's the best thing I can do in the first stage and this actually gives provably an improvement. The main theorem of policy iteration is that policies monotonically improve, they become better and better, and at least for finite problems you are going to eventually end up with an optimal policy. So that's a sound, exact algorithm. What's an approximate version of this? Well, just try to do all those same steps but using approximations instead. All right, how exactly does it go? First step is given a policy, described somehow, come up with a value function for that. These are two possible equations for the value function, let's just work with the first one. The value function of the policy is the expected cost if you keep using that policy. How do you evaluate that? How do you compute it? Very easy, just simulate. It's an expectation. Start at a certain state, run your policy, see how much cost you get. This is a data point, a noisy data point for V of S, do it for lots of initial S's, collect those data points and then fit - run the linear regression or a neural network or whatever to fit a function, and now you have a representation of your value function under this particular policy. Okay, let's assume that the gods are with us and that we get an approximation which is a good representative of the actual value function of that policy. If our approximation is not good, of course, there's nothing you could ever hope to guarantee, but if we're lucky and that happens, then what? I'm going to use it to make a decision. I'm going to follow the rule that I was describing before - if you have a value function, it's a guide on how to make decisions, and let me give myself some slack. I may be off by some epsilons, it might not be the exactly best arg min, it might be close to the best arg min, but let's suppose again here that I managed to do that. So if approximation errors are small, then what? Well, there's a basic sanity check for this algorithm. Interestingly, this is a result that we never published outside of our book, but it's actually one of the hardest proofs that we have in our book with Dimitri, and this is a non-obvious result. It says that if you do everything approximately within epsilon, then in the end you're going to get policies that are within order of epsilon from the optimal. Okay, in some ways it sounds trivial. If you change things by epsilon in the end everything changes by epsilon, however it's non-trivial because this thing is a very discontinuous thing. If V tilde changes just a little bit the optimal, the A that wins in this minimization could be a different one and that discontinuity causes a huge jump on the next V tilde that you're going to have, it's going to be very different. So it's a very discontinuous algorithm. Continuity arguments do not work but nevertheless the theorem is true. Improvement is not monotonic but the result says that the algorithm eventually will settle in some performance which is within order of epsilon from the optimal. So this is a sort of very basic sanity check. It's a very good thing to have. It doesn't mean that you should implement this algorithm, though. It's really a meta algorithm. I didn't say anything specific about how you fit the V tildes, and the assumption that you're making that approximations are good - I mean it's an assumption, if it's true it's okay but there's no way to a priori verify an assumption of this kind. And then, if we're talking about details, is the right metric to use the old thesis with respect to the infinity norm, that you want very good approximations everywhere and that gives us very good performance everywhere? On the other hand, probably what one cares about is average performance and maybe the approximation errors that one can guarantee would be in some average sense. It's very hard to approximate uniformly and generalize uniformly well with the neural network, for example, and if you're talking about averages then average over which measure, over which distribution? So there's technically interesting questions that come on here and yes, it would be interesting to explore variations of these with different types of metrics and so on, and another issue that comes in is that at least in the literature, half of the papers prove results about being close to optimality for every initial state, and half of the papers give you versions that are 'let's fix an initial state and optimize over policies starting from that initial state.' They kind of feel similar but technically, they diverge quite substantially and different versions have different levels of difficulty, and again the main issue both for the theoretical analysis but also why one does not use this algorithm in practice, is that at least when we have discrete actions, little changes in V can cause big changes in the policy that you would be using, and that leads to oscillatory behavior that kind of doesn't feel good. So one would like to tame those oscillations, and the way to tame oscillations in general is, instead of making wholesale changes, try to make small changes. One place where you can make small changes is when you're calculating the V's, the approximations to the value function. Instead of - make a big change to your policy and then see what's your new V which is going to be very different, start changing V's little by little. One way of doing that is by looking at the Bellman equation which is this equation that we want to solve. It's a fixed-point equation - again, with V star missing at the right-hand side. V star equals TV star, and the way one solves fixed-point equations little by little is by iterating on this relation. There's a similar equation you might want to solve if you have a fixed policy, it has the same structure. Here, this turns out to be a linear operator, here it's non-linear. Let's say we want to solve one of those two problems. The natural way of solving fixed point problems is an iteration of this type - that is, fix the right-hand side, calculate the left-hand side, then put the left-hand side back to the right-hand side and keep going. All right, but we don't have TV in our hands. We can get approximations of TV here and there by trying a few states here and there and maybe approximating expectations using simulation so we also have noise, and that means instead of making a wholesale transition, we want to avoid those simulations, we want to take steps a little bit at a time. So A tie is a small step size. I'm getting an estimate of TV at some states and I move my value function a little bit in that new direction. All right, but how do we actually represent V? V is represented through some parametric architecture, could be a neural network, so I cannot change V. I can only change the parameters of my neural network. So, therefore, what I should do is to - instead of doing this I'm going to tweak thetas so that V's kind of move in the direction of TV, and that effectively corresponds to something like that at least for linear architectures. That there's a certain space of value functions - there should be a tilde here - value functions that are representable in my approximation architecture that I have. Since I'm working within those value functions that are representable, any value function I produce should be in that space. So essentially, I end up doing a projection within that space. So this is a little bit abstract, but the essence of what's happening is that you sort of try to do this update with small steps, but all the time you also need to - effectively, you are projecting, so that you stay within the space of representable functions. That's not how the algorithm actually runs but - algorithms run - but this is the mathematical structure behind the algorithms that keep updating V incrementally and try to find an approximate solution to Bellman's equation. If all goes well, a method of this kind is going to converge to a fixed point of this iteration. So if V tilde has converged, you cancel various stuff and you're left with pi TV equals to V, that's the equation that you end up solving, and if everything goes well the fixed point of this is close to a fixed point of the initial equation, which is the exact equation that you want to - we wanted to solve. So, short mathematical digression but this was an exciting topic, this was the PhD thesis of Ben Van Roy more than twenty years ago, but essentially he looked at this type of equation - when do we get convergence and when not - and, okay, allow me two minutes of math. The way that one solves fixed-point equations, that proves that methods for fixed points work, is that you have a norm and you argue that this norm keeps going down. The operator T of dynamic programming is always a contraction with respect to the maximum norm. How about the projections? Projections we're talking or Euclidean projections, things like that, they tend to be contractions with respect to quadratic norms. So when you put the two together, you don't know which norm to work with, and you're kind of stuck. Unless you're lucky and something extra happens and you get additional contraction properties, and there's two cases. If you have a fixed policy you're looking at optimal stopping problems, it turns out that T will also be a contraction with respect to a weighted Euclidean norm and the same thing will be true for pi, the same norm, and one gets a valid, solid algorithm for that setting and actually one of the algorithms that is used in the financial industry to solve high- dimensional options pricing problems, which boiled down to optimal stopping problems, actually is based on this type of analysis. The other case is if one looks at parameterizations where you approximate the value function with approximately constant value functions, projection of a general function onto piecewise constants turns out to be a maximum norm contraction and that fits nicely with the maximum norm contraction of T and so one gets a sound algorithm once more, and that kind of extends also to some non-parametric settings if one talks about nearest neighbor type of approximations you get again an L-infinity contraction and good things can happen at that context. So these are the good news, but in some ways they're limited. It's for piecewise-constant approximations or, for the case of a fixed policy and optimal stopping. The sad thing is that that's the end. If you throw in nonlinear approximation architectures or if you look at linear architectures without fixing the policy or if you look at q-learning with function approximation and all that, none of these methods has convergence guarantees and these are actually the methods that people do use in practice, so it's a big gap. Sometimes they work, but they're not - they don't have any guarantee of working, and there are even counterexamples where you put in approximation architecture which is perfect. With the right choice of parameters you could fit the true value function but somehow the iterations that the algorithms do go away from the right theta star and go in the wrong place. So that's a little bit of a sad situation for a theoretician, that one place that we kind of felt a little stuck 20 years ago. All right, so that was the idea of incrementally changing V and if you go back, my original motivation was I want to do things incrementally to avoid big oscillations. Another way of avoiding big oscillations is to look at policies and have the policies change little by little. Okay, now if my actions are discreet, how can I avoid oscillations? If from action one I choose now to do action two it's a big jump, it's not a small jump. So here's the trick that's being used here. Talk about probabilistic decision rules. So I'm going to - looking into randomized policies, so my policy at state S is that I'm going to choose one of the three action according to these probabilities, and now I'm going to start moving those probabilities little by little. In the end when I get to something optimal, all the probabilities should be concentrated on the optimal action, but while I'm training and learning I'm going to be using a mix. So now that we made pi of S continues, we're free to start throwing in neural networks in there, and the typical neural network picture these days would be: you feed S inside and you have three outputs going out there that give you the three - the probabilities of the different actions, and you want to tune those parameters so as to get good policies. Once - and now, how do I tune the value of S? Here's one idea: if I have a V in my hands, a value function that's kind of reasonable, that's coming from some good policy and I'm sitting at some state S, I can do a look ahead and ask - what's the best thing I should do in the short run, and after that I'm going to revert to a policy that I know has performance V tilde. I look at that problem, I find the decision that's good for that small problem, and I take that as my guide and I update my policy by moving a little bit in the direction of the one that was suggested by my look ahead. That's what I would do if I had direct control of the pi's but I don't have direct control of the pi's I can only play with parameters, so what do I do? I change the parameters of my policy by moving in a sort of gradient direction. The gradient tells me if I change theta upwards is pi tilde going to go upwards or downwards, so it tells me which direction to change theta so that pi tilde kind of matches pi hat. So I'm changing the parameters, basically, maybe don't look at the formula too much. It's changing the parameters in a way so that pi tilde gets closer to the new action that was suggested. All right, so this is not my method but it's the method that - essentially, it's the core of the method that was used in AlphaGo and in AlphaGo Zero. We can think of that as we can call it, in a word that goes back, is an actor-critic method in the sense that it involves both a critic v tilde that tells you how good different positions are, and as the policy changes both are adjusted and there's very interesting questions here about convergence properties of such a method that adjust is both V's and pi's. There's some work by Devavrat and co-authors that show that this does something reasonable, at least with some assumptions, and when one uses nearest neighbors instead of parametric approximations. For the parametric approximations of value functions I guess there's probably some - there's room for some work to be done there. Okay, now again I was mostly focused on value functions, let's start talking now about just policy space optimization. Forget value functions - somebody comes with policies, with some tuneable parameters. Those who come from systems and control can easily think of PID controllers or things like that. You have the structure of your controller, there's some parameters to tune, you want to play with them but instead of three parameters you might have a million parameters. How do you tune them? You need the gradient of your cost with respect to the parameters of the policy. How do you compute gradients? There are messy formulas that tell you how the gradient can be expressed as an expectation of some function of the trajectory. So you can simulate trajectories and calculate gradients and update thetas with stochastic gradient descent, and that was proposed early on for a fixed initial state and so on, and there's an algorithm for how to calculate those gradients efficiently. My student, we looked at it as the case of average cost problems over steady-state distribution, which is a little harder because the steady-state distribution over which you take average is the steady-state distribution under the optimal policy, but you don't know what the optimal policy is so there are some loops to be closed there. So that basically works in theory, it has all the properties of stochastic gradient descent. That is, under some assumptions you'll get limit points that are stationary points and maybe these are good local optimum or whatever, but it's a very slow algorithm. Typically, one has to use a very small step size, high-variance gradient estimates converges slowly. You can do maybe a little better by using some variation of the gradient - so-called 'natural gradients,' but they're kind of stuck. But at this point, a thought comes to mind. These days, everybody is talking about deep neural networks, that over-parameterization can be a great thing and somehow, you will not have bad local minima and all that. Are there statements of this kind that can be made at least for some control problems and some particular parameterizations that you get a nice landscape as a function of theta so that gradient descent in policy space does something reasonable? I'm not aware of any single result of that kind. Actually, I'm not quite aware of how one would even start. What kind of assumptions or structure to put in place, but something like that should happen. Something good should happen if these methods work. All right, as I'm running out of time there's another class of actor-critic algorithms that tries to help the stochastic gradient descent by using an auxiliary value function that reduces, supposedly, the variance in the gradient updates. There's some intuition there that they should reduce the variance, but when you try to do the theoretical analysis, things are complicated. It was a very bright PhD student, Vijay Conda, that looked at it and has a fascinating chapter at the ends of his thesis, but which no one followed up since then. But there's some very good open problems to track in there. All right, so now, reinforcement learning is supposed to be about this thing that goes in the world, explores and as it lives, lives a useful life and that's better and better things, and that means online. Now - but much of the work is actually done offline. We have a model, a simulator, a physical system that you can tweak and play with or historical data, whereas online you only have the physical system in principle that you work with. All these methods, I didn't use the words online or offline, so you might be wondering what's happening, what and where. Something that one can do is, you train a policy using your, whatever method you want, and burn as many forests by running thousands of servers to learn a good policy and then you go and implement it, and that's basically what's AlphaGo did. So there's not much online, it's an offline heavy computational method that solves, approximately - or tries to solve approximately - a dynamic programming problem. More interesting things - Okay, give a little more credit to AlphaGo. You work on offline, you learn the policy, but you also learn a value function and when you are online you are a little more clever than that. So online, you don't just implement the policy that you learned, but wherever you happen to be, you do some clever look-ahead to do better, and that's actually where AlphaGo falls. But the - sort of a big promise of really intelligent systems that live in the world and learn would be something like this, where there's some point - possibly initial analysis or some very crude controller - and then you go into the world. This is the real thing, you're supposed to train, learn and do something wonderful. Well, as I have run out of time, I will just sort of give my bottom line here. That I think that this is really an almost impossible, unrealistic but also maybe maybe not interesting goal to have. When one looks at the successes of these reinforcement learning methods, all the practical ones, they do always involve very heavy offline computation. If you go online and you can only collect so many data samples, they will not be enough to fit your huge neural network. If you're trying to fit a neural network with the million parameters to do something useful, you're going to need a few million data points. Okay, so maybe in some context you can think of having millions of data points accumulating in real time, but in most applications this is not going to be the case. So there's, there's a lot that needs to happen offline, and offline you work either with a model or a simulator in most cases. Is AlphaGo a model-free method or not? I guess it's not. AlphaGo uses a simulator of the Go game and the simulator is a perfect exact model of the game of Go. So a model doesn't necessarily mean equations, as long as you have a module that reproduces exactly what the system does, this is a model-based method, and there is some discussion in this literature. Sometimes one reads papers where the authors proudly say 'we have a model-free method.' Usually it's not quite model-free, what it often means is that we try to do as much as we can without touching the model, but the model is really there. If it's completely model-free it's unlikely to be useful, or to work, unless you are in domains that are kind of circumscribed. In multi-armed bandit problems, for example, you can think of certain algorithms - you might call them kind of model-free - that is I just tried different things, but in essence there is still an underlying model behind. That is, the model that I'm dealing with a coin that has a fixed constant probability of success and all that. So models are there, they're unavoidable, and one should do the best one can do with them. So, as I'm running out of time I'll just keep... Okay, I'll just keep the last slide in there while taking questions, but one point I would like to emphasize is that the way that practitioners have benchmark problems, it's a very good idea for theoreticians to also have benchmark problems and see how different methods compare on problems that we know how to solve well. But forget that, suppose I'm using a reinforcement learning method - what is it going to do? And for example, Ben Recht has a very interesting program of that kind where he takes linear quadratic problem as his benchmark and asks 'how do the different learning methods work there?' even if we don't really need them, but just... just the comparison. All right and hoping that all these successes that we see in practice will get the resurgence of theory and many new interesting results coming up in the next few years from the new generation of people who are getting the field. Thank you. *audience applause* [Audience Question 1]: John, just... so I thought that okay, if I'm trying to learn to play chess against you, and the interesting thing is that I don't know the transition probability kernel, right - and somehow this should add an extra twist to the story. [John Tsitsiklis]: Okay, so games - now games are a little more complicated and all that. You don't have just - and the Bellman equation is different and all this. Now, are you trying to find the equilibrium policies or how to play against me? If you're playing against me, then that will be online training - you play against me. If it's literally me, we'll play maybe a few hundred games but that's it. You cannot really tune everything about your policy, maybe you have a few parameters. Now if you're playing against the computer, so they - you can have a billion games played, then that would be - and through online learning thing, yes. [Audience Question 2]: When you say 'waiting for a theory to catch up with practice,' do you have any, like, examples at the top of your head where theory has a chance to impact practice? Where some frontier that, you know, through practically doing reinforcement learning we couldn't break, but through some theoretical model or analysis we were able to overcome some challenge. Do you have any sort of anecdotes or things that you have at the top of your mind? [John Tsitsiklis]: Well, in the past I could mention this example of optimal approximate solutions to optimal - high-dimensional optimal stopping problems where the theory kindly informed the algorithms and ended up with algorithms that were actually practical. Now, often in this field it's not that theory tells you exactly what to do, but it gives you the sanity checks and kind of guides your thinking. So, in many cases the impact of theory is indirect but still valuable. [Audience Question 3]: So, you mentioned the last chapter of Konda's thesis. Given that we have a few minutes, maybe you can elaborate what are the - what sorts of questions are there? [John Tsitsiklis]: Okay, so there is a formula for the gradient of the cost as a function of the policy parameters. That formula is a double expectation, and the inside expectation is something that can be expressed in terms of approximate value functions. And so you say, if instead of simulating that inner expectation, if I use a trained value function I should be reducing the variance and do better. Now, and there's an extra aspect there that the value function needs only to be trained with respect to some natural features that are determined by the policy space architecture. There are some canonical features and these are the only ones that you need. Extra features do not come into the gradient calculation. Now, Konda has example where she shows that using these learnt value functions does not improve convergence rate but then he says if we throw in some extraneous features that are not necessary in the gradient formula, somehow you get better performance in practice, and it's - I use the word mystery because it's a kind of fuzzy thing that you look at it and it kind of raises questions in your mind. Not necessarily perfectly formulated, but I think there is something there. [Host]: Is there a last question for John? [Audience Question 4]: Thank you for the presentation. Apologies if I didn't understand this correctly, but you made a point at the very end where you said reinforcement learning - something to the effect that you didn't think would eventually ever reach. Now could you explain why? Is it because you couldn't reach that optimal state of reinforcement learning. Is it because of - such as limitations where people may not have readily access to super computers or is it a mathematical practicality? Now, with supercomputers... so the question is whether we can do reinforcement learning online, right? With offline computation we can do lots of things. The question is whether you can start - your car starts driving and it learns how to drive that day. [Audience Question 4] *inaudible, off mic* ... the older movie in the 80's with WarGames, with Hal where it learned to the point of playing certain games or certain understanding that it was suboptimal to actually play the game. That's maybe very simplified, it may be a tangential point but could you explain - reinforcement the challenge or the blocks of reinforcement learning to that point that you tried to make - really my question is what is that challenge? Is it because of access to computing power or is it because of a practicality or math complexity? [John Tsitsiklis] It's really neither, it's the amount of - if your policy is complicated enough, a good policy that needs a million parameters, to tune a million parameters you need a few billions of data points and you can - in many applications you cannot do that on the fly. Leaving aside the issue that if you're talking about an airplane that flies, I don't want to fly an airplane that's trying to learn a good flying policy. So there will be some subset where real time is going to happen, but it's not going to be the real time model free of a tabula rasa. It's going to be based probably on - you already have a model and some good policy and stuff that you did offline, and then online, yes you can refine, get a little better performance and all that but there's going to be a limit and I think those sample complex - the amount of data that you need is going to be the main bottleneck. *audience applause*

Info

Channel: MIT Institute for Data, Systems, and Society

Views: 14,270

Rating: 4.9314284 out of 5

Keywords: #mitsdscon

Id: fbmAsxbLal0

Channel Id: undefined

Length: 65min 5sec (3905 seconds)

Published: Wed Apr 10 2019