Dimitri Bertsekas: "Distributed and Multiagent Reinforcement Learning"

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

"Distributed and Multiagent Reinforcement Learning" Dimitri Bertsekas - Massachusetts Institute of Technology & Arizona State University

Abstract: We discuss issues of parallelization and distributed asynchronous computation for large scale dynamic programming problems. We first focus on asynchronous policy iteration with multiprocessor systems using state-partitioned architectures. Exact convergence results are given for the case of lookup table representations, and error bounds are given for their compact representation counterparts. A computational study is presented with POMDP problems with more than 10^15 states. In a related context, we introduce multiagent on-line schemes, whereby at each stage, each agent's decision is made by executing a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of local computation required at every stage by each agent is independent of the number of agents, while the amount of global computation (over all agents) grows linearly with the number of agents. By contrast, with the standard rollout algorithm, the amount of global computation grows exponentially with the number of agents. Despite the drastic reduction in required computation, we show that our algorithm has the fundamental cost improvement property of rollout: an improved performance relative to the base policy.

Institute for Pure and Applied Mathematics, UCLA February 24, 2020

👍︎︎ 3 👤︎︎ u/EmergenceIsMagic 📅︎︎ Apr 21 2020 🗫︎ replies

Captions

for partial draft of this is a my website it's not complete but it's close and it contains many of the ideas that that I'm going to talk about today and so here's an outline this deals mostly with approximate policy duration so I'm going to start with that and then I'm going to explain how it can be implemented with value and policy networks neural networks that is then I'm going to talk about some new ideas relating to parallelization one is a multi agent rollout and I'll explain what that means basically the control consists of multiple components with one component corresponding to an agent a separate agent and then I'm going to talk about a more traditional form of parallelization which is partitioning the state space in training in different parts of the space with communication between the processors so ok let me start with alpha 0 some amazing accomplishment it plays much better than all chess programs but most importantly plays different it has discovered new ways to play an ancient game that I think that's very remarkable moreover it has learned from scratch just the rules of the game and after four hours of training it could play as well as the best computer program and after several days it could play much better than that and moreover the algorithm that is used is portable because it in particular learned how to play other games such as gold shogi others ok now the officer methodology is a very skillful implementation involving several ideas the first idea is policy direction the fundamental dynamic programming idea of constellation policy improvement start from a strategy play the strategy and improve the strategy and we do this repeatedly cost function approximations and policy approximations using neural networks massive parallelization this four hours would translate two years of training had it not been for parallel computation finally it involves some very skillful look-ahead approximations in particular the technique of Monte Carlo tree search has been very effective within the alpha zero context now what we will aim to do here is develop a methodology that relates to alpha zero contains several of these ideas perhaps modified form but applies far more generally so I'm going to start with the standard alpha discounted Markovian decision problem where you have a stationary system and cost the system evolves over an infinite number of stages the system equation is what you see here XK is the state at time K you change the control at time J and WK is arrived on disturbance and we are considering policies that map States into control subject to some constraint and at each stage going from XK to XK plus 1 there is this cost here which is discounted by a factor alpha and the cost of a policy is obtained by adding up all those cores over a finite horizon take their expected value and then take the limit as the number of stages goes to infinity and the optimal cost function is the minimum of this over all new overall policies so it's a function it depends on the initial state and it's the main object that you are interested in because if you know J star then you can obtain an optimal policy by minimizing in the right-hand side of balance equation however obtaining J star is extremely difficult in challenging problems so so we're going to work with approximations of that now very fundamental algorithm here is the policy duration algorithm for you start with some policy which I'll be referring to as base policy this new you evaluate that you find the cost function and then you do a policy improvement where you minimize in balance equation with JMU instead of jstor and you obtain a new policy which I'll be referring to as a roll out policy so base policy generates a roll out policy and this process can be continued and the fundamental policy improvement property is that the roll out policy is better uniformly than the previous policy so this is what alpha 0 is also based starting from a policy strategy and improving it now for the exact version of the problem you have this policy this is probably coding everywhere for all X now there are many variants of Polish narration optimistic will restraining and influence policy evaluation has errors as multi-step instead of one-step look ahem we use multi-step look ahead you may operate with few factors and so on our focus is going to be on approximate versions okay so now point of a proverb iteration of the approximation operates like in this pinyin you start with baseball see you evaluate it approximately perhaps with something like a neural net without network then you improve it again approximately and you approximate the corresponding polysomes on policy network and you obtain a rollout policy and you keep going now you don't get policy improvement here but you get approximate policy improvement within some error bound another number of methodological issues to deal with here the theoretical issues as error bounds convergence guarantees sampling efficiency there are quite a few issues and they're also very very daunting implementation choices what to approximate how to sample how to explore the space how to train implementations online versus offline model free implementations versus model-based and so on and there's no guarantee of success in this there is no method that's good for all or even most problems instead there are several different methods that you can try based on your theoretical understanding your intuition to the problem your experience and hopefully something will work ok now as a mathematician I find this not very satisfying but I would not be honest if I didn't tell you that that's what's going on in reality our focus now there's tremendous amount computation involved no less and I will I will focus on how you speed up this computation through distributed and perhaps a synchronous computation so I will focus on two types of distributed computation schemes one is multi-agent parallelization and deals with large control spaces and I'm going to focus on the case where the control has multiple components like one component per agent M can be fairly large here and obviously the size of the control space becomes enormous in dealing with that in the context of one stable multi-step look ahead is very difficult the second type of perun session is more of a standard type of paralyzation multi-process parallelization in deals with large state spaces through partitioning you cut down the space into pieces and you train multiple value and policy networks one percent of V of the partition of course this networks have to exchange values they have to communicate with each other now distributed computation dynamic programming goes way back there are my papers from 80 to 83 and is three that quite extensively in the apparel industry computation book that I wrote with John Stickley's I had a series of papers with Jenny you from 2010 to 14 on distribute policy iteration these deals with value iteration this deal with bonus duration and there was an earlier paper by Williamson bird which motivated a lot of our work a few months ago I wrote a paper on multi agent rollout and just a few a couple of weeks ago we finished the paper with Stephanie Gill and three other students fish meter by the charger say Hill but y'all and Tom wheeler this paper just appeared and it's going to be published soon ok so now let's get into this idea of admin policy networks the standard scheme for approximation value space involves using a cost function approximation J tilde in place of J star the optimal cost function and at a given state that you come in to you do this minimization you form this expected value involved in the cost of the first stage and the cost of the future are properly discounted and you can view this as a cue factor okay the approximate q factor corresponding to a pair x and u you minimize the cue factors overall you and that gives you a control that that that that you use at state X now there are three approximations here one is J tilde in place of J star this expected value can be quite expensive you may wish to approximate this by some kind of limited simulation for instance and there's also the problem of simplifying the minimization operation which can be again daunting there are possibilities here like okay discretize in the state space of course grid or the scheme that I'm going to be talking about shortly [Music] you may know it you may know it and if you know you should use it on the other hand this also applies the model free implementation whereby you do you have instead of having a model you have a computer model and you obtain samples of this G and you approximately the expectation by doing simulation and that's what okay I mean there's a quite a bit of work along this direction but this scheme does not preclude a model free implementation okay now there's also the issue of how do we approximate policies okay if they're all out policies obtained by one step look ahead but how do we represent that and the major way of doing that is to introduce a family of policies a parametric family that depends on a parameter R such for example a neural network in which case R would be the set of weights of the neural network and then you find some ore here that somehow fits the situation okay so now how do you go from value approximation from a J tilde to a parametric policy given J tilde you can do one step look ahead and get training data state control pairs that come from the rollout policy and then you fill this into a classifier which gives you a policy now what do classifiers have to do with policies well if you think about it a policy is a classifier because if you have controls and take values that say ABC then if you had then the policies have the for that to each state X you classify it as being type a Type B for type C okay it's really the same problem X is classified as type u if at state X we apply throw you and how'd you train the rollout policy as a classifier you generate the training set by doing this one step look ahead optimization with this J tilde and for sample states XS you obtain sample controls and you collect a lot of sample pairs like that and then you approximately lock one step look ahead policy by feeding this sample sample set to a classifier and obtain an approximation in particular one way of doing it is by least squares regression so standard way of classifying of building classifiers and there others other possibilities also okay now an important point here is that this type of classifier will give you a randomized policy it will give the probabilities of applying various controls at the corresponding states okay and what you have to do then is go through a max operation and put out at the output the control of maximum probability but what Nellie you you'd use here randomized few actions which have similar feature returns then it's there's a lot of methodology that underlies this Dunham's oversimplified picture everything you know and love from classification theory can be applied here and there are software that will do this in many different ways if you don't even have to use least squares regression here and of course if you do use the squares regression you would also add a regularization function perhaps and so on okay so all the tricks that come from classification theory applied here okay so now how do you go from policies to value functions basically a policy defines has its cost function and you define it can be then a cost approximation can be obtained through truncated simulation so we have we have a policy and we have various initial States here and we simulate that policy generally a lot of simulation trajectories and because we perhaps because the simulation trajectories can be very long you truncate them and take some kind of heuristic cost function approximation it may come from an earlier approximation so how do we do this approximation well for deterministic problems we just run the policy from an initial state once only once because the problems deterministic and we I came a few later stage costs and perhaps we truncate them at some point for example we said that the rest of the cost is zero because in view of discounting this cost may be negligible for stochastic problems you have to do Monte Carlo simulation and run the policy from a given state many times and and obtain the Monte Carlo average altercation is important here in truncation we simulate nu for a limited number of stages and neglect the cost of the remaining stages or add some cost approximation at the end to compensate okay so we have from values we can go to policies and from policies we can go back to values with approximations and all of this gives the scheme whereby starting from a base policy we collect value data the previous slide approximate this cost function and fit a value network to that and then we use that we would then obtain the samples of the rollout policy given that value then you approximate the policy for the policy network and you obtain a new policy that can be used as the base policy for the new round of consideration and now in the idealized case where there are no approximations you have the cost improvement property with approximations the policy improvements approximate to women and error bound and there are error bounds for this there are many variations of the scheme optimistic policy Direction q-learning temporal differences and so on how many algorithms you can use in this box and in this box as well and most reinforcement learning algorithms including alpha zero use variations of this scheme and some variations are highly optimistic very little data between values and policy updates and so this is a big range of methods that have been tried now question is how do we use paralyzation and in this approximate policy duration scheme and there you can think of four different types of parallelization one is q-factor parallelization at the current state you have to evaluate lots of two factors one for each possible control however these two factors are independent from each other and can be computed in parallel okay so naive kind of parallelization alpha zero has plenty of that it's obvious if you have parallel computation you would want to do that Monte Carlo paralyzation of course Montecarlo simulations highly parallelizable and similarly each of the q-factor calculations involves for the problem stochastic involves the computation of an expectation and Monte Carlo simulation can be parallelized now the third type is what I call multi agent paralyzation and corresponds to the case where the control has M components therefore the look-ahead minimization at X involves the computation of as many as N to the M two factors there is a an exponential explosion of the number of factors that you can compute even if this components take two values let's say 0 & 1 still you have 2 to the M you factors and here is the maximum number of possible values and what we will do is consider schemes that reduce this computation dramatically to n times M a dramatic reduction while maintaining desirable properties of the scheme the others multiprocessor parallelization where we use a state space partition we execute separate valuing policy approximations only subsets and these approximations communicate with each other so we'll focus on just this two and start with multi-agent rollout and we aim to simplify and paralyze the one step look ahead ok this is a search-and-rescue problem but to make it more interesting I've turned it to a spy this in fly problem you have 15 spiders moving on a grid along the four directions by one unit at most and a fly that moves randomly now the spiders have perfect vision they know where each other are and they also know where the fly is after it has completed around emotion at every step so the control here is of course the joint motion of the spiders has 15 components and the objective is to catch the fly in minimum time once a spider lands on top of the fly then that's the end and we want this to happen in minimum time now if you try to do one step look ahead or one round full rollout of one round of constellations impossible because there are five to the 15 to factors each spider can move in four directions take what it is has five choices and the number of two factors extraordinary now we're going to reformulate this one step look ahead problem in a way that maintains the cost improvement property but the spiders moved one at a time they take turns in a given order and make a choice between five moves and each step so the total number of moves is 75 five times five times 15 and what we're going to do is break down the control into a sequence of 15 components and the theory behind this is the following actually this idea is more than twenty years old it appeared in the dynamic programming book that I wrote with John cichlids it's a whole section there and it deals with how do you trade off control space complexity with state space code complexity they reduce the complexity of the control space at the expense of increasing the complexity of the state space and the way we do this is we unfold the control action into its end components and we break down transition from XK to XK plus 1 into M little transitions where we choose the first control at the first step then knowing that which is the second throw the first control and so on now the problem is reformulated its equivalent ok because control tree said made with knowledge of X and the control space is simplified at the expense of introducing a more complicated state space and additional state additional cost functions now we apply the rollout one step look ahead idea to this problem and rollout algorithm is just the standard roll out algorithm for this problem it's not the same but it still maintains the cost improvement property and moreover the size the increase in size of the state space does not adversely adversely affect roll out roll out doesn't depend on how big your state space is you move from one state to the next and you do a one-step look-ahead calculation and how big the state space is does not matter and of course the complexity reduction here is that the one-step look-ahead branching factor is reduced from exponential to linear with n here is the number of max of possible choices of each component times more stages ya corresponds to M choices of control one at a time from smaller right if you were to calculate the cost function approximation you would have a problem because you need to approximate all these huge functions you could do it but it is something to consider it may be a daunting task on the other hand when we do one single policy duration we don't have to do that ok I'm sorry what predictive control what allowed is as contains a special case model prolific control it's a special case of rollout so that they share the the feature of landing at a state doing a heuristic calculation and then going forward to the next state and so on it's all connected but but multiple if control partner V has an infinite control space so the string cannot be played easily this is not exact there is always approximation involved right so you don't don't expect to get an optimal solution here the other thing now that you mentioned permutations the Ord that he is assumed to be fixed however it could be part of the optimization in other words you can try several different orders at every step and see what works or better okay so I have a demo it's the first time in my life that I've given a demo in a talk it may or may not work I have quite a few demos if this doesn't work I can show you privately but hopefully it will work and now this involves for spiders and to flies and and okay so this is the base policy that I'm going to show you okay there are four spiders and two flies the Flies move at random and and the base policy here is a naive heuristic a shortest path you ristic we're buying the spiders act on their own selfishly each one goes to the straight to the closest fly regardless of what the others do they don't the fly these spiders don't communicate now I'm going we're going to use this in our all out and see whether the robots he learns from the base policy so let's see if this will work I'm showing now the the base policy huh oops okay you see now that the spiders don't try to do anything smart they all go to the closest and this is a Manhattan distance that were using here in two of the spiders prefer the horizontal and two of the spiders prefer the vertical now they caught this one and they all go to the other one right because their policy is to is to go out to the closest surviving fly and and and take a step added in in that direction now unfortunately sponsee is not very good so it will take some time to complete but eventually the fly will not escape the fly is moving randomly so it may okay so that's it now okay so that's what the base policy does so now let's see what the rollout does with all okay as a present this the cost of this base policy that follows a shortest path is 85 now the all at once roll out when we optimize over the first simultaneously over the first step motion of all the spiders takes 30 for the one at the time also takes 34 and this is something that we observe here much greater reduction over the base policy and you're not hurt too much by doing things one at the time versus all at once and of course here takes it's 125 move choices per step this is only 15 and this is only because we have for spiders if you had 10 spiders it would have been difference would be much larger so let's see what happens okay so the spiders are right in the middle there are the two flies at the end and what you would like to see is despite the split you don't need all of them to go to one fly you sent to this way and shouldn't do that way so that would be something intelligent to do the other thing that would be intelligent to do is having two spiders go to fly they will try to encircle it somehow they will not go straight to where it is they will coordinate and try to encircle it and indeed roll-out does that okay they split off but they don't go in unison okay they try to do something like a an encircling motion okay does this does not show very clearly here okay so the first flight was caught and this one slowly started going in the other direction but it didn't try too hard and the reason is that is figured out that there are two spiders out there they're going to catch the fly before I have the chance to get there okay there's also a Monte Carlo simulation involved here so there are some slight errors but but the rollout is far more intelligent than the basic heuristic and the and this is the one at the time basically does the same similar things again the spiders separate it may go towards different supplies and they try to do some form of encircling encirclement of the of the fly here it's Gordon it's okay so that happened pretty quickly and of course it's 34 in both cases in some cases you see one type of policy do better in other cases the other type of policy do better and there are also some theoretical analysis and simpler problems that sort of verifies this behavior I think that what you see here is pretty close to optimal it's not quite optimal but it's pretty close to optimal okay what are some of the issues in this multi-agent parallelization first of all the one at the time we roll out in the all at once roll out produced different roll out policies and one may be better than the other depending on the problem how about doing one at a time look ahead in the context of policy duration will exact policy duration also work well the answer is no one at the time we roll out use repeatedly may stop short of the optimum I mean wanted one at time roll out it's like coordinate descent for you go try to reduce the cost along one component and then you try the second component and so on and we know that coordinate descent reduces cost but it makes a stack short of the optimum this can happen here also however when approximations are involved in randomization this phenomenon of getting stuck just doesn't happen too often okay and we speculate an approximate pulse duration one at the time we roll out will often perform about as well as all at once roll out and the order also may matter here so this is all in the realm of speculation at this point this has to do with faster computation or the one step look ahead but what is the parallelization because you have multiple decision-makers multiple agents and each one has computation capability you may give them some autonomy and and try to paralyze the computation and one possibility here is divide the agents into coupled groups and there is coordination within a within the groups but less of a commitment in coordination less communication among groups so for example I mean this spiders here don't really care what the spiders do right so they didn't need to coordinate their actions with the actions of this group similarly here this group these two groups require less coordination than reactions within the group so there is this possibility and there are obvious choices but there are a lot of interesting theoretical and algorithmic issues to be resolved here how do we form groups clusters of spiders or agents you probably need to use some kind of a some feature extraction to divide the groups in an intelligent way how frequently do the groups communicate how do what's the type of intergroup communication that you would like introduce communication you'd like to do what kind of aggregated information should this group sent to this group and so on furthermore there may be issues of estimation distribution estimation information processing and so on all of this is a is a very broad range of problems that have been addressed in the literature but they could also be viewed in the context of reinforcement learning okay now I'm going to go into my last topic perhaps I can take a little bit more time since Magda seems to be traveling still okay she'll be here at 4:00 yes you got a message okay okay I can talk until 4:00 I'm sorry I can talk a little more I like questions by the way since we have some more time okay but troubles in the afternoon right oh yeah mother okay okay okay give me a little slack why why don't we all take away some from Mandy okay okay multiprocessor paralyzation the other type of parallelization that's of interest here the idea here is that you have the state space and you want to partition into several subsets and construct a separate policy and value approximation in each subset now it's possible to do a regular partition of the state space like a grid type of partition but that's not a good idea a better idea is to use feature based partitioning where you have a feature vector for each state which summarizes important characteristics of the state and you have and you discretize more or less regularly the space of features and then map this partition into that partition through feature extraction and so in each set of the state space partition you have states with similar features okay that's the idea put states together that have there are more or less similar in the sense of in terms of their features and now there's a question of how do you implement truncated rollout and policy duration with partitioning okay there's an old and fairly fairly obvious idea you assign a processor to each subset of the partition all of this is offline training it can't be done online it's offline training you assign one processor to each substitute the partition each processor uses a local value in the local policy approximation and maintains a synchronous communication to the other processors the processor needs each other's estimates to do this training and we update values local units subsets update policies locally on its subset you may use multi agent parallelization on this in this policy update and then you communicate as synchronously the local values and policies to the other processors now this obvious algorithm has been around this idea has been around for a long time but Williamson bird gave a counter example to showing that this algorithm can fail even with a lookup table representation of values and policies and those for a deterministic problem actually but just there was some a synchrony now my work with Jenny you from 2010 corrects this difficulty we modify this algorithm so that this convergence difficulty does not occur and we have proved convergence to the optimal policy and optimal cost function assuming a look up look up table representation for policy and cost functions and also allowing for a synchrony in other words the values and policies that other processors we get from their neighboring processors could be out of sync they could be out of date all of this is not trivial to prove by the way can you give an idea for how you fix it it's just a matter of taking smaller steps how you fix what okay that's some okay in policy duration you saw the one step look ahead problem the theoretical way to fix this is instead of looking at one step minimization so a stopping problem itself which bounds the sizes of the value function above the value function estimates because this can explode that's what may happen so you introduce a certain bound and we convert we introduce a new type of policy Direction method actually which uses instead of a standard policy improvement it uses policy approval using a stopping problem an optimal stopping problem and okay so this idea has ramifications also in other contexts and there's a series of papers on that but it's it's not trivial now whether you actually need in practice to introduce its modifications a different story it certainly fixes the Williams and Byrd counter examples but I mean these artificial counter examples in practice the standard algorithm may work just fine in fact in the example the computation study that I'm going to give in the next slide we use we didn't use the we didn't use this corrected method okay now this algorithm here that we had with Jenny you admits extension to neural network approximations and doesn't converge to the optimum of course but you can you can obtain some error bound type of results that show that the method doesn't go crazy completely on you okay so you can look at the literature on this is quite a bit and let me give you an idea of how you do approximate policy duration with local value and policy networks suppose that you have a base policy and a value network for every subset of the partition this a partition with five sets and within each set there is a processor that maintains a policy approximation and a value approximation and now we wanted to obtain the rollout policy how do we do that within each set we pick initial States and we run the local policy up to the point where the state enters into some other subset of other processors now this process have communicated values for these states here and that's what the processor uses so the processors do simulations within their local set and take value approximations given to them by other processors initial state these are truncated roll out trajectories using the local policy Network in terminal cost approximation supplied by this guy here this processor the local value network that this guy has applied so it's truncated roll out using multiple approximations and then we repeat we update the local network will communicate with and so on and how does something say something briefly for those who have some experience with this idea a major problem with policy generation is the issue of exploration how do you sample the state space how do you pick this initial stage to do your approximation and generate the training self this is the Achilles heel of reinforcement learning you might say it's extremely difficult to deal with this problem I would argue that through partitioning you can deal with this problem better because you can control this distribution of the initial states more friendly within each subset this is something that just speculation there's no analysis associated with this but in experiments we have found out that partitioning helps with the issue of exploration okay so this is my last technical slide this is from the paper that I wrote with Stephanie you and her students from from few weeks ago it relates to reinforcement learning for pom DP and implements to distribute the ideas that I have discussed okay so this is a pipeline deeper problem where you have 20 locations along a pipeline which are potentially damaged they can have different levels of damage but the level of damage is imperfectly known it is known only if a repairment goes in there and does an inspection and determines that it's damaged to what extent and decides whether to fix the damage or go on right or left some other location so the number of states here okay we assume number of states is okay it was according to Markov chain the five levels of damage five to 20 states here okay and the total number of states of this pom DP is 10 to the 15 astronomical of course the real state of the problem is the is much bigger it's the belief state space over these is 10 to the 15 states okay so the robot moves to the left or to the right inspect locations and repairs them however it may want to skip a mild repair in the interest of getting closer to a location that needs are repaired urgently because remember there are five different levels of of of damage here and some of the damage will be critical and the department may need to rush there and don't waste time repairing locations where repair is less urgent so a huge boom DP problem and we addressed this with belief space but with six policy networks and three value networks so the partition is the state spaces partition in six okay why not why six and not 60 or 600 because even with six the computation time was very large we're talking about serious computations that run over days okay the problem stochastic devolves Monte Carlo simulations and so on however we did get some success and okay here you see some graphs this is policy the performance of policy duration the starting policy I'll tell you what that is after iteration one after iteration two three four you have steady improvement depending on how much approximation you have you may get better or worse performance so the blue lines correspond to six policy networks and t-value networks the red corresponds to no value networks just untranslated roll out basically with with six with the states which partition in six and the the yellow corresponds to no parallel computation at all so you see that the more you introduce approximations parallel computation there is a deterioration however it's not it is not you still do much better and we compared by the way our method with some standard Pompey P methods there's a method called desperate that people talk a lot about and I can't remember the author said many authors and there's another method called pom CP and again it has three or four office and one of them is David silver and they implement they try to address large-scale from the pyramids and we tried their methods with our with our without problem and we found out that actually their methods work worse than the base policy they evolve all sorts of approximation the but involves pruning the look-ahead tree and how you prune it would prove in many different ways we tried many different hyper parameters we never got it to work better than the base policy the same thing with palm CP pump CPUs a Monte Carlo tree search and again proves the look I had three in some randomized way and again we couldn't get it to work better than this okay now what is this policy this policy is though is a greedy policy that sends the repairment to the closest damage location it goes to the right if the closest damage location is to the right or goes to the left and always fixes our repair when it gets to it regardless of the agency both the urgency yeah if it's damaged okay one of the five levels is not damaged if it's damaged then you fix it that's the base policy yeah what is not observed the state of repair of the state of damage of its location is unknown and it evolves according to a Markov chain probability distribution over the part a certain location and you look to the left and you look to the right and you you figure out where is the closest damage location that's some damage and then you go in that direction so you may go on this way fix some damage over here and then go back fix some damage over there yeah but now the observations are when the repairment goes to a location it finds out exactly its level of damage it observes the level of damage exactly otherwise is unknown and also we assume here that one is fixed you stays fixed so so there's an estimator here a belief estimator that this uses the observations the inspections of the repairmen okay so I can give you offline details of this implementation [Music] it's a horrible optimization convexity is a dream beyond the reach of the widest fantasy these are neural networks they involve quite a few nonlinear units and their their cost functions are highly non non convex and they are trained with off-the-shelf software okay there are off-the-shelf software use off-the-shelf software but presumably behind your special software you would do better but still this look this this loss of code there's no convexity a local minima their gradient methods that that crap out on you most of the time and so on okay so my last slide obviously and it's generally accepted that reinforcement learning is a very promising but also very computationally intensive methodology and parentless synchronous computation is an obvious answer it has been the answer for alpha zero in fact so it's important to identify methods that are amenable to distributed computation and and one time we roll out just a single pulse duration with some base policy and multi agent parallelization and valuing policy networks is well suited to exploit parallel computation and roll out generally is often easy to implement and typically reliable as I look into this landscape of reinforcement length method through the first my learning members I find only one method that's reliable and that's roll out in my opinion it works every time okay it may not work spectacularly well may not improve the cost function spectacularly by spectacular amount but it works okay it never gives you worse results and it works similarly from one run to the next if you want some of these other methods you'll find that more than half the time they don't work or if they work it's okay it's if they were they are not also easy to fix with rollout is relatively easy to try to to figure out what to do there's so much opaqueness in some of these reinforcement learning methods that that okay you're trying often find often find yourself trying to blindly trying various things and hope that some will work you may have a different experience that has been my experience but we're all out is quite reliable repeatedly rollout now multiple policy improvements is much less reliable but hopefully that with Partition architectures multi-agent paralyzation you can get faster convergence and it's more ambitious more complicated the first talk dealt in part with Mal predictive control a rollout contains as a special case model predictive control so a lot of the things that I discussed here may have some connection to model predictive control although typically more than prolific control has continued state spaces doesn't have so that's that's an important difference role also has many applications to discrete combinatorial optimization problems like routing problem scheduling problems assignment the assignment problems and this is in my book this is this is dealt in considerably detail although I haven't dealt with this at all but the roll out ideas are relevant and needless to say that there are so many interesting analytical and implementation challenges here so thank you very much [Applause]

Info

Channel: Institute for Pure & Applied Mathematics (IPAM)

Views: 4,628

Rating: 4.9615383 out of 5

Keywords: ipam, math, mathematics, ucla, Dimitri Bertsekas, Massachusetts Institute of Technology, Arizona State University, reinforcement learning, asynchronous computation, programming

Id: nTPuL6iVuwU

Channel Id: undefined

Length: 57min 51sec (3471 seconds)

Published: Thu Apr 09 2020