Semicontractive Dynamic Programming, Lecture 2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello this is Dimitri bertsekas let me welcome you to the segment of five lectures on semi contractive dynamic programming the subject relates to a type of dynamic programming model that has many applications in where some policies are well behaved in the sense that they're dynamic programming mapping has a contractive like character while other policies are not okay so let us now briefly review the contents of the lecture series in the first lecture we discussed various examples for the semi contracted behavior when it sets itself examples were from broadly applicable types of problems such as shortest path problems and linear quadratic optimal control in this lecture we're going to provide a mathematical framework that explains the semi contracted behavior and we are going to restrict ourselves to stochastic optimal control problems in the next lecture we're going to generalize the results of the second lecture to more general abstract dynamic programming models then we are going to discuss applications of the theory to various types of models such as stochastic shortest paths and other types of problems then we are going to talk about algorithms such as value duration and policy direction in this lecture we're going to deal exclusively with the stochastic optimal control problem that we have also discussed in the previous lecture it involves a discrete-time dynamic system where x k denotes the state at time k UK the control a time knw k is around them disturbance with given problem distribution the next state XK plus 1 is determined from a kind of state the current control and current disturbance according to this equation the random disturbance has probability distribution that depends on X K in UK exclusively and we make no assumptions on the spaces xu + w other than the fact that W for mathematical reasons we we assume to be accountable set we consider policies which are sequences of functions of state to control that satisfy certain constraint that depends on the current state and given a policy if we consider the cost function over an infinite horizon as measured by a cost per stage function G the cost is discounted by a discount factor alpha which is between 0 and 1 or could be equal to 1 and we have this candid case now the K stage cost is given by this expression but I random variable so we take it expected value and then the limit as K goes to infinity gives us the cost of policy pi starting at initial state X 0 so we this is a function that depends on policy and state if we optimize other policies that we obtain the optimal course starting at state X 0 an optimal policy is one that attains the infimum in this equation for all X simultaneously and a lot of our analysis deals with stationary policies which are of the form where we request the control function does not change over time we will now review some basic concepts the cost of a stationary policies mu starting from state X which we denote by J news of X and the optimal cost starting from state X well j-stars of X typically satisfying Belmonts equations these are fundamental equations this equation corresponds to mu this course equation corresponds to the optimal and there is one equation for every X they can be viewed as functional equations for the function chain you and j star and they are quite long and complicated equations so it's very helpful to use some shorthand notation and notation we are going to use is to take this expression here and write it as H of X you and j as a function X that depends on you on xu + b function J and then we introduce the mapping steam use of j by this equation and t.j using this equation jada function Tim usage a is also a function whose value at state X is given by this expression and we can look at belmont equation and see that they can be written abstractly as fixed point equations jmu is a fixed point of this mapping team you and j start a fixed point of this mapping t semi contracted problems are those for which some team you are well behaved have a contraction like structure and others are not let us also review that kind of results that we aim for in total costless stochastic optimal control the first question is a the characterization of the solution set of bellman equation we want to know if Jane you and j star are solutions of this equation and also whether there are some other solutions or whether genuine jetstar a unique solutions there is a necessary and often sufficient condition for optimality whereby we want to show that if new stars of X attains the minimum in this expression for all X then the policy new star so constructed is optimal there is a shorthand way to express this if this equation holds then u star is optimal we have the value iteration method that starts from some initial function J and then generates decent JT squares of Jerry and so on in other words applies the dynamic programming algorithm many times and we to note as number of x goes to infinity whether this algorithm gives you in the limit j star the policy that is number operates on policies rather than functions and generates a sequence of ones in UK and we what we aim to show is that we the sequence of policy cost functions converges to j star assassin Daud cattle optima and we the method operates in two stages by the duration in the first stage given the Academy okay we evaluate by so the corresponding bellman equation to find Janie okay and then we stick genuine do into the minimization balance equation and obtain a new policy J mu k plus 1 and we continue this way in the first lecture we gave for pathological examples where strange things will happen such as chaste are not being a solution development equation for value and policy iteration behaving a strange way and we want to ask the question what's the root of all anomalies and an answer to this is that there are some policies that are not well behaved in terms of value duration in other words value duration using the bellman equation for those policies does not converge to j mu the cost function of 14 you in terms of in more concrete terms we saw that such not well-behaved policies our eyes in shortest path problems as policies and involve zero length rifles or policies are unstable in linear quadratic problems now we want to investigate mathematically how these policies cause difficulties and we will call this policy director in a well-defined mathematical way and we are going to introduce assumptions and analysis under which we can assert that these policies are hardness our general approach is going to be as follows we are going to select a class of well-behaved or regular policies then consider a restricted optimization problem where we optimize over the regular policies only we show that this restricted problem has nice theoretical in algorithmic properties and then relate the optimal policies both be restricted problem to the original problem and other reasonable conditions were going to apply strong theoretical in algorithmic results the mathematical analysis for all this in much greater detail that we are going to discuss here is contained in this research monograph chapter three and to some extent chapter 4 of this monograph have been updated and they can be found online at my webpage and I strongly urge you to look at these chapters as an aid for understanding the what follows in these lectures let me give an outline of this lecture we're going to first define mathematically regular policies and the definition is going to be relative to a set s of functions J and we'll explain this a little later then we're going to consider a restricted optimization problem where the optimization is done over the ass fragged lipolysis only we are going to analyze this problem by using policy iteration ideas and to this end we are going to introduce weak and strong pulse duration properties of the set s and we're going to obtain corresponding results that are weaker or stronger these results relate to the solutions of Bellman's equation the convergence of value and policy direction and the optimality conditions we are going to demonstrate these results with two examples one is a stochastic shortest path example with a finite number of states and the other is a deterministic optimal control problem with an infinite number of states we will now give mathematical analysis and to this end that we introduce some notation for cost functions we denote by are the set of real numbers and by r sub X the set of real valued functions over the state space x and x ESO vex the set of extended real valued functions over X so J sub X here can take not only real bad news but the value plus infinity and minus infinity and this is necessary because in values of theoretical context this can happen although in practice it may not quite happen in theory and analysis it can happen and we need to introduce this notation let us call j bar the function is identically zero and let us give a mathematical expression for the end stage cost of a policy starting projects and upon the last end stage we have a terminal cost function which contributes to the cost of this terminal expression that depends on the terminal state and is also discounted by alpha to the N now if you look at this expression piece of j is just one cost per stage plus the terminal cost t-squares of j is but i'm going for two stages and then taking the terminal cost and team you ends of j starting at state X is the result of starting at X and going for n stages and then incurring this terminal cost so we have this shorthand expression for finite horizon costs now if in place of j you use j bar the 0 function then this expression becomes what you see here and this is the case H cross upon CMU starting from X with 0 terminal cost and we can try to be a infinite horizon corresponding new start from X by taking the limit of this k stage cause and this is what we have called Jane use of X now let us introduce a very basic definition for our purposes the notion of an S regular policy new we the definition is as follows given a set of functions as extended real valued functions over the state space we say that mu is as regular if it satisfies two properties the first is that J mu is a fixed point of team you within s satisfies this equation in jammu belongs to s the second is that if we iterate with team you on any function J in s head in the limit we are going to obtain jmu in other words no matter where we start with in AZ value iteration using this team new mapping gives us the same result chain you now we say that a policy that is not as regular he is a see regular so a policy can be either as regular or irregular and here two examples this policy here is as regular this is the set s and it satisfies the defining properties jmu belongs to s it is an equilibrium point of team you a fixed point of team you can starting at any point in jail within s we converge to this unique equilibrium point so that's the regularity property being a stable iquilibrium of TM you within at least the set s new bar is not a regular policy and if this can happen in a number of ways jmu bar for example may be outside of s or it may be inside of s but starting from some gel inside the vests we take we go outside of us and we don't have converters so the asylum you are the ones for which gen you is a unique and stable equilibrium within s and also are well behaved with respect to value iteration at least for starting points within s let us give a graphical illustration of as regular and as regular policies and we're going to do this for the case where else is just the real line so here we have just one state X consists of just one state and we're going to use figures like that because they're convenient and also our intuitive in this figure we use a J's or the horizontal axis and Tim UGA provides the graph of team you it's a function of j the scalar j and when considering fixed points of mappings like Timmy or t it is useful to consider the 45 degree line here every point where the graph crosses the the 45 degree line is a fixed point of tea so if you look at this blue mapping Tim you this is our regular because with in this set s J mu is the unique fixed point and also starting from any point within a within s like for example this point here how will value iteration work well from this point the starting value is going to pick up team use of j and then it's going to go over here to the 45 degree line go down and that would be if it's j 0 this is going to be j 1 then J 2 J 3 and so on the way the slope of this blue function is you're going to have convergence to chain you from any starting function haida from above or from below that's going to do it value duration is going to converge like so and this can be traced to the fact that the slope of this function is less than 1 uniformly so it's a contraction mapping the red functions are not our correspond to mapping sting you that are not regular for example this one here he is not regular because it doesn't have any equilibrium point any fixed point within the set s this function here is course just does not correspond to a nice regular policy because it has two fixed points this mapping team you is also not as regular for the following reason even though it has a unique fixed point you can check that because the slope of this graph is greater than one this mapping is not a contraction mapping and if you consider value iteration starting from any point with the rest then it's going to diverge hi there from the right from the left of the fixed point so the blue function satisfies all the conditions of the definition of s regularity the red functions are not now let us consider some examples of as regular policies in the context of stochastic optimal control involving a discrete-time equation and across the first stage over an infinite horizon recall the definition of s regularity and one can prove that if the mapping team you for a given policy new is a contraction mapping over the set s then mu is as regular this applies to the case where s is a subset of real valued functions and Tim you maps s into s in other words if J belongs to s team use of j also belongs to s and Tim you also has the property of shrinking the distances between any two functions in s given any two function J and J prime the distance between team user J and Tim use of j prime is smaller than the distance between J and J prime here beta is a scalar between 0 and 1 and this is a norm any norm would qualify any norm of the over the set of real valued functions over X would qualify to make this a contraction mapping however in dynamic programming typically the norm used is is a soup norm or some kind of a weighted soup norm now if you consider shortest path problems that have end states other deterministic or stochastic and if you take s to be equal to RN the set of functions over the penn state state space the s regular policies are precisely the policies that terminate reach the destination in finite time for deterministic problems for any initial condition or reach the destination in expected in finite time I for the case of stochastic Shore despite problems so termination and contraction are and regularity are very closely tied together in stochastic shortest path problems in particular for deterministic problems that involve cycles and where a policy does not terminate then that policy would be as irregular and we're going to see that it causes problems in the analysis in linear quadratic problems we don't use a contraction type of property for policies to view them as regular instead we will view as regular policies the one that have a stabilizing property that the linear policies that stabilize the cross closed-loop system such policies turn out to be as regular where s is the set of positive semi-definite quadratic functions this is a particularly favorable set for bills for the linear quadratic problems because it contains the optimal policies as well as the as well as the cost functions or firm of any policy any stationary policy which is linear let us also illustrate some of the fine points or the definition of s regularity where a policy mu is s rider or not depends on the choice of s in other words mew may be a specter for some choice of us but not for hours s is not part of the definition of the problem it is it is you have to choose it properly so that good results can be obtained as an example let us consider a policy mu which mappings team you and we take here has to be just the real line or rather a subset of the realign suppose that the team you has a graph this blue curve here so it has two fixed points one is this one and the other one is this one here suppose that we choose s to start to the right of this second fixed point so this is s then then JMU is a jazz music came from j j bar and it's obtained as the limit of team use of k j bar and it's this one and jmu is such that s regularity is satisfied from it is the unique fixed point of team you with in this set s and also it's well behaved with respect to value iteration in the sense that starting at any point within this set we converge to jmu not just from j bar grub from any point in the same thing starting from point to the right of jmu value iteration operates like so and converges to jmu however suppose we change s can we extend it to the left so that include J tilde then mu is not going to be as regular because team you is going to have to fix points within s and the definition is violated okay now let us consider a restricted optimization over the set of s regular policies given a set s let us denote by M sub s the set of all X regular policies and let us minimize the cost function over this set this which is what we refer to as the restricted optimization problem let j star says be the optimal cost function over the s regular policies only in other words jes star is given by this expression for every X and graphically representing J on this line here and let s be this set of functions now the regular policies have cost functions that are within this set by the definition of regularity so all of their cost functions lie within this set and the value that is all the way to the left his j-stars of s it is the infimum over all these chain functions can be approached arbitrarily close by some s reg lemieux now j star is the infimum overall policies so it's the infimum over a larger set therefore it is smaller than j star server so j star lies to the left of j service they could be equal or they could be they could be strict inequality between them we have this relation now let us look at an important result around s regularity suppose we have a set s and let us consider the restricted optimization problem and it's optimal cost function and let us look at this figure J lies on the horizontal axis and we have this region here that on the right is demek ated by the largest function in s and on the left its demarcated by jas star but the theorem says is that if jes star happens to be a fixed point of tea then this region has nice properties in particular here is the proposition assume that Jes there is a fixed point of tea then J esta is the only fixed point of tea with in this set which we call the well-behaved region w look at the definition it is the set of all J that's demarcated on the right by the largest value in s the right instruction eNOS and on the left by jes star the second result is that value iteration works well with in this world well behaved region in other words if you start valuing aeration within double use of s then you get conversions to jes star not to j star necessarily but to Jas star third there's the standard optimality condition for the restricted problem it says that if mu star is as regular and J a star also belongs to s and mu star attains the minimum in Bellman's equation when Jay a star is used in place of j star then mu star is optimal for the restricted problem conversely if new star is optimal for the restricted problem then you start attains the minimum in balance equation okay let us prove this result about the well-behaved region the well-behaved region is WS which is shown here is demarcated on the right by the largest J in s and on the left by the optimal value of the restricted problem which we assume that's a fixed point of T and the first thing we want to show is that if we start badly duration from anywhere in Buddhist well-behaved region we are going to converge to Jas star so here's the argument let's take any starting function J within the region WS so that this J is bounded on the Left bateria star and all right by some j tilde in s now let's take any s regular mu and write the following equality relating to value iteration value iteration applied on jes keeps us a Jes star because because Jace ties a fixed point of faulty now J is larger than joyous also by monotonicity of T to the K I have this inequality and J tilde is larger than J so i have this inequality as well and i also have that this is less than T new to the KH a tilde because T is the infimum overall mu or 13 you so this inequality clearly holds now because me with as regular and J tilde belongs to s if I take the limit here as K goes to infinity what I will get is j mu if i take the infimum over all knew that I has regular I'm going to obtain j star so in this inequality taking limit as K goes to infinity and then minimum over mu the right hand side collapses to the left hand side and in the middle we have that evaluated converge to jes star okay so the next result is to show that J a star is the unique fixed point within this well-behaved region if you had a J Prime another fixed point of T that belongs to WS then we can just start value duration at j prime and that will not move j prime because it's a fixed point of T so AJ prime is the limit of 30 to the KH a prime which is equal to j s star by the convergence of value duration result that we have shown in earlier here so ji star is the unique fixed point within this well behaved region ok now let's prove also the optimality condition suppose that mu star satisfies this condition against the minimum balance equation when Jay started in place of j star then we have that this relation because Jay Z star by assumption of X point of T so this is equal to j sir so i have this and now i have this relationship here with j a star being an element of s and new star being as regular this implies that j is equal to j new star because Tim you start has only one fixed point within health and that is j new star thus chain new start attains the optimal value the restricted problem and therefore it is restricted problem optical conversely suppose that new star is restricted problem optimal then we have of course that that ends Jamie Starr attains the optimal value so this equality holds by the fixed point property of JH star which is equal to j mu star which is equal to t mu star mu star because mu star is as regular which is equal to a team you star j a star because this equality holds so for a restricted problem optimal mu star i have that this is equal to this which is what i wanted to prove let us consider now the deterministic shortest par example of the previous lecture here we have a very simple graph there is only one node other than the destination so the set of states X is just 1 just a single state the destination is cross free and absorbing so at the so all the cost functions have zero value at the destination so we we don't discard that here at the state one there is a choice between two controls one is the control you that things are straight to the destination and costs be and the other is a control you prime which keeps us at one and costs zero so there's a zero-length cycle here which is causing all the problems there are two corresponding policies the policy that applies you is not by mu has cost equal to B and it is a regular yes regular you can easily verify that because the terminal cost function doesn't matter since after one step we're going to be a determination state the policy mu prime has cost zero and it is s irregular the optimal cost of course is the minimum between the two it's given by this expression balanced equation has this form it's the minimum of the between the two controls b + 0 + J sub 1 so you can see now that the optimal cost is a fixed point of T and the set of all fixed points is the set of all a whole scheduler that less or equal to be there are two cases to consider here in the first case B is negative in this case the regular policy is optimal and we have chased i equal to j a star equals to be and the cost of the irregular policy he is 0 the well-behaved region is his this red region to the right the set of fixed point is the black region to the left and we can see that the theorem applies j/s star is the unique fixed point within the well-behaved region in this case it happens also to be equal to j star so j star is the unique fixed point within the well-behaved region but of course that other fixed points outside well-behaved region value iteration also works from anywhere within the well-behaved region it it converges in fact converges in just one step you can see let's consider the other case now where B is positive and it is the the irregular policy that is optimal so J star is strictly less than J test star and the well-behaved region his his smaller and the set of fixed point contains J star in its interior as a result value iteration cannot convert to jester and until unless you start a J star on the other hand value duration will converge from the well-behaved region in one step in fact it will find jes star so that's that's the illustrates the theorem in how anomalies even though anomalies can occur here with j start not being the unique fixed point and also not being found by value iteration there is still a theorem that says that you get convergence of value duration at least from for j within the well-behaved region we have shown that if jerry s star sebass is a fixed point of tea then nice things happen with respect to balance equation and the convergence of value iteration but now the question is how do we choose s so that ji starts a fixed point of tea and there are several approaches for that one approach is to choose s so that J a star the restricted optimum value is equal to j star the optimal value and prove that j star is a fixed point of tea now how do you prove that well it turns out that chased is a fixed point of tea for several broad class of problems genetically for example in all deterministic opener control problems it is genetically true that Chase dies a fix point of tea they are also stochastic optimal control problems where this is so for example cases where the cost per state she is uniformly non-negative or uniformly non positive j-stars also fix point of the genetically for some other types of dynamic programming problems such as minimax control problems and other problems but we are not going to look into this issue because this takes us in a different direction than where we want to go if you look at the revised chapter 3 of the abstract dynamic programming book you'll see in several results of this type we are going to follow here a different approach that's based on policy iteration arguments now this approach applies when s is well behaved with respect to policy duration roughly speaking what we mean here is that starting from an S regular pond c mu 0 policy direction can generate test regular policies exclusively now the significance of this poly spring as regular is that if you have a policy iteration generated sequence of as regular policies then the sequence of cost functions is monotonically non increasing in other words under s regularity we have a cost improvement property so this cause of the generated policies are improving and they are moving towards a estar they have a limit and it turns out easily that this limit is a fixed point of tea and also it is equal to j s star so policy iteration provides a good starting point if we can choose s so that I have good properties with respect to policy duration that gives me some results so let us flesh out now the approach of proving that Jay esta is a fixed point of T based on policy duration arguments if a policy sequence is generated by policy direction and consists of s regular policy then it has cost improvement property in other words the new policy is better than the preceding policy and all of them of course lie to the right of JH star which is the optimal value of a restricted restricted value over PS regular policies so let's prove this policy improvement property for UK I have that journey case a fix point of team UK so this is true and T is the infimum overall miu of team UK so this inequality holds and by the way mew k plus 1 the improved policy is generated I have this equality team you k plus 1 j mu k is equal to t jl k in other words mew k plus one attains the minimum in Bellman's equation when ingenuity is used so I have this relationship by definition of mu k plus 1 and now what's happening he is that jmu k is greater than t minus k plus 1 j mu k in other words team you k plus 1 applied to Jamie UK takes it down takes it to a lower value by monotonicity of team you k plus 1 it follows that every time that I applied to chain UK I get even lower value so the limit as so the sequence of of T to the power of new k plus 1 j minus k is monotonic the non increasing so it converges to a limit and that limit is genu k plus 1 because new tech las one is as regular by assumption therefore value iteration from within the set of of functions in s which is Jen UK is hen gives me in the limit JMU k plus 1 so I proved what I wanted that the new policy is better than the preceding policy and I have this monotonic decrease those Jas star so this holds for any sequence that's generating bias regular policies and we're going to distinguish between two different assumptions relating to pee I strong assumptions and weak assumptions which lead to corresponding strong and weak results in particular we say that s has the week p I property if there exists some sequence of test regular policies that can be generated by the TI algorithm so we don't say that every sequence generated by PI algorithm it consists of s regular policies here we assume that there exists some sequence so that's why we call it a week property now let us a state a theorem related to this week p I property it says that if s has the whitby I property and mutates a generating sequence of s regular policies by policy duration we know that there exists such a sequence by the week p I property then this sequence is going to converge monotonic a lead to something and this something is j a star and j a star is a fixed point of T so now the well-behaved theorem applies so we have that J star is not only a fix point of tea but it's the only fixed point of T within the well-behaved region and value duration converges to j start starting from with that reach let's look briefly at the proof argument we have that this is a sequence of as regular policies generate by P I so we have that this sequence is monotonically sequence of course Montana non increasing as we have proved earlier so Jane UK converges monotonic some limit j infinity and we have that okay Jane you k plus 1 is a fix point of team you k plus 1 and because J mu K is greater than J minus k plus 1 I have this inequality holding by the definition of new k plus 1 as the improved policy I also have that this is true here and also because T is the infimum overall mu of team you of Chaney okay i have this inequality also holding which is and i have equality here because mu j is an ass regular policy and Janie Cates the unique fixed point of team UK so i have this thing of inequalities and i take the limit as k goes to infinity this will converge to jail infinity this will convert to infinity everything in between is going to converge to Jane infinity and by continuity here this is going to converge to to TJ infinity so chain infinity is a fixed point of t and x also using the fact that these sequence is monotonically increasing is a simple argument that shows that the that it that we have monotonic decrease all the way to j s star so Jane Finn it is a fixed point of tea leaf follows from this and also by the monotonic decrease for Jamie okay I get that the limit is also jes star let us now introduce a stronger p I property we say that s has the strong p I property if it has the week p I property and in addition starting from an S regular policy p I generates only as rental policies in other words you cannot have P I start from as regular policy and then generate down the line and irregular policy and then back to regular and so on in other words the set of s regular policies is non-empty there is at least one and it is also closed under pie in other words once you start with with s regular policies you continue to generate regular policies now how do you verify these properties the whitby I property and the strong p I property for the strong p I property there's a potent line of analysis which applies for the case where s is the set of all real valued functions / x + 1 can prove that the strong p I property holds if there exists at least one s regular policy and there's a certain compactness property that the control constraint set my satisfied said of all you that are feasible at any X is such that that this set is compact for any real valued lambda for any X and for any J in s this is three particular use of X is finite for every X also under some some easily verified continually properties when when use of X is compact let me not go into the details of this let's take it as a compact assumption which is which is satisfied fairly easily in various problems for example in linear quadratic problems this this compactness relationship is satisfied finally the third assumption which is the most important it says that given an S irregular policy for every jns there exists some state X such that the finite horizon costs corresponding to mu with terminal cost function J blows off to infinity for at least one initial state among other things this guarantees that s irregular policies cannot be optimal because we know that there exists at least one has regular policy which cost function must be real valued so J star is real valued and if I put in here in place of j j bar this says that the cost of mu starting from X is infinite so mu cannot be optimal now let's see what you can prove under the strong p I property again assuming that s is the set of all real valued functions let's assume that the conditions that we gave the previous slide hold so that the strong viproy property also holds so we have that there exists at least one regular policy that there's a certain compactness condition that holds and also most importantly this is an infinite cost condition for irregular policies and irregular policy cannot be optimal because it's cost is infinite from at least one initial state so assume this condition so that we have the strong p I property and let's assume also that Jes star the restricted optimal cost function is real valued and is bounded below then nice things happen we have results that are almost as strong as for contractive or discounted problems with bounded costs per stage the first result is that J a star is equal to j star and j star is the unique fixed point of tea among real valued functions so Bellman's equation has a solution and it's a unique solution Mon real valued functions value duration converges starting from any real valued function converges to j star there is a necessary and sufficient condition for optimality which says that appoints a new star is optimal if and only if it attains the minimum annulment equation for every X moreover direct is at least one optimal policy this is a byproduct of equals contract of the compactness assumption that that as a consequence guarantees that the attainment of the imperial in the existence or at least one optimal policy finally policy duration works in the sense that starting from an S regular policy new 0 it generates a sequence of as regular policies but monotonically converges to j star so in this figure here we have the j star is equal to j a star it it is both of them are the unique fixed points of t within the set of real valued functions and and we have convergence of value direction from anywhere in the in our sub X and monotonic convergence of consideration to chase star so here we have stronger conditions instead of having that J a star is a fixed point of T we have that both of these are fixed points of T unique and also equal we have the convergence of value iteration and we have the convergence of policy direction there's a slight weakness here in order to apply for this police narration result to apply we have to know an initial regular policy such a policy may not be known and and so this may be a slight problem it turns out that it is possible to bypass this problem by modifying slightly the poles iteration algorithm but we are going to discuss this it can in the last lecture now let us revisit the deterministic shortest path example whereas is very a line and let us try to illustrate the differences between the week p I property and the strong p I property here we have a single node other than the destination and at that node we have two choices one is you that takes us directed destination at a cost be it corresponds to a regular policy which we denote by new and the other control at the other choice at the node one is a control you prime which caused a and keeps us at one so it involves a cycle and the corresponding points of new prime is irregular the bellman equation is given by this expression here and let us now focus on the policy improvement equation starting with the regular policy mu which has caused the B then the policy improvement equation is going to select the quality control that attains the minimum in this expression in other words the minimum between B and a plus B so if a is equal to 0 both controls can be chosen in other words starting from the regular policy we can generate haider directly policy or the irregular policy so here we have a situation for the week p I property holds it is possible for policy that to generate the sequence of regular policies but it is also possible to generate a sequence of policies that some of which are irregular so the week p I property holds but the strong p I property does not hold now let's see what happens when a is equal to 0 we have that the euro stricted optimum is equal to be the minimum poverty regular policy the optimal cost is the minimum between B and 0 so if if if B is positive then chased are strictly smaller than a estar the set of fixed point is the set of all numbers none are less or equal to B the well-behaved region is the region applies to the right of a estar is between the B and infinity and because the wave p I property holds J a star is a fixed point and it's the unique fixed point within the well-behaved region the value iteration algorithm will also converge a star starting anywhere within be well-behaved region and the convergence of policy direction is problematic in the sense that it is possible to to to generate a starting from a regular polish to generate an irregular policy that has strictly larger cost if you don't know how to break the tie in this equation which you will not be able to know you know in a more general problem in this example of course is very simple so we have these pathologies under the wick p I property but we also have some good results in the case where a is greater than 0 then the he regular policy has infinite cost so all these conditions are guarantee that the strong p I property holds high in effect and in fact is it is very easily verified that's starting from the regular policy we repeat that regular policy we cannot obtain from this equation the irregular policy so that's by definition is what we mean by a strong p I property so now all this good results whole and the strong p I property in particular j a star is equal to j star and Jessa is the unique fixed point of tea with in s which should be contrasted with the case the early case where we had an infinity of x points value direction converges to j star from all initial conditions not just from the well-behaved region and policy direction also works assuming that we know an initial regular policy which we do here but in a more general problem may be somewhat questionable let us now discuss an example a stochastic shortest path problem where the state space is finite consist of the state's 1 up to n plus a termination state t which is cost free and absorbing the authorization probabilities between the states and also from the states to the termination state the number of controls at each state is finite for all states x and there's also a cost per state G of X and Q and there is no discounting now an important notion for this problem is the notion of proper policy a policy new is proper if under that policy the termination state is reached with probability 1 or with X finite expected number of steps that these two are equivalent if the policy is not proper we call it improper now let us take s to be the set of all real valued functions over X so guess is it em because their hand states then it can be shown that you is as regular if and only if it is proper so the proper policies are the regular policies and vice versa another thing that can be proved is that the mapping team you of a policy is a weighted soup non contraction if and only if new is proper so proper regular contractions are all the same for this shortest path problem another thing that can be shown is that if all policies are proper then the mapping t is a soup norm contraction not just in you but also T is a soup norm contraction in this case and then the results are very favorable basically the problem behaves like a discounted problem anything but you can prove for this current problems you can prove also for these stochastic shortest path problem if all stationary polish a proper however it is possible to have some policies that proper in some other improper so this hazardous path problem is a prime example of a semi contracted model some policies are regular and others are not let us now summarize the results that we can prove for stochastic shortest path problems let's consider two cases the first case is when the strong p I property holds and we assume here that there exists at least one proper policy and for every improper policy mu the some state that has infinite cost then because the state space in finite to control space is finite and all that the strong p I property can be shown to hold and the corresponding results are true in particular J star is the unique fixed point of tea with in RN so Belmonts equation holes and has only one solution namely j star value iteration converges to j star starting from every j within RN policy duration converges to an optimal proper policy you've started with a proper policy it generates proper policies exclusively and because the strong p I property holds now let us consider the other case where there are some improper policies that have a finite cost perhaps due to zero length sidewalls liking the deterministic shortest path problem that we looked at earlier let us call J hat the optimal cost function over the proper stationery policies only and let's assume that J hat and j star a real valued in that case are also bounded below because s is just our n then the week p I property can be shown to hold and by the corresponding theorem we have results which were going to summarize now the first result is that J hat is the unique fixed point of T within this well behaved region it is not necessarily true that J star is a fixed point of tea and maybe the J star is strictly less than J hat brand-new duration converges to J hat not J star started from nej above J hat policy duration may have problems as we have seen with the deterministic shortest path example it may not converge to enough ma policy even if started with a proper policy there is also a related line of analysis that applies here which uses a perturbation line of argument in particular we add a positive scalar Delta K 2 G and then we obtain a problem where the strong p I property holds in the corresponding the results are obtained namely a unique fixed point within 444 from a PT and all the other nice results and however the perturbed problem is not the same as the original but we take the perturbation to go down to zero and then in the limit we obtain results that are interesting in particular there's a perturbed version of policy direction where we add Delta k2g with Delta K going to 0 at each iteration we decrease the decay and we obtain a sequence of proper policies that converge to an optimal policy within the class of proper policies assuring that we start with a proper policy there is also another version of P I that we're going to talk about later which does not require the knowledge of an initial proper policy and again for this version of the problem we it is possible to have an improper policy that is overall optimal NJ start need not be a fixed point of T as we have seen with one of our examples in the preceding lecture let us now consider a deterministic optimal control example where the state space is infinite in particular we have a discrete-time dynamic system which is deterministic has no disturbance and there is a non-negative cost per stage we assume that there is a cost free and absorbing terminal set of states capital X 0 and the aim is to reach or approach asymptotically this state of states at minimum cost so this is a problem of regulation to a terminal set of states like for example the origin it's a very basic problem in optimal control let us take as I said as the set of non-negative functions which are identically zero within the termination set and with this assumption it can be seen that the terminating policies the ones that reach the termination set in a finite number of steps from all X with with him with optimal cost less than infinity these termination policies are s regular with this definition of s there may be some states where J star is infinity these are the states that where we cannot reach the termination state we we prob around accumulating positive cost and J star at those states is infinite okay now let us introduce some assumptions that implied that the strong p high property holds one such assumption is that outside the termination state the the optimal cost is strictly is strictly positive so if we don't reach our sympathetically approach to termination said you are sure to accumulate an infinite cost in particular this implies that s irregular policies that do not terminate have infinite costs another assumption is some kind of controllability assumption where for all states with finite optimal cost and for every epsilon there is a terminating policy that reaches x0 starting from X with cause that's within epsilon of the optimum in other words it's not necessarily true that there exists an optimal terminating policy but everything that you can do with any other policy you can also do with it epsilon from the termination policies with a termination terminating policy starting from any state this guarantees that je s star is equal to j star and because the problem here is deterministic jci is a fixed point of T so j s star is a fixed point of T and that makes things easy there's also a certain compactness condition which is required to verify that the strong p I property holds and also to improve the properties of value direction the kind of results that you can prove here are first of all that je sais the unique fixed point of tea the unique solution balanced equation within S sub consequence of the strong p I property then we have convergence of value duration to j star starting from any j within s also policy duration converges to j star starting from a terminating policy so these are the standard type of results that you'd expect to hold under the strong p I property it is possible also to develop a line of analysis based on the week p I property in which case some of these results here heart become weaker but let's not go into that
Info
Channel: Dimitri Bertsekas
Views: 561
Rating: 5 out of 5
Keywords: Dynamic Programming, Optimal Control, Optimization, Semicontractive DP, Bertsekas
Id: jce5Y64-vrY
Channel Id: undefined
Length: 70min 49sec (4249 seconds)
Published: Sat Jul 30 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.