CS885 Lecture 15c: Semi-Markov Decision Processes

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so for the next set of slides what I'm going to do is introduce a some theory regarding semi markov decision processes and this is going to be essential to understand the next two papers that are we're going to discuss so these are going to be papers about hard core reinforcement learning and a bottom line is that the theory underlying this is pretty much based on this notion of semi Markov decision processes okay so let me first say a few words about article reinforcement learning so I'll start with an example so in reinforcement learning and Markov decision processes so far what we've done is we say that there's a set of states and then in each state we would like to execute some action and then this will influence the environment which we'll get into another state and then we'll choose another action and and so on but if you think of certain tasks let's see when you drive and now let's say that we want to design a self-driving car well it might make sense to marginalize this reinforcement learning task so we don't have to think that necessarily everything has to correspond to some low-level actions but we can think of it at a higher level and then so for instance if you're driving a car then often you've got a high level goal where you're you simply want to reach some destination but then to reach this destination you might have to go through a certain way point so that might be some locations like you know that you need to reach a certain town and then another town and then you'll reach your destination so that might be town eight and town B and then town C right so these are so like sub goals and then in terms of reaching town a well as you drive along the way you might at some point see a red light therefore you need to stop and then you might reach an intersection and then you need to do a turn okay and then you might be on a road where there's a slow driver and you overtake this driver right and then depending as well if whether this is your final destination or not maybe at the end you would like to park the car as well so here we can think that you see we can further decompose those goals into some sub tasks and then those sub tasks will in turn become really just a sequence of really low-level action so at the end of the day what do you do when you drive a car you do one of three things right you're gonna press the brake pedal you know you're gonna press a gas pedal or you're going to steer the car right these are really the three things that you're doing and and so here we can think of this really as a heart occult ask again because we've got the high level goal decomposed into some sub goals these are decomposed into some sub tasks and eventually we've got our low level actions but you see again when you're planning to reach some destination you don't start planning by thinking you know precisely at every second whether you're going to press the brake pedal of the gas pedal and so on right you just think at a higher level and then depending on what sub goal and wet sub tasks you're currently accomplishing right then you're eventually going to to decide what you need to do in terms of pressing the brake versus the gas pedal and and and so on and yeah so so here you see this is an example but then in a lot of other complex domains like there are some examples with respect to video games as well where you can think that in a video game there's a mission and then that mission you can break it down into some sub tasks and then this will turn into some lower level behaviors and and so on if you consider robotics then there's something similar as well where there's the low level control but often you know with the robot it could be that if it's a mobile robot it needs to reach a location in terms of reaching that location it might need to open up some doors and therefore grab some things and then maybe carry an object as well so then there's lots of subtasks that one might want to induce okay so so basically this is something very appealing and in fact there's been already a lot of research so decades of research on Harvey called reinforcement learning but it's still something that is not easy to accomplish but you know in any case in order to understand how we could achieve something like this for maybe we could divide into some sub goals and sub tasks what we want to do then we're going to use the notion of semi Markov decision processes okay and in here yeah okay let's yeah let's get straight into semi markov decision processes so well in fact no let me start with semi Markov processes so a semi Markov process here we can define this with a set of states as well as some transition dynamics but the difference with what we've seen before whenever we talked about a Markov process is that now I'm gonna have transition dynamics where there's going to be States but also some notion of time then I'm gonna capture with tau so here tau is going to indicate the time to the next transition or in other words the time that the environment is going to spend in the current state before it changes to a new state so so this is interesting because before we we always discretize time and then we just said that okay there are some time steps and and then at every time step there might be a transition but now here we're going to main these time steps more explicit we're not going to assume that they're all roughly the same amount of time so some time steps might be longer than others and then so concretely it would look something like this right so we might be in a certain state we're going to remain in that state for some period of time tau then with transition to another state s Prime we remain in that state for another period of time Tao Prime but that period of time might be smaller than the previous one and then after we transition then in s double Prime we're going to remain there for another period tile double Prime and that might be also different and and so on okay so so the idea is that we're a semi Markov process we include the notion of time and here it's still Markovian in the sense that the next state depends only in the current state but then we also want to capture the notion of time and then the time spent will vary with respect to each state yeah yes so okay it depends how we formulate this at least the way I've written it here which is a classical way of writing this down is that yes here it would still just depend on the current state and then for based on the current state I would have a distribution over the next state and then I would also have a distribution over how long it will take before the next transition happens okay now another way of formulating this could be that we say well since we introduced time let's make this more symmetric and have both the state and the time that we condition on and then it is possible to obtain an equivalent process that is going to be a continuous time process but now what happens is that it it it's not going to be Markovian and it's not in a strict manner simply because an a unique even time well what happens at the let's say an epsilon amount of time later we're obviously depend on what is the current state but also the time but then the time we want to take into account Hollow we've been in that state and and therefore we need to remember when was it that we switch to that current state before so we need to sort of look at a history of previous time steps so here I guess I shouldn't use the word time selves because it would be something continuous but I have to look at the history of my yeah which state I was in and when and and then from that perspective it would look like it's it's it's not Markovian but then if we don't make things conditional on time per se and we write it in this way then it is Markovian but indicates the conventions that we call that semi Markovian yeah any other questions okay good so now we can formulate plays on this a semi Markov decision process so it's analog to a regular Markov decision process we're going to have States and actions will have a transition model now here we use both s Prime and Tao just like in our semi Markov process and then we'll have a reward model that has an expected reward for every state action pair there's going to be a discount factor it can be less than 1 or it could be equal to 1 which means that there's no discount really and then we could also have a horizon that might be finite or infinite ok so so it's very similar to what we've seen for a regular Markov decision process but as you can see the main difference is that now we introduced Tao and this will have some some impact on how we compute values as well as optimal policies ok so just to give you a concrete example where do such processes arise classically in operations research it was in a context of queueing theory so there's lots of queueing problem that are still very important today the classic type of problem would be that you consider let's here a retail store and then there might be queues or lines let's say at the cashier or at the counter for customer service and and so on right so most retail operations I will deal with customers where you form a queue to get whatever service whether it's just to pay whether it's to get something else okay but now let's say that we've got a retail store with two queues and just for the sake of arguing let's say that there's one cashier so there's going to be a queue for the cashier and then there's one customer service desk and then there's going to be a queue at the customer service desk but now let's say that on a given day for whatever reason there's only just one employee working in that retail store and that employee now has to deal with both queues okay so so then there's an interesting question about you know there's two queues who should be serviced when and in other words at any point in time then this employee will essentially service a customer either from Q 1 or Q 2 here Q 1 and Q 2 could correspond to the customer service Q and a cashier Q okay and then the state of this semi Markov decision process is going to be essentially the number of customers that are currently in each queue okay now based on this we can imagine a transition model where here roughly speaking is going to be a distribution over the arrival and service time for customers in each queue so in other words there's going to be customers that might arrive in the store and then might get into the queue so that's the arrival time and then the employee is going to service customers from each queue and then that will also take some time and and that time might vary right so here you see the transition model in fact what is very important is this notion of time how much time is going to take to service the customers but also for customers to arrive at different points in time so here there could be all kinds of mathematical models that could be used in bother writing any specific one but but you should get the idea now for the reward model here we could consider the expected revenue of servicing each customer but then also we might want to remove from that some expected cost that's associated with the waiting time so here obviously I mean the waiting time is the time that the customers are are waiting in line and and I mean that might not impact any revenue directly but some of the customers might end up leaving or they might get angry they might not come back to the store and and so on right so so there is at least some hidden costs associated with with the waiting time and we could have a model for that so so yeah so this could be our reward model okay so yeah so this is a classic example where semi Markov decision processes are typically used and and now the question is the same as for regular Markov decision processes how can we find an optimal policy so that we can maximize our rewards okay so here we can also formulate as an objective a value function that will correspond to the sum of expected rewards but then the main difference is that you see since now if if we consider a scenario where rewards are discounted using a factor gamma now we need to take into account how long it will take in between different transitions so perhaps we get a reward each time that we've got a choice of choosing some action and applying our policy but then that reward might come in later if it took a long time before we were able to execute that action or it might take a smaller amount of time so like in the case of customers in a queue that we're servicing right so if you earn some reward or some revenue from servicing customers and then if it takes you less time to service some customers and perhaps those rewards come in faster and and you don't want to discount them as much so here the idea is that we'll want to discount those rewards but then for we need to use an exponent that takes into account the amount of time and so perhaps here at the if decision point right then there might have been an amount of time passed that corresponds to the sum of all the previous periods up to tau I so then we would add up all those periods of time and that would give us the time at the if decision point okay so then we might want to discount by gamma raised to the power of T I okay yeah so that's the main difference because we're gonna have this here and but otherwise you see once you can compute a value function with respect to different policies and the optimal pulse is simply going to be the one that has the the highest value in in every state okay so to find the optimal policy and the optimal value function we can use Bellman's equation it's very similar to the one for a regular Markov decision process the main difference again has to do with discounting so here we're going to add up the immediate reward plus an expectation over future rewards and here you'll notice that instead of just having the discount factor gamma in front here I have the discount factor gamma inside the summation because it will depend how long it takes before I make a transition alright so I want to discount those future rewards based on how long it takes before I receive them and so that's why the now the discount factor is pushed inside the Sun and then here it will have an exponent Cal that corresponds to the amount of time and and that's stochastic so depends on on how long it takes before we get to switch to the next state s Prime okay yeah so that's that's the main difference it's not a big difference but it's it's still a difference and now let's say that we want to do reinforcement learning so perhaps what is the simplest algorithm that we've seen is the cue learning algorithm and then so we can do aq learning update that's very similar to what we've seen before again the difference is simply with respect to the discount factor so here at any point in time well I should know at any decision point in time what will happen is that we're going to receive from the environment the next state s prime some reward are and also we'll be able to observe how long it took to transition so the amount of time tau right so we actually received three things the next state as prime the reward R and the amount of time tau so now we can use tau simply to apply the discount and and then otherwise it is the same update as before so here I'm just showing you you know the simplest type of Q learning update that in fact assumes that we've got a tabular representation but then you can generalize this to using a function approximation like a neural network and then we can turn this into a DQ n if if we want where we will simply use this discount raised to the power of tau okay so we'll work in the same way and then if we want to use some gradient update there's this this can be done as well so basically wherever we used to just have a discount factor gamma before now we're going to replace that by gamma raised to the power of town okay any questions regarding this okay yeah yeah very good point so yeah so here if we want we can think of this semi Markov decision process as a generalization of a regular Markov decision process where here if we simply set out to be equal to one all the time then we recover just a regular Markov decision process yeah oh I see yeah yeah so we can yeah we could also make well okay so here when introduced ow as time the most general way of introducing it is just to see that this is going to be something continuous but in some scenarios in fact in hardcore reinforcement learning it will make sense as what to consider tau that's only integers so natural numbers that that are positive and and then as a result here if if it's still discrete just like natural numbers then it is possible to then just see that okay we're back to just having decision points at every time step and and now it's just that we want to have some distribution that would say whether or not we're gonna have to remain in that same state or not this still provides that generalization because if you do this trick of saying I'm gonna discretize time again and then it's just that now maybe I the next time step I'm gonna stay in that same state and so on then you're kind of forced to use a distribution that would at every step we see that okay the probability that I remain in the same State is the same so you know to satisfy the Markovian assumption and here part of the reason why we call this semi Markovian is precisely because we could have distributions with respect to tau that that are not necessarily like following an exponential that would be sort like independent of what has happened before so normally like the exponential distribution right the the time to I guess the next event right is independent of how long it took before but here we could formulate some distributions where it could depend on that so so that's that's that's why this still provides a journalist ation yeah okay good okay now let's come back to article reinforcement learning because I introduced semi markov decision processes and I told you it would be useful for to provide us with some theory but what I did is simply explain that in a context of operations research where there was no notion of hierarchy but now if we want to introduce a notion of hierarchy there is a framework called the option framework that has become popular over the years and and then in that framework the idea is that we formulate hardcore reinforcement learning as a semi Markov decision process where we're going to have actions there are really options and then an option is really just a temporarily extended sub policy okay so if I just go back to my first slide yet right here so here I have perhaps a semi Markov decision process that happens to have some actions that corresponds to reach a rich B and reach C and then reach a itself now is going to be a policy or a sub policy and then it happens to have as actions turn over take stop and park and then these in turn we can think of them as again some some policies that have as well some actions break gas and steering okay so that's how we can think about this so here there there is a hierarchy right so each each circle in this graph right would correspond if you want to some sort of semi Markov decision process that has some actions that correspond to sub policies at a lower level okay and now because the actions are going to correspond to sub policies the idea is that this sub policy might take an amount of time that varies to execute it right so here I have a goal of reaching a certain destination maybe I want to drive to Montreal for instance and then to do that I need to go through Toronto then eventually go through Kingston and then go to Montreal right now to reach Toronto that might be my first goal and then that will take some time it depends on the traffic and and and so on and then because the amount of time will vary right then I cannot assume that I'm going to have time steps that are all equal right so this will not just take one time step it will take a certain amount of time and perhaps I want to model that too so so that's why then a semi Markov decision process will provide us with the right tools to understand what what are the implication of taking the time component into account okay so yeah so basically as what we're going to formulate our hard cool mark of a decision process as a semi Markov decision process where the actions are options and here options is just a name relief for some temporarily extended sub policies okay so it's so the idea is that inside the semi Markov decision process we have an action space right so if I just go back here we said that there is a set of actions but now each one of those actions is going to be called an option it's just an another name it's a fancy name but it it reflects the fact that okay we we can choose between those different actions we've got an option to choose between each one of them right and and then those options they're going to have take a certain amount of time because they're going to correspond to policies or sub policies lower into our our hierarchy okay so so now I'm going to denote by a bar an option and then it will have a corresponding sub policy I'm gonna call that PI bar and then the sub policy at some point it should terminate and now depending on how we formulate the sub policy are the many ways in which we could trigger the termination but the simplest way is just to declare that there's going to be a set of states in which when we reach them the the process just stops okay so so this will be the end of the option so so yeah typically an option would be defined by a policy or a sub policy PI bar as well as some termination condition which for the sake of this example I'm simplifying to just say it's going to be a sub set of states that when I reach them it means that the sub policy has ended and concretely you see in the example before with with driving if my goals to reach location a perhaps my states refer to well what are my possible locations and then when I reach location aid and then yes I've completed that sub goal so this would trigger the termination of that sub policy okay so with this now what we would like to do is define some transition dynamics so I'd like to define the distribution over states that can be reached when I execute my action or sub policy see a bar when starting from from state st okay so here again yeah this this this is a sub policy so we are in a in a current state or some initial state when we execute a bar a bar is going to take a certain amount of time that might vary so that's why we have tau here that indicates the amount of time it might take and then at the end we're going to terminate in some state st plus tau and here if we terminate at least with with this example it should be that this state is part of my terminal states okay so now I can estimate this distribution using this expression here so what I would do is essentially look at all the possible trajectories or all the paths that will go from st to st plus tau and i can unroll these trajectories by simply here looking at all the consecutive pairs computing their probabilities and then summing the properties of all those trajectories okay so that's essentially mathematically what one would do to obtain this but at least we know the theory that you see when it when we have a sub policy right then we're going to end up in some terminal state it will takes a certain amount of time tap and there is a distribution over those terminal states as well as the time town and then if we want to estimate that then then that would be the expression okay so that's for the transition dynamics now with this we also want to associate some reward function so here I'm going to generalize our notion of reward so before we said that the reward would only depend on the current state s T and the action now I'm going to make it depend as well on the final state s T plus tau as well as the time it took tau okay so this will become important on the next slide it's a slight generalization it doesn't change anything really to to the problem so it just means that now we have more flexibility in terms of defining our reward but when I use this definition then what I could do is say that the reward of the option a bar corresponds to the sum of the rewards that I would obtain while executing that option and here because I've defined those rewards to really be the expected reward then here I'm going to compute the X the sum of the expected reward as I execute my option a bar so when I execute this option then I'm really executing the sub policy PI bar all right so I'm going to obtain an immediate reward for being in state st executing the action prescribed by PI bar then I'm going to add to this the future rewards so here I would discount this and now I'm really working one time step at a time so I'm discretizing in the usual way but then the idea is that I'm going to unroll this up until tau time steps okay so until I reach essentially the state st plus tau and and then so all the rewards that I would obtain along the trajectories that go from s T to s T plus tau I'm going to add them up and I'm going to take their expectation and then I'm going to summarize that into my reward function okay so that's the idea so yeah so for an option we can well there is some implicit distribution as well as an expected reward that correspond to to these things mathematically ok so then with this now what we can do is also adjust Berman's equation to work with options so again options are really just like actions but they have some period of time that they take to execute and then the resulting state at the end there will be a distribution and then so this expression all right would come from this equation that we have right here okay so so the ideas that you see at one level and my hierarchy right then I get to choose some options or those sub policies that correspond to two PI bar and then when I do this then I can work with this bellman equation to optimize my choice of options and and then so I would use this transition dynamics as well as the rewards that I just explained in in the previous slide and and this is again Bellman's equation now the main difference is that I've I've pushed a reward inside the expectation because now my reward depends on s prime as well as tau so but but that that's really the the only difference okay so yeah so that's for Batman's equation I can also do Q learning here too if I want to do Q learning what I would do is again take well the immediate reward that I would obtain from executing the sub policy corresponding to to the action a and now because it's a temporally extended action so so this option it's really an option so then I would need to sum together all the rewards that I would obtain along the trajectory when I execute this option and then I can plug this in here okay and then after that I need to discount by gamma to the Tau because it took time steps and I also have the discount gamma to the Tau in here too okay so that's the main idea okay any questions yes [Applause] yeah very good points oh yeah so I'm using G because it does correspond to a Monte Carlo trajectory of executing the policy so it's the same as a rolled out but now it's a roll out of the option okay so that's why I'm using G just to be consistent with the same notation I think someone else had a question here or was the same question okay good okay so in that case yeah let's stop here yeah so this was all these were all my slides and then this should help you understand the next two papers that we're going to discuss regarding article reinforcement learning
Info
Channel: Pascal Poupart
Views: 4,408
Rating: 4.9480519 out of 5
Keywords:
Id: 1nuTmzqKQyE
Channel Id: undefined
Length: 36min 4sec (2164 seconds)
Published: Thu Jun 21 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.