CS885 Lecture 1b: Markov Processes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay so I'm not going to really get started with the meat of the chorus so we'll discuss and some important concepts regarding Markov processes next lecture we'll talk about Markov decision processes and then we'll talk about reinforcement learning so we're doing this gradually so Markov processes are important in order to model environment dynamics so here the ideas that we've got a reinforcement learning agent is going to select actions as part of the environment but this environment is evolving gradually it will have some dynamics and then so we're going to need to model this and so for this we're going to use stochastic processes and those stochastic processes we're going to make two important assumptions the Markovian assumption and the stationary assumption so that's what I'm gonna cover for the rest of this lecture and in the next few slides okay so as a quick recall this is our abstract representation for reinforcement learning and again this essentially encodes a feedback loop where an engine execute an action then the environment provides the state and a reward then there's an action executed state and a reward and so on so if we unroll the problem over time alright so we enroll this loop then what we would observe is that at a certain time step the environment Issa's is in some state then the agent execute an action and then a reward is provided then there's going to be another state for the environment that results from the action taken then the agent will take a new action there will be a new reward provided by the environment and so on okay so so then this feedback loop leads to the sequence of state action reward state action or state action reward so this sequence in general is going to form a stochastic process in part because it's going to be some uncertainty in the dynamics that underlie this process so for most problems this will not be a deterministic sequence okay so if it was deterministic we could essentially plan ahead everything that would happen so we could say oh I'm currently in state zero then let me execute a certain action I know what reward I'm going to get and I know what state I would get into and then I could decide already what is going to be the next action and so on and this would make the problem a lot simpler not necessarily perfectly solvable but a lot simpler the reality is that in practice even if we know what state the environment is in and we execute an action we're not certain what reward we're going to get and we're not sure what is going to be the next state of the environment so so we're going to model this using the notion of a stochastic process okay so here there's lots of processes that that we could use to model this now what tends to happen in practice is that most processes have some structure they're not completely arbitrary and what I mean by this is that the laws of the process often do not change over time okay so the process is essentially the dynamics of how states action and rewards would evolve but then here I'm going to claim that in many situations and many problems the laws of the process do not change over time and then it's also the case that for many problems there is a short history that is sufficient to predict what is going to happen next in in the future and here when I say predict I don't mean necessary in the deterministic fashion so it could be a probabilistic prediction okay but often the there's a history that we can keep short and and might be sufficient to predict the future so as a concrete example what if we look at weather prediction so here when we make predictions for the weather there's a model that is used so they're very about I guess how humidity wind temperature and so on will influence each other in various locations but that model is the same model that is used every day okay so when I said that the laws of the process do not change right so for weather prediction obviously the weather changes every day but the model that we use to make predictions typically is going to be the same and then we can assume that it's not just our model that we've constructed but that really nature uses some underlying process some underlying laws that do not change over time okay so so that's an assumption it's not always the case but this will be a useful assumption then the weather measurements from the past few days are often sufficient to predict what is going to happen in the near future right so if we want to predict the weather tomorrow the day after and so on obviously the measurements that we have right now are important maybe we can look back a little bit in the past see how those measurements have evolved in the past few hours maybe in the past few days but we don't need to look back a month ago right so this is unlikely to help us in any way so in general for weather prediction you can just look at measurements from a short history okay so now we were ready to formalize a stochastic process and here to keep things simple let me just consider States so we talked about States actions and rewards now for the purpose of defining a process here I'm going to look at a process that corresponds just to the sequence of states we're going to make this more general in the coming nitrous work we'll come back and add actions and rewards again but for now let's just restrict ourselves to States so a stochastic process with respect to States is going to have perhaps a set that defines what are the possible States and then there's going to be some dynamic mix that in general we can express by some conditional distribution over the current state given the past States okay so this is a general way of defining a stochastic process and graphically you can think of it this way that this is the state of the environment at time step 0 this is at time step 1 time step two and so on and now if I want to predict what is going to be the state of the environment time step 4 then it depends on the state of the environment at time step 3 as well as time step 2 times step 1 and time step 0 so at least in the most general fashion right so it might depend on everything that that happened before so here we can express a conditional distribution that will have the state at time step T depend on all the previous states of the environment any questions regarding this okay good so and let me just point out that in this graph the circles are random variables here and then so I'm really just expressing that each random variable has a dependency on other random variables with those arcs so this would be essentially a probabilistic graphical model I like a Bayesian network but if you're not familiar with this right like the key is just that there's some dependency between these variables and and this is expressed by the dynamics here okay so now there's a problem with the previous picture because if I have a process that is very long right then this conditional distribution that tells me what might be the state at time step T could depend on a number of states before that could be very long very large right and and then expressing this type of conditional distribution will be intractable and it might have to be infinitely large okay so here if the process was infinitely long and every state was depending on everything that happened before then I could not express this okay so so that's a first problem so the solution to this is going to be to make two assumptions first we're going to assume that a process is stationary and second that the process satisfies the Markovian property okay so here what I mean by a stationary process is that the dynamics do not change over time so we've seen this already in a context of of weather prediction and then the second one the Markovian assumption is going to be that the current state depends only on a finite history of past States so this was this idea that there's a short history that is sufficient ok so these are two very important assumptions that are very common and that we're going to make use as part of our definitions yeah yeah good question yeah so does the Markov property assume that it depends only on the previous state so so yeah so in in some definitions that's correct but we're going to see that there's a more general definition where there's a notion of K order Markov processes where it might depend on K previous States okay in fact okay here we go so K order Markov processes so here we're going to assume that the last k states are sufficient so what it means is that if we have K is equal to 1 we're going to have a first-order Markov process where the distribution can be rewritten as a simpler expression that depends only on s T minus 1 ok and and we're doing this again because in terms of representation in terms of computation it would be intractable otherwise to keep everything into account so we'll typically want to shorten this and then the most exciting thing would be just to consider the last state ok so the graph now changes - essentially just a sequence of states with one arrow that just connects one state to the previous one but now if we want to consider more than just the previous state maybe the last two states then we could have a second-order Markov process and then we would have every state that depends on the previous one and the second last one all right so so this would be the picture yeah so in general when people talk about Markov processes they usually mean a first-order Markov process but more generally it can mean as well as a key order markov process where it depends on on the last case States okay so yeah so a Markov process when it is first-order it means that we have this property where for any time step T we're going to have a conditional distribution that just depends on a previous time step and then in addition if we consider if we make the assumption that a process is stationary it means that this conditional distribution is going to be the same regardless of which time step we consider okay so in this expression you see the right-hand side depends on T Prime and the right-hand side on T right so here I'm essentially equating these two conditional distributions for two different time steps and I'm saying that it will be the same for any time step so that means that stationary so this is a little bit counterintuitive because obviously the process is changing at least the states are changing but the dynamics of the process won't be changing so we're going to assume that they remain the same okay so the advantage of this is that now we can specify an entire process with a single concise conditional distribution we just need a party of s prime given s yes a where's the Prime Minister in process it's right here oh how are they defined so these will be too hard a different yes also they will correspond to different time steps so I might have time step 0 1 2 3 4 5 etc right so every integer could be a different time step so T and T prime could correspond to any integer here any positive integer that's right so any pair of consecutive states are going to be related by a conditional distribution that is the same regardless of which time step we're at yeah okay so as a concrete example so if we consider robotic control so let's say that here we've got a robotic arm we're controlling the different articulations in the fingers so that we can grasp things and and so on and so for every joint we might want to model as the state the XYZ coordinate of the joint plus the angle okay and then the dynamics we could assume that the motion of the joint is going to be constant and this would satisfy our condition here for a stationary process but obviously this might not hold because in practice I mean velocity or motion is not going to be constant so there might be some acceleration and then we'll see how we can deal with this as another example so here we could consider an inventory management problems what it means is that perhaps the state is defined by the inventory level and then we've got some dynamics so this corresponds to how the inventory level change over time so the demand for different widgets is going to be assumed to be constant but stochastic and this might not always hold but it's an assumption that that we're making okay so when these assumptions do not hold what can we do so it turns out that we can often add new state components to our description to make those assumptions closer to holding so if like often those assumptions are not going to hold simply because we don't have enough features in our state representation so in the case of Robotics if you've got just position right and now you assume that velocity is constant but in practice it's typically not so velocity will vary because of acceleration then what you can do is simply add velocity to the state representation and now instead make the assumption that acceleration is the one that is constant okay now obviously this might not hold either so you might say well what if I acceleration varies so then what we can do is simply add acceleration to the state description and keep on going okay and then I guess we at the end the question is well where do we stop so in in theory we might not be able to stop but concretely you see here this this is similar to like when you take a function and then you examine the Taylor series approximation right so this would correspond to I guess the position then after that first derivative the next erection will be second derivative and so on so you can get a closer and closer approximation through the Taylor series expansion and then the idea is the same here so for your process you might get something that is closer and closer to satisfy the Markovian and stationary assumption simply by adding more and more opponents yeah so the problem with what I just described is that now when we augment the state description then obviously the computational complexity will increase so here then the key will become can we find the smallest state description that still satisfies Markovian as' and stationarity so this is not something easy to come up with it often requires some domain knowledge but otherwise this this would be the ideal to essentially come up with that smallest state description okay let's say that now we have a process that is Markovian and that is also stationary now what can we do with this right so an interesting question will be can we make some predictions in other words can we do inference to predict what will be the value of some future state and the reason why this is important is because in a Markov decision process and more generally in reinforcement learning the goal is going to be to select actions that will influence future states and and hopefully get us into some states that have high rewards so if we can predict what the future States was going to be then perhaps we can select some good actions if we can't predict what the future states are going to be then it's going to be much harder to select the actions so a core of of Markov decision processes and reinforcement learning is going to be this ability to make some predictions and and and it's based on that that we're going to select our actions okay so here yeah we'd like to perhaps predict an action K time steps into the future what this means computationally is that we can simply take advantage of the chain rule in probability theory to expand this into a product where we sum out all the intermediate states okay and then if we have discrete states so finitely many states then we can represent the conditional distribution as a matrix and here I'm going to use the letter T to indicate that this is a transition matrix and what what we're really doing in this product is we're essentially taking the product of that matrix by itself K times okay and the reason why it's T to the power of K is because we assume that a process is stationary therefore at every time step we have the same conditional distribution and therefore the same transition matrix T right so if you're looking K steps into the future then you would multiply the transition matrix by itself K times and that's why it's T to the K now the complexity of doing this is going to be K times s cube because whenever we multiply some matrices together so if I've got a matrix that is of size s right so as by s then you multiply two matrices together then it takes a cubic time with respect to the dimensionality of that matrix okay so so that would be the complexity and then in general this will be some some core operations that we'll want to do to be able to predict future states okay any questions regarding this yes so okay here I guess in in the next few slides I will well I guess in the next set of slides I will make use of some matrix notation and then when we analyze some properties of Markov decision processes then it will be much simpler so I guess I'm starting to introduce this by saying that okay we can think of this as as a transition matrix and and then really when when I take this product all I'm doing is is really multiplying the matrix by itself Kate okay yeah oh okay the write the summation is implicit because whenever we multiply two matrices together it's not just a pure product you're actually multiplying some entries together and summing them together right so so I so there is an implicit summation each time you multiply two two matrices so so this really captures not just the product but also the summations yes okay yes yes so this this would give us the probability of going from some state at time step T to any other state at time step T plus K yes okay so this is good so we can make some predictions however predictions by themselves are not very useful right so that's not the goal and whenever we make a prediction is because we want to use that prediction to to select some action make some decision okay and this is going to lead to the problem of decision making and this is where we're going to need a model that's more comprehensive that takes into account as well action and this will lead to Markov decision processes okay

Info

Channel: Pascal Poupart

Views: 6,621

Rating: 4.9191918 out of 5

Keywords:

Id: yOWBb0mqENw

Channel Id: undefined

Length: 23min 40sec (1420 seconds)

Published: Tue May 08 2018