Reinforcement Learning 6: Policy Gradients and Actor Critics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay let's get started basically I'm going to continue where I left off last lecture but we're also going to in some sense dive into something a little bit new so we touched upon some of the topics in this lecture already in the second lecture when we talked about exploration and exploitation but there we talked about a very simple case where there is only one state and essentially the multi-armed bandit case and we had a lot of control for what we knew about a problem we knew there was only one state there was a limited amount of actions that you could pick from a mostly focused on the topic of then learning how to explore and to exploit because in the sense we had full control over the policy the policy was in some sense very simple now in this lecture I want to talk about a more general case of learning policies directly from the data and this can be used in as we'll see in combination with the things we talked about in the past lecture which was focused mostly on learning values but it's a separate objective and just before we dive right in the motivation for this is basically this this is one way to think about it this is a quote by a vladimir vapnik who's well known for his work on the support vector machines and statistical learning theory and in one of his books he poses I think I'm slightly paraphrasing here so this might not be a direct quote but he says something like never solve a more general problem as an intermediate step right e idea behind this is if you're going to solve the more general problem this is necessarily harder than solving a specific problem you're interested in so for instance if you think about data efficiency in order to solve a more general problem you're necessarily going to need more data or at least as much data as to solve the problem you're actually interested in and so applying this to reinforcement learning if we care about optimal behavior if we care about the control problem then why not learn a policy directly that souls that why go through models or value functions why not just go for the policy itself and there's actually good reasons why you would want to do that which I'll touch upon in this lecture but it's good to consider and let's just go with that for a little bit at first and just consider that a viable objective so as an overview these are some this is not an exhaustive list of pros and cons but these are some things you could associate with different types of reinforcement learning so to start off with model-based reinforcement learning has certain benefits it's easy to learn a model in some sense that's not actually that easy in practice if your world is very complex and it's an active area of research but when I say it's easy I mean it's well understood because it's supervised learning in a sense so we know kind of how to do that and another benefit of learning models is that you basically learn all there is to know about your problem if you're trying to basically learn the dynamics of the world you're definitely not shooting to low and you're capturing everything there is to know about your problem however the objective captures irrelevant information and in particular it might focus compute or capacity on irrelevant details so what do I mean when I say that and let me give you a concrete example let's think of playing an Atari game and let's think of solving that by first doing first learning a model that tries to capture the essence of the Atari game basically to determine when you take a certain action how does the one frame transition to the other frame that's something you could consider doing first building that model let's not consider the planning yet which is sounds too hard but just learning that model but now let's see making the exact same set up where you're playing this Atari game but we've replaced the typically black or like uniformly colored background with some arbitrary video that's playing now if you don't tell your model that this is irrelevant it'll focus a lot of capacity for instance of your neural network if you're using that to to learn a model on these actually irrelevant frames of the video playing in the background in some sense the supervisor objective of learning the dynamics is not necessarily informed about how we want to use those dynamics which means you might waste a lot of compute and effort and function proximation capacity on learning irrelevant details which turned out to maybe hurt your performance in the end because you might not have to spend enough time learning about the relevant details the things that matter for your policy and as a final potential disadvantage of model-based are reinforcement learning it's not always trivial depending on how you do this to go from your model to a policy you might still need planning for instance again if you think about the Atari atari case we actually have access to a simulator so we can actually try each action in every state that we happen to end up in so if you don't learn a model you basically learned something which is very similar to that simulator there is a benefit of having a model because you could query it you could put in an arbitrary beginning states a frame that you've maybe not even seen before you could ask your model what will be in the next frame if I push this button in this situation which is something that you can't necessarily do with the black box simulator of the game but still it's not trivial to then go from that to a policy you might still want to then during a value function or a policy from your model which means we may have made some progress but maybe not all the way there to the to the optimal policy when you do model-based reinforcement learning now of course this is a little bit of a cartoonish view because model-based reinforcement learning typically we don't we don't assume that you're going to just learn a model and then you're on your own there's ways to use models that are much more efficient ways to use them that so that are well-suited for planning I just wanted to point out that that's norm to review and it's actually an active area of research as well now we mostly spend our time talking about value-based reinforcement learning and this is the benefit that it's closer to the true objective because we might not care about all the irrelevant details of a transition function we only care about the things that matter in terms of the value that each action has in terms of D for instance the cumulative reward that you're predicting for each action so this is closer to the true objective and indeed if we have the true values reading off an optimal policy as discussed before becomes quite simple for instance in the in the discrete action case in this lecture we'll also talk about continuous actions but we've most focused on the discrete action case if you learn an action value function a cue function then you can just read off the highest-valued action in each states and this gives you a policy immediately so it's fairly easy to go from these values to a policy and it's also fairly well understood these days it's similar to regression although not precisely the same because of the bootstrapping potentially but we have algorithms that work quite well and can learn these values quite accurately however it's still not the true objective and you might still focus capacity on less important details now to give you a very slightly abstract example but you could imagine that there is a setting in which you you you for instance have a robot and you want to control that robots to have an optimal policy and it just turns out that the optimal policy in that specific setting is for the robot just to drive forward it could be very easy to learn that and it could also be very easy to represent that policy just in every state you don't even look at the observation you just go forward but that doesn't mean that the value is actually simple it might be that the value changes a lot depending on the exact observation self of the robots for instance things might pop up in its vision which may slow down the robots temporarily or might even cause it to go a little bit off of its track it might still be optimal to keep on pushing forward for the robot but the value function might change arbitrarily and in order to accurately represent it you might need to have a fairly rich function proximity like a deep neural network whereas in this case the policy could actually be blind it doesn't even have to look at the observations so this is just an example to show that in some cases the value function might be quite complex whereas the policy could be quite simple and in those cases if you go through a value function in order to to learn a policy you might be spending more effort than you need in some sense now finally the policy based reinforcement learning which we'll cover in this lecture is basically to write objective we just focus on learning a policy that optimizes the value without learning the values separately potentially and that's nice because it's actually what we should be focusing on so it seems to fit with vladimir vapnik squad from the previous slide there are downsides though and one potential downside is that it ignores all other learn more knowledge one thing that this might mean is that it might be very slow to get off the ground if it gets off the ground at all in some cases because it might be very hard to pick policies that give you any meaningful performance if you're not first learning say what the values are of certain actions so in some cases you still need to learn more in order to see it to even be able to learn the policy in addition it might just not be the most efficient way to use a little data because if you're using each sample to update your policy on me there might be samples from which you can learn very little in terms of the policy but they might still teach you a lot about the world so in some cases it might be beneficial to still predict lots of stuff to maybe even build a model even if you're only using that to scope the knowledge inside the agent for instance you can think of a deep neural network which has maybe some convolutional filters at the bottom it might be useful to start try to learn lots of stuff about your problem even just to get good features to get good filters that look at relevant parts of the environment so this is indeed sometimes done where in some cases people use a deep neural network which attea as the output has a policy and I'll talk about this more concretely and if in in in the subsequent slides but if you just imagine abstractly there's some observations coming in and there's a policy coming out even in that case it might be useful to hang off other predictions which could be model predictions or may be value predictions just to learn the parameters of that function even if you're not using him in any other way this is just because it's more date efficient if you use more of your data to update all of your parameters in your model and if some of these parameters are shared because you use for instance the same Kampf net filters then this can be quite beneficial but for now we're going to basically acknowledge that this is the right objective and we're going to talk about how could we learn this okay so this is a recap previously we approximate it's I see there's a typo there parametric value functions slight notation difference here I'm going to use as the new certain Umberto Edition does I'm going to use W to refer to the parameters of a value function and then I'm going to reserve theta which in the past we've used to just talk about the weights of your value function to talk about the parameters of your policy in some cases these might partially overlap such as I just said in the case if you have a deep neural network where multiple things hang off but they might share a lot of weight so some of these weights might actually be shared by both of these but we won't focus too much on that for now you can just consider these to be two different weight vectors which for instance if you're using deep neural networks to represent these just capture all of the weights of that Network we're abstracting away the architecture of the network which is important but will not talk about in this lecture and then the idea is quite simple to parameterize the policy directly so we're going to have something that is dependent on those parameters and the semantics of that is that it gives you the probability of selecting a certain action in a certain States under those parameters and will mostly focus on model free reinforcement learning in this lecture so this is then the high-level picture where we have value-based reinforcement learning on one end maybe with an implicit policy so that's the part of the Venn diagram there all the way on one side then you could have policy based reinforced trading where just representing the policy and just to make you aware of terminology there's also something called actor critic systems and what these do is they use a value function in some way and I'll show you some examples to update a policy and the actor critic terminology then comes from there as an actor which is your parameter our metric policy and there's a critic which is your value function where the critic in some sense criticizes the behavior of the actor in such a way that the actor can more efficiently learn there's many ways to combine it and I'll show you some examples so these things are not fully disjoint there's often used together the value based reinforcement learning and the policy based reinforce playing again by the way stop me at anytime whenever you have questions that's a good question do you learn the values and the policies simultaneously or do you first learn a value and then use that somehow to learn a policy and turns out in practice what most people do is learn them both simultaneously this is a little bit akin to what we talked about previously when we said there's policy iteration where you first evaluate a policy and then you improve it and you could treat these as separate steps but you could also in term interleave these at a very fine-grained time step maybe even on the same time step you couldn't evaluate and improve this is what q-learning for instance does and this is one way to think about these things you could do that and in some cases that actually makes a lot of sense to first make sure your value function is very accurate before you start using it to improve your policy in other cases you can get away with just improving both of them at the same time all the time it's a very good question so here's some zooming in a little bit on the policy based reinforcement learning here's some more advantages and disadvantages one event is is that at least some of these methods have fairly good convergence properties because as we'll see it boils down to in some cases just doing something that a stochastic gradient ascent in this case but it's very similar to the stochastic gradient descent which means you'll end up if you have a nonlinear function you end up in some local optima fairly reliably depending on the assumptions that you can make another big big advantage is that it's very easy as we'll see to extend to high dimensional or even continuous action spaces because we parameterize the policy we don't necessarily have to reason about individual actions anymore that much and it turns out it's very easy just to use the same algorithm but you just plug in a different representation which then happens to capture continues action and I'll give you some examples another benefit which is sometimes not appreciated at widely widely is that it's actually benefit to be able to learn stochastic policies but I'll go into that and I think the next slide yeah so I'll talk about that more and this is one that I already said sometimes policies are quite simple well there's the value functions or the models can be quite complex that's the example that I just gave where the actual optimal policy might just be just move forward but the value function might be quite intricate and depending on your observations there's also disadvantages it's quite susceptible to local optima especially with nonlinear function approximation this is something that is shared with the value functions in some sense but if for policies it might even be a little bit more tricky because the policy is also what gives you the data also the obtained knowledge is quite specific with which I mean the optimal policy doesn't capture the policy that you're learning doesn't capture any information except for what you want to do but then if something changes there's very little you can generalize whereas if you have a model and if you say if say only one thing changed in the world maybe you can very quickly learn that this one thing is different but all of your art or knowledge stays relevant so if you're continually planning using a model a lot of the knowledge that is in there might continue to be relevant even if the world around you changes a little bit now why might the world change maybe there's other agents in the world maybe at some point somebody unlocks a door say you could still know that you can walk there and then you could try to open it even if you you don't have to relearn that but if you just learn the policy that just goes somewhere it might be very hard for the policy to completely switch to then doing something else because simply there's not much knowledge in there so this is related to the third point where are we ignoring a lot of information this is something I mentioned previously as well which means that in some sense we know it might not be making the most efficient use of all the information that is in the data that comes at us in some sense if you want to be efficient in learning you want to learn all that you can from all the data that you get in terms of data efficiency so this is one example of why maybe you want sarcastic policies you can consider a two-player game of rock-paper-scissors where scissors beats paper rock beats scissors and paper beats rock and now you can consider a policy for the iterated game if you have any terminus tick policy no matter which one it is you're very easily exploited and in some sense a uniform random policy is optimal if both of the players are learning if the other player is not learning you could maybe exploit the other player and maybe then again it could be that there's a deterministic policy but if the other player is either learning as well or it's a good player then it might learn from you it might to keep track of statistics from your play and it might exploit you if you're not uniformly random as a different example which might be maybe even a little bit clearer because it's a single agent case you can consider this very small grit world where there's there's eight states there's essentially three goal states two of which are bad and one of which is good so whenever your entities when you enter the one with the with a money bag you get high reward and this is good when you enter the states on either the left or the right the agent in the agent dies or gets to like large negative reward and the episode terminates and now specifically the way it's set up these two gray States on the corridors are indistinguishable from each other it just happens to be this this way in terms of the features that you the robot sees for instance if the if the agent in this setting can only see whether there's a wall north west south east or up down left right if you prefer then both of these states so both have a wall on the top and a wall at the bottom so they're indistinguishable from each other with that observation so if you just consider that to be the the thing that you that you that you have to base your policy on and if you're not allowed to say use memory or anything like that then a deterministic policy can only do one thing in some sense in both of these states it could only go left or it can only go right in each of these states so considering one of these this is the one that goes left let's say you can start in any of these states what happened is if you start here in the top right corner say you'll go left you were left again and then you go down and you're happy but if you start on the top left or in the gray state next to that it'll never actually maybe go down into the into the bad states but it will continue to go left and right because whenever it ends enters this gray state on the Left they're the deterministic policy there says it has to go left now you might want that to go to point rights but when you flip that to point right the other gray say it also flips and you get the same problem on the other end so what will happen here is that you can get stuck and you never reach the the good state and it will just reverse the corridor indefinitely it's not fully bad in the sense that it doesn't actually end up in any of the really bad states but this is the best you can do with a deterministic policy in this case now if you consider having a stochastic policy you can do better you could pick in each of these indistinguishable gray states to randomly move either left or right whenever you end up in a corner you say oh no way I went the wrong way I go back I try again and whenever you end up in the middle which you can distinguish because it's the only state that only has a wall on the top then you just move down now this isn't fully efficient but it's the best you can do in some sense because the only reason this isn't fully efficient is that we really couldn't distinguish these two states and there's really nothing better League you could do there then just to randomly move to one way or the other but if you can only represent deterministic policies you cannot reach this solution so there's a benefit to being able to represent stochastic policies and if you represent your policy explicitly if you're parameterize your policy explicitly this becomes possible there's an example in the in the book in the session umberto book where similarly epsilon greedy is compared to being able to put these probabilities to specific values and there it's shown that also epsilon greedy doesn't completely save you in the sense that this specific problem sure epsilen really will eventually reach the desired goal state potentially although it might also sometimes walking to one of these bad states but in the example in the book it's not in some sense even simpler it just shows you that the performance can be better if you can more arbitrarily pick these probabilities as an intuitive example you can also think about playing something where there's hidden information which is the same case here this is essentially a problem VP but you could think of friends of the game of poker where you want to sometimes Bluff but you don't want to predictably Bluff or predictably not Bluff so you want to be a little bit unpredictable and one way to do that is to have an actual stochastic policy in addition though one thing to point out is that the policy can be differently stochastic in different states which is something we haven't considered with the epsilon really policies we've discussed in the past for instance in this case the stochastic policy might be uniformly random in terms of picking between left and right in these gray states but it never picks up similarly in say the top right states you deterministically move left and you can represent this if you have full control over all the policy sorry over all the action probabilities in each state so when I say stochastic policy I don't mean it's random in every state I just mean you have control over the probabilities of the different actions in each state I saw a good question so if you do something model-based do you need any stochasticity there's multiple reasons why that might still be useful one is just exploration which I'll actually touch upon later again this wasn't about exploration but that when we discussed in the past quite a bit another reason why might be is it depends basically on whether it's a Markov decision process or not if you have a Markov decision process there's going to be a deterministic policy that will have the optimal solution so if you can fully solve your model by planning through it somehow say using dynamic programming then and it's the mark of decision process then you could maybe find a deterministic policy what happens often in practice so when our Markov decision process is used they're used in reinforcement learning but also often in something called operations research where people try to just find good models for real problems and then solve them and quite often things are modeled as Markov decision processes but of course these are approximations to the real problem which means that the thing that we're solving might be a Markov decision process but the actual underlying problem might not completely fit to the model so if these things don't completely fit you might still be better off using something that is not quite deterministic it doesn't fully trust your model in a sense it's a good question Thanks so this was more like an intuition of why why would you want these things to maybe be stochastic but then the question is of course how do we learn these things in order to talk about how we learn and we first have to decide what we learn exactly and essentially it's not that it's not that controversial or surprising perhaps that essentially what we want is a policy that has a high value and turns out we could just basically take that immediate turn it into an objective now there are some different ways to do that so I put some on the slide here because it might matter a little bit which states you care about how you reason about this what is the actual thing that you care about and one thing you might care about is from a certain start state I want to have a good policy this is the first objective we pick out a certain state or maybe a certain distribution of states and we say we want the actual value V PI of that state to be high this is an objective which is similar to a loss but we're going to want to maximize this rather than minimize it we just want to maximize the value and now this thing is a function of the parameters of the policy theta because the value depends on the policy which depends on the theta this is just a definition right I'm not plugging in a learned value here I just say this is what we want to achieve this is what we want to optimize now in some cases there isn't really a natural starting state that you care about but you might more care about the states I actually happen to end up in so another thing you could do is you could define a distribution over States and in this case we've picked the distribution mu over States which is the one that is induced by the policy you're following so the way to think about this thing is it's a summation over States if your states are continuous you can just turn it into an integral these are the the same innocence and this is just the ratio of time these do these Meuse for each states that you spend in that state when you follow that that policy that you're following right now that's one way to think about it you could of course also define an objective where this thing isn't actually weighted by the current policy you just say I'm going to say these states are important and those aren't that's perfectly fine and in the book you'll find some more discussion about that case this is a fairly natural one though because in some sense you might just care about doing well in the states you end up in and you might not care so much about doing well in the states that you never run end up in anyway and we're very essentially a very similar objective at the bottom is just look at the average reward and this might look a little bit weird if we use to the value functions because why is the average reward how does this capture the future I mean value functions were predictions about the future why we were talking about although all the rewards that might come but this one now looks at the immediate reward but the reason this is still valid or a useful thing is because we're looking at all the states you might end up in which captures the dynamics so this does use the fact that you might end up in many different ways so it's a valid objective to consider and in fact it's a very common one for people to actually put for instance in their plots or in their graphs where they say oh the average reward I got was this per time step so if your sequences are very long so the average reward case again has this distribution over States which is just essentially the ratio of time you spend in each state so both of these terminals sorry both of these formulations the average value in the African world case they can apply just fine to a fully continuing problem that might never terminate there might be no episode boundaries in some sense but we are assuming that there's maybe some meaningful region of states that you end up in so you don't indefinitely go enter new states if that were the case it's a very very hard problem in general solve but this new distribution captures that it basically says oh when I just follow this policy indefinitely in the long run in maybe infinite time this is the ratio of time I spend in each of these states again this is just defining it right so I'm not saying anything about how to learn this yet I'm just saying this is the objective that you might consider and then in practice what we'll do we'll sample this for instance just by using our actual behavior on policy distribution and that's why the MU the state distribution on this slide is dependent on our current policy because that means we could just roll out our policy and we can just pick the states that you happen to end up in and that will be a valid sample for these objectives now to optimize these objectives that's an optimization problem so we want to find a theater that maximizes in this case this the this objective for any of these objectives that you might pick there are of course ways to do that that do not use a gradients so I put hill-climbing here this generic optimization method you can use genetic algorithms or evolutionary strategies are quite popular these days and that's valid right you could you could as we will see will we can sample these objectives you could call that a fitness if you prefer and then do something evolutionary on top of this and this is actually valid way to do this you could basically use any optimization method to do this we will mostly focus on this lecture on stochastic gradient descent which is very similar to stochastic gradient descent or just going in the other direction and this turns out to be often quite efficient and it's also easy to use with deep neural networks because it turns out we just back prop gradients as usual it doesn't mean it's the only method so I just wanted to highlight that but I won't go into depth into other ways to do that so then it may be roughly looks like this you have some objective and we just want to hillclimb that in some sense and policy gradient algorithms they look for a local maximum buy locally looking at the direction in which this thing goes up so what we'll do in general is we'll have some change to our parameters Delta Theta and we'll say that this in some sense is equal to a small step along the gradients but no minus sign right we're doing gradient ascent rather than descent and if you take these steps you should be moving up which means if the objective is for instance the value from all of the states you end up in that we're actually getting two states with higher values we're getting a policy that leads us to better our values again this is just defining these things which seems fairly easy one thing that I want to notice of course again as I did before I'm just going to put the gradients on the slides but if you have a gradient you can put that into a more advanced optimizer like our mess prop or Adam or whatever your preferred flavor optimizer is and you can use that to up to my to update your parameters instead but for simplicity for the notation I will just put these things on the slide as if we're doing the standard stochastic gradient descent or ascent algorithms if you want to do this by the way in practice in something like tensor flow tends to flow optimizes expect a loss and they'll minimize so you have to be careful that if you want to maximize this that you actually put the negation of this in minimizing the negative gradient is the same as maximizing the the gradient this is a common error so if you run a policy gradient method and it seems to do very very poorly just try putting a minus sign and maybe it'll work okay so we need an estimate of this gradient it's all fine to define these things but of course that doesn't give us a concrete algorithm that we can run and we want to use the data to do this so what we'll assume is that the policy is differentiable almost everywhere and this is a fairly natural assumption so we'll we'll have a policy that Sprint's as a linear function of the origin state or it's a deep neural network of your observations or of your agent state or you could even have a bunch of expert controllers and just switch between these right maybe you have a bunch of controllers that are handcrafted by somebody else and you just have like a few parameters that switch between these controllers somehow and you just learn those parameters you could think of all of these cases so it's quite easy maybe to put in some domain knowledge in that way and the goal here is to to compute that gradient which is the gradient of an expectation sorry I see there's a slight I was changing notation on these slides to adhere more to the system Barto book where they use mu to talk about the state distribution in the past we use D so this D here is the same as the MU from the previous slides this is just a distribution over States because the only thing that's random in that inner expectation there is the state we put in the true value that thing is not random this is the true value of that state but the thing that's random is the state we're considering some distribution of those states which could just be maybe your start state distribution and if you're interested in that objective so at first we'll use Monte Carlo samples to compute these gradients but for that we first have to decide how this expectation actually depends on theta and to do so we'll go through some steps that we went through before in the bandit case and first for simplicity we will discuss the one-step case where there is no sequentiality there's just you take an action you get a reward and then the objective becomes maximize your expected reward where the expectation is over States and actions we cannot sample as discussed before and then take a gradient because the sample is just a number it doesn't depend on your parameters anymore so instead we use the identity that the gradient of this expectation equals the expectation of a gradient which is the reward times the gradient of the logarithm of your policy and I'll show proof of this again on the next slide we saw that in the second lecture as well and then why do we do this this is because the right hand side gives us an expected gradient that we can sample this is an equality these things are equal but of course then the right hand side can be sampled and used in an algorithm so the derivation of that uses the score function trick or the reinforced rakers - sometimes called or the log-likelihood trick has many names and in order to do that we first just write out this first expectation so there's a summation over states with the distribution over States and there's a summation over policy with the distribution of her policies where in this case the policy distribution is exactly the thing that we're interested in the policy distribution is our parametric policy that gives us the probability of selecting an action a in a state s and then we get a reward function which is here just defined as this is deterministically maybe the reward that you get in this action and in states of course you could also have a noisy reward basically it all goes through but for simplicity this would just be the reward that you got in that state action already expected reward that you get then one thing we could do we can push these this gradient all the way inside up to the point where we find something that actually depends on it which in this case is just a policy and then we multiply with this probability of selecting that action and divide by that and the reason to do that is to get something that looks like an expectation again where we're waiting each action with the probability that it's selected now the next step is just a notational thing and actually in the Southern America book they typically don't make this step they just keep the formulation as on the Equality before what we're just dividing by the policy but if you have a gradient of something divided by that something is itself you can write that down as the gradient of a logarithm of that thing this is just because the gradient of the logarithm is what sorry the gradient of log X is 1 over X but now we have something that like I said looks like an expectation again because we have a summation over States and then weighting of these states according to the distribution and then we have a summation over actions weighted according to the probability of selecting each of these actions so we can write this again as an expectation where the expectation is not over the gradient of the logarithm of pi times the reward which is something we can sample the DS is now captured in the fact that the state is random the d s I again apologize for this not being a mu in all of the slides it's sometimes when you it's sometimes a D that's just a distribution over States so when when I say this is an expectation I mean it's an expectation both over the states and the actions so if one way to think about that is that there's if you follow this policy there's going to be some states that you end up in in the banded case we could actually even pick a distribution that doesn't depend on your policy we're just going to have certain states which are given to you a distribution over States so each time you end up in some states you have a policy that depends on that state there's some distribution over those states and this distribution happens to be D s in this case so choose from the first line already we're just expanding that expectation by saying there is some distribution over States and we don't actually care what it is because it basically keeps fixed throughout the derivation and in the end we're just folding it back into the notation of the expectation so we didn't touch the distributional states in this case which we can do because it doesn't depend on our policy in the in the bandit case so then we have this equality as as on the previous two slides where we have something that we can sample so a simple stochastic gradient stochastic policy gradient updates is then just to change our parameters in this way which is something that was also in the homework assignment for the bandit case although we didn't really have States there there was basically only one state so one easy extension of this is when there are different observations for instance you're still doing a bandits in the sense that you take an action you immediately get a reward and the episode terminates but maybe there are different observations now there's different states you can be in and that sometimes called a contextual bandit and then you could have a policy you could imagine having a policy that doesn't just give you certain action selection probabilities in general but it will actually give you action selection probabilities that depend on the state the depend may be on your applications that you get that you can still use basically the same algorithm in that case so for instance again the policy here could be a deep neural network or something like that or a small neural network or a linear function that takes the state or the observation as input and then outputs these probabilities which allows you to take different probabilities in each state so an expectation this follows the actual gradient which is nice because then we know we will actually under the normal assumptions reach a local optima or if it is a linear function we might even reach a global optimum which is nice because these stochastic gradient algorithms are fairly well under student there's also a nice intuition here what does this what does this equation mean how could shout should we read this well essentially what we're doing here is we're adding something to the to these policy parameters that is proportional to some step size alpha and the reward and then these gradients so what does this gradient mean well it's essentially proportional to the to the probability of selecting that action and what it means it will increase the probability for actions that you've selected if they had a high reward because the reward mode just multiplies its gradient in sorry just multiplies with this gradient so the one way to imagine this this is that the probability of selecting an action will go up whenever the reward is high and will go down when ever the reward is negative now if all your rewards are positive this still works because what essentially will happen is the probabilities for actions that have higher rewards will go up more than the probabilities of actions that don't have as high rewards which will turn in the end turn out to give you the right property that the probability of selecting the actions with the higher rewards will go up compared to the probability of selection the selecting actions that have a lower reward so that's roughly the intuition here this is why it's kind of natural to have to reward there which which basically tells you how much you're moving in the direction of increasing the actions that you selected sorry so just to make that more concrete it's nice to have an example where you can basically see that so will again consider the softmax policy where we have some preferences and previously in the bandit lecture we had these things depend only on the action so we literally had a value per action which was the preference of selecting an action here we slightly generalize that to say maybe you have a table maybe there's no other parameters and this but for each state you might not end up in and for each action we have a preference now so this is means that the preference might be different for different states which is a generalization of the previous of the bandit case so again this is a contextual bandit case but it actually also will extend to when we have sequential problems and then one way to define a policy is in this way which is called a soft Max or a Boltzmann policy sometimes where we basically make the probability of selecting an action proportional to the exponentiation preference the quantity there the summation of the exponents that we're dividing by is just to normalize so that the summation over all actions will sum to 1 which is of course needed to make a develop probability distribution in that case if we look at these gradients it turns out to look like this where the gradient of the logarithm of this policy is just the gradient of the preference of the action and note in the previous slide typically we consider the gradient of the log probability of the action you actually selected in the state that you were actually in this is of a tea in st so that's maybe something to keep in mind here what what this will do is it will give you the gradients of the preference of the action is selected and then basically the other part is for the normalization that comes from the division there so it'll push up the preferences for actions that have a high reward and it'll push down the preference for actions that have a very large negative reward say that's just an example but you can parameterize other policies as well of course but it's a very common one to use the softmax now of course we want to extend this to the multi-step full reinforcement learning problem where there's sequentiality and there's not just immediate reward and that's it but we want to have values rather than immediate rewards now turns out there's a nice property there there's the policy gradient theorem which is a nice theoretic result that means that we can basically replace the instantaneous reward with the long term value and it applies to the South State objective the average reward and the average value objective and the important thing here is just the way the way this thing looks is that the gradient of those objectives basically all looks the same where we just have this gradient of the log probability of selecting action times the value of that action the long-term value of that action and I'll derive that in a moment so you can see that this is actually the case and it's actually slightly tricky to derive this accurately and there's some versions in the literature where people do something that is slightly different so just to be aware of that where the expectation is again over the states and actions so for instance if we're considering the the average value case or the average reward case in both of those cases you would just follow your policy and you would sample the states depending on where you actually end up in with your policy and then you would look at the actions that you actually take in those cases which is a very natural thing to do you basically just sample the experiences online as it is given to you and that will be a valid sample for this expectation in that case so importantly the policy gradients don't need to know the dynamics so that's maybe kind of surprising shouldn't we know how to policy influences the states how can we get a get around not having to know that well this comes from the derivation which I'll now show you so what we'll do we'll step through this a little bit in detail so stop me whenever you're confused about anything because it's important to understand this and what we'll do is we'll just consider some trajectory where I use this Greek sing symbol to base D denote the trajectory just a shorthand which has some return so the return depends on the trajectory this is a random value this is just shorthand that allows us to write these things down very condensed ly but just think of it as you've you've run your robots you restart that's something we arbitrarily call time step T zero so we start off in state zero we take action zero then we get a reward which in the sequential case we always say reward as at time T plus one so the reward is our one and we end up in state one and there we continue again we pick a new action a 1 and so on and so on we create this whole trajectory of data now the valid objective is then just to say okay I want the expected return to be high because the expected return is just the expected value because the return is just an unbiased sample for the actual value of your policy which means that in the very first equation there we say the grading of our objective is just the gradient of the expectation of this return now we can use very similar to before we can use this score function trick to basically say sure the gradient of an expectation is just the expectation of that thing times the gradient of the logarithm of the probability of that happening so this P Greek letter I don't remember it's one of this apologies it's just is the probability of this full trajectory right this is not just often an intermediate step in there it's the full trajectory so what is that thing we'll write it out so the gradient of the logarithm of the probability is the gradient of the logarithm of this huge multiplication it is the probability of starting in s0 then taking a 0 in s 0 then the probability of ending up in s1 when you've taken action zero in state zero and then again in that states taking action 1 and so on and so on and so on but because there's a logarithm in front this multiplication can be turned into a sum because the logarithm of a products is just the sum of the logarithms so we did that here where we just write out we put the log in front of all of these terms and we just write it out as one big sum but then we realize that we're actually taking the gradient with respect to the policy parameters of this big sum now the gradient of a sum is just the gradients of each of the parts summed together so but what we do that and we inspect this thing we notice that the dynamics when conditioned on the action so for instance consider that probability of ending up in state one when you're in state zero and you select action 1 that thing doesn't depend on the policy parameters because it's already conditioned on the action the only thing that does depend on policy parameters are the policy the terms that have the policy in them in here which means that this thing is the same as the gradient of the sum of just the parts that have the policy in there the gradient of the other parts another way to say that the gradient for instance of the dynamics of going to state one when in state zero taking a axis action zero the gradient of that is zero with respect to the policy parameters but this means that we can ride another objective that we had at the top this gradient log of the probability of the trajectory we can write it out as being the gradient of the summation over all of the action probabilities along the way in some sense the way to think about that is that the parameters only affect your decisions they don't affect the dynamics so they don't come into play in the objective you can't change the dynamics so this is why they don't show up okay so now one final thing that I did here is just to write out that trajectory sorry the return which is just the summation of the rewards this is the definition of the return for this trajectory as given at the top there the return is just the summation of the rewards and it might be nice to write a ton explicitly so we have the expectation of this thing at the top at the bottom there yes sorry in this case I didn't have discounting in there you should definitely consider the discounted case as well I should have probably just put the discounting for generality thank you by the way there's many seats still sprinkled if people want to sit let me do two more slides and then we'll have a short break this is a general thing that is important but it's also a little bit of an aside because I'm going to use this on the next slide we're going to go back to the objective that I just arrived here so for now I'll get back to that in a moment but first we're going to do something intermediates which is that we're going to realize that we can use basslines this was discussed in the exploration exploitation lecture as well and one way to think about that is that we as I have said the policy gradients they look kind of intuitive in the sense that if you have an action that had a higher reward you're going to improve that the probability you're going to increase the probability of selecting an action if it had a negative reward you going to decrease it but if all the rewards are positive if you're actually always increasing a little preferences but then you're increasing some more than others but turns out that has a higher variance than if you can sometimes push them up or and other another's push them down and an easy way to reduce the variance is to have a baseline which means you for instance track the average reward across all actions and you just subtract it from the reward then the way the Preferences move is if you have an an action that has a higher than expected reward you increase the preference of that action and if it's lower than expected you'll decrease it this turns out that lower variance and it's valid to do that because of this so what I've done here I've picked some arbitrary B which I could have actually made a function of state so maybe it's good to keep that in mind let B be a function of state but not of the actions and I multiply that with this gradient of the log of the of the probability of selecting the action and I'll just write that out a little bit so the thing there on the left hand side both the states and the action are random then I expand the action part here on the right so now only the state is random the action is fully determined because we're summing over all the actions and we take the probability of selecting that action in that state then I note that the gradient of the logarithm of Pi is the gradient of pi divided by PI so I'm essentially doing the score function trick but I'm doing it in Reverse what we did before but the other way around which means you can write out this thing as the gradients the sum of the gradients pi and then I just pulled the gradient outside because the sum of these gradients is the same as great of the some but then something interesting happens this summation over policies must sum to 1 this is a probability distribution so we sum over all actions this summation must be 1 which means that we're looking at the expectation of some arbitrary baseline B which might be a function of state times the gradient of a constants but the gradient of a constant is 0 which means that this whole thing is 0 now why was I allowed to do that essentially it's because B by assumption does not depend on the action if it didn't depend on the action I couldn't pull it out of the Sun as I did here and if you can't pull it out of the Sun that means that the gradient is not necessarily 0 but if it doesn't depend on the action then this is 0 which means we can add these arbitrary baselines to reduce variance without changing the expectation of the updates you expected up it will remain the same we're just changing the variance so one thing that this implies is that we can subtract the baseline but the other thing that we did this implies is that we had this big product of two summations here we arrived at that two slides before one is the return right it's the summation of all the rewards you've seen in this trajectory and the other one is the summation of all of the gradient log pies of that same trajectory but turns out some of these rewards don't depend on some of these actions in particular we know that the rewards up to some time step K do not depend on the actions that you take after that time step they're conditionally independent and what that means is that we can take this thing I've rewritten it slightly this is the same thing but I've just switched the sums I put the sum of the rewards there at the end instead of at the beginning but then we can we can essentially use this conditional independence of some of the rewards on some of the actions to say hey we actually only need to look at the rewards that follow each of these actions so for each of the actions we're now going to only look at the return that happened after you took that action which makes intuitive sense this action only effects things that happen after it it candles effect things that happen before because of causality but that means we can actually write that sum which starts at now T rather than zero as the value of the policy which brings us back to the policy gradient theorem because this is essentially exactly what the policy gradient theorem says that this first thing is equal to this last thing where you now summed over whole trajectory so this expectation basically says we we look at all of the states the policy gradient theorem has I had it before only looked at one of them but of course if this trajectory is on policy it is on the same distribution then these things are the same apart from a multiplier that is the length of the trajectory so one thing that I hate in a little bit here is that this this objective that we had at the beginning is very similar to the objective we had before but instead of considering one state we consider many states at the same time of a full trajectory so the magnitude of these things is slightly different but the principle is exactly the same and this gives us back this very nice formulation where essentially each of the updates now that we could do is just the gradient of the log of this one action times the value that you expect that action to have this will do if you sample this you could view this as one big batch updates where you consider doing many updates at once you could also consider doing each of them separately you don't have to do the full sum right you could do each of them separately and you have a valid valid policy gradient algorithm but just for the simplicity of the derivation I kept the sum inside there the Q is just by definition we have a return here which is the rewards following time step T but it's within the expectation so we can just ride around by definition as being the value of that of that action in that state so I didn't apply the policy gradient theorem in fact you could view this as being a proof of the policy gradient theorem in a sense yeah so the reason the magnitude is difference of a good question sorry is because I started out with considering this whole trajectory I didn't consider one state in the trajectory I'm considering all states in the trajectory so the objective here is essentially scaled by the number of states I'm considering the policy gradient theorem if I just go back there this one's for one random States and one random action and as you'll see here there's a slight difference here where we're now considering a summation over random States and actions so I'm pointing out that these are the same except for a scaling factor which is how many states am i considering in this summation so you that this is why I pointed out you can consider sampling disome a ssin but it's also perfectly valid to sample each of them individually and consider them separate updates instead of always looking at the summation as a whole thing because you can easily pull the summation out and you can you can just sample each of them individually and it's often better of course in practice especially if you're doing deep neural networks it's often better to have mini batches which means you consider a few at the same time and you update once for a few few samples not one not all of them but somewhere in between and distance to give us a good at learning algorithms and you're essentially free to do that so before before going on words with the rest of the rest of the slides let me quickly just stop here because also based on these very good questions I get on the break I already realize before doing all of this this is this stuff is fairly subtle and it's fairly hard to grogg if you haven't seen it before so I really encourage you to go and read the relevant part in certain 'martha book I believe this is now in chapter 13 I'll look it up I'll put it on - yeah - says next yeah that was um they renumber these chapters so I have to I think chapter 9 is function approximation these days so most of the will we kept is there [Music] well we have what we've basically skipped over in terms of the chapters we did chapters 1 to 6 we skipped over chapter 7 a little bit although I said some things sorry chapter chapter 7 is I think the end steps we just covered them a little bit chapter 8 is planning we skipped over that I'll get back to that in the next lecture and chapter 9 was done function approximation which seems because you're already taking fun if you're getting function approximation in the other part of the course so it's actually easier to fold that in there's many chapters in the book later on especially after chapter 9 that become well chapter 10 is basic captain chapters 9 and 10 are are together in terms of function approximation where chapter 10 is just about the control case the later chapters become more things that you can actually look at individually as well without necessarily going through them in sequence which is kind of what we're doing here as well but the policy gradient chapter which I again believes chapter 13 we'll go through these things a little bit more at ease in some sense there's also a nice paper by rich Sutton and others in which he derives the policy gradient theorem and proves the policy gradient theorem so if you want to step through that a little bit more at ease which can be quite useful I'll put in the in the lecture materials on Moodle I'll put in that paper so you can look at that he's a good communicator so it's it's a nice paper that that's fairly easy to read and of course the the chapter in the book is also very readable so maybe for now the best is just to absorb and suspend your disbelief a little bit and trust that these things actually work but also not trust it's very good to be skeptical but just to continue I used on this slide this is the last library ended up here I used this thing that we proved right if something doesn't depend on an action that we can basically toss it in because it's it's it's it's gradient will be zero or the expectation of this thing times the gradient will be zero I mentioned this in terms of the baseline we can add a baseline but then I actually used it to get things out I used the fact that rewards don't depend on later actions and I basically removed those rewards for the sum the summation there which is a step that's often glossed over when people derive policy gradient algorithms but I thought it was good to just put in there explicitly but ago of course this is something we did in the exploration exploitation lecture as well we could also use it to add something in this case we subtract a baseline so this is again just a policy gradient thing which we had down there I kept the summation over time over the trajectory you could also just get rid of that summation and say we only consider one specific random state at the time and then because this value does not depend on the action Perley small proof we had two slides ago the expectation of this thing the value times this gradient of the log policy the expectation of that will be zero so it's completely fair game to put it in there these things are equal but it might reduce your variance quite a bit but now something interesting happened because in the previous slide we were basically implicitly because we haven't actually sampled it yet but of course we have these expectation and the idea behind these expectations is that you then sample them so what we were doing here is implicitly we're using Montecarlo returns to sample and discounted ones as mentioned before any good question in general you want to look at the discounted Montecarlo returns of course if you're using discounting but here something happens where if we want to have this baseline you don't want to put the same Montecarlo return in here because then you're basically dividing your you're subtracting 0 from 0 and the Montecarlo return also depends on the action so you can't actually do that without changing the expectation so instead we want to approximate this we want to put some baseline in to reduce the variance so maybe it's not too important that this baseline is exactly correct because this is just a baseline that you could put in and that's a setting like the first liner a good baseline is the actual value for that state but it's not the only baseline you could consider you could just put anything in there and it wouldn't change the expectation but if you to use the actual value it tends to reduce the variance quite well but of course we can just approximate this and that's what we'll do so we estimate this explicitly for instance using TD learning on policy TD learning and in addition we can sample the the other part the Q PI we can sample this using a Monte Carlo return but we could also use an N step return for instance if we do the one step return this will just be the reward plus the discounted next value according to our same approximation and then minus the approximation that we had for this one which means that the whole thing becomes a TD error so we're multiplying then the gradient of the log pi with the TD error which again has the intuitive intuition that if your temporal difference error is positive that means there was a happy surprise and you're going to improve the probability increase the probability of selecting reduction again yeah so the Montecarlo return is just your flat potentially discounted sum of rewards until termination and when I use this notation GT that's the Montecarlo return if I have a superscript with an N this means we take n steps and then we bootstrap on the value function so the one step version that is down here takes one extra reward from your trajectory and then bootstraps on the value at states T plus one let me just quickly skip ahead here's an example where we take multiple steps so here's a more generic generic formulation of an N step return as we call these so the n here refers to how many steps we take until we bootstrap and we appropriately discount in this version so there's a discount factor here on the bootstrapping which is to the power n because that's how many steps we took before we actually used an estimate for the rest of the return so this value here stands in for the remaining return which you would have gotten if you would have continued indefinitely to return here so you could just keep the one-step return in mind as a potential possibility and then we can know that the critic the value function here which we now can call a critic because of this terminology of actor critic methods it's solving a familiar problem policy evaluation for the current policy so the question is that it's trying to answer what is the value of the policy that depends on some parameters for the current parameters and this problem was explored in previous lecture we can you could use Monte Carlo rollouts to estimate this you just follow your policy for a while you just do regression to make a value function look a lot like this that like those returns you could also bootstrap learn a guest from a guest using temporal difference learning or multi-step temporal difference learning so this was all discussed in the previous lecture for the case of function approximation and of course earlier in the course and to make that explicit here's an algorithm so the critic will have a value function of this updating with some parameters W this is just the weights of your network if if V is a neural network and in this case the algorithm itself will use multi step but it won't actually do multi it will do one step temporal difference learning and the actor will use the policy gradient algorithms as we defer as we've just derived to update the parameters of the policy so how does it then look first you initialize oh sorry there's a W missing there you should of course also initialize the parameters of your value function you initialize the first state the parameters of your policy and the parameters of your value function and then you basically have an indefinite loop where at each step you sample an action from your current policy which is stochastic you apply that action into the world and then you get a reward in the next state which here I just you know that you sample the reward in the next state maybe you have a simulator maybe you have you're running this on an actual physical robots so you just observe observe a reward in the next state and then one thing we could do is use the one-step TD error that's one possibility which means we just reuse that reward and we bootstrap immediately on the value at the next state to stand in for the remaining return of that current policy and then we subtract the value at the current state just as a baseline this turns it into a temporal difference error which is also sometimes called so the algorithm as a whole sometimes called advantage actor critic because the action value minus the state value which is which is the expectation of your temporal difference error is sometimes called the advantage function it basically the terminology comes from the fact that you can consider your your state value and then the action values are offset by that state value in some sense but if you subtract that state value out again you're only looking at the advantage of taking one action compared to another one some advantage will be positive some advantages will be negative for certain actions and if you sample that thing that's your temporal difference error and just terminology wise this is why it's sometimes called advancer avantage extra critic just to make you aware of these terms because they're used in the literature then we do a very familiar thing to update the policy sorry these critic parameters of our value function i've introduced a new step size here beta this is exactly the same thing that we had before us alpha but I just want to differentiate that you might use a different step size for your policy compared to for your value function but essentially the update is something we've seen before which is we're updating our parameters of the value function W by adding a step size times the temporal difference error times the gradients with respect to those parameters of your value function this is exactly the same as what we had in in the previous lecture and then the policy gradient algorithm looks very similar in some sense where we now adding a step size times two temporal difference error but we're multiplying the log probability of selecting the action that you've actually selected sorry the gradient of that which is repulsive gradient update yes it's a very good question so can you construct a certain verse parametrization of your policy so that this becomes almost equivalent to doing action values and turns out the answer is yes in particular if you construct a policy parametrization that is somehow more constrained so that these log probabilities you can basically constructed in such a way that these log probabilities they go exactly to the action values in general it's less constrained so these probability log probabilities can go up or down and they don't really have semantics of a value but turns out if you put certain regularization zin in there it turns out that it's neural q-learning where this would be a gradient of Q rather than the gradient of log pi and this algorithm they can be made to look exactly alike they can be made to do the exact same updates but that's for a specific case in general this is a more generic thing in the sense that these log probabilities are not constrained to have the semantics of a prediction it's a very good question there's a paper by John Schulman on archive in which he proves for a specific version of this and a specific version of Q learning that these things do the exact same updates so this is the slide I didn't get to in the last lecture where we expand this a little bit more it's the same algorithm as on the previous slide but I've just generalized it slightly more I made it also in some sense a little bit more concrete this would this you could say constitutes almost a full agent if you implement all of this you could just run that on something at scale perhaps and you might get interesting results so we start off with some representation which could be very simple so it could be that your representation just takes the observations but I made it slightly more general where I said maybe your current agent state s T doesn't just depend on your current observation but it might also depend on your previous agent State for instance you might be using a recurrent neural network and we have a network that Maps each state to a value we have a network that map's each state to a policy these are your critic and your actor and in in a specific instance of this algorithm one thing that has been done in the past is actually copy this policy a number of times and have a number of simulators so you get a lot of data at the same time this is just an implementation thing it doesn't actually touch the core learning algorithm that much except in the way that you generate the data but it's something that has been done in practice so I just wanted to put it on the slide as an example of something that you might do if you have access to simulators then one thing we could do is construct a multi-step temporal difference loss one thing that I didn't put in here is that the Montecarlo return the multi-step Montecarlo return should have a stop gradients on the value that you're bootstrapping on so there should be a stop gradient here when you bootstrap because otherwise you're not doing temporal difference learning you doing something more like to residual bellmen update that we discussed last lecture but I didn't actually put it on the slide if you don't do that if you get to do that it'll probably still work especially if you if you use multi-step returns where n is not one but it's maybe 10 or 20 but it might work a little less well it's still a valid algorithm but it's a slightly different one and then we could also construct a multi-step reinforce loss and I put loss here between quotes because this is something that we maybe do for convenience when implementing this and say tensorflow because densifies optimizers that they expect you to give them a loss but in fact what we derived wasn't the loss it was a gradient it was an update but you can turn it into a loss by seint you're getting rid of the gradient before the log pi this is a little bit of a weird thing because it's not really a loss but if you take the gradient of this it turns out to do the update that we wanted to do so that's one way to basically trick your tensor flow program into doing an update that you want it to do by saying I can turn it into something when such that when I actually take the gradient of it it'll turn in the update that I want it to be this is actually quite similar to the temporal difference thing when we put to stop gradient in there to stop gradient is there basically to trick tensor so you're not taking the full gradient but taking something that the system embargo would call the semi gradients so you can use it as a loss okay and then you just tossed it into an optimizer for instance the atom optimizer or something like that and you minimize these these losses I'm not sure whether I put whether it should be in minus sign there on the Omni reinforce loss so maybe you wouldn't try putting minus sign there or not that's trying to reason through that quickly anyway you could you could carefully look at that and see whether it needs one or not so this is sometimes known as a to see which is just short for advantage actor critic or in the literature so so that's called a three C because this was in practice combined with asynchronous parameters update parameter updates this is why I highlighted that part four maybe you have multiple copies one thing you could do is you could have multiple learners in different places on different simulators and they're all trying to update parameters but there's one share parameter set and then all these updates go in there asynchronously and then that's why sometimes called a 3c for asynchronous event or advantage extra critic that's basically an implementation detail it does change the algorithm in terms of the actual updates but in terms of the the objective is doesn't really change much I won't go into too much detail on that okay so now we should be a little bit careful because the policy gradients objective is for your current policy what's the return use that to update your policy but if you're going to approximate that for instance by bootstrapping you're going to introduce some bias and this might or might not be a problem but if you have a very biased policy gradients estimate then you might not find the right solution if you use the full return for your current policy you just take your policy you run with it for a while you get a unbiased estimate for the value of that policy so that one's fine but it has high variance if we're going to bootstrap for instance immediately using temporal difference learning you might have a high bias but a variance will be quite low but sometimes the bias is too high and the gradient doesn't actually point in the right direction in which case your algorithm depending also on the function approximation that you use and on the optimizer that you're using might actually go the wrong way and it might lead to poor policies so this is why multi-step temporal difference are especially useful maybe in this case and they're very often used where we take a number of steps so that we have lower bias but we might not take all of the steps you might not do a four Montecarlo rollout so we still reduce the variance a little bit and this turns out to be quite successful in training these things another thing that's important is that we actually have on policy trajectories targets so again one way to do that is just to roll out your policy say use the multi-car low return roll it up all the way to the end this will give you a non policy estimate but sometimes you get data for a policy that's not quite the policy that you have right now for instance the data may have generated with a policy that you are using just before like a few steps before you've updated it in the meantime your policy is now different but the data is from a previous policy that's one way they could be off policy another way of course is that the data actually comes from a completely different source maybe you've observed data from another simulator right other policy was run or maybe you have some data due to humans acting in a certain setting in those cases is very important to correct for the bias in your estimate and for instance you can use important sampling and a way I wrote it down here quite condensed Lee is using a recursive way to write it down where the end step important sampling sample return which I denote here with a row because this fraction is sometimes written shorthand as row you can write is recursively by just considering one step and then the remaining n minus one steps and again important sampled at the next time step where if you roll this out all the way to the where n becomes 0 because n on each step will become one lower at the end you bootstrap and we are assuming that this value is a valid approximation for the policy we're actually interested in this is in the sumption right it will have a little bit of bias because you won't have an exact approximation there but then this is this is one way to define that an important samples return this is very similar to the important sampling we've discussed in a previous lecture but we're applying it recursively and then one thing I wanted to point you could also do something slightly different from the end step returns and again I'm writing these things out recursively for now I just got rid of the important sampling ratios which you should put in there if you're off policy but for simplicity I just took out so I'm on because now considering the on policy case but I wanted to point out that this can be generalized in a way so the recursive formulation here is that you had you take n steps which means you first take one step and then you take n minus one steps this is just a different way to write out that random return and then at the end you bootstrap so this is equivalent to writing it out in this more generic way where we don't have these two cases but we only have one case but we put in a parameter that is called lab at T plus 1 and this is equivalent if you have lab being equal to one because in that case this value part disappears there's 1 minus lab that there that part goes to 0 so you could think of this latter parameter now as a binary switch which says do I step one more or do i bootstrap here and then these things are equivalent if you first step a few steps and then all of a sudden you set this lab at to 0 which means we're zeroing out the rest of the return and instead we're using the values to boost our phone fully and the generalization I then wanted to just point out is that you could consider using ladder parameters which are not binary so you don't deterministically step a few steps and then you're bootstrap but maybe on each of the steps you bootstrap a little bit or maybe you even bootstrap conditionally and one way to then correct for of policy returns is to bootstrap whenever you take an action that was very unlikely or even impossible under the policy that you're interested in now so that's a different way to think about how to correct for off policy returns the other reason I call this out is because lab the returns are quite often quoted in the literature and essentially they are a generalization of the N step return another way to think about them is that they can be interpreted as a mixture of nth step returns this is all covered in the southern embargo bookends in Chapter twelve in quite a lot of depth much more depth than will go into for this course because this is basically all I'm going to say about that multi step returns with just to fix n are also quite often used these days in practice because they're quite easy to implement and one problem if you have a lab that returned where this lapped is not zero or one is that it might mean that you need indefinitely long rollouts in order to even compute this thing there's ways around that but we won't have time to go into that in this course okay so this is just a generic formulation of return you can use that to update both your critic or your policy or similarly your value function or your your prediction or your or your policy gradient so you can consider this basically in the side don't feel free to think about it more but for now you can forget about it again I just wanted to talk more about the policy of simulation bit and what is hard and what is easy about this we're already had a concrete algorithm just now which you could implement you could just run and it might do something interesting but therein turns out in practice there's certain problems which if you think about them are quite intuitive but they might not be immediately apparent if you if you just think about the objective and optimizing it and this for this reason many extensions and variants have been proposed over the years and one thing that's important is to be careful with your updates because vitally we're changing the policy which means we're changing the data distribution so if you somehow mess up at some point and you create a policy that is very poor it just stands in the corner and tries to drive into the wall or something like that the data will become very poor which means it becomes very hard to learn something meaningful after that which is why it's very important to have policies that are somewhat reasonable this is different from a supervised learning case where learning and data are typically independent we just have a data set and we can't mess it up by a learning process so we can just do basically anything during a learning process without necessarily making it impossible to recover but because in the reinforcement learning setting the data and the learning are very tightly coupled you have to maybe be a little bit more careful so one solution is to regularize the policy for instance by making it not change too much this is a common practice common method in practice these days so the goal is to prevent instability and also maybe for it to lock in too quickly into things that are in good and a popular way to do that is to limit the difference between subsequent policies so what we'll do we consider the policy before doing an update which will call say PI old and we're considering the policy after the updates which is the PI that the thing that we're considering changing now payal doesn't have to be the actual policy exactly the previous time step but that's one way to think about it but a thank see what we'll do here is we'll define a divergence so if you haven't seen Co but libel or divergences before I don't worry but this is just how they're they're defined but the way to think about it is that the divergence is like a distance metric but it's defined between distributions and it's not actually symmetric so it's not a metric but that's okay you can still use it as if it's just the distance between these two policies so what does this thing mean it's very akin to having like a square distance error or we basically say we don't want pi Heda to be too different from pi all so we're considering a normal policy gradient updates but at the same time we're regularizing the policy not to change too much and this is a very nice thing to be able to do because you can put in other policies here as well if for instance you could have a certain control policy that you know is fairly safe on say a robot that you're on a run and you could say I want to look at any solution you can find but all of them has to be a little bit close to this one because I know that's safe then the idea is just to maximize whatever object if you picked so the normal policy gradient objective of increasing the value and you just regular eyes by adding this term with maybe some small e type parameter that just basically tells the system how much should I care about these regular eyes or compared to the actual overall objective in addition you can help to use large batches for many reasons one is to reduce the variance and I just wanted to point out there's multiple algorithms that uses that are used quite a lot in practice one is called trust region policy optimization where the idea to think about this is that this regularizer basically defines a region in which we trust our new policy so this is why it's called a trust region and similarly there's a PPO algorithm which is basically maybe you could consider a follow up on on trp oh I need to use quite commonly in practice and quite successfully and they both use this trick and let's see one thing that I wanted to show is just if you run this that it actually works just to give you somebody I showed this in the first lecture as well I think now of course getting these things to work completely in practice it requires a little bit for engineering and tuning and things like that but otherwise this is basically just using one of those policy gradient algorithms and it's trying to optimize the reward of going forward as much as you can by changing the the way this system moves so what's parametrize here's the way these limbs move compared to each other the exact way is not that important but importantly no information about the potential solutions what was put in there's just this reward and then the policy gradient algorithm figured out how to change the parameters of the policy so that this thing works walks forward and turns out this algorithm is generic in general enough that it doesn't really matter what the exact nature of the problem is in some sense and it can learn to deal with all these different body types and all of these different situations and it can just learn to move forward the previous two examples where it's basically on the on the line in a sense in this case it can actually move in the plane it can also move left and right to avoid the obstacles so it could be a ski pick to either climb over one or maybe move around it if possible so what is often done in these cases it's a very good question and in the algorithm applies in both cases I don't actually know which one was using this one I should check but especially in these type of demonstrations what's quite often done is that we we don't use the actual pixel input but we use features so there are certain sensors essentially of the robot or in this case of the simulated robot and instead of using like the full pixel observation we just use those much smaller dimensional feature vector as an input which includes also may be things such as feeling in a sense how different your joint is curved which might be somewhat harder to read off the pixels you could learn to learn off the pixels and people have done that as well but sometimes it's much easier if you have certain features that you know are quite useful and indeed we as humans friends also use that we don't just do things from our vision we use our sense of balance we use the fact that we kind of feel where our muscles are so this might be quite useful for the learning process to put in there now the video I show before you can imagine doing that in multiple ways you could imagine you still having discrete action set where you have like a number of controls you could do it you could turn a certain amount to one way you could move a joint a certain amount or you can consider this to be basically almost freeform continuous so the algorithms we discussed can be used in both of those cases the only thing that's quite challenge moment or maybe not the only thing but one thing that's quite challenging in high dimensional continuous spaces is exploration so let's consider a concrete example here let's say you have an action that is a real-valued vector maybe it's bounded somehow or or not one thing you could do is you could define a Gaussian policy so we have a mean that is state dependence which might be parametrized with these policy parameters theta and maybe for simplicity let's just consider a fixed variance if this is a multivariate Gaussian this would just be an identity matrix I wrote it down here as if it's like a single number here so in this case the action is a single a single real number and the exploration here is Gaussian which means our policies parametrized as a Gaussian distribution which means that the logarithm of that thing and then taking the gradient has this nice pleasing intuitive form where if you have a certain action you're considering a certain action the log probability of that action gradient log probability is the difference between the action and the proposed mean we're off of the Gaussian divided by the squared variance times the gradient of the mean with respect to the parameters the mean here should probably be explicitly parametrized with theta and then the gradient should also have the theta as a subscript to be completely clear enough for instance you can just use these algorithms we talks about reinforce or advance is extra critic like algorithms these Paul and gradient algorithms to update the parameters of this policy you could of course also parameterize the variance as well and do that as well one other thing you could do just to show that these intuitive algorithms they can work quite well is if you have an exploration algorithm so we pick a certain action now from a policy which might be Gaussian might be you continues action and one the easy thing we can do is just to do something a little bit more explicit than what we were doing before we just look at the temporal difference error and whenever the temporal difference difference error is positive we're happily surprised we're going to move the output of our actor towards this action so we're not explicitly doing a policy gradient now we're doing something that is maybe a little bit simpler we have an actor that outputs a single number we add some Gaussian noise around that we evaluate that action and we see whether it was good so we're doing something like hill climbing turns out this works and you can use the simple algorithm in interesting ways so this is another video in which they train basically something similar to the thing you saw just now and again this doesn't necessarily work out of the box immediately it takes a little bit of challenge perhaps to even sorry let me just to even define what the algorithm sees this is again not done from pixels but then you can make this work okay so I guess we're running out of time so we'll continue on next week when I'll talk a little bit more about polish greatness and about planning thank you sorry about that
Info
Channel: DeepMind
Views: 57,446
Rating: undefined out of 5
Keywords:
Id: bRfUxQs6xIM
Channel Id: undefined
Length: 94min 41sec (5681 seconds)
Published: Fri Nov 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.