Deep RL Bootcamp Lecture 5: Natural Policy Gradients, TRPO, PPO

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right hi everyone I'm John I'm going to build on two of the previous lectures so first Andres lecture and and Peter's lectured from from earlier where he talked about actor critic methods so I'm going to talk about more advanced optimization methods to use along with policy gradients so what Andre just talked about in the last lecture could be called a vanilla policy gradient method because it's essentially just getting a policy gradient estimate and plugging plugging it into stochastic gradient descent and there are various ways of modifying that and bells and whistles you can add like Peter talked a lot about these actor critic methods that use the value function to get a lower variance advantage estimate now so you can do a lot to get better advantage estimates but I'm actually not going to talk about how to get better advantage estimates I'm going to talk more about changing the underlying optimization algorithm so once you compute your advantage estimate how do you update the policy with that and so I'm going to and essentially this is for two reasons there are two things we need to improve on from the vanilla policy gradient methods one issue is that it's pretty hard to choose good step sizes for them and to choose a single step size that's going to work over the course of the whole optimization process so I mean in reinforcement learning one difference from supervised learning is that the data that your network is receiving as input is non-stationary and that's because as your policy changes you see a different distribution of observations and rewards so that means so you might for example be getting zero reward for a while and then at a certain point your policy gets randomly gets a positive reward and then your reward starts going up so what that means is that the your the statistics of your inputs are changing over time so it's hard that makes it particularly hard to choose a good step size which is going to work over the whole course to the learning process and you can address that somewhat by normalization techniques but it remains a problem so so another issue with the step sizes that is that a bad if you take too big of a step or a bad step this is more damaging in the reinforcement learning setting than it is in policy and in the supervised learning setting because if you mess up your policy then you're going to start collecting bad data like if you met if you change your policy by too much then you're going to collect you're going to visit a totally different part of state space you'll have a bad policy and then you won't be able to recover because all the data you're seeing from that point forward is from this bad state distribution so so you can have a disastrous drop in performance if you take a bad step so these so this makes it really hard to design a nice and stable algorithm another issue with the vanilla policy grading methods is their sample efficiency if you just when you collect some data if you just use them to compute a single gradient and then throw the data away you're probably not getting all the available information out of it and you can see that especially well if you imagine that your network is scaled poorly like let's say some of the gradients are 10 to the 5 times bigger than other gradients I mean gradients on some components are bigger than the gradients on other components then you're going to make very little progress on the ones that have tiny gradients if you're just doing normal stochastic gradient descent so you might yeah so anyway you might have really slow slow learning and we might be able to do better at this by using a more sophisticated kind of optimization method but instead of just taking instead of just taking a gradient a single gradient with the data you can do solve some little optimization problem whenever data comes in so I want to make that point in this slightly more philosophical way or describing what we what we kind of want to do here so if you think about a lot of modern machine learning a lot of it's about reduce the learning problem too numerical optimization so learning is about being able to make some kind of prediction on data you haven't seen before and what what deep learning is all about is writing down some kind of numerical optimization problem and then using some kind of gradient descent algorithm to approximately solve it and in supervised learning you're just minimizing training error and well you really care about his test error but if you minimize training error that's usually pretty good especially if you tweak regularization and things like that um so in RL it's the situation is not as clear-cut so it's we'd like to write down some kind of optimization problem we can solve which uses all available data that we've seen so far and computes the best policy from that but it's not clear how to do that what optimization problem should we write down so Q learning is actually quite nice because you can take any set of transitions from your dynamics model and you can fit a Q function using them you can try to fit Q star because the bellman equation makes sense no matter what policy your data was sampled from so that's pretty nice but this comes at a cost of actually optimizing the wrong objective so in Q learning you're trying to minimize some kind of cute bellman error when what you really care about is that your policy performs well so policy gradient methods are in contrast actually optimizing the thing you care about which is the performance of the policy but the downside is that you they're not as good at using olds data or as at using all the data you have available so this lecture is going to improve a little bit on the vanilla policy gradient methods where instead of just taking our gradient estimate and taking one step with that we're going to actually write down a little optimization problem based on the data you collected from your current policy and we're going to solve that optimization problem approximately so we get a little bit more information out of our data this way it's still not perfect because you can't take all the data you you ever collected but it's it's an improvement so so then the next question is what loss should we be optimizing so we want to write down some kind of loss function for to define a policy grading method but what kind of loss should we use so first let's just look at our vanilla policy gradient formula so here we have this is just one way of writing it we have some kind of expectation the hat means this is an empirical estimate expectation meaning we collected a bunch of time steps and we're averaging over them so we're averaging over T and we take grad log probability of the action times the advantage okay so that's our policy gradient formula and well we can write down a loss function such that differentiating that gives us the policy gradient and the loss function we run a write down 1 1 choices we just write down log probability times advantage estimate which is treated as a constant so obviously if you take the gradient of that you get our policy gradient estimate and this is actually what you want to do if you want to implement policy gradients with an auto diff library what you should do is just write down this loss and then do differentiate it okay but the thing about this loss is it doesn't actually make sense of them as a loss that you can do many optimization steps on you don't want to optimize this to convergence because let's say the advantage is positive then you're going to try to drive the log probability to infinity or or you're going to dry well if it's a discrete action space you're going to try to drive the probability to 1 for that action or if the advantage is negative you're going to drive the probability of 0 but the advantage of submit is is noisy so this so you might have just gotten a noisy estimate and you're gonna radically change your policy based on that and that's probably going to make your policy worse so so we so this clearly isn't exactly right well I'm not going to fix that problem yet but I'm gonna say there's another version of this that's also interesting a look at and gives us the same policy grading estimate so instead of writing it as log probability times advantage we can write it as this probability ratio times advantage so if we write down this ratio and then we differentiate that we get the exact same gradient and that's just because of the chain rule so if you have grad log grad log of some function the gradient is just grad of the function divided by the function value so so anyway this formula is just as good but this formulas also has an interesting interpretation in terms of important sampling so if we so one thing that's pretty sensible to do is to say we're going to sample okay we're gonna collect data from some policy which I'm going to call pi sub theta old and then we want to then let's say our new policy which means our updated policy is called pi sub theta so one sensible thing to do is to we want PI sub theta to have high expected advantage so we want like policy PI sub theta to take positive advantage actions so we write down this expectation and then we do importance sampling so in importance sampling you that lets you so important sampling lets you say what would the expectation be under one distribution even though I sampled collected samples from some other distribution so anyway that's you can do important sampling and that's then you arrive at this formula I wrote on the previous slide so it basically says we want to choose a policy PI sub theta such that the expected advantage under that policy is high where we still are collecting states from the old policy of theta old anyway I don't want to get too bogged down on this but anyway this doesn't solve the problem which which is that we want the policy update to be small so that's that's the big problem so far I mean we want it we can take a gradient step on that and if it's a small step we're okay but the big problem is that we don't want to optimize this objective fully because then we're going to do a too big of an update so so the first so here's here's a way to solve that issue so the way to solve it is to use what's called a trust region this is a well known kind of method in optimization where you say I have some I have a function I want to optimize and I have some local approximation of this function which is accurate locally but it gets really inaccurate if you go far away from your starting point so we have some kind of trust region where we trust our local approximation and as long as we're in the trust region we're willing to optimize our local approximation a lot but if we get outside of that all bets are off so so the way these kind of methods work is you say I want to optimize this objective which is our local approximation subject to some constraint that the we're not moving too far away from the starting point and there are lots of ways we can define this constraint for example we could say that you the acquitting euclidean distance between our starting point theta olds and our final point theta is small so that would be one way to do it but instead of using Euclidean distance we're going to compute the KL divergence between the old policy PI sub theta olds and the new policy PI sub theta so so that's what we mean by this KL expression here so so this is just a distant this just is a way of measuring the distance between probability distributions so the nice thing here is where we're actually looking at the distributions that the parameter vector is determined as opposed to just looking at the parameter vector where it's not clear how what's a sensible update in parameter space like how big of an update should we do in terms of Euclidean distance that's pretty hard to to say because we have some really complicated function that's going from the parameter space to the output space but in terms of KL divergence KL divergence is unitless quantity and it has a nice interpretation so we're basically saying that the probabilities shouldn't change too much the action probability shouldn't change too much and so that's that's something that's easy to that's very robust and it it doesn't it's not problem dependent so we can define some hyper parameters that make sense over a wide range of problems so anyway what we do is we say let's go and buy recall that this expression just gives us the policy gradient so if you differentiate this objective it's just going to give you the policy gradient we yeah I mean this comes out of some theory but actually it doesn't make a difference in practice so yeah I mean you can actually use actually we're gonna do an approximation later to the KL so and it actually doesn't the ordering doesn't really matter okay so so anyway this is the constrained optimization problem we want to solve and it's also worth considering that we could solve a penalized optimization problem instead of a constrained problem so we could have said let's just penalize the KL divergence instead of putting a hard constraint on it and according to the method of LaGrange multipliers this is essentially equivalent in that if we want to solve this problem the solution also it's a necessary condition that it's also the solution to this problem let's see I had a couple of questions oh yeah so the question was have there have people tried to apply use different distances like Wasserstein distance and so forth which have been used for Gans so actually it isn't going to make that big of a difference what distance you use here because usually we use pretty simple distributions over actions so any kind of distance on probability distributions is going to do the same thing like there's not too much to say about different distances between Gaussian distributions if you started to use more sophisticated distributions than it that might be relevant another question oh yeah so the question is why use the penalty instead of like the hard constraint I'm actually I'm going to talk about the trade-offs between these two things later i said i'm just saying here that it's equivalent in a certain sense because we can choose some beta so that so that the solution for any given like hard constraint we can get the same result by using the penalty okay I'm gonna move on so anyway so it turns out that there's actually a theory result that pertains to this penalized objective so it's not exactly the version I wrote down on the previous slide but if you take a max over state space instead of a mean over state space and so you constrain the max KL of between the new policy and the old policy over all state space then it turns out that you you get something that's actually a lower bound on the performance of the policy PI sub theta so the point is let's say we sampled some data with PI sub beta olds and and now we want to get a new policy PI sub theta so we want to write down some objective such that if we maximize that objective it'll actually improve the performance of the policy so it's not totally obvious how to do that because if we make if we just take some approximation to I mean we don't have analytic access to the performance of the policy because to calculate the perform you need to interact with the environment so we just want to write down some kind of objective that we can do numerical optimization on and doesn't involve interacting with the environment and it's not clear that we can do this but it turns out that if we that we actually can write down an objective which is a lower bound on the performance of the policy and that involves a max of the KL over state space and if we maximize that then we're guaranteed to improve the policy and we can illustrate that with this diagram here so let's say ADA this red curve is the is the actual performance of the policy parameter theta then if we compute this surrogate loss which is the thing before we added the constraint or the penalty then okay this is a good local approximation which has the same gradient but then if you go too far away all bets are off like if we try to maximize that we'll go all the way over here so then if we add in the penalty we end up with this blue curve which now becomes the lower bound on the performance so if we maximize this blue curve we end up over here and since it's a lower bound that means that we're gonna make more improvement on the real thing than we did on the lower bound so we're guaranteed to make at least as much improvement on the real performance if we optimize this lower bound so so that's pretty nice that we can actually take some kind of non-trivial policy update and be guaranteed to do an improvement to the policy in practice the theory we have to make some approximations of the theory because it's way too conservative and how like what kind of update it tells you to do it tells you to do way too tiny of an update so we use the mean of the KL over state space instead of the max KL and we also and we let's see we do an empirical a sample based approximation to it and we also take a bigger step than the theory recommends so anyway putting this all together in pseudocode the way it looks is each iteration first you sample a bunch of data from your policy so you run your policy for T's time steps or n trajectories using this is using your current policy parameter now we go back and look at all of our trajectories or trajectory pieces and we go and compute the advantage function at all time steps so then we've got this advantage function and then we then we write down our constrained optimization problem which is maximize the surrogate loss subject to some constraint and then it turns out we can solve this constrained optimization problem efficiently using conjugate gradient and I'm going to describe how to do that later we can solve it approximately we're not solving it exactly so and then we repeat so this is an iterative process and this is similar to some other methods like natural policy gradient and natural actor critic and reps so this idea has resurfaced many times and will continue to so now I'm just gonna I'm going to talk a little bit more about how you do this efficient solution using conjugate gradient so so this is actually not new to this method trp oh this is these are some kind of well-established methods from numerical optimization the idea is that you have some kind of problem you nonlinear optimization problem you want to solve so what you do is you repeatedly make an approximation to it and then solve that sub-problem so so we're gonna make a pretty aggressive approximation to this the nonlinear optimization problem where we make a quadratic approximation to the constraint and a linear approximation the objective and we solve these quadratic programs which and then we do that efficiently but anyway let's say let's just look at the penalized version of the problem and look at how to solve the penalized version so so first we're going to make a linear approximation of the objective and a quadratic approximation to the KL constraint note that the KL constraint kind of just looks like a quadratic distance function so the KL divergence has well it has zero derivative at your starting point because it's minimized at your starting point the KL has two arguments it's like a distance between two probability distributions and the distance is minimized if you look at varying one of the arguments and keeping the other one fixed the distance is minimized when you're at the starting point and it goes up smoothly as you go away from it so it looks like a kind of quadratic distance function but so we want but we want to figure out exactly what the what this the matrix is that defines this quadratic distance function so so we're gonna make this approximation to the problem and this is just this is just the Taylor series kind of expansion the second order around theta olds and okay and these are so we want to make this linear this affine approximation of the objective and quadratic approximation of the constraint and that's because it turns out that well I won't go into all the justifications and anyway the solution of this is just like a con it's just a constant multiple times this matrix F inverse that's F is there our Hessian times G which is the gradient so as it turns out the F here is called the Fisher information matrix this is something that appears all over the place and probability theory it's sort of telling you how sensitive the probability distribution is to different directions in parameter space so it's telling you if I move in this direction in parameter space how big of a change to the probability distribution is that so anyway this step this this step that you get when you solve this quadratic program is called the natural policy gradient and so so anyway we somehow okay this is all well and good but we don't want to compute this we don't want to compute the whole Hessian because it's really expensive to compute a hessian like computing the gradient of some scalar valued function is only a small multiple times the computation cost of computing the function itself that's a kind of the principle the cheap gradient principle it's called but computing the Hessian is way more expensive it's effectively you have to compute n gradients if you have n parameters so that's really bad so yeah yeah we could go ahead and compute this Hessian but it would be really expensive so okay anyway I'll skip this because it's not that important okay I had a section on the oh I must have deleted that slide by accident that's funny I had a nice slide okay anyway what I was going to say is that you can if you you can actually compute this you can compute this this expression F inverse G by without actually ever explicitly forming the matrix F and this is a really this is a pretty neat idea and this is using the Contra gradient algorithm so well you can approach it the conjugate gradient algorithm is a way of approximately solving linear equations so in this case the linear equation we want to solve is FX equals G so so conjugate gradient algorithm will approximately solve this linear equation without ever explicitly forming the matrix F instead it just needs to take a bunch of matrix vector products so it you just need to define some function that takes a vector and it computes f times X and and then you call that function a bunch of times like ten times and conjugate gradient algorithm will approximately solve this equation and as it turns out you can okay so how do we get this function that does the matrix vector product after x x well it turns out if you have a regular old reverse mode auto-da-fé program like tensor flow you can do that and you can just write down the gradient vector product and then you differentiate that to get a hash and vector product so anyway you can check out the textbook numerical optimization by nocedal and wright which has a really good explanation of this idea which is used a lot in in non-linear optimization so that's what we use here it sort of appeared in the deep learning community called hessian-free optimization of maybe five years back or so and then then it was used in the trust region policy optimization in the deep RL setting so anyway that's what CRP o is let's see do I have a slide of that ok anyway I already showed you this pseudocode for TRP o so let's see ok so so the summary of what I just showed you is ok we want to optimize ok we wanted to write down an optimization problem which would we can solve to do a policy update and this way we'll get a more sample efficient and more robust algorithm than the vanilla policy gradient method so we want to write down some optimization problem so the first suggestion is to just write down some kind of loss of when you differentiating you get the policy gradient so we got this LPG which was just log probability times advantage and L is for that's for important sampling which is like this probability ratio times the advantage but that's no good because it doesn't actually constrain the size of your update in any way so instead the we suggested to use the KL divergence to constrain the size of the update and but then we have a constrained optimization problem which is harder than just solving an unconstrained problem so we suggested instead we suggested in oh yeah so we noted that when you do a quadratic and affine approximation to this problem you get the update step looks like F inverse G which is this is the Fisher information matrix and this is the policy gradient and this is the same as something that was proposed previously called the natural policy gradient and we can solve for this step approximately using the conjugate gradient method and TRPA also does a line search on the original constrained optimization problem to make sure that the update that you did that is actually a good is is actually improves the improves the objective in that constrained optimization problem yeah that's him it is okay okay that's that's fine so let's see so there's also another version of this that's worth that's worth knowing about which is in which you you don't actually use the conjugate gradient algorithm but instead you directly look at the penalized version of the objective and this is use this is useful if you either don't want to compute you either don't want to do the Contra gradient algorithm because it doesn't work in some setting or or you I don't know in some cases this might work better but anyway the idea is we're going to work directly with our penalized objective and then our pseudocode is that we first collect data from our policy we estimate the advantage function and then we just do SGD or l-bfgs just any old unconstrained optimization algorithm and this objective with the KL penalty so so that's an easy thing to do but the problem is it's hard to choose this coefficient beta in a way that it works well over the whole course of the optimization so what we can do is we can instead just choose the KL divergence as our hyper parameter and what we do is we adjust our penalty coefficient beta to try to make the the constraint satisfied exactly and tightly so so basically if we took too small of a step so after we did our optimization if our step was too small we reduced the penalty coefficient and if the step was too big we increase the penalty coefficient so we're not going to exactly get the right step every time but since each step is pretty small it's not that bad and over time I mean over the whole course of the optimization we'll basically stick to the right KL size actually I see a lot of hands but I'm pretty low on time so I just want to get through a lot of my slides in the next minutes and then I'll get to the questions okay so it's also worth looking at what's the connection between the trust region problem we're solving in TRP oh and other reinforcement learning methods so so okay as as I said in the last slide if we do this linear quadratic approximation we get the natural policy gradient if we do no constraint at all then this is just normal policy iteration that might not be totally obvious but if you look at what policy iteration is doing it's just you you're just taking your choosing a deterministic policy each iteration in which you choose the action that had the highest advantage according to the previous policy so if you have no constraint at all it's just policy iteration and if you use a Euclidean distance to constrain your step instead of using a instead of using a KL divergence and then we do the linear quadratic approximation again then we end up with just the vanilla policy gradient that's you can just write it go through the logic and you'll come out with with that result so okay what is the this has some limitations though so one issue is that it's hard to use in architectures that have multiple outputs like if you have if the policy is outputting action probabilities and a value function then it's not clear how you what you should use instead of the KL divergence because the KL divergence doesn't tell you anything about how you should update your value function part empirically it performs kind of poorly on tasks through involving deep comm convolutional neural nets and rnns at least on certain benchmark problems like the atari benchmark so I don't know exactly why this is true but it might have to do it the first thing but that's not the full explanation and third the conjugate gradient makes it a little bit less flexible than if you just had to do SGD and it makes it more complicated so I want to talk about two more kinds two alternatives to TRP oh that do something similar in that you do something like the policy grading update but do more a more sophisticated kind of update step and you're trying to constrain the size of the update but I'm going to talk about two alternative ways of constraining the size of the update so first way is there's this method called K fact which is actually pretty recent but the idea here is we can make an approximation to the Fisher information matrix which actually exploits the structure of a neural network so in particular you exploit you find some kind of block diagonal approximation to this matrix where each weight matrix in the neural net is treated as its own a block of parameters and then you assume a certain kind of factorization you assume that this block has a factorization as a tensor product of two other matrices so it's actually quite simple you assume that the well there's an alternative expression for the Fisher information matrix as a expectation of outer products of grad log probability so so then you write out what's the okay you you look at actually I'm kind of running out of time so I'm not going to go through all the details but anyway you you write down what the the actual Fisher information matrix is this expectation which involves a product of four things and then you just assume that it factorizes as you you assume that two things are independent and then you you have some factorization and this lets you this this is actually lets you efficiently compute an estimate this an approximation to the Fisher information matrix and using this approximation of the inverse Fisher matrix you can do a natural gradient step in the same manner that we described previously and you all you have to do extra in term computationally is you have to maintain running estimates of a couple of covariance matrices and then you periodically confute compute inverses of these matrices and there's an algorithm proposed or published a couple weeks ago called actor which can bines this Katie basically plugs in the k-factor now to a gradient into the actor critic algorithm a to see which is like a 3c but without the asynchrony part and it's basically a pretty small amount of overhead on it but it works a lot better the limitations are it's less straightforward to use when you have parameters being used multiple times and with recurrent neural nets but it can probably be used in these cases also with a little bit of extra work another okay another algorithm is proximal policy optimization that's that's what I just described before where you use the penalty instead of the constraint as it turns out this version I described before with the adaptive penalty is okay but it's not the best possible thing and something that works even better is to write out a certain kind of objective that involves writing down this importance ratio like we did before but clipping it so that it doesn't go so you're basically saying I don't want to change the probability ratio too far away from one but that itself would be exactly right because then then you wouldn't any longer be optimizing a lower bound or pessimistic bound on the policy performance so what we do instead is write down a certain kind of element wise minimum between the clip objective and the end clipped objective I'm not well I'm low on time so I'm not going to be able to explain it fully but basically it's another kind of lower bound objective which empirically works pretty well and you just this is a pretty easy thing to implement because you just write down your vanilla policy gradient implementation but then you switch to this weird-lookin objective and optimize that for many steps of gradient descent instead of just one step and that way you're getting more information out of your data and doing it and you're doing an update that's actually kind of achieves a reliable amount of KL divergence between the new policy and the old policy so anyway this pretty simple pseudocode it's a lot better well it's it's a bit better than TRP oh and continue control problems like these robotics control problems and it's a lot better on these image based problems like Atari where T RPO is really bad on those compared to A to C whereas PPO does better as does actor from the previous slide and it's compatible with RN ends and multi output and that works ok well I'll send out the slides so here are some extra citations on the stuff from this slide let me let me actually just show you a cool video because everyone likes cool videos so here's a policy we here's we had a blog opening a blog post were for this algorithm PPO and we trains a this humanoid to run around actually you can this is this environment is available in the repository called Robo school so you can actually do this yourself and we also released the code for PPO so you can basically you can train a policy to do this yourself though we didn't release the exact training code for this because it was messy but anyway you see you see that this guy is trying to run towards the red ball which gets randomly placed in around it every 10 seconds every few seconds and meanwhile it's getting pelted by these cubes which are being thrown at it and it's able to get up off the ground and also steer towards the red ball it's not perfect so it can't sometimes it gets stuck on the ground for a few seconds and so on but but it's actually pretty good and it's it can recover from getting hidden like in all sorts of ways and yeah so yeah I don't know that's up to you to decide I think it's there's some interesting questions there let's see oh yeah when we do it without the pelting then it can actually basically run perfectly and it never falls over with the pelting it's a little harder so it's not always perfect but this guy like can turn on a dime it's actually quite nice oh yeah we can also after we train that policy we can just I just wrote a little script that would control the ball with the keyboard and and then it even though this is a little bit out of sample because we never trained it for this exact distribution of tasks it can do this perfectly well and run around when you control with the keyboard so you can imagine some video game where it's actually fully physical and everything and you can also you can also use this on a more realistic looking robot like the Atlas robot so this was all these like really cool results are from like colleague Oleg who trained all these crazy policies okay we also have a blog post recently on this actor method I was talking about so and we have some codes about we have some code for a bunch of these things like we have a to see actor GD P GDQ n PPO and TR p o I'm gonna add some more tests to it and it's some more documentation but it's up there so you can use it you're going to dig into the code a little bit and O to Train actually the the humanoid who's running around and getting pelted by things I think that was less than a day to Train it on one machine that had about thirty two course yeah okay so I'm wrapping up let's see yeah that's all questions thank you [Applause]

Info

Channel: AI Prism

Views: 37,168

Rating: 4.814815 out of 5

Keywords:

Id: xvRrgxcpaHY

Channel Id: undefined

Length: 41min 0sec (2460 seconds)

Published: Thu Oct 05 2017