CS885 Lecture 15a: Trust Region Policy Optimization (Presenter: Shivam Kalra)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

so today's topic is trust region 4 it was the paper that 2015 by John Schuhmann and other people so before we start discussing voltage trust region policy updates let's just look into the hierarchy of reinforcement learning algorithms so the division of these algorithms is action value functions and there is a policy gradients so if you have already worked on the action values which is q-learning and stuff and we've also seen in the lectures about actor critic and a little bit about a to see also so the policy gradient the TRP au is kind of classified into policy gradients but it's not really the method on its own it's an optimization technique that could be also used within actor critic this is this is basically when you get the gradient you can you can you can change it according to the trust regions which i'll discuss so this is the basically all the policy gradient algorithm follows similar sort of routines so you start with collecting n trajectories for a given policy that is an initial policy and then you estimate the advantage function a right even for the actor critic you are all you're doing is you are getting the better estimate of a otherwise it's follows the same procedure then you compute the policy gradient and then you update the policy so let's look at what is the problem with the vanilla policy gradient or without the policy gradient without using the RPO so first problem is that if if your policy is parametrized if it's a function approximation then your input to that function approximation is non-stationary at end of the step you basically change the distribution of input as well as the output your output is reward or or your advantage in this case so this is this makes everything unstable so one of the problem is non-stationary input data this happens because you update your policy right here and then you collect according to your updated policy so another disadvantage is that your advantage function is random when you start collecting the when you start from the very initial policies so it's putting it's saying that you are bad to the policy and then policies in this loop of not detecting where the optimal path is the what we need is that we need more carefully crafted policy update we want that we want that when we update the policy we want improvement not the degradation so the idea of the TRP oh is that we can update the old policy which is in this case the theta2 the new policy pie tilt such that they are entrusted region apart such conservative policy update allows improvement instead of degradation and i will tell you what the trusted the region or trusted distance is in this case oh yeah i want to mention that first of all that in we can change this or we can change this particular update into the optimization problem in state so instead of just following where the highest advantage is we can also we can make this particular update procedure the theta is equal to theta ole plus the alpha g if we can make it into we can solve some sort of optimization that will allow the better update so i'll show you why this is important so RL to optimization most of the machine learning algorithms is optimization right if you have a supervised learning algorithm is trying to reduce the training loss which in turn correspond to the testing loss or it may or may not but we assume that it does so what happens in the policy gradient optimizing what is what is policy grading gradient optimizing it's simply favoring the pupils s state and action such that they give more advantage but they're not really optimizing anything the vanilla policy gradient so the TR p or suggest that TR p oh by the way stress region policy optimization it it suggests that we can write down the optimization problem that allows to do small update on the policy based on the data that was sampled from that policy so that we can ensure that there is always the improvement in the policy not there is no degradation so first thing is what do we need to optimize so in in the in any policy gradient we have this atop I which is the cumulative collected revolt and this is what this also turns into advantage as we have seen so we collect the data with the PI old and optimize the objective to get a new PI tilled so that's what we will do in dear PA so we can express in this paper not in this paper actually in this 2002 paper the caca day and Lanford they suggest they they prove this identity which is the which is which is the basis of the TRP algorithm it suggests that the expected return of for new policy is expected return of for old policy and then you sample the advantage from the newer policy so it it basically allows you to relate the old policy that we are trying to update in terms of the new policy so the TRP or the John and Sherman they changed this particular equation the one that is over here which is proposed in 2002 they changed it to this equation the only thing they have changed is this part this particular part they have changed it into the discounted visitation frequency which is this over here and then this this stays this is just the advantage function and then the probability of transition from that particular action so what do we get here is that we have the new expected return it's based on the old expected return and if we can somehow say that the advantage for all these for all these states is greater than 0 then our new expected return is guaranteed to be the improvement over the old policy and this is not been done we generally just don't do that we generally just update in terms of a and then we simply flow the gradient into the old policy and expect something to happen but this is the the guaranteed improvement from the the old policy in terms of the in terms of the expected return of the new policy from the data that is sample from the old policy which is quite interesting I think so just want to say this equation again this is a quite important equation so this the row pi is the state visitation based on the new policy and this is the new policy itself however the problem is that this the row pi is very difficult to to estimate so it's very difficult that we can optimize this particular function as its current form so we cannot do that optimization right away using any optimization technique so what TR p or suggested is that we can change we can change this term the state visitation of the new policy is the old policy because you already have the data which is sample from the old policy so we can change this and instead of calling it at a PI tilt we'll call it L by Del so it's obviously changed so there is there is two different functions right but if we can approximate what is the difference between these function if we can approximate the lower bound of the difference between N and L we can we can have a good local approximation of n so I'll show you what I mean but I hope you understood that we changed the the idea of changing bases to make it more simplified for optimization because we already have the data which is sampled from the PI policy not the PI tail we that's a new policy that we don't even know what it is yet so yes what we wanted from the beginning was that we wanted to estimate the N which is the expected return of the new policy what we did was that this is very difficult to estimate so we converted this into L which is slightly simple to estimate so we are saying that this L this L term is the local approximation of the expected return of the new policy however this is only accurate in certain region so it's not accurate it's not the global estimate of the n obviously this is only a local estimate and it's it's only encapsulated within certain region called trust region so in that trust region if we update our policy there is a monotonic improvement guaranteed so what I mean is this in this green region the L is corresponding to the actual etta which is the return of the new policy if you go in the red region you basically diverge there is a divergence between the L and the ETA so so I mean this this is the end this is the Etta this is what we are trying to approximate so in the green region the L is is very close to N but sorry I mean Etta and in the red region it diverges and this the the divergence step that we choose is is the hyper parameter called Delta so so what so this is what this diagram shows is that if if this is my old parameter of the policy if I update in this green region I I can I can guarantee that there is a improvement in this direction so it turns out that the TRP of paper suggests that there is a direct relationship between the the N and the L and there is a proof that was given in the 2002 paper by by by Kaka day and land for this was given in 2002 paper so they are just extending on that paper so this is the identity that is already there that already exists and it says that the expected return the actual expected return is greater than the local expected return - this particular the KL divergence maximum of the kale diversions between the new policy and the old policy where C is this particular value so the more there is always a monotonically improving we can always improve the policy if we can sample the policy using this particular function if we can just take the art max of a policy that satisfies this equation that means we are a we are in the bound of the of the expected return of the new policy so this algorithm is called minor ization and maximization what I mean is if this is the actual etta function what we do is this is our first step of the update we come here and then we this is our surrogate function which is going to be the L function in this case is going to be this guy the L minus C D of Max so what we going to do is in order to estimate the good next policy we will just do maximization of this particular function because that's going to be the the lower bound of this L function in the local neighborhood so if we keep doing this we we we make sure that the policy is always improved at every step so so far this was a just a theoretical background without discussing what is a parameterised policy so how does it turns out to be the parametrized policy because our policy is estimating function write function of theta so how does this work with the parametrized policy so now the policy policies are parameterised according to the surrogate function the surrogate function that we showed you in this this one instead of pi it becomes theta because now the theta is the PI is the parameter of the is the function of theta so we can simply change this to this equation in practice to see if when we put the C the C value is already given in the paper if we if we use that see these update steps are very small so these steps when you improve from this black dot to this black node is very very small so you have to take lots of step in order to make any improvement so this is not this is not this is not good what this suggested the TRP of a procedure stated is that one way to take the largest step is to constrain the KL divergence between the new policy and the old policy that is the trusted region so instead of using NEC we will simply constrain that we can Aksum eyes the L at the same time will subject that the D of the the KL divergence maximum of the KL divergence between the older parameter in the newer parameter is less than some hyper parameter so it's not see anymore we can we can stochastically update the policy instead of changing its small step by step so this is how you will solve this equation that's how they solve this equation so we have to solve this equation in the way there so the our problem is that we have to maximize on the Thetas such that this equation is maximized and yes they also change the max KL divergence to the main KL divergence because they said that it's it's better to optimize the mean of a of some something instead of a max because it's it's it's not the best way to optimize this and they didn't really provide any I mean they didn't provide any anything on on this particular subject like why should they use max instead of mean and if they cause any differences so I assume it works so they converted this max into mean and what they did was they did the linear approximation of the L and they did the quadratic approximation that al so this could also be seen from that graph that I showed the the surrogate function is quadratic in nature whereas the L itself is the linear in nature so they did this approximation and they said that L could be converted into G dot this particular thing and then the KL divergence mean of the KL divergence could be converted into the quadratic form like this where G is the differentiation of L and then this is the Hessian matrix so in order to solve this KL penalized problem they make the linear approximation of L in the quadratic of the KL term and a solution of this equation becomes this if you want the new parameters you simply find this which is f is a he CN of this particular equation so f was f is just the second derivative of the KL term but it's very computationally expensive to calculate this particular Hessian matrix so they said okay we gonna approximate this Hessian matrix in order to make this algorithm more computationally inexpensive and they use this technique that already exist from maybe 40 50 years called conjugate gradient algorithm so what conjugate radium does does is that it can calculate F inverse of G which is f is the Hessian matrix n it can calculate this without even knowing what F is so I'll just give you a very quick intuition behind it the paper does not give it but I just I'll give you that so conjugate gradient algorithm approximates approximates the X is equal to a inverse B without explicitly forming any matrix a where where the a comes from this quadratic equation and the way it works is that if you have this concave convex surface in the real in the actual gradient descent you will move in the purple direction so you see that if you can see the purple purple line you can see that it's making multiple steps but when it's a conjugate gradient the next direction that you will take is not the direction of the grad instead you will take a conjugate of the Green Line and conjugate is a is not a right angle it's not orthogonal but it's orthogonal according to the surface so if surface is stretched out the the conjugates will be stretched out too if it's a pure surface then it's always going to be ninety degrees so this makes sure that the gradient does not flow into what we have already flown through so this is this is what they used in order to find the Hessian matrix so what is TRP Oh TRP oh is that what we discussed so far was that we simply have to maximize this particular term in terms of theta but what TRP oh is that they constraint they put the hard constraint on this particular equation they said that we have to maximize L theta with respect of theta and then we subject this KL divergence term is less than or equal to Delta the the solution remains almost same first this Delta is a hyper parameter that remains fixed over the whole learning process to solve this particular equation you can again just convert it into you can use the Lagrangian which is which turns out to be this this form and then you can differentiate it with respect to the theta and then you eventually get this solution which is almost same as the F here is again Hessian which we can compute using the conjugate gradient and then you get this term which is half of the step that you take with respect of F is equal to Delta this this particular condition that you see is is coming because of this constraint that we put on the equation so what it turns out is that you can calculate the steps and you can rescale them back in order to satisfies this particular constraint so your step size that you will take in your Green's will be will eventually be of this form so you have s which is unskilled and then you will put this particular scalar value you will multiply this by this particular value and then this ensures that this constraint is followed so the TRP or algorithm looks like this you collect it's almost same as the policy gradient you collect n trajectories and then you estimate the advantage function you could you could also use actor critic here if you want a better advantage function and better estimate of advantage function do you compute the policy gradient G and then you compute this should be F here or H H is fine too you compute the H inverse G using the using the conjugate gradient which minimizes this these two lines are minimizing this particular loss and eventually you scale the step size and then you add that to your gradient you multiply that to gradient and this ensures that your update is happening in the trusted region so that's the TRP o algorithm and I just want to show you the results that they got it's not very clear but what is the advantage of in terms of results but these are their tier pure results and they're using I've not really told you about what is single path and vine this is just a sample sampling techniques and I think that roof an who's the next presenter he will discuss this because it's very similar the papers are very similar so he'll most likely discuss this techniques I kept what is the TRP oh and what is the trusted region and how we can optimize it so this is the result that I get so it's not it's not very clear like why are they improving it seems like they are improving but that's it [Applause]

Info

Channel: Pascal Poupart

Views: 3,677

Rating: 4.9166665 out of 5

Keywords:

Id: jcF-HaBz0Vw

Channel Id: undefined

Length: 22min 33sec (1353 seconds)

Published: Thu Jun 21 2018