Policy Gradient Theorem Explained - Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

This is pretty good, I started getting stuck at 45 mins in, and I'd trace it back to not truly understanding the step beginning 32 minutes in, when you introduce Ŷ. I also at that point started wanting to reconnect it to the original formulae given in the initial introduction.

👍︎︎ 4 👤︎︎ u/eliminating_coasts 📅︎︎ Nov 22 2020 🗫︎ replies
Captions
in this video i want to give you an intuitive understanding of the policy gradient theorem this is the underlying idea behind the many policy gradient methods that are used to train ais some examples of ais that have been trained using some form of policy gradients are alpha star the ai train by deepmind to play starcraft 2 and it eventually got so good that it could be professional starcraft players another example is the ai train by openai to play dota and it also eventually got so good that it could be professional dota players gg is called an open ai take game one and another example from robotics is the robotic hand that open ai trained to be able to solve a rubik's cube and i could solve it with just a single hand which is a really impressive feat of dexterity especially for a robotic hand so when i first learned about the policy gradient theorem i learned it in a way that looks similar to this and this is open ai's spinning up in deep rl their website it's a great resource for learning about reinforcement learning and here they give a nice concise derivation of the basic policy gradient theorem and it usually looks like this you're trying to find the gradient of the expected return and you go through multiple mathematical steps and eventually end up with something down here which you can use to estimate the gradient so that you can improve your policy and this is how i initially learned it and i thought i understood what was happening but it wasn't until later until i actually worked through an example and tried to derive this equation myself just by thinking about how it improved the policy and after that i felt like i had a much better understanding so in this video i want to walk you through those steps that i took and kind of guide you into how we can come to this same conclusion as this equation does but hopefully end up with a kind of better picture in our head of what we're actually calculating so we're going to start off by analyzing this little game i came up with where we are this robot and it's a very simple game just to make it easy to study where we only have four positions zero one two and three we're starting off in this position number two and we only have two options we can either choose to go left or go right so if we choose to go right we'll end up at the flag we'll get a reward of plus 10 and this will end the episode or the game however you want to say it so the flag is a terminal state now if we go left we'll pick up the diamond which will give us a reward of plus three and then we'll end up on this little slope over here where we can again choose to go left or right but the game hasn't ended in this state and from here if we choose to go left we end up in the fire pit and we get a reward of negative 10 and that ends the episode so the fire pit is also a terminal state however if we choose to go right from this position there's a 90 chance we'll make it back up on top of the hill however there's a 10 chance that we'll slip on our way trying to go up the hill and end up back in the fire pit now if we make it back up on the hill that'll give us a reward of zero it's kind of like a bypass state and then we'll have the option of making another choice from this state we can choose to go right or left if we go left we'll end up back on the side of the hill and this will actually end the game just to make it simple and made it so that after three steps the game ends no matter what and this would be our third step however if from here we choose to go right we'll end up back at the flag getting a reward of plus 10 and then the game will be over so if we look at this game in a tree structure it looks something like this we start off in this state and we can go right or left if we go right we end up at the flag and that will end this episode this run of the game and if we go left we'll end up on the hill over here from here if we go left we'll end up in the fire pit however if we go right we'll have a 90 chance of making it back up on top of the hill and a 10 chance of slipping and falling back in the fire pit and once we're back up on the hill we can either go left to end up on the side of the hill or go right to make it to the flag and get the final reward of plus 10. so you'll see the rewards we'll get in each state in these green boxes here and over on the right side of this state diagram you'll see some text and this is usually the kind of information we'll be working with when we're doing a reinforcement learning problem so we'll start off by getting an observation which will often be just a list of numbers usually floating point numbers but i just made them integers in this example to keep it simple and in this example our observation is going to be two values the first number is our position so zero one two or three we're starting off in position two and the second number is whether a diamond exists in the environment or not so right here it's one because the diamond exists however if we go left and pick up the diamond it will become zero because the diamond is no longer available so usually what we'll do is we'll pass this observation this list of numbers to our neural network which here is just our policy function and it will output two values a probability of going left and a probability of going right and these two numbers will sum to one because their probabilities and then once we enter a new state we'll get a reward when we enter that state and we'll also get a done variable which will be false if the episode is not over and will be true if the episode has ended and i wanted to add this little randomness element to the game so that you can still see how the policy gradient theorem even works in a non-deterministic environment or an environment where when you take an action you're not exactly sure what state you're going to end up in next so first let's just think about what is the optimal strategy for this game and there are pretty much two possible strategies one you either go right all the time from this initial state to get to the flag or you try to pick up the diamond and risk potentially falling in the fire pit when you try to climb back up the hill to get back to the flag so the goal of the game is to maximize the total reward you receive throughout the episode so this total reward is also called the return and the goal is to maximize the expected return which just means on average how much return we'll get if we follow that strategy so if we look at the strategy where we always initially go right no matter what from this first state the values we got out of our policy when we pass in this observation would look like this which means a hundred percent chance will go right and we'll get the flag so the expected return with this strategy is 10 because on average the total reward we'll get from following the strategy will be 10 because we'll get 10 every time now the other strategy would look like this where we always go left from the start then we always go right and then if we end up over here we'll go right again to end up at the flag so now what is the expected return of this strategy and there are basically two possible paths we could take we either go left here go right then end up in the fire pit when we slip or we go left then we go right and up on the hill go right again and get to the flag so if we look at the probability of both of those paths the probability that we'll end up in the fire pit is the product of all the transition probabilities on that path so there's a probability of one times one times point one or ten percent chance that we'll end up in the fire pit and we'll end up with a total reward of plus three initially for getting the diamond and then negative 10 for falling in the fire pit which will be a total of negative seven and when we multiply those together we get negative point seven now if we do the same thing for the other path we'll see that we'll get a total reward of 3 plus 0 plus 10 the zero being this intermediate state and throughout this video i've left all these zeros in the equations so that you can understand how the equations would work if these were non-zero rewards but over here our total reward would be 3 plus 0 plus 10 which is 13 and when we multiply 0.9 times 13 we get 11.7 so if we simplify those down it looks like this and when we add those together we get our expected return of 11. so on average we'll get a total reward of 11 with this strategy so compared to our other strategy where we only got an average of 10 this is the better strategy because on average we're going to get a total reward or an expected return of 11. so that's how to think about the game we're trying to maximize the expected return so now the question is how do we learn this optimal policy or at least how do we learn a good policy and we're going to think about this as if we were playing a video game where you start off and you don't actually know how the video game works you don't really know what's going to increase your score what's going to decrease your score you don't know what these probabilities are for if you try to go up the hill and you're just going to try to learn as much as you can about the environment by sampling episodes from it then you're going to try to come up with a good policy just by observing the data that you receive from interacting with the environment and the data that you have to work with are the observations the rewards and the done signals so the first thing we'll probably want to do is explore which means we would probably want to initialize our policy so that it returns 50 chance of going left and a 50 chance of going right for any state we end up in then after we gain a little bit of information we can try to improve our policy from there however in this example i'm going to initialize the policy a little bit differently i'm going to initialize it like this with these values 0.7.3 0.4.6 0.8 and 0.2 and this is just so that it makes it easier to understand what we're doing in the equations because all these values are unique numbers so it'll be easier to see which action probability we're talking about so let's imagine we start off with this policy and how can we figure out how we should change the policy so that we can increase the expected return and to start off let's just simplify this example down to one step so if we look at the example now if we go left we're guaranteed to get a reward of plus three and if we go right we're guaranteed to get a reward of plus ten so it's pretty obvious that we're going to want to go right in this situation but in policy gradients what we're trying to ask is how much the expected return increases for each small change to our policy and this is called finding the gradient of the expected return so if we change these values in our policy this point seven and this point three by tiny amounts how much will that change the expected return so we can think about it visually like this here we have the blue represents the reward we'll get if we go left and the red represents the reward we'll get if we go right so this is has a height of three because we'll get three reward for going left this has a height of 10 because that's the reward we get from going right and then down here you'll see the width of the blue is the probability of going left so 0.7 for 70 and this has a width of 0.3 for 30 and the total area of both of these rectangles combined is the expected return so we can see if we change the probability of going left and going right so if we go right more of the time eventually if we go right all the time our expected return will be 1 times 10. if we're going left all the time our expected reward will be 3. so the amount the expected return will change when we change these probabilities can be visualized by this orange box so if we move this probability line to the left we'll increase our expected return by the area of this orange box which will be dx times 7. or instead of looking at this difference we can think about how each of these individual rectangles would change for each small change in their probability so for the probability of going left for each increase of the probability by dx we would increase the expected return by three times dx and for this red rectangle for each increase in its probability by dx we'll increase the expected return by 10 times dx and then we could just cancel those out to find the difference between them which would be 7. so now that we have that visual concept of what the expected return looks like let's think about what would happen if we tried to calculate the gradient of the expected return so the expected return is the sum of these two values here so we have a 0.7 probability of receiving a reward of 3. we have a 0.3 probability of receiving a reward of 10. we add those together to get our expected return and now what happens if we take the gradient of this value with respect to these action probabilities so first we start back propagating the values and we get the partial derivative of the expected return with respect to this action probability the 0.7 and to do that we take the sum of our expected return replace this value the action probability with an x consider all other values constant and then find the partial derivative of the expected return with respect to x so in the slides i use this normal looking d letter instead of the more precise curly looking partial derivative d letter and so try not to get confused because this is a partial derivative but they're pretty much saying the same thing how much does r change with respect to how much x changes anyways this is the partial derivative of the expected return with respect to this action probability and this is how we find the expected return with respect to this action probability do the same thing take the sum replace that action probability with x consider all other values constant and then find the derivative of r with respect to x here this disappears this just becomes 10. so now we have the partial derivatives for each of these action probabilities then these would get back propagated through the neural network that produced these action probabilities and we would get gradient values on the parameters of our neural network so we'd get gradient values on the weights and biases what other parameters we used to calculate these action probabilities and then once we had those gradient values for the parameters in our neural network we could update those parameters by taking a step in the direction of that gradient and that should increase the expected return of the policy because the gradient will be pointing in the direction of steepest ascent to increase the expected return now one thing i want to point out here which might seem kind of weird is that the partial derivative of this action value is three so kind of makes sense if we increase it by a little bit we'll get a little bit more of this reward of three but it seems to ignore the fact that these two values have to sum to one so we're kind of saying we can just increase this point seven and not change this value at all and that's what this partial derivative is saying so the question is why does this work why can we just ignore that relationship that these have to add to one and look at these independently and then back propagate those values through the neural network and we'll get the correct gradient so to better understand this i created this diagram and this red arrow this axis is the probability of going left this green axis is the probability of going right and right now we are at the point right here where we have a 70 chance of going left and a 30 chance of going right and since both these probabilities will have to sum to 1 we're going to see that the probabilities output from our policy is going to have to lie somewhere along this line because these are all the values that add up to one so if we go back to this point right here where we are currently we can ask and this blue axis is representing the expected return and it's scaled down a lot more than these values so just keep that in mind but if we ask how much the expected return will increase for each little step in increasing the probability of going left we'll get the slope of this line and the slope of this line will be the partial derivative which is 3 for going left and it doesn't look like 3 in this diagram that's just because i have it scaled down if it was more proportionate it'd be something like this so this would have a slope of 3 but i'm just going to scale it down so we can see a little bit more what's going on so the slope for increasing the probability of going left is 3 and the slope for increasing the probability of going right is 10. and those are these two vectors here this one has a slope of 10 and this one has a slope of 3. and you can see each of these arrows is just only going in the direction of that axis so we're basically ignoring the fact that these values have to sum to 1 and we're just saying if we could increase this probability and keep the other probability the same how would the expected return change and then what happens when we back propagate these partial derivatives through the neural network is it's going to figure out what the gradient is along this line and that essentially looks like this it'll find the plane that passes through both of these vectors so that'll be this blue plane right here and that is the gradient of this space and then to figure out how changing our policy will change the expected return we can look at how this pink plane which extends vertically above this line how it intersects this blue plane and then if we add this orange line to represent this intersection we can see what the gradient is of this line directly above this line which is the possible outputs from our policy and if we draw another line which connects the tips of these vectors given that the base of these triangles are both one we can see that it will be parallel to this orange line and it will have the same length and we can notice that the height difference for this yellow line is going to be the height of this which will be 10 because the base has a length of one and over here the height will be three so ten minus three is seven and that's what we were getting before how much changing our policy will change the expected return had a slope of seven the difference between these two and we can see that by looking at it from this angle if we increase the probability of going left we will decrease the expected return at a rate of seven and if we look at it from this angle if we increase the probability of going right we will increase the expected return by a rate of seven and this is really what's happening under the hood when we do the back propagation through the neural network it will take into account the fact that these values have to add up to one and it will find this gradient along the path of possible outputs from the neural network so hopefully that explains how we can look at each action probability independently and ignore the fact that they have to sum to one and then you'll find this plane which kind of represents the gradient in that space and then we can find the actual gradient of changing the policy in one direction or the other and that will just be done automatically through the back propagation process so as we saw with this example when we increase the probability of going right and decrease the probability of going left we'll increase the expected return at a rate proportional to seven and we also can get that value just by finding the independent partial derivatives for each of these action outputs and back propagating that through our neural network but we still don't know how to find the partial derivatives of these action probabilities just by exploring the environment so to figure that out let's go back to the full example so first let's ask what is the expected return of this current policy if our policy output these action probabilities and one way we could do that is to look at all the possible paths we could take and we would add up the sum of the probability of taking that path and the total reward we would receive on that path so for example one path would be going left and then going left with our current policy we'll have a point seven times a point four percent of that happening and the total reward we'll receive is plus three minus ten so that can be represented by this right here and we'll get this value negative 1.96 for this path for the next path we'll say we go left right and then the environment kicks us left and the probability of that happening will be 0.7 times 0.6 times 0.1 and the total reward we'll receive is plus 3 minus 10. so that will be this line right here and we'll get this value and if we do it for all the other paths going left right being kicked right and then left we'll get this value for this other path for this path we'll get this value in this value and if we sum up all these values we'll get our expected return which is 2.636 now this is one way of calculating the expected return but another way of calculating the expected return which will be more intuitive going forward is instead of summing over all the paths is to sum over all the rewards so we sum over all the possible rewards times the probability of receiving that reward so if we look at this reward of plus three the probability that we'll receive that reward is .7 given a random episode so point seven of the time will go left and receive that reward and point three of the time will go right and receive this reward of plus ten and you can see those two values right here point seven times the reward of three point three times the reward of ten so just after our first step we have an expected return of five point one but after taking this step we could also receive these additional rewards so we calculate those in we take this negative 10. the probability of receiving that is 0.7 times 0.4 and we add that in right here and then the probability of receiving this reward 0.7 times 0.6 times 0.1 we add that in right here and for all these other rewards we'll add them in with the probability of receiving them again i'm keeping in the rewards of zero just to show how the calculations would work if they were different and when we add all these up we'll get the same value for the expected return which is 2.636 and really these are the same sums it's just some of these values are moved around and it's just broken down differently so we'll be using this sum going forward it's the sum over all possible rewards multiplied by the probability of receiving that reward so if we actually knew what this sum is if we knew all the details of the environment and we could calculate the sum we could just find the gradient of the sum with respect to the parameters in our neural network and we would know how to improve our policy however since we're assuming that starting out we don't know all the details of the environment we won't be able to directly calculate the sum and the idea behind the policy gradient theorem is that actually all we need to do is figure out how to calculate the gradient of the sum so we don't actually need to calculate this exact sum we just need to figure out what the gradient of this sum would be and as we saw before we can think of one step in the backwards path of the gradient is finding all the partial derivatives of the expected return with respect to these action probabilities so if we could find all the partial derivatives for each of these action probabilities and then we could back propagate those through the neural network we would get the correct gradient for updating the network so let's go back to assuming we know what this sum is and look at what the partial derivatives for each of these action values would be so first to find the partial derivative with respect to this action value the 0.7 we will do as we did before and replace all the places where that value shows up with an x in the sum and then find the derivative of this sum with respect to x and this term right here is a constant so it'll drop out then we get a 3 for this x times 3 then we get this 0.4 times this negative 10 for this line right here and you can see we get this full sum down here for the derivative of r with respect to this action probability and now if you want you can pause the video and try to figure out what this sum represents and if you're back we can see it starts off with a three and then we're going to get a point four times this negative ten a point six times point one times this negative ten a point six times point nine times this reward of zero and on and on and we can see that this value is actually the expected return after having taken this action so we're guaranteed to get a reward of three then there's a forty percent chance we'll also get a reward of negative 10 a 0.6 times 0.1 percent chance of getting this reward and so on so this sum is the expected return after taking this action so the partial derivative of this action value is the expected return after taking this action and that makes sense in our previous example of just this state we were also getting the same thing except we're only looking at the immediate reward but if we think about just increasing this probability by a little bit we'll increase the expected reward by that little bit times the expected return after taking that action so now if we do the same thing with this action probability we'll replace its value with x this point three and then we'll take the derivative of this sum and we'll see we also get 10. so that's also the expected return we get after taking that action so now let's find the partial derivative of this action probability so we do as we did before we replace that point 4 with the x we find the derivative of r with respect to x and we get 0.7 times negative 10. now it's a little bit different than before this negative 10 is the expected return after taking this action but now we have an additional 0.7 so do you see what the pattern is now if not let's try another example see if we can get a little bit more insight here we'll calculate the partial derivative for this 0.6 action probability we replace all the point sixes with this x and then we take the partial derivative and we get this value down here which can be transformed by pulling out this .7 and we get this value down here now do you see what this sum represents well you can see if we look at this value in the parentheses we're getting a 0.1 times negative 10 a 0.9 times this 0 0.9 times 0.8 times this zero and 8.9 times 0.2 times this 10. so this again is the expected return after taking this action but again we multiply it by this 0.7 which is this probability up here so let's look at one more example and that's for this action probability down here this point two so we replace that point two with the x then we find the derivative of this sum and we get point seven times point six times point nine times ten and here we see the 10 is the expected return after taking this action but we multiply that now times 0.7 times 0.6 times 0.9 so those are all the probabilities on the path to arriving at this node and this node is the node from which that action is taken so the pattern is all these partial derivatives will be the probability of ending up in the node from which that action can be taken multiplied by the expected return after taking that action and if we think about it that makes intuitive sense once we're in this node if we can increase this action probability by a little bit we'll increase our expected return by that little bit times 10. however the probability that we'll actually end up in this node is 0.7 times 0.6 times 0.9 so the idea is that once we're in this node increasing this action will increase the expected return by this much but we'll only end up in this node with a probability of 0.7 times 0.6 times 0.9 so we multiply those together the probability of ending up in that node and then how much the expected return from that node onward will be increased by changing this action and that will be the expected return after taking that action so here's the idea instead of trying to find the sum that is the expected return we're going to try to find this other sum which i'm just going to use the symbol of y to represent so first of all with the sum each of these x values x0 x1 x2 are the variables for these action probabilities so x0 x1 x2 x3 over here and the number after the x is just to number that variable it's not x times 2 here it's just the x 2 variable so for each line of the sum we have the probability of ending up in the node from which that action could be taken multiplied by the average return after taking that action so the probability we'll end up in this node is 1 as the node from which x0 could be taken that's also the probability from which x1 could be taken for this line we have a 0.7 probability of ending up in this node from which the x2 action could be taken and we'll multiply that by the average return after taking x2 and so on for these other action probability variables down here so can you guess what is special about this sum it's different than the expected return but what's special about it is that it'll actually have the same gradient and we can see that by looking at all the partial derivatives for each of these action probabilities so if we find the partial derivative of y with respect to x0 then all these other values will drop out when they're considered as constants and we'll get one times the average return after x0 and down here you can see the other partial derivatives we get when we do the other values it's a similar process all the other lines except for the line with this action probability will get cancelled out so if we look at the partial derivative with respect to x3 all these other lines cancel out when they're considered constants and we get 0.7 times the average return after taking the action x3 and that's what we get down here so these were the same partial derivatives we got when we looked at the expected return the partial derivative was the probability of ending up in the node from which that action was taken multiplied by the average return after taking that action so for example here's the sum of our expected return as before and when we put it in this format using the x0 x1 x2 and so on we can see that it'll look like this with all the variables in place and then for example to find the partial derivative of each of these values with respect to x3 we can see that all the other values become constant then when we take the derivatives of each of these values up here we'll get 0.7 times the average return after x3 and down here we'll get this value and as we saw before this value on the right is the average return after x3 so we have this sum which will give us the same gradient as the expected return but you might be asking what's nice about this sum why did we come up with this sum and the nice thing about this sum is that each of the items in the sum only has one action probability variable in it so this item up here only has the x0 this item of the sum only has x1 this item only has x2 and so on and the nice thing about this is that it makes it easy to estimate this sum and if you want you can pause the video to try to figure out how we can estimate this sum just by sampling episodes from the environment and now let's figure it out together so you can see we have these average returns after taking the action and this probability of reaching this node from which the action was taken so let's just see what happens if each time we reach an action we add that action variable to the sum multiplied by the total reward received after taking that action so for example if we sample a first episode by first going left here then right ending up in this node down here and then going left we'll have reached the action values x0 x3 and x4 along that path so we'll add each of those to the sum multiplied by the total reward received after taking that action so let's imagine we sample another episode this time we'll be going through x0 x2 so that's x0 x2 down here into the fire pit and we'll add those values to the sum multiplied by the total reward received after taking that action so after taking this x0 action we got a reward of plus three and then a reward of negative ten so plus three minus ten here and so on let's say we sample another episode where we just go right here so that's x one plus ten sample one more episode here x zero x three and x5 so we're going to go along this path x0 x3 x5 and add those values to the sum now what would this sum come out to be on average if we sampled a large number of episodes let's just say n episodes it turns out that on average this sum would come out to this value so n is the number of episodes we sampled so on average the number of times we'll encounter the x zero will be n times 0.7 that'll be the probability of taking that action from this initial state and then each time we see that action we'll multiply it by the reward received after that action so on average when we take into account all those different additions into the sum it would be equal to this x0 times the average return after taking that action and down here if we look at x3 the probability that we'll have taken an x3 in any episode is 0.7 times 0.6 and we just multiply that by the number of episodes we ran to get the number of times we encountered that value and again we'll end up on average multiplying it by the average return after taking that action so this value is pretty close to this value but it's not exactly the value we are looking for so we have this additional n here you can also see we have this additional 0.7 in this line 0.3 0.4 0.6 additional 0.8 here an additional 0.2 here so what are these additional values in here well we can see that these are the probabilities of actually taking these actions so we have a probability of 0.2 of taking this action of x5 so that makes sense up here these were the probabilities of reaching the node from which the action was taken however down here these are going to be the probabilities of actually taking that action which will be the probability of not only ending up in that node from which the action would be taken but we're also multiplying it by the probability that we actually take that action so this sum is super close to what we're looking for all we have to do is correct for these additional values so you can see we could divide this total sum by n the number of episodes and we could also divide each of these action values by the probability of that action when we add it to the sum and if we did that we would get this value up here so just to make that clear let's try doing the same thing so first we take this episode we go x0 x3 x4 so each time we encounter an action we're going to multiply that value times 1 over the probability of taking that action so x0 has a probability of 0.7 currently so we'll add that in and then we'll again multiply it by the reward received after taking that action we'll do the same for these values down here and we'll also multiply this entire sum by one over n so n right now is just one for the first episode and now let's take another episode here we're going through x0 x2 x0 x2 we'll do as we did before and here we'll increment this value so now one over n is one over two because we have two episodes and we continue this for another episode it'll become this and for a final episode it'll become this total sum so one over four for four episodes and we continued the pattern all the way down and now on average this sum if we have taken n episodes will be this right here one over n times the probability of taking this action which will be n times 0.7 for this case times 1 over this action probability because we are adding those now to the sum multiplied by as before the average return after taking that action and that'll be the same for all these other lines as well and now you can see this one over n will cancel out with these ends and this one over 0.7 will cancel out with this 0.7 and so on and so on and what we'll get is this value down here which is the same as this sum up here so through the process that we just did we were able to estimate the sum just by sampling episodes from the environment so if we can estimate this sum just by sampling episodes like that we can find the gradient of this sum and that will give us the same gradient as the expected return so that we can just update our neural network in the direction of that gradient and we'll be moving in the direction that increases the expected return now one thing that you'll see down a lot in practice is that when we're building up the sum instead of calculating this action value multiplied by 1 over the probability of this action value what you can also do is just add the log of this x 0 value multiplied by the return after that action so this sum would become this sum down here and it's essentially just a mathematical trick the gradient of this value will be the same as the gradient of this value and that's just because the derivative of the log of x is 1 over x so just to see an example of that again if we took these sums and we found on average they would become these values we cancelled out the n's and up here in this sum these point sevens will cancel out with these point sevens and so on down here we don't have those values so it just becomes this value down here with all these other values the point seven the point three and the point four still in the sum down here but now let's see what happens when we would take the gradient let's look at the partial derivative with respect to x3 so up here what we got as before would be 0.7 times the average return after x3 and we do the same thing down here all these other lines will cancel out as before and we'll get 0.7 times 0.6 times 1 over x3 times the average return after x3 and this x3 will have the value of 0.6 so now it'll cancel out this point 6 and we'll get 0.7 times the average return after x3 so these two values will be the same so if each of these sums give you the same gradient why use one over the other and it turns out you can use both but the log version actually in my tests is computationally a little bit more efficient and you can think about it because most of these probabilities use something like a normal distribution or a soft max function and in those functions we're taking a value and exponentiating it to get our probability so when we're working with the log probability we don't actually have to go through that exponentiation we can just work directly with the log so we kind of avoid a couple of those mathematical steps of exponentiating and then in the backwards pass going through that exponentiation so in practice you'll usually see this log version but just know it's the same thing as what we did before by just taking that variable and dividing it by the value of that variable and if using the log is still somewhat confusing we'll now go over how we would implement this algorithm in practice and i'll show you how we could calculate the gradient using both of these methods and hopefully that will make it a little bit more clear so to implement this in practice we'll start by sampling episodes and collecting the data so over here on the right i'll have some pseudo-ish code it'll be python mostly pi torch motivated but it won't be exactly correct because these would have to be tensors instead of just lists but anyways you'll get the idea so we start off taking our first episode and we're going to sample our first action so let's say first we go left and we end up in this state down here and we're going to add this observation to our list of observations our action of going left here left will be 0 right will be one we'll add that to our list of actions we'll see we got a reward of three so we'll add that to our list of rewards and that'll be all for this step in the next step from here we'll go right and we'll end up getting kicked over into this state so we'll add the observation that we started at to the list we'll add the action we took one for going right this time we'll add the reward we got in this case zero that will be done for this step we'll take one more step ending up down here and we'll do the same thing adding this observation adding our action of going left and adding our reward of zero and now since we reached the end of an episode the done value is true we're going to calculate the future return for each of these steps so here the future return of this action was three plus zero plus zero so it's still just three zero is this zero plus zero and this will make more sense in the future episodes we sample but we're also going to increment the number of episodes count to one because we finished this episode so let's sample another episode so here we're going to go left add our observation our action our reward then we're going to go left again and again add the observation action and reward and now we'll calculate the future returns so here this negative 7 will be this 3 plus negative 10 and we get negative 7 this negative 10 will just be this negative 10 value and again we increment the number of episodes count so let's sample another episode this time we just go right we just add that observation our action our award we calculate the future returns for each of the actions and we increment our number of episodes and we'll add one more episode to the list here we're going to go left and then right and we're going to end up in this node down here and then we'll take another right so we'll add those observations for each of those steps each of the actions each of the rewards and then when we reach the end we're going to fill in the future returns for each of the actions during that episode so 13 is going to be the 3 plus 0 plus 10 10 is going to be the zero plus 10 and this 10 is just this 10. and then we increment number of episodes so we've collected four episodes and in practice you may want to collect more than four episodes but we'll just keep it simple and just work with these four episodes we collected and now we'll use this data to update our neural network so let's imagine we have a neural network in this case it'll just be a simple linear layer the input will be the size of the observation which is just two values and the output will be the number of possible actions so two one for going left one for going right and now we'll take our list of observations and pass them through our neural network so this is a batch of observations and we're going to get out a batch of logits which will be the input to the soft max function to get our probabilities so in pi torch instead of passing these logits directly through a softmax function you would instead create a probability distribution and then in this case it would be a categorical distribution we'll pass in our logit values and we'll get out a batch of probability distributions one for each observation now we're going to get the log probability for each of the actions given those probability distributions so when we pass in our actions to this log prob method on our distributions object we're going to get out the log probability values so what this is doing is it's going through this list of actions and for each action it's looking at the probability distribution for each of the corresponding observations so for this first action it corresponds to our first observation so when we input 2 1 our policy is going to output this probability distribution 0.7.3 so we're asking to get the log probability of the first probability in this list so it'll give us the log of 0.7 now for our second action this one corresponds to this observation so we're taking in observation 1 0 and that'll be this observation down here our policy would have output 0.4 and 0.6 so we're going to find the log probability of 0.6 since from here we ended up going right we had the action of one after this node so we're going to get the log probability of this 0.6 value so this log probs variable will be a batch of log probabilities one for each action we'll then multiply that list of log probabilities times the future return for each of those actions sum all those values together and then divide by the number of episodes and you'll also notice this negative out here which just means we're going to take the negative of the sum that we usually use and that's because in our examples we have been assuming that we're doing gradient ascent means we're trying to increase this value but with most machine learning libraries the default is to do gradient descent so we just add a negative out front and that's kind of like taking the value space and inverting it down so now gradient descent would be the same as going to the same place as gradient ascent in the original case now if we call lost. backward we'll back propagate the loss through our neural network getting the gradients on each of the parameters in our neural network now this gradient won't point in the true direction that increases the expected return because this is just a sample of episodes but on average it will point in the direction of the true gradient so with each step we may not be moving exactly in the direction we want to move but on average it'll point in the correct direction and so they should equal out and we should be moving more and more closer to increasing the expected return so in this example we're using the log probabilities but let's look at what it would look like if we're just using the direct probabilities so one way we could get the probabilities is just by taking the log probabilities and exponentiating them so reversing that log to get the actual probabilities and then our loss would look like this we multiply the future returns by the probability divided by the value of that probability so this detached method here essentially turns this probability into a constant in our sum it's detaching this value from the backwards graph so when we compute the gradient we won't be going through this value so pretty much be the same as if this was just a constant we just want to get the probability values and divide by them but not back propagate through them so if we take these values sum them all together and divide by the number of episodes then as before we add the negative just because we're doing gradient descent this will give us a loss that if we call lost that backward will give us the exact same gradient as if we used this loss up here so both of these methods work using the log probability or just using the probability but i recommend using the log probability because it's computationally a little bit more efficient if the computations are done correctly and you might be asking is it just more computationally efficient because we're going through the log prob here and then we're exponentiating it and actually if we don't do this and instead we get the probabilities by just using the probabilities directly like this and it's a little confusing since in pi torch there's not a convenient method for getting the probability of each of the actions but this code will do that but even if we do it like this the log probs should be computationally a little more efficient but i just wanted to show you this other method since to me it makes more intuitive sense because for each action in this sum the probability that that action will be in the list is going to be the probability of taking that action in our environment so here if one of the probabilities is this point two it'll only be in the list on average point seven times point six times point nine times point two of the time but for the gradient to be correct we want to actually multiply this future return value by the probability of ending up in the node from which that action was taken so if we just divide this value by the value of this action detached from the graph so it's just a constant we're not going to back propagate through this value then that kind of gets us back up into the probability of ending up in this node and then we'll multiply that by the future return to get the correct gradient and when you use the log probabilities it's just a convenient way of doing the same thing that's also computationally a little more efficient now this is a very simple version of what some people call the vanilla policy gradient algorithm and it's just vanilla because it's very simple and you can start adding on more bells and whistles to for example reduce the variance of your gradient or to make the step size of your updates more consistent and i just want to show you one of those little tricks that's frequently done just so you have an understanding of what it's doing so we'll start by going back to this version of the algorithm where we're using the log probabilities and now let's ask how consistent is the size of this loss going to be and how much is it going to vary depending on the environment so usually when we're training a neural network we want a somewhat consistent step size if the steps are too big then our network may not converge and if the steps are too small it may take too long for the network to converge so we want this loss to be somewhat consistent between the different environments we could train on so one thing you'll notice is that in this loss we're multiplying by this future returns variable and if the rewards in our environment are really big for example for if when we take one step we get a reward of plus a thousand if we take another step we get a reward of negative a thousand then this loss could be really big in the positive or the negative direction we may be taking huge steps just because that's how the environment was set up with the rewards at that scale now in another environment the rewards may be scaled down to be within one and negative one and we may be getting very small rewards so 0.1 0.2 and that would make this loss very small so the first idea is to make these future returns more consistent is we can normalize them and that looks like this we're going to find the mean of the future returns we're going to find the standard deviation and then we're going to take the future returns subtract the mean and divide by the standard deviation and we clamp by this very small value just to protect against the case where all the values in our list of future returns are the same which might happen if we have an environment with very sparse rewards and we may have a batch of episodes where we don't receive any rewards throughout all the episodes so in that case the standard deviation would be zero and when we take the future returns and subtract the mean this value will be zero and we would have zero divided by zero if we didn't add this clamp so to protect it against that we just clamp using this minimum value so that it would be zero divided by a small value and we'll get zero over here then instead of using this loss we're just going to replace our future returns by these normalized future returns down here and we'll use this loss instead now the question is how can we subtract by a value and divide by a value and still expect to get a gradient pointing in the same direction so first let's look at subtracting by the mean so what this essentially is doing is subtracting all our future returns by the same value so for example if the mean of our future returns is five that's the same as saying let's add a reward of negative five to the end of all of these paths what that essentially means is that it doesn't matter what you do no matter what you're gonna end up getting an additional reward of negative five so if there's nothing you can do to avoid that reward or increase the probability of getting that reward then that won't change the gradient because the direction that you should improve your policy before that reward was even present will be the same direction that you should improve your policy now that that reward is there it's basically a value that on average will get cancelled out and not have an effect on the gradient so subtracting by the mean won't change the gradient in expectation however because our future returns were sampled from the environment it may have an effect on this particular gradient step but the idea is that on average it won't have an effect so we can subtract by this value and on average we'll still be moving in the true direction of the gradient now to look at dividing by the standard deviation this will just scale all the future returns by the same amount so it may scale down our gradient or may scale up our gradient but it will still point in the same direction it'll just have a more consistent size and not depend as much on the size of the rewards for this specific environment so while that has helped with the fact that different environments can have different scales of rewards what's something else that can vary between environments the length of the episodes so in some environments the episodes may be really long we take a lot of steps and in some environments they may be very short like in this example where the maximum steps are three and if you look at our sum down here you'll see that we're dividing by the number of episodes so let's say we sample a thousand steps if the average length of an episode is a hundred steps then on average this num episode's value will be 10. however if on average our episodes are only 10 steps then on average this num episodes value will be a hundred so that will change the sum of this value by a factor of ten so when we have longer episodes on average this value will be larger a larger positive value or a larger negative value so can you think of what we can do to make the loss value more consistent and depend less on how long the episodes are on average and the idea is that instead of summing these values and then dividing by the number of episodes we just take the mean of these values and essentially what that's doing is also dividing by the average number of steps so if we look up here at our original sum we took the sum and then divided by the number of episodes if we also added to divide by the average number of steps per episode we would end up with this value right here because the number of episodes times the average number of steps per episode is on average going to be the total number of steps in these tensors so when we take the mean it's the same as taking the sum and dividing by the total number of steps so since we're essentially just dividing by the average number of steps per episode that's going to have the same effect as just dividing by a constant which on average is just going to scale our gradient either scale it down or scale it up but it's still going to point in the same direction and now regardless if our environment has long episodes or short episodes we'll get a more consistent lost value and we'll take more consistent size steps in our policy updates and just to clarify when i say that these two changes don't affect the direction of the gradient on average the two changes being normalizing the future returns and dividing by the number of steps instead of the number of episodes what i mean is that these two changes won't affect the direction of the gradient as the batch size increases to infinity but in practice since we're sampling a finite batch size these two changes will not only affect the direction of the gradient due to variance but also due to correlations in the sampled episodes for example correlation effects between episodes that have higher or lower than average variance in the future returns or episodes that have higher than average or lower than average number of steps per episode however in practice these two changes usually end up having a beneficial effect but just note if you have environments that have a high variation in the future returns or the episode length it may be beneficial to increase the batch size to make the direction of your gradient more consistent so to make it even more clear here's an outline of what an implementation might look like we start by initializing our neural network we'll create an optimizer and register the parameters of our neural network with that optimizer and then in a loop for however many iterations we want to update our policy we will first collect multiple episodes of data i don't show the code of doing that here but what we'll end up with is the collection of observations actions rewards and future returns and you'll notice we don't actually use this rewards value anywhere down here in the code but i just wanted to show it here because usually we use it to calculate the future returns so then we'll pass our batch of observations through our neural network to get our batch of logits and then we'll create a batch of distributions for those logits then we'll get the log probabilities for each of the actions in our batch of actions we'll then normalize the future returns to get our normalized future returns variable down here we'll then calculate our loss as we did before and we'll zero the gradients in our neural network back propagate the gradient of the loss through the neural network and we'll tell the optimizer to use those gradient values to take a step to update the parameters in our neural network and we'll do that in a loop the next time through we'll collect a completely new set of observations actions rewards and future returns from sampling episodes from our environment and we'll do all these steps over and over slowly improving our policy in the direction that increases the expected return so the key insights are that instead of finding the sum that is the expected return we find this other sum which just has the same gradient as the expected return and we find that sum by sampling episodes from the environment then we use that alternative sum to calculate the gradient values for the parameters in our neural network and on average they'll point in the direction of the true gradient of the expected return and the other big insight for me was just that i can see that the sum of values are just the probability of actually taking that action and then we're essentially dividing by the value of that probability to get back up into the probability of ending up in the node from which that action was taken and then we multiply that by the average return received after taking that action and that visual of going down into the graph dividing by its probability to go back up to the node from which that action was taken and multiplying that value by the expected return after taking that action really helps me have a more intuitive sense of why that sum gives us the correct gradient for increasing the expected return this was a long video but i hope to give you a better intuition for the policy gradient theorem and instead of just seeing it as a couple math operations that lead to this equation that you can see the actual visual of going through the tree structure of an environment and how we're getting the values that we are so that's all for this video but i'll be making more machine learning related and reinforcement learning related videos so if you're interested feel free to subscribe and i will see you guys next time
Info
Channel: Elliot Waite
Views: 13,345
Rating: undefined out of 5
Keywords: policy gradient, policy gradient reinforcement learning, policy gradient reinforce, policy gradient explained, policy gradient theorem, policy gradient theorem explained, policy gradient theorem proof, policy gradient derivation, policy gradient methods, policy gradient pytorch, policy gradient algorithms, policy gradient reinforcement learning example, policy gradient rl, policy gradient formula, policy gradient introduction, policy gradient log trick, reinforcement learning
Id: cQfOQcpYRzE
Channel Id: undefined
Length: 59min 36sec (3576 seconds)
Published: Sun Nov 22 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.