Everything You Need To Master Actor Critic Methods | Tensorflow 2 Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back everybody in today's tutorial you are going to get a mini crash course in actor critic methods in tensorflow 2. we're going to have around 15 minutes of lecture followed by about 20 minutes of coding and you're going to learn everything you need to know to go from start to finish with actor critic methods if you'd like to know more you can always pick up one of my udemy courses on sale right now link in the description let's get started welcome to the crash course in actor critic methods i'm going to give a relatively quick overview of the fundamentals of reinforcement learning in general and then the vector critic methods in particular finally we'll work together to code up our own ectocritic agent in tensorflow 2. this is geared toward beginners so feel free to ask some questions in the comment section reinforcement learning deals with agents acting on an environment causing some change in that environment and receiving a reward in the process the goal of our agent is to maximize his total reward over time even if it starts out knowing literally nothing about its environment fortunately we have some really useful mathematics at our disposal which makes figuring out how to beat the environment a difficult yet solvable problem the mathematics we're going to use relies on a very simple property of the system the markov property when a system depends only on its previous state and the last action of the agent we say it is markovian as we said earlier the agent is given some reward for its action so the set of states the agent sees the actions it takes and the rewards it receives forms our markov decision process let's take a look at each of these components in turn the states are just some convenient representation for the environment so if we're talking about an agent trying to navigate a maze the state is just the position of the agent within that maze the state can be more abstract like in the case of the lunar lander where the state is an array of continuous numbers that describe the position of the lander the velocity of the lander its angle and angular velocity as well as which legs are in contact with the ground the main idea is that the state describes exactly what about the environment is changing at each time step the rules that govern how the states change are called the dynamics of the environment the actions are a little simpler to understand in the case of a maze running robot the actions would be just move up down left and right pretty straightforward in the case of the lunar lander the actions consist of doing nothing firing the main engine firing the left engine and firing the right engine in both these cases the actions are discrete meaning they're either one or the other you can't simultaneously not fire the engine and fire the right thruster for instance this doesn't have to be the case though actions can in fact be continuous and there are numerous videos on this channel dealing with continuous action spaces check out my videos on soft hydrocritic deep deterministic policy gradients and twin delayed deep deterministic policy gradients from the agent's perspective it's seeing some set of states and trying to decide what to do how is our agent to decide the answer is something called the agent's policy a policy is a mathematical function that takes states as inputs and returns probabilities as output in particular the policy assigns some probability to each action in the action space for each state it can be deterministic meaning the probability of selecting one of the actions is one and the other is a zero but in general the probabilities will be somewhere between zero and 1. the policy is typically denoted by the greek letter pi and learning to beat the environment is then a matter of finding the policy pi that maximizes the total return over time by increasing the chances of selecting the best actions and reducing the chances of selecting the wrong ones the reward tells the agent exactly what is expected of it these rewards can be either positive or negative and the design of rewards is actually a tricky issue let's take the maze running robot if we give it a positive reward for exiting the maze and no other reward what is the consequence of that the consequence is that the agent has no motivation to solve the maze quickly it gets the same reward if it takes the minimum number of steps or if it takes 100 times that number we typically want to solve the maze as quickly as possible so the simple reward scheme fails in this example we have to give a penalty a reward of -1 for each step and a reward of zero for exiting the maze then the agent has a strong motivation to solve the maze in as few steps as possible this is because the agent will be trying to maximize a negative reward meaning get it as close to zero as possible so now that we have our basic definitions out of the way we can start to think through the mathematics of the reinforcement learning problem from the agent's perspective it has no idea how its actions affect the environment so we have to use a probability distribution to describe the dynamics this probability distribution is denoted p of s prime and r given s a which just reads as the probability of ending up in state s prime and receiving reward r given we're in state s and take action a in general we won't know the value for this until we interact with the environment and that's really part of solving reinforcement learning since we're dealing with probabilities we have to start thinking in terms of expectation values in general the expectation value is calculated by taking into account all possible outcomes and multiplying the probability of that outcome by what you receive in that outcome so in our markov framework the expected reward for a state and action payer is given by the expectation value of that reward which is the sum over all possible rewards the outcomes multiplied by the sum over the probabilities of ending up in all possible resulting states for a simple example let's consider a simple coin toss game if we flip a coin and it comes up heads you get one point if it comes up tails you get -1 point if we flip the coin two times what is the expected number of points it's the probability of getting heads multiplied by the reward for getting heads plus the probability of getting tails multiplied by the reward for getting tails so 0.5 times 1 plus 0.5 times negative 1 this gives an expected reward of 0 points which is what you would intuitively expect this is a trivial example but i hope it illustrates the point we have to consider all possible outcomes their probabilities and what we would expect to receive in each of those outcomes when we go to put theory in practice we're going to be doing it in systems that are what we call episodic this means that the agent has some starting state and there is some terminal state that causes the gameplay to end of course we can start over again with a new episode but the agent takes no actions in this terminal step and thus no future rewards follow in this case we're dealing with not just individual rewards but with the sequence of rewards over the course of that episode we call the cumulative reward the return and it's usually denoted by the letter g now this discussion is for games that are broken into episodes but it would be nice if we could use the same mathematics for tasks that don't have a natural end if we have a game that goes on and on then the total sum of rewards will approach infinity it's absurd to talk about maximizing total rewards in environment where you can expect an infinite reward so we have to do a little modification to our expression for the returns we need to introduce the concept of discounting we're going to reduce the contribution of each reward to the sum based on how far away in time it is from our current time step we're going to use a power law to describe this reduction so that the reward gets reduced by some additional power of a new hyper parameter we'll denote as gamma gamma is between 0 and 1 so each time we increase the power the contribution is reduced if we introduce gamma into the expression for the return at time step t we get that the return is just the sum over k of gamma to the k multiplied by the rewards at time t plus 1 plus k besides being a trick to make sure we can use the same mathematics for episodic and continuing tasks discounting is a reasonable basis in first principles since our state transitions are defined in terms of unknown probabilities we can't really say how certain each of those rewards were states that we encounter further out in time become less and less certain and so the rewards for reaching those states are also less and less certain thus we shouldn't weight them as much as the reward we just received if you've been following along carefully something may not quite add up here all this math is for systems with a markov property which means that they depend only on the previous state in action so why do we want to keep track of the entire history of rewards received well it turns out that we don't have to if you do some factoring in the expression for the return at time step t you find that the return at time t is just the sum of the reward at time t plus one and the discounted return for the t plus one time step this is a recursive relationship between returns at each time step it's more consistent with the principles of the markov decision process where we're just concerned with successive time steps now that we know exactly what the agent wants to maximize the total returns and then function for how it's going to act the policy we can actually start to make useful mathematical predictions one quantity of particular interest is called the value function it depends on the agent's policy pi and the current state of the environment and gives us the expectation value of the agent's returns starting from time t and state s assuming it follows the policy pi there's a comparable function for the value of state in action pairs which tells us the value of taking action a and state s and then following the policy pi afterwards it's called the action value function and it's represented by the letter q so how are these values calculated in practice well in reality we don't solve these equations we estimate them we can use neural networks to approximate the value or action value function because neural networks are universal function approximators we sample rewards from the environment and use those to update the weights of our network to improve our estimate for the value or action value function estimating the value function is important because it tells us the value of the current state and the value of any other estate the agent may encounter solving the reinforcement learning problem then becomes an issue of constructing a policy that allows the agent to seek out the most profitable states the policy that yields the best value function for all states in the state space is called the optimal policy in reality it can be a set of policies and they're all effectively equivalent various schemes to find these optimal policies exist and one such scheme is called the actor critic method in outer critic methods we're using two deep neural networks one of them is used to approximate the agent's policy directly which we can do because it's just a mathematical function recall that the policy is just a probability distribution over the set of actions where we take a state as input and output a probability of selecting each action the other network called the critic is used to approximate the value function the critic acts just like any other critic telling the actor how good each action is based on whether or not the resulting state is valuable the two networks work together to find out how best to act in the environment the actor selects actions the critic evaluates the states and then the result is compared to the rewards from the environment over time the critic becomes more accurate at estimating the values of states which allows the actor to select the actions that lead to those states from a practical perspective we're going to be updating the weights of our deep neural network at each time step because actor critic methods belong to a class of algorithms called temporal difference learning this is just a fancy way of saying that we're going to be estimating the difference in values of successive states meaning states that are one time step apart hence temporal difference just like with any deep learning problem we're going to be calculating cost functions in particular we're going to have two cost functions one for updating our critic and the other for updating our actor to calculate our costs we want to generate a quantity we'll call delta and it's just given by the sum of the current reward and the discounted estimate of the new state and then subtracting off the value of the current state keep in mind that the value of the terminal state is identically zero so we need a way to take this into account the cost for the critic is going to be delta squared which is kind of like a typical linear regression problem the cost for our actor is a little more complex we're going to multiply this delta by the log of the policy for the current state in action the agent took the reason behind this is a little complex and it's something i go into more detail about in my course or you can look for it in the chapter on policy gradient methods in the free textbook by sutton bardo so let's talk implementation details we're going to implement the following algorithm initialize a deep neural network to model the actor and critic repeat for a large number of episodes reset the score terminal flag and environment while the state is not terminal select an action based on the current state of the environment take the action and receive the new state reward and terminal flag from the environment calculate delta and use it to update the actor and critic networks set the current state to the new state and increment the episode reward by the score after all the episodes are finished pop the trend in scores to look for evidence of learning there should be an overall increase in score over time you will see lots of oscillations because actor critic methods aren't really stable but the overall trend should be upward another thing you may see is that the score can go upward for a while and then fall off a cliff this isn't uncommon because after critic methods are quite brittle and they're really not the best solution for all cases but they are a stepping stone to more advanced algorithms other important implementation details you can use a single network for both the actor and critic so you have common input layers and two outputs one for the actor and one for the critic this has the benefit that we don't have to train two different networks to understand the environment you can definitely use an independent actor and critic it just makes the learning more difficult for an algorithm that is already pretty finicky we'll let it play almost 2 000 games with a relatively large deep neural network something like about a thousand units in the first hidden layer and 500 units in the second the hard part is going to be the actor as i said earlier the actor models the policy which is a probability distribution the actor layer will have as many outputs as there are actions and we use a soft max activation because we're modeling probabilities and they'd better sum to one when selecting actions we're going to be dealing with the discrete action spaces so this is what is called a categorical distribution we're going to want to use the tensorflow underscore probability package for the categorical distribution and then use the probabilities generated by the actor layer to get this distribution which we can then sample and use the built-in log prob function for our cost function as far as the structure of our code we're going to have a class for our actor critic network and that will live in its own file we'll also have a class for our agent and i'll have the functionality to choose actions save models and learn from its experience that goes in a separate file the main loop is pretty straightforward but it does go in its own file as well okay now that we have all the details out of the way let's go ahead and get started coding this so now that we have all of our lectures out of the way we're going to go ahead and proceed with the coding we're going to start with the networks we begin as always with our imports so we will need os to handle file joining operations for model check pointing we will need keras and we will need our layers which for this example is just going to be a dense layer so we will have our actor critic network and you see a case of converging engineering here where tensorflow and pi torch both have you derive your model class from the base model class and we can go ahead and define our constructor that will take a number of actions as input the number of dimensions for the first fully connected layer we will default that to 1024 and for the second we will default it to 512. we will have a name for model checkpointing purposes and a checkpoint directory very important you must remember to do a make directory on this temp slash actor critic before you attempt to save a model otherwise you're going to get an error the first thing you want to do is call your super constructor and then go ahead and start saving your parameters now also very important for our class that we've derived from the base class in this case the actor critic network class we have to use model name instead of name because name is reserved by the base class so just be aware of that not a huge deal checkpoint directory and then we'll have our file and that will be os path join the directory name plus underscore ac i like to use underscore algorithm in this case ac for actual critic in case you have one directory that you use for many different algorithms if you're just using like say a working directory you don't want to confuse the model types otherwise if you have a good model saved you don't want to override it with something else now we'll go ahead and define our layers and that will be fully connected dense layers the neat thing about keras is that the number of input dimensions are inferred so we don't have to specify it that's why we don't have an input dimms for our constructor and it will output fc1 dimms with an activation of a value fc2 will be similar and then we will have two separate outputs so we have two common layers and then two independent outputs one for the value function and that is single valued with no activation and the second is our policy pi and that will output an actions with a soft max activation recall that the policy is just a probability distribution so it assigns a probability to each action and those probabilities have to add up to one because that's kind of what probabilities do right next we have to define our call function this is really the feed forward if you're familiar with that from pi torch so we'll just use some generic name like value it doesn't really matter and pass through the second fully connected layer and then get our value function and our policy pi and then return both the value function and the policy pi so that is really it for the actor critic network all of the interesting functionality happens in the agent class so let's go ahead and start writing the agent class so we begin as always with our imports we will need tensorflow we will need our optimizers in this case we're going to use an atom optimizer probability we will need tensorflow probability to handle our categorical distribution to model our policy directly [Music] you have to do a pip install tensorflow probability before you can run this this is a separate package from tensorflow and we also need our actor critic network let's go ahead and code up our agent so our initializer is pretty straightforward we will need some default learning rate i'm going to use 0 0 0 3 it doesn't really matter i'm going to pass in a specific learning rate in the main file we will have a gamma of 0.99 and default in actions of some number like say 2. so we're going to go ahead and save our parameters recall that again is our discount factor we're going to need a variable to keep track of the last action we took this will be a little bit more clear when we get to the learn function and has to do with the way we calculate the loss because we have to use a gradient tape for tensorflow 2 it's just a bit of a workaround for how tensorflow 2 does things we need our action space for random action selection that's just a list of actions from zero to an actions minus one we need our actor critic i want to make sure to specify the number of actions and we want to compile that model so after critic compile with an atom optimizer and learning rate defined by alpha next we have the most basic functionality of our agent the functionality to choose an action and that takes the current state of the environment as input which we have to convert to a tensor and in particular we have to add an extra dimension uh batch dimension the reason being that the deep neural network expects a batch of inputs and so you have to have something other than a 1d array it has to be two-dimensional so we just add an extra dimension along the zeroth dimension so then we will feed that through our deep neural network we don't care about the value of the state for the purpose of choosing that action so we just use a blank and we will get the probabilities by passing the state through the actor critic network and then we can use that output the probabilities defined by our neural network to feed into the actual tensorflow probabilities categorical distribution and then use that to select an action by sampling that distribution and getting a log probability of selecting that sample sorry that's tfp categorical and probabilities given by problems and our actual action will be a sample of that distribution and we don't actually need the log problem at this stage we will need the log problem we calculate the loss function for our deep neural network but we don't need it now and it doesn't make sense or rather it doesn't actually work to save it to a list let's say for use later here because this calculation takes place outside of the gradient tape tensorflow 2 has this construct of the gradient tape it's pretty cool it allows you to calculate gradients manually which is really what we want to do here uh but anything outside of that tape doesn't get added to the calculation for back propagation so the log prob doesn't matter at this point so why bother calculating it one thing we do need however is the action that we selected so we will save that in the action variable and we will return a numpy version of our action because action is a tensorflow tensor which is incompatible with the openai gym it does however take numpy arrays and we want the zeroth element of that because we added in a batch dimension for compatibility with our deep neural network and a little bit confusing but that is what we have to deal with next let's do a couple of bookkeeping functions to save and load models those will take any inputs and so it will save the weights of the network to the checkpoint file we do the inverse operation to load models and we will load weights from a checkpoint file so that is it for the basic bookkeeping operations next we have the real heart of the problem the functionality to learn so this will take a number of inputs we'll take the state reward received new state and terminal flag as input the first thing we want to do is convert each of those to tensorflow tensors and make sure to add a batch dimension and i like to be really pedantic with my data type so i will cast it to tf float32 and we don't have to add a batch dimension to the reward because it is not fed to a deep neural network so now we get to calculate our actual gradients using something called the gradient tape and we'll set persistent to true i'm not actually sure that's needed um i'm going to go ahead and experiment with that uh when we go ahead and run the code but i have it that way i might have just copied and pasted code from somewhere else so let me double check but we want to feed our state and new state through the actor critic network and get back our quantities of interest so we feed the current state and then the new state but for the new state we don't care about the probabilities we just care about the value and that is for calculation of our delta but for the calculation of our loss we have to get rid of that batch dimension so we have to squeeze these two parameters and the reason you have to do that is because the loss works best if it's on a one-dimensional quantity or rather a scalar value rather than a scalar value inside of brackets so it has to be a scalar instead of a vector containing a single item it's just something we have to do i encourage you to play around with it to double check me on that i move between uh i move between distributions excuse me framework so sometimes uh stuff isn't always 100 necessary even if it doesn't hurt anything so we need our action probabilities for the calculation of the log prob tfp distributions categorical and we define our probs by the output of our deep neural network and then our log prob is actionprobs.log prop of the self.action and this is the action that we saved up at the top when we calculate the action for the agent so this is the most recent action then we calculate our delta that is a reward plus gamma multiplied by the value of the new state times one minus int of done and the reason for that is that the value of the terminal state is identically zero because no returns no rewards follow the terminal state so it has no future value and subtract off the state value so our actor loss is minus log prob times that delta and the critical loss is delta squared and the total loss equals after loss plus critic loss and then we can go ahead and calculate our gradients so our gradient is tape dot gradient total loss with respect to the trainable variables optimizer apply gradients and this expects a zip as input so we're going to zip the gradient and the trainable variables all right and that is it for the actor critic functionality so i'm going to come back to this in a few minutes to see if i need that persistent i don't believe i do i believe i need this in the case when we have if we were to have say separate actor and critic networks and had to calculate gradients with respect to two separate sets of trainable variables i believe that's when the persistent equals true would be necessary um and when when we had um coupling between the loss of one network and the other uh it's so that it keeps track of the gradients after it does the back propagation uh kind of like in pi torch where it throws it away and you have to tell it to retain the graph i'll double check on that though so let's go ahead and write and quit and then we're ready to go ahead and code up our main file so with our imports we will need gym we will need numpy we will need our agent we will need um our plot learning curve function i'm not going to go into any detail on this it's just a function to plot data using matplotlib with some labeled axes it's nothing really worth going into first thing we'll do is make our environment and i'm using the cart pole because it runs very quickly and the actor critic method is quite brittle uh quite finicky you will observe in many cases where it will achieve a pretty decent score than a fall off a cliff because the learning rate was just a little bit too high so there are a number of problems with the algorithm and it's easiest to test in a very simple environment in my course we use the lunar lander environment and i did more hyper parameter tuning to get it to actually get pretty close to beating the environment i believe uh in this case we won't quite beat it we achieve a high score of like 140 points or so when beating it is 200 but i leave the exercise of hyper parameter tuning to you the viewer i gotta leave something for you to do as well right so we'll define our agent with a learning rate of 1 by 10 to the minus 5 and a number of actions defined by our environment action space underscore n then we'll have say eighteen hundred games about two thousand with a file name of cardpole.png i would encourage you if you do hyperparameter testing to put the string representations of those hyper parameters here in the file name so that way when you look at it later you don't get confused and you know what hyperparameters we use to generate which plot so our figure file is just plots plus the file name i split it up you don't have to do it that way we want to keep track of the best score received and it'll default to the lowest range so that way the first score you get is better than the lowest of the range and so you save your models right away an empty list to keep track of the score history a boolean for whether or not we want to load a checkpoint so if we're going to load a checkpoint then you want to load models and then finally we want to go ahead and start playing our games we want to reset our environment we set our terminal flag set our score to zero and while we're not done with the episode we can choose an action uh get the new state reward done and info from the environment increment our score if we're not loading a checkpoint then we want to learn either way we want to set the current state to the new state otherwise you will be constantly choosing an action based on the initial state of the environment which obviously will not work you also want to append the score to the score history for plotting purposes and calculate a score an average score of the previous i don't know say 100 games and if that average score is better than your best score then set the best score to the average score and if we're not loading a checkpoint then save your models so this inner conditional statement keeps you from overriding your models that had your best scores when you're actually testing if you just saved the model every time then you'd be overriding your best model with whatever which you know may not be the best model so at the end if we're not loading a checkpoint actually we can just plot either way let's do that it's got our x axis and plot learning curve x score figure file okay now i have to do a make dirt on plots temp [Music] did i call it temp slash actor critic otherwise this stuff won't work and you also want to do a pip install flow tensorflow probability because that is a separate package of course i already have it so let's go ahead and try to run this and see if i made any typos i'm certain almost certain i did so it says something something is not callable oh that's because i have forgotten my multiplication sign so that is in line 49 yeah things i'm trying to call something here when i really want to multiply oh you know what i did forget one thing i did forget of course is down here i forgot my debug statement so let's do this that's pretty funny episode i score percent one f always have to forget something of course okay there we go so one other thing i want to do is come back here to actor critic and get rid of this persistent equals true i don't think i actually need this sometimes i just copy and paste code and then as i'm doing the video i realize oh hey i don't always need all of that stuff okay so yeah it does run alrighty so i'm going to go ahead and switch over to another window where i have let this finish up because there's no point letting it run for another 1800 games so let's go ahead and check that so here you can see the output of the other 1800 games i ran and it does achieve a score of around 160 768 about 170 points or so which is almost beating the environment it's pretty close if you take a look at the learning plot here you can see that it has an overall overall upward trend and it's linear and the reason i don't let it continue is because as i alluded to in the lecture these models are very brittle and so sometimes you can get on a very narrow slice of parameter space where your model is doing well and any small step out of that range blows up the model one thing to fix that is replay memory allows you to get a broader sampling of of experiences and get a little bit more stable learning but that doesn't work by bolting it directly onto actor critic methods at least from my experience i've tried it i have a video on that i wasn't able to get it to work maybe some of you can that would be fantastic if you could but in my case i didn't get it to work i thought it would work it does not and in fact there's a whole separate algorithm called actor critic with experience replay that deals with bolting experience replay onto vanilla aggro critic methods so i hope this was helpful it's pretty hard to give a really solid overview in like a 30 40 minute youtube video but it serves to illustrate some of the finer points of atrocitic methods and some of the foundational points of deep reinforcement learning in general in my courses i go into much more depth and in particular i show you how to actually read papers how to turn papers into code a useful skill it's really hard to find anywhere else on youtube or udemy so if you like the content make sure to leave a like subscribe if you haven't already and leave a comment down below with any questions comments criticisms concerns and i will see you in the next video oh really quick before we do let's check in on the other terminal to make sure it's actually learning so here is the output and you can see uh as it's going along it is saving some models and the score is generally trending upward over time and that's why you get the saving models because the score is best in the last best score so we didn't make any fundamental errors in our code if it doesn't achieve the same result that isn't entirely surprising because there is a significant amount of run to run variation but um the code is functional what's up on my github if you want it go ahead check it out i'll leave it a link in the description see you in the next one
Info
Channel: Machine Learning with Phil
Views: 18,794
Rating: undefined out of 5
Keywords: actor critic explained, actor critic algorithm, actor critic methods, actor critic tensorflow2, actor critic reinforcement learning, actor critic tutorial, actor critic openai gym, actor critic cartpole, actor critic model, actor critic rl, actor critic tensorflow, deep reinforcement learning, deep rl tutorial, deep reinforcement learning tutorial, actor critic deep reinforcement learning, markov decision process
Id: LawaN3BdI00
Channel Id: undefined
Length: 40min 46sec (2446 seconds)
Published: Wed Sep 30 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.