How to Implement Deep Learning Papers | DDPG Tutorial

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what is up everybody in today's video we're gonna go from the paper on deep deterministic policy gradients all the way into a functional implementation in tensor flow so you're gonna see how to go from a paper to a real-world implementation all in one video grab a snack a drink cuz this is gonna take a while let's get started so the first step in my process really isn't anything special I just read the entirety of the paper of course starting with the abstract the abstract tells you what the paper is about at a high level it's just kind of an executive summary introduction is where the author's will pay homage to other work in the field kind of set the stage for what is going to be presented in the paper as well as the need for it the background kind of expands on that and you can see here it gives us a little bit of mathematical equations and you will get a lot of useful information here this won't talk too much about useful nuggets on implementation but it does set the stage for the mathematics you're going to be implementing which is of course critical for any deep learning or in this case deep reinforcement learning paper implementation the algorithm is really where all the meat of the problem is it is in here and that they lay out the exact steps you need to take to implement the algorithm right that's why it's titled that way so this is the section we want to read most carefully and then of course they will typically give a table where they outline the actual algorithm and oftentimes if I'm in a hurry I will just jump to this because I've done this enough times that I can read this what is called pseudocode if you're not familiar with that pseudocode is just an english representation of computer code so we will typically use that when we outline a problem and it's often used in papers of course so typically I'll start here reading it and then work backward by reading through the paper to see what I missed but of course it talks about the performance across a whole host of environments and of course all of these have in common that they are continuous control so what that means is that the action space is a vector whose elements can vary on a continuous real number line instead of having discrete actions of zero one two three four or five so that is the really motivation behind deep deterministic policy gradients as it allows us to use deep reinforcement money to tackle these types of problems and in today's video we're gonna go ahead and tackle the I guess pendulum swing up also called the pendulum problem reason being is that while it would be awesome to start out with something like the bipedal Walker you never want to start out with maximum complexity you always want to start out with something very very small and then scale your way up and the reason is that you're going to make mistakes and it's most easy to debug most quick to debug very simple environments that execute very quickly so the pendulum problem only has I think three elements in its state vector and only a single action so or maybe it's two actions I forget but either way it's very small problem relative to something like the bipedal Walker or many of the other environments you could also use the continuous version of the cart pole or something like that that would be perfectly fine I've just chosen the pendulum for this but you let me do because we haven't done it before so it's in here that they give a bunch of plots of all of the performance of their algorithm of the various sets of constraints placed upon it and and different implementations so you can get an idea and one thing you notice right away it's always important to look at plots because they give you a lot of information visually right it's much easier to gather information from plots than it is text you see that right away they have a scale of 1 so that's telling you it's relative performance and you have to read the paper to know relative to what I don't like that particular approach they have similar data in a table form and here you see a whole bunch of environments they used and there's a broad broad variety they wanted to show that the algorithm has a wide arena of applicability which is a typical technique in papers they want to show that this is relevant right if they only showed a single environment people reading it would say well that's all well and good you can solve one environment but what about these dozen other environments right and part of the motivation behind reinforcement learning as generality can yes can we model real learning and biological systems such that it mimics the generality of biological learning one thing you notice right away is that these numbers are not actual scores so that's one thing I kind of take note of and causes me to raise an eyebrow so you have to wonder the motivation behind that why would the authors express scores in a ratio there's a couple different reasons one is because they want to just to make all the numbers look uniform maybe the people reading the paper wouldn't be familiar with each of these environments so they don't know what a good score is and that's a perfectly valid reason and other possibilities they want to hide poor performance I don't think that's going on here but it does make me raise my eyebrow whenever I see it one exception is the torques which is a totally open rate race car simulator environment I don't know if we'll get to that on this channel that would be a pretty cool project but that would take me a few weeks to get through but right away you notice that they have whole bunch of environments these scores are all relative to one and one is the score that the agent gets on a planning algorithm which they also detail later on so those are the results and they talk more about I don't think we saw the headline but they talk about related work which talks about other algorithms that are similar and their shortcomings right they don't ever want to talk up other algorithms you always want to talk up your own algorithm to make yourself sound good you know whilst you be writing a paper in the first place and of course a concluding that's how everything together references I don't usually go deep into references if there is something that I feel I really really need to know I may look at a reference but I don't typically bother with them if you were a PhD student then it would behoove you to go into the references because you must be an absolute expert on the topic and for us we're just you know hobbyists on me youtubers so I don't go into too much depth with the background information and the next most important bit of the paper are the experimental details and it is in here that it gives us the parameters and architectures for the network's so this is where if you saw my previous video where I did the implementation of DPG in pi torch and the continuously nor liner environment this is where I got most of this stuff it was almost identical with a little bit of tweaking I left out some stuff from this paper but pretty much all of it came from here and particularly the hidden layer sizes 400 and 300 units as well as the initialization of the parameters from a uniform distribution of the given ranges so just to recap this was a really quick overview of the paper just showing my process of what I look at the most important parts are the details of the algorithm as well as the experimental details so as you read the paper like I said I gloss over the introduction because I don't really I kind of already understand the motivation behind it I get the idea it says basically tells us that you can't really handle just continuous action spaces with DQ networks we already know that and it says you know you can discretize the state space but then you end up with really really huge actions sorry you can discretize the action space but then you end up with the you know hold bootloader actions you know what is it 2187 action so it's intractable anyway and they say what we present you know a model free off policy algorithm and then it comes down to this section where it says the network is trained off policy with samples from a replay buffer to minimize correlations very good and train with the target q network to give consistent targets during temporal difference backups so this work can make use of the same ideas along with batch normalization so this is a key chunk of text and this is why you want to read the whole paper because sometimes they'll embed stuff in there that you may not otherwise catch so as I'm reading the paper what I do is I take notes and you can do this in paper you can do it in you know text document in this case we reading is the editor so that way I can show you what's going on and it's a natural place to put this stuff because that's where you can implement the code anyway let's hop over to the editor and you'll see what I take notes so right off the bat we always want to be thinking in terms of what sort of classes and functions will we need to implement as our so the paper mentioned a replay buffer as well as a target queue network so the target queue network for now we don't really know what it's going to be but we can write it down so we'll say we'll need a replay buffer class and need a class for a target queue network now I would assume that if you were going to be implementing a paper of this advanced difficulty you'd already be familiar with queue learning where you know that the target network is just another instance of a generalized network the difference between the target and evaluation networks are you know the way in which you update their weights so right off the bat we know that we're gonna have a single class at least one if you know something about actor critic methods you'll know that you'll probably have two different classes one for an actor one for a critic because those two architectures are generally a little bit different but what do we know about Q networks we know that Q networks are state action value functions right they're not just value functions so the critic in the actor critic methods is just a state value function in general whereas here we have a Q network which is going to be a function of the state in action so we know that it's a function of s and a so we know right off the bat it's not the same as a critic it's a little bit different and it also said we we will use we will use batch norm so batch normalization is just a way of normalizing inputs to prevent divergence in a model I think it was discovered in 2015 2014 something like that so that we use that so we'll need that in our network so we know at least right off the bat a little bit of an idea of what the network is going to look like so let's go back to the paper and see what other little bits of information we can glean from the text before we take a look at the algorithm reading along we can say blah blah this a key feature of simplicity it requires only a straightforward actor critic architecture and very few moving parts and then they talk it up and say can learn policies that exceed the performance of the planner you know the planning algorithm even learning from pixels which we won't get to in this particular implementation so then okay no real other Nuggets there the background talks about the mathematical structure of the algorithm so this is really important if you want to have a really deep in-depth knowledge of the topic if you already know enough about the background you would know that you know the formula for discounted future rewards you should know that if you've done a whole bunch of reinforcement learning algorithms if you haven't then definitely read through this section to get the full idea of the background in the motivation behind the mathematics other thing to note is it says the action value function is used in many algorithms we know that from deep Q learning and then it talks about the recursive relationship known as the bellman equation that is known as well other thing to note what's interesting here and this is the next nugget is if the target policy is deterministic we can describe as a function mu and so you see that in the remainder of the paper like in the algorithm they do indeed make use of this parameter mu so that tells us right off the bat that our policy is going to be deterministic now if you have you could probably guess that from the title right Dida turn is deep deterministic policy gradients right so you would guess from the name that the policy is deterministic but what does that mean exactly so a stochastic policy is one in which the software map's the probability of taking an action to a given state so you input a set of state and outcomes a probability of selecting in action and you select an action according to that probability distribution so that right away bakes in a solution to the explore exploit dilemma so long as all probabilities are finite right so so as long as a probability of taking an action for all states doesn't go to zero there is some element of exploration involved in that algorithm q learning handles the explore exploit dilemma by using Epsilon greedy action selection where you have a random parameter that tells you how often to select a random number sorryi random action and then you select a greedy action the remainder of time of course policy gradients don't work that way they typically use a stochastic policy but in this case we have a deterministic policy so you have to wonder right away okay we have a deterministic policy how are we gonna handle the explore exploit dilemma so let's go back to our text editor and make a note of that so we just want to say that the the policy is deterministic how to handle explore exploit and that's a critical question right because if you only take what are perceived as the greedy actions you never get a really good coverage of the parameter space of the problem and you're going to converge on a suboptimal strategy so this is a critical question we have to answer in the paper so let's head back to the paper and see how they handle it so we're back in the paper and you can see the reason they introduced that deterministic policy is to avoid an inner expectation or maybe that's just a byproduct I guess it's not accurate to say it's the reason they do it but what's needed says the expectation depends only on the environment means it's possible to learn q to the MU meaning Q is a function mu off policy using transitions which are generated from a different stochastic policy beta so right there we have off policy learning which you say explicitly with a stochastic policy so we are actually going to have two different policies in this case so then this already answers the question of how we go from a deterministic policy to solving these sports boy dilemma and the reason is that we're using a stochastic policy to learn the greedy policy or a sorry purely deterministic policy and of course they talk about the parallels with Q learning because there are many between the two algorithms and you get to the loss function which is of course critical to the algorithm and this Y of T parameter then of course they talk about what Q learning has been used for they use they make mention of deep neural networks which is of course what we're gonna be using that's where the deep comes from and talks about the car Atari games which we've talked about on this channel as well and importantly they say the the two changes that they introduced and q-learning which is the concept of the replay buffer and that's the target Network which of course they already mentioned before they're just reiterating and reinforcing what they said that's why we want to read the introduction and background type material to get a solid idea what's gonna happen so now we get to the algorithmic portion and this is where all of the magic happens so they again reiterate that it's not possible to apply cue learning to continuous action spaces because you know reasons right it's pretty obvious you have an infinite number of actions that's a problem and then they talk about the deterministic policy gradient algorithm which we're not going to go too deep into right it for this video we don't want to do the full thesis we don't want to do a full doctorate dissertation on the field we just want to know how to implement it and get moving so this goes through and gives you an update for the gradient of this parameter J and gives it in terms of the gradient of Q which is the state action value function and the gradient of the policy the deterministic policy mu other thing to note here is that this gradients these gradings are over two different parameters so the gradient of Q is with respect to the actions such that the action a equals mu of s of T so what this tells you is that Q is actually a function not just of the state but is intimately related to that policy Muse so it's not it's not an action chosen according to an arc max for instance it's an action short chosen according to the output of the other network and for the update of MU it's just the gradient with respect to the weights which you would kind of expect so they talk about another algorithm and if Q see a I don't know that is honestly mini batch version blah-de-blah our contribution here is to provide modifications a DPG inspired by the success of DQ n which allow it to use normal network function approximate is to learn in large state and actions is online we call DD PG very creative as they say again we use a replay buffer to address the issues of correlations between samples generated on subsequent steps within an episode finite science can exercise our transition sample from the environment so we know all of this so if you don't know all of it what you need to know here is that you have state action reward and then new state transitions so what this tells the agent is started in some state s took some action receive some reward and ended up in some new state why is that important it's important because in in anything that isn't dynamic programming you're really trying to learn the state probability distributions you're trying to learn the probability of going from one state to another and receiving some reward along the way if you knew all those beforehand then you could just simply solve a set a very very large set of equations for that matter to arrive at the optimal solution right if you knew all those transitions you say okay if I started in this state and take some action I'm gonna end up in some other state with certainty then you'd say well what's the most advantageous state what state is gonna give me the largest reward and so you could kind of construct some sort of algorithm for traversing that set of equations to maximize your reward over time now of course you often don't know that and that's the point of the replay buffer is to learn that through experience and interacting with the environment and it says when the replay buffer was full all the samples were discarded okay that makes sense it's finite size it doesn't grow indefinitely at each time step actor and critic are updated by sampling a mini batch uniformly from the buffer so it operates exactly according to Q learning it does a uniform sampling random sampling of the buffer and he does that update the actor and critic networks what's critical here is that combining this statement with the topic of the previous paragraph is that when we write our replay buffer class it must sample States had random so what that means is you don't want to sample a sequence of subsequent steps and the reason is that there are large correlations between those steps right as you might imagine and those correlations can cause you to get trapped in little Nick's nooks and crannies a parameter space and really cause your algorithm to go wonky so you want to sample that uniformly so that way you're sampling across many many different episodes to get a really good idea of the I guess the breadth of the parameter space to use kind of loose language and then it says directly implementing cue learning with neural networks prove the unstable many environments and there they're gonna talk about using the the target network okay but modified for actor critic using soft target updates rather than directly copying the weight so in queue learning we directly copy the weights from the evaluation to the target network here it says we create a copy of the actor and critic networks Q prime and mu prime respectively that are used for calculating the target values the weights of these target networks are then updated by having them slowly track the learned networks theta prime goes to theta theta times tau plus 1 minus tau times theta prime with town much much less than 1 this means that the target values are constrained to change slowly greatly improving the stability of learning ok so this is our next little nugget so let's head it over to the paper and make to our text editor and make note of that what we read was that the we have to not in camps we don't want to shout we have to networks target networks sorry we have to actor and to critic networks a target for each updates are soft according to theta equals tau times theta plus 1 minus tau times theta prime so I'm sorry there should be theta Prime so this is the update rule for the parameters of our target networks and we have to target networks one for the actor and one for the critic so we have a total of four deep neural networks and so this is why the algorithm runs so slowly even on my beastly rig it runs quite slowly even in the lunar lander in a continuous lunar lander environment I've done the bipedal Walker and it took about 20,000 games to get something that approximates a decent score so this is a very very slow algorithm and that 20,000 games took I think about a day to run so quite slow but nonetheless quite powerful it's the only method we have so far of implementing deep reinforcement learning and continuous control environments so hey you know beggars can't be choosers right but we know just to recap that we're gonna use four networks two of it are on policy in two off policy and the updates are gonna be soft with with tau much less than one if you're not familiar with mathematics this double less than or double greater-than sign means much less than or much greater than respectively so what that means is that tau is gonna be of order point zero one or smaller right point one isn't much smaller that's kind of smaller point zero one I would consider much smaller they use we'll see in the in the details we'll see what value they use but you should know that it's of order point zero one or smaller and the reason they do this is to allow the updates to happen very slowly to get good convergence as they said in the paper so let's head back to the paper and see what other nuggets we can clean before getting to the outline of the algorithm and then in the very next sentence they say this simple change moves the relative unstable problem of learning the action by a function closer to the case of supervised learning a problem for which a robust solution exists we found that having both the target mute Prime and Q problem was required to have stable targets why I in order to consistently train the critic without divergence this may slow learning since the target networks delay the propagation of value estimates however in practice we found this was always greatly outweighed by the stability of learning and I found that as well you don't get a whole lot of diversions but it does take a while to train then they talk about learning in low dimensional and higher dimensional environments and they do that to talk about the need for feature scaling so one approach to the problem which is the ranges of variations in parameters right so in different environments like in the mountain car you can go from plus minus one point six minus one point six to zero point four something like that and the velocities are plus and minus point zero seven so you have a two order magnitude variation there in the parameters that's kind of large even in that environment and then when you compare that to other environments where you can have parameters that are much larger in the order hundreds you can see that there's a pretty big issue with the scaling of the inputs to the neural network which we know from our experience that neural networks are highly sensitive to the scaling between inputs so so their solution that problem is to manually scale the features so they're in similar across environments and units and they do that by using batch normalization and it says this technique normalizes each dimension across the samples in a mini batch dev unit mean and variance and it also maintains a running average of the mean and variance used for normalization during testing during exploration and evaluation so in our case training and testing are a slightly different than in the case of supervised learning so supervised learning you maintain different data sets or shuffled subsets of a single data set to do training and evaluation and of course in the evaluation phase you perform no weight updates of the network you just see how it does based on the Train and reinforcement learning you can do something similar where you have a set number of games where you train the agent to achieve some set of results and then you turn off the learning and allow it to just choose actions based upon whatever policy it learns and if you're using batch normalization in PI tours in particular there are significant differences in how batch normalization is used in the two different phases so you have to be explicit in setting training or evaluation mode in particular in pi torch they don't track statistics in evaluation mode which is why when we wrote the DDP G algorithm and PI torch we had to call the eval and train functions so often okay so we've already established will need batch normalization so everything's kinda starting to come together we need a replay Network batch normalization we need four networks right we need to we need to each of a target of an actor into each of eight critic so half of those are going to be used for on policy now of them are going to be used for off policy for the targets and then it says will be scrolled down a major challenge of learning and continuous action spaces is exploration an advantage of off policy algorithms such as DD PG is that we can treat the problem of exploration independently from the learning algorithm we constructed an exploration policy mu prime by adding noise sampled from the noise process n to our actor policy okay so right here is telling us what the basically the target actor function is it's mu prime is basically mu plus some noise n n can be chosen to suit the environment as detailed in the supplementary min materials we use in Ornstein uhlenbeck process to generate temporally correlated exploration for exploration efficiency and physical control problems with inertia if you're not familiar adjust means the tendency of stuff to stay in motion it has to do with like environments that move like the walkers the Cheetahs stuff like that the ants okay so we've kind of got one of them not to add to our text editor let's head back over there and write that down okay so the target actor is just the evaluation we'll call it that for lack of a better word evaluation actor plus some noise process they used Ornstein kuhlenbeck I don't think I spelled that correctly we'll need to look that up that I've already looked it up my background is in physics so it made sense to me it's basically a noise process that models the motion of Brownian particles which are just particles that move around under the influence of their interaction with other particles in some type of medium like losses meeting like a perfect fluid or something like that and in the Ornstein in that case they are temporally correlated meaning at each time step is related to the time step prior to it and I hadn't thought about it before but that's probably important for the case of Markov decision processes right so in NDP's the current state is only related to the prior state and the action taken you don't need to know the full history of the environment so I wonder if that was chosen that way if there's some underlying physical reason for that just kind of a question of Gourmet let's top my head I don't know the answer to that if someone knows pipe drop the answer in the comments I would be very curious to see the answer so we have enough nuggets here so just to summarize we need to replay buffer class will also need a class for the noise right so we'll need a class for noise a class for the replay buffer we'll need a class for the target Q Network and we're going to use batch normalization the policy will be deterministic so what that means in practice is that the policy will output the actual actions instead of the probability of selecting the actions so the policy will be limited by whatever the action space of the environment is so we need some way of taking that into account so so deterministic policy means outputs the actual action instead of a probability we'll need a way to bound the actions to the environment environment limits and of course these notes don't make it into the final code these are just kind of things you think of as you are reading the paper you would want to put all your questions here I don't have questions since I've already implemented it but this is kind of my thought process as I went through it the first time as best as I can model it after having finished the problem and you can also use a sheet of paper there's some kind of magic about writing stuff down on paper but we're gonna use the code editor because I don't want to use an overhead projector to show you guys are freaking sheet of paper this isn't grade school here so let's head back to the paper and take a look and the actual algorithm to get some real sense of what we're going to be implementing the the results really aren't super important to us yet we will use that later on if we wanted the debug the model performance but the fact that they express it relative it to a planning our that makes it difficult right so scroll down to the data really quick so they give another thing to note I didn't talk about this earlier but I guess now is a good time is the stipulations on this term on this performance data says performance after training across all environments for at most two point five million steps so I said earlier I had to train the bipedal Walker for around twenty thousand games that's around times I think that's around about about two and a half million steps or so I think it was actually on fifteen thousand steps so maybe around three million steps something like that we report both the average and best observed across five runs so why would they use five runs so if this was a super duper algorithm and which none of them are this isn't a slight on their algorithm this isn't meant to be it's not here anything what it tells us is that they had to use five runs because there is some element of chance involved so you know in one problem with deep learning is the problem of replicability right it's hard to replicate other people's results if particularly if you use system clocks as seeds for random number generators right using the system clock to seed the random number generator guarantee that if you run the simulation at even a millisecond later right that you're gonna get different results because gonna be starting with different sets of parameters now you will get qualitatively similar results right you'll be able to repeat the the general idea of the experiments but you won't get the exact same results it's kind of what it's an objection to the whole deep learning phenomenon and it makes it kind of not scientific but whatever it works has an enormous success so we won't quibble about semantics or you know philosophical problems but we just need to know for our purposes that even these people that invented the algorithm had to run it several times to get some idea of what was going to happen because the algorithm is inherently probabilistic and so they report averages and best-case scenarios so that's another little tidbit and they included results for both the low dimensional cases where you receive just a state vector from the environment as well as the pixel inputs we won't be doing the pixel inputs for this particular video but maybe we'll get to them later I'm trying to work on that as well so these are the results and the interesting tidbit here is that it's probabilistic it's gonna take five runs so okay fine other than that we don't really care about results for now well take a look later but that's not really our concern at the moment so now we have a series of questions we have answers to all those questions we know how we're gonna handle the Explorer exploit dilemma we know the purpose of the target networks we know how we're gonna handle the noise we know how we're gonna handle the replay buffer and we know what the policy actually is going to be it's the actual problem it's the actual actions the agent is going to take so we know a whole bunch of stuff so it's time to look at the algorithm and see how we fill in all the details so randomly initialize a critic Network an actor network with weights theta super Q theta super mu so this is handled by whatever library you use you don't have to manually initialize weights but we do know from the Supplemental materials that they do constrain these updates to be within sorry these initializations to be within some range so put note in the back your mind that you're gonna have to constrain these a little bit and then it says initialize target network Q Prime and theta mu prime with weights that are equal to the original network so theta super Q prime gets initialized to a theta super Q and theta mu prime gets initialized to theta super mutant so we will be updating the weights right off the bat for the target networks with the evaluation networks and initialize a replay buffer R now this is an interesting question how do you initialize that replay buffer so I've used a couple different methods you can just initialize it with all zeros and then if you do that when you perform the learning you want to make sure that you have a number of memories that are greater than or equal to the mini batch size of your training so that way you're not sampling the same States more than once right if you have 64 memories in a batch that you want to sample but you only have 10 memories in your replay buffer then you're gonna sample let's say 16 memories and you're gonna sample each of those memories four times right so then that's no good so the question becomes if you update if you neutralize your replay buffer with zeros then you have to make sure that you don't learn until you exit the warm-up period where the warm-up period is just the number of steps equal to your replay buffer your buffer sample size or you can initialize it with the actual environmental play now this takes quite a long time you know the replay buffer is a reward or a million so if you load the out the I rhythm take in lane steps at random then it's gonna take a long time I always use zeros and then you know just wait until the agent fills up the mini batch size of memories just a minor detail there then it says for some number of episodes do so a for loop initialize a random process in for action exploration so this is something now reading it I actually made a little bit of a mistake so in my previous implementation I didn't reset the noise process at the top of every episode so that's explicit here I must have missed that line and I've looked at other people's code some do some don't but it worked within how many episodes was it within a 100,000 episodes the agent managed to beat the continuous Wonderland environment so is that critical maybe not and I think I mentioned that in the video receiver additional state observation s1 so for each step of the episode T equals one to capital T do select the action a sub T equals mu the policy plus n sub T according to the current policy and exploration noise okay so that's straightforward just use just feed the state forward what does that mean it means feed the state forward through the network receive the vector output of the action and add some noise to it okay execute that action and resuit and observe reward and new state simple store the transition you know the old state action reward a new state in your replay buffer are okay that's straightforward each time steps sample a random mini batch of n transitions from the replay buffer and then you want to use that set of transitions to set y sub I equals so I is sorry I having difficulties here so I is each step of that is each element of that mini batch of transitions so you want to basically loop over that set or do a vectorized implementation looping is more straightforward that's what I do I always opt for the most straightforward and not necessarily most efficient way of doing things the first time through because you want to get it working first and worry about implementation sorry efficiency later so said y sub I equals R sub I plus gamma where gamma is your discount factor times Q prime of the new state S sub I plus 1 times where the action is chosen according to mu prime given some weights theta super mu Prime and theta super Q prime so what's important here is that and this isn't immediately clear if you're reading this for the first time what's this is a very important detail so it's the action must be chosen according to the target actor network so you actually have Q as a function of the state as well as the output excuse me of another network that's very important update the critic by minimizing the loss basically a weighted average of that Y sub I minus the output from the actual q network where the a sub i's are from the actually the actions you actually took during the course of the episode so this a sub I is from the replay buffer and these actions right are chosen according to the target actor network so for each learning step you're gonna have to do a feed-forward pass of not just this target Q Network but also the target actor network as well as the evaluation critic Network I hope I said that right so the feed forward pass of the target critic network as well as the target actor network and the evaluation critic network as well and then it says update the actor policy using the sample policy gradient this is the hardest step in the whole thing this is the most confusing part so this is the gradient is equal to 1 over N times the sum so a mean basically whenever you see 1 over N times the sum that's a mean the gradient with respect to actions of Q where the actions are chosen according to the policy mu of the current states s times a gradient with respect to the weights of MU where they you just put the set of states okay so that'll be a little bit tricky to implement so and this is part of the reason I chose tensor flow for this particular video is because tensor flow allows us to calculate gradients explicitly in PI Torche you may have noticed that all I did was set Q to be a function of the the current state as well as the actor Network and so I loud PI torch to handle the chain rule this is effectively a chain rule so let's let's scroll up a little bit to look at that because this kind of gave me pause the first 10 times I read it so this is the hardest part to implement if you scroll up you see that this exact same expression appears here right and this is in reference to this so it's a gradient with respect to the weights theta super mu of Q of SN a such that you're choosing an action a according to the policy mu so really what this is is the chain rule so it's the this grading is are proportional to a gradient of this quantity times the gradient of the other quantity it's just a chain rule from calculus so in the in the PI torch paper we implemented this version and we're these are these are equivalent it's perfectly valid to do one or the other so in PI tortes we did this version today we're gonna do this particular version so that's good to know all right so next step on each time step you want to update the target networks according to this soft update rule so theta super Q prime gets updated as town times theta super Q plus 1 minus tau theta is supercute prime and likewise for theta super mu prime and then just end the two loops so in practice this looks very very simple but what do we know off the bat we need a class where our replay network we need a class for our noise process we need to class for the actor in a class for the critic now you could think that perhaps are the same but when you look at the details which we're gonna get to in a minute you realize you need two separate classes so you need at least one class to handle the deep neural networks so you have at least three classes and I always add in an agent class on top as kind of an interface between the environment and the deep neural networks so that's four and we're gonna with five but that's four right off the bat so now that we know the algorithm let's take a look at the Supplemental details supplemental information to see precisely the architectures and parameters used so we scroll down so here are the experiment details these Adam for learning the neural network parameters with the learning rate of 10 to the minus 4 and 10 to the minus 3 for the actor and critic respectively so they tell us alerting rates 10 to minus 4 10 to the minus 3 for Hydra critic for cue the critic we included l2 weight decay of 10 to the minus 2 and use a discount factor of gamma 0.99 so that gamma is pretty typical but this important thing is that for Q and only Q naught mu we included l2 weight DK of 10 to the minus 2 use a discount factor of gamma of 0.99 that's an important detail for the soft target updates we use tau equals 0.001 so one part in a thousand that is indeed very very small okay fine the neural networks used the rectified non-linearity for all hidden layers okay the final output layer the actor was a tangent hyperbolic layer two bound the actions now tan hyperbolic goes from minus 1 to plus 1 so in environments in which you have bounds of plus or minus 2 let's say you're gonna need a multiplicative factor so that's just something to keep in mind the low dimensional networks but that doesn't that doesn't impact the tension hyperbolic he just means there's a multiplicative factor and they're related to your environment below dimensional networks had two hidden layers with 400 and 300 units respectively about a hundred thirty thousand parameters actions were not included until the second hidden layer of Q so when you are calculating the critic function Q you aren't actually passing forward the action from the very beginning you're including it as a separate input at the second hidden layer of Q that's very important that's a very important implementation detail and this isn't learning from pixels we use three convolution layers which we don't need to know right now we're not using pixels yet and followed by two fully connected layers the final layer weights and biases both the actor and critic were initialized from a uniform distribution of plus or minus three by ten to the minus three for the low dimensional case this is this was to ensure that the initial outputs for the policy and value estimates were near zero the other layers were initialized from uniform distribution of plus or minus one over square root of F where F as a fan in of the layer fan and is just the number of input units and the other layer that the the the actions were not included into the fully connected layers that's for the convolutional case so here right now I'm experiencing some confusion reading this so it says the other way is Rachel eyes from uniform distributions related to the fan in the actions were not included until the fully connected layers so I'm guessing since we're talking about fully connected layers they're talking about the pixel case right because otherwise they're all fully connected you know wouldn't make sense to say specify fully connected layers so this gives me a little bit of confusion is this statement referring to the way I initially interpreted it it is referring to both cases for the state vector and pixel case but whatever I'm gonna interpret it that way because it seemed to work but there's ambiguity there and this is kind of an example of how reading papers can be a little bit confusing at times because the wording isn't always clear maybe I'm just tired maybe I've been rambling for about 50 minutes and my brains turning to mush that's quite probable actually but anyway we trained with mini match sizes of 64 for the low dimensional problems in 60 non-business pixels with a replay buffer of 10 to the 6 or the exploration noise process we use temporally correlated noise in order to explore well in physical environments at add momentum or encino kuhlenbeck process with theta equals 0.15 and Sigma equals 0.2 and it tells you what it does all well and good okay so these are the implementation details we need 400 units and 300 minutes for our hidden layers atom optimizer 10 to the minus 4 for the actor at 10 to the minus 3 for the critic for the critic we need an l2 weight decay of 10 to the minus 2 discount factor of gamma 0.99 and for the salt that they factor we need point zero zero one and we need updates we need initializations that are proportional to the 1 over the square fanon for the lower layers and plus minus point zero zero three zero zero zero zero three for the final output layers of our fully kind of networks okay so that's a lot of details we have everything we need to start implementing the paper and that only took us about 50 minutes we just much shorter thin when I read it took me quite a while so let's head back up to the algorithm here and we'll keep that up as reference for the remainder of the video because that's quite critical so let's go ahead and head back to our code editor and start coding this up we'll start with the easy stuff first so let's start coding and we will start with the probably one of the most confusing aspects of the problem with the Ornstein kuhlenbeck action noise now you can go ahead and do a google search for it and you'll find a Wikipedia article that talks a lot about the physical processes behind a lot of mathematical derivations and that's not particularly helpful so if you want to be a physicist I invite you to read that and check it out it's got some pretty cool stuff it took me back to my grad school days but we have a different mission in mind for the moment the mission now is to find a code implementation of this that we can use in our problem so if you then do a Google search for Ornstein uhlenbeck github as and you want to find someone's github example for it you end up with a nice example from the open AI baseline library that shows you the precise form of it so let me show you that one second so you can see it here in the can the github there is a whole class for this right here that I've highlighted and this looks to do precisely what we want it has previous plus a delta term and a DT term so it looks like it's going to create correlations through this X previous term there's a reset function to reset the noise which we may want to use and there's this representation which will probably skip because we don't know what we're doing and it's not really critical for this particular application though it would be nice if you were writing a library as these guys were so they included the representation method so let's go ahead and code that up in the editor and we'll tackle our first class so I'm gonna leave the notes up there for now they do no harm they're just comments at the top of the file so the first thing we'll need is to import numpy as NP we know we can go ahead and start and say import tensorflow as TF we're going to need tensorflow we may need something like OS to handle model savings so we can go ahead and import that as well just a fun fact it is considered a good practice to import your system level packages first and followed by your library packages second in numerical order followed by your own personal code in numerical order so that's the imports that we need to start let's go ahead and code up our class we'll call it o u action noise and that'll just be derived from the base object so the initializer will take a mu a Sigma now they said they used a default value of I believe 0.15 and a theta of 0.2 a DT term is something like 1 by 10 to the minus 2 and our X naught will save as none and again if you have any doubts on that just go check out the open AI baselines for it their implementation is probably correct right I'll give them the benefit of the doubt so go ahead and save your parameters as usual so we have a mu a theta a DT a sigma and x0 and we'll go ahead and call the reset function at the top now they override the call method and what this does is it enables you to say noise equals all you action noise and then when you want to get the noise you just say our noise equals noise you can use the parenthesis that's what overall writing call does that's a good little tidbit to know so we want do you implement the equation that they gave us it's L self that X previous plus self dot theta times mu minus X previous times self dot DT plus Sigma times at numpy dot square root cell DT times numpy random normal for the size of mu and set the x previous to the current value that's how you create the temporal correlations and return the value of the noise so we don't have a value for x previous so we have to set that with the reset function which takes no parameters cell that X previous equals x naught if self dot x naught is not none else numpy zeros like self dot mia and that's it for the noise function that's pretty straightforward so that's all well and good so we have one one function down so we've taken care of the noise sorry one class now we've taken care of the noise and now we can move on to the replay buffer so this will be something similar to what I've been lamented in the past there are many different implementations and ways of implementing this many people we use a built-in Python data structure called a DQ or a deck I think it's DQ is the pronunciation basically it's a queue that you fill up over time and that's perfectly valid you can do that there's no reason not to I prefer using a set of arrays and using numpy to facilitate that the reason being that we can tightly control the data types of the stuff that we're saving for this pendulum environment it doesn't really matter but as you get more involved in this field you will see that you are saving stuff that has varying sizes so if you're trying to save images let's say from the one of the Atari libraries or a mu Joko environment or something like that hope I pronounced that correctly you'll see that this memory can explode quite quickly you can eat into your RAM so the ability to manipulate the underlying data type representation whether you want to use single or double precision for either your floating-point numbers or integers is critical to memory management as well as taking advantage of some other optimizations for NVIDIA GPUs in the touring class and above so I always use the numpy arrays because it's a clean implementation that allows manipulation of data types you can use the DQ if you want it's perfectly valid so our separate class as its own initializer of course so we're going to want to pass in a maximum size the input shape and the number of actions right because we have to store the state action reward and new state tuples we're also going to want to facilitate the use of the done flag so we'll have an extra parameter in here and the reason behind that is intimately related to how the how the bellman equation is calculated it took me a second to think of that at the end of the episode the agent receives no further rewards and so the expected feature reward the discounted feature reward if you will is identically zero so you have to multiply the reward for the next state by 0 if that next state follows the terminal state if your current state is terminal in the next state is following the terminal state so you don't want to take onto account anything from that he expected future rewards because they're identically zero so we need a done flag this is all I wanted to say so we save our parameters I only need to say that we can say self up men counter equals zero and that'll keep track of the position of our most recently saved memory state memory and arrived up by zeros men's size by input shape new state memory same same deal you know it's just just totally the same good grief ok so now we have the say action memory and that's numpy zero self dot men's size by sulfa and actions we didn't save in actions it'll be so we'll just call it ten actions we have the reward memory and that's just a scalar value so that only gets shaped self mem size we also need the terminal memory memory and that'll be shaped self dot men size and I haven't numpy float32 if I recall correctly that is due to the data types in the PI towards implementation is probably not necessary here in the tensorflow implementation but I left it the same way just to be consistent it doesn't hurt anything so next we need a function to store a current transition so that'll take a state action reward new state and a done flag as input so the index of where we want to store that memory is the memory counter modulus the man size so for any member lesandsteve Cantor for anything larger than mem size it just wraps around so if you have a million memories all the way up from zero to nine hundred nine thousand nine hundred ninety nine it will be that number and then was it's a million it wraps back round to 0 and then 1 and 2 and so on and so forth and so that way you're overriding memories at the earliest part of the array with the newest memories precisely as they described in the paper if you're using the DQ map method then I believe you would just pop it off of the left I believe now don't quote me because I haven't really used that implementation but from what I've read that's how it operates new state memory sub index equals state underscore reward memory equals sub index as reward as action memory now keep in mind that the actions in this case are arrays themselves so it's an array of arrays just keep that in the back your mind so that you can visualize the problem we're trying to solve here next up we have the terminal memory now a little twist here is that we want this to be 1 minus int of done so done is either true or false so you don't want to count the rewards after the episode has ended so when done is true you want to multiply by 0 so 1 minus into true is 1 minus 1 which is 0 and when it's not over it's 1 minus 0 which is just 1 so that's precisely the behavior we want and finally you want to increment men counter by 1 every time you store em new memory next we need a function to sample our buffer and we want to pass in a batch size alternatively you could make batch size a member variable of this class it's not really a big deal I just did it this way for whatever reason my chose to do it that way so what we want to do is we want to find the minimum either so let's back up for a second so what we want is to sample the memories from either the 0th position all the way up to the most filled memory the last filled memory so if you have less than the max memory you want to go from 0 to mem counter otherwise you want to sample anywhere in that whole interval so maximum equals the minimum of our of either mem counter or sighs and the reason you don't want to just use mem can or is that mem counter goes larger than mem size so if you try to tell it to select something from the range men counter when mem counter is greater than mem size you'll end up trying to access elements of the array that aren't there and it'll throw an error so that's why you need this step next you want to take a random choice of from zero to maximum of size batch size so then you just want to gather those states from the respective arrays like so I forgot the self of course new states actions rewards and call it terminal batch I believe that's everything and you want to return the state's actions rewards new states and terminal ok so so now we are done with the replay buffer class so that's actually pretty straightforward and if you've seen some of my other videos on deep key learning then you've seen pretty much the same of implementation I just typically kept it in the agent class then we I'm kind of refining my approach getting more sophisticated over time so it makes sense this ticket in its own class so we're already like 40 percent of the way there we've got five classes in total so that's good news so next up we have to contend with the actor and critic networks so we'll go ahead and start with the actor and from there keep in mind that we have two actor networks and we're gonna have to contend with some of the peculiarities and weaith tensorflow likes to do stuff so let's get started the actor and of course in tensorflow you don't derive from any particular class we're on PI torch you derive from an end up module don't know what I'm doing there definite so we'll need a learning rate number of actions on name and the name is there to distinguish the regular atra network from the target actor network input dims we're gonna want to pass it a session so tensorflow has the construct of the session which houses the graph and all the variables and parameters and stuff like that you can have each class having its own session but it's more tidy to pass in a single session to each of the classes a number of dimensions for the first fully connected layer is so of course I should be 400 if we're going to implement the paper precisely at C to Tim's is 300 an action bound a batch size that defaults to 64 a checkpoint directory and the purpose of that is to save our model and in the case of the pendulum it doesn't really matter because it's so quick to run but in general you want a way of saving these models because it can take a long time to run so we'll save our learning rate number of actions and all of the parameters we passed in dot fc-1 dims and the purpose of this action bound is to accommodate environments where the action is either greater than plus or minus negative one so if you can go from plus or minus two and the tangent hyperbolic is only going to sample like half your range right from plus minus one and so you own I have an action mountain there is a multiplicative factor to make sure that you can sample the full range of actions available to your agent and we need to say that checkpoint Durer and finally we want to call a build network function that's not final but we'll call the build Network function next up since we have to do the soft cloning the soft update rule for the target actor class and target critic then we know that we have to find a way of keeping track of the parameters in each Network so we're gonna keep track of them here in the variable params and it's tensorflow dot trainable variables with a scope of a self dot name so the we have a single session in the single graph and you're going to have multiple deep neural networks within that graph we want we don't want to update the critic network and we're trying to do the after Network and vice versa right we want those to be independent and so we scoped them with their own name so that tensorflow knows hey this is a totally different set of parameters from this one that'll aid in copying stuff later and also make sure that everything is nice and tidy and the scope is what facilitates that will also need a saber object to save the model let's make a checkpoint file and that's where we use OS and that will clone into checkpoint Dirar and the name + underscore DD PG checkpoint so this will automatically scope the save files for us so that way we don't confuse the parameters for the target actor an actor or critic and target critic or even actor and critic for that matter very important so we're gonna have to calculate some gradients and we're gonna do that by hand so we're gonna need a series of functions that will facilitate that and the first of which is the unnormalized actor gradings and that is giving my tensorflow doc radians cell top mu cell top params and - self dot action in gradient so we're gonna calculate some you will be the MU from the paper the actual actions of the agent params are our network parameters and this action gradient so let's go back to the paper for a second so we can get some idea what i'm talking about here so if we look at the algorithm we can see that we're going to need these gradients of the critic function with respect to the actions taken we're also going to need the gradient of the actual mu with respect to the weights of the network so we're passing in into tensor flow gradients this function mu the parameters to get the gradient with respect to those parameters and then we're gonna have to calculate the gradient of the critic with respect to the actions taken so we'll have to calculate this later that's a placeholder that's the - self-taught action gradient and we're gonna calculate that in the learning function but that's where all of that comes from so now let's go back to the code editor and continue so that's our uh normalized actor gradients and uh normalized is just because we're going to take one over the sum 1 over n times a sum we need a function for performing that normalization so the actor gradients has to be this parameter has to be a list so we cast it as a list and we're just going to map a lambda function and XT F dot div X and badge size no big deal there so optimize is our optimization step and of course that's the atom optimizer we want to optimize with the learning rate of self that learning rate and we want to apply gradients so typically we'd use not minimize loss but in this case we're calculating our gradients manually so we need to apply those gradients and what do we want to apply want to apply the actor gradients to the programs and I would encourage you to go look at the tensor flow code for all of these the videos getting a little bit long we're already up to like an hour and 20 hour and 10 20 minutes something like that so I'm not going to go through all the documentation for tensor flow feel free to look that up I had to when I was building this out so next up we need to build our network so this is how we're gonna handle the scoping TF top variable underscore scope self that names that way every Network gets its own scope we need a placeholder for the input not to be 32 bit floating number shape of none which is batch size you can put dims beyond that won't work let's put this on its own line and we're gonna give it a name it's not critical the name parameter is just for debugging if something goes wrong then you can kind of trace where it went wrong makes life a little bit easier the action gradient is also a placeholder that is what we're going to calculate in the learn function for the agent and that gets a shape of none by and actions so it'll be the gradient of Q with respect to each action so it has number of dimensions but there's actions and so those are our two placeholder variables now we get to construct the actual network so let's handle the initialization first so f1 is the fan-in it's 1 divided by numpy square root FC 1 dims and our dense 1tf layers dot dense and that takes self but input is input with self thought FC 1 dims as units our kernel initializer egos random I forgot an importance it's random uniform one second minus F 1 2 F 1 and the bias initializer current L no that's not right Kernell bias initializer random uniform minus F 1 to F 1 that gets to parentheses so we have to come back up to our imports import tensor flow dot initializers know so we have to come back up here and say from tensor flow dot initializers and this shell Iser is import random uniform and now we're good to go let's come back down here we have dense one now we want to do the batch normalization so batch 1 equals TF layers batch normalization dense one and that doesn't get initialized so now let's activate our first layer that's just the rail u activation of the batch normal batch normal now it is an open debate from what I've read online about whether or not you should do the activation before or after the batch normalization I'm in the camp of doing it after the activation after the batch norm that's because the rho u activation function at least in least in the case of Rho u so and well you you lop off everything lower than zero so your statistics might get skewed to be positive instead of maybe their zero maybe they're negative who knows so I think the batch norm is probably best before the activation and indeed this works out this is something you can play with so go ahead and fork this repo and play around with it and see how much of a difference it makes for you maybe I missed something when I try to do it the other way it's entirely possible I miss stuff all the time so you know it's something you can improve upon that's how I chose to do it and it seems to work and who knows maybe other implementations work as well so now let's add some space so f 2 is 1 over square root su to Tim's dents - is similar that takes layer 1 activation as input with FC 2 dims you don't need that I'm gonna come up here and copy this there we go except we have to change f1 to f2 perfect then we have batch 2 and that takes dense to is input or to activation batch 2 now finally we have the output layer which is the actual policy of our agent the deterministic policy of course and from the paper that gets initialized with the value of zero point zero zero three and we're going to call this mu that gets layered to activation as input and that needs and actions as the number of output units what's our activation that is tangent hyperbolic tanh CH and I will go ahead and copy the initializers here of course that gets f3 not f2 perfect can I tap that I can okay so that is mu and then we want to take into account the fact that our environment may very well require actions that have values plus greater than plus or minus one so self dot mu is TF that multiplied with mu and the action bound an action bound will be you know something like two it needs to be positive so that way you don't flip the actions but that's pretty straightforward so now we've built our network built our network next thing we need is a way of getting the actual actions out of the network so we have a prediction function that takes some inputs and you want to return self dot sass run self dot mu with a feed dictionary of self thought input and pass in inputs that's all that is there is to passing the doing the feed-forward kind of an interesting contrast to how pi torch does it with the explicit construction of the feed-forward function this just runs the session on this and then goes back and finds all the associations between the respective variables nice and simple now we need a function to Train and that'll take inputs and gradients this is what will perform the actual back propagation through the network and you want to run self dot optimized that's our function that accommodates learning with a feed dictionary of inputs inputs and self dot action gradient gradients so that is also reasonably straightforward so you know let's format that a little bit better alright so that is our training function next we need two functions to accommodate the loading and the saving of the model so define save checkpoint print then you want to say self dot saver dot save very creative want to save the current session to the checkpoint file and the load checkpoint function is the same thing just in Reverse so here we want to print floating checkpoint self dot saver dot restore that session and the checkpoint file so that will you can only call this after and Stan cheating the agent so it will have a default session with some initialized values and you want to load the variables from the checkpoint file into that checkpoint and that is it for the actor class this is reasonably straightforward the only real mojo here is the actor gradients and this is just two functions that accommodate the fact that we're going to manually calculate the gradient of the critic with respect to the actions taken so let's go ahead and do the critic class next that is also very similar so that again derives from the base object gets an initializer and it's pretty much the same number of actions a name input dims a session su won bin's SE two dens a batch size will default to 64 and a check pointer will default to this just a note you have to do a make der on this temps live CD DPG first otherwise it'll bark at you not a big deal just something to be aware of since it's identical let's go ahead and copy a good chunk of this stuff here what exactly is the same all of this now we need the checkpoint file let's grab that ctrl C come down here control V and voila it is all the same very straightforward nothing too magical about that so now let's handle the optimizer means we already have a we've already called the function to build our network we can define our optimizer so the self dot optimize TF train Adam optimizer and we're gonna minimize the loss which we will calculate in the build Network function and we also need a function to actually calculate the gradients of q with respect hey so we have self - ingredients self dot q and actions let's build our network with Tina sorry TF a variable scope self that name so now we need our placeholders again we need sup out input placeholder float32 we need a shape nun by input dims now if you're not too familiar with tensorflow specifying nun in the first dimension tell sensor flow you're gonna have some type of batch of inputs and you don't know what that batch size will be beforehand so just expect anything and let's delete that say I need a comma for sure and say name equals inputs here we go did I forget a comma up here let's just make sure I did not okay so we also need the actions because remember we only take into account the actions and the second hidden layer of the critic neural network I have float 342 shape none by and factions name of passions now much like with q-learning we have a target value let me just go back to the paper really quick and show you what that precise it will be so that target value will be this quantity here and we will calculate that in the learning function as we as we get to the agent class so let's go back to the code editor and finish this up so Q target is just another placeholder and that's a floating point number done by one it's a scalar so it is shaped batch size by one and we will call it targets okay so now we have a pretty similar setup to the actor Network so let's go ahead and come up here and copy this no not all that just this and come back down here so f1 we recognize that's just the fan-in we have a dense layer for the inputs and we want to initialize that with a random number pretty straightforward then we can come up to f2 and copy that as as well sorry the second layers it'll be a little bit different but we'll handle that momentarily so now f2 the layer 2 is pretty similar the only real difference is that we want to get rid of that activation and the reason we want to give it an activation is because after we do the batch norm we have to take into account the actions so we need another layer so action in Steve layers dense it's going to take in celfon actions which will pass in from the learning function and that's going to output F c2 dims with a value activation so then our state actions will be the the addition of the batch two and the action in and then we want to go ahead and activate that okay so this is something else I pointed out in my PI torch video where this is a point of debate with the way I've implemented this I've done it a couple different ways they've done the different variations and this is what I found to work I'm doing a double activation here so I'm activating the rel you I'm doing the brawl you active a tional the actions in on the output of that that dense layer and then I'm activating the sum now the the Raju function is non commutative with respect to the addition functions so the value of the sum is different than the sum of the values and you can prove that to yourself on a sheet of paper but it's debatable on whether or not the way I've done it is correct it seems to work so I'm going to stick with it for now again for cat clone it change it up see how it works and improve it for me that would be fantastic then make a pull request and I'll disseminate that to the community so we have our state actions now we need to calculate the actual output of the layer and an f3 is our uniform initializer for our final layer which is self dot Q that's all areas that dense it takes state actions as input outputs a single unit and we have the kernel initializers and bias initializers similar to up here let's paste that but there we go that's a little bit better still got a whole bunch of white space there alright so we are missing one thing and that one thing is the regular riser so as they said in the paper they have L to regular regularization on the critic so we do that with kernel regular Iser equals TF Kerris regul our eyes errs dot l2 or the value 0 0 1 that 0 1 sorry so that is that for the cue and notice that it outputs one unit and it outputs a single unit because this is a scalar value you want the value of the particular state action pair finally we have the loss function and that's just a mean squared error error cue target and Q so Q target is the placeholder up here and that's what we'll pass in from the learn function from the agent and self dot Q is the output of the deep neural network okay so now similar to our critic we have a prediction function it takes inputs and actions and you want to return self that says run self dot Q with a feed dictionary of so that input inputs and actions next you need a training function and that's slightly more complicated than the actor visa takes in the inputs actions and a target and you want to return self assess run so the optimized with a feed dictionary of some that input inputs self dot actions actions and cue target space gun that's a little wonky whatever we'll leave it that's a training function next we need a function to get the action gradients and that'll run that action gradients operation up above so let's get that again takes inputs and actions as input and you want to return self dot session run self dot action ingredients with a its own feed dictionary equals self dot input inputs and self out actions actions and then we also have the save and load checkpoint functions which are identical to the actor siliceous copy and paste those there we go so that is it for the critic class so now we have most of what we need we have our our our noise our replay buffer our actor and our critic now all we need is our agent now the agent is what ties everything together and handles the learning functionality it allows the noise the memory of the replay buffer as well as the four different deep neural networks and that derives from the base object initializer is a little bit long and that takes an alpha and a beta these are the learning rates for the actor and critic respectively recall from what we read in the paper they use 0 0 0 1 and 0 0 1 for both networks we need input dims tau the environment that's how we're going to get the action bounds a gamma is 0.99 for the paper in this case we need number of actions equals to a mem size for our memory of 1 million player 1 size of 400 and layer 2 size of 300 and a batch size of 64 so of course we want to save all of our parameters we'll need the memory and that's just a replay buffer max sighs whoops input dims and number of actions we will need a batch sighs here's where we're gonna store the session and this is so that we have a single session for all four networks and I believe and I believe don't quote me on this but I tried it with an individual network for a certain individual session for each network and it was very unhappy when I was attempting to copy over parameters from one network to another I figured there were some scoping issues so I just simplified it by having a single session there's no real reason that I can think of to have more than one and that is an actor like it's alpha a number of actions the name is just actor input dims the session layer one sized layer to size action space dot high and that's our action bounce the action spaced out high next we have a critic that gets beta and actions it's name is just critic Abaddon's self that session layer one size layer to size and we don't pass in anything about the environment there so now we can just copy these and instead of actor it is target actor and let's go ahead and clean up and be consistent with our pet bait style guides always important and that actually makes you stand out I've worked on projects where the manager was quite happy to see that I had a somewhat strict adherence to it just something to take note of so then we have a target critic as well and we will clean up this and so that is all four of our deep neural networks that's pretty straightforward so now we need noise that's a no you action noise with mu equals numpy zeroes in the shape and actions so now we need operations to perform the soft updates so the first time I tried it I defined it as its own separate function and that was a disaster it was a disaster because it would get progressively slower at each soft update and I don't quite know the reason for that I just know that that's what happened and so when I moved it into the initializer and defined one operation I'm guessing it's because it adds every time you call it it probably adds something to the graph so that adds overhead to the calculation that's my guess I don't know that's accurate it's just kind of how I reasoned it but anyway let's go ahead and define our update operations here so what we want to do is iterate over our target critic parameters and call the assignment operation what we want to assign we want to assign the product of critic params and self that town plus if you have that multiplied self dot target critic params up I times or hand now that should be a sorry a comma comma and one - salt up towel so we get to let's see it man that's and that is a list comprehension for I in range length of target critic dot params I guess that don't need that I believe no so then we have a similar operation for the actor we just want to swap actor and critic target actor so thought actor actor dot params dot params I did the same thing here didn't I target actor and up here top Rams okay target actor actor and then target actor there so now we have our update soft update operations according to the paper finally we have constructed all the graphs you know for the whole four networks so we have to initialize our variables self that section not run TF global variables initializer you can't really run anything without initializing it and as per the paper at the very beginning we want to update the network parameters and at the beginning we want to pass in the we want the target networks to get updated with the full ver the full value of the critic of the evaluation networks and so I'm passing in a parameter of first equals true so since it's confusing let's do that first that function first update network parameters first will default to false so if first we need to say old tau equal self dot tau I need to say the old value of tau to set it again so that I can reset it so that's how it goes 1 and say self dot target critic session run update critic self dot target's actor dot session that run self dot update actor and as I recall this is important which session you use to run the update although maybe not since I'm using only one session if you want to play around with it go ahead and then go ahead and reset the tau to the old value because you only want to do this particular update where you update the target network with the original network full values on the first turn on the first go through otherwise just go ahead and run the update function boom so next we need a way of storing transitions self state action reward new state and done so self dot memory dot store transition you want to store all this stuff this is just an interface from one class to another you know this may or may not be great computer science practice but it works next we want a way of choosing an action and that should take a state as input since we have defined the input variable to be shaped batch size by n actions you want to reshape the state to be 1 by sorry it should be 1 by the observation space new access all and that's because we have the come up here for the actor network just so we're clear it is because this has shape none by input dims so if you just pass in the observation vector that has shape input dims and it's going to get it's gonna get uppity with you so you just have to reshape it because you're only passing in a single observation to determine what action to take Somu actor dot predict state noise equals self dot noise new prime you guys mu plus noise return sub 0 so this returns a tuple so you want the 0th element so now we have the learning function and of course this is where all the magic happens so if you have not filled up the memory then you want to go ahead and bail out otherwise you want to go ahead and sample your memory so if I member you that sample buffer batch size so next we need to do the update from the paper so let's go back to the paper and make sure we are clear on that so we need to we already sampled this so we need to calculate this and to do that we're gonna need the q prime the target critic Network as output as well as the output from the target actor Network and then we use that to update the loss for the critic and then we're gonna need the output from the critic as well as from the actor Network so we need to basically pass States and actions through all four networks to get the training function the learning function so let's go ahead and head back to the code editor and do that so our critic for Q prime so cute critic value underscore sublet target critic predict and you want to pass in the new states and he will also want to pass in the actions that come from the target actor new state I need one extra parentheses there so then we want to calculate the y sub i's for the targets so that's an empty list for J and range set up batch size target dot append reward critic value underscore J times done sub J and that's where the I kept harping on getting no rewards after the terminal state that's where that comes from when done is true then it's 1 minus true which is 0 so you're multiplying this quantity by 0 so you don't take into account the value of the next state right which is calculated up here you only take into account the most recent reward as you would want so then we just want to reshape that target and to something that is batch size by one that's to be consistent with our placeholders and now we want to call the critic train function right because we have everything we need we have the state's actions and targets with States and actions from the replay buffer and the target from this calculation here very easy now we need to do the actor update so the action else cell that actor dot predict we get the predictions from the actor for the states are grads the gradients get action gradients state payouts that'll get the remember that'll do a feed-forward and get the gradient of the critic with respect to the actions taken and then you want to train the actor state and grads the gradients it's a tuple so you want to dereference it and get is your element and then finally you want to update network parameters whoo okay so that is it for the learn function now we have two other bookkeeping functions to handle which is save models and this will save all of our models so a self-taught actor does save checkpoint self dot target actor save checkpoint so plot critic and a function to load models sounds like a dog fine out there so we want to load checkpoints instead of saving them load checkpoints load and load so that is it for the aging class that only took about an hour so we're up to about two hours for the video so longest one yet so this is an enormous amount of work this is already three hundred ten lines of code if you've made it this far congratulations this is no mean feat this took me you know a couple weeks to hammer through but we've gotten through it in a couple hours so this is all there is to the implementation now we have to actually test it so let's open up a new file and save that as main tensorflow PI and what we want to do now is go ahead and test this out in the pendulum environment I call it tensorflow original import our agent we need Jim we need do we need numpy for this yes we do we need we don't need tensorflow from utils import plot learning and we don't need OS so we will have a score history now you know what I want to do that with a say if name equals main now we want to say env Jim not make Bend you Lum - be zero agent equals agent it gets a learning rate of zero zero zero one a beta of zero one put dims 303 towel zero one pass in our environment batch size 64 being verbose here and it looks like when I ran this I actually use different layer sizes well that's an issue for hyper parameter tuning I just want to demonstrate that this works and actions equals one so when I got good results for this I actually used 800 by 600 and I cut these learning rates in half but we'll go ahead and use the values from the paper sometimes I do a hyper parameter tuning other thing we need to do is set our random seed you know we can just put it here whatever the reason we want to do this is for replicability and I have yet to see any limitation of this where they don't set the random seed and I've tried it without it and you don't get very good sampling of a replay buffer so this seems to be a critical step and most implementations I've seen on github do the same thing so let's go ahead and play our episodes say a thousand games you know view that reset we forgot our score history of course that's to keep track of the scores over the course of our games donas false and the score for the episode is zero let's play the episode say act agent dot shoes action takes abs as input new state reward done info equal C and V dot step act agent don't remember Hobbs Act reward new state and done I guess the int done really isn't necessary we take care of that in the replay buffer phone class but you know doesn't hurt to be explicit and then the one we're able we want to learn on each time step because this is a temporal difference method keep track of your score and set your old state to be the new state so that way we need use an action on the next step you are using the most recent information finally at the end of every episode you want to append that score to the score history and print episode I score percent up to F score 100 game average person not to have percent I'm probably mean a score history last hundred games minus 100 on one more and at the end equals pendulum dot PNG plot learning plot learning score history file name and a window of 100 the reason I chose a window of 100 is because many environments to find solved as trailing 100 games over some amount the pendulum doesn't actually have a solved amount so what we get is actually on par with some of the best results people have on the leaderboard so it looks like it does pretty well so that is it for the main function let's go ahead and head to the terminal and see how many typos I made all right here we are let's go ahead and run the main file invalid syntax I have an extra print to see I'm just going to delete that really quick run that I have the star out of place let's go back to the code editor and handle that so it looks like it is on line 95 which is here alright back to the terminal let's try it again and it says line 198 invalid syntax oh it's because it's a comma instead of a colon all right off accent once more alright so that was close so actor object has no attribute input dims line 95 nah okay that's easy let's head back so it's in line 95 just means I forgot to save input dims and that probably means I forgot it in the critic as well since I did it cut and paste yes it does alright now we'll go back to the terminal okay that's the last one all right moment of truth critic takes one positional argument but seven were given good grief okay one second so that is of course in the agent function so I have line 234 mmm one two three four five six seven parameters indeed so critic takes learning rate number of actions name input in session interesting one two three four five six seven what have I done oh that's why it's class critic of course alright let's go back to the terminal and see what happens aim action bound is not defined and that is in the line 148 okay line 148 that is in the critic ah that's because I don't need it there let's delete it that was just for the actor it's because I cut and pasted always dangerous I tried to save a few keystrokes to save my hands and inaudible wasting time instead all right let's go back to the terminal all right actor has no attribute inputs that is on line 126 self dot inputs it's probably self dot input yes that's why I do the same thing here okay perfect and it runs so I'll let that run for a second but I let it run for a thousand games earlier and this is the output I got now keep in mind and slightly different parameters the point here isn't that whether or not we can replicate the results because we don't even know what the results really were because they express it as a fraction of a planning a fraction of the performance of a planning agent so who knows what that really means I did a little bit of hybrid parameter tuning all I did was double the number of input units and have the learning rate and I ended up with something that looks like this so you can see it gets around 150 or so steps you guys are 150 or so steps to solve and if you check the leaderboards that it shows that that's actually a reasonable number some of the best environments only have 152 steps some of them do a little bit better sorry best agents saw 152 steps or achieve a best score run or vidi steps but it's pretty reasonable so so the default implementation looks like it's very very slow to learn you can see how it's kind of starts out band gets worse and then gets starts to get a little bit better so that's pretty typical you see this you know oscillation in performance over time pretty frequently but that is that that is how you go from an implementation in about two hours of course it took me you know many you know a couple many times that to get this set up for you guys but I hope this is helpful I'm gonna milk this for all it's worth this has been a tough project so I'm gonna present many many more environments in the future I may even do a video like this for pi torch I have yet to work on a Karass implementation for this but there are many more D DPG videos to come so subscribe to make sure you don't miss that leave a comment share this if you found it helpful that helps me immensely I would really appreciate it and I'll see you all in the next video

Info

Channel: Machine Learning with Phil

Views: 20,427

Rating: undefined out of 5

Keywords: deep deterministic policy gradients, ddpg tutorial, ddpg algorithm, deep deterministic policy gradients algorithm, ddpg network, continuous actor critic, ddpg openai gym, ddpg tensorflow, ddpg example, ddpg code, implement deep learning papers, how to read deep learning papers, continuous action reinforcement learning, ddpg implementation, ddpg paper

Id: jDll4JSI-xo

Channel Id: undefined

Length: 114min 2sec (6842 seconds)

Published: Tue Jul 02 2019