Deep Reinforcement Learning in Python Tutorial - A Course on How to Implement Deep Learning Papers

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what is up everybody today you're gonna learn how to go from a paper to a fully functional implementation of deep deterministic policy gradients if you're not familiar with deep deterministic Balaji gradients or DD PG for short it is a type of deep reinforcement learning that is used in environments from the continuous action spaces you see most environments have discrete action spaces this is the case with say the Atari library say like breakout or space invaders where the edge you can move left right it can shoot but it can move left right and shoot by fixed discrete intervals fixed amounts right in other environments like say robotics the robots can move a continuous amount so it can move in anywhere from you know a zero to one minus one to plus one anything along a continuous number interval and this poses a problem for most deep reinforcement learning methods like say q-learning which works spectacularly well in discrete environments but cannot tackle continuous action spaces now if you don't know what any of this means don't worry I'm going to give you the rundown here in a second but for this set of tutorials you're gonna need to have installed the open AI gym you'll need Python 3.6 and you also need tensorflow and pi torch other packages you'll need include MATLAB to handle the plotting of the learning curve which will allow us to see the actual learning of the agent as well as numpy to handle your typical vector operations now here I'll give you a quick little rundown of reinforcement learning so the basic idea is that we have an agent that interacts with some environment and receives a reward the rewards kind of take the place of labels and supervised learning in that they tell the agent what is good what is it that it is shooting for in the environment and so the agent will attempt to maximize the total rewards over time by solving something known as the bellman equation we don't have to worry about the actual mathematics of it but just so you know for your future research the algorithms are typically concerned with solving the bellman equation which tells the agent the expected feature returns assuming to follow something called it's policy so the policy is the probability that the agent will take a set of actions given it's in some state s it's basically a probability distribution now many types of algorithms such q-learning will attempt to solve the bellman equation by finding what's called the value function the value function or the action value function in this case map's the current state and set of possible actions to the expected feature returns the agent expects to receive so in other words the agent says hey I'm in some state meaning some configuration of pixels on the screen in the case of the Atari gym Atari library for instance and says okay if I take one or another action what is the expected feature return assuming that I follow my policy after critic methods are slightly different and then they attempt to learn the policy directly and recall the policy is a probability distribution that tells the agent what the probability of selecting an action is given it's in some state s so these two algorithms have a number of strengths between them and deterministic policy gradients is a way to marry the strengths of these two algorithms into something that does really well for discrete actions sorry continuous action spaces you don't need to know too much more than that everything else you need to know I'll explain in their respective videos so in the first video you're gonna get to see how I go ahead and read papers and then implement them on the fly and in the second video you're gonna see the implementation of deep deterministic policy gradients in Pytor in a separate environment both of these environments are in both of these environments are continuous and so they will demonstrate the power of the algorithm quite nicely you don't need a particularly powerful GPU but you do need some kind of GPU to run these as it does take a considerably long time even on a GPU so you will need at least a like say a maximal class GPU or above so something from the 700 series on NVIDIA side unfortunately neither of these frameworks really work well with AMD cards so if you have those you'd have to figure out some sort of Cluj to get the OpenCL implementation to trance compiled to CUDA that's just a technical detail I don't have any information on that so you're on your own sorry so this is a few hours of content grab a snack drink and watch this at your leisure it's best to watch it in order I'd actually did the videos in a separate order reverse order on my channel just so I could get it out so I did the implementation in PI tours first and then the video on implementing the paper in tensor flow second but it really is best for a new audience to go from the paper paper video to the PI torch video so I hope you like it leave it in the comments questions suggestions issues down below I'll try to address as many as possible you can check out the code for this on my github and you can find many more videos like this all my youtube channel machine learning with Phil I hope you all enjoy it let's get to it what is up everybody in today's video we're gonna go from the paper on deep deterministic policy gradients all the way into a functional implementation in tensor flow so you're gonna see how to go from a paper to a real-world implementation all in one video grab a snack a drink cuz this is gonna take a while let's get started so the first step in my process really isn't anything special I just read the entirety of the paper of course starting with the abstract the abstract tells you what the paper is about at a high level it's just kind of an executive summary introduction is where the author's will pay homage to other work in the field kind of set the stage for what is going to be presented in the paper as well as the need for it the background kind of expands on that and you can see here it gives us a little bit of mathematical equations and you will get a lot of useful information here this won't talk too much about useful nuggets on implementation but it does set the stage for the mathematics you're going to be implementing which is of course critical for any deep learning or in this case deep reinforcement learning paper implementation the algorithm is really where all the meat of the problem is it is in here and that they lay out the exact steps you need to take to implement the algorithm right that's why it's titled that way so this is the section we want to read most carefully and then of course they will typically give a table where they outline the actual algorithm and oftentimes if I'm in a hurry I will just jump to this because I've done this enough times that I can read this what it's called pseudocode if you're not familiar with the pseudocode it's just an English representation of computer code so we will typically use that when we outline a problem and it's often used in papers of course so typically I'll start here reading it and then work backward by reading through the paper to see what I missed but of course it talks about the performance across a whole host of environments and of course all of these have in common that they are continuous control so what that means is that the action space is a vector whose elements can vary on a continuous real number line instead of having discrete actions of zero one two three four or five so that is the really motivation behind deep deterministic policy gradients is that allows us to use deep reinforcement learning to tackle these types of problems and in today's video we're gonna go ahead and tackle the I guess pendulum swing up also called pendulum problem reason being is that while it would be awesome to start out with something like the bipedal Walker you never want to start out with maximum complexity you always want to start out with something very very small and then scale your way up and the reason is that you're going to make mistakes and it's most easy to debug most quick to debug very simple environments that execute very quickly so the pendulum problem only has I think three elements in its state vector and only a single action so or maybe it's two actions I forget but either way it's very small problem relative to something like the bipedal Walker or many of the other environments you could also use the continuous version of the cart pole or something like that and that would be perfectly fine I've just chosen the pendulum for this particular video because we haven't done it before so it's in here that they give a bunch of plots of all of the performance of their algorithm of the various sets of constraints placed upon it and different implementations so you can get an idea and one thing you notice right away it's always important to look at plots because they give you a lot of information visually right it's much easier to gather information from plots than it is text you see that right away they have a scale of 1 so that's telling you it's relative performance and you have to read the paper to know relative to what I don't like that particular approach they have similar data in a table form and here you see a whole bunch of environments they used and there's a broad broad variety they wanted to show that the algorithm has a wide arena of applicability which is a typical technique and papers they want to show that this is relevant right if they only showed a single environment people reading it would say well that's all well and good you can solve one environment but what about these dozen other environments right and part of the motivation behind reinforcement learning is generality can as can we model real learning and biological systems such that it mimics the generality of biological learning one thing you notice right away is that these numbers are not actual scores so that's one thing I kind of take note of and causes me to raise an eyebrow so you have to wonder the motivation behind that why would the authors express scores in a ratio there's a couple different reasons one is because they want - just to make all the numbers look uniform maybe the people reading the paper wouldn't be familiar with each of these environment so they don't know what a good score is and that's a perfectly valid reason and other possibilities they want to hide poor performance I don't think that's going on here but it does make me raise my eyebrow whenever I see it the one exception is the torques which is a totally open rate race car simulator environment I don't know if we'll get to that on this channel that would be a pretty cool project but that would take me a few weeks to get through but right away you notice that they have a whole bunch of environments these scores are all relative to one and one is the score that the agent gets on a planning algorithm which they also detail later on so those are the results and they talk more about I don't think we saw the headline but they talk about related work which talks about other algorithms that are similar and their shortcomings right they don't ever want to talk up other algorithms you always want to talk up your own algorithm to make yourself sound good you know why else did you be writing a paper in the first place and of course our concluding that's high everything together references I don't usually go deep into references if there is something that I feel I really really need to know I may look at a reference but I don't typically bother with them if you were a PhD student then it would behoove you to go into the references because you must be an absolute expert on the topic and for us we're just you know hobbyists I'm a youtuber so I don't go into too much depth with the background information and the next most important bit of the paper are the experimental details and it is in here that it gives us the parameters and architectures for the network's so this is where if you saw my previous video where I did the implementation of DD PG and PI torch and the continuous lunar lander environment this is where I got most of this stuff it was almost identical with a little bit of tweaking I left out some stuff from this paper but pretty much all of it came from here and particularly the hidden layer sizes 400 and 300 units as well as the initialization of the parameters from a uniform distribution of the given ranges so just to recap this was a really quick overview of the paper just showing my process of what I look at the most important parts are the details of the algorithm as well as the experimental details so as you read the paper like I said I gloss over the introduction because I don't really I kind of already understand the motivation behind it I get the idea it says it basically tells us that you can't really handle mmm just continuous action spaces with DQ networks we already know that and it says you know you can discretize the state space but then you end up with really really huge actions sorry you can discretize the action space but then you end up with the you know whole boatload of actions you know what is it 2187 action so it's intractable anyway and they say what we present you know a model free of policy algorithm and then it comes down to this section where it says the network is trained off policy with samples from a replay buffer to minimize correlations very good and train with the target Q network to give consistent targets during temporal difference backups so this work we make use of the same ideas along with a batch normalization so this is a key chunk of text and this is why you want to read the whole paper because sometimes they'll embed stuff in there that you may not otherwise catch so as I'm reading the paper what I do is I take notes and you can do this in paper you can do it in you know text document in this case we reading is the editor so that way I can show you what's going on and it's a natural place to put this stuff because that's where you can implement the code anyway let's hop over to the editor and you'll see what I take notes so right off the bat we always want to be thinking in terms of what sort of classes and functions will we need to implement this algorithm so the paper mentioned a replay buffer as well as a target queue network so the target queue network for now we don't really know what it's going to be but we can write it down so we'll say we'll need a replay buffer class and need a class for a target hue Network now I would assume that if you were going to be implementing a paper of this advanced difficulty you'd already be familiar with cue learning where you know that the target network is just another instance of a generalized network the difference between the target and evaluation networks are you know the way in which you update their weights so right off the bat we know that we're gonna have a single class at least one if you know something about actor critic methods you'll know that you'll probably have two different classes one for an actor one for a critic because those two architectures are generally a little bit different but what do we know about cue networks we know that cue networks are state action value functions right they're not just value functions so the critic in the actor critic methods is just a state value function in general whereas here we have a cue network which is going to be a function of the state in action so we know that it's a function of s and a so we know right off the bat it's not the same as a critic it's a little bit different and it also said we we will use we will use batch norm so a batch normalization is just a way of normalizing inputs to prevent divergence and a model I think it was discovered in 2015 2014 something like that so that we use that so we'll need that in our network so we know at least two right off the bat a little bit of an idea of what the network is going to look like so let's go back to the paper and see what other little bits of information we can glean from the text before we take a look at the algorithm reading along we can say blah blah they say a key feature of simplicity it requires only a straightforward actor critic architecture and very few moving parts and then they talk it up and say can learn policies that exceed the performance of the planner you know the planning algorithm even learning from pixels which we won't get to in this particular implementation so then okay no real other Nuggets there the background talks about the mathematical structure of the algorithm so this is really important if you want to have a really deep in-depth knowledge of the topic if you already know enough about the background you would know that you know the formula for discounted future rewards you should know that if you've done a whole bunch of reinforcement learning algorithms if you haven't then definitely read through this section to get the full idea of the background and the motivation behind the mathematics other thing to note is it says the action value function is used in many algorithms we know that from deep Q learning and then it talks about the recursive relationship known as the bellman equation that is known as well other thing to note what's interesting here and this is the next nugget is if the target policy is deterministic we can describe as a function mu and so you see that in the remainder of the paper like in the algorithm they do indeed make use of this parameter mu so that tells us right off the bat that our policy is going to be deterministic now if you have you could probably guess that from the title right D to turn is deep deterministic policy gradients right so you would guess from the name that the policy is deterministic but what does that mean exactly so a stochastic policy is one in which the software map's the probability of taking an action to a given state so you input a set of state and outcomes a probability of selecting an action and you select an action according to that probability distribution so that right away bakes in a solution to the explore exploit dilemma so long as all probabilities are finite right so so as long as a probability of taking an action for all states doesn't go to zero there is some element of exploration involved in that algorithm hew learning handles the explore exploit dilemma by using Epsilon greedy action selection where you have a random parameter that tells you how often to select a random number sorryi random action and then you select a greedy action the remainder of the time of course policy gradients don't work that way they typically use a stochastic policy but in this case we have a deterministic policy so you've got to wonder right away okay we have a deterministic policy how are we gonna handle the Explorer exploit dilemma so let's go back to our text editor and make a note of that so we just want to say that the the policy is deterministic how to handle Explorer exploit and that's a critical question right because if you only take what are perceived as the greedy actions you never get a really good coverage of the parameter space of the problem and you're going to converge on a suboptimal strategy so this is a critical question we have to answer in the paper so let's head back to the paper and see how they handle it so we're back in the paper and you can see the reason they introduced that deterministic policy is to avoid an inner expectation or maybe that's just a byproduct I guess it's not accurate to say that's the reason they do it but what's neat is says the expectation depends only on the environment means it's possible to learn Q to the MU meaning Q is a function mu off policy using transitions which are generated from a different stochastic policy beta so right there we have off policy learning which they say explicitly with a stochastic policy so we are actually going to have two different policies in this case so then this already answers the question of how we go from a deterministic policy to solving these sports boy dilemma and the reason is that we're using a stochastic policy to learn the greedy policy or a sorry a purely deterministic policy and of course they talk about the parallels with Q learning because there are many between the two algorithms and you get to the loss function which is of course critical to the algorithm and this Y of T parameter then of course they talk about what Q learning has been used for they use they make mention of deep neural networks which is of course what we're going to be using that's where the deep comes from and talks about the car Atari games which we've talked about on this channel as well and importantly they say the the two changes that they introduce in Q learning which is the concept of the replay buffer and that's the target Network which of course they already mentioned before they're just reiterating and reinforcing what they said that's why we want to read the introduction and background time material to get a solid idea what's gonna happen so now we get to the algorithmic portion and this is where all of the magic happens so they again reiterate that it's not possible to apply q-learning to continuous action spaces because you know reasons right it's pretty obvious you have an infinite number of actions that's a problem and then they talk about the deterministic policy gradient algorithm which we're not going to go too deep into right it for this video we don't want to do the full thesis we don't want to do a full doctoral dissertation on the field we just want to know how to implement it and get moving so this goes through and gives you an update for the gradient of this parameter J and gives it in terms of the gradient of Q which is the state action value function and the gradient of the policy the deterministic policy mu other thing to note here is that this gradients these gradings are over two different parameters so the gradient of Q is with respect to the actions such that the action a equals mu of s of T so what this tells you is that Q is actually a function not just of the state but is intimately related to that policy muse so it's not it's not an action chosen according to an Arg max for instance it's an action short chosen according to the output of the other Network and for the update of mu it's just the gradient with respect to the weights which you would kind of expect so they talk about another algorithm and fqc a I don't know what that is honestly mini-batch version blah-de-blah our contribution here is to provide modifications a dbg inspired by the success of DQ n which allowed to use neural network function approximate is to learn in large state in action spaces online we call DD PG very creative as they say again we use a replay buffer to address the issues of correlations between samples generated on subsequent steps within an episode finite size cache size our transition sample from the environment so we know all of this so if you don't know all of it what you need to know here is that you have state action reward and then new state transitions so what this tells the agent is started in some state s took some action receive some reward and ended up in some new state why is it important it's important because in in anything that isn't dynamic programming you're really trying to learn the state probability distributions you're trying to learn the probability of going from one state to another and receiving some reward along the way if you knew all of those beforehand then you could just simply solve a set a very very large set of equations for that matter to arrive at the optimal solution right if you knew all those transitions you say ok if I start in this state and take some action I'm gonna end up in some other state with certainty then you'd say well what's the most advantageous state what state is going to give me the largest reward and so you could kind of construct some sort of algorithm for traversing that set of equations to maximize your reward over time now of course you often don't know that and that's the point of the replay buffer is to learn that through experience and interacting with the environment and it says when the replay buffer was full all the samples were discarded ok that makes sense it's finite size it doesn't grow indefinitely at each time step actor and critic are updated by sampling a mini batch uniformly from the buffer so it operates exactly according to queue learning it does a uniform sampling Brandin sampling of the buffer and uses that to update the actor and critic networks what's critical here is that combining this statement with the topic of the previous paragraph is that when we write our replay buffer class it must sample states had random so what that means is you don't want to sample a sequence of subsequent steps and the reason is that there are large correlations between those steps right as you might imagine and those correlations can cause you to get trapped in little nicks nooks and crannies of parameter space and really cause your algorithm to go wonky so you want to sample that uniformly so that way you're sampling across many many different episodes to get a really good idea of the I guess the breath of the parameter space to use kind of loose language and then it says directly implementing q-learning with neural networks approve the unstable many environments and they're they're gonna talk about using the the target Network okay but modified for actor critic using soft target updates rather than directly copying the weight so in q-learning we directly copy the weights from the evaluation to the target Network here it says we create a copy of the actor and critic networks Q prime and mu prime respectively that are used for calculating the target values the weights of these target networks are then updated by having them slowly track the learned networks theta prime goes to theta theta times tau plus one minus tau times theta prime with town much much less than one this means that the target values are constrained to change slowly greatly improving the stability of learning okay so this is our next little nugget so let's head it over to the paper and make to our text editor and make note of that what we read was that the we have to not in camps we don't want to shout we have two networks target networks sorry we have two actor and two critic networks a target for each updates are soft according to theta equals tau times theta plus one minus tau times theta Prime so I'm sorry there should be theta Prime so this is the update rule for the parameters of our target networks and we have to target networks one for the actor and one for the critic so we have a total of four deep neural networks and so this is why the algorithm runs so slowly even on my beastly rig it runs quite slowly even in the lunar lander in a continuous lunar lander environment I've done the bipedal Walker and it took about 20,000 games to get something that approximates a decent score so this is a very very slow algorithm and that 20,000 games took I think about a day to run so quite slow but nonetheless quite powerful it's only method we have so far I've implementing deep reinforcement learning and continuous control environments so hey you know beggars can't be choosers right but we know just to recap that we're gonna use four networks two of it are on policy in two off policy and the updates are gonna be soft with with Towe much less than one if you're not familiar with mathematics this double less than or double greater-than sign means much less than or much greater then respectively so what that means is that towel is gonna be of order point zero one or smaller right point one isn't much smaller that's kind of smaller point zero one I would consider much smaller they use we'll see in the in the details we'll see what value they use but you should know that it's a order point zero one or smaller and the reason they do this is to allow the updates to happen very slowly to get good convergence as they said in the paper so let's head back to the paper and see what other nuggets we can clean before getting to the outline of the algorithm and then in the very next sentence they say this simple change moves the relative unstable problem of learning the action by a function closer to the case of supervised learning a problem for which a robust solution exists we found that having both the target Miu Prime and Q problem was required to have stable targets weii in order to consistently train the critic with out divergence this may slow learning since the target networks delay the propagation of value estimates however in practice we found this was always greatly outweighed by the stability of learning and I found that as well you don't get a whole lot of diversions but it does take a while to train then they talk about learning in low dimensional and higher dimensional environments and they do that to talk about the need for feature scaling so one approach to the problem which is the ranges of variations in parameters right so in different environments like in the mountain car you can go from plus minus one point six five minus one point six to zero point four something like that and the velocities are plus and minus point zero seven so you have a two order magnitude variation there and the parameters that's kind of large even in that environment and then when you compare that to other environments where you can have parameters that are much larger on the order of hundreds you can see that there's a pretty big issue with the scaling of the inputs to the neural network which we know from our experience that neural networks are highly sensitive to the scaling between inputs so so their solution that problem is to manually scale the features so they're in similar across environments and units and they do that by using batch normalization and it says this technique normalizes each dimension across the samples in a mini batch dev unit mean and variance and it also maintains a running average of the mean and variance used for normalization during testing during exploration and evaluation so in our case training and testing are slightly different than in the case of supervised learning so the supervised learning you maintain different data sets or shuffled subsets of a single data set to do training and evaluation and of course in the evaluation phase you perform no weight updates of the network you just see how it does based on the training and reinforcement learning you can do something similar where you have a set number of games where you train the agent to achieve some set of results and then you turn off the learning and allow it to just choose actions based upon whatever policy it learns and if you're using batch normalization in PI torch in particular there are significant differences in how batch normalization is used in the two different phases so you have to be explicit in setting training or evaluation mode in particular in pi torch they don't track statistics in evaluation mode which is why when we wrote the DDP G algorithm in PI torch we had to call the eval and train functions so often okay so we've already established will need batch normalization so everything's kind of starting to come together we need a replay Network back to normalization we need for networks right we need to we need to each of a target of an actor into each of eight critic so half of those are gonna be used for on policy now of them are going to be used for off policy for the targets and then it says will be scrolled down a major challenge of learning and continuous action spaces is exploration an advantage of off policy algorithms such as DDP G is that we can treat the problem of exploration independently from the learning algorithm we constructed an exploration policy mu prime by adding noise sampled from a noise process and to our actor policy okay so right here is telling us what the basically the target actor function is it's mu prime is basically mu plus some noise and n can be chosen to suit the environment as detailed in the supplementary materials we used in Ornstein uhlenbeck process to generate temporally correlated exploration for exploration efficiency and physical control problems with inertia if you're not familiar face in Ursa just means the tendency of stuff to stay in motion it has to do with like environments that move like the walkers the Cheetahs stuff like that the ants okay so we've kind of got one of them nugget to add to our text editor let's head back over there and write that down okay so the target actor is just the evaluation we'll call it that for lack of a better word evaluation actor plus some noise process they used Ornstein kuhlenbeck I don't think I spelled that correctly we'll need to look that up that I've already looked it up my background is in physics so it made sense to me it's basically a noise process that models the motion of Brownian particles which are just particles that move around under the influence of their interaction with other particles in some type of medium like losses meeting like a perfect food or something like that and in the Ornstein in that case they are temporally correlated meaning at each time step is related to the time step prior to it and I hadn't thought about it before but that's probably important for the case of Markov decision processes right so in MVPs the current state is only related to the prior state and the action taken you don't need to know the full history of the environment so I wonder if that was chosen that way if there's some underlying physical reason for that just kind of a question of gross female top of my head I don't know the answer to that if someone knows pipe drop the answer in the comments I would be very curious to see the answer so we have enough Nuggets here so just to summarize we need to replay buffer class will also need a class for the noise right so we'll need a class for noise a class for the replay buffer we'll need a class for the target Q Network and we're going to use batch normalization the policy will be deterministic so what that means in practice is that the policy will output the actual actions instead of the probability of selecting the actions so the policy will be limited by whatever the action space of the environment is so we need some way of taking that into account so so deterministic policy means outputs the actual action instead of a probability we'll need a way to bound the actions to the environment environment limits and of course these notes don't make it into the final code these are just kind of things you think of as you are reading the paper you would want to put all your questions here I don't have questions since I've already implemented it but this is kind of my thought process as I went through it the first time as best as I can model it after having finished the problem and you can also use a sheet of paper there's some kind of magic about writing stuff down on paper but we're gonna use the code editor because I don't want to use an overhead projector to show you guys a frigging sheet of paper this isn't grade school here so let's head back to the paper and take a look and the actual algorithm to get some real sense of what we're going to be implementing the the results really aren't super important to us yet we will use that later on if we want to debug the model performance but the fact that they express it relative it to a planning our that makes it difficult right so scroll down to the data really quick so they give another thing to note I didn't talk about this earlier but I guess now is a good time is the stipulations on this turn on this performance data says performance after training across all environments for at most 2.5 million steps so I said earlier I had to train the bipedal walker for around 20,000 games that's around times I think that's around about about two and a half million steps or so I think it was actually have 15,000 steps so maybe around three million steps something like that we report both the average and best observed across five runs so why would they use five runs so if this was a superduper algorithm and which none of them are this isn't a slight on their algorithm this isn't meant to be it's not here anything what it tells us is that they had to use five runs because there is some element of chance involved so you know in one problem with deep learning is the problem of replicability right it's hard to replicate other people's results if particularly if you use system clocks as seeds for random number generators right using the system clock to seed the random number generator guarantees that if you run the simulation at even a millisecond later right that you're gonna get different results because gonna be starting with different sets of parameters now you will get qualitatively similar results right you'll be able to repeat the the general idea of the experiments but you won't get the exact same results it's kind of what it's an objection to the whole deep learning phenomenon and it makes it kind of not scientific but whatever it works has an enormous success so we won't quibble about semantics or you know philosophical problems but we just need to know for our purposes that even these people that invented the algorithm had to run it several times to get some idea of what was going to happen because the algorithm is inherently probabilistic and so they report averages and best-case scenarios so that's another little tidbit and they included results for both the low dimensional cases where you receive just a state vector from the environment as well as the pixel inputs we won't be doing the pixel inputs for this particular video but maybe we'll get to them later I'm trying to work on that as well so these are the results and the interesting tidbit here is that it's probabilistic it's gonna take five runs so okay fine other than that we don't really care about results for now we'll take a look later but that's not really our concern at the moment so now we have a series of questions we have answers to all those questions we know how we're gonna handle the Explorer exploit dilemma we know the purpose of the target networks we know how we're gonna handle the noise we know how we're gonna handle the replay buffer and we know what the policy actually is going to be is it's the actual problem it's the actual actions the agent is going to take so we know a whole bunch of stuff so it's time to look at the algorithm and see how we fill in all the details so randomly initialize a critic Network an actor Network with weights theta Super Q theta super mu so this is handled by whatever library you use you don't have to manually initialize weights but we do know from the Supplemental materials that they do constrain these updates to be within sorry these initializations to be within some range so put a note in the back your mind that you're gonna have to constrain these a little bit and then it says initialize target network Q Prime and they mute Prime with weights that are equal to the original network so data super EQ Prime gets initialized to a theta super Q and theta mu prime gets initialized of theta super mutant so we will be updating the weights right off the bat for the target networks with the evaluation networks and initialize a replay buffer R now this is an interesting question how do you initialize that replay buffer so I've used a couple different methods you can just initialize it with all zeros and then if you do that when you perform the learning you want to make sure that you have a number of memories that are greater than or equal to the mini batch size of your training so that way you're not sampling the same states more than once right if you have 64 memories in a batch that you want to sample but you only have 10 memories in your replay buffer then you're gonna sample let's say 16 memories and you're gonna sample each of those memories four times right so then that's no good so the question becomes if you update if you neutralize your replay buffer with zeros then you have to make sure that you don't learn until you exit the warm-up period where the warm-up period is just a number of steps equal to your replay buffer your buffer sample size or you can initialize it with the actual environmental play now this takes quite a long time you know the replay buffers are border a million so if you load the out the algorithm take a million steps at random then it's gonna take a long time I always use zeros and then you know just wait until the agent fills up the mini batch size of memories just a minor detail there then it says for some number of episodes do so a for loop initialize a random process in for action explorations so this is something now reading it I actually made a little bit of a mistake so in my previous implementation I didn't reset the noise process at the top of every episode so that's explicit here I must have missed that line and I've looked at other people's code some do some don't but it worked within how many episodes was it within a 1000 episodes the agent managed to beat the continuous winter liner environment so is that critical maybe not and I think I mentioned that in the video receiver additional state observation s1 so for each step of the episode T equals one to capital T do select the action a sub T equals mu the policy plus n sub T according to the current policy and exploration noise okay so that's straightforward just use just feed the state forward what does that mean it means feed the state forward through the network receive the vector output of the action and add some noise to it okay execute that action and resuit and observe reward and new state simple store the transition you know the old state action reward a new state in your replay buffer are okay that's straightforward each time steps sample a random mini batch of n transitions from the replay buffer and then you want to use that set of transitions to set y sub I equals so I is sorry having difficulties here so I is each step of that is each element of that mini batch of transitions so you want to basically loop over that set or do a vectorized implementation looping is more straightforward that's what I do I always opt for the most straightforward and not necessarily most efficient way of doing things the first time through because you want to get it working first and worry about implementation sorry efficiency later so set y sub I equals R sub I plus gamma or gamma is your discount factor times Q prime of the new state S sub I plus one times where the action is chosen according to mu prime given some weights theta super mutant and theta super Q prime so what's important here is that and this isn't immediately clear if you're reading this for the first time what's this is a very important detail so it's the action must be chosen according to the target actor Network so you actually have Q as a function of the state as well as the output excuse me of another network that's very important update the critic by minimizing the loss basically a weighted average of that y sub I minus the output from the actual Q Network where the a sub i's are from the actually the actions you actually took during the course of the episode so this a sub I is from the replay buffer and these actions right are chosen according to the target actor network so for each learning step you're going to have to do a feed-forward pass of not just this target Q network but also the target actor network as well as the evaluation critic network I hope I said that right so the feed forward pass of the target critic network as well as the target actor Network and the evaluation critic network as well and then it says update the actor policy using the sample policy gradient this is the hardest step in the whole thing this is the most confusing part so this is the gradient is equal to 1 over N times the sum so a mean basically whenever you see 1 over N times the sum that's a mean the gradient with respect to actions of Q where the actions are chosen according to the policy mu of the current states s times a gradient with respect to the weights of MU where they you just put the set of states ok so that'll be a little bit tricky to implement so and this is part of the reason I chose tensor flow for this particular video is because tensor flow allows us to calculate gradients explicitly in pi torts you may have noticed that all I did was set Q to be a function of the the current state as well as the actor Network and so I loud PI torch to handle the chain rule this is effectively a chain rule so let's let's scroll up a little bit to look at that because this kind of gave me pause the first 10 times I read it so this is the hardest part to implement if you scroll up you see that this exact same expression appears here right and this is in reference to this so it's a gradient with respect to the weights theta super mu of Q of SN a such that you're choosing an action a according to the policy mu so really what this is is the chain rule so it's the this grading is a proportional to a gradient of this quantity times the gradient of the other quantity it's just a chain rule from calculus so in the in the PI torts paper we implemented this version and we're these are these are equivalent it's perfectly valid to do one or the other so in PI tortes we did this version today we're gonna do this particular version so that's good to know all right so next step on each time step you want to update the target networks according to this soft update rule so theta super EQ prime gets updated as talen times theta super Q plus 1 minus tau theta is supercute prime and likewise for theta super mu prime and then you just end the two loops so in practice this looks very very simple but what do we know off the bat we need a class for our replay network we need a class for our knows process we need to class for the actor in a class for the critic now you could think that perhaps have the same but when you look at the details which we're gonna get to in a minute you realize you need two separate classes so you need at least one class to handle the deep neural networks so you have at least three classes and I always add in an agent class on top it's kind of an interface between the environment and the deep neural networks so that's four and we're gonna go with five but that's for right off the bat so now that we know the algorithm let's take a look at the Supplemental details supplemental information to see precisely the architectures and parameters used so we scroll down so here are the experimental details these atom for learning the neural network parameters for the learning rate of ten to the minus four and ten to the minus three for the actor and critic respectively so they tell us alerting rates 10 to minus 4 10 to the minus 3 for Hydra critic for cue the critic we included l2 weight decay of 10 to the minus 2 and use a discount factor of gamma 0.99 so that gamma is pretty typical but this important thing is that for Q and only Q naught mu we included l2 weight DK of 10 to the minus 2 use a discount factor of gamma of 0.99 that's an important detail for the soft target updates we use tau equals 0.001 so one part in a thousand that is indeed very very small okay fine the neural networks used the rectified non-linearity for all hidden layers okay the final output layer the actor was a tangent hyperbolic layer two bound the actions now tan hyperbolic goes from minus 1 to plus 1 so in environments in which you have bounds of plus or minus 2 let's say you're gonna need a multiplicative factor so that's just something to keep in mind the low dimensional networks but that doesn't that doesn't impact the tangent hyperbolic he just means there's a multiplicative factor and they're related to your environment below dimensional networks had two hidden layers with 400 and 300 units respectively about a hundred thirty thousand parameters actions were not included until the second hidden layer of Q so when you're calculating the critic function Q you aren't actually passing forward the action from the very beginning you're including it as a separate input at the second hidden layer of Q that's very important that's a very important implementation detail and this is when learning from pixels we use three convolution layers which we don't need to know right now we're not using pixels yet and followed by two fully connected layers the final layer weights and biases both the actor and critic were initialized from a uniform distribution of plus or minus 3 by 10 to the minus 3 for the low dimensional case this was this was to ensure that the initial outputs for the policy and value estimates were near zero the other layers were initialized from uniform distribution of plus or minus 1 over square root of F where F as a fan end of the layer fan is just the number of input units and the other layer oh how are that the the the actions were not included into the fully connected layers that's for the convolutional case so here right now I'm experiencing some confusion reading this so it says the other ways reinitialize from uniform distributions related to the fan in the actions were not included into the fully connected layers so I'm guessing since we're talking about fully connected layers they're talking about the pixel case right because otherwise they're all fully connected you know wouldn't make sense to say specify fully connected layers so this gives me a little bit of confusion is this statement referring to the way I initially interpreted it it is referring to both cases for the state vector and pixel case but whatever I'm gonna interpret it that way because it seemed to work but there's ambiguity there and this is kind of an example of how reading papers can be a little bit confusing at times because the wording isn't always clear maybe I'm just tired maybe I've been rambling for about 50 minutes and my brains turning to mush that's quite probable actually but anyway we trained with mini match sizes of 64 for the low dimensional problems in 69 pixels with a replay buffer of 10 to the 6 or the exploration noise process we used temporally correlated noise in order to explore well in physical environments to add momentum once seen it will in Bac process with theta equals 0.15 and Sigma equals 0.2 and it tells you what it does all well and good ok so these are the implementation details we need 400 units and 300 minutes for hidden layers atom optimizer at 10 to the minus 4 for the actor at 10 to the minus 3 for the critic for the critic we need an l2 weight decay of 10 the mine to discount factor of gamma 0.99 and for the salt that they factor we need point zero zero one and we need updates we need initializations that are proportional to that one over the square root of fanon for the lower layers and plus minus point zero zero three zero zero zero zero three for the final output layers of our fully cam networks okay so that's a lot of details we have everything we need to start implementing the paper and that only took us about 50 minutes we just might short a thin when I read it took me quite a while so let's head back up to the algorithm here and we'll keep that up as reference for the remainder of the video because that's quite critical so let's go ahead and head back to our code editor and start coding this up we'll start with the easy stuff first so let's start coding and we will start with the probably one of the most confusing aspects of the problem of the Ornstein uhlenbeck action noise now you can go ahead and do a google search for it and you'll find a Wikipedia article that talks a lot about the physical processes behind a lot of mathematical derivations and that's not particularly helpful so if you want to be a physicist I invite you to read that and check it out it's got some pretty cool stuff it took me back to my grad school days but we have a different mission in mind for the moment the mission now is to find a code implementation of this that we can use in our problem so if you then do a Google search for Ornstein uhlenbeck github as and you want to find someone's github example for it you end up with a nice example from the open AI baseline library that shows you the precise form of it so let me show you that one second so you can see it here in the github there is a whole class for this right here that I've highlighted and this looks to do precisely what we want it has previous plus a delta term and a DT term so it looks like it's going to create correlations through this X previous term there's a reset function to reset the noise which we may want to use and there's this representation which we'll probably skip because we know what we're doing and it's not really critical for this particular it would be nice if you were writing a library as these guys were so they included the representation method so let's go ahead and code that up in the editor and that will tackle our first class so I'm gonna leave the notes up there for now they do no harm they're just comments at the top of the file so the first thing we'll need is to import numpy as NP we know we can go ahead and start and say import tensor flow as TF we're going to need tensor flow we may need something like OS to handle model savings so we can go ahead and import that as well just a fun fact it is considered good practice to import your system level packages first and followed by your library packages second in numerical order followed by your own personal code in numerical order so that's the imports that we need to start let's go ahead and code up our class we'll call it o you action noise and that'll just be derived from the base object so the initializer will take a mu a Sigma now they said they used a default value of I believe 0.15 and a theta of 0.2 a DT term is something like 1 by 10 to the minus 2 and our X naught will save as none and again if you have any doubts on that just go check out the open AI baselines for it their implementation is probably correct right I'll give them the benefit of the doubt so go ahead and save your parameters as usual so we have a mu a theta a DT Sigma and X 0 and we'll go ahead and call the reset function at the top now they override the call method and what this does is it enables you to say noise equals oh you an action noise and then when you want to get the noise you just say our noise equals noise you can use the parenthesis that's what overall writing call does that's a good little tidbit to know so we want do you implement the equation that they gave us it's L self that X previous plus self dot theta times mu minus X previous times self DT plus Sigma times at numpy square root self dot DT times numpy random normal for the size of mu and set the X previous to the current value that's how you create the temporal correlations and return the value of the noise so we don't have a value for x previous so we have to set that with the reset function which takes no parameters self that X previous equals x naught if self dot x naught is not none else numpy zeros like self dot mia and that's it for the noise function that's pretty straightforward so that's all well and good so we have one one function down so we've taken care of the noise sorry have one class now we've taken care of the noise and now we can move on to the replay buffer so this will be something similar to what I've been lamented in the past there are many different implementations and ways of implementing this many people we use a built-in Python data structure called a DQ or a deck I think it's DQ is the pronunciation basically it's a queue that you fill up over time and that's perfectly valid you can do that there's no reason not to I prefer using a set of arrays and using numpy to facilitate that the reason being that we can tightly control the data types of the stuff that we're saving for this pendulum environment it doesn't really matter but as you get more involved in this field you will see that you are saving stuff that has varying sizes so if you're trying to save images let's say from the one of the Atari libraries or a mu Joko environment or something like that hope I pronounced that correctly you'll see that this memory can explode quite quickly you can eat into your RAM so the ability to manipulate the underlying data type representation whether you want to use single or double precision for either your floating-point numbers or integers is critical to memory management as well as taking advantage of some other optimizations for NVIDIA GPUs in the touring class and above so I always use the numpy arrays because it's a clean implementation that allows manipulation of data types you can use the DQ if you want it's perfectly valid so our separate class has its own initializer of course so we're going to want to pass in a maximum size the input shape and the number of actions right because we have to store the state action reward and new state tuples we're also going to want to facilitate the use of the done flag so we'll have an extra parameter in here and the reason behind that is intimately related to how the how the bellman equation is calculated it took me a second to think of that at the end of the episode the agent receives no further rewards and so the expected feature reward the discounted future reward if you will is identically zero so you have to multiply the reward for the next state by zero if that next state follows the terminal state if your current state is terminal in the next state is following the terminal state so you don't want to take into account anything from that he expected future rewards because they're identically zero so we need a done flag this is all I wanted to say so we save our parameters I only need to say that we can say self mmm counter equals zero and that will keep track of the position of our most recently saved memory state memory and arrived up by zeros mm size by input shape new state memory same same deal you know it's just just totally the same good grief okay so now we have the say action memory and that's known pi 0 cell top men's eyes by sulfa n actions we didn't save an action this did B so we'll just call it 10 actions we have the reward memory and that's just a scalar value so that only gets shaped self mem size we also need the terminal memory from the memory and that'll be shaped self dot men size and I have it numpy float32 if I recall correctly that is due to the data types in the PI torch implementation is probably not necessary here in the tensorflow implementation but I left it the same way just to be consistent it doesn't hurt anything so next we need a function to store a current transition so that'll take a state action reward new state and a done flag as input so the index of where we want to store that memory is the memory counter modulus the men size so for any men can or less than mem sighs this just returns mem counter for anything larger than mem size it just wraps around so if you have a million memories all the way up from zero to nine hundred nine thousand nine hundred ninety nine it will be that number and then once it's a million it wraps back round to 0 and then 1 and 2 and so on and so forth and so that way your overwriting memories at the earliest part of the array with the newest memories precisely as they described in the paper if you're using the DQ map method then I believe you would just pop it off of the left I believe now don't quote me because I haven't really used that implementation but from what I've read that's how it operates new state memory sub index equals state underscore reward memory equals sub index reward as action memory now keep in mind that the actions in this case are arrays themselves so it's an array of arrays just keep that in the back your mind so that you can visualize the problem we're trying to solve here next up we have the terminal memory now a little twist here is that we want this to be 1 minus int of done so done is either true or false so you don't want to count the rewards after the episode has ended so when done is true you want to multiply by 0 so 1 minus in to true is 1 minus 1 which is 0 and when it's not over it's 1 minus 0 which is just 1 so that's precisely the behavior we want and finally you want to increment men counter by 1 every time you store em new memory next we need a function to sample our buffer and we want to pass in a batch size alternatively you could make batch size a member variable of this class it's not really a big deal I just did it this way for whatever reason my chose to do it that way so what we want to do is we want to find the minimum either so let's back up for a second so what we want is to sample the memories from either the 0th position all the way up to the most filled memory the last filled memory so if you have less than the max memory you want to go from 0 to mem counter otherwise you want to sample anywhere in that whole interval so maximum equals the minimum of over of either mem counter or sighs and the reason you don't want to just use mem can or is that mem counter goes larger than mem size so if you try to tell it to select something from the range men counter when them counter is greater than mem size you'll end up trying to access elements of the array that aren't there and it'll throw an error so that's why you need this step next you want to take a random choice of from zero to maximum of a size batch size so then you just want to gather those states from the respective arrays like so I forgot the self of course new states actions rewards and call it terminal batch I believe that's everything and you wanna return the state's actions rewards new states and terminal okay so so now we are done with the replay buffer class so that's actually pretty straightforward and if you've seen some of my other videos on deep key learning then you've seen pretty much the same of implementation I just typically kept it in the aging class then we're kind of refining my approach getting more sophisticated over time so make sense this ticket in its own class so we're already like 40 percent of the way there we've got five classes in total so that's good news so next up we have to contend with the actor and critic networks so we'll go ahead and start with the actor and from there keep in mind that we have two actor networks and we're gonna have to contend with some of the peculiarities and weaith tensorflow likes to do stuff so let's get started the actor and of course in tensorflow you don't derive from any particular class we're on PI torch you would derive from an end up module don't know what I'm doing there definite so we'll need a learning rate number of actions unnamed and the name is there to distinguish the regular actor network from the target actor network input dims we're gonna want to pass it a session so tensorflow has the construct of the session which houses the graph and all the variables and parameters and stuff like that you can have each class having its own session but it's more tidy to pass in a single session to each of the classes a number of dimensions for the first fully connected layers so of course I should be 400 if we're going to implement the paper precisely fc2 DIMMs is 300 an action bound a batch size that defaults to 64 a checkpoint directory and the purpose of that is to save our model hand in the case of the pendulum it doesn't really matter because it's so quick to run but in general you want a way of saving these models because it can take a long time to run so we'll save our learning rate number of actions and all of the parameters we passed in dot fc-1 dims and the purpose of this action bound is to accommodate environments where the action is either greater than plus or minus negative one so if you can go from plus or minus to the tangent hyperbolic is only gonna sample like half your range right from plus minus one and so you won't have an action bound and there is a multiplicative factor to make sure that you can sample the full range of the actions available to your agent and we need to say that checkpoint Dirar and finally we want to call a build network function that's not final but we'll call the build Network function next up since we have to do the soft cloning the soft update rule for the target actor class and target critic then we know that we have to find a way of keeping track of the parameters in each Network so we're gonna keep track of them here in the variable params and it's tensorflow dot trainable variables with a scope of a self dot name so the we have a single session in a single graph and you're gonna have multiple deep neural networks within that graph we want we don't want to update the critic network when we're trying to do the after Network and vice versa right we want those to be independent and so we scoped them with their own name so that tensorflow knows hey this is a totally different set of parameters from this that will aid in copying stuff later and also make sure that everything is nice and tidy and the scope is what facilitates that will also need a saver object to save the model let's make a checkpoint file and that's where we use OS and that will clone into check pointer and the name plus underscore DD PG checkpoint so this will automatically scope the save files for us so that way we don't confuse the parameters for the target actor and actor or critic and target critic or even actor and critic for that matter very important so we were gonna have to calculate some gradients and we're gonna do that by hand so we're gonna need a series of functions that will facilitate that and the first of which is the unnormalized actor gradients and that is given my tensorflow radians cell top mu cell top params and minus self dot action in gradient so we're gonna calculate some you will be the MU from the paper the actual actions of the agent params are our network parameters and this action gradients so let's go back to the paper for a second so we can get some idea what i'm talking about here so if we look at the algorithm we can see that we're going to need these gradients of the critic function with respect to the actions taken we're also going to need the gradient of the actual mu with respect to the weights of the network so we're passing in into tensor flow gradients this function mu the parameters to get the gradient with respect to those parameters and then we're gonna have to calculate the gradient of the critic with respect to the actions taken so we'll have to calculate this later that's a placeholder that's the minus self-taught action gradient and we're gonna calculate that in the learning function but that's where all of that comes from so now let's go back to the code editor and continue so that's our uh normalized actor gradients and uh normalized just because we're going to take one over the sum 1 over N times a sum we need a function for performing that normalization so the actor gradients has to be this parameter has to be a list so we cast it as a list I'm just going to map a lambda function and XT F dot div X and batch size no big deal there so optimize is our optimization step and of course that's the atom optimizer we want to optimize with the learning rate of self doubt learning rate and we want to apply gradients so typically we use dot minimize loss but in this case we're calculating our gradients manually so we need to apply those gradients and what do we want to apply want to apply the actor gradients to the programs and I would encourage you to go look at the tensor flow code for all of these the videos getting a little bit long we're already up to like an hour and 20 hour and 10 20 minutes something like that so I'm not gonna go through all the documentation for tensor flow feel free to look that up I had to when I was building this house so next up we need to build our network so this is how we're gonna handle the scoping TF dot variable underscore scope self that names that way every network gets its own scope we need a placeholder for the input that'd be a 32 bit floating number shape of nun which is batch size you can put dims beiongs and that won't work let's put this on its own line and we're gonna give it a name it's not critical the name parameter is just for debugging if something goes wrong then you can kind of trace where it went wrong makes life a little bit easier the action gradient is also a placeholder that is what we're gonna calculate in the learn function for the agent and that gets a shape of nun by and actions so it'll be the gradient of cute respect to each action so it has number of dimensions but there's actions and so those are our two placeholder variables now we get to construct the actual network so let's handle the initialization first so f1 is the fan-in it's 1 divided by numpy square root FC 1 dims and our dense 1 TF layers dot dense and that takes self but input as input with self dot FC 1 dims as units our kernel initializer egos random I forgot an importance it's random uniform one second minus F 1 2 F 1 and the bias initializer current L no that's not right Kernell bias initializer random uniform minus F 1 to F 1 that gets to parentheses so we have to come back up to our imports import tensor flow dot initializers know so we have to come back up here and say from tensor flow without initializers in this shell either import random uniform and now we're good to go let's come back down here we have dense one now we want to do the batch normalization so batch 1 equals TF layers badge normalization dense one and that doesn't get initialized so now let's activate our first layer and that's just the rail u activation of the batch normal batch normed now it is an open debate from what I've read online about whether or not you should do the activation before or after the batch normalization I'm in the camp of doing it after the activation after the batch norm that's because the real activation function at least in at least in the case of rail you so NREL you you lop off everything lower than zero so your statistics might get skewed to be positive instead of maybe they're zero maybe they're negative who knows so I think the batch norm is probably best before the activation and indeed this works out this is something you can play with so go ahead and fork this repo and play around with it and see how much of a difference it makes for you maybe I missed something when I try to do it the other way it's entirely possible I miss stuff all the time so you know it's something you can improve upon that's how I chose to do it and it seems to work and who knows maybe other implementations work as well so now let's add some space so f2 is 1 over square root su to Tim's dents to is similar that takes layer one activation as input with FC two dims you don't need that I'm gonna come up here and copy this there we go except we have to change f1 to f2 perfect then we have batch two and that takes dents to is input your to activation batch 2 now finally we have the output layer which is the actual policy of our agent the deterministic policy of course and from the paper that gets initialized with the value of zero point zero is zero three and we're going to call this mu that gets layered to activation as input and that needs and actions as the number of output units what's our activation that is tangent hyperbolic tanh CH and I will go ahead and copy the initializers here of course that gets f3 not f2 perfect can I tap that I can okay so that is mu and then we want to take into account the fact that our environment may very well require actions that have values plus greater than plus or minus one so self dot mu is TF that multiplied with mu and the action bound an action bound will be you know something like two it needs to be positive so that way you don't flip the actions but that's pretty straightforward so now we've built our network built our network next thing we need is a way of getting the actual actions out of the network so we have a prediction function that takes some inputs and you want to return self dot sass run self dot M you with a feed dictionary of self dot input and pass in inputs that's all that is there is to passing the doing the feed-forward kind of an interesting contrast to how pi torch does it with the explicit construction of the feed-forward function this just runs the session on this and then goes back and finds all the associations between the respective variables nice and simple now we need a function to Train and that'll take inputs and gradients this is what will perform the actual back propagation through the network and you want to run self dot optimized that's our function that accommodates learning with a feed and dictionary of inputs inputs and self dot action gradient gradients so that is also reasonably straightforward so you know let's format that a little bit better alright so that is our training function next we need two functions to accommodate the loading and the saving of the model so define save checkpoint and print then you want to say self dot saver dot save very creative donuts save the current session to the checkpoint file and the load checkpoint function is the same thing just in Reverse so here we want to print floating checkpoint self dot saver dot restore that session and the Check Point file so that will you can only call this after and Stan she ating the agent so it will have a default session with some initialized values and you want to load the variables from the checkpoint file into that checkpoint and that is it for the actor class this is reasonably straightforward the only real mojo here is the actor gradients and this is just two functions that accommodate the fact that we are going to manually calculate the gradient of the critic with respect to the actions taken so let's go ahead and do the critic class next that is also very similar so that again derives from the base object gets an initializer and it's pretty much the same number of actions a name input dims a session s you won bin's se2 dims a batch size will default to 64 and a check pointer will default to this just a note you have to do a make derp on this temps livecd DPG first otherwise it'll bark at you not a big deal just something to be aware of since it's identical let's go ahead and copy a good chunk of this stuff here what exactly is the same all of this now we need the checkpoint file let's grab that ctrl C come down here control V and voila it is all the same so very straightforward nothing too magical about that so now let's handle the optimizer means we already have a we've already called the function to build our network we can define our optimizer so the self thought optimize TF train Adam optimizer and we're gonna minimize the loss which we will calculate in the build Network function and we also need a function to actually calculate the gradients of Q with respect hey so we have self - ingredients self dot Q and actions let's build our network with Tina sorry TF a variable scope self that name so now we need our placeholders again we need subletting placeholder float32 we need a shape nun by input dims now if you're not too familiar with tensorflow specifying none in the first dimension tell sensor flow you're gonna have some type of batch of inputs and you don't know what that batch size will be beforehand so just expect anything and let's delete that say i need a comma for sure and say name equals inputs here we go did i forget a comma up here let's just make sure i did not okay so we also need the actions because remember we only take into account the actions and the second hidden layer of the critic neural network I have float 342 shape none by and actions name of actions now much like with q-learning we have a target value let me just go back to the paper really quickly show you what that precisely will be so that target value will be this quantity here and we will calculate that in the learning function as we as we get to the agent class so let's go back to the code editor and finish this up so Q target is just another placeholder and that's a floating point number done by one it's a scalar so it is shaped batch size by one and we will call it targets okay so now we have a pretty similar setup to the actor Network so let's go ahead and come up here and copy this no not all that just this and come back down here so f1 we recognize that's just the fan-in we have a dense layer for the inputs and we want to initialize that with a random number pretty straightforward then we can come up to f2 and copy that as as well sorry the second layers it'll be a little bit different but we'll handle that momentarily so now f2 the layer 2 is pretty similar the only real difference is that we want to get rid of that activation and the reason we want to get rid of that activation is because after I do the batch norm we have to take into account the actions so we need another layer so action in Steve layers dense it's going to take in cellphone actions which will pass in from the learning function and that's going to output F c2 dims with a value activation so then our state actions will be the the addition of the batch two and the action in and then we want to go ahead and activate that okay so this is something else I pointed out in my PI torch video where this is a point of debate with the way I've implemented this I've done it a couple different ways they've done the different variations and this is what I found to work I'm doing a double activation here so I'm activating the rel you I'm doing the Arroyo activation on the actions in on the output of that that dense layer and then I'm activating the sum now the the Raju function is non commutative with respect to the addition functions so the value of the sum is different than the sum of the values and you can prove that to yourself on a sheet of paper but it's debatable on whether or not the way I've done it is correct it seems to work so I'm gonna stick with it for now again for kit cloning change it up see how it works and improve it for me that would be fantastic then make a pull request and I'll disseminate that to the community so we have our state actions now we need to calculate the actual output of the layer and an f3 is our uniform initializer for our final layer which is self dot Q those all areas that dense that takes state actions as input outputs a single unit and we have the kernel initializers and bias initializers similar to up here let's paste that OOP there we go that's a little bit better still got a whole bunch of white space there all right so we are missing one thing and that one thing is the regularizer so as they said on the paper they have L to regular regularization on the critic so we do that with kernel regular Iser equals TF Kerris regul our eyes errs dot l2 or the value zero zero one that 0 1 sorry so that is that for the cue and notice that it outputs one unit and it outputs a single unit because this is a scalar value you want the value of the particular state action pair finally we have the loss function and that's just a mean squared error error cue target and Q so Q target is the placeholder up here and that's what we'll pass in from the learn function from the agent and self dot q is the output of the deep neural network okay so now similar to our critic we have a prediction function it takes inputs and actions and you want to return self that's a stun self dot Q with a feed dictionary of sub that input inputs and actions oops next you need a training function and that's slightly more complicated than the actor visa takes in the inputs actions and a target and you want to return self assess run so to optimize with a feed dictionary of sub that input inputs self dot actions actions and Q target space gun that's a little wonky whatever we'll leave it that's a training function next we need a function to get the action gradients and that'll run that action gradients operation up above so let's get that again takes inputs and actions as input and you want to return self that session run self dot action ingredients with a its own feed dictionary equals self dot input inputs and self out actions actions and then we also have the save and load checkpoint functions which are identical to the actor siliceous copy and paste those there we go so that is it for the critic class so now we have most of what we need we have our our our noise our replay buffer our actor and our critic now all we need is our agent now the agent is what ties everything together it handles the learning functionality it allows the noise the memory of the replay buffer as well as the four different deep neural networks and that derives from the face object initializer is a little bit long and that takes an alpha and a beta these are the learning rates for the actor and critic respectively recall from what we read in the paper they use 0 0 0 1 and 0 0 1 for both networks we need input dims tau the environment that's how we're going to get the action bounds a gamma is 0.99 for the paper in this case we need number of actions equals to a MEMS eyes for our memory of 1 million player 1 size of 400 and layer 2 size of 300 and a batch size of 64 so of course we want to save all of our parameters we'll need the memory and that's just a replay buffer max sighs whoops input dims and number of actions we will need a batch sighs here's where we're gonna store the session and this is so that we have a single session for all four networks and I believe and I believe don't quote me on this but I tried it with an individual Network for sorry an individual session for each network and it was very unhappy when I was attempting to copy over parameters from one network to another I figured there was some scoping issues so I just simplified it by having a single session there's no real reason that I can think up to have more than one and that is an actor that gets alpha a number of actions the name is just actor input dims the session layer one sized layer to size action space dot high and that's our action bounce the actions based out high next we have a critic that gets beta and actions its name is just critic the dims self that session there one sized layer to size and we don't pass in anything about the environment there so now we can just copy these and instead of actor it is target actor and let's go ahead and clean up and be consistent with our pet bait style guides always important and that actually makes you stand out I've worked on projects where the manager was quite happy to see that I had a somewhat strict adherence to it just something to take note of so then we have a target critic as well and we will clean up this and so that is all four of our deep neural networks that's pretty straightforward so now we need noise that's a no you action noise with mu equals numpy zeroes in the shape and actions so now we need operations to perform the soft updates so the first time I tried it I defined it as its own separate function and that was a disaster it was a disaster because it would get progressively slower at each soft update and I don't quite know the reason for that I just know that that's what happened and so when I moved it into the initializer and defined one operation I'm guessing it's because it adds every time you call it it probably adds something to the graph so that adds overhead to the calculation that's my guess I don't know that's accurate it's just kind of how I reasoned it but anyway let's go ahead and define our update operations here so what we want to do is iterate over our target critic parameters and call the assignment operation what we want to assign we want to assign the product of critic params and self that towel plus give that multiply self dot target critic params up I times or and no that should be a sorry a comma comma and one - cell towel so we get to let's hit man that's and that is a list comprehension for I in range length of target critic top params I gets that you don't need that I believe no so then we have a similar operation for the actor we just want to swap actor and critic target actor so thought actor actor dot params DUP params I do the same thing here didn't I target actor and up here da prams okay target actor actor and then target actor there so now we have our update soft update operations according to the paper finally we have constructed all the graphs you know for the whole four networks so we have to initialize our variables self that section that run TF global variables initializer you can't really run anything without initializing it and as per the paper at the very beginning we want to update the network parameters and at the beginning we want to pass in the we want the target networks to get updated with the full ver the full value of the critic of the evaluation networks and so I'm passing in a parameter of first equals true so since it's confusing let's do that first that function first update network parameters first will default to false so if first we need to say old tau equals self dot tau I need to save the old value of tau to set it again so that I can reset it so that's how it goes one and say self dot target critic session run update critic cell targets actor that session that run self dot update actor and as I recall this is important which session used to run the update although maybe not since I'm using only one session if you want to play around with it go ahead and then go ahead and reset the towel to the old value because you only want to do this particular update where you update the target network with the original network full values on the first turn on the first go through otherwise just go ahead and run the update function boom so next we need a way of storing transitions self state action reward new state and done so self up memory dot store transition you want to store all this stuff this is just an interface from one class to another you know this may or may not be great computer science practice but it works next we want a way of choosing an action and that should take a state as input since we have defined the input variable to be shaped batch size by n actions you want to reshape the state to be 1 by sorry it should be 1 by the observation space new access all and that's because we have the come up here for the actor Network just so we're clear it is because this has shape none by input dims so if you just pass in the observation vector that has shape input dims and it's going to get it's gonna get uppity with you so you just have to reshape it because you're only passing in a single observation to determine what action to take so mu actor dot predict state noise equals self dot noise new prime you guys mu plus noise return Sub Zero so this returns a tuple so you want the 0th element so now we have the learning function and of course this is where all the magic happens so if you have not filled up the memory then you want to go ahead and bail out otherwise you want to go ahead and sample your memory sometime memory that sample buffer batch size so next we need to do the update from the paper so let's go back to the paper and make sure we are clear on that so we need to we already sampled this so we need to calculate this and to do that we're gonna need the q prime the target critic network as output as well as the output from the target actor Network and then we use that to update the loss for the critic and then we're gonna need the output from the critic as well as from the actor network so we need to basically pass States and actions through all four networks to get the training function the learning function so let's go ahead and head back to the code editor and do that so our critic by you for Q prime so cute critic value underscore sub left target critic predict and you want to pass in the new states and you will also want to pass in the actions that come from the target actor new state I need one extra parentheses there so then we want to calculate the Y sub i's for the targets so that's an empty list for J and range set up batch size target dot append reward critic value underscore J times done sub J and that's where the kept harping on getting no rewards after the terminal state that's where that comes from when done is true then it's 1 minus true which is 0 so you're multiplying this quantity by 0 so you don't take into account the value of the next state right which is calculated up here you only take into account the most recent reward as you would want so then we just want to reshape that target into something that is batch size by 1 that's to be consistent with our placeholders and now we want to call the critic train function right because we have everything we need we have the state's actions and targets with States and actions from the replay buffer and the target from this calculation here very easy now we need to do the actor update so the action else self dot actor dot predict we get the predictions from the actor for this states are grads the gradients get action gradients state payouts that'll get the remember that'll do a feed forward and get the gradient of the critic with respect to the actions taken and then you want to train the actor state and grads the gradients it's a tuple so you want to dereference it and get the 0th element and then finally you want to update network parameters whoo okay so that is it for the learn function now we have two other bookkeeping functions to handle which is save models and this will save all of our models so a self thought actor does save check point self thought target actor save check point so plot critic and a function to load models sounds like a dog fine out there so we want to load check points instead of saving them load check points load and load so that is it for the aging class that only took about an hour so we're up to about two hours for the video so longest one yet so this is an enormous amount of work this is already three hundred ten lines of code if you've made it this far congratulations this is no mean feat this took me you know a couple weeks to hammer through but we've gotten through it in a couple hours so this is all there is to the implementation now we have to actually test it so let's open up a new file and save that as main tensorflow pi and what we want to do now is go ahead and test this out in the pendulum environment we call it tensorflow original import our agent we need Jim we need do we need numpy for this yes we do we need we don't need tensor flow from utils import plot learning and we don't need OS so we will have a score history I know what I want to do that with a say if name equals main now we want to say env Jim not make Bend you glum - be zero agent equals agent it gets a learning rate of zero zero zero one a beta of zero one put dims 303 towel zero one pass in our environment batch size 64 being verbose here and it looks like when I ran this I actually use different layer sizes well that's an issue for hyper parameter tuning I just want to demonstrate that this works and actions equals one so when I got good results for this I actually used 800 by 600 and I cut these learning rates in half but we'll go ahead and use the values from the paper sometimes I do a hyper parameter tuning other thing we need to do is set our random seed you know we can just put it here whatever the reason we want to do this is for replicability and i have yet to see any limitation of this where they don't set the random seed and i've tried it without it and you don't get very good sampling of a replay buffer so this seems to be a critical step and most implementations I've seen on github do the same thing so let's go ahead and play our episodes say a thousand games you know view that reset we forgot our score history of course that's to keep track of the scores over the course of our games don't is false and the score for the episode is zero let's play the episode say act agent dot shoes action takes jobs as input new state reward done info equal C and V dot step act agent got remember Hobbs Act reward new state and done I guess the int done really isn't necessary we take care of that in the replay buffer funk funk class but you know doesn't hurt to be explicit and then you want to rebel we want to learn on each time step because this is a temporal difference method keep track of your score and set your old state to be the new state so that way when you choose an action on the next step you are using the most recent information finally at the end of every episode you want to append that score to the score history and print episode I score percent up to F score 100 game average person not to have percent numpy mean a score history last 100 games - 100 on one more and at the end equals pendulum dot PNG plot learning plot learning score history file name and a window of 100 the reason I choose a window of 100 is because many environments to find salt as trailing 100 games over some amount the pendulum doesn't actually have a solved amount so what we get is actually on par with some of the best results people have on the leaderboard so it looks like it does pretty well so that is it for the main function let's go ahead and head to the terminal and see how many typos I made all right here we are let's go ahead and run the main file invalid syntax I have an extra print to see I'm just going to delete that really quick run that I have the star out of place let's go back to the code editor and handle that so it looks like it is on line 95 which is here all right back to the terminal let's try it again and it says line 198 invalid syntax oh it's because it's a comma instead of a colon all right I'll fix that once more all right so that was close so actor object has no attribute input dims line 95 no ok that's easy let's head back so it's in line 95 just means I forgot to save input dims and that probably means I forgot it and then critic as well since I did it cut and paste yes it does alright now we'll go back to the terminal okay that's the last one all right moment of truth critic takes one positional argument but seven were given good grief okay one second so that is of course in the agent function so I have line two thirty four one two three four five six seven parameters indeed so critic takes learning rate number of actions name input Tim session interesting one two three four five six seven what have I done oh that's why it's class critic of course all right let's go back to the terminal and see what happens named action bound is not defined and that is in the line 148 okay line 148 that is in the critic ah that's because I don't need it there let's delete it that was just for the actor it's because I cut and pasted always dangerous I tried to save a few keystrokes too my hands and ended up wasting time instead all right it's go back to the terminal all right actor has no attribute inputs that is on line 126 self dot inputs it's probably self dot input yes that's why I do the same thing here okay perfect and it runs so I'll let that run for a second but I let it run for a thousand games earlier and this is the output I got now keep in mind it's a slightly different parameters the point here isn't that whether or not we can replicate the results because we don't even know what the results really were because they express it as a fraction of a planning a fraction of the performance of a planning agent so who knows what that really means I did a little bit of hybrid parameter tuning all I did was double the number of input units and have the learning rate and I ended up with something that looks like this so you can see it gets around 150 or so steps because they're on hundred years so steps to salt and if you check the leaderboards that it shows that that's actually a reasonable number some of the best environments only have 152 steps some of them do a little bit better sorry best agents solvent in hunter 52 steps or achieve a best score run or vide steps but it's pretty reasonable so so the default implementation looks like it's very very slow to learn you can see how it's kind of starts out band gets worse and then gets starts to get a little bit better so that's pretty typical you see this you know oscillation and performance over time pretty frequently but that is that that is how you go from an implementation in about two hours of course it took me you know many you know a couple many times that to get this set up for you guys but I hope this is helpful I'm going to milk this for all it's worth this has been a tough project so I'm gonna present many many more environments in the future I may even do a video like this for pi torch I have yet to work on a kerosene limitation for this but there are many more DDP G videos to come so subscribe to make sure you don't miss that leave a comment share this if you found it helpful that helps me immensely I would really appreciate it and I'll see you all in the next video welcome back everybody in this tutorial you are gonna code a deterministic policy gradient agent to beat the continuous lunar lander environment in pi torch no prior experience needed you don't need to know anything about deep reinforcement learning you just have to follow along let's get started so we start as usual with our imports will need OS to handle file operations and all of this stuff from torch that we've come to expect as well as numpy I'm not going to do a full overview of the paper that will be in a future video where I will show you how to go from the paper to an actual implementation of deep deterministic policy gradients so make sure you subscribe so you don't miss that but in this video we're just going to get at the very high-level overview the 50,000 foot view if you will that will be sufficient to get an agent to beat the continuous lunar lander environment so that's good enough so the gist of this is we're gonna need several different classes so we'll need a class to encourage exploration in other words the type of noise and you might have guessed that from the word deterministic it means that the policy is deterministic as in it chooses some action with certainty and so if it is purely deterministic you can't really explore so we'll need a class to handle that we'll also need a class to handle the replay memory because deep deterministic policy gradients works by combining the magic of actor critic methods with the magic of deep Q learning which of course has a replay buffer and we'll also need classes for our critic and our actor as well as the agent so that's kind of a mouthful we'll handle them one bit at a time and we will start with the noise so this class is called oh you action noise and the öyou stands for Ornstein Willem Beck so that is a type of noise from physics that models the motion of a Brownian particle meaning this particle subject to a random walk based on interactions with other nearby particles it gives you a temporally correlated meaning correlated in time set type of noise that centers around a mean of zero so we're gonna have a number of parameters mu a Sigma a theta if I could type that would be fantastic as well as a DT as in the differential with respect to time and an initial value that will get an original value of none so if you want to know more about it then just go ahead and check out the Wikipedia article but the overview I gave you is sufficient for this tutorial so of course we want to save all of our values X zero and we want to call a reset function so the reset function will reset the temporal correlation which you may want to do that from time to time turns out it's not necessary for our particular implementation but it is a good function to have nonetheless so next we're gonna override the call function if you aren't familiar with this this allows you to say noise equals oh you action noise and then call noise so that allows you to instead of saying noise get noise just say noise with parenthesis or whatever the name of the object is so that's overriding the call function so we'll have an equation for that X previous plus theta times the quantity mu minus self dot X previous times self DT plus self dot Sigma times numpy square root cell DT times MP random normal size equals mu shape so it's a type of random normal noise that is correlated in time through this mu minus X previous term and every time you calculate a new value you want to set the old value the previous value to the new one and go ahead and return the value so the reset function all that does is check to make sure x0 exists if it doesn't it sets it equal to some 0 value so so X previous equals Delta X naught if is not none else numpy zeros like in the shape of self dot mutant that's it for the action noise again this will be used in our actor class to add in some exploration noise to the action selection next we need the replay buffer class and this is pretty straightforward it's just going to be set of numpy arrays in the shape of the action space number the observation space and rewards so that way we can have a memory of events that have happened so we can sample them during the learning step if you haven't seen my videos on deep q-learning please check those out they will make all this much more clear as well as checking out the videos on actor critic methods because this again D DPG kind of combines actor critic with deep key learnings so that will really be helpful for you I'll go ahead and link those here as well so that way you can get educated so scroll down a bit we want to save our men's eyes as max size so then then counter we'll start out at zero again this is just gonna be a set of arrays that keep track or matrices in this case that keep track of the state reward action transitions and that will be in shape immense eyes so however number of memories we want to store and input shape so if you are relatively new to Python this star variable idiom it isn't a pointer if you're coming from C or C++ it is an idiom that means to unpack a tuple so this makes our class extensible so we can pass in a list of a single element as in the case of the lunar lander and continuous winter and land environment that we'll use today or later on when we get to the continuous car racing environment we'll have images from the screen so this will accommodate both types of observation vectors it's a way of making stuff extensible and the new State memory is of course the same shape so we just copy it it looks like I am missing a parenthesis somewhere I guess we'll find it when I go ahead and run the program so oh it's not self thought and it's it's def there we go definite perfect so we'll also need an action memory and that of course will also be an array of zeros in the shape of size by number of actions I believe that means I need an extra bread the scene there yes and we'll also have a reward memory and that will just be in shape mm sighs we also need a terminal memory so in reinforcement learning we have the concept of the terminal state so when the episode is over the agent enters the terminal state from which it receives no future rewards so the value of that terminal State is identically zero and so the way we're going to keep track of when we transition into terminal States is by saving the done flags from the open AI gym environment and that'll be shape numpy by zeros mmm sighs and I've called this float32 it's probably because torch is a little bit particular with data types so we have to be cognizant of that we need a function to store transitions which is just a state action reward new state and done flag so the index is going to be the first available position so mem counter just keeps track of the last memory you stored it's just an integer quantity from 0 up to mem sighs and so when mem counter becomes greater than men's size it just wraps around from zero so when they're equal to zero and when it's equal to mmm sighs plus one it becomes one and so on and so forth so state memory sub index equals state action memory index reward equals state underscore and the terminal memory good grief terminal memory index doesn't equal done but it equals one - done so the reason is that when we get to the update equation the bellman equation for our learning function you'll see we want to multiply by whether or not the episode is over and that gets is facilitated by 1 minus the quantity done just incrementer and next we need to sample that buffer so sample buffer and that will take in a batch size as input so the max memory is going to be the minimum of either mem counter or mem size not the max but the minimum then match is just going to be a random choice of maximum index of maximum number of elements excuse me equal to batch size scroll down a bit and then we want to get ahold of the respective states actions rewards and new states and terminal flags and pass them back to the learning function sub batch new states rewards you know it's not easy to type and talk at the same time apparently and let's get actions self dot action memory fetch and terminal good grief so we want to return States actions rewards new states and the terminal flags perfect so that is it for our replay memories so you're gonna see this a lot in the other videos on deep deterministic policy gradients because we're gonna need it for basically anything that uses a memory next let's go ahead and get to the meat of the problem with our critic Network and as is often the case when you're dealing with PI torch you want to derive your neural network classes from n n dot module that gives you access to important stuff like the Train and eval function which will set us in train or evaluation mode very important later I couldn't get it to work until I figure that out so a little tidbit for you you also need access to the parameters for updating the weights of the neural network so let's define our initialize function the beta is our learning rate we'll need input dims number of dimensions for the first fully connected layer as well as second connected layer number of actions a name the name is important for saving the network you'll see that we have many different networks so we'll want to keep track of which one is which very important as well as I checkpoint directory for saving the model they're also very important as this model runs very very slowly so you'll want to us save it periodically and you want to call the super constructor for critic Network and that will call the constructor friend dot module I believe input dims equals input dims I see one dims so these would just be the parameters for our deep neural network that approximates the value function the number of actions a checkpoint file and that is OS path join checkpoint der with name plus underscore DD PG and if you check my github repo I will upload the train model because this model takes a long time to train so I want you to be able to take advantage of the fully trading model that I've spent the resources in time training up so you may as well benefit from that so next up we need to define the first layer of our neural network just a linear layer and that'll take input dims and output F c1 dims we're also going to need a number for initializing the weights and biases of that layer of the neural network we're going to call that f1 and it's divided by 1 over the square root of the number of dimensions into the network so self dot FC 1 dot weight a data dot size and that returns a tuple so I need the zeroth element of that and then we want to initialize that layer by using T torch and in an it uniform underscore the tensor you want to initialize which is FC 1 weight data up from minus 1 to positive 1 so this will be a small number of order point 1 or so not exactly 0.1 but of order point 1 and you also want to initialize the biases by Assad data and that gets the same number and again in the future video when we go over the derivation of the paper I'll explain all of this but just for now know that this is too constrain the initial weights of the network to a very narrow region of parameter space to help you get better convergence perfect so oh and make sure to subscribe so you don't miss that video because it's gonna be lit so bn1 is a layered norm and takes FC one dims as input The Bachelor Malaysian helps with convergence of your model you don't get good convergence if you don't have it so leave it in so FC 2 is a second layer another linear it takes FC 1 dims as input and outputs FC 2 dims good grief and we want to do the same thing with initialization so it's 1 over the square root of self FC 2 wait data dot size 0 and you want to do T and n in that uniform underscore C to wait data minus f2 up to F to make sure that's right so we don't screw that up because that's important that looks correct the syntax here is the first parameter is a tensor you want to initialize and then the lower and upper boundaries so next we need B n 2 which is our second batch normal layer FC 2 dims and just a note the fact that we have a normalization layer a bachelor type layer means that we have to use the eval and trained functions later kind of a nuisance it took me a while to figure that out the critic network is also going to get a action value because the action value function takes in the states and actions as input but we're gonna add it in at the very end of the network linear actions see two dims and this gets a constant initialization of zero zero three or a sorry the the next one the output gets a initialization of zero zero three and since this is a scalar value it just has one output and you want to initialize it again uniformly not cute at weight data and that gets minus F 3 up to F 3 and likewise for bias data okay so now we have our optimizer and that will be the atom optimizer and what are we going to optimize the parameters and the learning rate will be beta so you notice that we did not define parameters right here right we're just calling it and this comes from the inheritance from n n dot module and that's why we do that so we get access to the network parameters next you certainly want to run this on a GPU because it is an incredibly expensive algorithm so you want to call the device so T dot device CUDA 0 if T CUDA is available else CUDA 1 so I have two GPUs if you only have a single GPU it will be else CPU but I don't recommend running this on a CPU so next you want to send the whole network to your device by self dot to self dot device we are almost there for the critic class next thing we have to worry about is the forward function and that takes a state and an ax as input keep in mind the actions are continuous so it's a vector in this case length two for the continuous lunar lander environment it's two real numbers in a list or numpy array format so state value its FC one state and then you want to pass it through B and one state value and finally you want to activate it if that value state value now it is an open debate whether or not you want to do the value before or after the batch normalization in my mind it makes more sense to do the batch normalization first because when you are calculating batch statistics if you apply the value first then your lopping off everything below zero right so that means that your statistics going to be skewed toward the positive end when perhaps the real distribution has a mean of zero instead of a positive mean or maybe it even has a negative mean which you wouldn't see if you used the value function before the batch normalization so just something to keep in mind you can play around with it feel free to clone this and see what you get but that's how I've done it I did try both ways and this seemed to work the best so next we want to feed it into the second fully connected layer BN to state value and then we want to be into that sorry I already did that one second let the cat out so we've already done the bachelor Malaysian we don't want to activate it yet what we want to do first is taking account of the action value and what we're going to do is activate the action through the action value layer and perform a value activation on it right away we're not going to calculate bash statistics on this so we don't need to worry about that but what we want to do is add the two values together so state action value F dot rel you Teta add state value action value other thing that's a little bit wonky here and I invite you to clone this and play with yourself is that I am double rel Ewing the action value function so the action value quantity so I do a value here and then I do a rel you on the add now this is a little bit sketchy I've played around with it and this is the way it works for me so if you can clone it and get it to work a different way the other possibility is that you don't do this value here but you do the value after the add so value is a non commutative function with ADD so what that means is that if you do an addition first and then a rally that's different than doing the sum of the two values right or so if you take value of minus ten plus value of five you get a value of minus 10 of zero plus value of five is five so you get five if you take the value of minus ten plus five then you get a RAL U of minus five or zero so it's a non commutative function so it does matter the order but this is the way I found it to work I've seen other implementations that do it differently feel free to clone this and do your own thing with it I welcome any additions improvements or comments so then we want to get the actual state action value by passing that's additive quantity through our final layer of the network and go ahead and return that a little bit of bookkeeping we have a check save check point function and just save print and then you want to call T dot save self dot state dict what this does is it creates a state dictionary where the keys are the names of the parameters and the values are the parameters themselves and where do you want to save that you want to say that in the checkpoint file then you also have the load checkpoint good grief checkpoint function and that does the same thing just in Reverse loading checkpoint and you want self dot load State dict T download self dot checkpoint file so that is it for our critic Network now we move on to the actor network then of course derives from an N dot module we have an init function takes alpha if we get spelled correctly input dins FC one dims FC two dims this is pretty similar to the critic network it will just have a different structure in particular we don't have the we don't have the actions I can't type and talk at the same time but it's pretty similar nonetheless so input dims we want to save and actions see one dims fc2 dims same deal let me go copy the checkpoint file function just to make life easy perfect I like to make life easy so we have our first fully connected layer and in doubt linear take self dot input dims as input and FC one dims and of course it operates in the same way as I discussed in the replay buffer where it will just unpack the tuple next we have the initialization by a very similar one over NP square root self-taught FC one that weight data size zeroth element and we want to initialize the first layer uniformly within that interval in its uniform underscore s you want to weight data minus F 1 and F 1 copy that that will be FC 1 bias data FC 2 is another linear layer takes FC 1 dims as inputs and outputs FC 2 dims as you might expect and the initialization for that will be basically the same thing except for layered two so FC 2 weight data and then you know what let's just copy this paste and make sure we don't mess it up FC 2 F 2 whoo good grief and plus minus F 2 and that is all well and good other thing we forgot is the batch norm and that is an N layer norm and that takes FC 1 dims as input likewise for layer two that is another layer norm takes FC two dims as input shape and that doesn't get initialized but we do have the f3 and that is zero zero three this comes from the paper we'll go over this in a future video but don't worry about it self thought mu mu is the representation of the policy in this case it is a real vector of shape and actions it's the actual action not the probability right because this is deterministic so it's just a linear layer takes FC two dims as input and outputs the number of actions and we want to do the same thing where we initialize the weights let's copy-paste and instead of FC two it will be mu and a set of f2 it is f3 as you might expect am I forgetting anything I don't believe so so finally we have an optimizer and that's again optimum self dot parameters and learning rate equals alpha we also want to do the same thing with the device t dot device CUDA 0 if T dot CUDA is available else CPU and finally send it to the device that is that so next we have the feed forward so that takes the state as input so I'm just gonna call it X this is bad naming don't ever do this self dot F c1 do as I say not as I do self dot be in one of state value of X that's a mistake should be X X equals self dot FC 2 of X B into X and then X equal T tan hyperbolic self that Mew of X and then return X so what this will do is pass the current state or whatever state or set of states batch in this case you want to look at perform the first feed forward pass batch Norment value send it through the second layer and batch norm but not activate send it through to the final layer now I take that back I do want to activate that silly mean if thought well you X and then it will pass it through the final layer Mew and perform a tangent hyperbolic activation so what that will do is it'll bound it between minus 1 and plus 1 and that's important for many environments later on we can multiply it by the actual action bounds so some environments have a max action of plus minus 2 so if you're bounding it by plus or minus 1 that's not going to be very effective so you just have a multiplicative factor later on and then I'm gonna go copy the to save and load checkpoint functions because those are precisely the same that's it for our actor next we come to our final class the meat of the problem the agent and that just gets derived from the base object and that gets a whole bunch of parameters alpha and beta of course you need to pass in learning rates for the actor and critic networks input dims a quantity called tau I haven't introduced that yet but we'll get to it in a few minutes we're gonna pass in the environment that's to get the action space that I talked about just a second ago the gamma which is the agents discount factor so if you're not familiar of reinforcement learning and agent values are reward now more than in values or reward in the future because there's uncertainty around future rewards so it makes no sense to value it as much as a current reward so what's the discount factor you know how much less does it value if you reward 1% that's where we get a gamma of 0.99 it's a hyper parameter you can play around with it values like 0.95 all the way up to 0.99 our typical number of actions will default it to to a lot of environments only have two actions the max size of our memory that gets 1 million one one two three one through three layer one size equals default of 400 there are two size 300 is our default and again that comes from the paper and a batch size for our batch learning from our replay memory so you want to go ahead and save the parameters equal towel and you want to instantiate a memory that's a replay buffer of size max size input dims and end actions we also want to store the batch size for our learning function we want to instantiate our first actor yes there are more than one and that gets alpha input dims layer one size layer to size and actions equals and actions name equals actor so let's copy that so next we have our target actor so much like the deep queue Network algorithm this uses target networks as well as the base network so it's an off policy method and the difference here is this going to be called target actor it'll be otherwise identical this will allow us to have multiple different agents with similar names and you'll see how that plays into it momentarily we also need a critic that's a critic network takes beta input dims they are one size higher to size and actions equals and actions name equals critic so let's be nice and tidy there and we also have a target critic as well and that is otherwise identical it just gets a different name and this is very similar to q-learning where you have Q eval and Q next or Q target whatever you want to call it same concept okay so those are all of our networks what else we need we need noise and that's our oh you action noise and the MU is just going to be numpy zeroes of shape and actions so it'll give you an array of zeros this is the mean of the rewards over time and next we need another function you may be able to predict if you've seen my videos on Q learning we should check out is the update network parameters and we'll call it with an initial value I equals 1 so what this does is it solves a problem of a moving target so in Q learning if you use one network to calculate both the action as well as the value of that action then you're really chasing a moving target because you're updating that estimate every turn right so you are end up using the same parameter for both and it can lead to divergence so the solution to that is to use a target network that learns the values of these states and action combinations and then the other network is what learns the policy and then of course periodically you have to overwrite the target parameter target networks parameters with the evaluation that where parameters and this function will do precisely that except that we have four networks instead of two so next we want to choose an action and that takes whatever the current observation of the environment is now very very important you have to put the actor into evaluation mode now this doesn't perform an evaluation step this just tells PI torch that you don't want to calculate statistics for the batch normalization and this is very critical if you don't do this the agent will not learn and it doesn't do what you think the name implies it would do write the corresponding the complementary function is trained it doesn't perform a training set that puts it in training mode where it does store those statistics in the graph for the batch normalization if you don't do batch norm then you don't need to do this but if you do what's the other function drop out drop out does the same thing or has the same tick where you have to call the eval and train functions so let's start by putting our observation into a tensor do you tie because T dot float to self actor device so that'll turn it into a CUDA float tensor now you want to get the actual action from the in the actor network so feed that forward to self dot actor dot device and this makes sure that you send it to the device so it's a CUDA tenser so mu prime is going to be mu plus T dot tensor what are we going to use self that noise that'll give us our exploration noise and that is going to be D type of float and we will send that to actor device and then you want to say should be actor dot train shouldn't ya self dot actor dot train yes and then you want to return mu Prime now CPU detach numpy so this is an idiom within pi torch where you have to basically do this otherwise it doesn't it doesn't doesn't give you the actual number you write it's gonna try to pass out a tensor which doesn't work because you can't pass a tensor into the open a gym so kind of a funny little quirk but it is necessary so now we need a function to store state transitions and this is just kind of an interface for our replay memory class so memory store transition state action reward new state done simple so now we come to the meat of the problem where we actually be learning so you don't want to learn if you haven't filled up at least batch size of your memory buffers so self dot memory mem counter is less than self that batch size then you just want to return otherwise action reward new state done you want to sample your memory memory dot sample buffer self that batch size then you want to go ahead and turn all of those into tensors that's because they come back as an umpire raised in this case we'll put them on the critic device as long as around the same device it doesn't matter I do this for consistency because these values will be used in the critic Network so the Dunn's equal Tita tensor done to self critic device you need the new state P that tensor new state D type T float to self critic device you also need the actions predict out device and you need states that tensor device and now we come to another quirk of Pi torch where we're going to have to send everything to eval mode for the targets it may not be that important I did it for consistency so we want to calculate the target actions much like you do in the bellman equation for Q learning deep Q learning target actor for word new state you want the critic value underscore which is the new states so the target critic dot forward and that takes target actions as input so what we're doing is getting the target actions from the target actor Network in other words what actions it should it take based on the target actors estimates and then plugging that into the state value function for the target critic network you also want the critic value which is self dot critic dot forward for state and action so in other words what was the what is your estimate of the values of the states and actions we actually encountered in our subset of the replay buffer so now we have to calculate the targets that we're going to move towards or J in range self that batch size and I use a loop instead of a vectorized implementation because the vectorizing implementation is a little bit tricky if you don't do it properly you can end up with something of shape batch size by batch size which won't flag an error but it definitely gives you the wrong answer and you don't get learning so target dot append reward sub J + self dot gamma x critic Bayou underscore sub J times done sub J so this is where I was talking about the done flags if the episode is over then the value of the resulting state is multiplied by zero and so you don't take it into account you only take into account the reward from the current state precisely as one would want so now let's go ahead and turn that target into a tensor sorry tensor target not to self critic dot device and we want to reshape this target egos target dot view cell top batch size and one now now we can come to the calculation of the loss functions so we want to set the critic back into training mode because we have already performed the evaluation now we want to actually calculate the values for batch normalization a train so about critic dot optimizer zero grad in PI torch whenever you calculate the loss function you want to zero your gradients that's so that gradients from previous steps don't accumulate and interfere with the calculation it can slow stuff down you don't want that so critical loss is just good grief f dot MSE means square error loss between the target and the critic value so then you want to back propagate back propagate that backward and step your optimizer boom so that's it for the critic so now we want to set the critic into evaluation mode for the calculation of the loss for our actor network so how about actor dot optimizer 0 grad and I apologize that this is confusing it was confusing to me it took me a while to figure it out this is one of the ways in which tensorflow is superior to PI torch you don't have this quirk I tend to like tensor flow a little bit better but you know whatever well we'll just figure it out man and get it going so mu equals the for propagation of the state I'm gonna put the actor into training mode and you want to calculate your actor loss that's just minus self dot critic dot forward state state and mu after loss equals T dot mean of actor loss again stay tuned for the derivation from the paper this is all outlined there otherwise it seems mysterious but this video is already 45 minutes long so you know that ought to wait for a future video after loss backward and self dot actor dot optimizer dot step and then we're done learning so now after you finish learning you want to update the network parameters for your target actor and target critic networks so self taught update network parameters whew man ok so we're almost there I promise let's go ahead and do that def update network parameters self and tau equals none by default so tau is a parameter that allows the update of the target network to gradually approach the evaluation networks and this is important for a nice slow convergence you don't want to take two largest steps in between updates so tau is a small number much much less than one so so if tau is none then you want to say tau equals self dot tau now this may seem mysterious the reason I'm doing this is because at the very beginning when we call the initializer we say update network parameters tau equals 1 this is because in that in the very beginning we want to update or sorry we want all the networks to start with the same weights and so we call it with a towel of 1 and in that case tau is not none so tau is just 1 and you will get the update rule here in a second so this is more hocus-pocus with PI torch actor named parameters so this will do is it'll get all the names of the parameters from these networks and we want to do the same thing for target actor parameters target critic params okay now that we have the parameters let's turn them into a dictionary that makes iterating them much easier because this is actually a generator so I believe don't quote me on that critic Graham's the actor estate dict equals dict of the actor programs target critic state militias to target critic dict people's dict Rams cramps boom okay almost there so now we want to iterate over these dictionaries and copy parameters so for name in critic state dict critic state dict sub name equals equals tau times critic state dict name dot clone plus one minus tau times target critic dict name dot clone cell the target critic a load state dict critic state dict so what this does is it iterates over this dictionary looks at the key in the in the in the dictionary and updates the values from this particular network and you can see that when tau is 1 you get 1 minus 1 is 0 so it's just this equals tau 1 times that so as I did the identity and then it loads the target critic with that parameter so at the very beginning it'll load it with the parameters from the initial critic network and likewise for the Patra network so let's go ahead and copy this and just go ahead and change critic to actor and then we'll be done with that function and we'll only have one other thing to take care of before we get to the main program target actor actor state dict I believe that is it yes indeed it is now there should be target actor yes perfect okay now it's right so next up we have two other bookkeeping functions to save the models so def save models and you definitely want this because this thing takes forever to train and self-taught critic and target actor and target critic and load models does the inverse operation yeah just copy all this load keep things simple right and again I will upload since this takes so long I'm going to upload my saved model parameters to the get up for you but this is it this is 275 lines so this is probably the longest project we have worked on here had machine learning with fill if you made it this far congratulations we're already 50 minutes in and we're almost done I promise so let's come over to our main function and we want to import our agent so DDP G torch import agent we want to import Djinn we want to do we want yes we want numpy we want my SuperDuper awesome import plot learning function and that is it so env Jim make lunar lander Conte in US v to agent equals agent alpha zero point one two three four to five so two point five by ten to the minus five beta equals zero point zero zero zero to five so two point five by ten to the minus four input dims equals a list with element eight town zero point zero zero one and V equals E and B well that reminds me I didn't multiply by the action space high in the function for the choose action don't worry that'll be in the potential implementation or I can leave it as an exercise to the reader it doesn't matter for this environment when we get to other ones that word doesn't matter I'll be a little bit more diligent about that that size is 64 size 400 and r2 size 300 and actions plus two now another interesting tidbit is that we have to set the random C this is not something I've done before but this is a highly sensitive learning method so if you read the original paper they do averages over five runs and that's because every run is a little bit different and I suspect that's why they had to initialize the weights and biases within such a narrow range right you don't want to go away from plus and minus one when you can constrain to something much more narrow it gives you little bit more repeatability so we have to set the numpy random seed to some value instead of none so in this case I've used zero I've seen other values used please clone this and see what happens if you input other seed values so next we need a score history to keep track of the scores over time and we need to iterate over a thousand games done equals false score equals zero observation equals E and V dot reset I got a new observation so while not done agent dot choose action Bob's new state reward done info equals e and we got step act agent dot remember we want to keep track of that transition Hobbs Act reward new States int done agent learn we learn on every step because this is a temporal difference learning method instead of a Monte Carlo type method where we would learn at the end of every episode keep track of the score and set your old state to the new state so at the end of the episode if at the end of every episode I want to print print the place marker so we'll say score history dot append score print episode I score sent up to F score 100 game average % dot 2f and what this will do is take the last 100 games and compute the mean so that way you can get an idea vast learning remember with the luminol an environment salt means that it has gotten ace an average score of 1 200 over the last 100 games so every 25 games we want to save the model agent save models and at the end file name equals lunar lander PNG that's not on the loop you want to do at the end of all the loop game's plot learning I'll name no score history file name in a window of 100 games Wow so an hour in we finally finished this now we get to go to the terminal and see how many typos I made I'm sure there's probably 50 so let's get to it alright so here we are let's see what we get you want to run torch Lunar Lander fingers crossed okay so that's a stupid one so in line 30 we forgot an equal sign let's go back there and fix that alright here we are so it says line 30 yes right here all right did we do that again anywhere else not that I can see but that's no guarantee alright so let's go back to the terminal alright let's try it again huh line 119 okay typical all right so 119 right there so that's in the actor let's just scroll down that's the agent class I don't think I did it there all right I'm going back to the terminal alright so I started it and it ran so let's see built-in function has no function or method has no attribute numpy all right that's an interesting bug let's fix that so that is on line 192 in our choose action function and mu Prime oh that's why it's detached as a function not an object I should fix it let's head back to the terminal rewards is not deep find so that is in line 55 okay ah it's just called reward there we go I had the s and back to the terminal ah perfect that's easy to fix Victor temps ICD PG huh because I didn't make the directory first that's easy perfect now it's running so I'm not gonna let this run all 1000 games because it takes about a couple hours or so instead let's take a look here so I was running this earlier when I was making the videos for the sorry the recording the agents play while making for this video and you can see that within under 650 games it went ahead and solved it so when you print out the trailing average for the last hundred games we get a reward of well over 200 now keep in mind one interesting thing is that this is still actually in training mode it's not excuse me it's not in full evaluation mode because we still have some noise right if you wanted to do a pure evaluation of the agent you would set the noise to zero we'll do that in a set of future videos so there's a whole bunch of stuff I can do on this topic but just keep in mind that this is an agent that is still taking some random actions the noise is nonzero and so it is still taking suboptimal actions of getting a score of 260 and still beating the environment even though that noise is present and you can see it in like episode 626 where it gets a score of 26 so and then in episode 624 where it does you know eight point five eight points so that is pretty cool stuff so this is a very powerful algorithm and keep in mind this was a continuous action space totally intractable for q-learning right that is simply not possible it's an infinitude of actions so you need something like DD PG to handle this and it handles it quite well in future videos we're going to get to the Walker the bipedal Walker we're gonna get to the learning from pixels where we do the continuous bracing environment and we'll probably get into other stuff from the Robo school of the open AI gym so make sure to subscribe so that you can see that in future videos go ahead and check out the github for this so you can get the weight so you can play around with this so you don't have to spend a couple hours training it on your GPU make sure to leave a like share this if you found it helpful that is incredibly helpful to me and leave a comment down below I answer all my questions I look forward to seeing you all in the next video

Info

Channel: freeCodeCamp.org

Views: 53,807

Rating: undefined out of 5

Keywords: deep deterministic policy gradients, ddpg tutorial, ddpg algorithm, deep deterministic policy gradients algorithm, ddpg network, continuous actor critic, ddpg openai gym, ddpg tensorflow, ddpg example, ddpg code, implement deep learning papers, how to read deep learning papers, continuous action reinforcement learning, ddpg implementation, ddpg paper, lunar lander, python, pytorch

Id: GJJc1t0rtSU

Channel Id: undefined

Length: 177min 10sec (10630 seconds)

Published: Tue Jul 16 2019