An Introduction to Actor-Critic Deep RL Algorithms

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

well good evening everyone today's the 11th webinar in our series of talks on deep reinforcement learning today's topic is an important one it's an introduction to the actor critic algorithm set of algorithms it's a really important topic because today's deep reinforcement learning often depends on this class of algorithms most of the modern techniques rely on this this algorithm and so it's an important one to understand and I will go through a couple of introductory introductions to this topic today so with that let's pull up the slides looks like I'm having have two slices see just a second okay here we go so as I said tonight's topic is introduction to deep reinforcement learning actor critic methods the objectives this evening are to first recognize some problems with the other two main methods we've talked about in the past the value-based methods and policy based methods and then to understand how these two different independent methods can be combined into a single algorithm and thereby reduce variance and improve deep reinforcement learning so here started with a review of the basic value based in policy based methods so we've been talking about these different methods for a while we started with value based methods if you recall it was a DQ Network the if you have done the first project you probably relied on this method to implement the first major project on the navigation project and the idea in this method is that we train a neural network to learn a value function and then we use that value function to map each state value action pair to a value and find in whatever state we're in we look at all the different actions that are available the value functions that were estimated for each of those actions and then choose the action with the largest value fairly straightforward this technique works really well when you have a finite action space but if your action space is continuous it can be problematic and so we'll leave just a couple of other methods that can work in continuous extra spaces and we'll find today that yeah after critic methods that we'll talk about also work well in continuous action spaces so as I mentioned DQ networks and key learning is a good examples of the value based methods then the next major type of algorithm we talked about recently are policy based methods and the difference in the policy based methods from a value function approach that we directly estimate or optimize the policy without using a value function so the the neural network that's doing the the function approximation actually is is trying to find an optimal policy to select actions without first evaluating values of those actions this this approach we saw was was good for a new session space or it's a stochastic action space so there's a large class of hydrogen's that require that type of approach so policy based methods can work well for that type of a problem domain but we thought we saw that a real problem with these types of methods is finding a good function to compute how good the policy is and in these initial techniques we use total rewards of the episode it's a Monte Carlo technique and the problem with that is that requires you to your agent to actually traverse an entire episode from beginning to end before you can make any learning before you have any learning take place so because you do an entire episode from episode to episode actions that are taken you know different points in the episode can have drastic changes in the final results and that results in high variance and that high variance produces a look I'm not oftentimes be slow learning an example of this algorithm I was reinforced when we talked about a few two or three weeks ago so with that review of those two major classes of algorithms talk now about actor critic methods so that the real idea is actually fairly simple in actor critic methods we combine both those nuts we just talked about we take a value based method and a policy based method and put them into a single logarithm used to separate neural networks one neural network is an actor network that picks the actions it controls out of the agent behaves this is a policy based approach and then we implement a second neural network in the agent and this network is the critic and the purpose of the critic is to measure how good the action that the actor recommended to be taken how good that actually actually is and this is a value based approach so we take these two put them together and one algorithm and we go overcome some of the the problems that we saw with them individually the the real main advantage of this approach is that whereas in like the policy based methods we had these a Monte Carlo technique and we could do learning until we completed an entire episode with an actor critic approach we can make updates at each time step we can make learning activities take place every time we take a step in the environment this is a temporal difference learning approach and we've seen how that can be advantageous because it reduces variance and in learning so to make this work we train the critic model to approximate the value function ie function replaces the reward function that we saw in like the reinforcement algorithms policy gradient that in that method only calculated words at the end of the episode but here we can calculate those those rewards and in and take those rewards into account and take a learning step with every step we take in the environment so really that's that's the the kernel of the the entire method so in summer we implement two different neural networks in reality I should mention that often you'll have more than two neural networks because you'll see in the in the standard algorithm starting with the deterministic policy gradient algorithm that we use in the last two projects that each one of these main neural networks has a sister network that goes along with it it's a soft update process and it's update update process lets you take actions on a policy that doesn't change very often and then slowly update that policy based on the learning that takes place in the more rapidly changing primary neural network so in any case there's two main known that works the policy Network this that's known as the actor and that controls how the agent acts it's the actions and you can see it's a function of the state that you're in the action that you take and the set of theta values that are the weights of the neural network it's the the policy network then there's also the second networks the critic network and this is a value function approximator and it measures how good those actions were that were that were recommended by the policy Network and this Q function critic network sometimes it's a key function sometimes it's an advantage function Q hat kind of is a generic value function here but in this case it's a based on a stage in an action and they a set of weights in the weights W or Omega here are the weights of that critic Network so in summary today we talked about two of the earlier methods that we've discussed in the course bio based methods that map each state action pair to a value and we talked about Huddle's work well with finite action spaces but are problematic for environments with continuous action spaces and then we some review the policy based networks that directly optimize a policy without relying on a value function and those methods can be used in continuous or stochastic action spaces but the problem they had was that the Monte Carlo Learning updates require you to complete an episode before you have a step of learning and that resulted in high variance and slow learning so in the actor critic methods we combine those two stand-alone methods the value based of policy based methods and put them into a single algorithm the actor part of the algorithm controls how the agent behaves that's the policy based part of it it picks an action based on the state that you're in and then there's a critic that measures how good that action is and it's a value based method and the actor critic the version of the critic that's used in these algorithms enables temporal difference updates and and that gets by the the high variance we saw a Monte Carlo update methods with standard policy based approaches and with that we could reduce variance to speed up learning so uh you know I mentioned the beginning of the talk that these methods are important be the active critic methods that we see here are what's used in in every modern deep reinforcement Learning Network that I know of there's lots of different variations of this you'll use a deterministic policy gradient V V P G and the first project most likely you're welcome to use any outcome you like in those but most students would use that it's an actor critic method and you'll find it in that continuous action space it works very well HCC is similar it's a advantage after critic it's another version of the actor critic but rather than use a cue function that uses the advantage function that we talked about last week so with that I hope that that gives you a feeling for what actor critic networks are and in the basics of how they how they work and in that's it for this evening as always for those of you who are in the deep reinforcement learning program there's a ask me anything session immediately following tonight's lecture and I'll be online and happy to answer any questions you might have so with that I'll see you next week bye

Info

Channel: Udacity-DeepRL

Views: 3,031

Rating: undefined out of 5

Keywords: #hangoutsonair, Hangouts On Air, #hoa

Id: n6K8FfqQ7ds

Channel Id: undefined

Length: 11min 10sec (670 seconds)

Published: Mon Jun 10 2019