Deep RL Bootcamp Lecture 4A: Policy Gradients

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
the next lecture is going to be on policy gradients and actor critic methods lecture for a lecture for B will be on the same topic from a different perspective and there is a reason for this a lot of the current successes in reinforcement learning are through policy gradient methods the math is quite a bit different from what we've seen so far so it's nice because you can a little bit start from scratch but also there's gonna be a lot of new things to absorb and I'll cover it from a mathematical perspective Andre will cover it then from a more programming perspective and then later in John's session at the end here you'll see a lot of nuts and bolts relate to policy gradients so let's remind ourselves of where we're at we're doing reinforcement learning our agents supposed to somehow learn to act in this world take actions which will now be UTS instead of ATS slight change in notation action ut is taken things change new state reward associated with what happened agent faces that decides again and this keeps going around and what we're going to look at in this lecture is this part of the puzzle policy gradients and actor critic methods who which are direct policy optimization methods rather than dynamic programming methods everything we've seen so far our dynamic programming methods which try to get a self consistent set of equations satisfied for the Q values or for the B values it's going to be very different now so the goal here will be to find a policy PI theta which is now going to be stochastic in the entirety of this lecture distribution over actions given state that maximizes expected some other words ultimately that's what we care about finding such a policy thus far we've done the search for such a policy in a very indirect way finding a value function or Q function and then from that having a policy now we're just going to see if we can directly find a policy that can do well the way to think of the policy would be some neural net processes maybe some pixels or state information process it through a few layers and then outputs a distribution over possible actions that you might want to take then to act you'll sample from that distribution see what happens process repeats okay so the parametrization of our policy for concreteness we'll think of it as a big neural net and so the theta vector now used to be encoding a parameterization of a cue function now theta parameterize --is the policy that is the weights in the neural net are the theta vector and this is our policy we still want to maximize expected sum of rewards where she's not going to discount in this lecture so gamma equals 1 which is in the maximize expected sum of rewards and we're going to use a stochastic policy class meaning the set of the policy class is the set of policies that will choose from and we hope that in that set there will be a good one that we'll find there's a few reasons to use a stochastic policy class maybe the most important one is that it smooths out the optimization problem because what are we trying to do here we're trying to find a policy that maximizes expected sum of rewards if we have a deterministic policy let's say in a grid world changing an action is a big change changing action in a state can have a very significant change on the outcome it's not a very smooth thing to do but once we have a distribution of our actions in every state we can slightly shift the distribution and that will only slightly shift the expected sum of rewards and so thanks to having a stochastic policy we have a smooth optimization problem and now we can maybe apply some kind of gradient descent on this optimization problem to find a good at least local optimum of this problem so why policy optimization often the policy can be simpler than the Q value or the V value for example let's say you need a robot to grasp something it could be as simple as having the gripper goes somewhere and close around it and that conceptually is maybe not that difficult but if you wanted to learn a q-value or a V value you would need to learn exactly numerically how much reward is associated with that process which means that you need to understand exact timing need to understand exactly how robust that grasp is and so forth if you just learn a policy you don't need to worry about all those details and so maybe it's simpler case you can just focus on the essence another reason you might want to policy and not a V is that V does not prescribe what actions to take once you have V you still need the dynamics model a transition model to understand what actually you should take in the current state if you learn Q but it's a high-dimensional action space or a continuous action space computing this arc max can be tricky because even when you know what the Q value service did an action combination may be encoded in a your neural net how are you gonna know which action to take in DQ n which you saw Vlad present it's not too hard with one output per action you can see which one has the highest score you can take that one but if you have a continuous state action space continuous actions you cannot have one output per possible action because there's infinitely many actions you could take so we need them to take the action as an input most likely input is action and state output is the value of that action state combination now you need to solve an optimization problem that could be pretty difficult to solve to know what action to take that maximizes Q values question yes so a good question so the notation of reinforcement learning there is notation for actions sometimes people use a sometimes you for state sometimes X sometimes S and it turns out when it's direct optimization often U is used and some using that here but ultimately you could use a and it would the math would all be the same it's just but what people are familiar with in this context is usually you let's dive a little deeper in terms of the differences policy optimization you optimize directly what you want what you care about you're gonna optimize a policy that maximizes expected sum of rewards if you do dynamic programming let's say you learn a Q function you run DQ n you learn a Q function that tries to satisfy the bellman equation but if it doesn't satisfy it which typically will be the case it'll be some approximation it's not clear how much you're losing and you're not directly optimizing for what happens when you use that q function to then act here when you directly optimize your policy you directly optimize the objective that you you want ultimately another reason you might want policy optimization is that it's more compatible with rich architectures including recurrent architectures often easier to work with hard to know now but that's just some foreshadowing it's more versatile it's easy to set up a policy and just work with it with Q values it can be harder and it's more compatible with auxilary objectives then the nice thing about the dynamic programming methods and especially keulen that you don't get here is more compatible with exploration and off policy learning expression is finding parts of the space you haven't been to and learn about them the methods we'll see here in policy grading methods will be on policy you need to act with your current policy and then improve it gradually whereas in keulen you can have a separate data collection policy from the Q learning and that's fine your data collection policy can be very exploration oriented which can be beneficial at times so that's the difference between on policy off policy how you collect your data in on policy is directly tied to your learning and off policy you can collect your data whichever way you might like often dynamic program methods are more sample efficient because they look at more details of the problem use that bellman equation can use the data more efficiently doesn't mean they're more computationally efficient because you might need to do many updates on the data to actually take advantage of it but they're often more sample efficient so there are some trade-offs here and both are while I would say typically these the pol septum ization methods are easier to get to work if you're just gonna try one thing on a problem it's probably the thing you would try first and see what happens the grains we're going to use is a special kind of gradient a little different problem the gradients are used to it's called the likelihood ratio gradient in this case a lively ratio policy gradient and we're going to derive it so more new notation Tao will they know the state action sequence so state the time 0 action at time 0 and so forth till horizon H that entire thing will call Tao will associate reward with the entire trajectory which is the sum of the individual rewards along the trajectory and the utility of a policy with parameter vector theta is he expected sum of rewards and we can write this as sum over all possible trajectories probability of that trajectory under the policy times the reward associated with that trajectory so in this new notation which we're using for compactness we're losing some structure but we get more compactness easier to think about this with maximize utility as a function of policy parameter vector theta and that means that we're maximizing this expectation here okay we're going to do some math here to derive the likely ratio policy gradient from scratch it's a very interesting derivation let's do it slowly so here's what we're interested in I want the gradient of this so I want the gradient of the utility grade of this thing over here ok great of a sum is sum of the gradients so we can bring in that gradient just if anybody is not too familiar with the gradient notation here just think of it as derivative of this thing with respect to theta it's just a vector of derivatives then once we have this we're gonna play a little trick and this is we have some foresight why we do this but it's clear that this is possible we multiply and divide by the same thing probability of eject under theater once we did this we kind of just physically just move some things without really doing any math then we remember that the derivative of the log of X is the rate of the log of f of X derivative f of X divided by f of X do the same thing here with the probability but we use it in reverse direction you say that's the same as grad log probability times reward along the trajectory so we made a little transition here the reason we did this there's two reasons one reason we can already see on the next line and another reason will come up later so first reason we did this is that we now have an expectation here expected value under the current policy this this here is some other directories weighted by probability under the current policy that's an expectation of this quantity over here once we have an expectation as we've seen in Q learning we can approximate it by a sample based estimate of that expectation that's what we're going to do to compute an estimate of the policy gradient we take the quantity that's in the back here computed from sample paths we obtain with our policy I indexes over sample paths we have M sample paths so we execute our policy we find paths we compute the ground log probability of that path times reward average that over all of them and that's our grade estimate haven't said yet how we compute this quantity here probability of paths we'll get to that well we see here already is that we have an expectation and if we can do this this gives us an estimate of the gradient it's interesting observations here this estimate is valid even when the reward function is discontinuous and or unknown let's take a step back here when we thought we're going to optimize this thing here and you think about let's take a derivative of this quantity with respect to theta you might have thought that we would say oh derivative for reward respect to state then see how the state that you visit depends on your policy PI theta and have a chain rule effect happening that's not happening this is a different kind of derivative there's no derivatives of reward happening here it's completely different as a consequence even when the reward is discontinuous we can apply this equation it's valid even when we don't know the rewards function but we just experienced a number that's the reward as we roll out that's all we need for this equation it also works when you have discrete states and actions so typically when you think about derivatives you think we need to take chain rule through things rid of this with respect to that and so forth if actions are this discrete states are discrete it might be difficult to do this kind of derivative works out just fine so that's that's really nice okay now let's interpret what's happening here and why this even works let's say we have a policy we do three rollouts and one of them is pretty bad one of them is medium and one of them is pretty good M equals three we'll look at this here so we have three different rewards a good medium and bad for the good reward what happens is the grad log probability of that path is being increased a lot for the medium one it's increased a little bit and for the bad one if it's negative reward it's being decreased that's what you see here the way this works is that you update your policy by looking at the probability of paths under your policy and good paths get their probability increased bad paths get their probabilities decreased you can even do this from one sample actually it's not gonna be a very precise great investment if you just had one rollout you could still compute this grant estimate and do an update to your policy so tomorrow morning first lecture tomorrow morning we'll look at a different kind of derivative which is a derivative that you're often more used to past derivatives that will go through the dynamics model through the policy through the reward function and everything that's a different way of doing this it requires that you have access to a dynamics model and so forth we'll see that here we won't need it once we work through the details but tomorrow you'll see a contrasting method with this one so for now just see it as a way of doing it don't worry too much about which one is better that's what tomorrow morning's lecture will be about question okay so this quantity here is looking at the derivative of the log probability of a trajectory with respect to your parameter vector theta if your reward is positive then this will increase this set is saying I'm gonna if you do an update so you compute the grain and I say I'm going to change my parameter vector theta in the direction that the gradient is pointing me to then the consequences of that change will be that you increase the log probability of the better paths and in fact if all rewards are positive you'll try to increase log probabilities of all paths but then because the probabilities need to normalize to one and practice you won't be increasing it of all of them the way the this all will work out is that the neural net living underneath won't be able to increase probability of everything and it'll have a higher force for the good ones less strong force for the bad ones and all the more increase the probability of the good ones less of the bad ones or even decrease probability of the bad ones will dive more into this in the next few slides yes I'll give you a new derivation of the exact same math where that will stand out very blatantly now let's see how we can compute this so this is the quantity we need to compute we have a policy we use it we'll get a roll out and we need to compute the ground log probability of trajectory given that policy okay well what is the probability of a path it's the product of the probabilities of next aid given current state in action encoded by the dynamics model times the probability of action given state under the policy multiply this together over all times it's like a Markov chain multiply it all together that gives you the probability of that particular path that you experienced and we need the derivative of that path of that probability with respect to theta okay let's do that first of all a log of a product is the sum of logs so let's do that once you have a sum gradient of a sum if some of the grade is sum of the gradients so we can bring that in why did this term disappear doesn't have a pay it on it because beta is the parameterization of the policy and the dynamics model of the world does not have your parameter vector of the policy encoded in it it doesn't play a role so the dynamics model just disappeared and that's really the beauty of this derivation here is that you can compute this gradient with only having access to your policy not having access to the dynamics model that encodes whatever happens in the world okay so no dynamics model required very nice all we need to do is compute the grad log probability of action given state this is your neural net right your neural net encodes a distribution of our actions given States for each state that you encountered and then you can just back from beginning your neural net to get out the derivatives that's what you need for this multiplied with rewards you get your gradient contribution yes the peas are different on the left and right hand side so this P and this P there yeah there's some overloading of notation here it's fine to think of them as different another way to think of it is that P is the general measure over all events that could happen in the world and if you and if you ask P how likely is a trajectory it can tell you if you ask P how likely is the next state given current state in action you can also tell you yes so the Assumption here market decision process assumption is this thing right here I mean that's one of the main assumptions there's also how the reward is encoded so this thing here often can be made to hold true by making your stage big enough if your state is big enough you encode a lot in your state but in practice you often want to play games you can make your state too big if there is too high dimensional vector that's your state then often things are harder to learn he's become trickier and often there's a trade-off and so often you might not have a perfect Markov model you might just do it anyway and hopefully it's a good enough approximation other things you can do is this is beyond the scope of this lecture but you could have a set of states here in this derivation you could actually have just observations whatever you observe about the world and a similar derivation can be made to go through now then the policy that you would want in practice if at any given time you don't observe enough to know the state you would want to recurrent neural net policy and a lot of machinery gets a little more complicated but the same philosophy can be made to work still I'm gonna continue a little bit and then see you later where we're on time ok so what we have now we have a way of estimating the grade this was our equation with this new notation but then we also know when we look at the details again that this gradient here that this grand log probability of path can be computed with just sum of grad log probabilities of actions given state encountered so we can do this now while I like load ratio here's another way of deriving this comes from importance sampling you might wonder why we're deriving this twice we already know what we're gonna do this idea will play a role in what John will look at later when he looks at proximal policy optimization which is probably the state of the art current reinforcement learning algorithm and might be your first choice if you just throw an algorithm at something and I'll rely on this derivation important sampling what is importance and it's something where you peuta expectation under some distribution namely when you use your policy PI theta you want to know what the expectation is but your samples your trajectories come from an other policy PI theta old so you're going to sample from PI theta old which will give you two directories you're going to use those directories to compute an expectation but you care about the new theater and then what you have to do is you have to correct with this ratio here for the fact that you sampled from theta old rather than theta that ratio performs the correction this is a way to compute the expectation we care about but shows it as a way of computing it based on samples from an old policy rather than the current policy so what you're getting at is that we might need to learn a policy in a high dimensional space and a lot of paths could happen and if you compute an expectation here it's not why you're gonna have seen every path and every one of the pass is gonna come to Butte only a few what we're counting on there is that the neural net that represents the policy is able to generalize across states and that thanks to that things you see along one path can allow you to learn things that will also generalize to think she might encounter along another path if that's not true that you'll then you might indeed need a ridiculous number of samples to get this to work either way it already requires a pretty large number of samples don't get me wrong but so what we can then do is we can take the gradient of this quantity expectation of expectation is expression of the gradient and some mild assumptions then we can reformat this a little bit again and then we can evaluate it at theta old this quantity over here turns out to be the same as what we had before so what we did here is effectively we did it by the finite perturbation we said if we're curious about other policies even if we have trajectories from an old policy we can understand how good this other policy is going to be let's look locally if we deviate tarah old and it gives us the same gradient what it also means is that you can impress will do more you can actually also use the original equation to see if you change your policy after your gradient update you can use this equation to estimate how good that update was rather than just blindly trust net your grade and put you in a good spot and that will come back a lot in John's lecture is having this kind of extra information about what happens when you change theta rather than just looking at the gradient and taking a step okay so this will come back for entirety of John's lecture I think so it could be many trajectories you might care about it could be a lot of things happening here that lead to what I just showed you while being a valid estimate of the gradient requiring and requiring an impractical number of samples to be able to get a good estimate from this empirical estimate we're going to introduce three tricks to improve this a baseline temple structure and then in John's part this ladder set of tricks okay let's revisit what's happening here so what's happening is we have a objective craned of the objective respect to theta this is the gradient following the gradient means increasing the probability of the better paths but let's think about it very concretely here let's see we did three rollouts and all three rollouts had positive reward and actually what's going to happen is you're gonna try to increase the probability of all three of those rollouts but some of these robots are better than others nevertheless you going to increase it for all of them but just are some more than four others really what you would like is that maybe if something was bad you would decrease the probability of that rollout that will happen if you had negative rewards in reward was negative if you decrease the probability of that rollout but if you're all positive that wouldn't even happen and so you can also already see how this will require a lot of averaging of many robots to get two good ones to overrule the bad ones and update that you converge to by averaging each Contribution okay so what can we do to alleviate this we will introduce a baseline okay what's the baseline this was the original equation right here we subtract out B so instead of multiplying the ground log probability with the reward that we experienced we multiplied with reward mine is a baseline intuitively we want increased probability of paths that are better than average and decreased probability of paths that are worse than average so if we can measure what average performance is by just looking at the robots see what the average is we could call that our baseline we could plug that in there and now we've modified this to do what we want or to be at least much closer to what we want can we do this mathematically do we still have a valid gradient estimate it turns out there's a proof from 92 Williams reinforced paper that shows that when you put this in here you average over infinitely many trajectories you still get the exact gradient so you're not biasing the computation all you're doing is actually reducing variance that's a good thing if you can stay on bias reduce variance that's again means you need less samples to get a good estimate here's some math that proves this don't worry about it but it's in your slides that you can download from Piazza you might already have them you can go through it yourself thing to keep in mind is that this is okay as long as the baseline does not depend on action so you cannot have B also depend on the probability of the action that you took here that's all but it doesn't guess we just take the average of rewards we we got or something and then that's fine okay we can actually do more thus far we have largely ignored the temporal structure this is our policy great investment now but we also know that we can decompose this into a sum of gradients for each time step this thing over here multiplied by all the rewards which is also sum of rewards over time what we see here is that we have ground liabilities of actions given state and all the rewards but why would you change the probability of an action let's say at the final time step based on the reward you got at the first time step doesn't make sense all right as you look at this multiplication here all times probabilities of actions at all times interact with rewards at all times really you only want actions to interact with future rewards that come after them okay so we'll rearrange this we'll take this out this is the part that comes before the current time that part we're just going to remove doesn't depend on what action you take at the current time with reward you got in the past so it should also not affect the grad log probability update for that action do this ruin those removing those terms will get this policy grade equation over here which is the standard policy grant that you'll see a lot all right so it's looking at I'm gonna increase the grad log probability of an action I took given State and by how much assuming this is a positive quantity about how much the rewards experienced from that time onwards from T to the end is better than what you would get on average from that time onwards if it's worse than average probability would go down I'm gonna keep going for a little bit now let's think about good choices for B what does it mean to take the average a constant baseline will be something where you just look at all roll outs you just average and that's what you subtract out the turns out that there's some math that derived by Williams and also by actually Greene Smith and Baxter and Bartlett that shows what the optimal baseline is it turns out that you should not just average rewards for each roll out but you should do some weighted average based on some grad log probabilities of the roll outs where realize that are more likely contribute a little more to your average than realize that we're less likely turns out that's a that's an improvement to the simplest baseline admittedly I don't think anybody uses this cuz it's not a huge improvement but if you look at the math and you want to do the minimum variance thing that's what the we'll tell you you can also make a time dependent because we saw we have grad large family of actions given state not just growl operability of the entire path you can make this baseline time dependent and take the average from a certain time onwards can also make it explicitly state dependent you can say I have seen rewards from State St I can then compute a value function V PI of s T what's a value function for for a policy tells you how much reward you're gonna get from that state onwards actually that's what we really want because that value function is telling us how much you would have gotten on average based on your current policy and then if your action got better or worse then we want to increase or decrease probability of that action okay then of course we need to estimate this V PI thing we haven't really seen yet how to do that except when we have a transition model we can run policy evaluation but here we don't have a position male the estimated from samples be a large state space we have to do some extra work so here's what you would do we'd collect a bunch of trajectories under current policy and then you would do a optimization where you Phi is your parameter vector you're gonna have a separate neural net maybe for the value function neural net for the policy a neural net for the value function value function neural net parameterize by Phi you want this neural net for the state that you experienced to predict whatever rewards you got after you were in that state so for all experience you have you have supervised learning data now you had a state you see how much reward you got that's input state output how much reward you got that's a supervised learning problem that's what we're trying to solve here and so you can run an optimization here to find the best value function for your current policy and just farm that off to any kind of standard supervised learning thing that you have and then use that value function in your policy gradient update okay so this is the bellman equation for V PI you can do something else so this this is called a Monte Carlo estimate of your value of function of V PI you can use a TD s what you do there is as you have experienced you only look at these state action next state reward combinations and then set up an optimization problem like this this is saying I want my value function at state s to be equal to immediate reward plus value at the next state according to the current estimate I have from a value function so Phi is your current estimate of the value function are you going to use it where you see here is based on the current estimate a target that's the target value that you compute and this is the new parameter that you're optimizing for and then maybe you don't want to move too far too quickly because you only have a small amount of samples and so you have a good prior that's staying where you were isn't too bad and said that's what this term is over here so have different things here to do the same thing there are pros and cons to either one this is simple you just have a clear target anytime you do rollouts you aggress on to it and it's unbiased which is nice it's not very sample efficient if you want to be more sample efficient this will often be more sample efficient because it exploits this bellman equation structure by computing target values that regressing on to it just like you learning but also it can be harder to stabilize doesn't always work out as easily you might have a lot of tweaking to do to get this to be stable yes how is this different from using a smaller learning rate there's some equivalence there this is a way of essentially suggesting smaller steps are better this prior is saying small step is better than a larger step but rather than just putting it in the algorithm we're putting it in the objective yes so what changed when we looked initially add value iteration we saw that we needed this P over here the model what changed is that we are doing sample based evaluations of all of this so we are setting the target so what this is here is an expected value over what happens at next States we're going to sample from this the experience we collect by using our current policy is giving us samples of this thing over here and we're gonna average those samples and by averaging them we get the expectation that's here or an approximation to it so two alternative ways to do this Montecarlo approach just met the value of function or the TD approach that uses target values and iterates and get better and better value function this way get the value of function let's look at a full algorithm use the vanilla policy great algorithm I think this will be part of your lab for that you'll work through tomorrow you initialize your policy let's say your big neural net some parameter vector theta you initialize your baseline we've seen many options for the baseline probably estimating a value functions is the natural thing to do when you iterate what do we do in an iteration we update the policy every iteration will lead to a better policy hopefully and practice this a stochastic procedure and sometimes you get worse sometimes you get better but an average to keep getting better what do we need in an iteration we collect a bunch of trajectories that means we execute the current policy just get data then along each time step in a trajectory we compute the returns that we got from that time onwards where she discount them okay for now you can assume discounting is one but actually it's often good to discount here to reduce the variance on this quantity just as a variance reduction trick not because you want a discount now because you think interest rates or whatever effect when you need to get rewards just because this could be a large number of terms you're summing together and it could have very high variance and discounting can reduce that maybe make an advantage estimate which is you look at the reward that you got along the rollout you subtract out the baseline at that state SD okay that's exactly what we've been seeing on the previous slides that's computing that's computing this quantity over here that's the baseline and then this is the sum of rewards from their own words after we've done that we refit the baseline that's just recomputing our value function based on the data we've seen so far this one over here does it based on the Montecarlo because we have the actual rewards here we tried to fit to the actual rewards you could also use the TD thing up to you then at this point we can compute our policy grade estimate based on the quantities we have computed G hat which will be this quantity over here summed over all times and all states and actions that we encountered then we take a step in that direction we have some step sizing of course to make sure we don't step too far and then we repeat that's the algorithm so it's actually very similar that many lines of code and actually Andre will show some code in his presentation to dive into it at the Python level what this actually does so the baseline is updated over here refit the baseline and the way it's written here in this algorithm is using this estimate the Montecarlo version but other versions could be used like this one if you like the other one if you want to okay so what do we have now we have this great an estimator we've looked at how to build an algorithm around it we haven't yet I'm gonna keep going for the interest of time and you can post a question on Piazza if you want and the tears are following along with those questions this year is really what we really would want is actually the q-value for that state in action rather than just an estimate from one roll out because on average we're going to get the Q value so can we get something closer to that as a question you could ask okay if we just use the robots going to be high variance so we already reuse reduce the variance by discounting we could the reduce it even more by introducing function approximation so discounting very straight forward we've seen that just introduced discounting that gives us a better s not a better estimate per se but a lower variance estimate than without discounting how about function approximation well the expected value of all rewards that we might get is expected value of reward followed by expected value of the value we get at the next state okay so we can actually replace this sum of all rewards by just a reward that we get immediately in that roll out plus value at the next time we could also take two steps of rewards and value at the next next time or three and so forth a 3c1 of the state of the art policy grading methods uses exactly this idea you pick let's say k equals 5 and so instead of using sum of all future rewards use the first 5 that you encounter followed by your value function estimate ok so now we're using a value function estimate twice you're using it to replace this thing over here with some of five rewards and then value function and you subtract out value function at the current state here ok so here's a complete algorithm that does something like this oh I skip a variant hold on so a3c picks a specific step size you can do something else to something called generalize the balance estimation there's a exponential averaging of all possible choices of kata you could pick do lambda based weighting you get the average of all of those horizons together it's equally efficient to compute and it might be an easier way to kind get what you want rather than choose one specific number of steps it's very related to TT lambda which you can read about in the Soto and something in Bartow book so this is actor critic now was it mean to be actor critic it means that you are running a policy gradient methods where you estimate a value function for your baseline once you have a value function for baseline it's not called polish brand anymore and not necessarily you can choose you can still call it that but technical term would be active critic at that point so if an active critic method now we have a current policy current value function estimate for that policy will collect roll outs we collect accumulated rewards along the way and then this could be based on the a3c method could be based on generalized advanced estimation you have it here you refit the value function and you do your policy grant update based on it and you repeat and again you have a lot of choices in terms of do you use Montecarlo versus a3 see kind of after a few steps or generalized advanced estimation you have those choices both when you look at the first term here in the policy grade method what you do there as well as here when you refit your value function so a lot of orthogonal choices to make can do this actually works quite well so if we see achieved state allowed results for Atari and also really results in continuous control environments better than dqn in terms of how fast it gets somewhere as a function of wall clock time there's a few reasons for this one of the reasons that is also there that is there's a little bit of built in exploration by running it over multiple machines and the machines are not perfectly synchronized and as a consequence they run slightly different policies and that's related to the parameter noise that Vlad talked about in his very last slide so you get very good results you can even do things like learning to navigate a labyrinth so what's underneath here is a neural network that's a recurrent neural network because all you get is images and from just the images it's not going to be possible to decide what to do you need to retain memory of what you've seen in the past recurrent neural net encodes the policy so the policy knows how to remember what you've seen in the past a good policy in those down and can then have you navigate towards the good rewards that might look like apples or cherries or watermelons and so forth this is pretty amazing here this is thinking in monocular images right doing low-level control taking just monocular images and effectively building a map of the environment it's in and nobody built into it what it means to build a map it just learned that the neural net the recurrent neural net that does something equivalent to map building happens to be a good policy and can be used to do things like that there are some tweaking parameters with this I mean you should do when you do generalized advanced estimation there's lambda how do you exponentially decay the horizons that you work with and there's gamma the discount factor turns out that if you study lambda what happens is you see that effectively you want to be really close to Montecarlo estimates of your value function and up here for your rollouts and only bring a little bit of TD into the equation okay but bringing a little bit of TD will reduce the Bears and help once you rely too much on TD things will get biased and things will not work nearly as well gamma it's same for gamma you want you want to treat your gamma a little bit so lambda we see here gamma lambda 0.96 is good and gamma 0.98 keep in mind that does not mean that we change gamma for the problem definition and we say if we change the problem can we do better it's the problem remains gamma equals 1 and we evaluate and how well the policy does when gamma equals 1 but we train with gamma something different than one and see which choice ends up doing better so let's see what we can do learning wise when we use this uses something John will cover in his lecture that improves the grade in step a little bit by using a thrust region combines that with the everything we've seen here we're going to watch is a neural-net controlling this humanoid the neural net gets as input joint angles joins velocities outputs torques at each of the motors the reward function is as simple as the further different okay further that way the better and the less impact with the ground the better very simple reward function nothing in there about walking should look like this just distance covered and impact with the ground should be minimized neural net is about a hundred thousand parameters iterations zero what do you expect randomly initialized it's not gonna know what to do it's probably gonna fall over be very surprising if you're a random initialization knew how to control this robot it can happen very low probability though never happened to us so let's see what happens indeed it falls over but it's a random policy it does different things it'll make more likely the things that were better than the things that were worse and over time improves the policy to actually cover more and more ground after 640 iterations it's able to learn to run has learned to run in real time that would if we ran it on a real robot real time it was the exact same robot that behaved exactly like this robot let's say 2000 iterations would take two weeks so not totally crazy of course running it on many machines it only takes a couple of hours here exact same algorithm no change and can learn to control this robot so what's interesting here is that traditionally if you work in robotics you would think about things like you know how do you balance a robot what does it mean to lose balance what does it mean to dynamic to have dynamic stability and so forth and people would think a lot about those concepts to try to design controllers for these robots and then rethink things for another robot no rethinking needed here what's the reward function here distance of the head to standing head height the clothes standing head height the better and you see sitting is better than standing and actually figures out how to get up question there okay so timing wise what I meant to say but maybe I didn't say is it takes two weeks if you were to run this on a real robot if you think real time experience so we run it in mojo Co which is a simulator built by a motor about the University of Washington that simulator is designed to run faster than real time and we can run it in parallel over many processes if you want to but if you look at the real time experience that would correspond to this if we put it on a real robot it would take two weeks so that's not too crazy now a reality check if you were to do this for sure the hyper parameters that we used we're not are not going to be the right ones for the real robot so you might have to run it again you might not want to run it for two weeks to then change some hyper parameters and go again you probably want to stop a little sooner when it's not making progress but a lot of these will kind of make almost no progress for a very long time and then finally start shooting up and unless you can't wait long enough to know if you've reached that point where they would start shooting up then you don't know if you have the right hyper parameters or not and so forth so it's a little tricky I wouldn't yet recommend running it on a single real robot and expect that in a reasonable time you get a result but I also feel like this means that we're maybe only a few breakthrough papers away from getting there oh of course and we did yeah and we did right so we run in a faster we weren't at much faster than real time when with this research I'm just trying to give you a picture of you see it as in simulation and maybe you have a physical robot at home or you want to build one and you're wondering huh should I use this and I'm saying well maybe yes yeah so a lot of research goes in that direction I'll rephrase the question as I'm answering it so can you leverage what happens in simulation to learn faster in real life that's what you would do you wouldn't learn from scratch in the real world I mean you might just for kicks and for research purposes but if you actually wanted to deploy something you would first training simulation and they you'd fine-tune on the real system and if your simulator was somewhat matched that could work well and people are still researching how to do this really right but ideas that people play with this things like randomizing your simulator and learn a policy that works across many simulators because if that's the case then they might also work in the real world or it might be a robust policy that works in many situations and might be easier to deploy yeah I just want to contrast this with some real-world robots this is from 2015 I believe DARPA Robotics Challenge a lot of people probably have seen this video and the point here is not that this is the best walking robots that have ever existed but the point is that these people worked for about two years to try to compete really hard in this challenge with a lot of very smart people and still this kind of failures happen so what that shows is that this remains a hard problem like robot locomotion is not a solved problem at all all right I'm gonna stop here thank you I think I have time for one question while Andre sets up his computer yes I'll stop this so you can ask a question okay so the question is let's say you you have this robot that has learned how to run how do you now make it go to goals so a natural way to do this is to set up your problem a little differently so you would set up your problem with taking as input state and a direction that's desired or a goal that's desired to go to and you'd consider that part of the state so you augment your state space with the goal that you want to achieve at that point your neural net can learn what to do as a function of both state that you're in and goal that you're trying to achieve and so actually you can do this in pretty clever ways also to speed up learning so recently I'd open AI we had a paper called hindsight experience replay Marcin and Richard which is the the first author on that and that actually on top of having a real not take in goal and state to learn something more general it also after the fact in the replay that happens in the optimization during the Q learning that's happening there changes the goal that is fed into the neural net to the one that you actually achieved so maybe you want to go buy pizza B end up in the ice cream store and you bought ice cream then you might say there is no reward associated with this or you could say if just my goal had been to buy ice cream I would have achieved high reward that would get great signal and so that's what hindsight experience replay does to get signal into your reinforcement learning which is often the really hard part to get nonzero reward or good signal to learn from all right thanks again [Applause]
Info
Channel: AI Prism
Views: 46,693
Rating: 4.9227052 out of 5
Keywords:
Id: S_gwYj1Q-44
Channel Id: undefined
Length: 53min 55sec (3235 seconds)
Published: Thu Oct 05 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.