Hado van Hasselt: Reinforcement Learning II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
next I'll touch upon some of the challenges in Reverse burning of course I've already mentioned a couple during but just to be clear so this was actually a question that was just asked right now as well I talked about learning and planning but I didn't really define these very concretely and this is more or less deliberately so because these terms are somewhat fuzzy in the literature if you ask many people to find say planning for you they might give you quite different answers so I want to thank you that you could make is that for learning the environment is initially unknown and the AG that interacts with the environment in order to learn whereas for planning a model of the environment is given and then the agent plans in this model without external interaction and this is sometimes also called reasoning pondering thought search or planning depending on who you ask some of these terms are maybe a little more concrete or technical than others the main distinction is whether you consume deed data to do new things I would I would say something is learning when it comes consume new data and use that to improve its internal representations its internal policy its or internal knowledge any of those whereas planning is more something that happens internally without the additional data coming in and of course these things can be combined you could be learning and at the same time it may be something like a background process you could be planning to improve the results coming out of the system as a whole so more terminology I already mentioned those terms but that's also be clear about those is the difference between prediction and control and these are distinct goals that you might have we still mostly talked about control control is a word that is used to basically mean find the optimal solution it comes obviously from control theory and sometimes this can be slightly confusing because some people think of controlling something means to keep it in a certain state and indeed a lot of control theory applications are like that where you want to keep a plant in a certain state certain equilibrium certain thing that doesn't go haywire but we use it more generally to also find something that just optimizes your return prediction separately from that is to evaluate the future too be able to predict what's going to happen and these were of course first of they're strongly related to the concept we talked about prediction it can be encoded in terms of value functions and control can be encoded in terms of an an optimal or an approximately optimal policy and they're strongly rated in the sense that if you maximize or for all value functions if you could somehow do this for all for all policies then the optimal policy will be the maximum policy over sorry would have maximum value over all policies so again this is a question that I posed at the beginning if we could predict everything do we need anything else it's kind of an open question it's kind of hinting towards an answer though so what we'll want to do is we'll want to learn components of an agent and specifically we also know that these components are basically all functions policies map States to actions value functions map States to values which are basically just numbers models map States to States and/or rewards which are again just numbers and State updates map states and observations to new States and all of these are just functions so what one one way we could progress is we could model these functions as learn about functions for instance as neural networks and then use a learning mechanism for instance by using the research from deep learning to then update these functions and this is a fairly popular approach these days thank you however we often violate standard assumptions for supervised learning and statistics and it's good to be aware of that for instance if you use some methods to do your your value prediction you could use the standard regression methods to do that but many regression algorithms they assume that the inputs and outputs labels that you get are iid independent and identically distributed and also we typically assume that the world is stationary when we do supervised learning basically there's one mapping that you're trying to find and it stays the same and in Reverse proving we typically violate these one clear way in which we violate that these is when we try to do prediction and control so we're trying to improve our policy while trying to predict things about the future which is actually very common thing that we want to do but that means that our our predictions which might be conditioned on our current policy will become somewhat outdated they become stale they might not be completely accurate anymore the data that comes in is not stationary it depends on the current policy moreover and maybe even more importantly the data coming in is not iid it's not it's not identically distributed it'll be sequential data so actually today too coming in from one side step to the other might be very correlated and standards regression methods may or may not deal well with that so deep reinforcement learning which is basically the field of research which deals with combining reinforcement learning combining these concepts with deep learning as a tool to then learn these functions is a very rich in active research fields because you can plug in learnings from deep learning in many places and reinforce learning you can identify these things as being functions and then try to learn these functions by defining a well-defined loss function on them but then still you have to be aware that you're doing something slightly different from what people are typically doing and say deep learning so not the same methods might apply equally well I also of course once stay I say explicitly neural networks are not always the best tool to solve something this is the hope of some people at least but they do often work quite well so it's worthwhile to consider consider them so an example as said before the the atari games were actually learned with reinforcement learning but specifically we're with we learned with deep reinforcement learning mm-hm and schematically that looks a little bit like this there's an observation which is just two raw pixels of the screen in this case and the action the output action is in a joystick action and this may somewhat be somewhat misleading this picture because this joystick kind of looks as if it's an analog thing but it isn't actually there's only eight directional actions that the joystick can read and then nurses fire your button so you could either not move or move in any of these eight directions and then you could do the same nine but then pressing the fire button at the same time so in total there's 18 different actions so it's actually discrete action space and then the goal is to pick actions in the joystick that improved the rewards especially the cumulative discounted future reward and the rewards are defining as the change in score from one time step to the next and this means in particular that if we wouldn't discount then the accumulation of these rewards which is be your score at the end so that seems like a fairly natural thing to try to achieve in these Atari games now we were talking about learning versus planning so in Atari for instance you could also think about planning instead of learning the rules of the game could be known and we could query the MU emulator which is a perfect model actually turns out to be the case and Atari is also deterministic so that makes it easier because there's not a huge banking factor in that sense there's there's still a fairly big branching factor in terms of the number of actions you can pick but there's no addition in the also noise coming into play so you can actually take an action and then see what what what ends up being the case and you can repeat the process you can search this whole tree but of course you can only do this if you've access to the simulator and this turns out to be harder because if you say played is they one of these Atari games if you try an action you're already at the next screen and you can't go back so this requires you to first have a model in order to do that that said it works actually quite well if you do this on Atari you don't even have to look at the screen you can just build a whole tree of all the actions you could find because of the deterministic it's not that mega tree is so fairly big it's not that big and then you can just plan through this tree and see if you can find good scores typically we don't want to do that because this kind of over fits to the determinism of the domain and in stochastic domain this wouldn't work that well so what we typically do we use algorithms that could also be applied to a stochastic domain and then hopefully they still like they work in both these cases and later versions of the Atari emulator that we use both in deep mind and externally later versions also add noise they basically make the actions that you take a little bit sticky so sometimes the action take a little longer a little shorter which means you can't predict exactly what the next state will be deterministically from just to say than in action in order to learn we must get useful information and one very important challenge and current active area of research within reinforcement learning is the topic of exploration as we often call it but it's actually maybe more precisely called the topic of exploration and exploitation and the idea is basically this we want to learn by trial and error we want to learn by interaction this different way to say the same thing and the agent should discover a good policy from new experiences and without sacrificing too much reward along the way now this trying to discover something that's the exploration part you're trying to discover new things that can help you build more information this can help you build good internal value functions and policies and whatnot and you want these to be informed by as much as you can by different interactions with the world but if you explore too much your performance will be quite poor and additionally you only see maybe in somewhat an uninteresting part of the safe space think of a robot that can walk around but it only like turned randomly moves randomly it might stick to a very small portion of the of the room in a sense so the alternative for what you could do is you could maybe follow what you currently think is the best possible thing you could be doing this is what we call exploitation the benefit of that is that it gets you to new interesting States and it gets you good results along the way and this can be important for instance if you're learning a line it might be important not just to find a good solution but also to be performing well while you are learning and this is related to the distinction that I made earlier between the goal being either to find a solution or to adapt while you're finding a solution and sometimes it's really important to perform well while you're trying to find something better think also for instance of this advertising example that I gave earlier where you're trying to place ads on a site and this is these are your actions maybe you don't want to lose too much reward along the way while trying to find the best possible policy do that so the problem then is how to balance off this exploration versus exploitation and basically the truth of it is that we have excellent algorithms to do this in the simplest possible cases where we have basically one state and only a bunch of actions this is the multi-armed bandit settings there we have mechanisms that basically work optimally in a sense which means that they don't do they don't take suboptimal actions more often than they need and they take the optimal action basically as often as you possibly could with any learning algorithm which is quite impressive but it's the simpler case than the full reinforcement learning problem for the full reinforced viewing problem we we don't have that yet we don't have the best possible trade of algorithms for exploration and exploitation and it's a hard problem in general so therefore it's an active area of current research so as I said exploration is about finding more information and exploitation exploits the known information to maximize the current reward and it's important to do both it's also good to know that this is a fundamental problem that is inherent to this interactive nature of reinforcement learning it doesn't occur in supervised learning because there in the data is already there now of course there is another subfield called active learning which basically is about oh but how do you then select your data and this is very related so some some simple examples if say you're you're trying to find the best possible restaurants your exploitation would say go to what is currently your favorite exploration would be to try something completely new maybe inform somehow and then maybe this is better than anything you've ever seen like there you might go there again and again or skip one go to the last one game playing you might play the move it your current belief is best but sometimes you might want to try a new strategy maybe try moved if you've never played before just to see what happens and this will give you more information might be a bad idea you might never do it again that's fine right so next time you might exploit in the same situation but it's important to trade these off okay now to make a lot of that a little bit concrete's a lot of the concepts that we touched upon here's another small example this is a small little tiny grit world where you can move in all of the four directions and you get minus one when you bump into a wall and there's a discount factor here as well Oh point nine and you can get some positive rewards basically if you're going to state a and then you take any action from that state instead of moving in that direction it'll transport you to this state a prime at the bottom and you'll get a reward of plus ten similarly when you're a state B and you take any action there you'll transport to B Prime and you'll get a reward of plus five now to the right hand side there we see the value function for the uniform random policy in this problem and basically what we see is that the value of the state a is the highest of all of them which makes a lot of sense because it's just before you get a large reward and there's a discount factor so you care more about near-term reward than lower than later rewards if you wouldn't have a discount factor here there's no termination so your values would be infinite potentially which would be maybe a bit of a problem so the discount factor here make sure that the values are well-defined but they also are somewhat natural thing you want your high rewards fasts now we could also ask what the optimal value function is and what do you have normal policy is for this problem and in mind will be immediately obvious because you could go to state a and then get a pretty high reward repeatedly but it transports you back and transports you've quite a bit away whereas if you go to B you only get bumped two states away you can go back and get this lower reward of +5 but you can get it more often so it's an interesting question to say well where should I actually go a or B there's a trade-off here but turns out if you solve this there's a clear winner which is that the value in state a is definitely higher than the value and stabi so even if you're in exactly in the middle between those two if you look at the all the way to the right hand side there's two of these states they have errors in all directions this is because any action in state a and any action in state B will lead to the corresponding state that you transport to but if you look to the to the state in the middle at the top row it has an arrow pointing left because the value of state a is actually higher than the value of state B now this is a function of the discount factor and of these rewards that you get on these transitions the main thing that's that this is intended to highlight is that it's sometimes no mob vyas what the actual optimal solution is but you can find it automatically and then just follow that it turns out that the optimal policy doesn't care it doesn't distinguish between many actions there's met often times there's two arrows this is because this is a very simple MVP very symmetric in a sense and you can often go left first and then up or up first and then left and it's exactly the same value now of course in many real-world problems there will be slight differences and it might not matter too much whether you go up or left first but it might met over a little bit in practice we typically don't care necessary about the optimal optimal solution but we care about something that is pretty good close to optimal rather than necessarily exactly optimal now how did we solve this how do we find these values yeah question okay so how do we find these values well this this we could do with with planning we could use dynamic programming specifically and I'll go a little bit into that and then as I have said before I actually care more about learning algorithms myself and planning algorithms but it's good to go through these things because they give you a very good intuition about what's going on with the learning algorithms as well and they're also very good algorithms if you can if you can do them so there's four main bellman equations the first two are for state values the very first defines the state value for a given policy PI the second one defines the state value for the optimal policy which we call V Star and and there's basically two analogous versions for state action values ones for the current policy PI or some given policy PI and the other ones for the optimal policy star and by definition there can be no policy with a higher value than the e star this is assuming NDP's Markov decision processes so we can just think of the state as being the mark of state of the environment for now there are of course equivalencies between these two in particular the state value for policy is the policy weighted sum of the action values for that same policy and the optimal value is the maximum over the optimal values for all actions now we can also talk about what it means to have an optimal policy and in particular we can define an ordering over the policies where we could say that the policy PI is better than a policy PI prime if it has a higher value in all states again this is assuming the mark of property holds so we have a Markov decision process and then for any Markov decision process it turns out there exists an optimal policy that is better or equal to all other policies so there's a since some PI prime that is according to this ordering large or equal to all PI and there can actually be more than one of such policies but all of these policies will have the same optimal value v-star I see I turn the subscription to a superscript there but it's still the same function sorry about that so why can't it be multiple possible policies we were actually already saw an example of that with that small little grit roll over there we're going up and left actually has the same value as going left and up so the policy of going up in that state and then following the optimal policy and the policy going left in that state and then following the optimal policy there are different policies but they do have the same value now how do we find an opto policy or one very simple way to find it is to first find the optimal value that part might be hard but if you have the optimal values then finding an optimal policy is trivial because you can just take the greedy action with respect to the optimal action values and this is guaranteed to be an optimal policy which basically means that we can solve if we can solve for the optimal value function this is sufficient to force to find the optimal policy there is always a deterministic optimal policy for for MVPs as said before maybe with the adjective finite before there or not and if we know this optimal value function we have the optimal policy kind of trivially if multiple actions actually maximize Q star because there can be multiple film policies we can just pick an up action arbitrarily including stochastically it is allowed to have the stochastic policy then but you could also pick a deterministic e which which is why there is always there is always a deterministic optimal policy now above an optimality equation let's go back and look at those so the second equation here and the fourth equation here they are nonlinear because they have a max in there and that turns out to be important because the linear verses we can actually solve directly for instance by writing bonanno's matrices these are fairly big vectors and matrices if there is a factor of state values for all of the states your MVP if you have a big MVP that's a big vector but you can solve the bellman equation analytically for that one but for nonlinear versions you can't and therefore what we typically do for these nonlinear bellman equations is that we solve them iteratively this means that we build up solutions and then we incrementally improve these solutions over time by making more and more updates I'll give you an example of that in a moment there's many different iterative solution methods and you when you use models you can use dynamic programming so if you have say the true model of your problem then you can use dynamic programming and in particularly you might use an algorithm called value iteration or one called policy iteration if you don't have a model you'll have to use samples and then one way to go is to use a Monte Carlo sample with which we use which we mean sorry with which in reinforcement learning we mean a full Monte Carlo sample of the full return typically or you could use algorithms called q-learning or salsa which are sample based algorithms to learn action values and i'll tell you more about that in a moment yes the bellman equation there's only one for the bellman optimality equation there's only one optimal value function for the Markov decision process case which is because well one way to think about that so it it it all of your states are kind of independent in some sense because they're Markov so you can just think about the value of each state independently in a sense and then if you can do something to make that state value better this will improve your policy overall according to the pole to the two due to the ordering that leads to find because it'll be the same in all states but in this one state it's better and then according to our definition that means it's a better better policy yes so well in general so this is a very good question so in general if you have a tabular representation so you can represent the value of each state independently you can actually find the actual optimal value function then and this is guaranteed to be the exact solution typically we don't have that so typically you could be in either one of two other cases one is where you have some features and you build a linear function of that and that one's actually pretty well understood still and then there are any proofs that this will also that certain methods will also then converts to a fixed point which is well defined and we know what it is so we can basically say it'll definitely go there in the limits if it's a sample based methods you can't say much more than in the limit you could say something about the rates of course but it's never going to be exactly there because there's a noise but if you decay your step sizes appropriately you do know that it will end up at a fixed point if eventually for nonlinear functions there's less known essentially as always but for linear case actually that's already a pretty strong one and the tabular case also that you can show that they go there is actually non-trivial results and it's quite interesting and so I'll give you an algorithm that does that and which is a dynamic programming I thought this quote was quite fun I'll just read it out loud this is by Richard bellman the 1950s were not good years for mathematical research I felt I had to shield the Air Force from the fact that I was really doing mathematics what title what name could I choose I was interested in planning in decision-making in thinking but planning is not a good word for various reasons I decided to use the word programming I wanted to get across the idea that this was dynamic this was time varying I thought let's kill two birds with one stone let's take a word and as a precise meeting namely dynamic in a classical physical sense it is also impossible to use the word dynamic in a perjurer pejorative sense try thinking of some combination it will possibly give it a pejorative meaning it's impossible this I thought dynamic programming was a good name it was something that even a congressman could object to so I used it as an umbrella for my activities so this kind of shows because this name is sometimes a little bit confusing why dynamic programming is called dynamic program and maybe it's somewhat of an arbitrary term and not very descriptive there's a different definition definition in the book by Sutton Umberto which is a very good reference for reinforcement learning if you want to read more about anything that I said today go read that book it's called reinforcement learning an introduction and rich defines it as sorry rich Sutton defines it as dynamic programming it refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process and so I'll next discuss a couple of methods that can be used to solve Markov decision process and all such methods basically consists of two important parts which are policy evaluation and policy improvement and this actually the reason why this is important one is to do understand where all these other methods come from for instance these sampling based methods that I personally do more research on but it's also important to understand that these parts actually are also important for all our other methods as well for our sampling based methods of based methods as well okay so first we'll briefly discuss how to estimate the value of a certain policy and the idea is quite simple we have an equation there this is an equality to top one there and the idea is to turn it into an update we initialize our value at some initial random guess v-0 for instance or maybe I should say arbitrary rather than in random for instance we set it to zero for all states and then what we can do we can iterate this equation over here which says that VK plus 1 is going to be set equal to the expectation of the immediate reward plus the discounted value at the next state according to our previous estimates VK now notes that if VK plus 1 equals VK for all states we must have found the value as defined up there because apparently we had an equality and now the non-trivial part but it is true is that it actually goes there eventually it will become equal important thing we are using bootstrapping this is a term that is used in reinforcement learning a lot which refers to the fact that we're using this value of the previous iteration to stand in for the remainder of the return what is one alternative thing that we could do we could not do that and we could just ride the whole discount of return there and if we did that we didn't even need to iterate because just do it in one go but this requires you to compute that expectation which might have a huge branching factor you're basically building the whole tree of possibilities and then backing that all the way up into your value and that's that can be quite expensive whereas this 1 step only course is to look once one action into the future you do have to replace to return them with something else because we don't have to full return we didn't compute it and for that we're using our current estimates for the value of that state now the dose is always converge and the answer is yes under appropriate conditions for instance the discount factor should be smaller than 1 for the for the general case and here's a very simple proof sketch for the continuing discounters case and what we'll do is we'll look at further for all states we'll look at the maximum difference between the value of any States therefore we're maxing over States the difference between the value at VK plus 1 and the actual value of that state and now expand both of those in their definitions so VK plus one was defined as a one step look ahead with the reward and then VK right that were boots are Pina Palma whereas the actual true value we can expand similarly into one step and then bootstrapping on the actual true value will you use the actual true value at the next state now we know that the rewards are actually equal where I've the expected reward there and the expectation we can just pull the reward out it's the expectation of a sum it's just the sum of two expectations which means we can just subtract these rewards from each other they disappear then what's left is the expectation of the value at the next set according to VK and the value at the next set according to our true value function and then we know that we can actually pull the discount factor all the way outside because it just multiplies both of these this is for a fixed constant discount factor which is the same everywhere for simplicity and then we know that we can ride this at the bottom we can do an inequality basically because we have a max of roll States of an expectation over the next state which you can reduce replace by just do a max of roll States and now know that the discount factor in front is a number that is smaller than one so this means that if we repeat this process over and over again and we don't actually have to repeat it that often that the difference between VK plus 1 and V PI will become very small become small to 0 and when it's 0 this means that we have found the actual values yes so he pi is the actual value of following policy pi so it's the expected one step reward plus the discounted value at the next state or equivalently it's just the expectation of the full discount of return if you follow that policy thanks you could also similarly prove that the somebody asked this earlier that the episodic case the undiscounted episode the case that that one also converges but that that's a bit harder to prove but it also goes through so it still converges to the right answer now here's an example again small grit world you get - one of the older transitions and sorry these two corners shaders coroner states are terminal States basically the episode ends there it's again a simple problem now let's say we start with a uniformly random policy and we start with a value function at a zero everywhere now we take this random policy on every step in every state we take the random policy because this algorithm was defined to update all the states at the same time which of course you can only do if it's small enough but here we can so we did update all of the state values and all of them go to a next state but what we're going to do is we're going to peg the values of these terminal States to zero because these are states that you can never escape from again we're raising oh say things stop there so we're just going to set the value to zero you're never going to get any more rewards when you enter these terminal states this is how episodic problems are defined now in each of the states that we could do something we get a minus one because the rewards are rows minus one and from this we had a new policy as well because now we can reason about a one step look ahead on the value function to see from each states what is the value of the action going to the next states because we have access to the full model we can just reason through it up we can just see deterministic in which day we end up and we can just pick that action and what we notice is that we now know from the states next to the terminal states that the terminal states actually preferred over staying where you are or going to some other states because their value has been back to zero and the other states now have a lower value now if you repeat this over and over again you can see that more structure starts to pair and actually after three iterations we've already found the optimal policy which maybe is not too surprising because in this situation from each stage within three steps you can reach a terminal state so in this case three backups is enough the value function keeps on changing still for a while and all the way in the limit it will find the the true values of the random policy because that's what we're doing here and at the same time basically sorry I should have been clear about that in the get-go so the policy here on the on the right hand side is the current greedy policy with respect to that value function on the left but we're not evaluating that policy we're evaluating random policy over and over again so in this case already a.m. we get to the values of the random policy and we get the policy that corresponds to then maximizing that one step and this is something that we will call policy improvements and incidentally in this case being greedy with respect to the random values will actually get you to optimal policy so this example shows that we can use evaluation even if the random policy in this case to then improve our policy and in fact just being greedy with respect to these values the actual values of the random policy in this case were sufficient to get an optimal policy that is not true in general often you have to iterate this process a couple of times where you then make your policy a bit more greedy with respect to the current values and then reevaluate that policy and then repeat so this means we can iterate the policies as well we can first learn the actual values let's say we pick action values here for simplicity of use and then we learn the actual Q PI which is defined as the value of taking action a and I'm following say your current policy PI and now we can construct a new policy from this by being greedy with respect to these values and if you do that you can actually show that the value of this new policy will be an improvement over the value of the previous policy if it's not the case because there wasn't an improvement then then they can at least be equal they cannot become it cannot become worse by definition because we're maxing over something so the only only alternative is that value if your new policy is equal to the value of your old policy but that then implies something that is essentially exactly our bellman optimality equation so whenever the improvement stops what this means is whenever your improvement stops whenever a new policy does not attain a higher value according to this process then you must have found the optimal policy so not only does this process find the optimal policy you can actually tell when it has found the optimal policy which is quite useful of course this is restricted to the full planning case where we have access to the full model and each policy evaluation step we consume all of the states and we even assumed that we found the complete answer to the value of each policy and this is then schematically you can depict it as this where we start at some combination of policy and value and the first step we do is we go up which means we find a value that is the true value for that current policy and then we make that greedy the policy and then we reevaluate that new policy and this is a schematic drawing that doesn't really mean anything the space here but why what it's meant to what is meant to show essentially is that in some sense these things will then move closer and closer together at some some point they'll stop stop changing and when that happens this means you've found the optimal value and the optimal policy and that's the only place where which they could stop yeah you can see now you could start with any random policy you don't have to start in formed but this is exploiting the fact that we're updating all of the states even if the policy that you're evaluating would never reach one of those states right we're still updating its value if you're going to do something sample based this won't work right your random policy will simply not go to certain states so you won't be able to learn from them so this is limited to the full model case the same picture is on the left is kind of like meant to be depicted on the right where at first these things are different from a certain policy you get a value from that value you can get a different new policy you can do that over and over again at some point they stopped changing and you found the optimal alternatively what we actually did for the policy evaluation let me go back to that is we turned this equality up their development equality into an update now we could also this is for the bellman evaluation equation where we're interested in the value for certain policy now we could also take the bellman optimality equation and turn that into an update sorry this is over here the bellman optimality equation has this maximization over actions and actually this turns out to be equivalent to the policy iteration algorithm that we've just described where policy iteration combines a policy evaluation step with a policy improvement step and this is equivalent to doing one step of policy evaluation because that that that term there on the right-hand side behind the max a is basically doing one step of policy policy evaluation on the current policy but then doing immediately really than merely maxing over that in in the actions so so you don't have to then commit to the current pose anymore because this one step is fully defined by the action where you can condition on this action sorry there's a small typo there of course it should say s T is s and then sa80 is a where that's the same a is on as in the max yeah so can we follow the local optimum turns out no not for a full dynamic programming which is quite interesting the the little proof there for the policy evaluation case something similar you can prove for the policy or for this one where you can show it it is actually a contraction which means that it will keep on getting better until it's the same and then you found the optimal solution so that's a non-trivial results again and it's quite interesting but it's also quite nice because it means you can just apply this and be assured that you're going to find the optimal solution yes is that true with an infinite state space well yeah if you could do these things but you can't right because they require you to update all of the states at the same time for infinite state space you could either sample that and there are a synchronous dynamic program algorithms that do that or you could maybe somehow maybe cluster and then you could find like a different MVP which kind of maps to the real MVP that you're interested in and then solve that but of course if you would do that the solution to that wouldn't necessarily be optimal to the original thing anymore and more generally if you have infinite states you're going to have to make some approximation somewhere which means that you might lose an optimality just for computational reasons barring very simple like there could be cases in which you could do something like that and actually make it computationally feasible because you can update infinite states at the same time somehow sometimes you can but not for very interesting cases I would say and so we're going to now do this this update with the max this is the development optimality equation turns into an updates this is called value iteration and we're going to apply that to shortest path problem where the ideas again that you get a minus one on each step it's undiscounted and now the goal is here to find the goal which is that terminal state on the left up corner sorry similar to the other one only we have one goal now instead of two and then we find that we start off with well it's here v1 I I guess I call that v-0 before we start off with initially first on these zeros everywhere and then we apply this equation once it's update and we get a minus one everywhere because the first step is always minus one but and on the second step we see that the minus ones don't change anymore because their optimal thing is still to go to the terminal state immediately but everything that can go to those or to any other stage will go to minus two but then the minus two don't change anymore because they have a shortest path minus two leads 2-1 lead zero and after a couple of iterations you can see them basically tracing back with as many steps as you need to find a goal you'll find a certain value function and if you update these on fee v7 again you would see no more change which means that this must be the optimal value function and in this case of course clearly is it's very interpretable now dynamic programming is great but it does require a full model and therefore it's mainly applicable to a small or moderately sized domains as we briefly discussed just now so we want to additional capabilities one we want to be able to learn from sample interactions and these the second is we do not want to require a model so sampling is important for two reasons one is that it doesn't require model so you could say aren't these the same thing well no I also I wanted to say them separately because you could also have a very big model you do have access to the model but it's just too big to reason about then you probably still want a sample it's still useful to SEM and both of these are achievable and in order to talk about that I'll switch to a different slide deck just for my own convenience no real semantic reasons so what do we want we want algorithms that can learn from samples and our interaction with the world rather than say only learning by planning we want to be able to deal with real-world problems which often have messy inputs and outputs and friends to think of a robot which is like a noisy camera sensor and that's about it it doesn't have real state you can't reason about planning through an MVP we want to deal with large problems where even on the robot with a small camera you typically have hundreds of hundreds you typically have tens of thousands of pixels at least and often even like more than that and you have to somehow deal with that fairly large this is just your immediate input right that's just your observation it's already thousands dimensional real valued space potentially and you want to somehow deal with that and we want to learn with little little domain knows this is just this is maybe more an aesthetic thing it's important for certain things definitely if you have a certain problem that you want to solve a specific problem it never hurts to put more domain knowledge in you should definitely exploit it as much as you can but what I'm personally interested in and what we're interested in more generally in reinforcement learning is to find very general methods that you can use on many different problems and it also allows you maybe in some ways to put in domain knowledge on all of the different problems but then you can maybe still share the mechanisms that are in learning algorithms for instance so let me see if this works so this is an example of something no I don't think I can let me see if I can make it work by going out of the slide deck might work like this nope still won't work so what is I'll just hand wave and tell you then what you're you're you can use her imagination which might be even more fun especially if the slide deck isn't returning that's what what that was what this was meant to show was an agent walking through a 3d maze environments and you start here and the point there being at the edge you can only see its immediate surroundings you can see the walls and it can walk around and it can turn but it can't see around the corner and it can't see everything there is to know in addition to the input see visual inputs are quite rich right there are a lot of pixels and for us it's quite obvious what's going on because we're used to looking at these things but if you think about it from the robots point of view there's like a whole bunch of different numbers you don't know what they mean what what is brown brown is like this one number compared to green is like this different number actually there are typically like three numbers encoding each of these colors like RGB numbers and you don't know any of what any of this means and then you just have to go you're going to have to deal with that as an agent and these videos were actually meant to show I think the whole slide decks not working anymore so I'm just going to go like this these videos were they're meant to show that the AG working around and this was actually done within withering for spurning algorithms so the agent has learned to deal with that this is a different example what you see here is a gripper arm it basically has a bunch of motors in all of the joints so it can turn this joint it can turn that joint and then the goal is to pick up a ball and to place it to some target location and then we'll just change the goal so we change the position of the ball we change the target location and it needs to do it again and it gets rewarded for putting the ball on top of the target location or alternatively for just bringing it as close as it possibly can this is a continuous control problem we're actually actions are in the continuous space of these motor controls and again this was done with a reinforcement learning algorithm and here's an example of such algorithm so this is again development of linearity equation this time for action values and we can also turn this we already mentioned that we can turn it into an update right by just being like k plus 1 on one side and then k on the other side but we could also turn it into a stochastic update a sampled update this is a temporal difference algorithm that's the term that rich southern coins to describe the whole family of these algorithms and the temporal difference learning works by looking at the temporal difference between the reward and the value at the next states and the value your current state so that's this whole term between the brackets and you can kind of think of this as an error so the value at your current States or state action pair in this case we want that to be equal to the actual reward and value of the next state and we'll try to make that equal by just incrementally updating it towards its that's a way to interpret this this equation if at some point the terms between these brackets the reward sorry the value of the states and action pair is equal to the reward plus the value at the next states then that term will be 0 and we'll stop updating so then the algorithm has converged and it also has fine found the optimal solution in general there might be noise the reward might be noisy to the next state value might be noisy so it might never exactly be 0 what the algorithm would then find is that it would be 0 in expectation which is exactly that term on the top so we're still good and it does so with small incremental updates this alpha parameter that multiplies this term between the brackets is a small step size parameter in the tabular case if we store each value for each state action pair separately this might be something like one tenth or something and it's just the whole point is to average over the noise if you're in a stationary setting you might win indicators over time to actually find an optimal solution in the limits because then the updates become smaller and smaller and at some point you've averaged out all the noise and you're good note though that we're still bootstrapping we're still using our previous state action values to bootstrap upon this means that the update is actually non stationary so we can't for instance use a step size of 1 over T and then just average these things because then we will be averaging quantities that include our all estimates so instead what we typically would have to use is a step size that is somewhat larger for instance you could use a constant step size then you won't actually converge through the through solution you'll keep on like getting a little bit but it will continue to adapt and there's also choices in between those step sizes that decay but not quite as fast as one over one over T that will eventually get the optimal solution this algorithm this temporal difference learning algorithm is called Q learning for historical reasons where this is also where the letter Q comes from for these action value functions in general by Chris Chris Chris Watkins so Q luring is actually doing something quite subtle and important it's learning of policy it's learning about the value of degree policy while following potentiating a different policy and it does so by doing this max in the next state so essentially it's already the whole thing is already conditioned on the action right we pick a certain action a T in the state's s T we observe the reward R T plus one we observe the next state s T plus one and we're going to do this update now this update it's already conditional on having taken that action for each time we update this value we have taken that action exactly but in the next state we're going to hypothesize that we're going to take the max value which basically is equivalent to saying will then hypothesize that in the next state will be greedy with respect to our current action values one way to think about this is that this is doing similar to value iteration one step of evaluation and immediately of policy improvements because we're looking at the greedy action type this also means we're learning off policy which is a term that is meant to it means that we're we're learning about a policy that we're not current currently following and what this means for Q learning specifically is that we can approximate the optimal value function without ever following it you can you could continue to explore indefinitely and actually the proof for Q learning says that you only have to for the tabular case if we store all these values seperti you only have to make sure that your policy tries each action in each states infinitely often smile miles restriction there and then it will converge it doesn't say anything about having to make your policy more and more similar to the optimal one if we have the optimal value function then again we can very easily find the greedy policy and therefore the applause okay now this is so fine for the tabular case but we want to make approximations because otherwise we won't have a practical algorithm so this is a short detour about approximating functions so we want to learn policies value functions and our models in reinforcement learning and each of these as I said before is a function so policy file isn't function from say two actions for instance and we want to learn those now learning a function from data is a very well explored well in the suit research field so many examples exists that say we want to learn a mapping from some input state X to some output say Y and for instance X could be an image in Y could be a label as there are cats or a dog in the image or something like that or X could be the pixels of a video game and why it could be a prediction of the future score maybe we're all trying to control maybe we have some fixed policy maybe we're watching a human play the data comes from a human playing and we just want to make a prediction so there's many of these examples and our first approach would be maybe to do tape tables we just do an entry for each of these individual inputs and then we have the state action sorry these in reversed women would have state action pairs as inputs so we'd have a separate value for each of these and then in each of these cells of the tables we can just update only that one cell whenever we stay can say that one action in that one state and we give her everything else fixed now that's maybe the easiest place to prove things that it actually converges to the true values and what whatnot but it doesn't really scale well from for multiple reasons one is you have potentially huge memory requirements because you have to store it a huge table for all of your states and actions and you have potentially huge data requirements because you have to update each of the cell into the in the table enough times to get a valid act as estimate of this value and you have no way to learn about unseen events this means when you have no generalization so whenever you go to a new state and you try a new action you have to learn its value basically from scratch lack of generalization basically implies slow learning this is very related to the second point of huge data requirements so what we actually want is whenever we learn something about a state value we also kind of want to learn about similar states this is what generalization is so this is a very simple toy example of that you might want to learn housing prices we could use the average past sales of surprises per house but if we use them separately for the individual houses you get very little data like how often is a house sold in say a century even only a couple of times so you get some data but not that much and because the underlying process is changing its non stationary if you just average say the house price and that's obviously not going to give you very good prediction so what could you do instead we want some generalization and one possibility is to aggregate you could cluster houses by streets or by size or by color maybe you know buy in advance that some clusters are more meaningful or useful than others or some combination of these you'd get more data back per cluster for instance if each cluster contains three houses rather than one you already get a lot more data per cluster but it's still no generalization across different clusters so what we want to propose otherwise is something else where we have a bunch of features and I'll use phi2 to refer to the features and we'll try to learn a linear function of those features and this is maybe the simplest step up from tabular this is now a linear approximation where we simply say our value function or whatever func you were trying to predict here will be some linear weighted sum and the point now becomes to learn these weights this generalizes the tabular setting because you could for instance have a feature a feature vector that is equally long as the number of states and actions that you have and exactly one of them becomes one and all others are zero whenever you're in that wasn't state an action then it becomes the tabular case again but you don't have to do that right you could have a better representation that allows for more generalization in order to learn such a function you need at least two things one is an objective a well-defined goal what should this function represents and the other is a learning algorithm something that yields an update to the parameters people often conflate these two but they're separate things one is the actual thing you want and the other is what you use to get it and in in supervised learning and in deep learning these are often specified by losses which is a very good flexible way to specify that and then the goal is always basically to minimize the loss and the algorithm is almost always it's also already almost given as I'll talk about in a moment and a very common loss is say to have a squared error we just want our predictions to be close to the actual labels so for some inputs X we have some parameters W that we're going to update that this function depends on and then we want a function of the input X and their parameters W to be close to this label Y in expectation and we could call Y say the targets and the expectation here is over the pairs of inputs and outputs and again the supervised learning we often assumed is that this distribution will be identically and independent so it'll be stationary over time you can just sample from it independently in reinforcement learning that's not really the case but the goal is then to minimize this loss for whatever this loss is and one way to do that is to take gradients now I assume many of you have learned about gradient descent or maybe use it or maybe extensively at some point or not but I'll briefly go through it the idea is quite simple we basically say we take this loss and we look at for this loss we can basically compute a gradient now what is the gradient it's the direction in which the loss will go down the most for your current weights that's what a gradient is by definition it's it's the direction the vector is points in in terms of your rate space it points in the direction of the steepest ascent actually and then when we say gradient descent we just put a minus in front and we say we go down rather than up now you can turn it into a stochastic update and this is then called stochastic gradient descent and it's a very powerful algorithm because of its simplicity because you can often calculate these things quite easily and it's an incremental algorithm in the sense it's sarcastic so you'd better take small steps to average over the noise but then if you do you can actually prove many things about this as well that it actually finds the minimum say if your function is convex which means for instance you have a linear function that you're trying to map and then your losses to square it loss then your loss will be convex and then in that in that case you'll actually find the optimal solution eventually again there's a step size alpha you have to appropriately decay the step size if you want to actually find the optimal solution but then it's a then it just works and so one thing I did here I wrote it down here as a as a minus so we're the the term here the minus alpha and so on this is the negative gradient of our stochastic gradient of our loss I typically like the fifties around because then it reads to me as if we're updating towards the target the goal is to make that term between you back at zero and if you update a enough this all happen in expectation which means that this linear sum of the features will start to be equal in expectation to the label Y and the way you can read this is that the change in weight is a step size times a prediction error why me - this prediction that we're making W times Phi times the features that's for the linear case now this is a fairly simple algorithm it's easy to code this up but it's actually actually quite strong you could do many things with this with Suzy chosen features now this is a small example where for instance our feature vector is still a binary vector but it's not one hot there's not one component that is one but actually there's two for instance maybe you have a you don't just have a big house but it's also a red house so you have these two features that are equal all at the same time and maybe there's some initial guess for the weights say 10 20 30 50 70 and you have some actual observed target Y now if you have some step size we'll notice that two of these weights actually change in a tabular setting only one of them will change in the case to change and it'll make your prediction better these things are so common so pervasive that they're made a lot easier these days there's many modern software packages that calculate these gradients for you so all you have to do is specify your function you specify the loss and then you just tell your software and then I want to do gradient descent I forgot how it works but you know how it works you do it for me and that actually works quite well it makes it possible for people to easily do lots of things quite impressive things as well without having to worry about having small software bugs because if you haven't make a small say error in the calculation of the gradients maybe points in the wrong direction things will get quite messy okay so this brings us to deep learning because so far I've discussed linear functions so what's wrong within you I said they were quite good and they are you're great when you have good features but this what does it mean to have good features there's ready to be a couple of requirements that you have for your features and program is that they're hard to guarantee in advance one is that you that suitable solutions are a linear combination of those features basically what I'm saying is if you only have a linear functions you'd better you better make sure that the appropriate solutions are within the space that you can represent and additionally you want them to facilitate fast learning because the tabular setting for instance that one's linear and if you have like a separate cell for each of the states it's also fully expressive but it doesn't generalize well so it doesn't facilitate learning so finding good features in general is a fairly hard hard thing to do and there's lots of research research on that as well although less so these days for reasons I'll say in a moment but they might require substantial domain knowledge so you might want to learn a lot about in domain you're trying to solve before you can even define a good features representation and your solutions the quality of your solutions and the speed with which you attain them so the amount of data that you need they very much depend on your choice in features now there is a different solution which is to learn these features and this is more or less what happens in deep learning there's two ways to look at it one is to say deep learning helps you learn features a different one is to say we're just going to define a difference a function class that is not linear it's going to be non linear and nonlinear is almost is basically by definition is more general than linear which means it's a larger function class which means it's more flexible so it's much easier to define a function a nonlinear function class that has good solutions inside of it than it is to define a linear one and in fact what people often do is they define nonlinear functions neural networks that just consume the raw inputs we don't even define features anymore because the function is allowed to be nonlinear now we're not restricted to a linear function of the inputs anymore which means they're very flexible which means we can often find good solutions even if we don't do any feature engineer and that's the main selling point I guess and our neural networks are basically just a flexible and broad class of nonlinear functions for all our intensive purposes I mean you could follow a whole class of deep learning and you learn much more richness than just the statements but for our intensive purposes that's what they are or it means ID main idea is that they're build up hierarchically which means there's a couple of steps of compute CIPA cleaner neural network and typically also we learn them end to end which means we just take grain with respect to this full nonlinear function and we try to incremental incrementally update the parameters in that fashion you lose some guarantees when you do that because there's lots of guarantees for the linear case that we don't yet at least have for the nonlinear case but you gain a lot of flexibility and in practice it works really well now how do we make a nonlinear function well one example is given here at a year at the bottom schematically and in math so the schematic here is meant to basically show that there's some input which are depicted as nodes where the nodes are just numbers say you have some features coming in or some raw pixels or whatnot and then there's a bunch of lines in two other nodes these lines are basically the weights that were that we're learning so one way to think of this is that we have a vector as an input and we multiplied it with some weight matrix and this gives us a new vector and then we repeat we do this over and over but if we'd only do that we'd still have a near function so what we do instead which is shown there on the right hand side is that whenever we have multiplied our input vector with some wave matrix we then apply a non-linearity and these days people you often use a very simple one which basically says if the output of each of these nodes if it's negative I'll just set it to zero and if it's possibly if I'll just set it to whatever value it is this is a nonlinear function which looks like a hinge and this is this has a couple of benefits to use this one and it just happens to work well in practice it's not a given that it'll stay that way this thing might change people are actively investigating other nonlinear functions a lot of algorithms they don't particularly care with non-linearity you use but it turns out in practice this specific non-linearity tends to work fairly well now what does it mean we have nonlinear space so one thing that might happen this is just a comical representation of what might happen is that we have a loss function with some high loss there at the bottom left and some low loss very low else at the bottom right maybe that's zero there the dark blue at the bottom bottom right but because we're following a local update the gradients we might actually pull to the top top right where the loss is lower than where it currently is but we're all actually finding the optimal solution and this happens because now our loss function is no longer necessarily convex it can have multiple Optima which means think of a mountain range if you go down the slope on a mountain range you're not guaranteed to find the lowest value that's essentially what's happening so as I said for our intensive purposes a neural network is just a nonparametric nonlinear function and sculpting and optimizing these function is something of an art which is what a lot of deep learning research is about but it's also much easier now than it used to be because we have more compute better simulators better pre-built optimizers people have implemented these for you and better software packages like tensor flow and others [Music] now one function that we talked about is the state update function and this looks very similar but this is actually meant to be somewhat more generic statement so one common way you we'll use these nonlinear functions is for to deal with sequences of inputs rather than instantaneous input and one way to do that is to use a recurrent neural network which consumes some inputs and the previous state of that Network and then outputs an output and the next state of that network now what this is essentially doing this is combining in terms of reinforcement learning setting this is the way people typically write it down for current neural networks but for the agent for the further reinforced moon setting we typically split it into multiple parts we have one part which you call the state update function which only outputs the next state and then we have separate parts maybe multiple that consume that state and give you the output one might give you the policy the other might give you a value function maybe at both of those maybe there's a model as well there might be other functions there in deep learning there's usually to keep combines because we don't necessarily care about them separately as much and you could also of course always write this down this way because there's just a function that is well-defined which consumes some inputs and gives you some outputs now we can learn this mapping by using something that is called back prop through time this is just to cast a gradient descent but we're making sure that we're accounting for the fact that these previous states was actually also the output of the same function but on different inputs which means you can pass is gradient all the way through multiple steps but typically we use the fixed windows for computational reasons and what this might provide you is memory because you then are then you are then it's possible to change this function in such a way that the output at many states later becomes good for some according to some loss by changing something that you did many states before by just following the gradients one thing that is happening here is that the same weights might be multiple places in this unrolled larger function but that's okay we can still define the gradients we can still follow that gradients and it will still lead us somewhere and this can potentially help us in partially observable Markov decision processes now in general deep learning is a research and practice of training neural networks and it has many different parts it's a very active area of research right now there's mathematical parts which are about picking a function for class and the the optimization process defining driving you optimizers things like that there's a lot of engineering as well tinkering in architecture space sometimes people find surprising things that work well without really necessarily understanding exactly why right and that's perfectly valid and also how to include domain knowledge is active area of research and there's of course also the science what what happens if you do this what happens if you do that you can ask scientific questions about this and just observe it's basically think of it as a frog that you're poking and then you just observe what happens if you poke the Frog and that brings us to deep reinforced learning by the way which time do you normally end twenty minutes okay so this will be the last part where we done combined these deep learning algorithms with reinforcement learning and the idea is quite simple and quite obvious by now I guess we're going to use deep learning techniques to learn policies values and their models to use in a reinforcement learning domain that sounds perhaps simpler than it is but it's also all that hard where reinforcement learning provides a framework for making decisions and algorithms and deep learning provides tools to learn certain components most specifically these functions so then again I've posed a question before with your star L maybe AI is reinforced building plus deep learning still with the question mark not going to take stance on that but concretely we'll just implement reinforcement components with deep neural networks and then we'll just see what happens so to go into detail on a specific version of that I'm going to talk a little bit about deep cue networks now what does a deep cue network it's it was it it is what was used to produce at our videos that I showed earlier where you saw this agent a nice Atari games and essentially how it works is that it does cue learning the same cue learning algorithm that I showed before but now with the nonlinear function update where the update to the weights will be proportional to some step size times a temporal difference error of your reward plus discount its maximum excellent value in the next state so this is a form of value iteration but a stochastic form of it times the gradient of their current value estimates so one way to think about this is it's kind of its kind of trying to minimize the squared temporal difference error there between the brackets but it's not taking the gradient with respect to the suta value at the next state and basically one way to think about that is that this is because of causality we want to update the value of our action eighty towards this target of the reward plus the value of the next states but we don't want to change the value of the next states to be a better target for our current value because that violates causality we only want to update in one direction or the other you could you could do the other one thing as well just doesn't work as well in practice and it feels less intuitive now what is this what is this cue function here it's some some function that takes the state as input in action as input and a weight factor or some weights so what more specifically was this well this was a deep neural network how does this deep neural network work there's a couple of inputs in this case we're stacking a couple of frames and that's all the memory that we'll use so there will not be a state update function we're just going to use the observation but the observation will be your past four frames and these are down samples to be 84 84 pixels this is why we get four times 84 times 84 inputs then we're going to apply a linear function on top of that which is something called a convolutional layer and what a convolutional layer does it looks at each little patch of the image with a small little filter that looks at say 8 by 8 pixels and multiplies those 8 by 8 pixels with some matrix to lead to a number of numbers and it's the number of numbers is a parameter and in this case was pick to be 16 and then you just multiply this little patch with that matrix and you do that for every patch in the image in this case by skipping over for 4 pixels every time so now on every pixel that you skip over for pics so you do basically divided up the image into a lot of patches and you apply this same linear function all these patches then you apply that non-linearity that I showed you earlier and this gives you a new set of images which in this case will be 16 images which are now there's these are now 20 by 20 because of the size of the filter that we used that's a detail that's not really important but you can repeat we're you can repeat that process sorry we do that again and then at some point we basically say okay now we have a bunch of images but they're not really images anymore we're just going to stack them together call them a feature vector and then we'll apply another matrix multiplication to that the data sources are not really important for this talk but this is how this neural network was set up and this is a I can add this is a fairly conventional what I call a comb nets for convolutional neural network nothing too special was of course tuned a little bit for the Atari setting but there's nothing really fancy going on here in terms of deep learning this is a fairly standard network for our tense purposes to your most important thing here is that some nonlinear function of the of the pixels and is not too expensive to compute then as said this is trained by doing hue learning but there were a couple of innovations in the original dqn algorithm which led to in nature publication one of which included using target networks where bass givers we're noticing that if we really upped the update up there whenever we update the value of some state action pair we might in adversity also change the value at the next state because of generalization we're using the same weights for all inputs to compute our action values but if you do that you might actually change values everywhere when you do some updates that's a feature we want a generalization because it helps speed up learning but it can also lead to very awkward learning dynamics where things are spiraling away one way to prevent that from happening is to fix the target a bit more we're noticing that the reward plus the discounted next stage was just some stand-in for the actual optimal return which we don't have and we could do something a little bit more stay we make a copy of the online weights W - it's just a copy of your normal weights and we just keep it fixed for a while and this turned out to be quite useful for performance and another oh yeah sorry here's another slide on that so the intuition on that is that changing the value of one action can change the value of many other actions in other states to change as well because of the generalization and this ends up means that the network and as I said chase its own tail or spiral out of control a different mechanism for the dqn work was experience replay now this is a fairly straightforward thing which basically just means whenever you interact with the environments we're going to store those samples in some big database and then we can replay those samples what this means is that the actual data that the narrow neural network will consume that the deeper your first learning algorithm will consume will be a bit more similar to what you'd normally get in normal supervirus regression because it will be more iid and it will also be more stationary in a sense because it isn't that that quickly - you're changing policy so the idea behind is the motivation behind this was to make it a little bit more like the supervised learning settings that we already knew works well with deep reinforcement sorry with deep learning and it turns out to be a lot more data efficient as well if you do it like this there are many nature improvements to dqn this was like an active area of research and still is and I'll highlight a couple which is more showing what what like what ongoing research there was happening in deep reinforcement learning and also to show some like larger trends in that so something that might not be immediately obvious is that this replay database this stores a whole bunch of transitions state action a reward in the next state that you've actually observed now one way to interpret this is that its basic in an empirical nonparametric model it's something we can sample from or if you if you call it a model you could actually also call this querying it's a model we can query in the original dqn paper this was used sample uniformly at random from now obviously that's a valid thing to do you could just you could just do that you can define it but doesn't mean it's the optimal thing to do and if you view this as a model then there's a question whether we can use it more cleverly and turns out yes that you can so one way which I won't go into more detail on you could use it to plan with you could use the actual things that you've seen to basically do backups more often to see if you can do if you can learn more about this and to reason through these trajectories that you could be following but the other one that I'll plane a bit more on you can sample non-uniformly you can pick things that you want to learn about and one simple way to do that this was explored by some of my colleagues at NIMH mind is to read K transitions proportional so you pick them more often to the error that they have sorry I CDW the ways of the network are feed us here both are used quite often in the literature so I missed one but it's just the parameters of the network theta is just the parameters of the network it's the thing we're updating and if you do this turns out this learns a lot faster and a lot better and the intuition behind this is that there's when there's a high temporal difference error this is typically something that you can still learn a lot about that's not necessarily the case if it's stochastic environment you might have a higher temporal difference error just because of noise but in this case in the Atari actually the environment isn't very stochastic or isn't such that sarcastic at all and then this is a very plausible thing to do you could define other ways to sample of course and I'm not saying that this is the optimal way to sample but it turns out that if you sample like this it's already a lot better than if you sample uniformly at random it makes intuitive sense as well if you think about the update because the update will be proportional to this value so the magnitude of this value so if you sample things where this value is zero you're raising y'all doing an update you're wasting compute so if you only sample the things where this is high you're wasting much less computers are much more updates it will change the dynamics of learning but turns out in general this just helps yeah yeah sorry no so do we update the whole replay memory do we go through it like dynamic programming style maybe update all of them at the same time or anything like that we typically don't we typically store a fairly large replay memory tens of thousands of thousands or maybe even a million transitions and then we sample from those and we do that while acting in the environment so what's the typical the actual typical setup that was also used in dqn is that at every four steps in the environment you sample 32 transitions at random from your replay and you update your network with those and the new samples all just get added to the replay so we don't even use those immediately necessarily to update we just sort them in a replay and eventually I'll get samples and because we're sampling 32 samples every four steps we're actually going to see each transition on average eight times of course random we will see some more often than others but that's those are so does the actual numbers that we used so might you might actually not see any of some of the sounds you might not actually see ever at all just by chance and especially if you do this it becomes more likely that it happens because you'll sample things with a high error more often of course when you do sample them well update the error might go down and you store them with a lower error now so you might not use them again but things that got stored in the replay memory with zero error error you might actually not sample it all and it might be okay yeah randomly so where do the numbers come from these are just many of the numbers there's many numbers this is a fairly complex system right there's many specific numbers the size of the filters in the convolutional layers there's the exploration there is the step sizes there's the numbers of the batches as you were asking about many of them were very roughly tuned or just picked because it's what much too large a space do an exhaustive searching well so do you have to do that for any problem a little bit of tuning yes you have to do for any problem it does turn out to be the case that these days so that these case many methods are quite robust for instance this well this work was using rmsprop which is a type of optimizer it's basically an extension to stochastic gradient descent these days many people use a different optimizer which is quite similar which is called Adam and that one turns out to be very robust in the sense that you can apply to many problems and you don't really have to tune the parameters of that optimizing that much anymore but then if you do the performance will change a little bit so it's always the case and if you have a certain problem and if you tune all these parameters a little bit more you will probably be able to iterate your performance upwards might not be big gains but you will be able to do a little bit better at least and this isn't yet this is you need a little bit you're touching upon something that is a little bit of a problem but it's also kind of inherent these are complex systems with lots of knobs and we are not able to set all of them optimally so we're going to have to do a little bit of search yeah if you don't turn them all your experience you might get undesirable learning dynamics and this is true and sometimes it happens and actually let me show you an example for which is for a different reason but let me show you that these things can go out of control there are basically there are no guarantees but it often works quite well so that's that's the current state of affairs it might be a bit of an abortion state of affairs but it works so well that it's still interesting but let me show you an example for when things for my fail first I'll give you the solution I'll show you the failure so this is DQ an algorithm which you had up there before the changing weights is proportional to the step size times in several different error times the gradient of your function now this is not actually the case as I just mentioned briefly we weren't actually using stochastic gradient descent but instead we were using something called rmsprop which is a slight modification to this that's kind of irrelevant for the point I'm going to make next but it's just so you're aware so this would be the stochastic gradient descent version of TQM now you can ride that differently and it might make it a little bit more obvious of what's going on so we take we take an action in a certain stage we see a reward R we see you next state s Prime in this slide and what do we do then we update towards the value of that immediate reward plus the value in the next state of the greedy action in that state so we're maxing now over the state action values we're doing an arc max to pick out the greedy action and then we're evaluating that greedy action but what we're doing is actually a little bit dangerous because what we're actually doing is we're using the same action values to pick an action and then to evaluate it but we're still learning these action values so they're going to be inherently noisy and approximate and wrong now let's assume you have a bunch of different action values and they're all a little bit noisy so if you're going to pick the greedy value there if you're going to max over those you're more likely to pick an action that you've overestimated then you are to pick an action that you've underestimated so can we do something else well yeah you could so this is reusing the fact that we have two networks we were updating the online network with the bold black W in the top line and we have this target network as we called it which is a slower-moving copy of that network and now what we could do instead is we could still use that target Network but instead of evaluating the greedy action according to the target Network we could evaluate the greedy action according to the online network these two networks will have slightly different predictions because we're only intermittently updating the target network say every 10,000 steps or something like that and this D correlates the selection of the action and the evaluation and turns out this is already enough to mitigate that over estimation and Trude we can see that here so these are value estimates them to Atari games Wizard of wor and Asterix you don't need to know anything about those games in order to understand this slide and what you see at the top here are value estimates according to these action values that the agent has learned and what you see here is that this is on a log scale and we see at least a DQ n has much larger value estimates than double DQ n which uses that last update down here but this turns out not just to be higher value well why could always be higher maybe you're just doing better right maybe the currency is better and therefore the vilest myths are higher because the actual returns are higher that turns out not to be the case because the actual scores they actually drop and they drop almost exactly at the point where these value estimates from DQ n go up this is because of that bias that we've just discussed and we can see that because double DQ n doesn't have that problem here the values do go up let's hide in DQ n but the scores also go up and they seem very correlated and indeed I don't have that line here but if you compare the value estimates with D we also compare the value estimates with the true discounted future return from that point onwards and then we see the wqn is much more accurate than DQ in the actual value for DQ and will actually be lower than that of double DQ and even though it's it's estimating a much larger value so apparently this is a real bias that it's also hurting performance because the performance is really tanking there for instance on asterisks for DQ m and then so if you compare this across the whole suite of Atari games this is what DQ n gots where the bars here represents normalized scores where we took a human game test and he played these Atari games and whatever score of this human tester got we set that at 100% and we had a random agent just pick random actions we set that at zero and then we can compare across different games you can see how close we are to that hundred percent that the human tests are got on the games and that's actually the first line here in the system in this in this plot which is the human performance line and you see that in many games it's actually much better this is also a log line behind after a hundred but it's linear on the left I think before 100 and on many games it's very close and on some games is actually worse than human and in some ways even worse than random which is bit but then if we apply this wqn spec which is basically one line of code change but it's of course an informed line change we could see that it does a lot better basically across the board and some of the games that were sopran them also got higher so some case might have been sub rendom because it's all for estimation by is basically made it go all the way haywire and made it go into wrong solutions so take-home message here partially like the instance of the algorithm is that it's good to be aware of the actual properties of your learning algorithm don't use them as black box don't just plug them in and hope for the best right but understand what they're doing and then you can improve them it's also very useful to track and analyze in statistics because if you could see these values escape and go off into unrealistic estimates then maybe you could stop there and think about oh there's something happening here maybe I could understand what that is and if you understand the problem sometimes the solution is very simple so in this case maybe the coming up with it maybe is not the trivial part but the implementation of wqn is like a one-line change of code to the dqn algorithm and if you leads to a huge improvement in performance similar things have happened many times in the past so it's good to try to understand the low levels of what's going on and one way to do that is to track and plot lots of stuff if you ever want to play with these things I think I have time for one last topic and then I'll end sorry minus five so I have five time for no last topics yeah okay so we could end here if you want unless everybody wants to continue so maybe anybody who wants to or has to leave could also leave and I could just this is just going to be a separate little topic on the whole topic of deep reinforcement learning which is related to a larger narrative so you could understand everything up to now without going into this and this is going back to the reward function so we've defined rewards in Atari is a change in score there's other different ways you can find rewards frames on the game of Go you could define it as a winner or loss you could say plus one for a win minus 1 for a loss it's a very clear definition in dqn turns out all the rewards were clipped so minus 1 1 so why well this helps learning and why does it help learning well these deep networks they have all these parameters that we were talking about before such as the step size in size and turns out if you can make the problems much more uniform make them look all alike then it's much easier to simply set these hyper parameters so that you can use the same hyper parameters across all of these games that was an explicit goal for the work on TQM that the same algorithm could learn each of these games but it does change the objective and sometimes harmfully so now is there a way around that there is instead of clipping the rewards we tried an algorithm that adaptively normalizes the target to lie within the range of minus 1 1 say or to be roughly normally distributed now here on the right-hand side what you see is a picture of the unclips gradient norms so this is the updates to the weights that you get or your update is typically proportional to this did you get across all these different Atari games there's a bunch of little dotted lines in there each of those dotted lines is for a specific Atari game and then there's a bunch of colored regions these are basically percentiles like 50 80 90 % tiles across these games this is 57 games so they're fairly meaningful percentiles what we see if we don't keep the rewards the gradient norms are like expands 7 orders of magnitude it's very hard to come up with a learning algorithm save a learning algorithm that can deal with this variety if we clip the rewards everything's much more stable this is the middle plot here but turns out if you use the adaptive normalization which we call pop art it's even a smaller range and interestingly this is without changing the objective because when you clip the rewards you change the objective what we're doing instead is we're basically it's sometimes called whitening and it's very often used in deep learning on the input level but we're doing something similar but then on the target level and this turns out to stabilize as gradient norms quite a bit which hopefully that makes learning easier and then when we apply this to the Atari game something interesting happens this is the relative performance of using that normalization versus not using that normalization across all of these Atari games and turns out like the net median or mean performance difference is about zero which was quite surprising and coincidental but look at the y-axis here to normalize differences remember that 100% is the difference between random and a human tester so that's quite substantial so what we see that a lot of lot of games got a lot worse and a lot of games got a lot better and then there's a bunch of games in the middle that they didn't change that much so what's going on there why is it happening well let me see what work there's a video there you can also find it on my website essentially what happens on on say pac-man this is one of the games that got better and what happens in Paquin is if you clip the rewards Backman goes around this maze eats the pellets it can eat power pellets that can then turn the ghosts into something it can chase as well and then it can eat ghosts and if you clip the rewards pac-man cannot see the difference in your ear or between eating a pellet anything a ghost so it doesn't care so it just goes for the pellets those are just there you just eat them you're happy but if you don't get the rewards and use this adaptive normalization to still stabilize the learning process Beckman basic becomes a ghost hunter it goes for the power pellet goes chases the ghost goes through the other power pairs it waits for them at the entrance kind of fun to watch and then the doral score here on Beckman is better now there's other games and there's actually there's a link there you can see that on my website there's all the games which the score got a lot worse and often this is quite interpretable there's a game somewhere down here called time pilots and in that one you're flying around it's a shooter game you're playing and you're flying around you're shooting things and normally just shoots everything when you clip the rewards but if you don't clip the rewards it actually doesn't care so much about shooting everything seems somewhat lackluster until it spots a blimp and then it goes for the blimp it really tries to shoot it where the clipped one didn't care and the reason is when it shoots the blimp it goes to the next level but then the next level it even looks different like the background is a different color there's other enemies everything's different so it's hard to generalize from this first level to the second one and it turns out in total the score is lower but I would argue it's still doing the right thing in a sense it's trying to play the game as the way it was intended it just happens to be the case that the rest of our algorithm isn't quite good enough to then fully learn policies that are really good for this fool game with the changing levels and everything similarly you'll break out you can find that the clipped version just hits all the blocks and at some point learns this tunneling but before it learns the tunneling it doesn't really care about which boxes hits whereas the unclipped version it really tries to hit the higher blocks quickly because there's a worth more points and again this is harder and turns out the overall performance is actually a little bit worse but where this really shines is for multitask learning so before dqn was actually a single algorithm that could alert each of these Atari games separately so you just learn from scratch and the same algorithm could learn each of these games but you could also try learning all of the games with a single algorithm which is quite it's quite and the hard thing to do this is maybe even not really possible if you don't do something like this because now the same algorithm needs to consume all these different reward skills of course you could still clip that might help and this is one thing you could try but the other thing to do is to do that if you normalize and then you can also trade off these different losses from these different games and basically say maybe they're all equal so what we did is we took an agent we made it play all of these Atari games basically at once in parallel and then see how far we can get and this is what happen here the orange lines at the top are a baseline agent with this adapted normalization the agent itself I won't go into detail it's basically a version of reinforced or a to see if these terms mean anything to you if they don't don't worry about it it's not a cue learning-based agent it's a policy gradient based agents which I hadn't had time to talk about but it's well well-known reinforcing building agents so those are the baselines in blue and in pink they're in the middle and when we apply this to the optimization we see that I've got a lot better and in fact this is the first known agents that can solve all of the Atari games above human level and this is very recent work so you kind of get scooped here more impressively even when we this is with clipped rewards the middle lines if we do encrypt rewards the baselines they just say at zero there are no precisely zero I can tell you well it just tanks it just doesn't do anything and this makes complete sense right because some of these games they have rewards in the thousands or even in the millions and other games have rewards of one or two so it's very hard to come with the single to have a single system as because you can consume all of those and learn about them so what you can maybe more likely do is to find a system that can learn maybe one of these games to one with the highest magnitudes and then all the others just won't care just won't learn a good solution for this is depth of normalization allows you to instead reason about these things more or less as being equally important or you could weigh off the importance however you want rather than having the magnitude of the roared dictated for you okay so this is where I then want it to end thank you all for the attention [Applause]
Info
Channel: Federated Logic Conference FLoC 2018
Views: 286
Rating: 5 out of 5
Keywords:
Id: z1fa6RYHNVM
Channel Id: undefined
Length: 99min 15sec (5955 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.