Lecture 10: Reinforcement Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right still good on volume okay so today we're going to talk about reinforcement learning and a lot of the machinery we've seen in MDPs was actually to set up what we're going to do today and in the next lecture so we'll be seeing our values and Q values all over again so reinforcement learning in some sense is a very basic thing you would like to teach some agent to act in an optimal way and you do that by allowing it to try various things in the world and receive rewards of course the idea is you're supposed to act in such a way as to receive more rewards and this slowly shapes your behavior but it turns out that although we can formulate it very simply solving the reinforcement learning problem is very hard and although we're going to make some progress on it in the next few lectures we're really just going to scratch the surface some of these ideas will come back later in the course and new contexts so what is reinforcement learning the basic idea is you have an agent that receives feedback in the form of rewards and when you think about these rewards you should think exactly as the rewards were in an MVP every time you take an action you get a reward back in return and so the flowchart you can see here is that you have an agent and the agent takes actions in a certain state okay so that's just like an MDP and then it's the environments turn so you take an action but you don't actually necessarily know what the action is going to do in an MDP you didn't know what the action was going to do because it was uncertain there are multiple possible outcomes and they have probabilities attached to them in the reinforcement learning context you also don't know what the possibilities are until you experience them so there's actually two kinds of uncertainty folded into one here but you take an action and then you can imagine the environment response and the environment responds with one a response state that's the result of your action and two a reward so you land in a state and you get some immediate reward the agents utility is defined by the reward function just like with an MDP so your goal is to maximize the rewards and that there may be a discount associated with them so you want to learn to act so that you maximize the sum of expected rewards all of the learning that you do therefore is based on what actually happens so of course you can't know when you take an action and you land in a certain state what else might have happened the only way to learn what else might have happened is to try it again and see if you get some new outcome so in this whole context of reinforcement learning all the learning is coming from the samples that we experience and your experiment your experience might be typical or atypical or lucky or unlucky and so there's another layer here of expectation you're not actually sure what experiences you're going to have so let's see an example of reinforcement learning in the real world so if you'll remember these AIBO soccer robots we saw them primarily in the first lecture as an example of how if you have a robot if you have a dog shaped robot you might not want to kick a dog kick a ball like a dog does but rather do some weird plastic robot way to shoot the ball now we're going to look a little bit more about how these things even move around in the first place right you could look at this robot and say okay I know how it should move right it should kind of move one leg forward and then the opposite leg in the rear and then you could kind of try to write down a movement function but it would turn out to not work very well so what we're going to see here is a dog starting with the kind of initialized policy which more or less moves it forward trying over and over again to kind of try variations of that policy and its reward has to do with how quickly it makes progress and also how stable the kind of the video is and that turns out to be important for vision that's being done on that as it plays soccer so let's take a look here this is from Peter stones lab at Austin so take a look here this is the initial this is the initial gait you know it's more or less how you might think a robot dog would move but you can see it's slipping it's not super efficient but it's making forward progress this is the initial policy now as it's going it's getting rewards when it moves forward one thing to note about walking is sometimes you go faster and sometimes you go slower and there are parts of the gate where you're actually slowing down even when you're doing the right thing so of course it's not about acting in a way that gives you the next instantaneous reward it's about acting in a way also that sets up future Awards so that's the initial gate and so then it trains and it walks it walks and walks and it starts to get a little better and this is the reinforcement learning so with the variations that led to better rewards are then used to shape the policy and now you can see it's much more efficient and then all of a sudden you've got an Olympic caliber I bow here just now you might think the reward function should only have to do with moving forward and if you set the reward function to only move forward you might be able to move forward faster but of course you might then be kind of shaking around which makes the computer vision to track the ball hard it's hard to play soccer when there's an earthquake on right and that's what it looks like when you move in an unstable way and so if you balance the rewards to also take into account stability there's a trade-off and the idea in reinforcement learning is you specify the rewards and the learning then makes the trade-offs for you and you end up with different behavior based on the utilities that you prescribe okay so that's learning to walk here let's see another example a lot of these reinforcement learning kind of a videos classic experiments have to do with some kind of walking that's there are there actually are some deep reasons for that but plenty of reinforcement learning has nothing to do with walking a robot around this one however this is out of the MIT lab this is um this is an interesting example aside from the fact this robot actually kind of looks like a toddler in in general shape this is an example of the ability to walk partly being in the control policy but also partly being in the hardware so with the kind of human body you know we have to learn how to walk but partly our bodies are designed to move in certain ways that facilitate walking and you can see the same thing here this robot is pretty stable and so as you kind of walk around it kind of toggles around and it doesn't always move in the right direction but it doesn't really have to worry quite so hard about not falling over unless it does that and you know the next thing you know it's kind of more or less heading in some direction like a toddler might and it's making more forward progress and the next thing you know it's kind of moving purposefully forward and then I guess it's off to college so there it goes okay so um how do we do these things well let's take a look at a slightly simpler robot you'll go to know and love this robot because this is your project three so here he is I call him the crawler now if you ignore the fact that his uh his body aside from the rectangle is basically a giant spike he's actually kind of cute so um this is about the simplest robot in this in two dimensions right there's physics here in the sense that this is in simulation there's physics in that you know it you know it doesn't float off into the air or anything like that but really there's only actually two arms here so there's there's the yellow arm which is actuated kind of at the shoulder and there's the red arm that's actuated at the elbow and this thing has to move forward now when you're the simple robot you can basically only do it by like stabbing in front of the ground and lurching forward so it's not you know it's not as design necessarily favored by nature but what's going on here so what you can see going on here is it's trying a bunch of stuff and most of what it's trying doesn't actually make progress this is exploration it doesn't doesn't know what's going to work it's trying a bunch of stuff you look at this and you think okay well I could do better than that and you know probably could I could stop this thing do it I could stop this thing okay you're going to do you're going to lift this arm and you're going to go this way and then down and then I mean I could try to give it a demonstration and there are in fact reinforcement learning algorithms that operate using demonstrations as a valuable source of training but here basically it's going to explore around and in the interests of not spending a whole lecture watching it flail around we can skip a lot so you know now it's like tomorrow or something and I'm going to change something that I'll tell you about later and you can see that it's now basically learned how to walk you know cute terrifying you be the judge okay what's cool about reinforcement learning is that it's learning how to make kind of forward progress or forward here means to the right but you know there are parts of this walking cycle where it gets no reward at all right it's just setting up future rewards and that's because the definition of the problem has to do with maximizing the sum of future rewards not instantaneous rewards at that exact time step otherwise it we just kind of sit there and kind of jiggle back and forth trying to get instantaneous rewards okay let's stop that one cool thing in the context here is that the only reason it chooses to go in that direction I didn't tell it how to walk and I didn't tell it which I sorry I didn't tell it how to walk and the reason why it even walks to the right is because that's the direction the reward says so if I for example invert the reward function what do you think is going to happen let's skip a million steps and there he goes you can just as easily learn to go the other direction based on the rewards you give so this is again an example of utilities or inputs but the actual behavior that's computed is just going to optimize whatever you asked for okay you'll build that for project 3 ok any questions about the setup it's not all going to be robots walking in fact you will also get a chance to do reinforcement learning for pac-man which we'll see a next lecture how that's going to work but let's talk a little about formally what is a reinforcement learning problem and the answer at the bottom level is it's an MDP so formally we imagine there is an MDP you say I know how to solve MVPs I do value iteration or I do expect to max search and if you knew the MDP you wouldn't need reinforcement learning so there's got to be something different there's still a set of states just like an MDP there's still a set of actions which can depend on the state and there's still going to be a model that is each state has some non-deterministic result and there's still a reward function we're still looking for a policy that's going to maximize the sum of returns the new twist is for example look at this car MVP here that we know there's these states there's these actions and we know that each action does for example 50% chance of staying in the same state 50% chance of cooling down there are various rewards we needed those in order to run value iteration ok basically those are gone so the new twist is we're in an MDP in which there is a transition function and there is a reward function but we don't know what they are or at least we don't know what they are until we experience them ok so the difference is rather than just planning offline on the basis of a known model we now have to try things out in order to figure out the rules of the world we live in and that means we're going to make mistakes and it means we have to think about things like how often do you try something that looks bad before you give up on it so there's going to be new kinds of trade-offs that we saw a little bit at the end of last lecture with our casino experiment ok important distinction because it's very easy to confuse these and it's especially easy to confuse because when we talk about the real world versus in simulation and you have a robot it's obvious right the real world is when you actually have to plug it into the wall and this you know it actually moves over around and in simulation is when it sits there and smoke comes off the CPU right whereas when you're talking about pac-man what's the difference between the real world and simulation it's like whether or not you call game dot get successor or something like that so it's a little bit subtle so that's why we keep pushing on this distinction between offline and online learning okay offline solution that's what we did with MDPs the idea behind an offline solution is you're sitting there you're in the real world but before you do anything stupid you're going to think about your consequences so you can think all right should I jump should I move off to the left here and you think you're like okay I know probably I'm going to move to the left there then I'm going to receive a negative reward I don't like negative Awards so I'm not going to do it okay the falling into the pit and suffering happened in your mind because you knew it would happen you didn't actually do it okay in online learning okay it's more like should I move to the left I don't know and then you try it so one big difference is when you do online learning you you break a lot of robots right you'll actually see more about this one when we talk later I Peter will talk about some of his helicopter work you break some robots so definitely the kind of dollar cost and pain cost in terms of robot pain is higher for online learning but you have to do online learning unless you already know how the world works and so invariably you need to do a little bit of both in the real world in order to solve problems well now we're going to talk about ways in which you can do reinforcement learning which means how you can decide how to act when you don't know what the transitions and rewards in your MDP are and most of what we're going to talk about is something called model free learning which we're going to get to in a few minutes but first I want to basically sketch an alternative because the alternative is a little bit more related to things we've already seen so the alternative something called model based learning and a lot of people do this so in model-based learning what you do is you say alright I don't know the MDP if I knew it I could solve it so my priority is not actually learning how to act it's learning the MDP once I learn the MDP I'll just run value iteration and do what it says okay so you construct a model of the world and your learning is for building a model not 4x4 valuing actions okay so that's the model based idea you're going to learn approximate model it's going to still be based on your experiences so there's going to settle distinction here then you're going to solve for values and policies as if the learn model were correct which it's not right so this won't be perfect okay step one is to learn the empirical MDP well you know there's a lot of details on how you could do this in the most efficient way but here's one way to do it that broadly speaking would work so you act in some way you try a bunch of stuff you jump off a bunch of ledges in the grid world and for each state and action you've tried them a bunch of times and you you have recorded what happened and so you know okay when I moved left I always ended up in the pit when I moved right I ended up at the exit and so on and so for each s and a okay what's S is a state what's si si is a q state right it's a state action pair for each si I know the things that have happened in the past and I just say you know what that's what that's the transition function so I normalized that that means if one state happened once and the other state happened twice I say the first one has 33% in the second state has 67% i normalize them so they add up to one and I say that's my transition approximation little hat means empirical here similarly for every state actions s Prime triple I get when I experienced it I get a reward back and we imagine the rewards are deterministic there's fancier versions where the rewards are non-deterministic too but we write down the we write down our as we go and we build an approximation to T it's now I've got this like you know is like this you've got some you've got something that looks a little bit like the truth and the more time you spent collecting data the more it's going to look right and then you solve that so step two is you just run value iteration and this is basically fine okay if you spent enough time and step one this would work but it's going to be a little bit tricky to know when you're done with step one and one you should switch to step two and we're going to prefer methods that we're going to talk about soon where you're constantly learning and everything you've learned so far feeds into your actions so we don't always want this division between a learning phase and an action phase okay but that's the idea learn a model and then do what you would do what you would have done last week with that model questions okay let's see an example of that so here's an input policy okay if you're going to be learning you're going to need an input policy often the policy you learn with needs to be a random policy which means rather than having one action you recommend you have to maybe roll some dice to see what action you're going to take because you're only going to learn about the things you do and that's going to be a recurring theme in reinforcement learning you learn about the things you do so if you want to learn about everything you kind of eventually do everything but let's imagine there's this input policy and we're going to do some model based learning well imagine gamma is one and we're on this grid world and so we do some training okay and I won't walk you through each episode but you know the first episode you're in B you choose to go east you happen to land and C you get a minus one you're in C you choose to go east you happen to land in D you get a minus one you're in D you choose exit you go to the terminal State and get a plus ten okay that was episode one then we do episode two in episode three and episode four and so on until we're satisfied that we've got enough data and then we move on so we will have to do when we move on is we have to say alright on the basis of all of that evidence you know what happens when I go east from B and I would look and I'd say okay well I did it here and I did it here I guess those are the only times it looks like a hundred percent of the time I end up at C so when I learn my transition model it'll say the transition probability from B choosing East and landing and C that always happens is that right probably not but it's what my empirical data suggests similarly if I go east from C it looks like we got D once D twice D three times and a once and so we write that down in the transition probabilities again you'd have to do this a lot of times to have much confidence in these numbers later on in the course we'll talk about kind of sampling confidence and things like that for now let's just kind of not worry about it similarly we know what the rewards do so we know exiting from D is worth plus ten but we didn't know it till we tried it okay so one problem with this kind of thing is you only learn about the things you try but there it goes and then we plug you this you know this learn model we send that off to our value iteration algorithm and now we know how to act okay okay that's model based learning I'm going to draw a distinction that in this example is going to seem extremely subtle and worse than subtle it's going to kind of seem uninteresting okay but bear with me here so let's say I wanted to know the expected or average age of students in this class okay how could I do that I want to compute an average and really if you think about it everything you compute in reinforcement learning is an average right it's average rewards you're averaging you know what the chance knows everything's just averages so we're computing averages and we're doing it based on samples of experience so this is a reasonable thing to ask I want to compute the average age of students in this class without kind of knowing all the details so if I knew the distribution over ages so if I happen to know exactly how many people in the population were of each age how would I compute the average well I just you know sum up all of the values so here's the probability of each age times the age itself and that would be maybe 0.35 of the people are 20 and I just kind of continue on adding up all of the numbers okay everybody clear if I know the distribution over ages simple weighted average calculation easy okay now what happens when I take P of a a P of a away from you so P a is gone right you don't know P of a so you can't use this equation and you can think of this equation as value iteration it's taking some averages and doing something and suddenly I've taken away your probability distribution what I can do is instead rely on samples right so I can say I can pick people at random and say all right random person number one how old are you random person number two how old are you and I can take samples out of the class right and maybe I take n samples obviously more sample is going to be better but let's set that aside I take some samples now the question is once I've gotten my 30 samples how do i compute the average age okay and there are two answers and they're exactly the same thing they're just two different perspectives on the same simple computation so one answer is the model-based way it says well I know how to compute an expectation when I know the distribution over ages so the first step is figure out the distribution of ages okay how many of those 30 people were twenty how many of those 30 people were 21 and so on okay so in that way I compute a probability and then I just reuse my equation from before right I kind of reduce it to the known probability distribution case everybody okay with that that's not the normal way you would compute a sample average how do you compute a sample average you just do it you just take your numbers your 30 numbers and you average them together what's that look like that's essentially the model freeway where I just take an equally weighted average of all of the numbers so I've got my 30 numbers just add them all up and divide by 30 okay the reason the thing on the left works is because eventually your probability distribution gets correct how come the thing on the right works why does it work to equally wait them right and here's kind of the key confusion that we'll hit later even though it probably hasn't hit yet which is why aren't you up waiting the common ages in this average right if there are more people who are 20 then 50 right why wouldn't I give the 20s more weight in this average all right I kind of count them in twice yeah exactly so this is it maybe it seemed obvious it might seem subtle but the key point is exactly right the things that are common show up more times so we don't need to up wait them everything in this average has the same weight and although this is clear here you'll get a little confused by it later when this idea comes back so everything has the same weight the frequent things just show up more times all right okay so samples come with the right frequencies and things work out this is the exact same computation and it's the same thing with reinforcement learning you're basically doing the same computation but in the left view you reconstruct your probability and your right view you say I don't need the probability I'm going to directly take the average I care about and that's going to be our lead-in to a model free reinforcement line is our question yep yes the great question is what happens when your actions are when your action space is infinite right so everything we're going to talk about here is discrete action spaces when there's infinite action spaces obviously it's not meaningful to talk about taking all of them many times you can't even take all of them once and so in order to work in those spaces you need to either have a model that's parametrized in some way so that although there are infinitely many actions they're controlled by some smaller number of parameters right can't be an arbitrary function from real numbers to real numbers and then you're in a space of learning those parameters or something that'll be a little more natural to talk about after next class when we talk about parametrized parametrized reward functions great question any other questions ok so let's talk about model free learning in model for you learning we're going to directly try to estimate whatever it is we care about okay so if we care about a value we're going to try to directly average the sum of returns rather than trying to learn transitions and rewards and running value iteration okay so first we're going to talk here about a simple case something called passive reinforcement learning and passive reinforcement learning the kind of you don't choose the actions you're kind of along for the ride the learner just watches what's happening right and has no action selection going on it watches what's happening it takes notes and it writes down what's working and what's not so in some sense you're an observer you're just evaluating what's going on with your clipboard so formally it's the task of policy evaluation which hopefully we'll remember from last lecture in policy evaluation there the input is a policy so somebody gives us a policy we don't have the kind of either the responsibility or the luxury of trying to choose what action is best we're forced to run this policy okay our job is to figure out either the transitions and the rewards in a model-based way or since we're going to be doing model free what we're going to do is directly try to learn the state values for this policy and we're going to do it in some way that is robust to the fact that we don't know TNR okay so in this case the learner is kind of a for the ride you have no choice of what actions to take you execute and learn it is not offline planning you're actually doing things in the world and writing down what happens okay here's our first algorithm for for passive reinforcement learning it's called direct evaluation it's not a very good algorithm but it's easy to understand since we're trying to compute the value for every state under pi we're going to run the game a bunch of times right and that could mean kind of simulating pac-man or it could mean starting the robot and letting it go for an hour and then doing it again and again whatever it is we're going to get a bunch of plays of the game and then after we've acted according to this policy we're going to say for each state we're going to check to see when I visited the state empirically what have my returns been i'll average them together and that's my answer so let's see it in action here's that same input policy right a and D here the exit states and now what I do is I say alright I'm going to train just like I did when I was trying to learn a model I'm just going to do different things with the results so I have episode 1 an episode 1 I go from B to C to D and then I exit and then I go from B to C to D and exit again and then I go I start at the bottom so those were all from V these are all from ysou I go E I go north and then I go east and then I exit and then I go north again ok so I have all of these experiences what's interesting the first ones are from B the second ones are from E and usually I exit D and get a +10 but in episode 4 I exited from a and I got a minus 10 ok so a is the bad exit here now rather than trying to learn transitions and rewards what I do is I say what was my job ok my job was for each of these states to write down its value under the policy let's take an example let's take D right what's DS value under this policy well I I ball it and I say it looks like it's +10 okay what's a is value under this policy minus 10 are these right who knows these are the empirical values all right let's do something interesting like C what is C's value under this policy I have to find every time I found I was in C and I say what were my scores well first time my score was nine then the second time my score is nine and the third time my score was nine and the last time my score was negative eleven right and so I continue I take the average of that and I continue when I write all these down okay these are the numbers I'll get everybody clear on the algorithm for every state I want to know the average returns and so for each state I look at my empirical average returns and I average them all done I'm selling along with this algorithm but any questions on it okay yep how did I arrive at the numbers well let's say where did this plus eight come from well that is the score I get from eight from B well I was in B twice once here and once here the first time I received plus a receive plus eight both times and so I averaged those together and I get a plus eight and that's the reward from the returns from B okay okay so what have we accomplished well we've now figured a way to find out the average returns under this policy without knowing the transition probabilities they weren't relevant we just found a way to do the computation without them but in a very blunt way yeah there's another question okay again it's from a once you get you get ate from here and negative twelve here so you're right so the average of those is negative two I might have messed up the math but that was the intent other questions all right so we could be done right we could unluck sure early right what why does that not work well what's good about this is easy to understand and it's clear that you were able to do it without ever figuring out TN are explicitly along the way and those are nice properties and it does eventually work okay but what's wrong well that's kind of what's bad is it wastes a lot of information so for example um what does B do kind of we know we know in our experience B only does one thing where does B take you in this policy it takes you to see what does e do every time it takes you to see okay so B and E do the exact same thing they both always take you to C and you know this so what are they doing with different values how come II look so much worse well you came from a the one time when things worked badly from C but it kind of doesn't matter right that's a statement about C and it's a statement about everything that takes you to C as well but in this method it only informed us about E right because the experience is from E even though they all go through C they don't kind of share information for all of the times you got to see in other ways okay so this minus two is supported by two runs through C whereas really we have for relevant episode so we waste a lot of information and that means we got to run the game many many duplicate times in order to get averages that mean something so it takes a long time to learn let's try to fix that okay we actually have an algorithm for Policy evaluation if you remember from last class what's the algorithm right the algorithm we said well if we knew the MDP it would be like doing expect to max where you have only one choice of action but then you have to average all the possible outcomes of your action right and we had a way of using a one step look ahead it was we'd start off with value estimates of zero and then we do these updates that said my new value estimates from each state are written in terms of from that state all the different S Prime's that could result and for each s Prime we'd write down a reward plus a discounted future approximation this algorithm was nice because it took everything we knew from the previous round and it allowed us to recycle it everywhere where it was relevant so this algorithm connected all the states up through a one step look ahead expect a max so in some sense this is the algorithm we want nope so in some sense this is the algorithm we want so what's why don't we just run this just quick sanity check why don't we just run this beautiful algorithm why don't we for each state just do the one step look ahead expect a max compute this average of all of the seconds of all the one step ahead states and then I get this nice effect we had in policy evaluation what is preventing us from executing this equation what is reinforcement learning right there's something in this equation we don't know what is it we don't know T actually we don't know are either okay so there's we can't run this equation because we don't know TN are okay so in fact the key idea is going to be that we're going to implement this equation this update that takes approximate values and replaces them with better values based on a one step look ahead but we're going to somehow do it without having to compute the average over everything that might have happened we only get to know what did happen so somehow we need to implement this average this thing that says my value is an average of the one step look Ahead's okay so how are we going to do this update to view without knowing TN are and the answer is going to be we're going to use the samples we experienced to do the averages just the same way that when we want to know the average age we just picked some sample people and average them together so the samples are going to come from our experiences and you say okay but this is just a sample it's the same with the age computation we didn't check everybody we just use the people we happen to have and then we let kind of the laws of large numbers eventually do the right thing okay so we'd like to improve our estimate of the value of a certain state say it's s by using the average of and remember pi is fixed here so we're in some state we fixed in action and there's just a bunch of things that might happen and we want to average together all of their scores and so what we're going to do is because we don't know how likely each outcome is the first idea is we're going to take the sample s Prime's by actually doing this thing and average them together so we're in this state and we have this kind of expect a max confronting us but we only one s prime is going to happen okay so we get one sample okay we take the action and we end up in s prime sub 1 and we say we got a certain reward and we can approximate the future using our existing values right so that's an idea that we've had and we're now be able to use so but this is only one sample in that average this thing that happened to us this one time is just one of the possible outcomes in that average so if we wanted this average to be right we're going to need more samples we're going to sample another outcome of our action and another outcome of our action until we've got a whole bunch of outcomes of our action from the state and for each one we have an instantaneous reward plus a discounted approximate future now if we had these n samples from this state we could average them together and that would be our new value and this would be a sample based way of doing this expectation we're instead of averaging things together according to the transition probabilities we just take it a sample based average of the outcomes that happen and it's that same idea the things that are that are have high weight in the transition function are going to appear more often as samples and the right thing will happen in the limit okay so on one hand great by taking samples rather than doing the average we can now implement this update from policy evaluation but I kind of broke something here there's something very wrong with this algorithm we have one more thing we need to fix they see a flaw and this this beautiful approach okay so I'm in state s and policy PI says go left so I do it I go left and I receive a sample of what might happen okay that was sample one how do I get sample to do it again okay go back how do you go back I don't know we're in a reinforcement learning unknown MDP I don't know how to go back right I didn't even know how I got where I got so there's no way to go back unless you can time okay and you know maybe you can rewind time or maybe you can predict the future but in general this is not available to you okay so the question is how are you going to compute a sample based average when you only get one sample okay you might think okay that's easy right you just average it together won't be a very good average so so there's a big idea that says you can't just wait until you've accumulated enough samples from that state before you implement this averaging you have to learn every time you take an action and so every single action we take every time we experience what's called a transition a transition is when we experience a state action state Prime and then reward okay every time we experience one of these transitions we're going to update the value of the state that we just left so if I'm in a state and I go left and I land somewhere I'm going to update my source state as Prime's that are likely will over time contribute but we're going to average them in as soon as they happen okay what does that mean okay the reason this is called temporal difference learning is that what we're essentially doing is we're saying all right right now I'm in a state and I currently think the state is worth 50 okay now I take an action and I land somewhere maybe I'll and in a typical outcome maybe I don't maybe it's a rare event that'll all wash out in time but I land somewhere maybe I get a reward of 10 and I land in a state that's worth a hundred well I just got 10 I landed in a really good state that's going to give me a hundred in the future and looks like I'm on track to receive 110 but I used to think that it was only going to be 50 so I need to take that 50 and I have a choice what do I do I could clobber it with 110 but that's not right because 110 may not be typical so what I'm going to do is I'm going to kind of blend in the new experience I'm going to average the new experience with all of my past experiences which are encoded by the current approximation okay so what we do is we're in some state we act and we receive a sample of the value right that sample is a real reward plus a landing state okay so we have a sample of the value but we also have our old value right which was the V PI of s to begin with and so what we do is we average together the new sample and the old value and we average them together in this kind of funny-looking way where we multiply 1 by alpha you could think alpha here is something like point 1 and the other by 1 minus alpha that's like 0.9 so it's some linear combination typically most of the weight is on on the old value right so typically alpha is something like 0.1 and not something like 0.9 okay and so what I'm doing is I'm constantly averaging in the new value with my old values which are encapsulated in a single number okay you can write that in a different way you can write that that the value is you can write that the value is itself plus an error term right this error says how far off you are and so you can think about this either as taking a running average or taking your current value and nudging it in the direction of the error a rather than negative error either way these are different only in algebra and they're both useful intuitions to have yep that's a great question so if we're trying to average a bunch of things together you would think alpha is some function of how many experience how experiences I've had right and there's two ways you can actually cash that out one is you could keep really kind of close track of okay I've seen nine things this is the tenth and so the ones I've seen should have nine tenths and the one that I just saw should have one tenth the reason you don't do that is because the one that you just saw is actually better than the old ones because the one you just saw has in its computation a newer future estimate than these old ones that you saw ten minutes ago so this new one's actually worth more it's more recent so you want to give it a higher weight than it would get evenly and this kind of update even though partly it's just there for mathematical convenience it gives you that effect that newer samples contribute more and also you don't need to have phases you don't need to kind of accumulate things until you switch to the new round you're always running in every experience you've ever had is kind of in the average but with exponentially decreasing weight that's really just to repeat that in a different way if you go around with a number right so if instead of taking an average of ages by adding up my 30 samples and dividing by 30 I said all right what's your age and somebody says I'm 23 eyes okay so far the average is 23 somebody else says okay I'm 22 well I take my 23 forgetting how many people led to it I multiply it by 0.9 I have point one times the new one okay at the beginning right this might make sense but eventually the new people are contributing point one even though the old people you may have a thousand people in the running average this is not a bug it's a feature if you actually write out this update you can see you can rewrite this as a non moving average where you have average together every number that's ever contributed to the average but kind of with exponentially decaying weight and this is a feature everything contributes to the average but with a less and less contribution over time so the old information kind of softly Scrolls off the end in a way that you don't have to really track okay the second part your question is in general even though alpha is not going to be determined kind of in the obvious way you do in general want the learning rate to decrease over time so this stabilizes but then there's this then there's the fine print about alpha has to increase but it can't decrease too fast and there's some fine print about that in order to make things converge okay any questions on that so let's do an example and then we'll take a break let's do an example of temporal difference learning so here are my states a B C D and E and also there's some terminal state X that we know is worth zero and let's assume that the discount is one so we don't have to worry about tracking gammas and that the update alpha is one-half you would never do this in practice but it makes the map workout a little bit easier here for an example so right now you begin and you begin say in this situation where as far as you know I mean who knows what these numbers are right you currently have a value for every state you always do they typically start out at zero right now everything is zero except for this one state D whose value is currently estimated to be 8 are they right who knows ok so now we're going to do some experiences and change our numbers as we go based on each experience alright so the first experience we have is we are in B as you can see with the red dot the policy says to choose East so we do we don't think hard about what to do we just do what the policy says because we're evaluating this policy so the policy says go east and we do and this time the result was we landed and C and we got an instantaneous one-step reward of minus two all right so now we have a little bit of a thought process we say on one hand my running average of the value of B is zero on the other hand I just got some new fresh data in that says I just got a minus two and I landed and see what's my total reward according to my approximation it's minus two plus V of C which is right now zero so I think based on this sample that I'm going to get a total of minus two so on one hand I have zero representing the past the representing my past experiences I've got - two representing my current experience and I kind of average them together okay so I land here I do this update we're in one in the old approximation is zero the new one is minus two plus gamma times zero and maybe if it's if my learning rate is one-half I now I get a minus one okay so these are my new values notice that the learning isn't about the state you land in it's about the state you left when B takes me to see I learned something about B I don't actually know anything new about C I found a new way to get there but I don't know anymore than I did before about whether it's good I know something more about B because I just saw it lead to C and there's something I know about see that makes sense so you even though you it's there's a little bit so you can think about you move around and you kind of leave learning in your wake you leave a you learn about wherever you left not wherever you land all right so let's have another transition from C we choose the action east because apparently the policy wants that we land in D and we receive a minus two so currently what do I think C is worth I think it's worth an average of zero what does this experience suggest it suggests I beginning I just got a minus two I learned in D and what's D going to give me over time can give me an eight so I'm getting a minus two plus the eight which is a six that means I should now think that this state here is worth somewhere between zero which is my old information and three which is my new information sorry and six which is my new information and if I average those evenly then it will be a three so you can see that as I move my each experience rolls into the value approximation I have at the state I just left quick sanity check if my policy continually throws me into the pit of death are the values going to be high or low they're going to be low because I'm going around saying oh this is bad this is bad bad things are happening and I'm writing down essentially what's happening to me so the values you get depend on what you do right if you act if you act well you're going to learn good values if you have portal you're going to learn poor values and that's actually what you want because this is policy evaluation you learn about the policy you're following and that's that that makes sense here okay let's take a break and then we'll look at the problems with this and then I'll see kind of the core algorithm we're going to use going forward which is an algorithm called Q learning okay let's get started again okay so we just learned a new algorithm what is this new algorithm this new algorithm goes along and for each state tracks a value right I mean kind of value iteration did this policy value should all our all our algorithms basically kept approximate values sitting around for all the states this one is no different the only thing that's different now is that rather than doing sweeps where you go to each state and compute some expect a max in order to find the new value the new value isn't based on an expectant ax it's based on a single experience so instead of the new value clobbering the old value it's kind of blended in as you go now there are two issues kind of one is like a data structure issue and the other is a more subtle issue the the kind of just structural issue is that's a problem with this value learning algorithm is well it is a model free way to do policy evaluation right essentially doing the policy evaluation algorithm in a sample based way it doesn't really work in a context where we have to actually act we didn't design it for that it was for passive learning but when you have to act remember you have to choose a you have to choose an action according to a policy right so so far we imagined that our policies were given to us but if we had values and we needed to compute a policy well remember it was easy to choose a policy from Q values you would look at your state you would look at its Q values and those were the score you'll get for each action and you pick the action that gives you the highest score but the problem is the Q values can't be derived from the values without this kind of Q value update equation which um we can't do because we don't know TN are okay so the idea in terms of just the structure here is if we want to axé we really shouldn't be learning values we should be learning Q values so we're gonna have an algorithm in a minute which is called Q which is called Q learning which is going to learn Q values but there's actually another subtle problem with this algorithm we just saw remember we talked about if you kind of continuously jumping off the cliff you learn that all of the states are bad well we'd really like an algorithm that rather than learning the values of the policy that you are executing it learns the optimal values and actually people for a while weren't even sure this was possible right this was kind of the big holdup in reinforcement learning for decades was trying to find a way that let you learn not the values under the policy you're following but the optimal value so you're like throwing the robot left and right off the cliff and but occasionally doing something smart and somehow learning that the values are actually pretty good because although you've been jumping off the cliff under the optimal policy you wouldn't be doing that right and so we'd like to solve both of these problems the easy it's easier to kind of think about solving the problem of learning the Q values and we'll see along the way how it solves the more subtle problem at the same time okay so active reinforcement learning this means the robot doesn't know what happens when you jump off the cliff so it does it and maybe has to do it a lot of times and it maybe has to jump off the cliff from every possible point because you only learn about the things you do now of course as we kind of flesh out the ideas of reinforcement learning eventually you'd like to learn that if one cliff is bad they're all bad but today there's nothing like that that won't happen until next lecture when we start tying States together using features okay at the moment you got to try everything which means you actually incur the loss in the real world and that means that slowly over time you're responsible not only for exploring and trying a bunch of things to learn you're also eventually responsible for doing the right thing and not just constantly incurring losses so in full reinforcement learning okay we're going to traffic in optimal policies just like value iteration so we'd really like to do is implement the value iteration updates somehow using samples that we'll see it's very hard okay so we still don't know the transitions we still don't know the rewards we are going to choose the actions and our goal is to learn the optimal policy or the optimal values and we're going to kind of do those at the same time just like value iteration did the learner is now making choices so we have the fundamental trade-off between exploration and exploitation right should we try something that's probably bad just to be sure right in order to do we do things for science or do we just kind of go with whatever looks best now which is going to give us higher returns but actually cause us to learn less and there's a trade-off there will formalize that trade-off going forward though it's going to take a couple different topics in the rest of this course in order to get all the pieces of that okay it's not offline planning you're not thinking deep thoughts you're actually trying things in the world and then slowly over time trying to do the smart ones okay so we're gonna do a little bit of a detour let's remember what value iteration did and why we can't just approximate it with samples so what was value iteration right okay this is and this is a really I'm gonna I'm going to write the equation out because it's really important is the key idea in behind q-learning right if I wanted to compute an approximate value for a state s and I wanted to base it on what is going to happen in one time step under optimal actions I would say well that's going to be a max right I take a max over all the actions available to me because I'm doing action selection now and for each action if I knew the MDP I would say well I have taken average over all the possible outcomes I wait the outcomes according to their probability that's the transition function and for each outcome the value is going to decompose into a one step reward the instantaneous reward plus a discounted future value which i plug in my approximation which for value iteration came with a subscript K for the landing state s Prime so that is the value iteration update okay the thing we can do with samples is we can mimic averages right we know this if I want to average all the ages I just grab a couple samples and add them together and that's going to do an average using samples it's easy to do averages of samples we don't know how to do maxes using samples right there's no real sample based max and people stared at this equation and they said I can crack this right I can I don't need to know T in order to compute that sum I can just take samples of the things that are happening but I don't know how to take a maximum / actions without actually having the options and explicitly choosing the biggest one right nobody knew how to push the samples through the max okay and the key insight was that values though they're the intuitive things are not the only quantity there is so here's the value iteration update you could also do one step look a heads on the Q values right nobody likes Q values at first because they're a little bit weird Q values are weird they're States paired with a specific action but they answer the question how what is my score if I take this action from this state so on one hand they're actually more useful you can use them for action selection and secondly let's write down the one step look ahead well the one step look ahead if I started a chance node and expect a max tree I have to consider the possible outcomes and then for each of the outcomes I would need to consider the possible actions and then I would be back at the chance node so if I unroll one layer kind of half apply off here what I would get as I would get an out I would get an update that says well if I want a new approximation to a Q value for si what do I need well I need to consider an average over all of the S Prime's that will happen okay I would need to weight them by their likelihood then I would need to add in the instantaneous reward plus a discounted future reward but here's the thing if I'm writing Q's in terms of Q's I can't write V here right I can't write the I can't write I just did but you can't do that because you don't have the visa if you want to reduce it to Q's I have to write it in terms of Q's well what's the value of a state in terms of the chance nodes below it it's just a max of all of the actions a prime of the Q values from the landing state s prime like that okay so now I wrote Q's in terms of Q's done all I did was I shifted my one layer of expect to max a half step down the tree so it's chance nodes in terms of chance nodes it's got the same ingredients there's an average over outcomes there's an instantaneous plus future and then there's a maximum over the actions the maximum that makes it optimal okay but now something really nice has happened because the Q values break this the my memory because they break my approximation up by actions I can compute this explicitly whenever I need to by just looking at the Q values and picking the biggest one and the outer thing is now an average instead of a Max and because the average is out on the outside I can now approximate the Q updates even though I couldn't approximate the value updates and so this was the big breakthrough in reinforcement learning is the idea that if you make use the outer the outer quantity suddenly this this equation you want to mimic with samples is an average and not a max okay so we would like to do this equation as we go based on samples so here's how it's going to look okay we're going to learn Q values as you go they're going to look like this in grid world so I'm going to bring this up in the GUI in a second but they're going to look like this so for each state you're going to learn four numbers if it's the kind of state that has four actions we're going to keep they're going to all start at zero and we're going to learn them all as you go so what's going to happen is you're going to get a sample you're in some state you choose some action you land in s prime and you receive a reward R and you're going to go through the following reasoning on one hand until I received the sample I thought my starting state s and my starting action a were worth Q of s a so I thought that this combination of s and a was going to lead to Q si as my score but now I've just gotten a new estimate my new estimate is the reward I just got plus the estimate at the landing state which has got this max embedded into it so I have to look ahead at the state I land and in order to figure out what's going to happen I have to consider everything and take a maximum if you think about it that's like looking at my landing state and saying not what are you worth under the current policy but what are you worth if I do the best thing I can locally right and so this sample is something like a one-step look ahead with a little bit of expect to max kind of optimality built in when you max over actions so I've got my old estimate of the q value my new estimate of the Q value and I do the same thing I did before which is I take a running average with this kind of linear weighting so 0.9 times the old value plus point one times the new value let's see this we're going to see this twice because it's kind of the key idea here in in reinforcement learning algorithm so let's take a look let's first take a look at which one do I wanna show you let's do this one okay so right now all the Q values are zero and I'm going to move and as I move the Q values will start to update okay the parameters here are that the living reward is zero meaning in general I'm not gay when I take that first action it's like I thought it was worth zero I got zero I landed zero nothing really happens at the beginning is all everything zero the only nonzero rewards in this MDP are at the exits so as I move nothing interesting happens until I choose the action east here and land in the egg and land in the pre exit here now as I land here nothing actually happens because I still get a zero but from this state I'm going to take the exit action I'll be airlifted out and I'm going to receive some reward I think the reward here is plus one and I start over again okay so I start learning more now why is it point five and not one I just got a a one why didn't I write that down because my old value is zero my new value is one and I average them together in this case the learning rate is apparently 0.5 which again is nice for demos but kind of probably a bad idea in the real world okay so now I move again and again and again and nothing happens until I'm here now when I move east I'm going to say my Q value which is the zero right here okay I'm going to say I used to think it was worth zero and as I move east I'm going to land in a state which is worth point five and I'm going to say well zero point five I'm going to average them together and what's the Q value going to become on the east slice of the pie here 0.25 okay so and then as I leave I kind of average point five and one together so that goes up to 0.75 so I'm just going to keep walking this path and you can see is something going to happen here when I move okay I think East is worth zero and as I land I'll discover it's worth point two five okay and I'll average them together so I'm just going to manually do this a bunch of times and you can see it takes a while for things to propagate back and because the learning rate is 1/2 it takes a while for those point fives to kind of go towards one but let's look at several things that are happening one I'm only learning about the actions I'm taking that makes sense right I'm going to start doing some other things so first of all if I go here if I go left okay I go west what do I currently think West is worth I currently think it's worth zero I'm going to land let's say I land to the left what is my landing state worth it's worth point nine eight so essentially one so I should now think that West is worth 0.5 which happens and notice I learned something about it that is backed by the knowledge that when I act optimally from this current state it involves going east twice and then exiting so I learned something in one step that's backed by all of these other experiences from other runs ok so that's nice it lets me learn a lot faster and let me show you the optimal thing let me show you the optimal thing what's the best way to do this ok so let's say let's go around the other way let's say I decide to start acting dumb and I start taking this wonderful path and using it to jump into the cliff okay let's look at what I'm learning okay obviously from here I'm learning that the east action is bad right but notice what I'm not learning I'm not learning that this state is bad right because even though I'm doing dumb things from it right the only thing I update when I go south is the Q value to the south it doesn't mess up my knowledge of the other actions and moreover even though every time I go to the south I kind of fall into the pit of death notice that the Q value for south is still zero because even though when I go south apparently there is an action that's bad the Q value update tells me to approximate the future using the best value from the state so it consider the state with a - 0.69 to be worth zero right because it assumes when I land there act optimally even though I don't write so it's telling me something about optimal action not about what I'm actually doing okay so if I go back and forth between these I'll start to learn that in fact I'll learn that in fact going south from here is pretty good because if you act optimally you should go back north and then go out the good exit okay so it's neat about this is we're learning about the things we try and somehow the learning isn't affected by the experiments we do right we learn about optimal things and that's because of the max in the equation okay let me show you another version of this all right this is well let's see what it is oh it's flickering pretend it's not okay so look at this what do you think is shown on the bottom in the giant flickering mess yeah so these are states down at the bottom it's a - it's a two-dimensional grid because the state space of this robot is two angles one for each part of the arm and green and red here are good versus relatively good versus relatively lower values values are shown on the Left Q values are shown on the right you can tell that because you've got four actions from every state and they're kind of in quadrants it looks a little like grid world but this is not a grid world and you can see we're learning about the stuff we try okay what about kind of down here we don't know anything about it because we haven't tried it so let me skip into the future if I skip far into the future suddenly you'll see that I've now learned about all these Q values okay so one interesting thing is notice we don't go to the bright green I guess I need to click a couple things down so you can see for a second okay so um I'll tell you in a minute what I did just there but notice we don't just go to the bright green and sit there right because the bright green states give you a lot of future returns but they do it by moving towards worse states that and receiving instantaneous rewards along the way so don't just like sit at the good state you do the good policy that's shown by the little blue lines over here okay if I stop this and I just place it right even though it's not supposed to get here it'll kind of do the right thing and get back on track because it's learn the whole policy okay we'll stop the flickering okay so we learn Q values as we go let me tell you about some properties of this algorithm and then we've got a couple high-level things to talk about and then that'll be today so um there's this amazing result Q about Q learning this algorithm converges to the optimal policy and the optimal values even though you aren't following the optimal policy it's kind of amazing it's something called off policy learning you don't learn the values for the policy you're executing you learn the optimal ones even if you're not following it with value learning you would have had to follow the optimal policy in order to get optimal values therefore there's fine print right you you're not going to learn unless you explore enough you're not going to be able to learn about states and actions you don't take you got to make the learning rate small enough but not too small but kind of if you get all the fine print right in the limit you can select actions completely at random and you will learn the optimal policy okay you don't want to select actions at random in the process of learning the Alpha policy why why don't you want your robot just kind of moving around at random it'll work in the limit okay but one it's going to take a lot longer right and two you're going to lose a lot of robots in the process so your regret formally speaking will be high okay so this is kind of a you know from a lot of these kind of decision Theory out of concepts you know them from life and we just kind of formalizing them in some way here's this trade-off between exploration and exploitation you have your place that you like to go to eat food and it's tasty and there's this new place right is the new place better than your favorite place you know probably not but it might be right so you go and you try the new restaurant and let's say you have a kind of mediocre meal do you go back maybe maybe once maybe you wait a while and go back so you'll do a certain amount of taking actions which you think probably are not going to immediately reward you but you do them for science right you do them to learn to gather information okay there are other things in the restaurant analogy maybe you enjoy something new that's kind of a that's kind of secondary stuff in this metaphor but the key idea is you take actions also for the sake of learning so how are we going to get our reinforcement learning cue the cue updates are just whatever policy you're following it kind of does its updates how do you know what actions to take well there's a bunch of schemes we're not going to cover everything you might do and in fact to formalize the optimal thing to do is going to require machinery we're not going to have this week but the simplest thing you can do is do what's called an epsilon greedy exploration policy this says every time step with some probability epsilon take a random action with a much larger probability one minus Epsilon follow your current policy this says your robot should do its thing but occasionally just like jump off the cliff for kicks right is this good well you know you don't spend all your time exploring um but and then so that's good but on the other hand when you do explore it's kind of unfocused there's problems with this you eventually explore but you kind of even once you're exploring you keep thrashing around if you remember I was doing something to stop the exploration and that applet that's what I was doing I was it was an epsilon greedy policy and I was lowering epsilon so that you could see just the greedy behavior okay um you can kind of lower epsilon over time there's other solutions I'm going to sketch one right now but the exact form is actually not super important if you want to know when to explore you should really be taking random actions kind of maybe at the beginning but what you want to do over time is explore areas whose badness is not established you don't when you decide to try a new restaurant you don't pick a restaurant at random you choose a restaurant that you don't know much about but you know maybe your prior belief says is reasonably promising so you views exploration functions of some kind one way to formalize that is to say that rather than in the queue update and again the kind of immediate the detail is not as important as the idea rather than approximating the place you land with its current Q value you can kind of take the current Q value plus a count of how many times you've been there and in up wait Q values of places you haven't been so essentially you not you treat states that take you to unknown territories as being a little bit better than they are until those territories become unknown and this can give you a nice behavior where on that kind of that robot for example you would learn very quickly let me see let me see if it'll behave all right so this is that GUI now it's exploring but it's got an exploration function so it stops exploring as it becomes confident and yeah it's a little weird but it's basically done okay so rather than having to skip a million time steps and still watch it thrash around you can see that it's actually still got some epsilon in it too so let's take that up oh we can't take it all out anyway you get the idea you can learn much faster if you basically just damp down the exploration once it's no longer needed so that can save a lot of time okay so in conclusion okay lots of new ideas today we now know how to do a lot of different things in the each of algorithms that actually look a little similar if you know the MDP you can compute optimal things with pulse width value iteration you can evaluate a fixed policy with policy evaluation when you don't know the MDP we can learn the MDP that's called model-based reinforcement learning or we can do policy evaluation in a model free way that's called value learning and we can learn optimal Q values and therefore policies with this algorithm we just learned called Q learning so that's it next time we're going to see how to deal with large spaces where you can't enumerate the states like pac-man

Info

Channel: CS188Fall2013

Views: 52,807

Rating: undefined out of 5

Keywords:

Id: w33Lplx49_A

Channel Id: undefined

Length: 69min 49sec (4189 seconds)

Published: Thu Oct 03 2013