TD Learning - Richard S. Sutton

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you everybody thank you joy well it's true I'm kind of the I have the most gray hair here I've been around for a long time in and I want to kind of talk about the longview a bit actually so this is the challenge if first let me say though it's really exciting just to to talk to all you young people learning about the field learning about I want to get into AI and reinforcement learning and it's just like it's a special opportunity I really appreciate it and so so when I do this I have to think what can I try to communicate to you because you know we have really a brief amount of time and the field is immense written a whole book on the subject and so what do we what what can I try to do now my topic nominally its days is temporal difference learning and certainly I'm going to talk about that but I also want to you know to speak to you as young people entering the field and just tell you what way I think about it from a sort of a longer view and well so I think the first thing to be said along those lines is that this is a really special time and we can look forward now to to doing an amazing thing which is maybe understanding the way the mind works way intelligence works this is this is like a monumental event not just you know this century you know for thousands of years maybe in the history of the of the earth when intelligent beings animals things that can replicate themselves finally come to understand the way they work long enough to by design create intelligence and so the big thing that's gonna happen is that we're going to come to understand how the mind works how intelligence can work how it does work and just the fact of our understanding it is going to change the world it's going to change obviously there'll be lots of applications but just it'll change ourselves our view of ourselves what we do what we play with what we work at everything it's a big event it's a big event and we have to we should keep that in mind and as I say you know I'm not saying it's gonna happen you know tomorrow it's not good but you know ten years it could happen 30 years it could happen and then the great sweep of history that's a small amount of time it's gonna happen within your lives with high probability so so the first thing we have to ask is why is this happening and I think it's really comes down to like Moore's law okay so I took a slide from Kurzweil this is his standard slide of the increasing computation per dollar and it's just reaching an a point where it starts to compete with the computation that we can do with our own brains with natural systems okay that's happening now and it's just going to continue increasing dramatically I'm sure you're seeing these kind of slides here we have time years along the bottom and on the other we have computations per second per dollar okay and you see it's a log scale so of course a straight line would be exponential increase and Kurzweil argues that it's actually slightly super exponential but the point is the computations become available to us has become enormous and that changes things it's the rise of deep learning and all that is purely because the GPU is in computation getting cheaper and that will only continue and it will become more extreme as vastly increased computation so that alters everything and it really is profound and I'm just gonna trying to slide just to say what I conclude from this as I think it's an answer to like a question for AI this had for 6070 years okay so the answer I'm not gonna I just gonna do a couple slides like this but I guess I should one last bit of preface is that mice might my talk today is gonna be like half you know these big things okay and like I'm starting to do now in half like really small things where you go back to the foundations real stuff that's that's a basic that we need to understand okay I think the details the big pictures is important but the details also really matter and like we heard a lot of specific algorithms early today exactly how they work is really important and the details matter but the big picture matters and sometimes we lose sight of the big picture sometimes something comes so obvious that we overlook it I think the computation is as long that lines and its implication its implication in a phrase is that methods that scale with computation are the future of AI okay and that's that's by scale computation I mean as we get more computer power those methods become more powerful and this has it has not always been trudge true but certainly it's true of learning and search methods these are what we call general-purpose methods and it's the answer to this one of the oldest questions in AI for sixty years old do we want weak methods that's what they call them in the old days they call general-purpose methods week because they just they just use the data or use computation they're gonna leave their their general purpose their general their week okay that's what they call them in the old days the strong ones where the wayans that would lose human knowledge and human expertise and human insight to make their systems so much better okay now probably you guys are thinking that's crazy the general-purpose ones or are you thinking that with me general-purpose how many of you like to go human insight alright some of both okay yeah that's that's that's nice compromising sort of position and so all today I'm going to try it all in this big part of the talk I'm gonna try to talk about I'm going to present strong views okay really maybe you should do compromises and nuances but it's good to talk about strong views because they give you a working hypothesis to give you a point of view and you know you can say this is a strong one you don't have to believe it you have to say well that's a strong point of view that I shouldn't be thinking about and then maybe you have several of them anyway so I'm gonna that's a strong wave either this is all this question has and it's been imagined in favor of the weak methods and yeah and now I don't want you to I could talk about this all day but I'll refrain from it I'll just note the next thing to say is that you may you're thinking you're good you're thinking you're into deep learning or reinforcement learning so you're on the right side of history but I'm not sure that's right okay because we take things like any kind of supervised learning even model-free reinforce something the thing I love more than anything right it's only weakly scaleable if I had you know a million times computation I'm still limited in my model for reinforcement about how fast I can gather data and you know I'm only learning a value function and maybe a policy what's the big deal it's a tiny object it's some map from States to what to do okay and that's not a big thing and okay how how many features you have it's and super certainly for supervised learning is only weakly scalable because it requires people to provide datasets okay and there they become the bottleneck if you have you know you get it get people to label things on the web things scale but only weakly not only is faster you can gather to the data and in much eventually that becomes a bottleneck you really want to be able to learn from raw data you want to be scalable okay so if these things that we love they're not scalable what is scalable what would be fully scaleable well my answer is simple it's what I call prediction learning prediction learning means learning to predict what will happen okay and so it's the unsupervised supervised learning because we have targets where supervised learning we have we we just wait and we see what does happen and that's our target okay and so you don't need to have human labeling or you have a target but you don't need a human to provide it just get it from the data so you see on supervised supervised learning but anyway it's it's definitely the scalable model free learning and maybe it's it's these scalable model free learning and prediction learning is at the heart of all of our control methods where you learn value functions I think maybe that's sort of the argument I wanted the big argument about temporal difference learning is that it's the scalable model free method and of course I haven't given you even the step towards that yet I'm just saying that the idea of predicting predicting having learning that predicts is the key problem it's what we should be thinking about and you may say well deep learning supervised learning is all about prediction learning but that's not true it's all it's about predicting what the other what the label will be but it's not about predicting over time it's not about predicting where you have to wait and see what happens and that makes all the difference okay so so with all this words I want to ground us a little bit remind us what the data what the data looks like so I put in the slide of this is what real life is like real life is a temporal stream we have things like playing soccer and we have to make actions at every moment we're maybe a hyena being chased by lion and trying to predict whether it's gonna live or die we're maybe a baseball player and we're where our eyes are watching this tiny little ball flash by us really fast I have to swing or just the right moment to hit the ball or maybe you're talking to someone you're trying to predict what they will do so this is the data streams that AI should be learning to deal with so we should always keep this in mind when I say learning to predict well I think I did you think of that like a hyena trying to predict you know fear fear is your is your prediction of are you gonna die okay so he's trying to predict it several times it looks good and bad and and at the end here it's not looking so good okay so those premises let me start talking about temporal difference learning temporal difference learning temporal difference learning it's a method for learning to predict it's widely used and reinforced with learning to predict future award value functions and it's basically the center the core of many methods you know about cue learning and sarsa TD lambda deep Q networks TV gammon the world champion backgammon player using deep reinforcement learning from 15 no 25 years ago oh my god 25 years ago deep reinforcement learning 1992 but not all reinforcement we method so alphago it happens not to use TV learning uses other kinds of reinforcement learning the helicopter thing that you might have heard that doesn't use it and then there's sort of pure light like Peters talk was interesting because he talked about policy based methods and some of those don't use temporal difference learning but eventually he would get to it and put it in to make things better so it's sort of yeah so it's sort of optional it feels like it's optional many people say it's optional but I do want to argue that you do want to use it that really you should always be using it it's ubiquitous and okay it seems me how the brain works what else today oh now it can be used to predict anything it's for general method for prediction learning not just for rewards okay so don't be fooled by and reinforce something you read all about TV it's all about predicting value functions reward but we can use it for anything and since my talk is just about TD I want to be sure to think about this general use question yeah I forgotten exactly the details the initial learning system was learned from from actual outcomes actual games and they would go all the way to the final outcome if they had make a prediction look at a position make a prediction of how am I gonna win or lose and they would go ahead and see who actually won the game at the end they would use well who actually want instead of a temporal difference learner so that's what we want to work for dude do they wait and see who actually won they see the outcome or the return or do they do the updated guess from a guess okay let me make that clear so that's my next slide as the teen learning what is TD learning well basically it's learning a prediction from another later prediction okay so if alphago was looking position to make a prediction and I'm zooming all the way to the end of the games he won and then it's not TD it's not learning that prediction from a prediction on the other hand if you make a prediction from a position and then you make one move and you see your next prediction and you use that next prediction as to form a target then you're doing TD learning okay so the quick word quick phrase is we are learning a guess from a guess okay sounds a bit dangerous doesn't it how it would constrain it what would tie it down okay but that is the idea we want to learn an estimate from an estimate and we have to talk about whether this is good or not okay the TV air the TV air is the difference between two predictions to temporally successive predictions right so if you're playing your game you say I think I'm winning then you take another move you know now I think I'm losing you try to learn from that you don't know the who actually is gonna win not yet what you try to learn from the temporal difference in your predictions okay now after that it's the same as what you're all used to is you just have an error and you send it back proper through or whatever you're doing and so it's just really where does the error come from or where does the target come from is the target come from the end of the game or does the target come from the next prediction okay so here's the the example of TD gammon originally 1992 and he had a chest a backgammon position and he would send that into a neural network which would filter through actually just a single well he had many versions the standard version was a single hidden layer and he had ended up with this probability of winning and the error then was the probably winning in one position - the probably winning in the next position so we look at the change in the estimated probability of winning and that was used as the error that would be back propagated through it and this would learn just stick it in a corner playing against itself learning from itself from its own trial and error and it came out to be competitive with the world's best players really the best in the world okay so that's familiar now but well I'm trying to get to this question those questions do you need to use TV learning because this all this is is to motivate the motivation is the most important do I need to use T learn here can I get away with it because you go to the field now maybe even in reinforcement then you'll find a good fraction of the people don't believe in T dealing and I think they can get away without it okay and so it's a real and we should all be asking do we need it I want you to understand I want you even as people learning about the field to to be able to engage with this question and and know the basic facts pertinent to whether we should we need to use TD learning okay so I will skip over over Atari and as you've seen that already so TD learning Wendy when do we need it okay well it's only relevant and what Icahn multi-step prediction problems that's the first thing only when the thing predicted is multiple steps in the future so obviously if you're predicting the outcome of a game that's multiple steps in the future but if you are predicting a label or or in alphago where the data was the initial data at least was here's a guess and then I'll just see who won the game and if you just use who won the game then it's essentially a one step prediction okay okay so now I want to say that it's really broadly applicable you should I said it's only applicable when you have multi step predictions but really everything you want to do is going to be multi step prediction if you want to predict oh I have some examples multi step prediction if you want a predicting outcome a game you want to predict what a stock market index will be well really you could just predict what it's going to be but you get more data on every day after that and you'll mean like new predictions so you make a mole you like repeated predictions about a long-term outcome if you want to predict you'll be the next president you can predict that you know every every day as a new event happens if you want to predict who the US will go to war against next so ll these are long-term predictions they don't they don't jump to the end now even even if you want to predict a sensory observation if you want to pretty just the very next sensory observation that would not be multi-step prediction but if but but if you were to prick you know even ten steps ahead or a discounted measure of the future those are long term or multi-step prediction and yeah we think about go back to the real world the the the hyena and the and the Lions or the the conversation or Messi playing soccer he's got to make long-term predictions it's not you know what's the next thing it's am I gonna make this goal will I get around that fellow where's the ball going to be in a few milliseconds from now all these things are multi-step predictions now can we can we treat it can we just use our one-step methods like you know sure things happen bit by bit but ignore that just wait and see what happens you know like see you on the game and use our one-step methods and or can you learn a one-step model and then compose your model okay and the answer is that we really can't do these things and I want to try to give some sense of that today and I'm really going to talk mainly about this first question can we think of the multi-step case as one big step or do we have to deal with it bit by bit and but I wanted a one slide on the second bit the second bit is can we learn one-step predictions like a model and then iterate them to get a multi step prediction when you need to and I just I just want to say that I think it's a trap okay and I know I could really properly explain this to you but I think it's a trapped I think it's enough to model the world to make it like a low-level simulation of the world to make like a think tray to throw is a Markov decision process and model the transition probabilities or treat the world as a as a as a engineering model where we just have to learn the velocities and the accelerate and the effectively on the accelerations or actions and then integrate this low-level differential equation Lisa this is all a trap these short-term models and then iteration is it feels good because we know we know that if it can be done perfectly on the one step then it can be done perfectly for however far we want to look into the future but there's two problems first of all we can't do it perfectly and when we do it imperfectly with it because we're always gonna have an approximation then when we try to iterate them we we get a propagation of errors and a compounding of our errors and we get a useless long term prediction and secondly of course it's exponentially complex because as we look ahead each step there be many possibilities the world is stochastic and also our actions may be we have we're different choices we might look at for actions so it quickly becomes computationally intractable it will always be computationally intractable to try to look ahead many small steps into the future I like to try to try to iterate your model of physics to get to you know how much fun will I have going to Montreal and taking the summer school okay it's crazy and it just doesn't if there's no future in that that way of thinking it's a trap and lots of people are in my opinion are falling into it okay but let's go back to the other side or the other side remember I'm just going backwards here we have these two things two ways to get away from TD if we don't mind get away we don't like TD can we learn a model and iterate it that's the second one or the first time can we think of it as a one step thing and just use do the one step thing okay so the one step thing I'll do it I'm gonna do it in I'm about to transition to my low level part of the talk but I don't want to try to answer it here just at the high level and then maybe we can even take questions after this this high level part of the talk can't we just use our familiar one step supervised learning method and and reinforce something these are known as Monte Carlo methods I mean it just roll it out or whatever see what happens and use that what happens is I've target okay so this has costs number one Costas is you have to you make you're making this prediction then you're rolling it out to the end and really you're gonna make a prediction in every moment in time so you've got to remember all whole mess of predictions as you as you go out to the end to see the outcome or really uh if you have returns you'll get the outcomes at different different steps different outcomes you have to relate them back to the to the earlier situations it's horribly complex it's nasty okay it's first of all you need to remember all the things you did think about yourself maybe he really is the lion there you're trying to make a good prediction okay and what do you have you have all the stuff swirling around you you have now the hyenas running away you have a glimpse of him your your feeling of your feet you have all this stuff that you can sense and and then you want to relate that to how well it's going how how good you should feel about this chase okay and so what is it sensible to think or just be much better if you can do it now if right now we know all the stuff is in your mind and in your sensors learn if it's going well or going poorly and learn now as opposed to wait wait five seconds later when you guys have you know a whole number large number of frames of different different sensations and different patterns of sensation you've forgotten you know but if you have to wait two five seconds till the end of it it's too late you can't remember all that whatever you remembered will be a tiny shadow of the real vivid representation you had at the time it happened okay and of course the computation is is poorly distributed you can't learn now so what are you doing now later you'll find an outcome you have to do all the learning then just a poor temporal distribution and you can avoid these problems with special methods that's what really what TV is about specialized methods for the multi-step case and another reason that you don't want to wait is this sometimes you never know the target like I don't know let's say you're playing your chess game and there's a fire alarm and you never finish your game so you never see a final outcome but if you're TD you know maybe such you're winning and then you know it was going really poorly and you say you did something bad you can learn without waiting till you're checkmated maybe the fire line just like one move away from the checkmate so technically the game never ended and but you can learn a lot from your experience so obviously we could try to ignore all these things think of them as nuisances but I think of them as clues these are hints from nature but how we should proceed okay okay so now now I'm going to get down to it more technically but I hope you're starting to see the view I'm trying to present we really need to learn from new predictions so that we can do it as we go along and I think it's really ubiquitous and kind of and all the different kinds of learning we're looking at to nai okay so I'm going to use notations a little bit different than what we've heard earlier today that's what I call my my new notation I use it in the second dition of the reinforcement learning textbook the big thing is that we're trying to use the students status Texas statisticians convention that random variables are capital letters and instances are lowercase letters so all of the things that happen that make up life are capital letters right because they're the random events that have actually happened they're not possibilities they're whatever happens so s0 is the first state a zero is the first action taken in that state r1 is the first reward that depends on that state and taking that action in that state and then at the same time as we get the reward we get the new state so I should I like to give them the same temple index r1 + s1 they occur together they're jointly determined in fact okay and then life goes on on and that's the data that's all we have terms of data we have a trajectories and maybe a single trajectory and we're interested in classically in the return and the return is a sum of rewards and I'm not going to use capital R for the return because capital R is actually the actual rewards the sequence of rewards and so I'm going to find the return I need a new letter I'm calling it G capital G because it's a random event the R and whatever some of a future of rewards after time T actually was we're going to call that G of T G of T is the return and as we note here if I can use this thing this dot just means that this is a definition it's not a a statement of something that's true because it follows from other things that I've said it's a definition so G of T the return is the sum of the future Awards with the discount rate is most common way of dealing with it and that of course can be written now really is equality this is not a definition that equals the first reward plus gamma times the sum of all the later rewards we're just taking one factor of gamma out of all of these guys so this is no no no gamma as this one is one guy right now this guy is two again wasn't we just taking one out and then the rest of this is the psalm of future rewards from from from one time step later okay so it's a bit like what we started it's just like the same thing it's G of T plus one okay this is just a definite this is this is this is this is a true equality or any return be written as the first reward plus gamma times the next return okay and that is going to be the basis for our temporal difference learning because we're going to race T gonna use it this is um we're gonna do this as a target we're gonna use the reward extra ward plus gamma times the next return essentially as a target I guess that's going to be explained right right next we look at a state value function okay since we're using capital letters for the random variables I can't use capital V for my for the true value of function okay it's got to be lowercase and it's the difference function of the policy so V PI of s where s is any particular state it's an instance any state lowercase s its value is the expectation of the random variable the return if we started in in step state s okay so what we can expect the return could be under pi pi as the policy so it's some way of picking the actions of course the value of state depends what you do if you dance at the top of the Grand Canyon it might be bad but if you consider a walk up to the railing it's good so policies value 5 al use depend on policies okay so then you know this is the return that we can we can just use the above equation the return can be written as the first award plus gamma times the the rest of the return and then since we're taking the expectation of this this is the expected next reward and expected value of the next state okay so that naturally leads to the notion of an error we can use we can compare the the the estimate estimated value of a state at some time to the reward and the estimated value of the next state okay so that's gonna be our TD error this is what we're gonna use to replace the normal conventional error now this V is a is a random variable this is our estimate their estimate will depend on what happens and so that is random the bat the estimated values are a random value or a random function and so it's capital and that's our two-year you got it the TD error is that clear good okay so now let's talk about our methods I want to contrast supervised learning and remember I said supervised learning is called Monte Carlo in this context so what exactly is that that's we take our estimated value function for the state that we run into V of s of T okay we're going to update it based upon some experience okay here's the experience here we are dest of t and this is the tree and those things that might happen like we might pick either these two actions and if we did pick this action either these two states might arise so yeah black dots or actions the open dots are our state's so basically this is this is the tree of all the futures that might happen and in this case we're imagining that there are terminal States we're basically adding things up until we reach a terminal stage so here is a particular trajectory but that's might happen let's say it did happen from that state we went this this this this this and then we terminated okay so we now once we've terminated we know what G you know what the return is and we can do this update rule we can compare our estimate for the state at time T u up here to the actual return and we make that that error and then we betwee to this an increment I should this is the step size alpha is a number like 0.1 so we increment towards towards this target okay that's that's a standard would be a Monte Carlo learning roll a supervised learning rule and that's the the competition for the TV method the simplest TV method looks instead like this we only look ahead one step we're at s of T we look ahead at one we see the the reward that happens and the next state that happens and based on those we form this TD air which is again as comparing comparing or updating this estimate of this for the state at time T so we're gonna make an error between what we we're guessing for that state and this new target the reward plus gamma times the estimate value of the next state okay now you've probably also heard about dynamic programming and you can think you can put dynamic programming in the same figure the same kind of figure and if you were the dynamic program version looks like this because it's not considering a single line through the possible tree it's considering all possibilities it's considering both actions and both possible next States so this where you need the model of the world because although you you know you're probably picking each each action but probably the world will give you a possible next States will be known only to the world but in dynamic programming you assume you know all that so and I never programming the equation is that the value the estimated value for a state is move towards the expectation of the first reward and the expectation of gamma times the value of the next state okay so there's this expectation and and that's what makes it dynamic programming this is you've seen me know all the probabilities you can figure out that expectation doesn't give you the answer because your value will still be your you're still learning a guess from I guess you're learning your new estimate still from your old estimate well that's sign number for me so so really we can say the following what's special about tea methods to say bootstrap and sample so bootstrapping is this idea that your target involves in a guess and existing prediction okay so Monte Carlo Monte Carlo or the whole point is that it doesn't bootstrap it's just looking all the way to the end and seeing what the return is there's no there's no estimates playing a role in the return dynamic programming also bootstraps and therefore says look ahead one step and look at the expected value of the next state and back it up so you're only you're using your estimates and your estimates gradually get better TD of course also is using your estimate yeah it's like Monte Carla and TD are learning methods I guess that's it my next point the learning methods Monte Carlo and TD they sample they sample what happens is because you don't know how the world works and what dynamic programming does not sample it this uses the expectation it assumes you know what will happen what could happen so those are the two basic dimensions whether you're sampling and therefore learning and whether you are bootstrapping you're using your your your bootstrapping your estimate from other estimates your learning guesses from guesses and so TV prediction basically I'm just saying this is this is the update you saw before Rupp doing the Monte Carlo is here and the TV is there so just the contrast is that one the target is the actual return and the others the target is this sort of one step estimate of what their return will be okay now let's think let's do an example so here I am I'm coming home after working a hard day at the office and I'm trying to guess how long it will take me to get home okay so I'm I'm leaving my office it's Friday at six o'clock I have some other features and I make a guess of how long it will take so I will I'm gonna guess it'll take 30 minutes to get home okay so and that's that's my prediction of my total time because I haven't gone you know I'm just starting now so my elapsed time is zero now as I come out of the I come out of my building go to the parking lot and I see it's raining okay and it's raining you know it's gonna take me longer because everyone drives slower in the rain so I think well first of all it's already I've already spent five minutes just getting down to my office into the parking lot and but I also think it's gonna take me longer I'm gonna think it took me thirty five minutes from now for a total of 40 minutes okay so I what I want you to see the first thing you ought to see is that my guess by how long it's gonna take me and my guests enough total time to go home it's constantly changing as I get more information I revise my estimates okay so to carry the example through I get I get I I start getting my car I Drive on the highway it turns out I didn't take it didn't take so as long as I thought I've spent 20 minutes total now and I think it'll only take me 15 more to go home and we're so bad between the rain and so that's 35 minutes total now this keep this car in this column is my total estimate as it goes up and down and then I get stuck behind a truck on secondary road and so I think it's gonna take me longer and then I reach my home street and I think it'll take me 43 minutes and it does take me 43 minutes okay so that's a possible thing that might happen a possible trajectory and what I want you to ask is what you might learn from that okay so if you're doing a Monte Carlo methods you just say well it took me 43 minutes to get home that's the answer so all my estimates my first initial estimate of 30 minutes that's going to be moved towards 43 minutes that's the error I'm in fact all of these will be moved up towards 43 minutes because whatever guess I made at each point in time it should be moved towards what actually happened or whatever whatever is remaining in the future at that point okay now if you're using a TD method if you're using your your guests you can learn a guest from a guest that's something very different happens so even some of the signs change so your first prediction will move up because you start out 30 and then after we found out trainings do you'll move up but this one for example will move down and the actual that although all the errors are different all the errors are different and the long long long run is all a washout but for all actual learning is a law is it's very different okay now I also want you to think about the computational consequences okay if you're doing TD then you know when you're here and you go to the next stage you get an error and you can a can update right away you can say well why did I make that prediction what are my features there a house and how shall I change those what are the contents of my deep network that led me to make that prediction I need to change those and and that's true at each step and you so when you go from here to here to here you can update this guy and then you can forget about it because you're never gonna update him again whereas in Monte Carlo you have to remember why you made each one of these predictions until you get to the end then you have to go back and say well okay why did I make that one and then and then adjust its weights you know it was knowledge of its feature vector and of your network and and so on yeah and it's terrible its distribution because you keep all this time you're doing nothing you're driving home you can't do any learning okay you can only wait till the end you know the answer and then you can go back and write to all the earlier things and learn them so the distribution of computation is poor the memory is poor it's it's it's just kind of inconvenient it's much more convenient if you do just go along and you think about it you're in your car you're trying to drive home you get stuck behind a truck do you say you say you say this is bad you know I say that's gonna take me long and I thought I was too optimistic before you don't say well you know maybe this truck will disappear and you don't say hold the whole judgment you could hold judgment until you get home but you know my feeling is I'm learning as I go along and I'm responding to what I see and we actually do learn as we go along okay so I think I've said these things and TD with Monte Carlo you can be fully incremental learning as you go along you can learn before you know the final outcome this means you need less memory and less peak computation you don't do it all at the end you can even learn if you know if you never find out how long it takes you takes you to actually go home you know maybe you're get a phone call and you're you called away for something important and you never find out but you can learn without knowing the final the final outcome now when you do the math both of these methods will converge and but so the only question is which is faster okay this is the only question but it's a big question okay so I don't know let's just do a simple experiment to find find out okay so here's a trivial experiment famous not famous I know I did this a long time ago just a random walk and it's meant to test the idea which one of these is better so we're gonna we're gonna have five states and we're just gonna have an estimate for each state of what the outcome will be okay this random walk it takes 50 step right and left and you start in the middle and you go back and forth back and forth back and forth until you end at one side okay if you enter this side you get a zero and you do get zeros all along the way for your reward but if no way you get a nonzero is if you end on the right side you get a you get a reward of one okay so are you with me what's the correct prediction so the correct prediction there's there's no discounting here so we're just trying to predict the sum of the rewards up until the end what's the correct prediction for the start state see you're in C what's the correct prediction for the expected value of your return yeah the squared gamma is 1 so the expected expected return so if you if you end if you go blah blah blah blah blah and you end on this side the return is 1 if you and on the other side the return has to be 0 now you start in the middle what do we expect the return to be you know by symmetry it's gonna be like 0.5 okay and state B I don't know it's gonna be less than 0.5 and state a still less anyone want to guess what they're good at what they are the true values of all the states C is definitely a half what do you think B is guess just guess 1/3 yeah I thought it is 1/3 and and the next one is 1/6 yeah these just go up by six one six two six three six four six five six and those are plotted here this Lin line is supposed to be the true values so a state a true that has a true value that's 1/6 and state B it's a true value this 1/3 and this has a true value of 1/2 and so forth and these other lines are the estimated values from from applying TD to it okay so with TD you do have to care about the initial conditions because it's making a guess from a guess right so your your guess is you know effect things they either pollute things or or brilliantly provide good guesses and a value okay they so the initial guess is zero so time at episode zero all of the estimated values or excuse me yes all the estimated values are half because since there could be zeros and ones and four possible returns it seemed reasonable to start the estimated values all at a half we've got happens to be right for the middle state but it's quite a bit wrong for the other states and then we're gonna do episodes where you and and we're gonna learn it on every time step and we're gonna update the states accorded the TV rule and after one episode we have this this darker line this is the episode number after one episode we have these values for our estimated values of the five states right so what do you know happened on the first episode you ended on this side because well what's gonna have what does this TV rule gonna do what is what is the TV air gonna do let's say if we start in the middle and we could go either way we move around well what's what's the TV air going to be it's gonna be the reward and the reward beginning is can be 0 and then it's of gammas wands and forget about gamma and then we just basically the change in the value okay and if you went from say this state to this state what's the change in value 0 cuz they're all their estimated values are all 1/2 and so we went from one state from its estimate of a half to another speaker's estimate of a half so as we go all this bouncing all around nothing is gonna happen really until we run into one of the ends well if you ran we can run into this and or we can run to that end if we ran into this end will go from state that was a half - oh the terminal state thermal state always by definition has a value of zero okay so so over here if you weigh if you did this transition you get a reward of one a reward of one the starting state starting state here would be a half and you get this thing that has zero okay so it'll be 1 minus 1/2 it'll be positive 1/2 anyway the the estimate of state a would go up and that didn't happen because here's the estimate of state it's still at 1/2 instead what happened is we ended on this side we went from here that had estimate value half to this thing which has estimated value 0 terminal State and so we went from 1/2 to zero and our ITT air is minus 1/2 so we moved down from 1/2 toward zero and you can actually see how far we moved we actually moved by from 1/2 to 0.45 and so our step size alpha was tenth was 110 okay we can understand this algorithm is very simple and and then as you get more episodes you get closer and closer to the true values after 10 episodes get the blue line after ten hundred episodes you get close to the true values you never get exactly to the true values because there's always randomness in the individual episode and and alpha is nonzero it's 1/10 and so you keep it bumping around bubbling around the true values so that's that's an example now let's compare now Montecarlo versus TD on this problem okay and we have to draw whole learning curves now and we have to worry about what's the value of the step size ok so what I'm showing you and this is a learning curve meaning the x-axis is time or episode number and the y-axis is some measure of it's actually the root mean square error averaged over the five states and many many iterations of the whole experiment I guess 100 iterations of a whole experiment okay so as I said they're going down you know everything's getting better over time but they things will not go to zero because well we have the step size 1/10 and or whatever our step size is it's it's always gonna be there and that we're always gonna have some residual error I don't know which one should we look at first maybe Monte Carlo it's simpler we just Monte Carlo you just wait till these you know what the final return is and then you do your update for all the all the states that were visited so if you take a nice slow learning rate one hundredth we just gradually moved down towards zero error and it's actually this this alpha equals 100 they're very slow will actually get the closest in the long long long run to zero error but it's very slow so you might want to go faster if we take alpha is one fiftieth we go down faster but we're not we're gonna start to bubble and we try to try going as fast as point zero four and we do go the initial part as fastest but now we're definitely bubbling and you can't really do better there's no step size which will do better than the ones that are shown here if you try to go faster you're gonna you may be a little bit faster very beginning but you're gonna leverage level out at a higher level okay and 40d we see a similar pattern but all the numbers are lower okay so here's our one of the the slowest one I'm showing here is 0.05 and it goes its slowest but it gets slower and then other ones are faster and of course they they bubble more and they they don't get as low in the long run okay now if we have long someone may ask you you may some of you may be wondering what's going on with that TD stuff because it seems like they go down and they start to come up again yes anybody wondering that anybody wondering that yeah it'sit's weird isn't it and it's real it's not a bug in my program that's the first thing to be sure of yeah it's if it has to do with the fact that actually starting out the estimates at 1/2 is not it's not stupid it's actually a reasonable guess if you started all the estimates out at somebody really bad then you wouldn't see that bounce I got the bounces as we go down and we've seen the bounce we come up a little bit higher and that balance is really interesting it has to do with the fact that we do have some some effect of the initial estimates in TD and whereas we don't really at least not as much for Monte Carlo okay so so this is just a random walk and I've survived been systematic about the random walk and I don't know the big picture is that TD is faster okay there's a balance okay whatever but it's still much faster but this is just one problem this is just a random walk okay maybe it's something special about the random walk maybe if I did that I'm you know Atari games I would get a more fundamental result and I like to do simple things question oh no they they all converge even with an on well with a non zero if you don't reduce the step size then you don't expect anything to converge right they would converge in the mean okay and all of them will converge to a mean that depends on the step size and the higher step size would be higher and lower step size would be lower convergence point yeah the convergence properties are roughly the same in both cases I want I wanted to I want to ask now can I say anything about the general case not for the random walk what can I say general case actually no I'm gonna do I'm gonna do that in a minute but first I'm gonna do the random walk again under this this setting can I call batch updating okay that's updating means we we take some training set like a hundred episodes or ten episodes or whatever and we present it over and over again until it does converge so even for a finite sub size we'll get complete convergence if we repeatedly present the same training set okay because there's no randomness and random samples you're just putting the same data over and over again and you will converge the two methods TV and Monty our work converges to two different things and this is for constant alpha these charms converge to different things as long as your step size is small enough it won't depend on on your step size yeah all step sizes as long as they're small enough so that you don't diverge will converge the same thing okay and they converge to different things the two algorithms converge to different things so we can ask on this problem which one converges to a better thing if we present the data over and over again to the algorithm okay and here's the results on the random walk again you have we have different numbers of different sizes our training sets we're increasing that along here but for an each case they say with 50 your training set of 50 up we present those 50 over and over and over again until we converge and we measure the the asymptotic error there is independent now the steps I cite I can now eliminate this the effect of the step size just get a measure you know what your algorithm is giving me a better result on this problem okay and TD is faster in by this measure I mean we're doing a lot more computation we have to go to convergence we have to repeatedly present things none of which I like but it's getting us insight into what were the real difference between the two algorithms so it's like TV is moving towards a better place even on a single example as suggested by the initial results and if you if you go over and over again you can get that okay now this again is all random walk and you have to ask if this is happens on all problems so one approach would be to do all problems and that's obviously not satisfactory so what can you do instead you can try to prove a theorem okay and you can you and you can also try to get insight I guess I'm gonna try to get insight first and then we'll do the the formal result so let's try to get insight into us as people I want you to need you to be the predictor imagine you were having some experience so imagine you were experiencing a training set of these eight eight episodes these are all very short episodes right so most of them are episodes like B 0 means I'm going to state B and then I get a reward of 0 and the end of the upsets the end of the episode okay or I see state B I get a reward of 1 and that's the end of the episode the only non-trivial episode is this first one rhyming state a I get a 0 and then I go to state B and for B I get a reward of 0 and that's the end of the episode okay so that's the data you see just these eight episodes and I want you to tell me what's what prediction would you make okay the first question is what prediction would you make for state B if you found yourself in state B what would you guess for the expected return ahead of you from state B say again 3/4 I agree because what would we do that we said well I was in state B all eight times and six of them ended up the one and two of them ended up at the zero so you're gonna guess three quarters okay okay that was easy one what about state a state a it's really much more in certain you've only been in state a once and what are you gonna ask her state a just take a moment and think about it what are you gonna guess the return if you find yourself again in state a what is the estimated value well hold you asked mate the value of state a is okay now I'm gonna say right away that this is a question there's multiple good answers to okay so I'd like someone to raise their hand and give me one give me one answer and why it's a good answer okay how are you you always if you always go from A to B and the only time you've seen a we went from A to B & B has value 75% 3/4 then a should also that sort of makes sense what's another good answer yeah I don't know if everyone heard that you said you've seen a once every time you saw it the return was zero so you know might not predict zero okay now those two answers those two answers are the two are Monty other want to Carla's answers and TD's answers okay so we could say zero that's what Monte Carlo would say a Monte Carlo just looks at what happened I was in a once and I it was only the outcome was zero so I should predict zero Monte Carlo now the other one the other one is what TD predicts it's also what you would predict and this is the gentleman explained what was going on in his head he was saying well I'd seen a go to B and with and all that sort of stuff he was building in his head this model he was saying I've seen the only time I seen a it went to state B by the way the reward was zero on that transition so let's let's guess that the that's that happens every time and then in B B I saw a times and six out of the eight when one way to a one and then stopped and two out of the and I saw it eight times six out of the eight went to where 2 1 and 2 out of the 8 went to a zero so I'm building this in my head ok this is like and this has a name this is called this the maximum likelihood model of the MDP it just just just means what you'd get by counting say how often from do you go from here turn those into into probabilities okay and this is the maximum likelihood model of the underlying Markov process and then if you take this model and you solve it if this is this is the true world then the true value is 3/4 okay and so this is the general phenomenon of what TD does if you present a training set over and over again to it it it it gets the answer that you would get if you collected all the data I mean the maximum likelihood model of the world and then solve that model with dynamic programming or with any method the true the true solution if that model was was this was the reality okay and so that's well that's that's why TD is is can be faster can be better because it's using this Markov property that saying oh and I've gotten to be I know OB is a Markov state and whereas Monte Carlo just says I don't care about what happened between I ended up with a getting a 0 okay so to summarize that the prediction that best matches the training data is the Montecarlo estimate best matches the training data remember the train data okay and if you saw you saw a once and it ended up with a 0 so you want to match the training data the right prediction is the value of a is 0 that is the prediction that will best match the data ok now of course I want to tell you we don't want to match the data if we don't want to we don't want to minimize the mean square error on the training set weird huh and it seems like we should want to minimize the mean square error on the training set and that's why I've gone through at some length this example with you guys so why don't you just have some intuition and why we don't want to minimize the the mean square error on the training set so what can I offer you if I can't offer you minimizing the mean square error on the training set it's going to be minimizing the mean square error on future experience because we don't really care about the training set pastic sprint we about the future and so we think if we believe we have real states here we would think that the estimate the value of a is 3/4 will actually be a better match to future data okay we get a new experience with say day it's probably going to be end in a 1 3/4 chance of being in a wine okay so that's interesting now we have to really distinguish between minimizing error on the training set minimizing error on the future these different things and TD can be faster because it could take advantage of the state property and match future experience better now even though as I said that you you may be able to get immediately a sense of possible limitation of TV methods as I said they're gonna take advantage of the state properties that I know when I get to be doesn't matter how I got to be but in in in real life you don't normally have complete state knowledge and I mean complete state knowledge if any time you're using function approximation here we're just using discrete States and anytime using function approximation you're gonna have imperfect knowledge imperfect state information and so so in the end it's going to be a mix it's gonna be a question which gonna win in practice but in the end it's gonna be TD that wins in practice I'm thinking okay in the end n okay okay so yeah good good time for questions yes Novi is still predicting the some of the other awards from that's from the state to the end and so you remember the example was a is followed by 0 is followed by B is followed by another 0 okay thank you for clarifying [Music] then of course in this case TV is not gonna be super linear yes depending on the size of the representation of the function but then they function expensive so let's let's think that through a so let's assume that instead of having a table lookup this is all table lookup but instead of that we have a complicated neural network right and and so then when we get a new error we have to back propagate through the network we have to do somewhat expensive computation but it's not should we consider that expensive I'm going to say no because even the back propagation of one error back propagation to the network that complexity is the same it's the same order as a forward pass to the network so we already spent the fort we had to make a forward pass in order to get the prediction and so there's an equivalent complexity to do the update so even though it's it's a bunch of weights it's it's it should we should consider that cheap okay okay it's it's linear in the size and the network good good any other questions good ok so so I've just done one step methods tabular methods model free methods all these qualifiers can be generalized but even here in this simplest case one step method meaning we're looking from one step to the next step rather than one step to five steps ahead like in AC 3 [Music] but there we can see the basic ideas and it's tabular tabular is easy to think about but it all the ideas really do generalize to the to the network case complicated function approximator we've seen the basic things is that we're gonna bootstrap and sample combined aspect of dynamic version like to carry Carlo UTV methods are computationally congenial just a little bit of work on each step and you don't have to wait till the end and then do a whole bunch of work and if the world is truly Markov then TV methods are faster that's what we see and it has to do with the past data versus the future data now before I go into a summit new thing I like to also try to summarize where we are in terms of pictorially ok what today we've talked about contrasting TV methods a one step TV method which is like this is what I this little picture is what I use to summarize the math exactly the thing of the table something updating it I and this says I go ahead one action and one next date and I use my estimate here to improve this guy this is like a picture of the algorithm and the same kind of picture for Monte Carlo and as you want to estimate improve the estimate of this guy this value you go ahead one state on action state action states that I'll leave the end and you see the final outcome and then you back all that up okay so and this is like a dimension you can occupy intermediate methods you can do to two step methods three step methods for step methods five step man this is like an infinite step method where you go all the way to the end of the episode and then and then there's the parameter lambda you might have heard about in TD lambda the eligibility trace parameter it's really a bootstrapping patter that if determines it's not the number of steps but it's it's analogous to the number of steps and always though this is really a dimension and we can occupy any point along this dimension okay and that's now there's a second dimension which is are we going to use planning ok are we going to use knowledge of the model of the world ok dynamic programming dynamic programming is this corner it means we're still going to do one step methods rolling dynamic we only had one step and use your estimates at that one step and look ahead into the future okay and so that's moving along the top here you says keep keep in these short backups one step backups but instead of doing a sample do all the possibilities and that's dynamic programming and then there's a fourth corner where the analogous of Monte Carlo but with planning it's like exhaustive search you consider all the possibilities all the way to the end and so we can get these four corners as plastic methods and then we can occupy the area in between them and that's kind of a big space of reinforcement learning methods although it's it's certainly not the whole space okay now where should we go next I don't have time to present all that I wanted to say mm-hmm but let me just sort of we've done a good group here I can almost sort of wrap up and talk about the future from here but let me just tell you some of the things that we if we had more time we might talk about okay first we will talk about estimating instead of state values we talk about estimating action values because as you know really for control you want to do action values and it's not that different you would just estimate q pi instead of V PI and it's going to be lower case because that's this is the true value function Q PI and you would estimate it with with an action value estimate since it's just tabular I can say big Q of Si for the actual statement action counter then this update this update is essentially the the TD update it's just done state action pairs rather than on States right this is still a TD error this is my original estimate this is the estimate for the next state this is the sarsa algorithm that's that's quite straightforward and so that source algorithm ends up being that rule that update done over and over again and and some examples Q learning Q learning is it's almost the same the rule is you know we're doing out biggest state at a state action value the new one is the old one plus the step size times a TDR but the TD air is a little bit different the TD air we you know preparing our original estimate to mix reward plus something of the next state but it's the max of our possible actions at the next stage that's Q learning I like to draw this picture this is its picture the picture says you know I'm updating a state action pair all the possible estimated action values take the max event that's this this part of the rule right max over all possible a and and then I back up the max to improve yesterday of this guy at the top that's Q learning it's a TV algorithm with that particular target and this is a nice example if block is a nice example comparing sarsen hew learning sarsa is an on policy method Q learning is an off policy method and we see that actually here the y axis or the y axis is reward for episode sarsen will actually work out better I don't have time to explain this example I wouldn't I didn't want to do something it guys didn't really understand better to skip over it okay here's one sort of new algorithm expect this salsa expect it's R so if we look at the picture right Q learning is taking all these possible things you might do and you take the best of them take the max the arc means Max and if you have if you don't have an an arc then it means expectation okay so an expected star saying you don't take the best of the things you might do you take the the expectation based on how you would actually do them according to your policy so here we are we're summing over all things we might do how likely are we to do it under our policy which we know we know our policy we know how likely we are like it maybe it's epsilon greedy and we take the expectation the action value times our likelihood of doing that action and we backup that okay and that's that's arguably of this an improved version of sarsa and it can also be made an off policy version of sarsa and there are some other novelties you can you can do an off policy version to expect this hour so now I've used the word off policy a couple times not explaining it I'm sorry about that but the off policy means that you're learning about a policy that's different than the one you're following okay and on policy means you're learning about the same policy as the one you're following the same one that's generating the data so the way to remember it is that on policy is almost one policy and then on policy methods there is only one policy it's a policy you're doing it's policy or anybody but very often these want to be different like you want to do something that is a more exploratory and you might want to learn the optimal policy okay so if you're gonna learn the optimal policy but you're gonna actually get your data in an exploratory way we're just not going to be optimal then you have two policies and then you're in the realm of off policy learning q-learning does this but off policy learning is is theoretically more difficult and more challenging for our methods okay that's that's off policy okay so so I basically just extended these things to control okay no and we've seen some some methods that can do the on policy case in the off policy case we didn't talk about double Q learning okay so I've but I've talked a lot about about do you want to use Monte Carlo or do you want to use TD okay and it's there's a sense in which we don't have to choose because use an algorithm like if you use TD you can parametrically very lambda or very the the height of your backups to get to give you any intermediate point between one step TD and Monte Carlo you can get both and a key neat way of doing this is with the parameter lambda the bootstrapping parameter which agent really talked about but it is a way to very parametrically between TD and Monte Carlo so if land equals zero which is the left side of all these graphs that's pure TD pure bootstrapping okay if land is one that means you're not doing any bootstrapping it's Monte Carlo okay okay so now all these graphs have land they're gonna cross the bottom so it's basically like this - pure - - no bootstrapping and they all have a measure performance on the top where in all cases lower is better okay so it's like it's like Mountain car and you want to have a few steps to get to the top of the hill okay so and what you see looking at this is that you know performance depends on lambda this is this is a random walk and it's actually not best at land equals zero pure TD is not the best you can do better if you do some amount of TV intermediate between pure TD and Monte Carlo but if you go all the way the land becomes equals one then things get really bad that's like the worst case in general and that's the pattern land equals zero is Monte Carlo and Monte Carlo has really high variance and it has to be ruled out it's not very happy if you are committed to Monte Carlo now you can do TV and say oh I can pick any step in between now that's what you you want to have this facility of doing some bootstrapping and that's sort of some evidence for that even though this is old data I think you know like Peter would agree that that Monte Carlo is is not really an efficient strategy to do it in a pure way okay now another guy wanted to give it one slide also for taking questions on the linear case a case with the real function approximation and I'm not going to go to nonlinear networks but I want to go to something which you five talk so much theoretically about which is the linear case so suppose we're doing linear function approximation linear function means that our estimate estimated value is formed as an inner product between a parameter a weight vector and a feature vector gram okay so future vectors are fee fee for feature V of T is our feature vector for the state at time T and the parameter vector is theta and that might think about the weights of your your network of this is a linear network okay so do we take the inner product and so this transpose thing means the inner products of theta inner product with fee is our estimated value of state T if you just sis it's just made a value of the state at time T because this is the feature vector for the state at time T okay so this is our estimator this is an estimated value of time T this is our estimated value at time T plus 1 so this really is a TD air and this is a TV rule the TD rule is that the parameters are the old parameters plus step size times RTD air and thus anew the gradient in the general linear in general not only in case this would be a gradient of the of the prediction with respect to the parameters in the linear case is just the feature vector five feet excuse me okay so that rule should be fairly familiar to you now it's just a TD rule using a stochastic gradient descent it's lots can be said about that but that's that's the standard of team linear TD T zero and if you look at this you can of course write it like this you can take the Phi the fee and carry it inside here it's there and you carry it inside here you with a little sum transpose e stuff you can write it like that and this is a vector if you think the expectation we're gonna take the expectation okay so an expectation the the new feature vector is the old one plus a step size and this this thing this thing in expectation what is it okay well I'm just going to make some names for it this thing is a vector and this thing is a matrix times theta okay so B is going to call that vectors that vector B is just the expected value of this thing it's it is a well-defined vector you don't know it but it's it's it's there and this thing is a matrix because it's an outer product feature vector with the change in the feature vectors so the expectation of that matrix is what I'm going to call a so that means we write a whole expectation like this and I'm interested in the K and what happens at the fixed point where will this settle at extreme expectation where will it converge where will the expected update be zero well the expected update is basically this part so I don't know when is that zero well that's gonna be zero when B equals a theta or when B minus a theta is zero okay and that that theta that's a special thing for which this is true so B minus a beta equals zero I'm appalled at theta TV because it's the fixed point that TV converges to the linear TV converges to okay and then you can just compute it know B is is is a times theta and so you have to take the inverse of a and you get the TV fixed point is the a inverse B and which is which by the way it's a key to another algorithm least squares algorithm 8a directly even though it's a matrix take its inverse and I'll sestamibi directly and then multiply them together to get that's the least squares TV works okay but but this way of computing what what what the algorithm converges to and then you can say something theoretical about it this is your guarantee that we get that the mean square value error measure how off the values are is bounded by an expansion times the the mean square value theorem the best data so this means that we don't find the best theta okay but we do get an expansion of it anyway so that's what the theory would look like if you have more time to talk about it now I just want to mention quickly some pots some of the frontiers and things that people are working on now so off policy prediction is a big area people are working on trying to generalize to the off policy case also we'd like to talk about non has some theory for the case of nonlinear function approximation there's just a little bit of that there's also a very little convergence theory for control methods period and I think Chava maybe we'll talk about that tomorrow um and we also like to say things beyond convergence we talked like to you know how well can you do in a finite amount of time how fast you converge now when you combine TV with with a deep learning a lot of different issues come up and I think there's just a lot of uncertainty do we really need a replay buffer so that one of the folk theorems is it oh especially you have instability and correlation quote correlation so we need this thing called the replay buffer but I think it's really there's lots of questions about what happens when we combine TD with deep learning and finally the idea of predicting things other than reward remember I started with that we might want this is TD is a general prediction method multi-step prediction methods we want to use it to predict other things and in particular we want to learn it to use learn a model of the world so in conclusion I guess what I want to say is something like this at e-learning is a uniquely important kind of learning anyway maybe it's ubiquitous we're always going to be using it and I think this may be true it's a hypothesis so anyway it's learning to predict we're just perhaps the only scale what kind of learning it's it's it's a kind of learning that's specialized for general multi-step prediction which may be the key to perception modeling the world the meaning of our knowledge it's key idea is to take advantage of the state property which can make it fast and efficient but can also make it a syntactically biased and so the key claim to fame is that is computationally cheap congenial and we're only beginning to use to explore different ways of use it for things other than reward thank you very much [Applause] you
Info
Channel: Wei Wei
Views: 10,259
Rating: 4.8762889 out of 5
Keywords:
Id: LyCpuLikLyQ
Channel Id: undefined
Length: 86min 24sec (5184 seconds)
Published: Thu Aug 31 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.