let's get going so today um we're going to start to talk about model free prediction um so what that means is taking something like a an environment that could be represented by a mark of decision process but no one gives us the mdp and we still want to solve it and so I think up to this point you should be wondering you know correctly well you know this is all very well but what if I don't know the U exact equations governing the way in which my factory operates what if no one tells me you know how the U the pollution level which comes out of the stack really depends on the U the torque which I put into the engine and so forth you know what you do in those situations and and the answer will be started we start to understand that in this class where we'll talk about model free prediction so methods for for which no one tells us the the environment and and the agent still has to try and figure out the optimal way to behave so we're just going to start by introducing the basic ideas um and then there's going to be two major classes to model free U prediction that we're going to talk about um first of all m Carlo learning uh which in a nutshell we use to mean um methods which go all the way to the end of a trajectory and estimate the value by just looking at sample returns um and then we'll talk about another family of methods which also is model free um and can be significantly more efficient in practice which are tempal difference learning methods um and these are methods which just look one step ahead and then estimate the return um after one step and then what we'll see towards the end of the class is that we can actually unify these approaches and there's a whole spectrum of methods known as TD Lambda where we can unify and kind of go any number of steps along the trajectory to come up with a a viable estimate of what this uh of what the value function is for our for our problem so that's the goal um I anticipate that we will probably get um to here but there's some details um which are sort of deferred to an appendix again um which basically give the sort of formal proofs of um the equivalence between Monte Carlo learning and TD learning um so that might be left until after to this lecture you can look at in your own time okay so let's start with the introduction so let's kind of just take stock of where we're up to so far so last lecture um we we saw in the previous lectures what the definition of a reinforcement learning problem is and we formalized these environments a large class of environments by using a mark of decision process and then last lecture we saw how to solve a mark of decision process and by solve we meant find the optimal behavior in that mdp that maximizes the amount of reward that the agent can expect to get from any state in that in that environment any state in that mdp um but this was all for a known mdp so someone had to plug in the Dynamics of that mdp and the reward function for that mdp and then you could use dynamic programming to turn the handle um apply these iterative equations using the Bellman equations back up those Bellman equations again and again and again and we could solve it we could output the optimal value function and hence the optimal policy um for that um MVP and what we did last Le was we broke it down into two pieces we started off by first of all using dynamic programming to evaluate a policy and then in the second part of the lecture we saw how to use that as an inner loop so once we knew how to evaluate a policy we could find the optimal policy um and what we're going to do with model free methods is we're going to do something similar when now we're going to give up on this major assumption which is that someone tells us how the environment Works which in is is really unrealistic for most interesting problems um and what we're going to talk about instead is these model methods which go directly from the experience the interactions of the agent with its environment directly from those interactions to uh a value function and hence a a policy but what we're going to do is just like in the dynamic programming case we're going to break it down into two pieces we're going to break it down into policy evaluation and then we're going to use our methods for policy evaluation to help us do control and so what we're going to do is this lecture we're going to focus just on the policy evaluation case we're just going to look at prediction we're just going to ask the question if someone gives us a policy how how much reward will we get for that policy and we're going to try and figure out how we can do that entirely model free without anyone telling us the um the Dynamics or the reward function of the environment so now you can just you know run your factory and see how much reward you get and estimate the value function directly from just trying things and seeing what happens um and next lecture we'll use these core ideas to actually do control and find the optimal value function and hence the optimal U policy for the mdp okay so there's two distinctions we're talking about here distinction between um planning which we did in the last lecture with a known model um and this in next lecture we're doing model three full reinforcement learning problem model three no one tells us the environment um and the second distinction is between prediction and control whether we're just evaluating a given policy or whether we're trying to optimize for the best policy and that's this lecture and the next lecture okay so this lecture model free prediction how to figure out the the value function for an unknown mdp when someone gives us the policy okay so now let's talk about our major methods for doing this um you should be thinking well you know how can I do this how can I figure out the value function when no one even tells me how the world Works um but we'll see it's actually straightforward um and our first method is Monte Carlo learning um it's not necessarily the most efficient method but it's extremely effective and actually very widely used in practice um and the idea is just to learn directly from episodes of experience so we don't need knowledge of the mdp Transitions and rewards so this is model free um and what we're going to do is look at complete episodes so this is really suitable for episodic tasks where where you start it's like a game or something you start at the beginning of your game and and you play for some number of steps and then the episode always finishes always terminates um and what we're going to do is we're going to go all the way through our episodes and we're going to use um the simplest possible idea to estimate the value function which is just to take sample return so we're going to look at you know one episode I got um a return of five the next episode I got a return of seven so we can already estimate the value function from our start State as being six you know the average of five and seven so we're going to use the simplest possible idea which is just to look at sample returns and average over them um and that's Monte Carlo uh reinforcement Manning now there's a major caveat here which is that this only works for episodic mdps you have to terminate to be able to form this mean um and we'll see some other issues with Monte Carlo learning but nevertheless it's it's such a core idea I think we should begin there it helps us to understand the key intuitions and once you can do Monti Carlo learning you really you can you can build up um a value function you can solve for that value function you can do all of reinforcement learning with this very simple idea um so so let's see how this actually works a little bit more by going slightly more into detail so so what's our goal first of all so we're going to try and learn this this value function the expected teacher return turn from any state what we're going to do is just observe some episodes of experience under the policy that we're interested in evaluating so if we want to know what happens following the random policy just going have our robots going to wander around randomly for a while um that's one episode we're going to do another episode I'm going to wander over here and then we're going to look at those episodes of experience we're going to look at the stream of States actions rewards States actions rewards that that we sampled from our policy and from the environment um and what we're going to look at is the the total discounted reward that we got from each time step onwards okay so that's this GT this was the that was the return the return was just the the reward from that time step onwards until the end of the episode with some discount Factor um so this is going to be different for every time step in the episode like you know the the return that I see from the beginning of the episode includes everything all the way along this trajectory um whereas the return from halfway through only includes the rewards from halfway Until the End okay and what we're going to do now is we're going to estimate the value function by basically just taking this expectation here so the value function is the expected return um from time T onwards if we were to kind of set our state to that particular point so if I was to set my state to here and look at the expected return from this point onwards that's the definition of the value function um so all we do in Monte Carlo policy evaluation this simplest idea of Monte Carlo learning is we use the empirical mean return in place of this expected return so we're just going to replace this expectation with an empirical mean and we're going to collect as many samples as we can from each of these states of of what happens from that point onwards and then the only issue is how do we do this when we kind of don't get to reset our state back to that exact point repeatedly and we want to kind of figure this out for for all states in our environment how do we get the the mean for all states just from trajectories um and there are two different ways we can do this um and they both work and they're both effective um but there's a sort of of subtle difference between them um so the first approach is what's called first visit Monte Carlo policy evaluation um so so to understand this imagine that you've got some Loop in your mdp where you come back to the same state repeatedly um but to evaluate a state what we're going to consider is the the very first time we visit that state we're going to basically say the first time I get to this state here in some trajectory even if I come back to that state later and then go off and get some reward afterwards we're just going to look at the first time I arrived at that State um and when I reach that state for the first time I'm going to increment a counter to say how many times I visited that state and I'm going to add up the total return and now I could just get the the mean return from that state onwards so consider you know in each episode we're going to look at the first time I visited that state we're going to measure the total return we got from that and we're going to take the mean over those total returns from that point onwards um and all we're going to use use is a you know a very simple mathematical result which is the law of large numbers which basically says that you know when you take a mean of um a bunch of um ID random variables that that the mean of those random variables actually does approach approach the true expectation so that tells us that if we see enough episodes if we just randomly sample enough episodes this very simple idea of Monte Carlo learning really will converge on the True Value function for our policy as we get more and more samples um and the only requirement is that we somehow visit all of these states um and and we do that by just sampling along our policy itself so we generate trajectories along our policy we do that again and again and again and we just look at the returns that we get from the first time you visit a state onwards and we average over those returns okay people understanding questions on that good yeah I'm slightly confused because every time we enter the step we increment the cont right or because if the first time we visit so always one no no because we we the counter persists over multiple episodes so what we're trying to do is to get an average over multiple episodes so this is counting how many times over a set of episodes how many times have we visited that particular State for the first time and this is telling us the total return over many episodes um we're adding on the complete return for for each episode here so it's just this is just um you know very simple way to say we just take the mean of all of those returns okay yeah LGE number no um well so so first of all we can ask how quickly do do does the so the law of large numbers basically if you want to know how quickly does it approach the mean that's the central limit theorem so we we basically know that you normally get some uh normal distribution and the the rate at which the uh the rate at which the error reduces roughly sort of with a variance of one over n okay so the number of uh the number of episodes you need to reduce the variance so the number of visits you need to any individual state is kind of um um you know the the error you get is has a variance which reduces with one over n um which is completely independent of the size of the state space so you just need to see that returns from that state um that number of times and that very quickly will reduce your variance um and the size of the state space um again the only thing which matters is that we see these states enough times so each time we're going to see a trajectory and we basically want to make sure that that every state in that trajectory um we're we're maintaining this mean for um so again um as long as your trajectory visits the states that you're interested in um you it basically depends on the frequency with which you visit those States so you have to see those States some number of times times and and so it depends on how frequently your policy visits the states that you care about seeing the value function for but again that's actually independent of the size of the state space so that's one of the nice things about these model three methods actually is they they don't have unlike dynamic programming where we're doing full sweeps we're just sampling here and sampling actually kind of breaks this dependence on the size of the of the problem okay so that's first visit Monte Carlo let's look at a very subtly different version um sorry quick question I don't understand how how can you make that you visit states okay um we'll talk about that more next class so going to defer that question mostly to next class for now we're just doing policy evaluation so all we care about is how good are the states that we visit under policy pi and the way that we make sure that we that we see all the states that we care about under policy Pi is by following policy Pi so so just by following the policy that I'm trying to evaluate we guarantee that we see all of the states that that are reachable under that policy and so we will see all of those States um and and that idea is is sufficient for now to make sure that we see enough um visits to all of those States in subsequent lectures your your question's really actually a very uh interesting one which is how do you make sure that you cover the whole state space when you're trying to find the best policy um and that's a whole issue of exploration that comes up in reinforcement it's one of the fundamental questions of of reinforcement learning but here we're just trying to evaluate this policy so it's enough to just run episodes with that policy you know if I want to know what happens if I follow a straight line to the wall um all I need to do is just keep following that policy of running a straight line towards the wall and I'll see all of the states which are of interest and I can evaluate those States because I've I've run that policy already okay second idea very closely related is every visit Monte Carlo policy evaluation um so now to evaluate a a single state s um if we've got this loop again the only difference is now that we're going to consider every visit to that state instead of just the first one um so we're basically going to come around um and if I could do my Loop I'm going to come back to it again and I'm going to include both my original estimate of the return including the loop and also my my second estimate from that state the second time I visited to it so basically you know in in one episode you can increment the counter multiple times for this date each time you visit it you'll increment the counter and you'll add on a different value of the return based on the point which you were up to at that time step t for for each of those visits apart from that the algorithms identical we still um use the law of large numbers this thing still finds the the True Value function everything's good so both of these are valid estimators of the mean of the expectation by just taking samples of the mean in different ways yeah there any inition which one is better which case which one should use um so so it sort of depends on the setting actually when we talk about TD lambra I'll come back to that question that there's different types of traces you can use and they corespond to every visit and first visit traces um and there's some evidence in favor of both depending on the domain actually okay so let's let's make an example so I just want to talk about this it's a little bit related to the assignment so um so the idea is um we're going to consider you know how can you guys how can I train you up to go and break the casino um like after this class um so um don't hold me to that I don't want to be responsible for anyone losing money casinos actually don't follow exactly this is a simplified version of blackjack so you won't be able to use this and and really win money um but we're going to consider the game of blackjack sometimes called 21 or Pon two and things like that and it's a very simp game where where the dealer basically um you get you get um dealt um you start off with two cards and the idea was to get as close as possible to 21 and the sum of those two cards without going over 21 so you just add up the face values of all of your cards and if if you ever get Delta card that such that the sum of your cards goes over 21 your bust and you lose um but the closer you get to 21 the more likely you are to be beat the the dealer who also has two cards um and it's going to play the game the same way and if your total beats the dealer then you you win the game and you win some money okay so that's the game with Blackjack um and how many people know the game by the way just to good yeah okay so so the way we're going to represent this we're going to represent it as an mdp um and we're going to treat the state um as having basically three different variables um in the state space first variable is the current sum of our cards so if we've got three cards and the sum of those cards is 17 we're going to say we're in a state with a sum of 17 um and note that we're not going to consider um sums below um 11 because for those States we're just going to automatically ask ask for another card because for free you can ask for another card and you it's not an interesting decision at that point if you have less than 12 um there's no interesting decision to make because you'll always ask for another card okay so we're only going to consider the states where we actually have an interesting decision to make where some of our cards is between 12 and 21 um and we're going to look at what the dealer is showing so when you're playing your game you see one card of the dealers and that gets you can use that to decide whether you should ask for another card or or not and your goal is to without having seen the Dealer's next card can you decide on this decision of whether to ask for another card and possibly go bust um or to stick with what you have so you've got those two actions to stop stick or to twist um and if you do stick um then the dealer plays out the game you just think of that as the turn of the environment the dealer plays out the game um all the way to the end that's like one step of the environment and and then you get plus one reward if if your sum's greater than the dealer sum zero if it matches and and minus one if you're if you've got less um and if you twist then um you get minus one if you go over 21 because you go bust and zero otherwise and for the transition structure of this mdp the only thing we're going to add to it is this idea that if you've got less than 12 we're going to automatically twist so we're not going to that's just part of the mdp then so you only actually make actions where there's an interesting action to make and there's this sort of Auto twist as part of the environment if you if you have less than 12 okay um so so let's apply Monte Carlo policy evaluation to this problem um and so next class we'll actually see how to really uh find the optimal policy for now we're just doing policy evaluation um and what we're going to do is we're going to choose a really naive policy and we're going to see what happens with that naive policy and so the policy is going to be we're going to stick if the sum of our cards is 20 or 21 um and otherwise we're always going to twist so even if we've got 19 or 18 we're just going to twist and we want to know well how how good or bad is this um and we going to consider oh sorry I forgot to mention one more State variable so we have three State variables the third state variable is whether we have a um an ace in our hand which is usable that means it can either take the value of one or or 11 um and so it's good to have Ace in in your hand um and and basically what we see is that this is basically using uh I think every visit Monte Carlo learning and and what what was done here was basically to just roll out 10,000 episodes of blackjack so to literally play out this game with 10,000 hands and then 500,000 hands and apply exactly the procedure we've just seen so for these 10,000 hands every time we came across one of those States every time we had a sum of of 13 and we had a usable Ace um and the dealer was showing a a five or whatever we would find the appropriate Point here um in the value function space and we would update that estimate of the value function a little bit towards the mean of so for for each point here we're keeping a separate estimate of the mean for that State uh and so for all of these states we update the mean from these episodes and if you run enough episodes you start to get the shape of the value function what you should immediately see here this is really to give you a flavor of of um what happens with these algorithms you see that even after 10,000 episodes you still have a you get a pretty decent estimate of the case where or you don't have an ace but the estimate where you do have an ace is much noisier and that's because these are rarer these states are rarer so 10,000 episodes you don't actually see enough cases where you've got an ace to really figure out all of these means correctly so it's still quite noisy um and you need something like half a million episodes to really get the correct shape of this value function and this shape is telling you something quite interesting which is you know as you expect if the Dealer's um showing something towards an ace uh so the better the dealer has the the worse off your your value is um unless he's got an ace that's even worse there um and and that this is your sum so if you've got 20 or 21 you do quite well in this game but for anything below that you do really badly because you've got this very naive policy where you're just going to keep um twisting and probably go bust um so we see that emerging nicely from the structure of this value function just by sampling so no one told us the Dynamics of the game no one told us the probabilities uh that govern the game of of blackjack no one told us the structure of the pack the deck of cards um this was just by running episodes trying and error learning and figuring out the value function directly from experience okay can you just clarify the the dealer Anis does it does it go Ace 2 three four all the way yes yeah yeah okay Queen okay so Jack um Queen so picture cards are just treated as tens in this in this game so Blackjack all picture cards have value of 10 Yeah question is expected reward given you're at that you follow the policy the expected um return which is yeah the expected return which is the value function so if you were to follow this policy each point on this diagram that the height of this point on that diagram is showing you your estimated value function your estimate of how well you will do of whether you'll win or lose this game um if you're in a situation where you say where the dealer is showing an ace um and you've got to play a sum of say 19 um this value here tells you how well you think you will do in that game um and so it's the shape of this value function is is exactly the information we're after like once we have this um we can do all kinds of things like the the key quantity we're after is this picture once you have this picture you can evaluate your decisions and and pick the best decision and make better policies because you can look at the shape of this surface for all different policies and find the best one okay good um and so the assignment is to work on a similar game something called easy21 which um um I think you'll have fun just playing around with and trying to understand and you can run all kinds of reinforcement learning algorithms to to figure out not just policy evaluation but to find the optimal policy for for that game function off slightly with the 10 compared to the n on the bottom right hand side there um yes would it might just be noise in the um evaluation I 10 would be able to estimate right just because you have more Ts if you if you bundle up together all the jcks and queens and kings or know maybe not better but different but if the deal is got a 10 it's a better hand dealer right that's why you get no because you're on 19 um when you're 19 you don't want to get a 10 that's why it's improving slightly all the way up because you don't want large cars so you're happy when the got large cars well it's a little bit unclear right so it depends on the strategy the dealer is using to um to twist actually right so so I think it depends so so it might be that the dealer has the same um strategy as us in which case um having something like a seven or eight is actually a bad situation for the dealer because um they will um they'll probably twist and most likely get a 10 again and then they'll most likely Twist Again and um you know and there's a higher probability of going fast here than if they've already got a 10 in which case they're likely to get another 10 because there's more of them um and then they'll stick so I think it depends on the dealer strategy program this you actually had to program the dealer strategy as well which we don't know you do know for the assignment okay so let's move on um but I think that illustrates that you know that the value function depends depends on all kinds of factors like you don't the actual value function does depend on on every aspect of the environment it depends on how the dealer acts it it depends on the the deck of the card the randomness how many tens there are compared to nines and eights all these things um affect the value function and yet we don't need to tell our agent any of those things just by sampling the returns we're able to go directly to the the right thing which is the value function and and hence the optimal behavior in the next lecture without needing to explain it anything about the Dynamics and that's the power of model free reinforcement learning that you can go directly to the result without going through these intermediate steps okay so so far we've talked about this sort of um explicit computation of the mean um so what we're going to see is that this can be done um incrementally we're going to start to move towards online algorithms which um Step byep update the mean um so this slide is really it's probably a well-known result sorry if you guys are aware of this but it's just to show that the mean can be computed incrementally you don't have to first of all make the sum of everything and then divide the sum by your count um you can do it incrementally and so very briefly if the the mean of K elements is just one over K of the sum of those U those elements um what we can do is we can split this sum out into a sum up to K minus one so the previous K minus one elements plus the K element um and then what we can do is just observe that that this sum here is K minus one time the mean um up to K minus one so that's this sum here is just um you can take your previous mean multiply it by K minus one and then add on your new element and divide by K again that gives you your new mean um and then if we just rearrange these terms uh we see that if we collect together our xks um then we've got one over K of them here um and our mu we can basically pull out here um and we can subtract off the the one over k um UK that we have here um and this is the K * 1 K of new K that we pull out here and what we see is that you can basically the the new mean is the old mean plus some step size 1 over k a little increment towards the difference between um um your new element and what you thought the mean was so we're going to see a lot of algorithms which take this form there's an error term here so think of this error term as this being your estimate of what the value is going to be before before you actually see this new element you've got some estimate of what you think that that value is going to be which is your previous mean so you think that your elements are going to have this value on average then you see your new element here and there's some kind of error between what you thought was going to happen and what actually happened so that's your error term here and and now what we basically do is we update our mean a little bit in the direction of that error we're going to see a lot of algorithms which take that form where where you sort of correct um you have like a prediction and you correct that prediction in the direction of the error so basically if you if you are estimating your mean too too low and you see something higher then you need to increase the value of your of your mean in the direction of the error between what you thought the mean was and what you actually observed so that's the idea of these incremental mean computations um and we'll pretty much see that every algorithm and the remainder of this lecture takes takes this form um and so let's first of all do this with incremental Monte Carlo updates so we're going take our Monte Carlo algorithm and we're just going to make it an incremental update so we're going to do it episode by episode now without keeping track of these sums in these in these visit counts oh you still need the visit count but without keeping track of the sum um what we're going to do um is we're going to take the same algorithm so for each state with some return that we've seen um so for now we'll just keep the visit count um and what we'll do is we'll update every time we see that state we're going to measure the return from that state onwards um we're going to look at the error between the value function that's what we thought the um the return was going to be um and the return we actually observed we're going to generate some error we're going to update our mean estimate for that state a little bit in the direction of that return okay and so this is exactly the same we've just transformed our computation of the mean in this way yeah question no no so so we're not Computing intermediate returns that's the next part of the class actually so I'm glad you said that so so what we're doing is we're still getting the complete return but it's only incremental in the sense that every episode we're going to update our mean and justce it's it's a trivial rewrite of how we compute the mean um and the reason we're doing it this way is we're we're showing how this is going to develop towards other algorithms so we're going to replace this one over n by by other quantities and we're replace the target we move by other quantities um but all it's just to show that so we're doing exactly the same computation as before nothing has changed it's just an incremental mean update just to put things in the form that we're going to use later um so we're we're doing things Episode by episode we're incrementally updating our estimate of the the mean for that yeah but we add this um these values under and after after we the episode right because then we know the whole value function yes so this is all forward looking everything with Monte Carlo so far you have to wait until the end of the episode to see the the return that you got and then you have to go back to the state um that you want to update and say okay well from that state I ended up getting 100 points um of reward and now you update your your estimate from at the end of the episode so this is one disadvantage to Monti Carlo that we'll address shortly okay so so really what we're going to move to WS then is is we want to move towards um algorithms that that don't have to maintain these statistics and just incrementally update themselves and one way in which this can be done is actually um by forgetting old episodes so you don't necessarily want to take a complete mean um sometimes you actually want to forget the stuff that that came a long time ago and one way you can do that is just by having a constant step size here and this gives you like an exponential forgetting rate when you're Computing your mean it gives you a exponential moving average of all of the returns you've seen so far um so it's another way to compute a different type of mean now where we've replaced our one over n by by this fixed depth size Alpha but the the concept is still the same that we we had some estimate of what we thought the mean value was going to be um then we saw some return from that state onwards that gives us an error term and now we just move our value function a little bit in the direction of that error except now we don't move all the way to correct this exactly to the to the overall mean um we we maybe um under or over correct depending on what the value of alha is but don't think which we got here is like we don't have to store one integer right which is so the question was is there an advantage to this Beyond not having store n so maybe I wasn't clear so so so first advantage of this approach is is that um this applies to non-stationary setups where where things can be drifting around so in the real world well stuff is changing you don't want to remember everything um all the way into the past because stuff that came 100 years ago you kind of want to forget about and including that in your mean is just going to slow you down and sort of give you baggage that you you actually want to to let go of and we'll actually see as the um lectures develop that that we're always in that case in reinforcement learning we're always in the case where things are non-stationary because um although in this lecture we're evaluating a fixed policy in general we'll start to improve the policy and so the thing which we're evaluating is getting better and better and better and so that's why we prefer these um non-stationary um estimators um rather than taking the true mean um in addition um it's something where we can have memoryless algorithms where where each a new Step comes in and we just look at that step and we can do an update without having to track these statistics about what's come up I think that's a less important detail um so we're just in this setup where we're looking at algorithms where we move a little bit towards the the sample that we've seen okay so so that's Monte Carlo learning so very simple idea you you run out episodes you look at those complete returns that you've seen and you update your estimate of the mean value towards your um sample return for each date that you visit okay now we're going to move on to a different class of methods which really do now break up the U the episode and use incomplete returns so so let's think about this what do we mean by temporal difference learning so again like Monte Carlo learning these methods TD methods we call them they learn directly from um actual experience from interaction with the environment um says episodes here actually TD applies to the non-episodic case as well Contin case TD again is is model free so again we don't need to know how the environment works the transition structure the reward structure you know again we don't need to know how the dealer works or the deck of the cards um or any of those details um but one difference from Monte Carlo is that now we're going to learn from incomplete episodes so we don't have to go all the way until I hit the wall and see how much reward I got from that complete trajectory I can basically take a partial trajectory um and now use an estimate of how much reward I think I'll get from here up until the wall um in place of the actual return so this idea is called bootstrapping this idea of substituting the remainder of the trajectory with our estimate of what will happen from that point onwards that's called bootstrapping where we basically update our guess of the value function so I can to start off guessing how much how long it's me to get to the wall I'm going to walk some number of steps I'm going to make another guess of how long it's going to take me to get to the wall and I'm going to update my original guess towards my subsequent guess that that idea is called bootstrapping and that's the fundamental idea behind TD learning okay so let's make that more concrete so again the goal is the same as we had before we're trying to learn our value function V Pi we're just doing policy evaluation in this class next class we'll turn this into a control method um and we're just going to look at our interactions under some policy so if I follow my random walk around um you know how well will I do how much reward will I get and I want to estimate this efficiently um and now we're going to try and do this online we're going to try and just look every step I'm going to make a new estimate adjust my estimate of the of the value function without waiting until the end um so so let's think about what we've done so far so in Monte Carlo what did we do well we basically we were in some state we had an estimate of value function um this was our previous estimate of value function we got our return we looked at the error term between our estimated value and the return and we updated our value function a little bit in the direction of that error so what we're going to do now is we're going to use temporal difference learning and let's consider the simplest version td0 and we'll understand later what that zero means but what this means is now we're going to update our value function towards the estimated return after one step and I'm going to estimate how much um the return is going to be and our estimate basically consists of two parts just like in the belman equation so the estimate consists of two parts which are the immediate reward so I think the total reward is going to be the immediate reward plus the discounted value of The Next Step so it's just like the Bellman equations we saw in the previous um in the previous lectures and now what we're going to do is we're going to substitute this estimated return and we're going to use that instead of the real return so basically going to take exactly the same algorithm we had before pull out the real return we're going to replace it with an estimate of the return which consists of these two pieces the immediate reward plus our estimated reward over the rest of the trajectory um so we're coming up with a an estimated biased um algorithm um and we'll see the consequences of that bu shortly okay so this term here this red term here is what we call the TD Target this is the thing which we're now moving towards it's like a a random Target it depends on EX exctly what happens over this next time step um but we get some immediate reward and some value at wherever we happen to end up and we're going to move towards that thing um and that thing's called the TD Target and this whole error term here the difference between what we thought we had um and the difference between our TD Target and our our estimated value before we saw this this St is called the TD error the difference between our estimate before and our estimate after taking a step that's the TD eror um so why is this a good idea I mean I think the simplest example would be imagine you're driving along um you're driving in your car um and if you're doing Monte Carlo learning um you have to basically you know when would you update your your value function well imagine that you you drive along and consider the following scenario where you're driving um you see suddenly a car um that comes hurtling towards you um and you think you're going to crash but then you don't don't actually crash at the last second the car swerves out of the way um and and you don't actually crash so in Monte Cara you wouldn't get this negative reward you wouldn't have a crash you wouldn't be able to update your value to say that you almost um died but in TD learning you're in this situation where you think everything's fine then you um you drive along and you're now in a situation where you think you're going to die think this car crash is going to happen and so you can immediately update the value he had before to say oh that was actually worse than I thought maybe I should have slowed down the car anticipated this potential near-death experience and you can do that immediately you don't need to wait until you die to update your value function um so there's a great Slide by U someone I know in reinforcement he he puts up these backup diagrams that we saw in the last class and with this big skull and crossbone saying you can't back up death anyway so let's do another driving example um so this example's from sattin so you can look at this in your own time um but the idea was basically that you know I'm driving along I'm on my way to work let's imagine I didn't actually do my usual horrible commute on the train and was actually in a car um so um I start off by making some prediction of how long it's going to take me and so these these rows basically show how how much time has passed so far um how much time I think it's going to take from this point onwards and the total of um these two columns here time so far plus predicted time to go okay and I start off thinking it's going to take me half an hour to get to work but you know maybe then I I leave the office and I reach the car and I see that it's raining and I'm like oh I'll have to drive a bit slower probably be more traffic um took me five minutes um and now okay maybe I haven't really adjusted my I I now think it's raining so it's going to take me slightly longer um so I think it's going to take 35 minutes to get to the office now giving a total of 40 um Next Step um I get off the the motor way highway um and that's taking me 20 minutes um and now I think okay that went a bit smoother than I expected so now I think there's only 15 minutes to go um giving a total of 35 um then I get stuck behind the truck and think oh actually it's going to take me a bit longer than I thought so now I kind of increase this I think oh there's actually 10 minutes to go now it's going to take me 40 minutes um and then I get on the home street and I've only got you know things are going okay but it took a little longer I thought I think there's still 3 minutes to go and then eventually I get there and it actually took in total 43 minutes um I realized there's no more time to go and so the total everything lines up at the end um and we think that um there's nothing left and it took us 43 minutes in total so now how would we actually update our value function based on this trajectory of experience um so this these figures really show the difference between Monte Carlo and TD in practice and in Monte Carlo learning if we look at each point along that trajectory um this shows basically um how how much uh like the total predicted Journey time so basically what we thought so it actually took 43 minutes along here um but at every moment I had some prediction of what that total time would be um so you know here I thought it was going to take 30 minutes it actually took 43 um then it sort of I reached the car and at that point I thought uh oh it's going to take 40 minutes and then things were going better so I thought it's going to take 35 minutes and so forth and each step along this um we using Monte Carlo learning you update towards the actual outcome you have to wait until you finally get there see that it actually took 43 minutes and then update each of your value estimat along the way towards that 43 minutes okay whereas with TD learning it's quite different now every step is like you started off thinking it was going to take you um 30 minutes um after one step you thought it was going to take you 40 minutes so you can immediately update there already you don't need to wait until anything else happens could say oh I was in a situation where I thought it was going to take 30 minutes but then I saw it was raining so probably my 30 minute estimate was too optimistic let's adjust that a little bit towards 40 minutes um at the next step you know I started off think it was 40 minutes but then things went quite smoothly on the highway so I can immediately pull this guy down and say well maybe I was a bit pessimistic and I should move this towards this guy um and then you know you have to realize that you might get stuck behind a truck so that might pull you back up again um and so you should move this towards your subsequent estimate and eventually you get grounded by the actual outcome so you're always updating the guess towards a guess but eventually you get grounded by this outcome where you really get there and that's the end of your episode um and you get your your real reward and having a day at work um I just check in in this example what are the what what is the goal and what are the actions um so so first of all okay um the goal um is to so the reward function is the time that it takes you um so think of that as the reward function the reward function is time um so you could think of this as the goal is to get to work and you get minus one per step so if you made all of these negative you could think of that as a goal and that would be a well formulated um markup decision process that such that the optimum corresponds to getting to work as quickly as possible the actions are um there are no actions here think of this is a markup reward process we're just trying to do policy estimation the policy could be anything we want the policy is just driving your car and the action spaces it could be anything um and the reason it doesn't matter is we're just trying to evaluate whatever that policy is and whatever that action space is for now we're just looking at how to how to evaluate how much reward I got from this particular trajectory remember you can flatten any mdp into a markof reward process given a fixed policy so that's kind of what we're doing we're just evaluating this thing we're just trying to understand how much reward you get um independent of the action space next class it will matter much more what the action space is because we're going to pick actions so as to optimize our our decisions okay so is this a good or a bad idea so well there's several ways in which it's a good idea um so the one we've just seen really and focused on is that TD can learn before you see this final outcome so you don't need to wait until you crash and die um you can already start learning after just one step um whereas Monte Carlo has to wait always until the end of the episode before you actually see the return and then it backs it up retrospectively um and one advantage of this is that TD can learn in situations where you never see the final outcome so this basically means you know what if you get incomplete sequences what if you're learning from in some situation where you just don't get to observe the um the return at the end of the episode we'll see this later when we do off policy learning this really happens a lot in many cases um or what happens if you want to deal with continuing environments that just go on and on forever and you never get to the end well Monte Carlo just doesn't apply there I mean you can go to some arbitrary Horizon in the future and just back up from there but there's always some error introduced um or as TD really works it really works in all of these cases even if the environment just goes on and on and on you just update now from what happens at the next step and if your steps go on forever that's okay everything still works you'll find the True Value function regardless okay does it learn the same behavior so if you have a car and the car almost crashes but it doesn't and the reason that it doesn't is because other people invo avoid a lunatic so when you're learning step by step um you're going to learn a behavior that you're you're scared of crashing but if you only learn from the very end you would never learn something which encourag you to be to to avoid your own manic behavior Because by the time you get to the end you don't suffer the consequence um so it learns will answer that question properly it does so so the answer to that question is is a really good question does TD find the same answer as Monte Carlo um we're going to start to address that the basic answer is that in that TD finds the True Value function it finds the True Value function as long as you see as long as you run this thing out it will always ground itself because even though you correct yourself based on your guess and that guess might not be right that guess will then be updated towards something that happen subsequently which will ground it more and more so all of your guesses are progressively becoming better and that information backs up such that you get the correct value function um so yeah so it's fine okay um so so we're trying to understand the pros and cons of TD compared to um um compared to Monte Carlo um so one other issue which comes up um which is a major difference between the algorithms this bias variance tradeoff so so the return um this sample that we get of all the rewards in the episode um we can think of in Monte Carlo this this return that we used so in Monte Carlo we just used this whole sequence of rewards that we actually sampled from our mdp and that's an unbiased estimate of the value function so the value function is just the definition of value is the expected return so that's what we mean by unbias that this is just a sample of the this expectation so if we actually update towards the return we're not introducing any bias the law of large numbers will take effect we'll really find the True Value function everything is good um what if we were to use the true TD Target by true TD Target we mean that we drop in the True Value function like if some if some Oracle told us the True Value function and we used the immediate reward plus the True Value at the next step that would also be an unbiased estimate of the value function and we know that from the Bellman equation the bman equation tells us that the the value function is equal to the expected reward plus discounted value at the next step that was the bman equation bman expectation equation um from lecture two okay so either of these cases gives us an unbiased estimate but what we actually do in TD is we don't use the True Value we don't have an oracle to tell us the True Value so what we substitute in is our best guess so far of what that value function is at the next step so the target which we move towards the target which we're updating our value towards like I think I'm in some value and I update towards the immediate reward that I got over that step plus the discounted estimated value at the next step according to my current estimate so now we've introduced bias like this estimate could be anything like could I could be wildly wrong I could initialize with some crazy value for this thing um so we introduced bias into our estimate into our Target um but we reduced the variance so the TD Target is biased but it's much lower variance than the return so intuitively why is it lower variance well let's look at the return the return depends on this random variable here depends on um the immediate reward after one step and then whatever the environment does it will take us to some next state that's noise in that transition there's noise in the reward we'll see here there's noise in the next transition there's noise in the reward we get here there's noise in all of that whole trajectory of of the many many random variables depending on the environment and the agent policy over many many steps there's a lot of noise it multiplies out over the course of the trajectory whereas the TD Target we only incur noise from one step there's one step of of noise in the reward we see the environment gets to take one turn we get one random transition and then we invoke our value function at that next step so this thing is much lower variance than this thing we're only looking at the noise over the first step rather than over the entire trajectory one is a very noisy estimation it's not noisy it's biased um so it's biased it's it's your you have some estimator of B which may be wrong it may not have the True Value but it's not noisy this is um this is you know at any given time this is just some fixed function that you're that you're looking up the value so it doesn't depend on um so so the state that you see is noisy but the value function that you the bias comes from the fact that your value function the bias comes from the fact that this is not equal to this okay so so Monti Carlo is high variance but no bias so some good things about Monte Carlo is as a result of being zero bias it has very good convergence properties you know we'll see later even when use function approximation Monte Carlo just works it works out of the box everything all the algorithms just converge to the to the correct answer it's not very sensitive to the initial value because we're not bootstrapping from the initial value it will just um you know it still matters where you start it will take longer to adjust your a really wildly wrong value towards the correct value um but it doesn't kind of cycle off itself it doesn't feed off itself it's very simple to understand and use um so TD on the other hand is much lower variance but it has some bias and so you should be wondering well does the algorithm even work if it's biased um and the answer luckily is yes um and it's usually much more efficient than Monte Carlo because it's so much lower variance in most situations TD we'll see examples of this TD is generally a much more efficient algorithm um and for um the special case we're looking at policy evaluation using table lookup um it converges to the True Value function so this is this is true uh I'm not going to prove it but um this is proved actually by Peter Diane was one of the first people to prove this for the general case um um but not always with with function approximation so so there are cases which we'll see there are special cases where the bias in TD leads to the algorithm not behaving in the way that we would like um but these are quite specific cases that we'll start to address in in um the next couple of lectures and it's more sensitive to the initial value because it is feeding off itself in this way okay Yeah question what do you mean which function you function approximation okay so in um a couple of lectures time what we'll see um is that everything we've done so far is Not Practical because we are um we're estimating the value function V of s u for all states separately and in most interesting problems you've got more States than you can count um and so what we'll do is we'll use a function approximator to estimate the value function later and um and the question is do do these methods still work effectively and the answer is yes except for certain specific cases that we'll deal with okay let's make this a bit more concrete with an example um so this problem here um is a random walk um so what's happening in this problem it's like the simplest one of the simplest mdps you could imagine um in fact it's just um there are two actions left and right uh we're going to consider a uniform random policy that goes left. five right5 um if you end up in this state here the episode terminates and you get zero reward these are the rewards you get for the transition so zero reward unless you make it all the way to the right in which case you'll get a reward of one so you're just going to do some random walk along here if you end up here that's good episode finishes with reward one you end up over here that's bad you get zero okay and then the question is well what's the value function for this thing you know what's the value of being in each of these states and we could just use this to get an intuition of you know the simplest possible case to get an intuition of how want car and um and TD compared to each other um so if you use um if you actually look at the value function and apply TD so this is applying the TD zero algorithm what you see is that you can start off with some completely arbitrary Value Estimate so let's start off initializing everything to say 0 five um and then if you run this for one episode 10 episodes in blue 100 episodes here in black um we see that this thing starts to flatten out onto what actually is the True Value function which is this straight line up here okay so roughly 100 random episodes here and you pretty much got the value function using td0 um so how does that compare so so intuitively that's basically saying that you know State a um the value kind of increases as you move towards the one so you basically I think the True Value function is something like one six 2 six 3 six 46 56 As you move along um that's the True Value function which you can find by dynamic programming okay so now how what happens if we compare Monte Carlo with TD you know is it really true that TD is more efficient so here we're just looking at um the number of episodes that we so this is like picking out each of those stages like 1 10 50 100 U and and now we're plotting these learning curves say showing how the error reduces in the value function so what we're looking at on this y- axis is basically the mean squared error the root mean squared error averaged over all states in other words looking at the total error between the True Value function and the value function that we've estimated after some number of steps we're looking at the some squared error and then take the square root of of all of that okay and then what we see is that um we look at in the black lines here we have Monte Carlo learning with different step sizes and in Gray here we've got TD learning with different step sizes um so this is for a small number of episodes actually this same result extends out further you just have to optimize the step size and you'll see that you know for appropriately chosen step sizes TD continues to do better than montic Caro so you see this basically this is showing us the effect of bootstrapping by bootstrapping you can learn much more efficiently than inuring the variance of all of this um um things which come later on in the episode um I think it's just the noise in um these step sizes are very high so with these step sizes um you can never get exactly to the True Value function so with a large step size you basically will oscillate around the True Value function with some within some ball that's governed by the size of your step size so you don't guarantee to converge all the way to the um the optimal solution so if you so the the convergence proof of a TD actually use a decaying step size schedule using like Robins Monroe condition where you have to kind of do a sort of one over n um um kind of schedule for the step size and then it really does converge same's true for any gradient based method you Monti Carlo you would still need to in many of the cases that we'll see you still need to Decay these step size according to an appropriate schedule so I don't think it's actually if you were to plot this out I don't think it's going up I just think it it stabilizes and that you know you're only going to get within whatever accuracy your step size is if you're jumping around too much much with your steps you can't expect to converge exactly to the right solution you have to start reducing your steps to to get exactly there okay so so we've seen so far that they both these approaches converge both in Monti Carlo and TD we converge we find the True Value function as you add in an infinite amount of experience but what happens if we were to just stop after a finite number of episodes if we were to just show three episodes to our system and then keep iterating on those episodes and learn again and again and again apply our updates like repeat those episodes and just keep going over that same data again and again um what would happen if you applied either Monti car or TD would they find the same solution um so to get an intuition into that we will um take a very simple example so this is the ad example so it's just a an um mdp with two states a andb and each row is a different episode of experience that that we've seen so in the first episode you start in state a you get a reward of U zero um then you go to transition to State B and you get a reward of zero and then the episode terminates next episode you start in b you get a reward of one next episode you start in b you get a reward of one go down here you see an episode you start in b and you get a reward of zero so the question is for you guys what's the value function so any um suggestions what's the value of a and b so let's do the easy one first of all what's the value of B so we've got six six eights right so um so we've got eight episodes starting from B um six of them got a value of one um two of them got a value of zero so we I think everyone would agree the value is 68 or 75 what about the value function from a any volunteers okay we've got one person says 68 any other any other guesses zero okay so who thinks zero and who thinks 68 let's have a show of hands for zero okay who thinks 75 68 okay good about of the people who said about half and half so those are respectively the Monte Carlo and the TD solutions to this problem um they're both perfectly legitimate Solutions um and we'll see why um so so first of all the zero answer I think is is you know 0 one is easy to explain we've seen a one episode from a um and we got zero reward on that episode the return was Zero from this point onwards so the value function legitimately we could estimate as as Zero from that point onwards okay that's the Monte Carlo estimate we looked at complete trajectory starting from a we had one complete trajectory starting from a it got zero what about the TD estimate well implicitly what this is doing is something like the following which is it's building implicitly imagine that you tried to build an mdp model that models what we've seen so far okay if you were to build an mdp to try and explain this data that you've seen so far you would build an mdp model that looks like this where you start in a um and 100% of the transitions that we've seen where we started in a transition to B and got zero reward along the way and once we got to be 75% of the transitions gave us a reward of one and 25% of the transitions gave us a reward of zero and then we terminated so that MTP is the best explanation um so this is the maximum likelihood mdp that's explains this data in the best possible way okay um so now we understand that we can see what these guys converge to so Monte Carlo always converges to the solution that minimizes the mean squared error so it finds the best fit to the actual returns that we've seen so if we saw these um particular returns um Monti Carlo just minimizes the mean squ error Over All all time steps all episodes it minimizes the difference between the return that we saw and the value function C whereas td0 actually converges to the solution of the mdp that best explains the data so implicitly what TD is doing is it first fits um an mdp and then it solves for that mdp so whatever data you've seen so far it will find the solution to the maximum likelihood mdp model that explains that data um so so this is just math to say exactly that same idea um it's just saying that first of all let's find the mdp that best fits the data and you can do that just by counting transitions so here this is just saying let's make the transition Dynamics count all the transitions we've seen from State s and state a to some other state let's count those Transitions and make that the probability just count the Transitions and divide by the total number of times we're in s um and the reward is just the mean reward from State s divided by the number of times we were there it's just fancy notation this this bold one means an indicator function saying each time this event occurred this this indicator has value of one otherwise it has value of zero yeah if you know the mdp Can you give it to Monte Carlo so that it doesn't do know what I mean right so the question is um if you know the mdp Can you give it to Monte Carlo to ask it to sample from the mdp and work from those observed returns uh we'll come back to that in lecture eight or nine we'll do exactly that it's a good idea okay so let's come back we're trying to understand the advantages and disadvantages of Monte Carlo and TD and so I really just want to summarize the third type of Advantage now I think we can understand better which is that a mark of decision process has the mark of property um and TD exploits this markof property exploits it by building implicitly this mdp structure and solving for that mdp structure um and that means that in Markov environments TD normally is more efficient because it it actually makes use of that Markov property it makes use of the property that you don't need to just blindly look at complete trajectories you can understand the environment in terms of states and we know that certain properties of those states have to hold that the state has to summarize everything which came before and that's why TD is more efficient it's another way to understand it um whereas Monti Carlo does not exploit the mark of property um it ignores the mark of property but one benefit then is if you're in a non-markov environment partially observed you know you can't just rely on this state signal you get everything's Alias and messy um then Monte Carlo can be a better choice and often we have some spectrum between these where you're maybe a little bit on Markov and and then the question is how far between these should you go okay right so just to sort of summarize where we're up to this is just a few slides to give a sort of pictorial summary of of what's going on Yeah question you said sometimes we're a little bit non Markov does that mean we an MVP even still applies um okay let me explain that comment so I said you can be in a situation where you're a little bit non Markov um so the question is does an mdp still apply um so under the hood there's there can always be an mdp which defines the way that the environment works and the question is what do You observe as an agent do you so a partially observed Mark of decision process is where you get to observe some function of the state you get to see some part of the state you get to see like um certain cards but not other cards or you get to see certain statistics but not other statistics um so there's an mdp running under the hood you get to see some part of it um and now the question is you know how do you deal with that so so there is an mdp yes um but if you just try to use your observed statistics as your state representation uh TD wouldn't work very well td0 wouldn't work very well Monti Carlo would still do the right thing given that representation you've given it so it still find the expected return conditioned on the information that you've shown it which might not be still might not be great because if you if you have a blind robot walking around its estimates of the value aren't going to be very good compared to if you let it see everything but you still want to make sure that it can consistently estimate values even if it is blind even if you only give it partial information okay um so so let's just try and pull things together a little bit so we've seen that we have um these different updates um which take these particular forms and and they have we can think of these as like backups that we started in some State imagine we're doing something like our backup diagrams we sort for uh dynamic programming now where what we're going to look at is you start in some State and implicitly there's some kind of look ahead tree like you know we're in the mdp and in this mdp from this state there are two actions I could take um I could go take this action or this action and from those actions the wind might blow me here or might blow me here then there's some actions I could take um this one might terminate this one might the the environment might take me to here and so forth and and the question is how do we make use of this look ahead tree to figure out the value function that this root node here and so what Monte Carlo does is it basically samples U one complete trajectory along here it samples one so it samples an action from the agent it samples where the wind might take you it samples an action samples where the wind might take you samples one consistent trajectory just by running that and interacting with the environment you get one sample of this whole thing then use that sample to update the value function at this root value here and we also do it at every intermediate value this is just showing what happens for one we're focusing on one state here but we also do all these other updates when we run um temporal difference learning the backup is just over one step we sample the environment we sample our own action we sample the environment we look at the value function where we ended up here and now we back up that value function here towards the value function at this root node here so we don't go all the way to the endend like we did Monte Carlo we just look one step ahead we get a sample of what happens over that one step we back up that sample towards the value function here now let's contrast that to what we did in dynamic programming so in dynamic programming we also did a one-step look ahead um but we didn't sample we had to know the Dynamics and we useed those Dynamics to basically compute a full expectation when now we consider this action here we took a full expectation where we had to know the probability the environment would take us here we had to know the probability the environment would take us here and then we could sum these values together weighted by their probabilities um take a Max over our actions and do a complete backup to figure out the value here so if you know the Dynamics you can do this full look ahead um and that backs you up completely to to give you this value here um and just to complete the the space of possibilities you could also do an exhaustive look ahead of this entire tree you could do something which is like dynamic programming that you went all the way to the end and did a complete backup and that's an exhaused tree search um so you get an exponential blow up in the size of the update that you do in just one state so let try tease these apart into Dimensions that we can understand that sort of Define our algorithm space so the First Dimension is whether or not we bootstrap so Monte Carlo does not bootstrap so bootstrapping again remember is this idea that you you don't actually use the sample you don't actually use the real returns you use your own estimate of the returns you use your own value function as a Target and use your estimated value function to kind of give you an iterative update rather than using real returns um or real rewards so Monte Carlo does not bootstrap it uses the real returns all the way but TD and dynamic programming both bootstrap so in both TD and dynamic program we be doing something like the Bellman equation where we look at the immediate reward you get plus the value function at the next step and we bootstrap by my estimated value at the next step so in both of those cases you know have that situation you've got your robot you take one step of real experience but then you make a guess of how much value you're going to get from here until the wall and use that guess of your value to back you back up either from one sample in TD or by doing an exhaustive full width backup in dynamic programming the other dimension is whether we sample or take full width backups so Monte Carlo samples so it samples the environment we don't need to do a full width um exhaustive search over all the things the environment might do we sample the mdp Dynamics we sample our own policy whereas dynamic programming does full width updates it considers every possibility exhaustively and backs them up exhaustively TD samples as well okay are these two Dimensions clear good so that gives us this picture um sort of gives us the space of um unified view if you like of of methods for policy evaluation we'll also have the same picture for control that applies throughout reinforcement learning um where you can either use these full width backups that do this exhaustive look ahead um over all the different choices you might make or we could sample the the Dynamics um and then in this Dimension here we've basically got methods which uh which bootstrap from our value function after one step so they only have to go like this shallow look ahead they look ahead one step and then update from the value function here towards the value function here or in dynamic programming we do a full width look ahead but we update from the value function here to the value function here um in contrast we can do these deep backups with Monte Carlo where we go all the way to the end of the episode and we look at what happened right at the end and we back up the actual return all the way to the value function here similar with exhaustive search you can do an exhaustive look ahead all the way to the end back that up all the way to the route without ever bootstrapping from your own Value Estimate okay and now the final piece in the remaining 20 minutes or so I'd like to talk about a spectrum of methods in between these shallow and deep backups so it's not actually the case that you have to do either or you don't have to do either TD learning or Monti Carlo learning because it turns out that there's a unifying algorithm which has either end as its special case and that algorithm is called TD Lambda we'll see that Lambda basically lets you pick your your place along here where you can be either shallower or deeper according to a value of Lambda yeah questionning when we sample trans we always the Val function for the first sttion so it's sort of saying I'm updating my estimate of the state where I was given the current action and I assume that actually the next value is correct if we do the opposite way so update value function forward okay basically opposite all right so that kind of should also work yeah so it's a great question which is why do we assume that the value after one step is more accurate than the value before one step why not why why not reverse the Dynamics and do it the other way and the second part of the question was would the alternative algorithm give the right answer um so the answer is it will give the wrong answer um in fact even if you do something like um just make these things close to each other and try to minimize the mean squared error of your TD TD error or something you find the wrong answer in St castic mdp that's a well-known fact you actually get the wrong answer by doing it the other way so why so let's get the intuition um so the intuition is that if you take one step um you're always in a sense a little bit more accurate because you've seen one step of reality in between so you've seen one step of reality that step of reality involves one step of the real reward and also one step of the real dynamics and then you estimate your value function where you ended up okay but because you've included one step of the real dynamics and the real reward um you are in some sense more accurate you're more accurate than than than where you were before and if you take enough of these steps you end up grounding yourself completely in the real dynamics and the real reward of what happened so so it's because you're involving one step of the real world whereas if you go backwards it's like you're starting from a situation where you've already taken that step um you're closer to your your goal um you know you've already seen this real step of what happened and now you want to use this guy to um you're going to move this guy towards your estimate of what happened before you saw that real step of the environment there's no reason to think that that estimate is actually going to be better and often it's going to be worse so I think the key is and this is actually comes apparent in the math if you actually look at the reasons the these algorithm converge like contraction maths and so forth It's because you take one step of of real dynamics and that real dynamics always brings you closer to the ground Troth yeah if you make step backwards observe the reals no I don't think you do in rever mp does EX mdp which can produce the observations in Reverse I guess not this one um so so if I have time in future lectures what I'll try to do is give a um a specific mdp for which your proposal um fails um and it's very simple there's a there's a very simple mdp and we can maybe do it afterwards um okay but I really want to do TD Lambda before we run out of time but yeah great question okay so so far we've seen this idea of temporal difference learning where we said okay you can take you know one step of reality and then look at the value function after one step but why not take two steps of reality where you take two steps and then you look at the value function where you ended up and use that value function after two steps to back up towards your value function of where you were two steps previous or three steps or end steps um all of these are valid ideas they're all reasonable um and so what we're going to do is generalize the idea of TD um learning to these endep predictions so TD we can think of as this special case td0 what we've seen so far is a special case where we take one action we see one step of the environment we end up in some successor State we update the value function at that successor State towards our value function our original state but we could just as well do the two-step version we take one step to our next state another step to our um state two time steps from now we update the value function um our reward plus our reward plus our value function um gives us our estimated return and we update our value here towards that estimated return so you can do that for any number of steps you can say you can always say well I'm just going to observe some number of steps of the real world see how much reward I got along those steps um add on my estimated value from that point onwards and that quantity is a valid estimate of the the overall value function we use that estimate to update our original value function um in the correct direction based on the comments of the last question okay so let's specifically try and just write out what that means so so far we we looked at the the td0 um estimator that's when n equals 1 so when n equals 1 we make what we call a one step return this G1 is a one step return we just look at one step of the real return and we look at our value function in the state that we ended up in after one step but now let's define the two-step return so the two-step return this G2 is the reward after one step plus the discounted reward after two steps plus the twice discounted um estimated return from that point onwards so this is our value function our value function tells us you know from that point onwards how much reward are we going to get and then we can just say our overall return our estimate of the overall return is our real reward are to two steps plus our estimate of what's going to happen from that point onwards so we can do this for any n and in the case where n goes to INF we end up with the Monte Carlo estimator the real return so we start off with td0 here and we get all the way to Monte Carlo here the reason we get to Monte Carlo is because G Infiniti we use all the steps of the real environment we look at our real reward plus our discounted real reward at the next step all the way until the end of the episode we're just looking at the real rewards of what actually happened in that trajectory and that's exactly what Monte Cara does it looks at this these real quantities and it uses those as our targets um and so now we can Define if we just generalize this for any n so the endep return just tells us we're going to take end steps of real reward by really interacting with the environment so these are just random quantities that depend on our particular trajectory of where we ended up over end steps and then after end steps we're can use our estimated value our current value function in the state that we end up in after end steps um as like a proxy for all of the rewards that we haven't observed from that point onwards and we use that as our estimator and we can plug this in in the same way that previously we we took our Monte Carlo estimate and we plugged in our td0 our TD Target in place of the um in place of the actual return now we're going to plug in our endep return as our Target so what's this telling us now again we've got like this error term here where we're saying we started off with some estimate of the value function we ended up after this particular trajectory with some um endep estimate of that value where we saw some number of rewards and plus our estimated value so this our endep estimate of the reward this is what we thought the reward was going to be that generates an error signal and we just update our value function a little bit in the direction of that error signal to correct for that error so this is valid for any n so which n is the best that's the natural question should we be using n equals z which was td0 should we be using n equals infinity should we be using Nal 177 um what's the right thing to do um so this is a little study just on um the same random walk we saw before but now this is a larger case I think it's got you know 21 states or something like that um and it's just a little study to look at how the performance in terms of this root means squared era that we looked at before how it varies um across different step sizes that we vared before um but also across different choices of n so these little numbers in here tell us the choice of n when we do nstep TD updates um and the top and the bottom diagrams are very similar it's just about whether we do online or offline updates and that means whether um whether we immediately update our value function or whether we defer the updates of our value function until the end of the episode um and you get you know slightly different results but the character is still the same um which is to say that as n approaches Infinity um up here somewhere um you see what would happen if you using Monte Carlo learning so you get very high errors um I think this is all of this is these are short runs this the total training time is short which is why Monte Carlo the the difference between them is so exaggerated here um but we see that Monte Carlo the variance of Monte Carlo or um is is high and it doesn't do very well um TD exploits the market property it's lower variance um it does better um so this is TD Z but in between you get this sweet spot where if you look ahead just sort of the right number of steps in your trajectories you can proper at information backwards more efficiently you can kind of um each TD update is not just um propagating information back over one step it can propagate information over multiple steps over end steps at a time so you can kind of move information more quickly across a long chain Now by using n greater than one and so we see there's some kind of sweet spot with n around three or five um but it's sort of unsatisfactory and that you can see that even between the online and the offline the best end changes and if you change the size of your your um Mark um of your random walk the optimal end changes again like if you were to do a 100 um State random walk um you would see that all of these values would be shifted over so that you should to favor larger n um and so it's a little bit unsatisfactory we want to have algorithms which are sort of robust across um many different ways to many different ends and so what we're going to do is try and come up with an algorithm which gets the best of all n that's the goal to efficiently consider all n at once so how do we do that well the way to do that is to notice that we can average over these endep returns we don't have to just commit to one of them we can actually take a um a Target which combines multiple endep returns together so for example consider this case where we're just going to do one backup towards the average of both of these so basically what we're going to say is our estimate of the of the return now is going to be what happened after two steps and the value function so we're going to take two steps of real reward and the value function after two steps so we're going to average that with uh four steps of real reward and the value function after four steps we can average those things together call that the Target and that thing is going to be more robust because it kind of gets the best of both of these cases and we can use that as the target for our TD learning and so the question is how can we efficiently combine all n in this way can we come up with a scheme that um without increasing the algorithmic complexity lets us deal with all n and the the answer is we yes we can do that efficiently um and the algorithm basically is called TD Lambda that does that so the main idea was to use this quantity which is called the Lambda return the Lambda return is like a geometrically weighted average of all n going into the future and so the idea is that we've just got um this constant Lambda which is between zero and one um and this constant Lambda tells us how much um we're going to decay um the amount that we the waiting that we have for each successive n so we're basically going to have a waiting of 1 minus Lambda for this one step look ahead um and then we're going to keep multiplying successively by Lambda this one minus Lambda is just like a normalizing factor that makes all of our weight sum to one so what we're going to do is multiply our weight by Lambda when we go to from Nal 2 to n 3 by Lambda again when we go to nals 4 and so forth so each one is successively weighted less and less and less until we get to the end of the episode and we basically give all the remaining weight um for if if we were to continue this diagram all the way to Infinity we're going to sum the weights of all the remaining ones gather those weights together and put them into the this final update for the for the last one okay um so this is what it looks like so we use a wait for every endep return we wait that endep return according to 1us Lambda this normalizing factor and then just Lambda to the nus one and so this is our Lambda return this G superscript Lambda tells us our weighted sum of all of our different endep returns so we're going to wait these together geometrically and now again we're going to do the same trick which is we're just going to use this G Lambda as our Target now for TD learning we're basically going to move um we're going to have our estimate V we're going to use this thing as the target our waiting our weighted sum of all of these endep returns um and we're going to move we're going to generate an error this is what we thought the value was going to be this is after our trajectory after a whole episode we can wait all of our unep Returns come up with this very robust estimate of the of what the return actually ended up look like and we update our value function a little bit in the direction of that error signal is that clear maybe yeah is it always the case that the last weight in this formula is smaller comparon to the prev it's always the case until the final step until the the algorithm terminates that the weights are getting less by a factor of Lambda um the final weight um actually can be larger um just think of this as um think of this as you can always take any episode that terminates and replace your um your terminating mdp by an mdp that goes on forever and it just stays in the same state getting zero rewards again and again and again and so all this is doing is instead of um so you could you could have a diagram which goes on forever um where these were getting longer and longer and longer um and the value function wouldn't be changing each time because You' just be stuck in that same terminal State and then all this is doing is gathering together all of those final ones into one update because we know that the episode's terminated so it's just an efficiency that we gain by knowing that the terminal signal has this special effect that nothing else is ever going to change the value after that point but it can be larger okay so it looks something like this where you know we've got this waiting and this waiting decays geometrically like this over over the number of steps that we look into the future um and we generate this Lambda return and we kind of sum up all of the final um steps into one bigger weight that we use to to deal with the final return the last step of the episode y question con okay great question so the question is why use geometric waiting why not some other waiting um the answer is that it makes um for an efficiently computable algorithm um so the geometric weightings are are memoryless uh and that means that um you can actually do this in a very efficient way um that doesn't require either storing or Computing something different for each of your endep terms so you can do TD Lambda for the same cost as td0 basically same computational cost and that's only true for geometric weightings you can um I should say that it is possible to come up with geometric weightings where the Lambda varies per time step and that gives you a much broader class um and that's still has the nice property you can vary the Lambda that you multiply by at every step of the algorithm that gives you a much broader class but still satisfies this nice property um so we are going to make use of that property shortly as well okay so what we've seen so far is that we have what we call this forward view algorithm where where it was a bit like Monte Carlo that to do this we had to wait all the way until the end of the episode to get our endep returns that you can only compute what happened after 10 steps from now once your whole episode is finished you can go back and look at all your endep returns average them together get your Lambda return plug that into your algorithm so this is suffers from some of the same disadvantages that we had with Monte Carlo so that's the forward view of TD Lambda but what's nice about TD Lambda is there is an equivalent mechanistic view that achieves the same results as forward TD Lambda but without having to look into the future without having to wait until the end of the episode um and and Gathering those Lambda attempts um so so this is just to say this is the same diagram we had before showing on the large random walk now how sensitive is this to Lambda value and it's actually we see that there's a sweet spot again in Lambda but it's actually much more robust that this value of Lambda would be the the same that works best regardless of the size of the random walk it's much more um robust to the change in the environment and it um does well um as well as the best n um just by averaging over these things and land equals one is Monte Carlo land equals z is td0 and we see that we can do much better and you often see these kind of Curves where there's a sweet spot in the Lambda curve between zero and one where you get just the right trade-off between bootstrapping and um um the bias variance tradeoff okay so now let's come up with a backward view how can we actually achieve this but with the nice properties of TD learning where we update online every step from incomplete sequences um and that is the TD Lambda algorithm um so to understand that I'm just going to put up one um diagram here okay so imagine you thought you're a rat again like in the first lecture and you thought you were going to get this nice piece of cheese uh but then you hear a bell three times a light comes on and you get electrocuted okay so who thinks that you got electrocuted because of the bell and who thinks you got electrocuted because of the light let's just have a show of hands Bell light okay so most people thought the light okay they're both reasonable answers um if you thought that the most frequently occurring State um caused this event then you would assign credit to your error to to the Bell you would say the Bell happened more so that was probably the you know I thought that um I should change my thinking my um more because of the Bell than I should because of the light but there's also a recency heuristic which apparently you guys are partial to um which is to say the most recent thing which happens um should um basically be assigned credit for the error that we see the idea of Eligibility traces is to combine both these charistics together so what we do with eligibility traces is we basically we look over time at the states that we visit and the more so this is the eligibility trace for one particular State and this is the moment at which we visit that state and what we do is we basically every time we visit that state we increase the eligibility trace and as we start to not visit it we decrease the eligibility Trace exponentially so we increase it we see it we increase it again we increase it again and we don't see it for a long time and it sort of decays off and so this gives us our frequency and recency heuristics combined together and what we do is we basically update we basically when we see an error we up the value function now in proportion to the eligibility Trace in proportion to how much um credit we think was assigned to to being in that state like how how how much was being in this state of hearing a bell the cause of of this error that I saw and we update the value function there directly in a portion proportion to the eligibility Trace so this gives us the backward view TD Lambda algorithm um and so what this gives us is this following algorithm where for every state we're going to keep one of these eligibility traces that's trivial computationally it's just um simple operation per state we update the value function for every state in proportion to both the TD error and the eligibility Trace so we compute our TD eror this is what we thought the value function was going to be this is our estimate of the value function after just one step so this is the onestep TD error and now we basically update our value function in the direction of the TD error um in according to our credit assignment so we basically see this error and the things which have the most eligibility which we think most responsible for that error get updated the most um and so this has this like backwards view now where we're not we're no longer looking into the future it's more like this TD is being broadcast back to every state in the past and the up the value function of each of those States is updated in proportion to the TD era so how does this relate to the algorithms we've seen so far well trivially we can see so far that um when lambra is zero that this reduces to the td0 algorithm so if we plug in the ility Trace um for lambra equals z we see that the eligibility Trace will be um when lambra is z basically what that means so so Lambda is telling us how rapidly we Decay this thing if Lambda is zero we declare it completely you know straight down after we've seen a state so that basically means we only care we only assign eligibility at the state where that state at the moment that state is visited um and so we basically update our value function um in proportion to this only will be updated if we actually visit that state and that's exactly what td0 does it basically gives us our our update for the state that we visited we update the value function in proportion to the TDR for that state only if that state was visited and never otherwise okay um at The Other Extreme we have Lambda equals 1 and in that case the credit is deferred all the way to the end of the episode and so if we're actually in an episodic environment this has the nice results that actually if you were to to accumulate all of the updates that you do using this backward view you actually end up getting the same set of updates the sum of those updates is the same as for the forward view of TD Lambda so the backward View and the forward view of TD Lambda are actually exactly equivalent when you do offline updates they're the same algorithm um so it's just a mechanistic view that can be implemented efficiently um you only have to look backwards you don't have to look forwards in time um and that's the TD Lambda algorithm which helps to achieve this principled view which we understood in terms of this forward view where we were doing these averaging over these interpal terms so this theorem shows that they're equivalent there's some notes um which you can look at Offline that prove the equivalence and give a bit of intuition into it and finally there's just a mention of some recent work that shows that actually um this result has been extended even to online updating as well so even if you change your value function as you go along you can also achieve the same equivalence using a recent method from last year um so that's it really um so that's all in the in the slides um so just before we close any last questions about that so um so TD Lambda basically gives you a spectrum it gives you a spectrum between td0 and Monti Carlo um and it has two implementations this forward view that gives you this theoretical way of averaging overall your interep returns and this backward view where you accumulate these eligibility traces and the eligibility traces can be very efficiently updated and just tell you how much credit you should assign to each one of those States um so you never need to look into the future and you have all the benefits of of TD in terms of working from partial sequences and so forth but you can pick your point now just by choosing Lambda you can pick um just like in this um plot you can pick your Lambda so as to get the benefits um of both to get a sweet spot in between Monte Carlo and and TD updates okay right assignment will come online shortly we should probably get out of here to let the next guy in see you after reading week