Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 5 - Value Function Approximation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

All right. Good morning, we're gonna go ahead and get started. Um, homework [NOISE] one is due today, unless you're using late days, um, and homework two will be released today. Homework two is gonna be over, um, [NOISE] function approximation and reinforcement learning. Um, we're gonna start to cover that material today, and then we'll continue next week with deep learning. [NOISE] Um, deep learning is not a prerequisite for this class and so we're gonna be releasing a tutorial on TensorFlow, um, later this week. [NOISE] Uh, and then next week, we'll also in sessions [NOISE] have the opportunity to go into some more of the background to deep learning. [NOISE] You're not expected to be an expert at it but you need to know enough of it in order to do the homeworks and, and do the function approximation. [NOISE] We will be assuming that you're very familiar with things like, um, gradient descent, and taking derivatives, and things like that. Um, TensorFlow and other packages can do that automatically for you, but you should be familiar with the general [NOISE] process that happens. Um, before we continue the sim, may I have any logistic questions. [NOISE] All right. Let's go ahead and get started. [NOISE] Um, as you can see I have lost my voice a little bit, it's coming back but we'll see how we go and if it gets too tricky, then will take over. [NOISE] All right, so what we've been talking about so far is thinking about, um, [NOISE] learning, uh, to be able to evaluate policies in sequential decision-making cases, and being able to make decisions. [NOISE] All of this is when the world is unknown. And what I mean by that is that, we're not given in advance, a dynamics model, or a reward model. [NOISE] Um, and what we're gonna start to talk about today is value function approximation. [NOISE] Um, just so I know actually, who of you, who of you have seen this before? Who've seen some form of like value function approximation? [NOISE] Okay, so, a couple of people, that most people know. Um, uh, so when I say value function approximation, what I mean is that so far we've been thinking about domains, where we tend to have a finite set of states and actions, and where it is, um, computationally and memory feasible [NOISE] to just write down a table, to keep track of what the value is, of states or the value of state action pairs, [NOISE] um, or that we could imagine writing data table to write down the models explicitly of the Reward Model and the dynamics model. [NOISE] But many real world problems have enormous state and action spaces. So, if you think about things like the Atari games, which we can debate about whether or not that's a real-world problem but it's certainly a challenging problem. [NOISE] Um, state-space we discussed at the beginning is really sort of a set of pixels. And so that's gonna be an enormous space and we're not going to be able to write down that as a table. [NOISE] And so, in these cases, we're gonna have to go beyond sort of this tabular representation, [NOISE] and really think about this issue of generalization. [NOISE] So, we're going to need to be able to say we want to be able to make decisions and learn to make good decisions. We're gonna need to be able to generalize from our prior experience, so that even if we end up in a state action pair that we've never seen exactly before, it's like a slightly different set of pixels than we've ever seen before, that we're still gonna be able to make good decisions and that's gonna require generalization. [NOISE] So, um, what we're gonna talk about today is we're starting with value function approximation, [NOISE], um, for prediction, and then talk about control . [NOISE] Um, and the kind of the key idea that we're gonna start to talk about in this case is that, we're gonna be representing the state action value, uh, value function with a parameterized function. [NOISE] So, we can think of now as having a function where we input a state, and instead of looking up in a table to see what its value is, instead we're gonna have some parameters here. So, this is, this could be a deep neural network. [NOISE] This could be, you know, um, [NOISE] a polynomial. [NOISE] It can be all sorts of different function approximations but the key here is that we have some parameters that allow us to say for any input state, what is the value. And just like we saw before, we're gonna both sort of go back and forth between thinking of there being a state value function, and a, a state action value function. [NOISE] Um, and the key thing now is that we have these parameters. [NOISE] We're mostly gonna be talking about those parameters in terms of w. [NOISE] So, you can generally think of w as just a vector. Um, [NOISE] uh, with that vector could [NOISE] be the parameters of a deep neural network or it could be something much simpler. [NOISE] So, again, you know, why do we wanna do this and sort of what are the forms of approximations we might start to think about? So, we just don't wanna have explicitly store learn for every individual state action pair. [NOISE] So, we don't have to do that in terms of learning the dynamics model, you don't have to do that in terms of a value function, or state action value function or even in terms of a policy. [NOISE] We're gonna need to be able to generalize, so that we can figure out that, our agents, our algorithms can figure out good policies for, um, sort of these enormous state spaces and action spaces. [NOISE] And so we need these compact representations. [NOISE] So, once we do this we're gonna get a multiple different benefits. There would also gonna incur potential problems as well. So, we're gonna reduce the memory that we need to store all of these things. We're gonna reduce the computation needed and we might be able to reduce the experience. [NOISE] And so what I mean by that there is, um, how much data does our agent need to collect in order to learn to make good decisions. So, this is really a notion of sort of how much data is needed. [NOISE] Now, I just wanna highlight here that, um, you know, there can be really bad, it would be really bad approximations. UM, [NOISE] and those can be great in terms of not needing a lot of data, and not needing a lot of computation, and not need a lot of memory, [NOISE] but they may just not allow you to represent very good policies. [NOISE] Um, so these are, these choices of representation or defining sort of hypothesis classes. They're defining spaces over which you couldn't represent policies and value functions, and so you couldn't, there's gonna be sort of a bias-variance trade-off here, um, and add a function approximation trade-off, in the sense that, if you have a very small representation, you're not gonna need very much data to learn to fit it, but then it's also not gonna have very good capacity in terms of representing complicated value functions or policies. [NOISE] Um, so, as a simple example, we could assume that our agent is always in the same state all the time. You know, all video game frames are always identical, [NOISE] and that's a really compressed representation, um, you know, uh, we only have one state, [NOISE] but it's not gonna allow us to learn to make different decisions in different parts of the game. So, it's not gonna allow us to achieve high reward. So, there's going to generally be a trade-off between the capacity of the representation we choose, so sort of the representational capacity [NOISE] versus all these other things we would like versus memory, computation, and data. [NOISE] Others and always, sometimes one gets lucky and, and you can choose something that's very, very compact, [NOISE] and it's still sufficient to represent the properties you need to represent it in order to make good decisions, [NOISE] but it's just worth thinking that often there's this explicit trade-off, and we often don't know in advance what is a sufficient representational capacity in order to achieve high reward. Yeah? [NOISE] Is this, um- What's your name. Oh, sorry, . Is this more or less an orthogonal consideration from the bias-variance trade-off in inter-functional [NOISE] coordination? Yeah, and you can think of it as right, the best question is whether this is an orthogonal trade-off to sort of bias-variance trade off? [NOISE] Um, can think of it as related, i- if you choose a really restricted representational capacity, you're gonna have, um, a bias forever because you're just not gonna be able to represent the true function. [NOISE] Um, so it's, be- and they all have consuming, uh, a smaller variance for a long time because it's a smaller representation. [NOISE] So, it's really didn't shift to, uh, related [NOISE] to that. If you take a machine learning and then, uh, talked about things like structural risk minimization, [NOISE] and thinking about, um, how you choose your model class capacity versus how much data you have, in terms of minimizing your tests that are similar to that too. [NOISE] So, you know, how do you trade-off in terms of capacity to generalize, um, [NOISE] versus the expressive power. All right. So, a natural immediate question that we've started, I've started alluding to already is what function approximation are we going to use? Um, there's a huge number of choices. Um, today we're only gonna start to talk about one particular set. Um, but there's an enormous number probably most of the ones you can think of have been tried with reinforcement learning. So, pretty much anything that you could do in supervised learning. You could also try as a function approximator for your value function, um, could be neural networks or deep decision trees or nearest neighbors, um, wavelet bases, lots of different things. Um, what we're gonna do in this class is mostly focused on things that are differentiable. Um, these are nice for a number of reasons. Um, but they tend to be a really nice smooth optimization properties. So, they're easier to optimize for. That's one of the reasons we're gonna focus on them in this class. Those are not always the right choice. Um, uh, can anybody give me example of where, for those of you that are familiar with decision trees, where you might want a decision tree to represent either your value function or your policy? Yeah. Yes. Uh, they tend to be highly interpretable. You keep them simple [inaudible] us all with trees. All right. [inaudible] actually understand that could be helpful. Exactly. So, what he just said is that, um, you know, depending on where you're- how you're using this sort of reinforcement learning policy, this may be interacting directly with people. So, let's say this is gonna be used as a decision support for doctors. In those cases, having a deep neural network may not be very effective in terms of justifying why you want a particular treatment for a patient but if you use a decision tree, um, those tend to be highly interpretable. Um, uh, well, depending on what features you use but often it's pretty highly interpretable and so that can be really helpful. So, thinking about what function approximation you use often depends on how you're gonna use it later on. Um, there's also been some really exciting work recently on sort of explainable deep neural networks where you can fit a deep neural network and then you can fit a sort of a simpler function approximator on top. So, you could fit like, first fit your deep neural network and then try to fit a decision tree to it. So, you try to get the kind of the best of both worlds. Super expressive, um, uh, function approximator and then still get the interpretability later. Um, but it's worth thinking about sort of the application that you're looking at because different ones will be more appropriate in different cases. Um, so, you know, probably the two most popular classes, um, these days and in RL in general are, um, linear value function approximation and deep neuro networks. Um, and we're gonna start with linear value function approximation for two reasons. One is that it's been sort of probably the most well studied function approximators in reinforcement learning but, um, up to the last few years and second, is because you can think of deep neural networks as computing some really complicated set of features that you're then doing linear function approximation over at least in a number of cases. So, it's really provides a nice foundation for the next part anyway. All right. So, we're gonna do a really quick review of gradient descent because we're gonna be using a ton over the next few days. So, let's just think about any sort of general function J, um, which is a differentiable function of a parameter vector W. So, you have some vector W, it's gonna be a set of linear week soon and our goal is to find the parameter, um, W that minimizes our objective function. Haven't told you what the objective function is but we'll define it shortly. Um, so, the gradient of J of W is we're gonna denote that. It's told to J of W and that's just us taking the derivative of it with respect to each of the parameters inside of the vector and so that would be the gradient, and so a gradient descent way of trying to optimize for a function, uh, J of W would be to compute the derivative or the gradient of it and then to move your parameter vector in the direction of the gradient. So, if your weights and generally we're going to always assume the weights are vector, um, uh, we're gonna be equal to your previous value of the weights minus some learning rate of the derivative of your objective function. So, we're sort of just we're figuring out the derivative of our function and then we're gonna take a step size and that and move our, our parameter weights over a little bit. Um, and then we're gonna keep going. So, if we do this enough times, um, are we guaranteed to find a local optima? Right. So, [OVERLAPPING] assume it yeah. So, they could be yeah, there may be some conditions o- on the learning rate. Um,ah, but yes, if we do this enough we're guaranteed to get to a local optima. Um, no- notice this is local. So, we started thinking about this in terms of the polis- uh, in terms of doing RL, it's important to think about where are we gonna converge to and if we're gonna converge and I'll talk more about that throughout class. So, this is gonna be sort of a local way for us to try to smoothly start changing our parameter representation at the value function in order to try to get to a better, um, better approximation of it. Right. So, let's think about how this would apply if we're trying to do policy evaluation. So again, policy evaluation is someone's giving you a policy. They've given you a mapping of, um, first date what your action is and this could be, it could be stochastic. So, it could be a mapping from states to a probability distribution over actions. So, but someone's giving you a policy and what you wanna do is figure out what's the value of that policy. What's the expected discounted sum of rewards you get by following that policy. So, let's assume for a second that,um, we could quer- query a particular state and then an Oracle would just give us the value, the true value of the policy. So, I, you know, I asked you like, you know what's the, what's the expected discounted sum of returns for starting in this part of the room and trying to navigate towards the door under some policy and it says, okay the expected discounted number of steps it would take you as on average like 30 for example. So, um, that would be a way that the Oracle could return these pairs and so you get sort of this pair of S, V pie of S and then let's say given that, we have all this data what we wanna do is we wanna fit a function. We wanna fit our parameterized function to represent all that data accurately. So, we wanna find the best representation in our space, um, of the state value pairs. So, if you frame this in the context of stochastic gradient descent, what we're gonna wanna do is just directly try to minimize our loss between the value that we're predicting and the true value. So, right now imagine someone's giving us these true S value pairs and then we just want to fit a function approximators to fit that data. So, it's really very similar to just doing sort of supervised learning. Um, and in general we're going to use the mean squared loss and we'll return to that later. So, the mean squared loss in this case is that we're just going to compare the true value to our approximate value and our approximate value here is parameterized by a vector of parameters. Um, and we're just gonna do gradient descent. So, we're gonna compute the derivative of our objective function and when we do compute the derivative of that then we're gonna take a step size and we're gonna do stochastic gradient descent here which means we're just gonna sample the gradient. So, what I mean by that is that if we take the derivative of our objective function, what we would get is we'd get something that looks like this. [NOISE] And what we're gonna do is we're going to take, I'm going to use this as shorthand for updating the weights, I'm gonna take a small step size in the direction of this as evaluated for one single point. So now, there's no expectation and this is just for a single point. [NOISE] So this is stochastic gradient descent where we're not trying to compute the average of this gradient we're going to- we're trying to just sample this gradient, evaluated at particular states. And what I've told you right now is that someone's given us these pairs of states and the true value function. So you just take one of those pairs, compute the gradient at that point and then update your wave function and do that many many times. And the nice thing is that the expected stochastic gradient descent is the same as the full gradient update. Um, so this has nice properties in terms of converging. Yes a name first please. Um, so just to confirm, uh, why is the expectation over policy and not over a set of states if you're saying, if SGD is a single state? So this is over the distribution of states that you'd encounter onto this policy. Yeah, the question was you know wh- why do it over- what does the expectation mean in this case? In this case it's the expected distribution of- of states and values you'd get under this policy. [NOISE] And that's, uh, it's an important point, will come up later. It'll come up again later in terms of sort of what is the distribution of data that you're going to encounter under a policy. Of course, you know, in reality we don't actually have access to an oracle to tell us the true value function for any state. Um, if we did we'd already know the true value function and we wouldn't need to do anything else. Um, so what we're gonna do now is talk about how do we do model-free function approximation in order to do prediction evaluation um ah without a model. Okay. So, if we go back to what we talked about before we thought about EBV sort of Monte-Carlo style methods or these TD learning style methods um where we could adaptively learn online a value function to represent the value of following a particular policy. Um, and we did this using data. And we're going to do exactly the same thing now except for we're gonna have to whenever we're doing this sort of update step of um do- updating our estimator with new data, we're also going to have to do function approximation. So instead of just like um incrementally updating our table entry about the value of a state, now we also have to re approximate our function whenever we get new data. All right. So, when we start doing this we're going to have to choose a feature vector to represent the state. Um, let me just ground out what this might mean. So let's imagine that we're thinking about a robot, uh, and a robot that, well robots can have tons of really amazing sensors but let's imagine that it's old school and it just has a laser range finder. Um, a lot of laser range finders used to basically be a 180 degrees um, and so you would get distance to the first obstacle that you hit along all of this 180 degrees. So maybe here it's like two feet and this is 1.5 feet, this is 7 feet. And this sort of gives you an approximation of what the wall looks like for example. So here's our robot. It's giving- it's got a sensor on it which is the laser range finder and it's telling us the distance towards the walls. And so what would this feature representation be in this case? It would just be simply for each of these 180 degrees, what's the distance? One degree, two degree. [NOISE] That'll be example of a feature representation. Now, why? That sounds like a pretty good of it, maybe slightly primitive but generally a pretty good feature representation, um, but what's the problem with that? Well, probably isn't mark off. So a lot of buildings have hallways that would say, you know, on my left and my right there's a wall about two feet away um, and then there's nothing in front of me at least for it perhaps out to my laser range finder, you would say you know out of rage. And that would be true for many different parts of the same hallway and it will be true for many different hallways. And so there'd be a lot of partial aliasing. So this is a feature representation that probably is not mark off um, but it might be reasonable. It might be a reasonable one on which to condition decisions, maybe if you're in the middle of the hallway and that's what it looks like you was just wanna go forward. And that's an example of a type of feature representation. And again just emphasizes the point that the choice of the feature representation will end up being really important. Um, and for those of you who have taken through deep learning classes you've probably already heard this but it's kinda before deep learning. There was often amo- a huge amount of work and there's still a huge amount of work on doing feature engineering to figure out what's the right way to write down your state space so that you could make predictions or make decisions. Now, one of the nice things about deep neural networks is that it kind of pushes back that feature selection problem so that you can use really high dimensional sensor input and then do less amount of hand tuning. So what do I mean by hand tuning? Well, in this case, you know you could use the raw features about like how far you are to on each of these a 180 degrees or you can imagine having higher level abstract features like trying to understand if there are corners. So you could already have done some pre-processing on this raw data to figure out what features you think might be relevant if you're going to make decisions. And the problem with doing that is that again if you- if you pick the wrong set you might not be able to make the decisions you want. Yes, the name first please. Uh, could you please elaborate why this is not mark off, um, this [NOISE] ah kind of getting the 180 degrees. Is it ? Yeah. So, the question is can I elaborate why this is not markup? Um, I, if just have a 180 degrees for a robot, if you think about something say like a long hallway. Let's say this is floor one. This is floor two, like n gates for example. So if you have your little robot that's walking along and it's guiding its laser range finder, to try and tell it to the distance to all of the things, um, you're not going to be able to distinguish with that representation whether you're on floor one or floor two because your immediate sensor readings are gonna look identical. And in fact you're not even able to tell where you are in that hallway from this hallway. Yeah? [inaudible] So, um, can we generalize that ah if we have partial aliasing then, uh, we say its not Markov? Great question. [NOISE] ask, can we generalize to say if we have partial aliasing it's not Markov? Yes. I mean, you could change the state representation to be mark off by including the history um and so then each individual observation would be aliased but the whole state representation would not be but in general yes, if you have a state representation for which there is, um, aliasing it's not mark-off. Might still be that you could could still do pretty well with that representation or you might not but it's just good to be aware of in terms of the techniques one has applied. Good questions. All right. So let's think about doing this with linear value function approximation. Um, so what do I mean by linear value function approximation? It means that we're simply going to have a set of weights and we're going to.product this with um a- a set of features. [NOISE] So you know maybe it's my 180 degrees sensor readings and then I'm just gonna have a weight for each of those 180 features. Um, and we can either rep- use that to represent ah a value function or you can do that for a state action value function. Um, those of you who are already thinking about state action value functions might notice that there's at least two ways to do that once you start getting into q just mentioned that briefly. You could either have a separate weight vector for each action or you could put the action as sort of an additional um feature essentially, multiple different choices. You get different forms of sharing. Okay? But right now we're just thinking about um er estimating the value of a particular policy. So we're just going to think about values and we're gonna say that remember W is a vector and X is a vector. Now X and S is just going to give us the features of that state. So it could be like the real state of the world is where the robot is and the features you get out are those a 180 readings. So we're again going to focus on mean squared errors, our objective function is this mean squared error. The difference between the values we're predicting and the true values. And this is our weight update which is, uh we want to update our weight by a learning rate times the derivative of this function. So what does this look like in the case of linear value function approximation? [NOISE] So what we're gonna do is we're just gonna take the derivative of J using the fact that we know that this is actually X times W. Okay? So, what we're gonna get in this case is W- delta W is equal to 1.5 alpha to P pi of S minus S W times X because the derivative of X times W with respect to W is X. Yes. Is this expected value over all states or for a particular state? Great question, remind me your name one more time. Yes. So the question is, is this is an expected value of all states or particular state? When we're doing the update of the W we're going to be evaluating this at one state. So we're gonna do this per each state, um, tha- well, we're going to see different algorithms for it but um, generally we're gonna be doing stochastic gradient descent. So we're gonna be doing this at each state. The expected value here you can think about is really over the state distribution sampled from this policy. So if you were to execute this policy in your real MDP you would encounter some states. And if you, um and we'll talk shortly more about like what that distribution looks like but that's the- we want to minimize our error over all- over the state distribution we would encounter under that policy. They're good questions. Okay. So, if we look at this form, what does this look like? It looks like we have a step size which we've seen before with TD learning. And then we have a prediction error which is the difference between the value function, uh, the true value function and the value function we're predicting under estimator and then we have a feature value. So that's one of the nice aspects of linear, uh, uh, linear value function approximation is that these updates form into this sort of very natural notion of how far off were you from the true value weighed by the features. Yeah? The question about the math here so that you have the negative [inaudible] the negative inside V Pi s hat. So, does the- it should be a negative excess there with the negative [inaudible] outside as well? We're going to push this into either, so the question is about just being careful about um the negatives they come out. Um, yes you could push that negative out into here in general alpha is a constant so you can flip it and be positive or negative. Generally, you're going to want your, um, if you're minimizing this is kinda be, ah, you're going to be subtracting this from the weights but you just want to be careful of depending on how you're defining your alpha to make sure that you're taking gradient descent- gradient steps in the right direction. Okay. It's a good question. Okay, so how would we do this, remembering again that we don't actually have access to the true value function? Um, so we don't actually know, so in this equation, right? This assumes this is true, like this is if Oracle has given you the value of a state under that policy, but of course we don't have access to that. Um, so what we're gonna do is sort of use the same types of ideas wi- as what we saw, um, in Tabular learning, um, now with a value function approximation. So, the return which is the expected or the return which is the sum of rewards from timestep t till the end of the episode, is an unbiased noisy sample of the true expected return for the current state wherein on time step t. And so, we can think about doing Monte Carlo value function approximation as really as if we're doing supervised learning on the set of state returned pairs. So now, what we're doing here, is we're substituting in G_t. It's an estimate of the true value. [NOISE] So, we don't know what the true value is, but, uh, we know that the, the Monte Carlo returned is an unbiased estimator, so we're gonna substitute that in. [NOISE] Okay, so what does that mean if we're doing linear value function approximation? It means inside of our wait update, we have a G here. [NOISE] So, we would take the state. We would take the sum of rewards on that episode. So again, this can only be applied in episodic settings just like generally with Monte Carlo, then we take the derivative and in this case that's just x, our features because we're using a linear value function approximation and then on the last line, I'm just plugging in exactly what our, um, V hat estimator is. So, we're comparing our return to our current estimator, um, and then we're multiplying it by our features. And as usual, we have the problem that G might be a very noisy estimate of the return. Yes, the name first, please. [NOISE] Can we differentiate first time and every time like before? Sort of. Do we differentiate first-time and every time visit, uh, like before? [NOISE] Great question to ask. Do we, um, distinguish between first-time visit and every time visit? Yes. The same exact distinctions apply to Monte Carlo up to, remember that applied before. [NOISE] So, [NOISE] I'm here, I'm showing a first-visit variant of it, but you could also, could also do every visit. [NOISE] And it would have the same [NOISE] strengths and limitations as before. Every visit is biased, asymptotically it's [NOISE] consistent. Okay, so what does the weights look like? In this case, we would say weight is equal to the old weights plus [NOISE] Alpha times G_t of s minus v, uh, of sw, remembering that this is just x times w for that state, [NOISE] times x of s. [NOISE] So, it's very similar to what we saw before for Monte Carlo, um, uh, approximate Monte Carlo policy evaluation. [NOISE] Um, what we do is we start off, in this case now instead o- of having a value function, we just have a set of weights, um, which is gonna now be the zero vector to start. And we sample an episode, you have to sample all the way to the end of the episode using the policy, [NOISE] um, and then we step through that episode and if it's the first visit to that state, then we compute the return from that state till the end of the episode, and then we update our weights. Yeah? Um, just to check on that, [NOISE] are you adding, uh, the learning rate, uh, because of the mechanism, uh, reward? [NOISE] Considering that, uh, question is about, um, the Alpha where, oh, in terms of negative versus positive? Right. Each one [inaudible] gradient. Yeah. So, in general, this is gonna look like [NOISE] this. I'm gonna be a little bit loose on those. Um, Alpha is gonna be a learning rate, that's, um, a choice. Generally, we're gonna be, um, trying to minimize our objective function that we're gonna be reducing our weights, um, uh, and will need to be able, again, be a little bit careful about how we pick Alpha over time, um, and, and this has been evaluated at each of the states that we encounter along the way. [inaudible] and just to be, uh, [NOISE] careful on step six, read again factor or just adding up of notice now? Good question. Um, uh, on step six, um, uh, was it ? sorry. said, um, "Do we need to have a Gamma function?" Um, it's a good question. Um, in episodic RL, you can always get away with Gamma being one. Um, so if it's an episodic place, Gamma can always equal one. It is also fine to include Gamma here. [NOISE] So here, generally in episodic cases, um, you will set a Gamma being equal to one because one of the reasons why you set our, our Gamma to be less than one is to make sure things are bounded in terms of their value function, but then the episodic case, it is always guaranteed to be bounded, um, but it is also completely fine to include a Gamma here, yeah. [NOISE] So, I got a couple of questions about same point, um, about this, this G, so when we do that, it seems like we'll and, uh, sam- sampling G's that have reward- rewards over episodes of different lengths, [NOISE] but, so doesn't that close their distribution without stationary and more variance? This question [inaudible] there's a problem with the fact that, um, the returns you're taking are gonna be sums over different lengths. [NOISE] It isn't. Um, so, uh, you're always trying to estimate the value of being in this state, um, which itself under this policy. Um, and in episodic case, you might encounter that state early on in the trajectory or late in the trajectory, and your, your value is exactly gonna be averaged over whether you encountered early or late and one of the returns. So there's no problem with, um, we're assuming all of your episodes are bounded, they have to be finite. So there has to be with probability, one, you're episode has to end. If that is true, then, um, your, your rewards are always bounded, and then you can always just average over this and that's fine. Sometimes you might encounter, um, a state really early in the trajectory in a lot of rewards, other times you might encounter at the end and have very few rewards, [NOISE] um, and the value of interest the expectation over all of them. [NOISE] Yeah? [NOISE] um, on this part of clarification, so essentially is uplink, you're updating this little video approximation at the episode. So, [inaudible] And not just once as the velocitor is in. You're not just updating the weight once an episode many times, right? So, you look at all of the states you encountered in that episode and for each of those, you update your weight vector. [NOISE] Which is equivalent of like generating all the episodes and trying to feed them in a single, in a single- [inaudible] Well, what if we did this in a batch setting, so what if you generate it all every data and then afterwards tried to fit it. So this is an incremental approach to doing that, um, and now ends up converging to the same thing. Question, yeah? Um, [NOISE] I'm just wondering, do we include Gamma, should be Gamma our j minus t slowly start discounting, um, like going forwards. J minus t, oh, yeah. Uh-huh. [NOISE] Catch. [NOISE] Again, you shouldn't need a Gamma in this case. [NOISE] So, in general in this case there should be probably knows that there'd be no Gamma from the episodic case. But it's good to be precise about these things. Okay. All right. So, let's think about this for a particular example. Um, it turns out that when we start to combine function approximation with making decisions, um, ah, [NOISE] and doing this sort of incremental update online, things can start to go bad. Um, and there's, uh, um, and, and what I mean by that is that we may not converge and we may not converge to places that we want to in terms of representing the optimal value function. So, there's a nice example, um, when people are really starting to think a lot about function approximation in the early 1990s, um, uh, Baird came up with this example where it can illustrate some of the challenges of doing function approximation when combining it with doing control and decision-making. So, we're gonna introduce this example now. We're doing MC policy evaluation and then we'll see it a few times throughout class. So, what does this example showing? So, in this example they're going to be two actions. So, a_1 is gonna be straight lines and those are all going to deterministically go to what I'm going to call state S seven. And this is state S1, S2, S3, S4, S5, S6. And what you can see inside of the bubbles there is what its feature value representation is. So, remember I said that we would have a state and then we could write it down as, um, a set of features. So, what does S1 look like? It looks like two, two, three, four, five, six, seven. So, weight one is two, um, and weight eight is one. So, what does S2 look like? S2 looks like zero to one, two, three, four, five. S3 looks Like this. And so on until we get to S7 which looks like this. Okay. So, S7 looks a little bit different than the rest of them. That is the feature representation of those states. Now notice that it looks pretty similar to a tabular representation. In fact, there are more features than there are states. So, there are only seven states here and there are eight features. That's completely possible, right? Like your feature representation could be larger than the number of true states in the world. So, then we have, um, action a_1 and action a_1 always takes us from any state to deterministically to state S7. And then we have action a_2 which is denoted by dot dot dot. And what action a_2 does is, um, with probability one over six, it takes you to state Si where i is n one to six. So, basically uniformly spreads you across one of the first six states. There are only two actions. Either you deterministlly go to state S7 or if you take the second action then you go to one of the first six states with equal probability. And it's a pretty simple control problem because the reward is zero. Everywhere, for all actions. So, the value function for this is zero because there's no rewards anywhere. Um, and yet we can start to run into trouble in some cases. So, before we get to that part let's first just think about what, um, like a Monte Carlo update would do. Um, and, and let's just imagine also that there's some additional small probability here that from S7 that we actually go to a terminal state. So, um, like let's say, you know, with probability 0.999 we stay in S7 and or like 0.99 we stay in S7 and 0.01 we terminate. And, uh, this is a slight modification but I'm doing that just so we can do it for the Monte Carlo case. So, we can think of episodes ending. So, if you're in state one through six you can either go to S7 or you can stay in states one through six. If you're an S7, um, you can either go to states one through six. You can stay in S7 or you can terminate. All Right. So, what then- what might an episode look like in this case? So, let's imagine that we are in state S1. We took action a_1, that deterministically gets us to state S7. Actually before I do that, I'll specify we got zero reward. Rewards was zero. We went to S7. We took action a_1. We got zero reward. We stayed in S7. We took action a_1. We got zero reward and then we terminates. That's our episode. Okay. So, now we can think about what our Monte Carlo update would be. So, our Monte Carlo update in this case would be let's start with state S1 and try to do the Monte Carlo update. So, for state S1 the return is what? Zero. Zero. So, the return is zero. Um, what is x? Um, I should tell you. So, let's start with initializing all of our weights to be one. So, what is our initial estimate of the value function of state S1? [inaudible] How many? So, it's all of the weights are one. The state S1 representation is 200013. That's right. Okay. So, and that's just equal to our, ah, X times W. Okay. So, then what does our update look like and of course I would have to tell you what alpha is. So, let's say alpha is equal to 0.5. So, what our weights are gonna- our change in the weights is gonna be equal to 0.5 times 0 minus 3 times our feature vector for x. Our feature vector for x is to 20001. So, that means that we're gonna get simply minus 1.5 times 20001 minus 3 times minus 1.5. One, two, three, four, five, six. So, notice this is gonna give us an update for every single weight but it's only gonna give us an update for the weights that are non-zero in this particular state, which is the first weight and weight eight. And so then if we were to actually get the new weights, so now we're going to have w is equal to w plus delta w. Then our new representation would be minus two, one, two, three, four, five, six minus 0.5. So, that would be one update of Monte Carlo for the first state. Now you would do this for every single state in that episode. Say, you would then do it for the first time you see it and the algorithm I've defined before. So, we'd next to this for state S7 as well, where the return would also be zero but the value would be something different, so we would get a different, um, well actually in this particular case the value is also three. Um, it depends on if you've already updated your w then your, your value will already be different. Yeah. So, we're doing SGD per state not per episode. questions is are we doing SGD per episode or state? We do it per state. Yeah. In the previous slide where we had before every state- ev- every encounter, does that mean that- For every- for every first visit in that episode. So, yeah. And it's within that specific- if so then you go to a new episode that would be S7. question is about through this first visit, we basically step along that episode similar to what we did with Monte Carlo before and the first time we are encountering state in that episode we update the weights using its return. And when we do that for every single unique state and that episode the first time we see it. And then after all of that we'd get a new episode. Okay. All right. So, this is what would happen. Um, and you can see that the changes can be fairly large because we're comparing like the full return to our value function. Um, it depends of course on what our alpha, alpha is an alpha can change over time. And generally we'll want alpha to change over time in order to get convergence. Um, this gives an example of sort of what Monte Carlo update would look like in this case with linear value function approximator. Okay. So, a natural question might be, um, does this do anything reasonable? Are we guaranteed that this is gonna converge to the right thing? Um, and what does the right thing mean here? Um, we're constrained by our linear value function approximator. So, we're gonna say are we gonna converge to sort of like the best thing in our linear value function approximator. Okay. Before we do this let's just talk for a second about, um, the distribution of states and how that influences the result. So, if you think back for maybe the first or second lecture we talked about the relationship between, um, Markov processes, Markov reward processes, and Markov decision processes. And we said that once you define a particular policy, then your Markov decision process is actually a Markov reward process. Where you can think of it as, um, a chain where the next state is determined by your dynamics model, where you only use the action according to your policy. So, if you run that, if you run your sort of Markov chain defined by an MDP with a particular policy, you will eventually converge to a probability distribution over states. And that distribution overstates is called the stationary distribution. It's a probability distribution its sayings are like what percentage of the time you're going to be in state one, on average versus state two et cetera. Has to sum to one because it's a probability distribution. You always have to be in some state and it satisfies a balanced equation. So, it says that the probability distribution over states before, um, I summed- yeah, I guess. Let me just flip this. I think it's a little bit easier to, to think about it the other way around. You've got, um, d of S prime is equal to sum over S sum over a. We're just doing the sum over a right now so that we can be sure that, um, we allow ourselves to have stochastic policies. So, we look at all the actions that we could take under the current state. And then we look at where we could transition to you on the next state. So, we're in some distribution over states. We think of all the actions we could take from each of those states, where we might transition to. And then that gives us a new distribution over states S prime. And those two have to be identical. So, um, this is often also thought about in terms of a mixing property when your Markov chain has run for long enough. Um, this balance equation will eventually hold and this is just that your distribution over states on the previous time step has to be exactly the same as your distribution over states on the next time step after this process is fully mixed. And it's just telling you on average, you know, how much time are you spending in, in, um, what's the probability on any particular time step you're gonna be in a particular state. This is not telling us how long it takes for this process to occur. So, this depends a lot on the underlying dynamics of the system. So, it might be that this takes millions of steps until you reach the stationary distribution or it might mix pretty quickly, it depends on the properties of your transition matrix, um, under the policy. I'm not gonna get into any of that in this class. Um, it's just important to know that you can't- it's not like you can just wait a 100 steps and definitely you are going to be in the stationary distribution that depends on the problem. Yeah. Have there been any proof of- [inaudible] meaning. Yeah. Have there been any proof of [inaudible] Any proven bounds on the mixing time of this type of Monte Carlo methods. Not that I know of. There might be some. Um, [NOISE] it's a really tricky issue, often, because you don't know how long it will take to get to this, sort of, stationary distribution. There is a really cool paper that just came out in like, a month ago at [inaudible] , um, that talks about how, when we're thinking about of- policy evaluation, which we'll talk more about later today. [NOISE] Um, instead of thinking about, um, superstep, um, ratios, or whether you'll be taking a certain action and a certain policy or not. You can think about these stationary distributions, and the difference between them, in different policies. Problem is, you often don't know how long, and whether your data has got to that stationary distribution. So, would also be really nice if there were easy test to tell if this was true. That's also really hard to know. Yeah. Uh sorry, [inaudible]. And it's. Yes. [LAUGHTER] Um, [inaudible] Why isn't it [inaudible]. Ah, yes. So, question is about, uh. So, um, I, sort of, gave a long what As when you gave a long prelude about saying, like, that things might not converge, but everything looked fine there. We're gonna go into that bar. Yes, and we're gonna talk about the fact that, actually, in the on policy setting where we're just doing policy evaluation. Everything's gonna be fine. It's only when we get into the control case, um, where we're gonna be using data from one policy to estimate the value of another, where in this example and many others, things start to go right. So, we'll use this as a running example, but right now, there's no reason for you to believe this is pathological. Okay. So this is the stationary distribution. And then, the convergence guarantees are related to that. Okay. So what we're gonna do is to find the mean squared error of our linear value function approximator, with respect to the stationary distribution. Why is this reasonable? Well, because you probably care more about your function approximated error in states that you visit a lot. There's a state that's really, really rare, probably it's okay to have bigger error there. You want your overall mean squared error to be defined on that stationary distribution. [NOISE] So, this is the mean squared, um, sort of, value prediction error. Um, and it compares what we predict versus the true value, um, weighed by this distribution of states. And what we're assuming for right now, is that the approximation we're using is a linear value function approximator. [NOISE] Um, let me just note, for historical reasons that, um, John Tsitsiklis and Ben Van Roy. John is a, um, MIT, ha- um, I had the pleasure of him teaching me probability, which was great. And then, um, Ben Van Roy is here. Um, and was one of, I think John's, uh, PhD students for postdocs. Anyway, they, um, in about 1997, people were getting really interested, in whether or not, when you combine function approximation with, um, reinforcement learning, what happened, and whether things were good or bad. And- and they're responsible for, um, this nice analysis. [NOISE] So, let's assume we have a linear value function approximator. What you can prove is that, if you do Monte Carlo policy evaluation, linear value function approximators, gonna converge to the wage which have the minimum mean squared error possible. Just, kind of, best you could hope for. So, um, this is saying the limit as you have lots and lots of data. Um, ah then, an- and you run this many, many, many times, um, then you're, kind of, just converge to the, the best weights possible. Now, this air might not be zero, because it might be that your value function is not approximatable with your linears that have weights. But it's gonna do the best job they can. It's just gonna find the best. It's, basically just doing the best linear regression that you can do, okay, on your, on your data. [NOISE] So, it's good. That's, you know, sort of, a nice sanity check. It's gonna converge to the best thing that you could hope to do. Um, some people have been asking about, uh, okay, well. I've knew me this, sort of, incremental method. And maybe, in some cases, that's reasonable. And maybe, you're running like a customer recommendation system, and you're getting data over time, and you're updating this estimator. But in some cases, you might have access to just a whole bunch of data from this policy. And couldn't you just do that to, kind of, more directly. And the answer is, yes. So, this is often called Batch, uh, Monte Carlo value function approximator. And the idea is that you have a whole bunch of episodes from a policy. And the nice thing is, now you can just, kind of, analytically solve for the best approximator. So, again, our, our GI's are gonna be our unbiased sample of the true expected return. And what you can do is, now, N is just our set of data. This is really a linear regression problem. We're gonna use our, um, unbiased samples as estimates of the true value function. We're just gonna find the weights that minimize this mean squared error. You take the derivative, you set it to zero. It's linear regression. You can solve for this analytically. Um, so, a lil- just like how we talked about you could do policy evaluation analytically in some cases. You can also do it analytically in this case for the linear value function approximator. Um, and note again, this is Monte Carlo. We're not making any Markov assumptions. We're just using the full return. So, this is also fine in non-Markov environments. Yeah. Can you speak to the [inaudible] of this approach versus our, our [inaudible] that we use [inaudible] policy evaluation. Yeah. [inaudible] Okay. So whe- wha- when we'd wanna do this versus the other derivative one. This generally has higher computational cost. X's can be a very large matrix. It may not be possible to even write down X, X. Um, is all of your data in the, in the future representation form, and it requires taking a matrix inverse. [NOISE] Um, so that may not be feasible, if you've got, you know, huge feature vectors, um, and, you know, millions or billions of customers. [NOISE] Um, Facebook can't do this, um, and do this, er, er, directly. Um, and also, you know, if you're doing this, you could do this, sort of, incrementally, but you're always refitting with all of your data. Um, that also could be pretty expensive. So most of it it's about memory and computation. Um, if you have a really small case, it's probably a good thing to do. And it also depends, whether you already have all your data or not. Yeah. [NOISE] You could also do some batch as well, right? And that could help with convergence and not having your, um, radiant estimations fluctuating crazily. [inaudible] So, this, of course there's an in-between. So, you could do, you don't have to. If you have access to quite a bit of data, you could either do it completely incrementally or all batch, or you could do some batches. Um, [NOISE] and there's some nice, uh, work by my colleagues. And also us showing that in, in terms of, um, [NOISE] reading into deep learning, there can be a lot of benefits to doing, sort of, some amount of this analytical aspect over like, you know, a sub batch of data [NOISE] because, um, you're, sort of, particularly when you get into TD learning. Or, kind of, proper getting information a lot more quickly than you are, um, if you're just doing this, sort of, incremental slow update. Because, remember, in TD learning we're also, kind of, only doing like, one step of backup compared to kinda propagating all of our information back, like we do with Monte Carlo. All right. So now we're gonna get into temporal difference learning. Um, so remember in temporal difference learning, we're gonna use both bootstrapping and sampling. Monte Carlo only uses sampling to approximate the expectation. [NOISE] TD learning also use bootstrapping, because we don't have to wait till the end of an episode. Um, we just bootstrap and, like, combine in our estimated ah, expected discounted sum of returns by using our current value function. So in this case, what we used to do is, we would bootstrap. This is the bootstrapping part. And our- what we often call our target is the reward plus gamma times the value of the next state. And I remember the reason this is sampling is, um, we're sampling this to approximate our expectation. We're not taking the full probability of S prime, given as a, and summing over all of S prime. So before we did this and we represented everything as a table. [NOISE] Now, we wanna not do that anymore. Um, so let me just- before we get into this, let me just remind us the three forms of like- of the, the forms of approximation we're gonna have now. Now, we're gonna have a function approximation, bootstrapping and sampling. But we're still on policy. What do I mean by that right now we're still just doing policy evaluation which means we're getting data from the policy that we're trying to estimate its value. It turns out things are just way easier in that case when you're on policy and perhaps they should be somewhat intuitive. It's quite similar to the supervised learning case. Supervised learning, you're generally assuming your data is all IID or data is a little bit more complicated than that. But our data's closer to that in this case because we have a single policy. It's not sort of this non-stationary aspect that comes up when we start changing the policy. So, right now we have these three forms of approximation. Function approximation, bootstrapping sampling but we're still on policy, mostly things are still going to be okay in terms of convergence. So, what does that look like? We're again going to sort of think about doing the equivalent of supervised learning. We'd like to just have our states and the Oracle tells us what the value is and fit our function instead of having the oracle, we're going to use RTD estimates. So, we're going to use our word plus Gamma times our approximate value of the next state. And that's going to form our estimate of what the true value is. Okay. And then we're going to find the weights to minimize the mean squared error in that setting. So, if we do that, what we can see is that if we're doing this with the linear case, we write this out it's just this is the TD target. Just as a quick side note, I'm gonna use the words TD zero a lot. We haven't talked about it in this class but there's actually a whole bunch of different slight variance of TD which often called TD Gamma. And so if you're reading the book that might be a little bit confusing and so I just want to be clear that we're doing the TD zero variant, which is probably the most popular and there's a lot of other extensions. For simplicity, we're just going to focus on TD zero for now. So, this is the TD target. This is our current estimate and then we take the derivative. In this case that means that we're going to end up plugging in our linear value function approximator for both our current state, the next state and looking at that difference weighed by the feature vector. So, it should look almost identical to the Monte-Carlo update except for the fact that now we're bootstrapping. So, instead of this being G versus being the G return of that we saw before for a particular episode, now we're bootstrapping and we're getting the immediate reward plus the estimate of the discounted sum of rewards which we're using our value function approximator to estimate. So, this is what the TD learning linear value function approximation for policy evaluation algorithm looks like. And again we're gonna initialize our weight vector. We're gonna sample a tuple and then we're gonna update our weights. So, we get to now update our weights after every single tuple just like what we saw for TD learning. And what we can see here is that we're just plugging in our particular estimator minus old estimator times X. So, let's see what this looks like on the Baird example. So, again we have the same state feature representation as before. State one is 200001. We still have zero words for everywhere. Let's set our alpha equal to 0.5. Now we're in the case through or we can say that there is no terminal state because TV learning can handle just continuous online learning. So, we're just gonna assume that S7 always stays S7 under action A one. So, A one is the solid line and A two is the dashed. And we're gonna initialize our weights to 1111. And then let's look at this tuple. So, just like the first tuple we saw before let's imagine we're in state one. We took action A one, we got a reward of zero when we went to state S seven. So, why don't we take a minute and you calculate what the new weights would be after we do that update and maybe compare back to the Monte Carlo case in terms of how much they have changed. And feel free to talk to a neighbor. Let's make this a little bigger so it's easy to remember what S seven is. All right. Have they moved a lot, a little? How much of the weight changed compared to what we saw with TD with Monte Carlo? Seen some people indicates smaller. Yes, that's right. Okay. So, the- the- value- the initial value of the states this is gonna be so for X of S one times W it's still gonna be three. For X S prime S prime is seven. We look up what that is. This is also going to be three. But now what we're gonna have, is we're gonna have Delta W is equal to Alpha times zero plus 0.9 times three minus three. So, that's gonna be equal to Alpha times minus 0.3. So, remember before it was actually minus three so it's a much bigger update. And so then when we add that into our- our new weights, we're gonna move our weights but we're gonna move them much smaller than we saw before. And this shouldn't be too surprising that sort of consistent with what we saw for Monte Carlo updating and TD learning. TD learning is only updating these sort of smaller local changes like one state action or word next highest state. The- the Monte Carlo is saying this is an episode full episodic return. It's not bootstrapping. So, it's a no really the return from starting in state. S one is zero. So, we're gonna move it a lot more here. It's saying, okay, I'm going to pretend that the return from status one is 2.7, which is close to three, it's not zero. So, when we move our out weights over here, the- the difference is gonna be much smaller than what we saw for Monte Carlo, which is similar to what we saw without function approximator. All right. Whatever theoretical properties in this case, pretty good. So, these are also, uh, if you look at TD zero, you're gonna converge the weights which aren't necessarily quite as good as Monte Carlo but they're within a constant factor. So, they're going to be one over one minus Gamma of the minimum possible. So, they're not quite as good as Monte Carlo but they're pretty good. And depending on your, uh, discount factor and the function approximator possible, this varies in terms of benefits. So, just to check our understanding for a second, I've put up both of these results. So, one says the Monte Carlo policy evaluator converges to the minimum mean squared error one possible under your linear value function approximators and TD zero converges to within one over one minus Gamma of this minimum error. So, again what is this minimum error that says if you could pick any linear value function approximator, uh, how good could that be at representing your true value of your policy? So, let's take just another minute and this is a good one to talk to a neighbor about. If the value function approximator is the tabular representation, what is the MSVE for both Monte Carlo and TD? We guaranteed to converge to the optimal solution, optimal value for the true value for V of pi or not and if it's not clear what the question is, feel free to ask. [NOISE] So when you, when you say it's a tabular representation, do you mean that you are reducing the repre- representational capacity of the system? Last week's session is, if I say it's tabular representation; what do I mean by that? I mean that there is one feature for each state, it's like a one-hot encoding. So it's like the same representations we saw for the first few lectures, where like, for each state you have a table lookup for that value of that state. [NOISE] Yeah? Can you please explain- And your name? Can you please explain what the barrel mains into? Like, if they're [inaudible] into. Ah, good question. So TD0, I- everything we're seeing in uh, um question is about TD versus the TD0. Everything we're talking about in class right now is TD0. I'm using that because um, there are multiple versions of TD. And if you look at the book they'll have TD Lambda sometimes too. So I'm just making sure it's clear. So if you read other resources you'll know which version of TD to know too. Alright. Well first question. For the um, if we're using a tabular representation, can we exactly represent the value of a policy? [inaudible] Well if we're using the, if we- for every single state in the world, you can grab a different- different table at representation, can we exactly represent the value of the policy? Yes. Yeah, we can. So if you um, I, if you have one feature for every single state in the world, it's not going to be practical, we can't actually do this, but you can exactly represent the value of that policy. How could you do this? You could simply run the policy for every single state. Um, I do Monte Carlo returns an average and that would give you the true value of the state. So you could do it. You could represent the expected discounted sum of returns by representing that in every single table. So that means that um, this error is equal to 0 because um, your functional capacity is sufficient to represent the value. Let's go. So what we're seeing is in expectation, right, the difference between what you're function [inaudible] actually gets to 0 but it's like any chart is going to be a bit different. So there's um, I- i-in expectation at 0 but at any upsets, it-it's different. In this case, if you have a tabular representation and this is in the limit, so with infinite amounts of- of data, et cetera, then this will be um, this will be 0 for every single state. So this, equals 0 for every state. You will converge to the right value for every single state if you're using a tabular representation. And that's because if you think of just having like literally infinite amount of data and you run your policy, just you know, infinite-infinite amount of times then for every state you have an infinite number of trajectories starting from that state, and you can write down that value separately in the table. So it'll be 0. So what that means is that the mean squared value error for the Monte Carlo estimator is 0 if you're using a tabular representation. And because it's 0, that is exactly the same as the mean squared value estimator for TD except for -- so this is just equal to MSVE of the Monte Carlo times one over one minus Gamma. So, that means that this is also 0. So if it's a tabular representation, just to sort connect back to that, um, not- none of these methods have any year. Yeah, question at the back? Your name first. Me? Yeah. Uh, I'm wondering where the one over one minus Gamma constant came from? Yes, the question is; where does that one over one minus Gamma constant come from? Um, in the interest of time, I'm not gonna go through it too much. Um, I encourage you to read the Tsitsiklis paper. Um, intuitively, there is error that's a propagate-propagating here because of the fact that we're Bootstrapping and so if you have a function, what this- what this result is sort of trying to highlight is that, if your function approximator actually has no error, then there's gonna be no difference between Monte Carlo and TD because for both of them the mean squared value error, um, inside of that sum, the minimum over w is going to be 0. So it doesn't matter whether you're using TD or Monte Carlo. But if that's not true, like if you can't actually exactly represent the value function, then you're going to get error and that error is going to um, so like basically you can think of one over one minus gamma is approximately a horizon each. And basically that's getting multiplied because you're adding up these errors. And the reason those get added up is because you're bootstrapping. You're propagating that error back whereas Monte Carlo doesn't suffer that. Yeah. Um, my name is . In general, the mean squared error has taken over distribution uh, of the states but- Under the policy yeah. -yeah, yeah. Under specific policy, uh, but the only specific one we've seen as a stationary distribution. Do you ever use another one? Like- Great question, - Okay, right now we're seeing this under the stationary distribution of the states that you're gonna reach under the policy that you're exit that you care about evaluating. For that, I think that's the right choice because that really is the state you're gonna get to under this policy. We start to think about control. You might want others, like, if you're gonna change your policy. Okay. All right so let's um, I guess just briefly more on this. I- are they faster? Is one of them better? To my knowledge that's not really understood. If you come across any literature on that, I'd love to hear it. Um, practically TD often is better, is to empirically often the bootstrapping often helps to pull up. Alright. Let's move on briefly to control. Um, it's going to be pretty similar. So instead of representing the value function and we're going to represent the state action value function which is what we saw before when we wanted to often move from policy evaluation to control. And now what we're gonna do is we're going to interleave sort of policy evaluation with a value function approximator with performing like an e-greedy policy improvement. Um, this is where things can start to get unstable. Um, what are we doing in this case? We generally involving function approximation, bootstrapping, we're often a- are also doing sampling. But really the- the really big issue seems to be the off-policy learning. But when we think about before we had this nice stationary distribution or converging to a stationary distribution over states, we're not going to be doing that anymore because we are going to be using our- changing our control policy over time, and so that's changing the distributions of states that we encounter. And um, setting the bar to often call it The Deadly Triad. If you want to, start combining function approximation and bootstrapping and off-policy learning, things start to get a little bit um, they can fail to converge or converge to something good. Alright. But before we get into that let's think about it procedurally. So now we're going to have um, Q functions that are parameterized by a W, and we can again do stochastic gradient descent, so its going to look almost identical to what we had before. And again, the stochastic gradient descent can sample the gradient which means for particular state action pair, then we're gonna do these updates. So, here what we're gonna do, is we're gonna represent, um, our Q function by a set of linear state action, um, weights. So, that means we're gonna have features that sort of both encode the state and the action. So, like what I saw when I was turning left as my robot for example. Um, so, it's gonna be a combination of these two. And then, once we have that then we're gonna just have a weight vector on top of that for the Q. So, we're not having separate weight vectors for each action. Instead, we're having features that try to encompass both the state and action themselves. And then, we can do our Stochastic gradient descent on top of that. So, how does this work for, um, uh, like, Monte Carlo? Um, it's gonna look almost identical to before. We're just gonna again use our return. Now, we're gonna be defining returns from a particular state action. For doing first visit the first time we reach the state action in that episode, we look at the return, the sum of rewards till the end of the episode and we use that as our target. Use that as our estimate of the oracle, the true Q function and we update towards that. In SARSA we're, gonna use a TD target, so we're gonna look at the immediate reward of our tuple plus gamma times Q of the next state that we encountered and the action we took. And so, then we're, again I'm just gonna just plug that in. And then for Q learning, it's gonna look almost identical to Q learning except for again we're gonna plug in function approximators everywhere. So, we're gonna plug in, um, this. Remember, is gonna be a function of RX which is gonna be a function of S prime and A prime times R W. Whereas here this is going to be a function of the state in action. Everything's linear and we're just doing different forms of bootstrapping and comparing whether or not we take a max or not. All right. So, I went through that a little bit fast but it's basically exactly analogous to the first part which we sort of stepped through more carefully. Uh, so, far everything's with Q functions now. Why might this gets or weird or tricky? So, TD with value function approximation does not really follow in a gradient. I don't have time to go into total details on that today, but there's a ni- some nice explanations on this in Chapter 11. So, certain Umberto Chapter 11, um, it's a great resource and we also have lecture notes available online. Um, and so, informally we're sort of doing this interleaving or doing this like approximate sample Bellman backup combined with what's often known as a projection step because we're trying to sort of project our value function back into the space of representable functions. And intuitively for why this might start to be a problem, is that the Bellman operator we showed is a contraction. Like when we're doing dynamic programming we showed that if you do Bellman opera- Bellman backups you're guaranteed to converge to a fixed point. When you do the value function approximator, it can be an expansion. What does an expansion mean. Well, what a contraction, just as a reminder what a contraction said is let's say for an operator that's a contraction. If you apply this operator and this is an operator like the Bellman equation. I wanna back up, if you apply it to two different value functions, the distance between this feel like a max norm or something like that is less than or equal to the previous distance. Which means as you apply this operator, the distance between your old value function and your new value function keeps getting smaller and smaller and smaller and eventually get to a fixed point. The problem is this now we're not doing that anymore. It's more like we're doing like O V and then we do some sort of projection operator. I'm just gonna call it kinda weird P. So, this is the projection operator which means when you compute your new value function, it may no longer lie in your value function approximators space and you're gonna have to refit it back into that space. And when you do that, um, that operator itself can be an expansion. For those of you that are interested in some of the early sort of discussions of this, Jeff Gordon, um, has a really nice paper on averages from 1995. Um, but they talk about how linear value function approximators can be an expansion. So, the Bellman backup is fine. It's a contraction but when you do this approximation you might expand the distance and that's one of the problems. Okay. So, if we go back to our Baird example and, um, think about this a little bit more in terms of the, the control case. So, let's imagine that we have a setting where, um, you have two different policies. And the first policy, and this is the policy that you want to evaluate you always take the solid line. So, you always take A1 and in your behavior data, this is the data that you're, this is the policy you're using to gather data, you take A2 with six-sevenths of the time and you take A1 one seventh of the time. Gamma is point nine nine. Um, and what you do is you generate a whole bunch of data. So, you generate data, data under your behavior policy. So, there's some really cool work on how you deal with, um, sort of correcting for the, the data that you're getting versus the data you wanna evaluate. Let's imagine we, we don't go into any of that which I think is super cool and we're, instead we're just gonna do something super simple which is we're gonna throw out all the data that doesn't match. So, imagine you just throw away data if, um, A is not equal to Pi of S. So, you generated all these data points. So, what does it, what do I mean by data points here? We had SAR as prime. So, you take all these tuples. If it turns out that the action that was taken there is not the same as the, the policy you wanna evaluate but where you're only ever taking A1, just throw out that tuple, you don't update. So, now all of your remaining data is consistent with your policy. So, let's imagine then you tried to do TD learning with that data. The problem is, you can diverge. And what do I mean by that? It mean that, that, that your weights could blow up. Super interesting why this happens. Um, the main intuition for it is that your distribution data is not the same as the data you'd get under your desired target policy. In particular, if you were actually to run the, the policy carve out Pi what would happen? Let's say you start off in state S1. You take A1. That determinant will get you to state seven but you're gonna stay in a seven for a really long time because it's deterministic. So, you'd get like S1. S7, S7. Let's say even you did this maybe you, maybe it wasn't episodic case and you have multiple episodes, you'd still have cases where you'd like have very little bit amount of data about S these S's and lots of data about S7. But in the data that you get from your behaviour policy because a bunch of the time it takes A2, it'll keep teleportating you back to one of S1 through S6. Which means the distribution of your data, the data you have looks very different. The distribution of states you visit looks very different than the data that you get and the states you'd visit under Pi. And that is the problem. If you, if you don't account for this mismatch, then the values can diverge. Even though they're all sort of compatible, all in the sense that you're only ever using state action pair if you took the action under your desired policy. And this sort of issue can also come up, um, when you're using Q learning. Um, uh, and you're doing generally like updating this policy over time. So, in terms of this, um, just to briefly summarize before we finish. In the tabular case everything converges, it's beautiful. Um, in the linear case, [NOISE] I mean by this I mean that, um, you can chatter. You basically converge but you might, um, uh, there might be some oscillation. Uh, but Q learning like we are doing this off policy aspect can diverge. And once we get into nonlinear value function approximation, every- like mostly all bets are off. Now, this is a little bit of an oversimplification. Um, there has been a huge amount of work and huge amount of interest in this because everyone wants to do function approximation with or else we can tackle real problems. And so, over the last, like, one or two decades, there's been a huge amount of work on this. Um, and there are some algorithms now that do have convergence guarantees. Um, and there's some coo- super cool really recent work, um, where they're looking at batch URL which can converge with nonlinear approximators. So, there's definitely a lot of work on this that we're not gonna get to. Um, I just wanna highlight that it's a really important issue not just whether it converges, but what it converges to. Like you might converge to a point which is a really bad approximation. So, it's stable. It doesn't, your dab- your weights aren't blowing up but it's just a really bad approximator and, and some of the critical choices here are your objective functioning and your feature representation. So, just before we close I think this is a really nice figure from Sutton and Barto. Um, what they're showing here is you can think of it as like you have this plane which is where you can represent all of your linear value function approximators. And what happens when you do a Bellman update, um, or like you do a TD backup, is that you're gonna now sort of have a value function that might not be representable in your plane and the you're gonna, you can project it back. And these different, you can quantify different forms of error, different basically this allows you to find different objective functions that you're trying to minimize in order to find the best approximator. And we've seen one today which is sort of this me- minimum mean squared error approximation essentially over the like these Bellman errors but that's not the only choice and it's not necessary even the best choice. Um, because it might be that the one that has the smallest error there is not the same one that has the best performance in your real problem. So, that's a little bit fast but it, it's super and Shane that's covered in, um, Sutton and Barto in 11. If you wanna go into more detail. Just really quick, what are the things you should understand? You should, um, you have to implement these ones on linear value function approximator both for policy evaluation and control. You should understand whether or not things can, uh, converge in the policy evaluation case and when the solution has zero error and non-zero error. Um, and you should just just understand qualitatively what are the issues that can come out so that some of these solutions may not always converge and it's this combination of function approximation bootstrapping and all policy learning. All right. So, that's enough just to get started with the homework two that we're gonna be releasing this week, today. And then, um, next week we're gonna start to talk about deepening. Thanks.

Info

Channel: stanfordonline

Views: 33,501

Rating: 4.9491525 out of 5

Keywords:

Id: buptHUzDKcE

Channel Id: undefined

Length: 82min 26sec (4946 seconds)

Published: Fri Mar 29 2019