Lecture 11: Reinforcement Learning II

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right so let's see where we were at with MVPs and reinforcement learning what was an MVP an MVP was this thing where you have States which pretty much any problem we've looked at so far has been abstracted as a set of states that you could be then there's a set of actions per state a then in MVPs we have a model in this model what it captures is the probability of landing in status Prime at the next time given that at the current time you are in state s and took action a so that three-dimensional table with numbers between 0 & 1 there's another three dimensional table reward function which encodes for every possible current state current action and next state what the reward is if you were to experience that triple we were always looking for a policy high a PI policy pi is a vector with an entry for every state in your state space so there is 100 states the policy pi is a hundred dimensional vector prescribing for each state with the action is that you're supposed to take the twist in reinforcement learning is that we don't know T or are we still know they exist we know there are these three dimensional tables out there we just don't know what they are and so we have to somehow learn about them while we're acting in this state action space so in terms of solution techniques what have we seen if we know the MDP you know what TN are R then we have several things we can do one thing we can do is compute the optimal value function using value iteration or using pulse iteration in doing so usually as a side-effect we compute the out mode to value function along the way and we find an optimal policy PI star those are typically the things we're after the policy is what we end up using it's often useful to also know the value these techniques were just two different ways of getting to the same result sometimes one is faster sometimes the other is faster so they'll append on the situation what you want to use then if we know the policy ahead of time we started we could do something called policy evaluation this is a lot like value iteration but you know fix the action by the action prescribed by your policy PI aside from that you can do the same set of updates to find the value of that policy for each state alternatively we saw that once the policy is fixed you can also solve a linear system of equations to find the value vector for that policy now when we switch to reinforcement learning TNR are not known at least not ahead of time we know they exist we don't know what they are one thing we can do is you say we just collect experience in the MDP from that experience we build an approximation of TN r and once we've done that we can go back to use any of the techniques we saw up here so these are the same techniques as up there just now apply to the approximate MVP the one that we estimated from having experienced transitions and rewards in the environment were acting in then we looked at something called empirical averaging and the example we started out with was computing the expected age of students in the class we said one thing we can do is first build a model for the distribution over ages and then from that model for the distribution compute the expected value of age or alternatively we could just directly average the sample values you get and not first compute the distribution and we said can we do the same thing for learning value functions and Q value functions and the answer was yes we have ways of directly from experience computing V and Q and we saw a couple of arguments for this the first thing we looked at was value learning we had two ways of doing this one way was by just direct evaluation it was just you're in a state you run a policy you see what the sum of rewards is either this several times and you just average that and that's your estimate we saw the downside of this procedure is that you do this independently for every state and you don't leverage the fact that if you go from one state to another state that there is a connection between the values of these states so we saw an alternative which is called temporal difference learning in temporal difference learning we have something like a policy evaluation update but we do it based on samples and we said well we're still start them with just learning the value of a policy can we move to learning the optimal value function turned out there was no way to directly learn the optimal value function v-star but there was a way to directly learn the optimal value function Q star that was called Q learning and of course once you have Q star you can find V star from that you can find pi star download policy from that Q star is sensor is the algorithm we'll build on top of today so Q learning will find us Q star and let's it look at a reminder of how this worked you have this sequence of state action reward next state action in the next state reward for the next transition next next state and so forth at every transition you update your estimate of your value function in this case the Q value function very similar to the bellman equation updates and this is what it looked like this was our starting point if we knew the model we knew TN r this is how we could run Q value iteration then we said instead of running it based on the model which we don't have let's run it based on samples how does that work we have a current running estimate of Q and we average that with a current sample that we get what is this sample is the reward we experience plus gamma which is a discount factor times the value for the next state at least the estimate we have for the value of the next state the way we estimate that is by the max of row actions in that next state of the Q value for that next state this alpha here the closer to 1 the more you weigh the current sample the closer to 0 the more you can retain what you already had so you can imagine that you want your alpha the decay over time that is initially if a bad estimate you want to do big updates to your estimate and as time goes by grassman's get better and better you want have less and less influence of the current latest sample because it would just kind of throw things around below too much so any questions about this algorithm yes okay it's a terminology different so the terminology difference between model-based and model-free is that your algorithm is called model-based if inside your algorithm so your organ will always in reinforcement just get this and nothing else that's all you're getting but it's model tracer model model based or model free but if you are going to miss model base what that means is that the first thing you do with that experience is building a model then it's model-based if you never build a model where you try to directly estimate Q values or V values from this sequence of state action reward State and so forth then it's a model free approach so this here is a model free approach because no where do we explicitly use the model that we built for tier a model that we built for our we just keep track of a table of Q values any other questions about this yes set again the model is everything that comes into the MDP so the model consists of the three dimensional table for transition model T the reward function R and then there's a set of states a set of actions with that you are given in this case yes the small R okay the smaller what that indicates is that it's not the reward function that you are evaluating so if it were the big R that'd be a three dimensional table that you get to index into small R means that you experience the reward and sometimes I'll be indexed by time to indicate at which time you experienced it but here we just went with R and then the next one is R prime R double prime and so forth okay so that's Q learning why was this such a C we really hold on this looks like a repeat that's just a repeat that we already step through so what is so great about q-learning why do we really like q-learning well the beauty is that you can act according to pretty much any policy and you will still find through these q-learning updates the optimal Q value function Q star so you are somehow learning something about the optimal policy the values of the optimal policy without ever having to know what it is or ever having to execute it that's called off policy learning and that's a big deal that's what makes it so appealing in that it can allow us to find the optimal policy without having any idea of what it might be so let's take a look at what that looks like in action so what we have here is our standard grid world whenever it's a state from which you can only take the exit action only one action is available you just see one number in that state if the state we have the four standard actions available north east south west and risk four numbers in that state these are the Q values what we'll look at here is Q learning in action so it's just kind of running some policy with a bunch of randomness in it and as it goes along you'll see that the values get updated so here is Q learning in action what have we seen so far we've seen so far that there is something good at the end there some bad stuff down here at this point Q learning doesn't know that this is a bad state we know that from looking at the problem we know how these problems are set up but keep in mind Q learning doesn't know that it's never visited that stayed before so it still has the original estimate of zero and at some point they might visit it and just get just a really bad reward and that's part of the learning process so it's going around updating values and one thing we observed last time which was really interesting is that we see that it's good to go all the way right but even if periodically we jump into the pit that bad reward that we get there will not affect the fact that we think going right is good so even if we go right twice and then jump into the pit these actions of going right will still get increased in value because when you use the update the q-learning update you will not end up using the update based on jumping in the pit you'll use the value that's the best value for a given state so even when there are bad things happen in a certain state that will not necessarily propagate back only if that's the best thing that could happen in that state will that propagate back to previous States it's over time we see valleys fill in we also see that this can take a while before it's in all the states some states have been barely visited so the valleys are still more or less meaningless at this point you can principle keep running this at some point but I visited all states sufficiently often it will have converged okay so what are the caveats here we saw this in action you can take a long time to run so the need that you need the need for exploration is still there we said whatever you do as long as you visit every state sufficiently often it's fine you still need to find a way to do that and ideally you do it efficiently eventually you have to make the learning rates small enough why is that if you don't make the learning rates small enough you'll keep jumping around just think of estimating the probability of a coin flip coming out heads or tails all right the expected value of that if heads is minus one tail is plus one the expected value of zero is zero have you seen many samples your estimate is probably very close to zero but whatever you're going to see next is going to be a 1 or a minus 1 so if your alpha is too big you'll move very far away from that 0 to the 1 or the minus 1 so your alpha needs to go down to be able to really converge to the average value can go down too quickly because if it does that if initially you have a weird streak then you can never make up for it in the future because you don't have enough leverage in the future because your alpha became too small in the limit that's the beauty it doesn't matter how you select your actions as long as you satisfy these properties all right we saw this demo so the question that we now have to answer is one of how you decide to explore versus exploit and think of this as a real-world scenario here you have your favorite restaurant in Berkeley let's say that you go to very frequently here again it's time to go out and you could go to your usual place favorite food guaranteed but next door is this grand opening of a new place you have to make a decision you've never been there you don't know what it's going to be like you know you're pretty much guaranteed the usual place will be really good I mean your guarantee isn't going to be really good you just don't know the new place might still be better because you have never been there most likely it's worse because the old place is so good but it could be better so the only way you're going to find out is by trying it out when you try it out you'll find out it could actually give you food poisoning was a really bad choice but that's part of exploration all right so that's the thing we have to keep in mind as long as you haven't seen it you just can't know what it would be like so let's take a look at exploration in action here looking at a different scenario here so this is what we call the bridge world you'll see that in your project 3 we can see what's going on here but the agent actually doesn't know this keep that in mind all right agent has no idea that this is a bridge running in the middle and on the sides there are you fall off the cliff it's really bad and that there is a really good reward at the end and a medium reward up front the agent doesn't know anything about that as far as the Energy's agent is concerned just a bunch of states now you can start acting in this world and let's say you go left first you see the q-value update for going left it gets better and better now you have to make a decision are you going to keep going left which gives you a reward of one or should you go check out other things all right maybe we should go check out other things sometime soon this maybe there's a better reward out there you definitely can at this point you cannot guarantee that this is the best thing to do you just know it's probably a good thing to do but definitely not no guarantee that it's the best thing to do so let's see what happens when we start going the other way so I rate this oops off the cliff again only once you fell into the cliff does the Asian figure out that that was a bad thing to do huh there's a bigger ward at the end if you exit there so that could be a reasonable thing to do but yeah it's one thing to go there and keep collecting that reward but in the meantime you still have no idea what would happen in a lot of these other states maybe they have a good reward as far as we are concerned it could be that there's a real good reward at the bottom of one of these cliffs top or bottom not for this one now for that one not for that one so you see that it's kind of a painful experience you while you're exploring you might have a lot of negative rewards that don't get you anything but it's the only way to guarantee that you will eventually find the optimal policy because you've ruled out that anything else could still be good so how do you explore and try to balance that trade-off between exploration and exploitation the one thing we need to keep in mind is that we to guarantee after running for a long time that we have the optimal Q values Q star we need to visit every state action pair sufficiently often so that's the main thing we need to keep in mind and while doing so hopefully we don't incur too much damage right so one thing we can do is say Epsilon greedy what does that refer to that means that you have some number Epsilon and with probability epsilon you just act randomly and with probability one minus Epsilon you do whatever your current Q values say is the right thing to do the epsilon is close to 1 you do a lot of exploration not a lot of exploitation and the other way around so let's look at that in action for the crawler robot not willing to let me do it here we go no wrong one here we go this is a crawler what's happening here at this point this point we have a epsilon of 0.8 so it's shown down here what that means is that 80% of the time when an action is chosen is chosen randomly only 20% of the time are the Q value is being followed so minus is going to be a lot of exploration a lot of learning about how the system works but maybe not a whole lot of progress off to the right where you want to get right you get positive reward from moving right see just see a lot of exploration and we might want to speed up this exploration a little bit by skipping over some steps but so at some point you hope you skip over enough steps you will have learned a good q function but it's still not making any progress why is that well it's still 80% of the time taking random actions they might have a really good q function but it's not using it if we're now were to lower that epsilon and start really exploiting what we learned then we'd start getting more interesting results so let's wait till we lower epsilon here this is just 80% random lower it down to zero now you see what will happen if you were to explore it and it's actually doing really well already so we see that after we accelerated this by million step million steps at one point in the video right so after over a million steps of exploration here is what it learned and it's crawling forward pretty steadily okay so what are the problems with these random actions well eventually the good thing is eventually explore the entire space but you can keep thrashing around for a long time before you've done that and in the meantime you are just doing random stuff you're not collecting any good rewards along the way or you're at least you're not trying to so one solution would be to say well as learning progresses maybe I should lower Epsilon right you think that lower Epsilon as you've done more learning and then you can do more exploitation less exploration maybe this will still work and actually you can do that that's not a bad strategy you can do something better that we'll look at now do something we can use something called exploration functions if you think about lowering Epsilon let's think back to just the idea of lowering Epsilon let's say we had some high Epsilon lots of exploration we think we were doing a good amount of learning we lower it so we do more exploitation seems the right thing but the problem is that it might be that in one part of the state space you're ready to exploit but in the other part of the state space you haven't explored much yet and so you want to tune how much exploration you do to where you are in the state space rather than just have this global parameter Epsilon that just is sending how much exploration anywhere you might be right any state action pair you've barely seen you need to explore to find out what happens there so the idea here is that and this is a very often reoccurring idea and we force reinforcement learning is that you follow this principle of optimism in the face of uncertainty that means that if you have access to a state action pair that you have a lot of uncertainty about you just assume is going to be really good the outcome and act according to that assumption if you've seen them something many times before well you have statistics on how good or bad is going to be and you just work with that estimate that you have from there so what that means is whenever something looks even still a little bit uncertain you're just going to jump into it go explore that cave you might still encounter bad things you'll still see by are things otherwise you haven't explored everything but once you know something's bad you're not going to explore it again so how do you do this in practice here's one way to do this we have a new function here f this is going to modulate how much we're going to explore so what you keep track off is a visit count n so we consider a particular state action pair how often you've been four in a particular state action pair but a particular state action pair you have a count n so this table M has entries for all SN a and you'll use it for a particular state action pair then you have this constant K which is just an offset and you have your original utility you and you say the utility I'm going to associate with Si is utility estimate that I have for si plus a bonus and that bonus encodes how much uncertainty I have about SN a that bonus is some factor K divided by n which is how often you if you have experienced si so the more you've experienced si the less bonus you get so in a Q learning update the regular thing would look like this you update what this means is a alpha update so you have the original estimate you weighted by 1 minus alpha and you add on alpha times what's on the right hand side here which is the reward you experience plus gamma times the max over all actions in the next state of the q-value function going to replace this thing over here with something else that's optimistic in the face of uncertainty we're going to place that function f around it f is sitting here now Q is in here you is the utility part it continues to contribute but you add a bonus that encourages exploration and that bonus NSA this is the N up here and under some factor K that's fixed that doesn't show up in the expression because it's fixed for the entire process so what happens here is that when you are now choosing an action you can just look at this modified version of your Q functions where you use this bonus in the face of uncertainty you compute the optimal action with respect to this thing over here and take your action accordingly one thing that happens is you will explore in your current state if there are actions that still have a lot of uncertainty associated with them another effect is that because you also use it in the Q learning update here you get a propagation through your state space of this bonus so if this is stayed really far out that you don't know much about once you visit that state the exploration bonus that is associated with that state will then be propagated through Q learning to the state you're in before you are visiting that state next time you travel in that direction will propagate one further back and so forth so you get a propagation of these exploration bonuses throughout your entire Q value table and that will encourage you to not only take an action that has a lot of uncertainty associated with it but I actually encourage you to take actions that lead you to parts of the state space where there is still a lot of uncertainty so you get a very directed exploration here where even for a state we already know everything you might still decide to take an action that technically exploration because leading you to a state that you haven't seen yet so this is a much more clever way of doing exploration on just epsilon greedy yes alpha is what this arrow means here is something like this if I write let's say X alpha let's call this thing s with that that is the same thing saying X becomes 1 minus alpha times X plus alpha times s that's the notation yes ok that's a good question you have a choice there the simplest thing is to have one global alpha which will actually depend on how long you have been running if you wanted to decay over time there are other implementations where your alpha last time we said something about if K let's say I keeps track of how many updates you have done how many experiences you've had then we could have alpha index by I and it could be for example 1 over I such that it decays over time but not too quickly you could have something more clever you could have something where your alpha is equal to 1 over n s prime a prime something like that actually let's see it would be si not s prime a prime where you keep track of how often you've seen that particular state action pair and based on that weight how much you want to update this would often work better it's a lot more bookkeeping to implement but it's more tailored to every particular state action pair how much you're going to update yes it cannot so we should update this slide to say R and R you're right ok so there is q-learning with a very guided exploration strategy let's see how that works on our crawler so epsilon here doesn't really play a role because we're using exploration functions to do the exploration initially there's a lot of exploration happening because we don't know yet a whole lot about the state so you're at that exploration bonus that tells you explore that's the right thing to do but then relatively quickly and note that we didn't skip any time steps here like we did in the other run this is starting to move off to the right it's still doing some exploration along the way because it doesn't know everything yet but you can see that it's already starting to exploit very quickly and doing the right thing very quickly you'll get to play with the crawler you'll get to play with a grid world with the bridge world all of them in your project 3 ok so now that we've seen two different ways of doing exploration one was epsilon greedy another one was exploration functions we could wonder how do we quantify whether one is better than the other both of them are guaranteed that in the limit you're going to do the right thing you find Q star so which one is better we think the one with the exploration function is is better so how do we write that out formally well that's the notion of regret so even when you learn the optimal policy along the way you're going to make mistakes keep this in mind here right looking at this picture what do you see this is the robot now this is the robot thinking back to the good old days when it was a little child robot and still learning about fire pits and how did it really hurt and maybe we some regret thinking that wasn't the greatest thing I had a lot of pain at the time but right now after a lot of learning in those what the right thing to do is and so what regret quantifies is when you look back when you've learned the optimal policy and you quantify well compared to the optimal policy how much less reward did I yet during my exploration alright you could say if it had been awful from day zero how much reward would I have gotten on expectation because it's stochastic versus because I had to explore I needed to incur some painful rewards before I really knew how the system worked I will not have gotten as much and so the optimal way of exploring is that exploration strategy that on expectation is closest to as if you have been acting up early all along from the beginning no optimal exploration strategy will be really close to that because there's no way around that you need to explore need to experience a better word before you know that state action has a battery reward but you quantify about how far you are off and so with exploration functions we saw we can be at least in the example with the crawler we can be much less far off from the optimal thing get there very quickly so that's a better way of doing exploration it encouraged less regret okay let's take a quick break here and after the break let's look at approximately learning let's take a look at approximate key learning so we're going to change the algorithm a little bit make it approximate but there's going to be a reason that we want to make it approximate let's look at the motivation first let's say you have some experience some sequence of states action reward state action reward and so forth and from that you learn that this situation over here is pretty bad because you are getting cornered by the ghost and soon thereafter you're dead and that value is propagated back you realize this is a bad state to be in okay that's great you learn supposed to do that a little later and still running your Q learning agent and this happens in naive Q learning the way we've been doing it so far the fact that you know this is a bad situation doesn't mean that you know this is a bad situation your Q function is just a table of values one value for each state in action and this is a State so the fact that you've seen this other state with ghosts around you that was pretty bad doesn't tell you anything about this state over here Hugh learning will not somehow transfer that information that this is bad has no mechanism for that you're all separate cue values even worse look at this situation here what has changed well you need to look very carefully to see what has changed between the first one and the third one then one dot has changed for q-learning there's no such mechanism of saying just that one dot has changed all it has is yet another state so this Q value would still be at the initial value potentially whereas this Q value could already be at the right value name with some really negative value that's bad because that means from experiencing something very similar you don't learn anything you need to actually experience everything before you learn that's a bad way to go right let's see how well this works so we'll start with just generic Q learning is in a very small world I will watch the entire process here so what's the entire Q learning process okay huh I win there so it's exploring it's exploring there Ghost is enjoying this a lot that's what happens you know it has been acting for a couple of times now every now and then it succeeds but it still doesn't really know what the waiting strategy is let's look at this now after having skipped over 2000 learning trials so 2000 runs until pac-man won or died see what happens that's actually see it in action here we are so after 2000 trials it's pretty good let's think about this so that's a nice result but how many states are there here Pac Man could be any of six positions that goes could be in any of six positions or 36 states and four actions so this is very very very small problem and it took about 2000 episodes here to learn to act in this small problem look at another problem just slightly bigger watch and learn still hasn't won hasn't able to learn so with a reasonable ghost it's going to take a really really long time before pac-man ever experiences something good that it could learn from right this is not going to work very well so what was the problem when we see a bad situation we learn that's a bad situation but when we see a similar situation we don't realize it's it's an equally bad situation most likely that's because we keep track of a table of Q values and don't connect the values in the table with each other so think of it as for a small problem sure you have your book your table of Q values right your realistic situations however you cannot hold all of these in memory there's just too many states moreover a sack not being able to hold them in memory you might not want to learn for so long that you need to experience all of these state action pairs many times before you finally learn what the value is in those for those state action pairs think of it like this alright if an entire library you need to keep track off not only keep track of but also experience before you can learn something okay so we want to generalize this is a very generic thing in machine learning I will see a lot of machine learning at the end of the course but we'll see a little glimpse of machine learning out so what's in amendable about machine lines that you somehow see examples of something and then see new examples and whatever you were able to say about the first examples you want to generalize to these new examples that's the generic problem that we're after that we want to solve here okay so how could we do that think back to the project you just finished last week you did something like that already there you weren't able to search till the bottom of the game tree you stopped at some point when you stopped you used an evaluation function you didn't look up in a table for what that stored for each state what the values of that state know you have some function that encoded how good a state could be what could go in there things like distance to distance to the closest ghost distance to the closest dot number of ghosts that are in the game 1 over distance 2 dot nearest dot squared is pac-man in a tunnel which could be 0 one feature and so forth you could put even more specific features in there you could say does it exactly match the state shown over here that's allowed you can make that a feature but the key idea in making features that you want to make features that are not specific to just one state is then they're only going to be useful for that one state ideally a features that have general applicability you know being close to dots is generally good being close to ghost is generally bad and so forth so you want to pick features that reflect what in general would be a good or a bad thing potentially and maybe learn whether they're good or bad as you go along so we're going to do now is describe two states with features so rather than storing a table of values will have features and combine those features to predict what the value is of a cue stick so in the minimax search our alpha beta search you would have had a value for a state would look like this it took the weighted sum of the features to get a value out we'll do the same thing here now these features will depend on both state and action could you still just have them depend on the state well if you want to do that you kind of want them to depend on the next state that you'd like to land if I have to take that action right so another way to work with this is to work with F 1 of s Prime in general doubling several s Prime then will be something where it's some F 1 of maybe the set s prime and the probabilities associated with each of those next States so that's another way if you want to restrict yourself to encoding it as a function of States which often is easier then you would have to account for the action you take to do that but in general this is the format you have States and actions and the features as a function of those states and actions and then you weight them sum them together that's your Q value what's the advantage all you need to learn now is these weights W 1 through W N and often you can get away with maybe even a hundred weights even when you might have billions of states the disadvantage is that well you have some form of aliasing now what do I mean with that you can have two different states that have the same feature values if that's the case if two different states have the same feature values that no matter how you choose the weights they will have the same Q values right so you need to be careful about how you pick your features such that if states do end up with the same or even very similar feature values they better be very similar States in terms of how you think of as how good they are so we lose some generality here but we gain this idea that we just have to learn a couple of numbers that now we'll generalize over the entire state space we don't know yet how to learn those weight entries WI they yet so we have q function as follows when we see a transition we'll get s AR s prime the difference is what we compute in regular q learning and then we do an update based on that difference right we say new Q value is the old 1 plus alpha which is your weight how much you want to update times the difference we're now we need to do something similar but our Q values aren't stored in a table okay here's what we're going to do we're going to update each weight because that's all we can do is update weights we can't update Q values directly and for one weight and three WI this is the update equation we keep what it was plus often times and then this expression over here what is that expression say it still has the difference in it let's look at that part first just the difference that's what we had before if it were just a table of two values we just updated by alpha times the difference that's still part of it meaning that if we have an error the sample is different from the estimate right now we will change something we don't just change by the difference we also look at the feature value let's think of an extreme case let's think of the case where our features are always 0 or 1 and for some of the weights the feature will have been 0 what does that mean if your feature value was 0 then when you compute it the Q value over here that feature didn't contribute now as a consequence the weight sitting in front of that feature didn't play a role so you really have no reason to update that weight that weight didn't do anything wrong if the feature value was 0 you want to just keep it as is because you learn it from previous experience there's no reason to update it based on this experience now because they didn't contribute to them in to the error on the other hand if your feature value is very high let's say for now it's 0 and 1 so 1 would be the high value then it did contribute to your commutation of Q so it did contribute to it being wrong and then you want to update it and so then this would be 1 and you updated by the difference now this reasoning doesn't just hold true when you have 0 & 1 imagine a feature value could be minus 1 if it were minus 1 then you actually want to shift that weight in the opposite direction because the feature was negative it was computed contributing in the opposite direction than the normal direction right and so what will happen here is by multiplying with minus 1 you will actually move in the direction that you need to move because the feature has a flipped sign in general you will skill your update by the size of the feature including the sign and that will allow you to essentially put the blame of the error on features that have high absolute values and make those weights change a lot why because those features are really the ones that contributed to this computation here yes good question the difference is still computed this way over here of course when we look at these values here we will not have a table to look those up in we will actually need to use this type of expression to evaluate those yes that's a good question so something that's really important in machine learning is how you choose your features the features you come up with will determine what you're able to learn if you have a very good set of features that have the right generalization properties you'll learn really well if you have just arbitrary features your world you learn really poorly one of the things that is typically done in machine learning when you use features is if a priori you don't know which are the important ones you think they're all about equally important you make sure they're all on the same scale so you have some idea of what the scale should be right you pick it you say one is my scale I want them old let's say old features low is between minus 1 and 1 it's something you could pick as the way you scale things now you would for your feature see what is the maximum value I could ever achieve and then you would normalize your feature by that maximum value to put it on the right scale if you don't do that the promise that features that happen to be live on a very large scale will dominate these updates and will just swing things around very quickly and it'll be really hard to learn things well in practice by the way it's often done rather than analytically computing what the maximum values of the feature which can be hard you just collect data compute features and then empirically to see what the maximum value was or sometimes the standard deviation is computed and what you do is you subtract the mean and then divide by the standard deviation to normalize your features okay so that's our update we now know how to do it interpretation is that active features are the ones that those ways will get updated the most and that seems to make sense so let's take a look at how this runs will soon look at the actual justification so we have a cue function right now we're in a initial situation we take the action north so what was the Q value for s comma north well f dot is 0.5 times 4 is 2 then 1 times minus 1 is minus 1 so we land that plus 1 as the Q value that's our estimate based on the current set of weights take the action we'll get a reward we have then to compute the Q value in the next state since we died the Q value there is 0 the game is over there's nothing anymore in the future we already got that losing penalty right here so our sample value is the reward minus 500 plus dick max overall Q values at the next state which is 0 and so we have an error of 501 keep track of the sign of error so the difference here is minus 501 that's appearing over here as a difference we have the original weights over here and then this here is the feature values we have 0.5 and 1s feature values so this is our Q learning update for approximate Q learning where the weighted combination of features as our representation this is our after updating our new approximate q function all right let's see how this works say it again what was alpha alpha is always the learning rate so the closer to zero the slower you update the less update you have which you might want once you've already learned a lot and the closer to one you want initially because then you get a lot of update into your weight factor here so this is a much bigger situation is a full board all right lots of states you've kind of this kind of stays it depends on where the where the food palden where there are food pallet and so forth so many many many states we're doing Q learning with a feature based representation of the Q value and we're seeing this from the beginning actually we hear it it learns running into a ghost is a bad thing I knew that it tried to escape but I guess haven't learned yet that being close to a ghost was bad and need to stay away from it that's doing pretty well already this is just the third episode look at that that's nice right why didn't why is it not eating the power palettes it could be it could be several reasons yeah two good things were brought up one thing is if there's no feature about power pellets no way to learn anything about them ever another reason is if it never experienced the power pellet and it just doesn't realize that a power pallet is a useful thing to go get so as far as it knows for now power pellet could be good or bad all right so this works really well between those two reasons can we infer what's going on here between the two reasons whether it's because it doesn't have a feature for power pal or whether has just never experienced it I have to seen it run for several runs we can know for sure but let's say we were in epsilon some Epsilon greedy or this was better than that this was a exploration function if you have an exploration function it would draw you to things you don't know about yet well in a state space like this pretty much everything you don't know about yet so that's not very useful so to generalize that concept you need to be drawn to activate features that you haven't seen activated yet right that's how you need to generalize the concept of optimism in the face of a certainty going to states that have features very active that you haven't seen very active yet the math is a little more complicated to keep track of that but the intuition is the same if it's indeed doing that which it is it's clearly exploring in a clever way then if it has a feature for a power pellet eating getting a power pelt it would be drawn towards that and it would go eat it so after I've seen there's a couple of times we know that it most likely doesn't have a feature for power pelt because a good implementation of an exploration function would be drawn towards features you haven't experienced yet okay so here's a justification for what we just did what we've seen so far is an example of machine learning and I presented to you an update that made a lot of sense we said we're going to blame those weights where the features were very active because those weights had a chance to do something better for us more formally what's going on is something called least squares what is least squares you might have seen this before you have a bunch of points I want to fit a line through that bunch of points and you'll find the line that has the least squared error right so it's good also called linear regression what do you do you have a function and for a function you assume a form and the form we assume is a weight W 0 plus a weight W 1 times F 1 of X if this this axis here is F 1 of X that's what we're doing or we'll representing this q function as a weighted combination of features it's as if the axes the independent axes are the features and then there's a function living on top of that we're trying to fit here is a two-dimensional version with and in the function value living in the third dimension so this is f1 of X this is f2 of X and then this is y what we try to do is find the weights W such that our function is a good approximation of the dots the samples we've seen the standard thing to do in these squares is to look at the error or if the residual think of the function as a prediction of what the value ought to be look at how far it's off and that the more it's off the worse that error is what we're going to drive as low as possible so formula what does that mean we take the square of the ER because that happens to make the math work out nicely take the square of their Y I had is the prediction and this is the sample value that you got to observe and we want to drive that difference to zero we can't drive it completely to zero because well there's no line that fits through all these points we find the line that has the smallest sum of squares so another way to write this explicitly writing out the prediction part is to write the prediction as to sum over the weights times the feature values okay so how do you minimize that error and one thing I want to point out here is whenever slide has a star on it it means that this is something we find important to tell you about but it's not something we expect you to fully understand necessarily and we're not going to quiz you about this on an exam but we do want you to get the at least the gist of this so we want to include it in lecture so how do we get something that minimizes that error well imagine it only one point and the features were f of X the target value is y and the weights are W well then the error as a function of W because W is what we get to tweak error is a function of W would be this expression over here how would you drive the arrow down well one thing you can do is you can take derivatives they say how should I change the entry em the amp entry into W well I take the derivative of this error function with respect to W m and see what it is that derivative tells me if it's a very high positive value that means that increasing that entry WM will increase the function value alot and decreasing that will decrease the function value law so it's telling you how sensitive your function is to dad and 3wm and also the sign of that sensitivity so then we could say well the way we update this is by saying well we have a error in the prediction is what we said and then a feature value that's the same thing we have over here so we see that the update we intuitively came up with is an update that says update by the derivative now you go the opposite direction down the derivatives pointing you because you want to go downhill not uphill but look at this that's our standard update error plus feature Val Times feature value times alpha and then some waiting here right so we see that the standard update we've been looking at can be derived as a least-squares update if you were to look at just one sample so same thing here we have target prediction the difference multiplied with the future value and then the step size alpha over here so this is a justification now turn it to justification for why we have the update that we had earlier on by looking at it as a least squares fitting problem as if we get samples from the q function and fit in a least squares fit to it with our representation sum over weights times feature values okay so that our justification then you could say one question could ask yourself well definitely the more feature values the richer a function I can fit so why not have many many many feature values so I can fit very complicated functions if needed it still set the weights to zero right so you can still fit simple functions too why wouldn't we do that well that's kind of at the core of machine learning about machine learning is a lot a concern with a lot is that once you start having a lot of feature values it can be really hard to extract the right weight factor because now you have a lot of these things that could contribute they don't know what to attribute the error to and so having additional feature values could make things a lot worse having too many feature values could make the learning a lot harder so the more feature values you have the more data you're going to need to be able to learn the right weight factors there's a train out there where you need to yes you need good features but you can have too many is if too many for the amount of data you'll start overfitting and let's look at an example here let's say the future values were let's say this is X and F 0 of X is 1 F 1 of X is X F 2 of X is x squared and 3 of X is X to the third and so forth f 15 of X is X to the 15 and our function our predictions will be of the form sum over k equal 0 to 15 WK f K of X so a 15th order polynomial is what we would be fitting here to these data points let's see what we get here's what we get that's what least squares fit the best degree 15 polynomial is it what you want probably not I mean this kind of stuff here and here that's overfitting that's because it wants to 50-year data and it has the additional degrees of freedom to do so it comes out with this weird outcome here if you had a lot more data it would straighten those out if you had a lot of data points living over here it wouldn't do that but this shows you that you need to trade off the capacity the anumber features with the amount of data you have you need to be really careful about having too many features ok here is a cartoon example of this too many features you might as well just have this guy draw the line through your points is going to be equally good here's something else that's got some really good results in reinforcement learning it's a little different from what we've looked at so far it's called policy search think of it as you have this big box with policies that you get to pull one out see how good it is it's not that good you're going to somehow then pick a better one and keep moving around so you find a good policy what's the problem why aren't we done yet why might we still want this different strategy well let's say you learn your Q values we know in the limit with tabular representations they'll be exact but we know that in practice we don't have tabular representations we have these approximate q functions and what we learn is q functions are really good at approximating Q values but what we care about is picking the right action they're not trying to do that they're trying to get the Q value right you're not trying to get you to pick the right action they're not zoned in on the problem you're really trying to solve they're solving something related but not the actual problem of finding the best possible policy so the priority for Q learning is essentially modeling Q values not getting the ordering right between actions this will come back later in machine learning a lot the solution to this is to just start to learn policies that maximize your words not the values that predict those rewards so you start with an OK solution how do you do that actually you might run Q learning and after you've run Q learning why is Q learning good it knows how to connect stage that follow after each other so it's a very efficient way of learning the values once you've learned the values reasonably well you switch the policy search in policy search is the simplest thing you can imagine you just run with a policy see what the outcome is if it's a stochastic system you run multiple times you a vert the sum of rewards then you that's your assessment of the value for the current policy then you tweak one of your weights just a little bit do this again if it's better great you keep the tweak if it's worse you go back to the previous thing just hill-climbing it's really simple so it won't do well if you start with a bad possum if you start with a good policy you can really tweak it onto the right spot so it's a lot of nudging a little bit of nudging but you need a really good initialization and very few parameters you might need many runs to evaluate your current policy because there could be stochasticity and keep in mind if you have a lot of features you better start from a good spot or it's going to be impractical okay all this being said it can do some pretty amazing things so let's look at what it could do and has done so what you're gonna look at here as a helicopter what's the control problem in flying a helicopter you know how are comes you get to push air down that allows you to stay up how you push air down these blades go around and they sweep the through the air and push air down now so I should stay put where you are it's really hard that's all you get to do is push it down it's very easy to be pushing air down and go a little bit of sideways orientation and now you'll start falling off is it just pushing air sideways now the way you control a helicopter is actually by pushing down a differential amount of air to the left and the right on the front on the back as you're flying that's how you control it that way you can control your orientation then your tail rotor also can push a varying amount of air and basically not your tail can move around how do you push a varying amount of air down well this blade goes around and router down trying to change the speed of the blade which is really hard it goes around at 30 times per second what you do is you change the angle of the blade it has a steeper angle more air gets pushed down if it's completely flat no air gets pushed down and you do that throughout the cycle differentially from back left right very complicated mechanism but that's what you do it can also change the average angle of the blades and that determines the overall amount of air pushed down that's such a how you control these you don't control these by controlling the RPM of the motor you can throw them by the angle of the blades that also means that's also why they can react very quickly which they need to because these are very unstable problems if you are for two seconds have a helicopter in the air for two seconds don't pay attention then you look back at two seconds later most likely helicopter is gone it seemed to the ground extremely extremely like this is a very hard problem to control helicopter need to be very precise about how you do it so here's a solution to controlling a helicopter which is a very hard control problem why does that work the fly upside down you control the angle of the blades alright so you just put the negative angle in you can be upside down it's actually more efficient the fly upside down why is it more efficient to fly upside down you don't have the thing in the way then in a in more words there's a body of the helicopter you pull an air from the top you push out there down it happens to go at twice the speed when it's below you then when you pulled it in air friction relative with respect to the helicopter body goes up the faster the air is moving so if you can put that helicopter body above then it's the slower air that's coming by to helicopter body the faster air that's just going an open air space friction is actually goes with the velocity of the air squared so you have four times less friction over two helicopter ball you can put it above rather than below all right so not only cool also energy efficient okay we are done with part one other things I mean with that that we are done we're done with lectures about part laundress still a project you're still a homework in there still a midterm so you guys are not done but starting next lecture we're switching to part 2 and part 2 will be about probabilistic reasoning how do we do with uncertainty and can we start doing learning from data for large numbers of variables that's it for today
Info
Channel: CS188Spring2013
Views: 24,057
Rating: 4.977778 out of 5
Keywords:
Id: Si1_YTw960c
Channel Id: undefined
Length: 65min 39sec (3939 seconds)
Published: Tue Feb 26 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.