Reinforcement Learning 1: Introduction to Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome this is going to be the first lecture in the reinforcement learning tract of this course now as story will have explained there are more or less two separate tracks in this course or overlap between the deep learning side and the reinforcement inside let me just turn this off in case but they can also be viewed more or less separately and some of the things how we will be talking about will tie into the deep learning side specifically we will be using deep learning methods and techniques at some points during this course but a lot of it is separable and can be studied separately and has been studied in the past for many many years separately this lecture specifically I will take a high-level view and cover lots of the concepts of reinforcement learning and then in later lectures we will go into depth into the server of the topics so if you feel there's information missing yes that doesn't need it the case however if you feel I'm Julie confused feel free to stop me and ask questions at any time there are no stupid questions if you didn't understand something it's probably because I didn't explain it well and there's probably loads of other people in the room that also didn't understand it the way intended so feel free to end and ask questions at any time I also have a short break in the middle just to refresh in everybody okay so let's dive in I'll start with some boring admin just so we can warm up schedule wise most of reinforcement earning lectures are schedules at this time not all of them there's a few exceptions which you can see in the schedule on Moodle of course the schedule is what we currently believe it will rule remain but feel free to keep checking it in case things change or just come to all lectures and then in our home is anything so check Moodle for updates also use Moodle for questions we'll try to be responsive there as you will know grading is through assignments and the backgroud material first specifically this reinforcement learning side of the course will be the new edition of the session Umberto booked a fool drafts can be found online and I believe it is currently or will very soon be impressed if you prefer a hard copy but probably all in time for this course but you can just get the whole PDF specifically for this lecture the background will mostly be chapters one and three and then next lecture will actually come from or out of chapter two I especially encourage you to read chapter one which gives you a high-level overview the way rich thinks about these things and also talks about many of these concepts but also gives you large historical view on how these things got developed which ideas came from where and also how these are these changes over time because if you get everything from this course you'll have a certain few but you might not realize that things may have been perceived quite differently in the past and some people might still perceive quite differently right now so almost to give my view of course but I'll try to keep as close as possible to the to the book and I think our views overlap quite substantially anyway so that should be good this is the outline for today I'll start by talking just about what reinforced learning is many of you will have a rough or detailed idea of this already but it's good to be on the same page I'll talk about the core concepts of a low reinforcement learning system one of these concepts is in agents and then I'll talk about what are the components of such an agents and I'll talk a little bit about what our challenges in reinforcement learning so what are research topics are things to think about within the research field of reinforcement learning but of course it's good to start with defining what it is but before I do that I'll start with a little bit of motivation and this is a very high-level abstract view maybe but one way to think about this is that first many many years ago we started automating physical solutions with machines and this is the Industrial Revolution think of replacing horses with a train we kind of know how to pull something forward across a track and then we just remember that machine and we use them the machine instead of human or in in the case of horses animal labor and of course if you just create a huge boom in productivity and then after that the second wave of automation which is basically still happening but it has been happening for a long while now is what you could call the digital revolution or sometimes called the digital revolution in which we did a similar thing but instead of taking physical solutions we took mental solutions so maybe a canonical example if this is a calculator we know how to do division so we can program that into a calculator and then have it do that mental what used to be purely mental tasks on a machine in the future so we automate it's mental solutions but we still in both of these cases came up with the solutions ourselves right we came up with what we wanted to do and how to do it and then we implemented it in a machine so the next step is to define a problem and then have a machine solved itself for this you require learning you require something in addition because if you don't put anything into the system how can it know one thing you can put into a system is your own knowledge this is what was done with these machines either for mental or physical solutions but the other thing you could put in there is some knowledge on how to learn but then the data having the machine learn from its for itself so what is then reinforcement learning they're still by the way a couple of seats sprinkled throughout the room so feel free to try and grab one because it's getting rather busy okay so what is specific about reinforcement learning so I'll post it that we and many other intelligence beings are beings that we would call intelligence learn by interacting with in environments and this differs from certain other types of learning for instance it is active rather than passive you interact the environments response to your interaction and this also means that your actions are often sequential right the environment might change because you do something or you might be in a different situation within that environment which means that future interactions can depend on the earlier ones and these things are a little bit different from say supervised learning where you typically get a data set and it's just given to you and then you just crunch the numbers essentially to come up with the solution this is still learning right this is still getting new solutions out of the data but it's a different type of learning in addition many people agree that we are goal directed so we seem to be going toward certain goals maybe also without knowing exactly how to reach that goal in advance and we can learn without examples of optimal behavior obviously we could also learn from examples as in education but we have to also just learn by trial and error and that's going to be important so this is a canonical picture of reinforcement learning there's many versions of this there is an agent which is our learning system and it sends certain actions or decisions out these decisions are absorbed by the environment which is basically everything around the agents even though I drew them it's mostly done in these figures but you can think of the environment as just everything that is outside of the agents and the environments responds in a sense by sending back an observation if you prefer you can also think of this as maybe more of a pool action by the agent that the asian observes the environment whatever it is and then this loop continues the agents can take take more actions and the environments may or may not change depending on these actions and the observations may or may or may not change and you are using to learn within this interactive loop so you know in order to understand why we want to do learning it's good to realize that there is distinct types of learning I already made difference between active learning and passive learning but there's also different goals for learning so two goals that you might differentiate is one is to find previously unknown solutions maybe you don't care exactly how your IVD solutions but you might find it hard to code them up by hand or to invent them yourself so you might want to get this from the data but it's good to realize that this is this is a different goal from being able to learn quickly in a new environment and both of these things are valid goals for learning so in the first type of learning an example might be that you might want to find a program that can play the game of go better than any human which is a goal to find a certain solution in a second type of learning you might think of an example where robots is navigating terrains but all of a sudden it finds itself in a traded it has never seen before and also wasn't present when people build to rubble to run the robot was learning then you want the robot to learn online and you wanted to look maybe adapt quickly and reinforcement learning as a fields seeks to provide algorithms that can handle both these cases sometimes they're not clearly differentiated and sometimes people don't clearly specify which goal they're after but it's good to keep this in mind also note that the second point is not just about generalization it's not just about how you learn about many terrains and then you get a new one and you're able to deal well with it it's about that a little bit but it's also about being able to learn online to adapt even while you're doing it and this is fair game we do that as well when we enter a new situation we do do it that still we don't have to just lean on what we've learned in the past so another way to phrase what reinforced bleeding is is it it is the science of learning to make decisions from interaction and this requires us to think about many concepts such as time and related to that the long term consequences of actions its requires to think about actively gathering experience because of the interaction you cannot assume that all the relevant experience is just given to you sometimes you must actively seek it out it might require us to think about predicting the future in order to deal with these longtime consequences and typically it also allows requires us to deal with uncertainty the uncertainty might be inherent to a problem if for instance you might be dealing with a situation that is inherently noisy or it might be that certain parts of the problem that you're dealing with are hidden to you for instance you're playing against an opponent and you don't know what goes on in their head or it might just be that you yourself creates uncertainty because you don't know maybe you're following a behavior that sometimes is a little bit sarcastic so you can't predict the future with complete certainty just based on your own interaction I'm just going to repeat once more there's still a few seats if people want to grab them one back there there's a few up here so there's huge potential scope for this because decisions show up in many many places if you think about it and so one thing that I just want you to think about is whether this is sufficient to be able to talk about what artificial intelligence is of course I could take stand here this is just to provoke you to think about that can you think of things that we're not covering right that you might need for artificial intelligence that's basically the thing that I want you to think about and if so we should probably add them so there's a lot of related disciplines and reinforcement learning has been studied in one form or another many times and in many forms this is a slide that I borrowed from Dave silver where he noted a few of these disciplines there might be others and these might not even be these not mine will be the only mine debate examples although some of them are pretty persuasive and the disciplines that he pointed out work at the top computer science which a lot of you will be studying some variant of in which we might do something called machine learning and you could think of reinforcement learning as being part of that discipline I'll come back to that later but also neuroscience people have investigated the brain to large extents and found that certain mechanisms within the brain look a lot like the reinforcement learning algorithms that we'll study later in this course so there might be some connection there as well or maybe you can use these concepts that we'll talk about to understand how we learn also in psychology maybe this is more like a higher-level version of the neuroscience argument where there's behavior obviously there's decisions and maybe you can reason about that you can maybe you can model that in this very similar way or maybe even the same way as you can model the reinforcement learning problem and then think about learning what that what that entails how the learning progresses using this framework separately on the other side you have engineering sometimes you just want to solve a problem and there are many diseases problems out there that people want to solve for many different reasons but typically to to optimize something and within that we have a field called optimal control which is very closely related to reinforcement learning and many of the methods overlap although sometimes the focus is a little bit different in a notation might be a little bit different fairly similarly in mathematics there's a subcategory or maybe I don't know whether it's completely fair to say that it's part of mathematics maybe it's a little bit more like a Venn diagram itself called operations research and operations research this is the field where you basically look for solutions for many problems using mathematical tools including Markov decision processes that will touch upon later in this course and dynamic programming and things like that which are also used in reinforcement learning finally at the bottom it says economics but there's other related fields that you might consider here one thing that's quite interesting about this is that it's very clearly a multi-agent setting so now there's multiple actors in a situation and together they make decisions but also separately and there's all these interesting inter actions between these these agents and it's also quite natural in economics to think about optimizing something many many people talk about optimizing say returns or value and this is very similar to what we'll discuss as well so to zoom in a little bit on the machine learning part sometimes people make this distinction that machine learning basic has a number of subfields maybe the biggest of this is the supervised learning subfield where which we're getting quite good at I would say and a lot of deep learning work for instance is done on supervised settings the goal there is to find a mapping you have examples of inputs and outputs and you want to learn that mapping and ideally you want to learn a mapping that also generalizes to new inputs that you've never seen before in a nutshell unsupervised learning separately is what you do when you don't have the labeled examples so you might have a lot of data but maybe you don't have clear examples of what the mapping should be and all that you can do all that you want to do is to somehow structure the data so that you can reason about it or that you can understand the data itself better now reinforcement learning some people sometimes perceive that as being part of one of these or maybe a little bit of a mixture of both but I would argue that it's difference and separates and in reinforcement learning the one of the main distinctions is that you get a reinforcement learning signal which we call the reward instead of a supervised signal what this signal gives you and I'll talk about it later more is some notion of how good something is compared to something else but it doesn't tell you exactly what to do it doesn't give you a label or an action that you should have done it just tells you I like this this much but I'll go into more detail so characteristics of reinforcement learning and specifically how does it differ from other machine learning paradigms include that there's no strict supervision on your reward signal and also that the feedback can be delayed so sometimes you take an action and this action much later leads to reward and this is also something that you don't typically get in a supervised learning setting of course there's or there's exceptions in addition time and sequentiality matters so if you take a decision now it might be impossible to undo that decision later whereas if you just make a prediction and you update your loss in a supervised setting typically you can still redo that later this means that earlier your decisions affect later interaction is good to keep that in mind and basically the the next lecture also talks a lot about this so examples of decision problems there is many as I said some concrete examples to maybe help you think about these things include to fly a helicopter or to manage an investment portfolio or to control the power station make a robe walk or play video or board games and these are actual examples where reinforcement learning has been or versions of reinforcement have been applied and maybe it's good to note that these are reinforced learning problems because they are sequential decision problems even if you don't necessarily use what people might go every first learning method to solve them it's good to make the distinction because some people they think of the current reinforcement learning algorithms and they basically unify the field with those specific algorithms but reinforcement learning is both a framework of how to think about these things and there's a set of algorithms which people talk about as being reinforced learning algorithms but you could be you could be working in Reverse inferring problem without using any of those algorithms specifically so I mentioned a few of these already but core concepts of a reinforcement this learning system are the environments that the agent is in a reward signal that specifies the goal of the agents and yeah youth itself of course yeah you to self might contain certain components and I'm going to go through all of these in the rest of this lecture but note that in the interactions figure that I showed before this is the same one I actually didn't put the reward in and there's a reason I did that because most of these figures that you'll see in literature actually have the reward going from the environment into the agents and that's fair and in that case the agent itself basically is only the learning algorithm that means that if you have a robot the learning algorithm sits somewhere within that robot but the agents in this picture doesn't is not the same as the robot as a whole the learning algorithm can perceive part of the robots as its environment in a sense because typically the environment doesn't care it doesn't have a reward it doesn't have that notion typically it's us that specify a reward and it lives somewhere within your reinforcement learning system that's why I didn't put it in the figure because you can think of it as coming from the environment in the into the agents or you can think of it as part of the agents but not part of learning algorithm because if the learning algorithm can modify its own reward then where things could happen and it could like find ways to optimize its reward but only because it's setting it not because it's learning anything interesting so it's useful to think of the reward is being external to the learning algorithm even if it's internal to the system as a whole so what happens here this is the interaction loop that I was talking about if we introduce a little bit of notation at each time step T the agent will receive some observation which is a random variable which are you which why which is why I use capital o there and a reward from somewhere capital R and the agent and execute an action capital a near arm receives this action and you can either think of it as a me emitting a new observation and a new reward you can just think of the agent as receiving that as pulling that from the environment but for now we'll just talk about it as if the environment just gives you that back as a function it takes in the action and it returns you the next observation and the next reward this is a fairly simple set up fairly small in some sense but it turns out to be fairly general as well and we can model many problems in this way so the reward specifically is a scalar feedback signal this indicates how well the agent is doing at a step time T and therefore it defines the goal as I said now the agents job is to maximize the cumulative reward not the instantaneous reward but the reward over time and we will call this the return now this thing trails off into the end there I didn't specify when it stops the easiest way to think about is that there's always a time somewhere in the future where it's so that this thing is well-defined and finite a little while later I'll talk about when that doesn't happen when you have a continuing problem and then you can still define a return that is well-defined and I'm reinforcement learning is based on the reward hypothesis which is that any goal can be formalized as the outcome of maximizing a cumulative reward it's basically statement about the generality of this framework now I encourage you to think about that whether you agree that that's true or not and if you think it's not true whether you can think of any examples of goals that you might not be able to formalize as optimizing your cumulative reward to maybe help you think about that I'd like to know that these reward signals they can be very dense there could be a reward on every step that is nonzero but it could also be very sparse so if a certain event specifies your goal you could also just get a real positive reward whenever that happens and zero rewards on every other step and that means that then there is a reward function that models that specific goal so the question is whether that's sufficiently general I haven't been able to find any counter examples myself but maybe you do yeah no that's a very good question sorry we use the word reward but we basically mean it's just a real value reinforcements signal and sometimes we talk about negative rewards as being penalty especially this is especially common in psychology and neuroscience in the more computer science view of reinforcement learning we typically just use the word reward even if it's negative and then indeed you can have things that push you away from certain situations that you don't want to repeat I'll give an example I'll revisit this example a little bit later but maybe is good to give it now as well you could think of a maze where you want to exit the me so the goal is to exit the maze then there's multiple ways to set up a reward function that encodes that one this as I said just now just gives 0 rewards on every step but give a positive reward when you exit the maze but what you could also do is just give a negative reward on every step and then stop your episode when you exit the then minimizing that negative maximizing your ear or means minimizing the absolute negative rewards so it still encodes the goal of getting out of the maze as quickly as possible you could think of one as being chasing the carrot and one is avoiding the stick to the learning algorithms it typically doesn't matter too much or at least to the formalism of two learning algorithms in practice of course everything maddox but okay so now that we have returns we can talk about predicting those returns and to do that we first have to talk about values so the expected curative reward which is basically the expected return as we define it just now is what we call the value and a value is in this case a functional state so the expectation here is given conditional on the state that you're putting us into the function and then over anything that's random the goal is then to maximize this expected value rather than the actual instantaneous value already the random value because typically you don't know the random value yet by picking suitable actions so the rewards and values both define the desirability of something but you could think of the reward is defining the desirability of a certain transition like a single step and then the value is defining the desirability of this valve of this state more in general into the indefinite future potentially also note because we'll be using that quite a bit in this course that the returns and values can be defined recursively so I put it down here for the return the return at the time step T is basically just the one step reward and then the return from there and that turns out to be something that we can usefully exploit so I said the goal isn't to pick action so we have to talk a little bit about what that means so again the goal is to select actions as to maximize the value basically from each state that you might end up in and these actions might have long-term consequences what this means in terms of the reward signal is that you immediate reward for taking action might be low or even negative but you might still want to take it if it brings you to a state with a very high value which basically means you'll get high rewards later so that means it might be better to sacrifice immediate reward to gain more long-term reward an examples of this include safe enough financial investment where you first pay some money to invest in something but you hope to get much more money back later refueling a helicopter you might not gain anything specifically related to your goal from doing that but if you don't maybe you're a hell of a helicopter will at some points not work anymore and in say playing a game you might block an opponent's move rather than going for the win you first prevent the loss which might then later give you a higher probability of winning and in any of these cases the mapping from stage to action will call a policy so you can think of this as just being a function that map's each state into an action it's also possible to condition the value on actions so instead of just conditioning on the state you can set the condition on the state and action pair the definition is very similar to the state value there's a slight difference in notation for historical reasons this is called a cue function so for States we use V and for state action pairs we use Q there's really no other reason than just historical for that and we'll talk in depth about these things later so the only difference here is that it's now also conditioned on the action otherwise the definition is exactly the same as as before okay if everybody's on board I will now talk about agent components and I'll start with the agent State there is there's still a little bit of room in the in the room if somebody still wants to grab a chair people are not so uncomfortable Thanks so the first I talked a little bit already about states but I didn't actually say what the state is I trusted that you would have some intuitive notion of it so I'll talk about what an agent status so as I said a policy is a mapping from States to actions or the other way to say that the actions depend on some state of the agent both the agent and the environment might have an internal state or typically actually do have an internal state in the simplest case there might only be one state and both environment and agent are always in the same state and we'll cover that quite extensively in the next lecture because it turns out you can already meaningfully talk about some concepts such as how to make decisions when only considering a single state and it just extracts well in all the issue of sequentiality and states and everything but the whole next lecture will be devoted to that but often more generally there are many different states and I might even be infinitely many so what do I mean when I say infinitely many just think of it as being there's some continuous vector as your state and maybe it can be within some infinite space just because you don't know exactly where it's going to be and it can basically be arbitrarily anywhere in that state and then you are in basically a typical domain where deep learning also shines where you can maybe generalize across things where that you haven't seen because things are sufficiently smooth in some sense so the state of the agents generally differs from state of free environments but at first we're going to unify these as I'll explain later but it's good to keep in mind that in general the agent might not know the full state of the environment so the state of the environment is basically everything that's necessary for the environments to return its observations and maybe rewards if those are part of the environments and like I said it's usually not visible to the agent but even if it is visible it might contain lots of irrelevant information so even if you think about say if you think about the real world us or a robot operating in a real world even if you could know all the locations of all the atoms and all other things that might be relevant in some way to your problem you might not want to or even can process all of that so it's still in that case makes sense to have an agent say that is smaller than the full environment state so instead the agent has access to a history it gets an initial observation and then this loop starts you take an action you get a reward in a new observation you take another action and so on and so on in principle the agent could keep track of this whole history it might grow it big but we could imagine doing that and an example of such a history might be the sensorimotor of stream of robots just all these things that ever happens to the robot so this history can then be used to construct an agent state and the actions then depend on that state in the fully observable case we assume that the agent can see the full environment State so the observation is not equal to the environment State this is especially useful in smallish problems where the environment is particularly simple but it occurs sometimes in in in real practice for instance if you think about playing a game a single-player board game where you can see the whole board this might be such a case or even if you play a multiplayer board game but you have a fixed opponents this might again be the case if you're playing against a lone opponent it's no longer the case because you cannot look inside the head of the opponent if this is the case then the agent is in the Markov decision process and I'll define that many of you might know what this is but Markov decision processes are essentially a useful mathematical framework that we are going to use to reason and talk about a lot of the concepts in reinforcement learning it's much easier to reason about then the full problem which is non Markovian as I'll talk about in a bit but it's also a little bit limited because of the Markov assumption so what does it mean to be Markov so decision process is is Markov or MA even if the probability of a reward and subsequent state I've written it down here as the new Sutton Umberto Edition also does as a joint probability of the reward in the state the way they depending on your current state an action is fully informative if you would condition or for the case if you would condition on the full history what that means is that the current state gives you all the information you need to basically predict to the next reward in the next state so if this probability is fixed even if you don't know this probability I'm not claiming that the agent knows it but if it exists and if it's fixed then it's a Markov decision process intuitively it means that the future is independent of the past given the present where the present is now your state so in practice this is very nice and useful because it means that when you have this state you can throw away the history and history can crow can grow unbounded li so that's something that you don't want to keep doing so instead you much prefer the case where you can just throw away everything and you just keep that state and another way to phrase that is it's a sufficient statistic for the history the environment state typically is Markov in most cases there are exceptions to this for instance if you think of a non-stationary environment but typically you could think of the the environment said as being Markovian but you're just not being able to perceive it so things might appear a non-stationary even if they aren't the history itself is also Markovian because of course if you condition on the history or you condition on the history that's the same thing but it's going to grow big more commonly we're in a partial observable case this means that the agent gets partial information about the true state and examples of this include a robot with a camera vision that is not told its absolute location or what se is behind the wall or say poker-playing agents which only observes public cards so there's multiple ways that second one is partially observable is the fact that it can't see the cards of the opponents and the other way is to factor - can't see within the brain of the opponents and formerly these things are then called partially observable Markov decision processes there's a lot of literature on these and especially on solving these exactly a lot of that we're not going to cover but it's good to keep in mind that this is actually the common case that you just get some observations but they don't tell you the full state doesn't mean you want to use the solution methods necessarily from the literature on parse observable Markov decision processes exactly especially those who solve these things exactly because that tends to be a very hard and interesting problem but it also tends to be quiet computationally expensive and again the environment state can still be Markov even if you only get a partial observe observation of this but the agent just has simply no way of knowing this is that clear okay so now we can talk about what the agent state then is so the agent said as I said before is a function of the history the agent action depend on the state so it's important to have this agent state and in a simple example if it is if you just make your observation the agent state more generally we can think of the agent state as something that updates over time you have your previous agent state you have an action a reward and an observation and then you construct a new agent state note that for instance building up your fool history is of this form you just append things but there are other things you can do you could for instance keep the size of the state fixed rather than have it grow over time as you would do with a history here what I do know with F is sometimes called the state update function and it's an important notion that will terr get back to later the its actual so very active area of research how do you create such a set up that function that is useful for your agents especially if you cannot just rely on your observations so the agency is simply much smaller than environment State and it's also typically much smaller than the full history just because of computational reasons and here's an example so let's assume a very simple problem where this is the full state of the environment here may maybe it's not the fool say maybe there's also an agent in the maze that I didn't write down but let's say that this is yours you're a fool state of the maze and let's say there is an agent and it perceives a certain part so the observation is now partial the agent doesn't get its coordinates it just sees these pixels say as an example now what might happen is that the agent walks around in this maze and all of a sudden sometime later finds itself in this situation now this is an example of a partial observable problem because both of these observations are indistinguishable from each other so just based on the observation the agent has no way of knowing where it is so a question for you to ponder a little bit how could you construct a mark of agent state in this maze for any reward signal because I didn't specify what the reward signal is if you want you can just think up one maybe there's a goal somewhere does anybody have a suggestion right so indeed in this case you'd have to carefully check for the specific maze whether that's sufficient it might be sufficient it might depend on your policy if you have an action that stands still it might not be enough because you might see the same observation twice if that action doesn't exist in this maze it might actually be enough I didn't I didn't carefully check but the more general idea which i think is the good idea here is that you use some part of your history to build up an agent states that somehow distinguishes these two situations if you do have a certain policy it might be that in the left state you always came from above and in the right state you always came from below and just having that additional information what the previous observation was might just be enough so completely did in this distinguish these these situations and that is indeed the idea of an estate update function this would be a simple set update function that just concatenates the previous two and each time you see a new observation we oldest one and that's actually something that is done quite frequently for instance in the Atari games that you saw before there was just a concatenation of a couple of frames and this is the full agency so it was basically just an Augmented of observation yeah which one here yeah so in in the ordering here is that you you're in a certain state st based on this state you take an action 80 and then we consider time to tick after you take the action basically when you send things to the environment this is just a convention actually some people ride down RT rather than T plus one so be aware but we'll take the convention that basically the time steps when you send that action to the ER to the environment then the reward and the new observation come back and we can consider the next agent State St plus 1 as being a function of this new observation so that when you take your next action you can already take your observation into account if it would be OT rather than OT plus 1 then you couldn't take your newest observation into account when taking your next action it's good question so I set many of these things over already but to summarize to deal with partial observability the agent can construct a suitable state representation an examples of these include as I said before you could just have your observation be the agent state but this might not be enough in certain cases you could have the complete history at your agencies but this might be too large might be hard to compute with this full history or you might as a partial version of the one I showed before you could have some incrementally updated States which maybe only looks at the observations maybe ignores the rewards in the actions in this example but if you write it down like this maybe you'll notice that this looks fairly remarkably similar to a recurrent neural network which I know we haven't yet covered in the deep learning side but we will and basically the update there looks exactly like this so that kind of already implies that we can use maybe deep learning technique here of the recurrent neural networks to implement the state update function and indeed this has been done so sometimes the agent state for this reason it's also called the memory of the agent we use the more general term for the agent state which maybe includes the memory and maybe also additional things if you want to think about it like that but you can think of the memory as being an essential part of your agent State especially in this partially observable problems or alternatively you can think of memory as a useful tool to build an appropriate agent state so that wraps up the state bit feel free to inject any questions and otherwise I'll continue with policies which is fairly short so the policy just defines the agents behavior and it's a map from the agencies to an action there's two main cases one is the deterministic policy where we'll just write it as a function that outputs an action stay goes in action goes up but there's also the important use case of a stochastic policy where the action is basically there's a probability of selecting each action in each state typically we will not be too careful in differentiating these so you can just think of the stochastic one is the more general case where sometimes the distribution just happens to always select the same action and then you're already covered the discrete case the deterministic sorry note that I didn't specify anything about what the structure of this function is or what even the structure of the action is in the beginning of the course will mostly focus on the case and actually throughout the course will mostly focus on the case where these actions can be thought of as being part of a discrete set for instance if you think of the joystick used in the Atari games it basically had up-down left-right shoot and those type of actions but it didn't have move your motor a little bit in this direction we call that a continuous action and there's also algorithms that can deal with those for notation doesn't really matter it's just a function that outputs either maybe you can think of it as an integer for the discrete case and maybe it's a it's a vector or a real valued number for the continuous case I'm not talking yet about how to learn these things this will be later in the course which means we can then run because there's a lot to be said about learning policies but not too much about what our policy is and we'll move on to value functions so as said before the actual value function is just the expected return condition on the states and something that I actually didn't talk about before but it's also conditioned on a policy I basically hit that on the previous slide there's another thing that I hit which I'm introducing here which is a discount factor so the return now is really find it's a slightly different return from before there's this gamma in between if the gamma is equal to one it's the same as before it's just your accumulation of rewards into the future in many cases we actually pick a gamma that is slightly less than one and what that does is it trades of immediate rewards versus long-term rewards putting higher weights on the immediate rewards basically you're down weighing or discounting and this is why it's called the discount factor the future rewards in favor of the immediate ones if you think of the Mays example that I said earlier where you get a zero reward on each step but then you get say a reward of +1 when you exit the maze if you don't have discounting the agent basically has no incentive to exit the maze quickly it would just be happy if it's exits to maze in like in some time into the future but when you have discounting all of a sudden the trade of starts to differ and it'll favor to be as quick as possible because then the exponent on this gamma will be smaller if it takes fewer steps to go to the exits it'll have it will have discounts it's this future return less so the value depends on the policy as I said and it can be used to evaluate the desirability of states one state versus the other see and therefore can also be used to select between actions you could say a plan one step ahead you could use your value it's like you more convenient in that case although I didn't put on the slide to use these action values because there was immediately give you access to the value of each action this is just a definition of the value of course we're going to approximate these things later in our agents because we don't have access basically to the true value typically oh there's a sorry there's a plus sign missing there on the top it should have been reward plus one plus the discounted future return GT plus 1 I'll fix that before the slides go on to Moodle and I said this before for the undiscounted case but now I'm saying it again for the discounted case the return has a recursive form it's one step reward plus the remaining return but now discount at once and that means that the value also has a recursive form because we can just write down the value is the expectation of this return but then it turns out because the expectation can be put inside as well over this this cheap T plus one this is equivalent to just putting the value there again and this is a very important recursive relationship that will heavily exploit throughout the course and notation wise note that I'm writing down a as being sampled from the policies so this is basically assuming stochastic policies but like I said things do Mystic can be viewed as a special case of that and then the equation is known as the bellman equation by Richard bellman from 1957 there's a similar equation interestingly for the optimal which is the highest possible value for any policy and this equation is written down there it basically takes the action that maximizes the one step and then uses the optimal value on the next step so it's again recursively defined and you could essentially view this as as a system of equations if there's a limited number of states and a limited number of actions this is just system of equations that you could solve and thereby you can get the optimal values and the optimal policy in order to do that you need to know you need to be able to compute this expectation and that's something that we'll cover later as well and you'll use dynamic programming techniques and to solve this yeah so it's basically the top line there which is missing the the plus and 40 T but it's basically based on the the recurrence of the return which I assume is somewhat clearer that you can just split up the return into a single reward and then the rest of the return which is again accumulation of rewards in order to then get the recursive form of the value especially for the for this case it's enough to know that the expectation on the top line because it's already had an expectation about in the future you can put the expectation around this internal return there which means it's just defined as the value this is just a nested expectation then but that's equivalent so you can also write this down very explicitly with just sums of probabilities of landing in each state and we will get back to that later so I'll give you some explicit formulas also to show that this recursion holds to let it not Multan you next lecture but in the lecture after that yeah let me rephrase the co-founders to two question correctly which is if you're looking ahead from a certain stage you want to consider ten steps into the future the question is whether you want to optimize for right now or you want to optimize for all each of these steps so on each stage you basically want to follow the policy that optimizes the return from that state the expected return from that state that essentially means that in the last state you'll want to do the optimal thing but in the state before you want to do the optimal thing conditioned on the fact that in the last state you're going to do the optimal thing so that's also in that sense recursive there's a different matter here which maybe maybe just to clarify there's also a question of which states you care about do you care about behaving ultimately from this state or do you care about behaving ultimately from all states if you can solve everything exactly you can actually have both you can just be optimal from every state and you can possibly be in later when we're going to start approximating things you're going to have to pick which states do I care about which states don't I care about and then you might care more about so having good solutions in certain states rather than in other states yeah yeah so the question is can you then sauce is by recursing backwards starting at the end and then just fining I mean that's that's a simple problem in a sense because then you just look at the instantaneous reward if you're at the end and you pick the action that in optimizing instantaneous reward and this will give you the optimal value of that state and then indeed this is a valid and well often used solution technique so then iterate backwards what you could also do and I'll talk about that a much more depth it's just look at all states at the same time and use these recurs recursive definitions to basically incrementally go towards the solution so you could either indeed start at certain states say at the end and then recurrence and that might be more efficient or you could just do all of them at the same time and you'll still get to the optimal solution yeah yeah it's a very good question so the question is here we're approximating expected cumulative rewards or expected returns but sometimes you care more about the whole distribution of returns and this is definitely true and it's actually it has haven't been studied that much so there has been quite a bit of work on things like safe reinforcement learning where people for instance want to optimize the expected return but conditional on ever having a return that is lower than a certain thing but recently and with that I mean like last year a paper was published on distributional reinforcement learning where the distribution of returns is explicitly modeled this has been done there's a little bit of prior work on that but actually not as much as you might think and turns out you could do very similar things with recursive definitions in that case and then indeed modeling the distribution is in some cases very helpful you can help it to steer your decision away from risky situations or sometimes you actually want to be so that that would be called risk-averse say in economics lingo or you could be more risk seeking which could also sometimes be useful depending on what you want to thank you so yes very good question that's very current research we're not marginalizing it we're literally maximizing over it which is a little bit different but it's in some sense it's similar in the sense that you get rid of the dependence on a and therefore of the policy and therefore this whole recursively defined value no longer depends on any policy because we're going to do this max on each step you could similarly think of marginalizing on each step but that's slightly different because marginalizing we take the distribution into account but here you still have to have a distribution n which is your policy then the distribution of actions but in this case we're not interested in in a fixed distribution over actions a fixed policy but instead we're choosing to maximize offer it but yes it's otherwise very similar yeah yes so that two parts of the question one is how to deal with continuous domains Prince's continuous time and the other one is how to do with approximations because even if you don't have continuous time the state space for instance might be huge and that also might require you to approximate so approximations are going to be very central in this course and we're going to bump into them all the time also if everything is very small you'll still have approximations in the sense that you don't know these values and if you don't if you can't compute the expectation because you don't know the model of the environments then you still have to approximate these values and you could do this with sampling simply but then there's ways to sample which are more efficient than others and learning algorithms that are more efficient than others only continuous time points the bellman equation is actually that's one there's also a different one called hamilton-jacobi bellman equation or sometimes the hamilton-jacobi equation which is basically the continuous time variant of this that one's more often studied in control theory and control problems where typically a lot of things are more continuous but then also typically people make more assumptions about the problems which then allows them to solve these things it basically becomes again a system of equations but now with infinite inputs and outputs but you can still solve these things if you make suitable assumptions about the problem will not touch upon that that much in this course but I'll be happy to give you pointers if you want yeah yeah the return is this is the the actual thing you see so it's random it's sampled and then the value is the expectation of that Thanks yeah no I actually already gave an example so sometimes people set up an environment in which these probabilities change over time which means it's already not Markov we would call that a non stationary environment in that case you could still I mean there's always ways to kind of work your way around that which is a bit peculiar and mathematically in some sense you could say the way it changes might itself be a function of something so if you take that into account maybe the whole thing becomes Markov again but it's usually complex so you don't want to care so it's often much simpler to say just it changes over time and then it wouldn't be Markov there's other reasons it might not be mark off button on station every one is one that pops up quite often yeah yeah yeah so the question is how do you define the returns which is actually you can kind of fold that back into the question of how to define the rewards for instance think of if we take the financial investments example a natural way to model things is just to give each reward be the difference in say money that you have and then the accumulation of this is the difference between what you had at the beginning and what you had at the end and you want to maximize that that's a very natural thing to do but instead you might define events you might say I'm going to get a reward whenever my money goes above this level or I'm going to get a penalty whenever it goes below this level maybe you don't care about the exact number maybe you don't care about modeling the expects at return sorry of money but maybe instead you care about some other function of the money often you can then fold that into the reward function and related to the question earlier about modeling distributions rather than the expected return the actual algorithm that does that looks a bit like that it actually you can think of it as modeling the distribution by modeling variants of the return which are more event based innocence sometimes though it's very tricky to set up these events easily so this is why France is in safe reinforcement learning people more typically they still model say the expected money but then they just add the condition that they don't want it to drop below something it might be possible to phrase the problem differently where it's more weighted differently where a certain negative rewards are way that more heavily say and that's the reward that learning system gets but sometimes it's harder to do that then just to solve it with certain constraints very good questions one high-level thing that I wanted to say here is a lot of these things that I've shown right now are basically just definitions for instance the return and the value they're defined in a certain way and this this way they're defined might depend in to on the indefinite future into the into the infinite future essentially which means that you don't have access to these things in practice this is just a definition later we'll talk about how to learn and when you learn where we'll be get get back to this interaction loop where you get these rewards one at a time that means you don't have access to the full return yet typically or you might never actually have access to the full return because my might be infinitely long but you can still learn if you don't for now we're just defining these concepts and later we will get back to these things so don't worry as well if you're not quite sure on how you would then use these concepts because I will most will explain that in future lectures so as a final not only value functions much of what we're going to talk about revolves around approximating these so as I said this is just a definition of a value function both of these one is for a certain policy the other one is for the optimal value function I didn't say how to get them or how to approximate them now there's multiple ways you would want to approximate these multiple reasons you might want to approximate these I already mentioned one reason to approximate them is your state space might be too big so actually model these things exactly to even fit it in memory so then you might want to generalize across that as you would typically also do with neural networks in deep learning another reason to approximate it is just that you might not have access to the model needed to compute these expectations so you might need to sample which means you will end up with approximations which will get better in your sample more and more and more but they will never be maybe exactly correct yes yeah so I probably should have put Q values on here so they will come back in a later lecture in which I will have them explicitly but since I have the V maybe I should have put the Q I can tell you what the Q function is for both of these that might be helpful for the first one we're conditioning on a randomly picked action which is comes from your policy which is a function of s if you have a Q function there will be an action on the left-hand side small a and we would condition on the action being taken actually being that action and then only internal parts where we have this recursion instead of a V you could still have this V for the same PI but alternately if you you could write this down as a summation over actions and then the probability of selecting each of these actions and then the Associated Q function state action value in that next step yes and as I said I will show those equations later in the course we will get back to that extensively in the optimization so in the optimal value definition essentially what will happen is the there will again be in action on the left-hand side which will be conditional so the max a then disappears on the outside of the expectation because we've selected an action as a func as an argument to the function rather than maximizing over but it will reappear inside there will be a discount times the maximum action value in the next set but like I said you don't have to remember that right now we will get back to these extensively thanks good questions so I was talking about approximating these things and we will discuss algorithms to learn these things efficiently in many cases so in the case where you have like a small so small MVP where there's a small stage space you can approximate these things in some way maybe you have access to the model we'll talk about that but we'll also talk about the case where it's a huge state space maybe it's pixels and you have thousands of pixels each of which can have many different values and we might still want to learn a value function and we'll talk about how to learn in those cases when you don't have access to the model when you need to we'll talk about that whatever we do when we do get an accurate value a function we can use that to behave optimally I said accurate here with which I mean basically in exact optimal value function and you can behave optimally more generally with suitable approximations we can behave well even in interactively big domains we lose the optimality in that case because we're learning we are approximating there's no way you can get the actual optimal policy but there is in practice you don't actually care that much because good performance is also already very useful and if it's intractable anyway that's the best you are ever going to get so that wraps up the value part of the agent I'm going to talk a little bit about a model although we're covering that less in this course and one reason is that it's actually kind of tricky to learn and use these things there's also time constraints of course but a model basically is just a prediction of working environment dynamics are so for simplicity think of the fully observable case so the state here is both the environment States and the agent States it just simplifies thinking about these things although you can generalize this then we might have some function that predicts the probability say of the next state for any next stage based on a state's act and in action you could also predict the expected next day that I chose to write down here the probability distribution so were explicitly modeling the distribution of next States here instead in some cases it's useful just to predict what the expected next state looks like sometimes that's not so useful because maybe in expectation maybe you're partially in a whole instead of fully in a whole or not in a whole at all or maybe you're the door is partially open and now it's both open and not open in the expected state which might not be a real state so in some cases the expectation is it doesn't make a lot of sense in other cases it does maybe the more general thing to do is just a model full distribution of possible next States for all of the states similarly for the reward we could have a model for that which maybe just is dependent on the state and the action and then predicts what the reward would be for that stays in action you could lock augment this maybe make it a function also of the next states and predict that if you have a certain state action in the next day what would then the reward be some some cases this is easy maybe if you have these three things maybe the reward is deterministic and you can very quickly learn it in other cases it might be sarcastic and in maybe the worst case it could even be non-stationary so you maybe you want to track rather than to approximate as if it's a stationary quantity so model is useful and we'll talk about how to learn or basically plan with models later but it doesn't immediately give you a good policy or an optimal policy because you still need to plan so as soon in the next lecture we'll talk about how to learn when you do have the exact model using dynamic programming and we will learn how to construct value functions and there are many problems in which this is actually the case if you think for instance of the of the game of go if you're in a certain state which is basically fully observable you take a certain action you know exactly what's going to happen you know exactly that if you place your stone there the stone is going to end up there your next state is fully known in that case the model is there and you can use that in other cases like a robot walking around say through a corridor and this is much trickier and you might not have access to the true model and it might be very hard to learn true model so it's very dependent on the domain whether it makes sense this is why I basically put down the model as being an optional part of your agents many reinforcement learning agents don't have a model components some of them do there's also in between versions where we might have something that looks a lot like a model but it's not trying to capture the full environment dynamics but maybe only part of it and then maybe you can still use make use of this one last thing I wanted to say about models here I had a version that gives you the full distribution sometimes it's useful to have there to be implicit and instead have a model that you can sample so we could call that a sample model or a stochastic model or as it is in deep learning often called a generative model which basically gives you first aid and action it gives you sample next days then you could still build a full trajectory you could sample from that say it again this is something you can't do with an expected state model because if you have an expected state coming out of your model you can't Sara Lee put that into your model again that might not make sense because as I said that expected state might not be something that actually occurs in the real problem so I'll make these things a little bit more concrete by tossing them into an example this is a simple example simple maze there's a certain start and there's a certain goal and there's only four actions you can basically move up left down and to the right or north east south west if you prefer in the stage is basically just the location of the agent which in this case gives you all the information that you need because the environment is fixed it's a little bit weird if you think about it the state doesn't include any observation on where the walls are but because everything's fixed the location still gives you everything you need to know we define the reward to be minus one in each time step there's no discounting but because there's a minus 1 in each time steps you're still encouraged to leave the maze as quickly as possible so what might a policy look like this is actually an optimal policy for this maze which in each state's in this case gives you a deterministic action in some problems the optimal policy might be a stochastic policy but in this case there is a clearly a deterministic policy that will get you out of the maze as quickly as possible this is maybe the simplest thing that you might need to solve this problem the policy mapping I didn't specify how how we might learn that which we'll touch later later upon but it's good to realize that this is the minimum thing that you might need alternatively or additionally you might learn the value this is the true value for the policy that I just showed before because that Multan policy happens to be the optimal policy this is also the optimal value function if I would have picked a different policy the numbers would have been different and it will be the value that is conditioned on that policy the value is here of course particularly simple because it's just the negative number of steps before you reach the goal as you would expect note by the way that we we consider the goal to be when you actually exit the maze so that final step there has a reward of minus one because you still need to take that action of leaving the maze before you stop and then basically problem there terminates so these returns that we saw before which I had trailing off into potentially infinite future in this case they're actually finite and they're at most 24 steps so model in this case might also be quite simple so the reward model is just a minus 1 on each of these states a transition model is also quite simple but in this case in this picture a part of the maze is missing which is meant to illustrate that maybe we only have a partial model or our model is only partially correct in one of these states there's basically a connection missing that was here maybe because your model just has never learned that maybe you're in M and you never taken that action and maybe your model by default assumes that there's a wall unless you've taken that action and you've seen that there isn't the wall given this approximate model you could still plan through it and you would still find the same optimal solution even though the model isn't fully correct in all states in other cases of course your model might be approximate and there might be a wall where there isn't a wall and you might find a completely different value and a different policy which might not be appropriate for the true problem so if we don't categorize and this is also to get you acquainted with the language use for these things in the literature there's many different ways you can build an agent it can have any of these components or it can have many of these components and there's also the difference between whether the agent has the components and whether it has it explicitly when I say explicitly it means it has some approximation it has an actual function inside that you can use to compute something so when we say we have a value-based agent what I mean there is that the agent internally has some approximate value function that it uses so then judge which actions are better than others there might not be an explicit policy in that case and in fact when I say value-based I mean I mean that there's not an explicit policy but that we construct the policy from the value whenever we need it alternatively and maybe this is the simplest example that could be a policy based which just has a representation of your policy some mapping from States to actions and doesn't ever have an explicit notion of value the terminology actor critic is used when an alien has an explicit policy and a value function this is a little bit it depends on who you ask in which literature you read because sometimes people say actor critics systems also imply a certain way of learning these things but I'm just going to use it whenever you have an explicit representation of your policy and value and your learning both then I'll just call it an actor critic system for simplicity where the policy is the actor and the value function is to critic separately there's this distinction between having a model of free age and the model based agents basically each of these from the previous slide could also have a model that's the distinction here and when they do we could we could say it's a model based agents so you could have a model based actor critic agent for instance or a model based value agents for value based agents these things are of course a little bit more gray than I'm pointing out here because you could also have these partial models or things that you can interpret to be a model in fact some people would say hey a value function is actually also some type of model sure but when I say model in this case I mean something that tries to explicitly model something of the environments that is not the value or and it's not it's not the optimal policy so that looks a little bit like this where there's these three components a value function of policy and a model and then when you have the overlap say of value function and a policy we was called an actor critic and actor critics can also be part of that lower circle which is the model circle so you could have an actor critic with a model but it could also be mobile free which is the everything that's outside of the model circle so you could have a mobile free actor critic or you could have a mobile based extra critic you could have a model based value based Asians and the mobile based policy based agents you could also just have a model and as I said then you still have to plan to get your policy but in some cases this is the appropriate way to solve we'll mostly cover the top end here where often there's no the model but even if there is a model there will typically also be a policy and/or a value function so that's the high-level views and I'll talk about a few of the challenges in reinforcement learning and I've talked about a few of these already but it's good to be explicit so there's two fundamentally different things that we might do to solve a decision problem one is learning the environment is initially unknown the agent interacts with the environment and thereby somehow comes up with a better policy you don't need to learn a model as I said and I'll give examples of algorithms within this course that don't learn a model but still learn how to optimally behave and separately there's something called planning when when we say planning planning is hugely overload of term it means many things to many people but when I'll say planning within the the notion of this course I mean that the model of environment is given or learned and that the agent plans in this model so without external interaction the difference here is the sampling bit right in the planning phase you don't sample you're just thinking so sometimes people use words such as reasoning pondering thought search or planning to basically refer to that same process the fact that it could be unknown approximate model here is important because you typically don't have access to the full model of the environment in the in the problems that we care about that we end up will consider in some cases you do and then planning techniques there's a huge literature with very efficient and very good algorithms that can solve these problems where you have a true model however one thing to be aware of is that oftentimes these algorithms assume that your model is true and that means that if you're going to plan with an approximate model you might end up in a situation where your planning algorithm finds this very peculiar policy that happens to walk through a wall somewhere because it has miss models the fact that that's all you could maybe make these planning algorithms more busts to model errors and this is an active area of research but we won't have time to go into depth in this course into that a separate distinction that is often made and it is useful the terminology is very useful is the distinction between prediction and control this is not actually dichotomy both of these things are reported and can be important at the same time but the terms are important because we'll be using them a lot and people in literature use them a lot where prediction basic use means evaluate the future so all these value functions that we talks about those are predictions of something in this case of the return a lower model is also prediction it's a prediction of the dynamics control means to optimize the future and this difference is also clear when we talk about these definitions of these value functions where one value function was defined for a given policy so this would be a prediction problem where we have a policy and we just want to predict how good that policy is and the other value function was defined as the optimal value function so for any policy what will be developed two more thing to do that will be the control problem finding the optimal policy we are mostly concerned with a control problem we want to optimize things but in order to do so sometimes it makes sense to predict things which are not necessarily optimizing so it's good to keep that in mind that sometimes we're optimizing sometimes we're not optimizing we're just predicting this also means that sometimes strictly supervised learning techniques are very useful within the RL context sometimes you just want to predict certain things and maybe you can just use supervised learning and then all the tricks that you can can can leverage all the new tricks that you can leverage to do that efficiently and that can be very useful also they are strongly related if you have very good predictions of returns it's typically fairly easy to extract a good policy you could do this in one shot if you somehow manage to predict the value for all the policies you can just maybe select the best policy of course in practice this is not very feasible there's an algorithm that we'll talk about later which iterates this where you basically have a policy and then you're going to predict the value for the policy and then you're going to use those values to pick a new policy and then you're going to predict a value policy and there you repeat these things over and over this is called policy iteration and we'll get back to that later and it's an efficient way or an effective way to improve your policy over time by using predictions so here's another thought nugget similar to ones we had before this is a question for you choose to ponder I don't I'm not claiming I have the answer but if we could predict everything do we need anything else is there anything missing from a system that can predict everything in order to have say full AI so now most of this lecture wasn't about how to learn these things but most of the course will be and for that it might be important to note already that all of these components after Oxbow's are basically functions policies are functions from States to actions value functions sorely in the name map States values models map States to States or distributions of states or rewards or any subset of those or superjet and state updates which we haven't talked about that much but or he's how to construct them there are also functions they create a new state from your previous state we talked about a version where this was given there was an example here where you maybe you mend your observation with some pre or prior observations but maybe you can also learn how to efficiently build your state in practice this means that we can represent these things for instance as neural networks and then we can maybe use all the deep learning tricks to optimize these efficiently if we have a good loss and we have a strong function class such as deep neural networks maybe this is a useful combination and indeed we often use the tools from deep learning in what we now nowadays called deep reinforcement learning to find good efficient approximations to many of these functions one thing to take care about is that we in reinforced when they will often violate assumptions that are made in typical supervised learning for instance the data will not typically be iid there's different different reasons for that ìit meeting of course identically and independently distributed so why won't it be that well one reason is your policy will change so even just that the fact that you're changing your policy means that the data will change which already makes your problem non-stationary and no no iid so there's a challenge for typical supervised learning techniques so you need to do tracking perhaps instead of just trying to fit certain data the stationarity itself might also come in back into other ways for instance not maybe maybe not just your policy changes but maybe also your updates change or the problem itself is even on stationery for instance there might be multiple learning agents in a single environments and that makes everything very non stationary and very very hard but interesting so the takeaway here is that deep reinforcement learning is a rich in active research field even though the beginning especially of this course will mostly focus on reinforcement learning without talking too much about the connection to deep learning I will occasionally whenever appropriate make those connections and it's good to keep in mind that we'll might use many of the techniques but you have to take care when applying them because you might be violating certain assumptions that were made when the the the techniques got some creators also one thing maybe to keep in mind is that currently all networks are not always the best tool but they often work very well a lot of work and reinforce pruning has been done in the past on tabular and linear functions which is much easier to reanalyze and it's also already a pretty rich setting where you can do many things these days a lot of people prefer to use deep networks because they're more flexible in it they tend to fit weird functions more easily but it's good to keep in mind that it's not the only choice and you could sometimes be better off let me say a linear function which might be Stabler in some sense might be easier to learn but then maybe your function class is limited maybe it's less flexible and maybe this somehow hurts you except if your features are sufficiently rich but maybe then you want to create your feature somehow and maybe that's something that you don't want to think about or maybe you can't think about because you don't know enough about a problem so it's just something to keep in mind so here's an example of how that then nukes for Atari so as I said there was one system that basically learned these Atari games that system assumed that the rules of the game are unknown so there was no known model loading environments and then the system would learn by playing just playing the game and then directly learn from the interaction so what that means the joystick is now basically what defines the action as I said the agent wasn't like the Avatar you saw on the screen but it's actually the thing that pushes the buttons on the joystick and then that goes into the simulator which in this case is the simulator of these Atari games that outputs the reward which in this case was extracted as the difference on the of the reward that you can also see in the screen and your observations would just be pixels in this case it would be a quick explanation of a few pixels because actually in this Atari game sometimes say the screen flickers so you might have individual observations in between we're just completely black and just to avoid that being a problem instead we keep a very short history of a few frames also this helps in certain games and you might know the game of pong where you base gave two pedals where a ball goes from one to the other if you have more than one frame you can use that to judge which direction the ball is going whereas if you only have one frame you cannot actually distinguish which direction the ball is going it would be partially observable in Atari you could also plan assuming that the rules of the game are known in that case you could query the model and in each stage you could just take all of the different actions see what all of the next states are and then see what the reward along the way was and then you could build this huge tree and just plan search within that tree in the original Atari emulator that we we use to do a lot of experiments the actual emulator was deterministic the games were deterministic so if you're in a certain stage you take certain action always the next same thing happens in a later version of the emulator they actually add a little bit of noise by making the actions stochastic he last a little longer shorter just to break certain algorithms that heavily exploits the determinism of the environments because eventually you want algorithms that can deal with situations that aren't deterministic and most of the work on these Atari games has actually used algorithms that work just as well when the environment is not deterministic but there are a certain algorithm that you can do when the environment is neater monistic that you can't do when it's when it's dark astok so just briefly before we wrap up one other thing that I wanted to mention this will be the focus of the next lecture so I'll talk about it much more in depth which you see which is also something that's quite central to reinforcement learning as I said we're learning from interaction and we're searching actively for information so this is sometimes called the dilemma between exploration and exploitation because as you're learning you'll learn more and more about the problem you're trying to solve you'll get a better and better policy and it will be more and more tempting perhaps to just follow whatever you think is best right now but if you do you basically stop getting new information about things that might still be out there so what you want to do is sometimes pick actions that you've never done before this is because you don't automatically get all the data you actively have to search for the data and there might be friend sister might just be a treasure chest around the corner and if you never go there you will never know so maybe eventually you want to make sure that you eventually sometimes go to places that you've never seen before that's called exploration but you also don't want to just jitter all the time you don't just want to do random stuff all the time because that will hurt your performance will hurt your rewards so when you do something that you think is good right now that's called exploitation and the balance of these things is actually quite tricky to do in general so the next lecture will discuss many methods that can do that so the goal is to discover good policy from new experiences but also without sacrificing too much reward along the way the new experiences part is the exploration the sacrificing not sacrificing reward is the exploitation also think of an agent that needs to walk say a tightrope to get - across a ravine in this case you might want to exploit a policy that can already walk across the tightrope and then only start exploring when you are on the other side so this shows that in some cases it's very good to exploit for a little bit to even get to the situations where you can effectively then explore so these things are very intertwined but I'll talk much more about that so summarizing what I just said exploration finds more information and exploitation exploits the information you have right now to maximize the reward as best as you can right now it's important to do both and it's a fundamental problem that doesn't naturally occur in supervised learning and in fact we can already look at this without considering secret charity and without considering states and that's what we'll do in the next lecture simple examples include if you want to say find a good restaurant you could go to your favorite restaurant right now you'll be reasonably reasonably reliably you'll get something very good or you could explore and try something very new and maybe it's much better than anything you've ever seen before or maybe not so expiration is a little bit dangerous perhaps another example oil drilling you might drill where you know the oil is but maybe it becomes less and less maybe becomes more and more costly to extract or maybe sometimes you want to try something completely new and in game-playing you'll just want to try new moves every one so often but there's essentially examples of this in any decision problem that you can think of so finally before you wrap up I wanted to go through a little example once more which is a little more a little bit more complex in the Mays example that I gave before to make sure that these things are a little bit clearer this is a very simple grid and the agent walks around and it gets a reward of minus one when it bumps into a wall so we can ask a predictive question we can ask if you just do things randomly if you randomly move around this grid what is the value function what is the expected return condition on that policy now there's two special transitions here whenever you're in state a you'll actually transition to state a prime and you'll get a reward of ten this is the highest reward you in this problem if you're say B you get a reward of five and you go to B Prime now it might not be immediately obvious which one of these is preferred because one has a lower reward but it also puts you less far away so it might be easier to repeat that often whereas the other one gives you a higher reward but it's a longer jump that that happens after you make the jump from a to a prime so it takes you longer to get back and then it's unclear which of these things is preferred in order to even talk about which things of these are preferred we need to talk about the discount factor which trades off the immediacy of high rewards to high rewards later on and in this case the discount factor was set to 0.9 this kind of an arbitrary choice but it means that there's the value function that is now conditional law on both the uniformly random policy and also on the discount facts that we picked which basically together with the reward defines what the goal is so the goal here is not just to find high lore but also to do it reasonably quickly because future rewards are at discounted here under B the value function is given it's a state value function for a uniform random policy and what we see is that actually the most preferred state and you can possibly be in a state a because you always reliably get a reward of ten and then you'll transition to a prime which is a negative reward but the negative reward isn't that bad the reason a prime is a negative reward is because your policy is random and it will bump into the walls occasionally so you'll get some negative rewards and because that state is fairly close to the edge you'll bump into it more often than if you're further away from the edge note by the way that the value of state B is higher than 5 and you get a reward of 5 whenever you go from B to B Prime but because the value of B prime is positive the value of being in B is higher than the value of just the immediate reward whereas the value of a is lower than 10 because divide the value of the State it transitions to is negative now we could also ask the question what the optimal value function is if we could pick the policy in any way we want it what would be the policy and what will be the value of the optimal policy now if you first look at the right-hand side you see that in states a and B all the actions are optimal this is because we've defined them to be all equal all right any action you take a state a you'll jump to a prime doesn't matter which action you selected so we don't care which one you take and we can see there's a lot of structure in the policy as well so probably if you're going to do some function approximation you'll probably be able to generalize quite well because the policy is actually quite similar in a lot of similar closeby States so this is a very simple problem in which you probably don't need to do a lot of function approximation but in a much bigger problem let's say you're in a corridor as a robot and your optimum actually right now is to move forward through the corridor probably next step your observation is probably very similar and you'll just continue going forward because of generalization the optimal value function is now all strictly positive for the simple reason that the policy here can choose never to bump into a wall so there are no negative rewards for the optimal policy it just avoids doing that altogether and therefore it can go and collect these positive rewards and now notice as well that the value of state a is much higher than 10 because it can get the immediate rewards 10 but then a couple of steps later it can again get a reward of 10 and so on and so on these are discounted so it doesn't grow indefinitely to infinity but it does get repeated visits to these things again by the way state aides prefer to state B which is a function of both of these rewards along the way and of the discount factor you could trade these things off differently so I have a video to show all the way the end but before I do I just wanted to give you a high-level overview of what the course will entail will learn will discuss how to learn by interaction as the main thing and the focus is on understanding the core principles and learning algorithms at some points during the course I will give nuggets of practical or empirical insights whenever I have them and also at the end of the course we'll have guest lectures by flood and Dave who will talk about their work which also include some of these nuggets but on the whole will mostly be talking about this on a fairly conceptual level but it's not that far removed from practice and I'll I'll point out whenever whenever I can how to make these things real and how to actually make them work also there will be assignments as you know which will allow you to do that and try that out so the topics include next lecture are we're talking about exploration in what is called bandit problems abandoned is basically it comes from the one-armed bandit which is a slot machine where you have this one action and you get a random return each time you try that action this has been generalized in the literature the mathematical framework which is called the multi-armed bandit problem where basically you can think of this as there's multiple actions you could do multiple slot machines each of which gives a random reward and your job is to decide which one's best there's no state there's always the same slot machines nothing changes there's no no sequentiality in the problem so the only problem here is one of exploration and exploitation how to trade these things off and how to learn how do you learn the value of these things but that's fairly simple in that case then later on we'll talk more about Markov decision processes I touched upon these a little bit but I'll talk about how to plan in those with Markov sorry with dynamic programming and such and will move towards model free prediction and control where we're not going to assume we have the model anymore and then we're going to have to sample there will be something called policy gradient methods which is where a family of algorithms are allow you to learn the policy directly which we'll talk about and we'll talk about challenges is deeper first learning how to set up like a complete agents how do you combine these things and how to integrate learning and planning are there any questions before we wrap up yeah I don't know it's on Moodle somewhere I used to know but I I don't want to commit to saying a date and not having it wrong right now other questions admin or topic related yeah the assignment will be out right oh yeah I thought the question was when it would be due not when it was ah okay so if Moodle says start this week it probably should be I'll have to check where it is but thank you for noting because that's important and we need to then if that schedule is correct we'll need to make sure this gets out as quickly as possible yeah so if if it was due to be out beginning of this week then we'll also have to check whether the due date is still correct this may be done and then it has to postpone but I'll need to check the schedule and check with the people who should have released the assignment Thanks it's very very important other questions or the link isn't working uh yeah that sometimes happens I think I may have got a little link wrong that's one option also his in my my experience his site doesn't always work but okay if you just google for Sutton Bartow 2018 you should be able to find the book or add reinforcement learning if you run when you're very very sure but then you should be able to find it yes yes I'll make sure that these slides are always updated I'll try to get them so what what's what's in in Moodle right now are basically the slides from last year and we'll try to update them as soon as possible whenever possible so when slides do change some of them will stay the same right but when the slides do change we'll try to update them beforehand that didn't work this time but I'll try to get them in as soon as possible but beware that if you now look at the slides already for future lectures the material might change slightly but not greatly mostly but slightly so I'll do my best on that so I wanted to end with this I'll explain what you're looking at because it's kind of kind of cool so this is a learning system right there is something here that is learning so what is learning here there's something that is learning to control basically these joints of this if you want to call virtual robot simulating now what is interesting about this is that basically otherwise very little information was given to the system essentially the reward function here is go forward and based on the the body of the agents and the environments the agent has learned to go forward but also in interesting ways specifically note that we that that nobody put any information in err on how to move or how to walk there wasn't anything pre coded in terms of how do you move your joints which means you can also apply to suit different bodies same learning algorithm different body still learns to locomote you could put it in different environments you could also make it walk on a plane rather than on the line essentially and it can basically choose to either crawl over things or maybe sometimes walk past them again all of this it's just as one simple goal which is the reward to go forward there's a general principle here that when you do code up a reinforcement learning system and you have to define the reward function it's typically good to define exactly what you want as you can tell sometimes you might get slightly unexpected solutions and not quite optimal so what why would this anybody know the reason why this agent was making these weird movements so it might be balanced yeah so that's a that's a very good one so part of your agent States might be your previous action will be encoded in your observation so you can use your actions to get you certain memory in certain situations right that's that's a very very interesting one another thing is I mentioned here the rewards to go forward typically for us that's not the case typically we want to go somewhere but we also kind of want to minimize energy we don't want to get too tired but if you don't have that constraint you could also get these pyrius things which might help for balance they might help for memory but they also might just be there because they don't hurt right and that's something that occurs fairly generally in reinforced learning if you model the problem be sure to put in your reward function what you actually care about because otherwise the system will optimize what you give it what you ask it which might not be what you want in this case it's okay right because we didn't actually care about this and it might actually be helpful in this case I don't actually know right it might be helpful for balance but in other cases it's quite tempting to put in your wrist excel it'll nyjah store certain things Oh if you want to do that maybe you first should do that but that's a little bit dangerous because in some cases it'll then optimize that thing that you only want it to be a sub goal along the way rather than the truth thing you care about yeah yeah that's a very good question so why was it running rather than crawling so there's two reasons for that one is that the the reward is essentially to go forward as quickly as possible and the other one is the body if you have a different body crawling might actually be the more efficient one or rolling it's made me more efficient one there are cool videos online on similar systems where people have done similar things and there's some old work as well where people use evolutionary methods to for all sorts of weird bodies to see what the locomotion will be that it finds and you find very cute and weird ways to localize it turns out ok so I think that's all the time we have thank you all for coming
Info
Channel: DeepMind
Views: 127,642
Rating: 4.9182391 out of 5
Keywords:
Id: ISk80iLhdfU
Channel Id: undefined
Length: 103min 17sec (6197 seconds)
Published: Fri Nov 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.