Hado van Hasselt: Reinforcement Learning I

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Oh so yeah are we talking about reinforcement learning and let's start with the obvious question what is reinforcement learning so before I dive in let me just start by saying that from what I heard about the the intent of this talk here is that we that we make sure that we bring people aboard also who aren't very familiar with the topics that are being discussed that means that I'll start with some of the core concepts of reinforcement learning in a moment and I'll I'll go through everything a little bit basically from the ground up but that's said of course we don't have a lot of time we just have one afternoon so I will be going fairly quick at some point through some of the material so I just wanted to tell you to stop me whenever you feel confused about anything because if you do you're probably not the only one and then you could be you could help each other out quite a bit by just asking a question at that point and then later on and maybe in the second maybe we'll split down the middle and in the second half I'll talk about some more cutting-edge research current research in reinforcement learning and in what's sometimes these days called deep reinforcement learning if you already know reinforcement learning maybe the beginning will be quite known to you already but still it's useful to know what my notation is how I think about things perhaps and again stop me if you think anything is unclear or if you disagree with anything okay so let's dive in or now but first we'll pop up to the motivation to the high high level this is very high level so this is one way to think about how does this fit into a larger context of artificial intelligence research and we could start by saying well first we started automating repeated physical solutions in the Industrial Revolution at some point basically started building machines to take over tasks that we were have been doing by hand of course this is actually older than the Industrial Revolution it's something that goes back throughout his three that we find mechanisms and tools to make basically make tasks easier for us now more recently there was a basically second revolution you could say where we were able using computers to automate repeated mental solutions for instance think of a calculator we know how to do addition we know how to do multiplication and we know it's precisely enough to be able to basically tell a machine how to do those operations and then the machine can help us to them maybe more precise and maybe a lot faster than we can do and now I'd argue that we're basically at the start or in the middle of maybe something that could be yet another revolution where we now no longer think of the solutions ourselves before hand but instead we build machines that can themselves find solutions and this is a whole purpose of artificial intelligence in a sense that we want systems that we then could call artificial intelligence now many people have many definitions for what intelligence that means but to me this means that they can find their own solutions to a well posed problem so this requires that they could learn autonomously how to make decisions and I'll get back to that specifically that word decisions first what does it mean generally to find solutions for problems earlier approaches artificial intelligence involved encoding rules and knowledge by hand and then having the machine maybe reading through these again basically a calculator is actually a good example of this where the rules are very well-defined and there's some knowledge involved maybe in defining these rules in first place and you type in maybe the knowledge that you want to add say two plus three you type that in and then the machine can use its rules to deduce an answer to that question typically the rules were very fairly general like modus ponens basically if a then B you could model that precisely and then you could put in a knowledge which A's are true and you could deduce which B's would then follow from that of course this is a very trivial rule you could build up quite complex systems by chaining these types of rules together however encoding knowledge by hand has some caveats some pitfalls for instance slide errors in the encoding can lead to unforeseen cons because you're not the control of the complete reasoning you're in control of setting up the reasoning but you're not in the loop at each point of the reasoning typically this means that if you have a slight error in your encoding of the knowledge that could be conclusions that are up out of it that are basically false but if you don't follow all of the steps of the reasoning you would find it very hard to be able to tell that this is related to the other two points up there which is that you you basically can air in two different ways when you set up a system like this one is that you go for too low a level of detail and then often the results are quite brittle and likely you're wrong in a sense that's they are too general to to be useful it might be isn't the reason through these systems though and this is why the other caveat isn't importance if you have a very high level of detail even if all your encoded knowledge is correct it might be very hard to compute the conclusions it might just be very hard to go through all the required reasoning steps to come up with conclusions now what is an alternative machine learning in general in my definition refers to the research field which does resource after algorithms that do not define these rules and knowledge at the same level but instead define learning updates that then can be used to extract rules and knowledge from data and this is the important bit so basically the difference between these two approaches of course I'm talking about them quite prototypically as if they're completely separate for simplicity which they don't need to be but the difference the main difference between these two approach this one is where you set up a system and you put all the knowledge in basically beforehand and the other is a system where you basically set up the system is situated it can consume new information new data during its lifetime if you want to call it that and can then build up new knowledge from the data now all you see there is a large overlap with classical statistics here we're also conclusions come from the data and indeed these two fields statistics and machine learning are very tightly related and in classical statistics we also analyze the data algorithmically with some well-defined statistic methods and then we can base our decisions that we make on the outcome of this in however typically we still choose which analysis to apply beforehand ourselves and we decide what we do with the outcome of these analyses so it's not fully self-contained it's not fully autonomous innocent so essentially we want to go further and we want to ask the question whether we can learn to make decisions automatically from data as well and that is basically what reinforcement learning is as it says on the slide we and other intelligent beings I posit learn by interacting with our environment and then consuming these interactions and this differs from other certain other types of learning for instance it's active you might actively seek out new information that can help you learn but you also have to take into account that interactions are sequential and that future interactions can depend on earlier ones if you go down a certain path you might not see certain things that would have maybe being useful to know about so there's this type coupling now between the learning process the analysis if you want to call it that anymore a statistic terminology and the decisions that the system makes typically we were also goal-directed so I'll get back to that why this is important and we can learn without examples of optimal behavior so this one's maybe a little bit controversial let me say from something very briefly about that which is that in some sense one view of reinforcement learning is that you are a system that has certain input outcome mapping and the inputs are just some raw sensory stimuli that you get you get your vision you get your hearing at least like we do you could set up a system or official system with a whole bunch of other sensors of course and then even if there is a similar system out there that already has a good behavior you would have to either encode that knowledge or the system would basically have to learn just by looking at the other system what to do so this is not necessarily just mimicking or it's not trivial to mimic and typically we would not necessarily learn only from making other other smart entities we also have to be able to learn in a different way so the general interaction loop of reinforcement learning is than this one where there is an agent's and we will talk about basically the internals of that agents how does the day you can learn in the world and there is an environment and the interaction is quite simple that you could see the agents executing action as sending that to the environments and this action might have a certain consequence and then the environment in a sense sends back an observation or phrase differently the agent pools in a new observation and then his new observation could lead to a different next action and so on so this continues over time and can concil indefinitely and I'll get back to the diagram again also later but first I want to touch upon two different distinct reasons you might want to learn and this is good to be able to also to put other research into context because often people are not very explicit about this distinction or not always very explicit about which one's there which one they're interested in but it's good to realize that they are a difference and the first distinct reason to learn would be to solve something this is the idea is here to extract new better solutions and when I say solve I mean is in a somewhat soft sense I don't necessarily mean finding the only optimal solution to a certain problem I also mean finding good solutions if there's a certain ranking among the solutions at least better ones than you initially start with so an example of this would be to find a program that can play the board game of go better than any human this would be the goal here is to find a certain solution the other goal which might be of interest is to be able to adapt and here it's important that you can find good solutions online during the interaction so an example here will be a robot that navigates terrains that differ from any expected terrain or any terrain that it was previously previously trained on the main distinction here is whether you're whether it's enough to learn offline maybe from some simulator or some pre-recorded data or something in the first case when you just want to find a solution this might be sufficient but if you want to learn online you need algorithms that can work well within that regime that can learn for instance fast if something happens that is important and in this novel you might want to be able to adapt quickly because otherwise you might take the wrong decision and if it's if it's an irreducible path you might take a wrong decision that you can never recover from and then learning might cease because say your robot is now stuck in a certain room that it can never exit again with its current wheels or whatever you gave your robot reinforcement learning as a general field of research seeks to find solutions for both of these cases and I also wanted to note that the second point is not just about being able to generalize it's not just for being able to learn from your initial data set and then being able to generalize to many different cases it's also about efficient learning online as I said during the operation okay so back to the question what a reinforcement learning is so I like to define reinforced learning as the signs of learning how to make decisions from interaction and this may be also good to keep in mind that I think of reinforcement learning more as the framework then as a specific set of algorithms this sometimes gets conflated also in the academic literature on the topic there's just there is such a thing as reinforcement learning algorithms but reinforcement learning as a whole could be viewed as both the algorithms and the framework around this and the framework is very general in in order to reason about making decisions and interacting with the world we need to reason about time about consequences of actions about actively gathering experience about predicting the future and about dealing with uncertainty now this is tricky because it's a very general problem but it also has huge potential scope so I'm just positing here as a slightly provocative question to think about is this enough as a framework to basically capture the goal of artificial intelligence this is not saying that we're there yet it's not saying all the current reinforce improving algorithms solve artificial intelligence it's merely asking the question is the framework sufficient I'm not going to give an answer to that this is up to you to decide whether you agree with that or not so how does reinforcement learning differ differ from other machine learning paradigms for one normal machine learning has two big strands of research one could say one is unsupervised learning but the more popular largest trying to supervised learning which assumes that you have certain labels that your system can consume you basically have a mapping from certain inputs to outputs and the system can see many examples of this and then learn to mimic that mapping may be generalizing appropriately to new inputs in reinforcement learning we do not assume that a mapping exists that gives you the exact optimal behavior instead we're going to give you something else we're going to give a reward signal that tells the agent on each step how much did I like that basically that action that you just did this is not quite the same as having a supervision signal because for one we're not telling you that this was the best thing you could do you might not know what your rewards might essentially be unbounded and you see say a +10 but you don't know whether you could have gotten a plus 20 in the same situation but also because we're talking about sequential interactions with the environments the feedback can be delayed and not instantaneous which means that a reward that you see right now might actually depend on an action idiot much earlier because this action that you did much earlier got you to a situation in which you could now do something that gives you the reward this means that time matters there's a sequence to these interactions and this is also different from many earlier machine learning algorithms where often we assume that we can sample from some data set for instance independently or if we have an actual data set we made might just scramble these in terms of ordering because the ordering doesn't actually matter and in fact it might work better if you scramble them but here this is not the case earlier decisions do effect later interactions typically which means you can't just throw away the time in order to reason about these and essentially these two last points means that we have to respect causality that action can only interact things that happen later and not things that happened before so what are examples of decision problems so here's a couple so you fly to fly in a helicopter as an example managing an investment portfolio controlling a power station making a robot walk or playing video or board games and this is somewhat maybe arbitrary list of examples but these are all examples that reinforcement timing has been applied to this is why they're up there there's many more but this is a nice diverse set I found and as I said if you think about reinforcement learning as a framework rather than a single solution method these are all reinforcement learning problems no matter which solution you've used to find solutions to the sensations decision problem in a sense and indeed in some of these cases more popular approaches won't typically be called reinforcement learning algorithms because maybe day they fall under some other terminology heading so here's a video just to make it a little bit more concrete this is an agent that has learned to play a bunch of Atari games these are video games from the 80s and they're fairly simple compared to current day video games but they're still complex enough that the dynamics are not immediately obvious so in this case there's a submarine that you're controlling and apparent you have to shoot the fish but after every one so often your oxygen almost runs out which you can see at the bottom and then you have to go up for new air and there's also divers that it has to collect I believe so there's long term dependencies there's short-term dependencies there's reactive parts to this decision process and there is no reactive parts this is a more reactive game in which the goal is to race against other cars those are meant to be cars and here the you can see that the situation's change I believe that the course here is probably more slippery when it's white because it's icy which means that maybe the interaction changes this is another example most of these games are actually fairly reactive in the sense that if you I have a good reactive policy then you you you could do quite well but this isn't holds for all of them some actually require you to reason through multiple time steps in order to do the right thing but as you see the agent has learned to do fairly well and it also learns for instance to hide behind objects this is the classic space invaders game and now I can note that you can't actually see it on the screen right now but here you can now two-top see the score and in this case the reward signal for the agent again at the top here we have the score this is a boxing game and then it does fairly well and in the the oldest situations the reward signal was basically defined as the difference in score from one step to the other this is the classic breakout game where the goal is to get rid of all these blocks by moving the paddle around and hitting the ball the little pixel is a ball against the blocks and this agents does that fairly well and actually learns to tunnel so they can get lots of score really quickly so that was an example somewhat longer-term decision policy there where at some point it starts hitting the blocks on the side with you could say the intent if you want to call it that of getting the ball behind the other block so it can get many points fairly quickly the reason it wants the points fairly quickly is essentially that we've defined your objective and I'll say that more precisely later we define the objective in such a way that the agent actually prefers fast rewards a little bit over later rewards even if the rewards themselves are the same otherwise it wouldn't care and we just may be very slowly chip away at everything and it might not care about when the rewards coming so the core high-level concepts of a reinforcement learning system are essentially that we have an environment and in agents and the agent mines contain and I'll get back to all of these an agent states which is basically the internal state of the agent at any moment in time a policy which defines the behavior of the agents probably a value function although this is not strictly necessary I mean the minimal minimal thing you would have want to have in an agent is policy and then you could already call it an agent might not be a very smart one but actually there are good policy learning methods as well so it could be a fairly smart one I'll talk about more about what value functions are and optionally you could also build a model or have a model I personally am mostly interested in algorithms that learn from samples which means that I won't assume that a true model is given this is also an active area of research which is typically called planning or search when you give in a certain model you want to find a solution as quickly as possible or as well performance as possible maybe not optimal if that's too hard if the search problem is too large I personally am more interested in learning methods so when you when you want to apply something like search or planning you would then have to build a model which means that your know the agent might internally have a model but it might be one that it has learned from data and then as I mentioned there's a reward signal actually easier or it could be internal to the agents you could have different ages that have different reward signals or and this is basic basically more typically what it's done these days it could be our specification of what we want the agent to achieve so for instance in these Atari games the rewards could be considered more or less external to the agent because we as the designer of the agent we tell the a and whenever your score changes that's your reward this is what you're supposed to optimize but this is why I didn't put the reward in the signal sorry in the figure there's a variant of this figure that is often used where the environments in in addition to an observation also emits a reward signal I basically wanted to not commit to that because the reward could be inside of the agent alternatively you could also just view the reward as being part of the observation if you still want it to be to come from the environments so that means that at each time step which will typically you know with T the agent observes Big O T Big O because it's a random variable and a reward and then it executes an action and then the environments receives that action and M it emits a new observation or as I said you could also view that as a pool action by the agent that pulls in a new observation now I'm essentially introducing a little bit of notation on the previous slide and on this one and the policy that I talked about that defines the behavior of the agents is typically denoted by PI and it could be deterministic in which it's just a function that consumes State I'll talk about States in a moment but you can think of it as the observation for now for concreteness and then outputs in action it could also be stochastic in some cases this is quite useful to have stochastic policies both for learning and because sometimes actually the optimal policy is one that is stochastic and in that case we might write that the action is sampled from the from the policy at that state so examples of policies could be when the temperature drops below 15 degrees you turn on the heat so the the fact that the temperature the temperature itself might be the state and then turning on heat might be the action as a different example say you have a robot it might have a policy that says when there is a wall in front of me I turn right where again the observation that there's a wall in front it's in the state and then turning right is the action the reward is just a scalar feedback signal some people write this as a function of the state I prefer just to subscript it with a time T because then it doesn't have to be a function of the state this is going because in a moment I'll talk about the state internals of the agents and the state of the environment which might be different and if the reward comes from the environment you might not be able to construct the reward from your internal state of the agent but you could just think of it as a fish scaler signal that I at each time step arrives and this is indicating how well the agent is doing and it defines the goal and the actual goal of the agent is not to maximize its immediate reward when it is set to maximize its cumulative reward over time we denote this with a capital G which if you want to for to make it easier to remember you could think of this as the goal there for G what we call this the return for historical reasons so there's two these two concepts the reward is the immediate reward on each time step and the return is the cumulative reward over time into the future both of these are random and now reinforcement learning is based on something called a reward hypothesis which states that any goal can be formalized as the outcome of maximizing a cumulative reward and so I put a question below whether you agree with that and it's it's good to think about this a little bit and I found it hard to come up with counter examples myself so if you have budget constraints this could just be part of the state and therefore the reward from any state might return from any state might still be well defined in the sense under those budget constraints you could also depending on how you want to set it up there's a difference different solution let's say you want to maximize let's say rewards our money that's you you mentioned investing as a potential thing but you want to not go bank drops so maybe there's a constraint in a sense can this still be phrased as a reward signal and I'm going to say yes but maybe then you shouldn't pick the rewards to just be money but you should also give a large negative reward for whenever you go bankrupt at which point you can reason about the combination of those things so it might not always be trivial to pick a rewards function that encodes exactly what you want to achieve because of these constraints that you might want want to have to encode but that's not what the hypothesis says it doesn't say it's easy it just says there there exists a reward that matches any goal that you might want to give it yes so if the goal is impossible could you then still encode is properly in a sense it might be meaningless and indeed it is quite possible to give a reward signal that you cannot actually meaningfully optimize at which point the agent would just maybe learn to do random things because it doesn't really matter what it does however what will typically do in reinforcement learning is we'll define an optimal policy which is with respect to what the agent can do and so if something is unachievable it's basically irrelevant it's not within the space of the problem that you're setting out to solve but again it might be that when you specify the reward and you don't think it through you might think that a certain thing is possible and then turns out da who never learns it but it was actually because it was impossible that could happen so these are these are great questions Thanks yeah a goal that changes with their rewards so it's actually dependent on the reward that you've already accrued this could very well be but again you could then define a new reward function that takes it into account or alternatively you could think of the roars that you've already accrued as being part of the state now from which let me let me be more concrete actually so that what what are humans often in modeling in humans in terms of how are you for instance interact with money there's this notion of almost a logarithmic curve from empirical data that we when we have more money we care less about more increments in some sense which means we don't actually reason linearly about money we reason maybe Laurie thinking about money but then this could also be covered with this formulation but then maybe what you want to optimize is not the raw money again but some transformation there off but with the suitable transformation you should be able to still capture this interaction of the money you've already accrued with the money you're still going to accrue so again the point is not that any rewards definition will give you the right solution it's more that there must exist a reward function that gives you the solution that you're intending to to achieve and it might be a non-trivial one so maybe maybe in that sense this hypothesis is not always that useful yes yes so it's a good question about when time goes to infinite when your future goes to infinity and you touched about something there which is quite interesting so what I'll actually say in the next slide is that we often trade off near time rewards with long term rewards by using a discount factor actually just let me just skip to that slide with a discount factor gamma which is between 0 & 1 and it basically just weighs down each of the rewards a little bit as they go further further out and there's a couple of reasons for this one is that it's actually sometimes convenient to define your goals another is even if your true goal is the undiscounted return into the future the algorithms might find it much easier to reason about something that is a little bit more myopic this is more of a practical concern but in many problems we actually do care about trade of like this like I said in the in that sorry game that we just saw the agent actually cares about getting scores quickly rather than late because of a discount factor the agent that learned that had a discount factor now there is an alternative formulation that I'm not going to talk about further today which is the average reward formulation where you basically talk about the average reward and you can accrue on each step and if you talk about the average reward you don't necessarily need this discount anymore because if the rewards are all positive and you wouldn't have this discount then the value could just be infinite but it could be infinite for many policies which means that it's very hard to distinguish between one policy having a larger infinite valued and a different one for say you could try to find mechanisms around that and this might well be possible but in those cases if you're actually interested in the under scoundrel future it might actually be easier to reason about the average reward and then about the future discounts cumulative reward fortunately many of the algorithms still apply many of the theory still applies for both these cases so this might be something that we can interject later and because of the more common use just to discounted future returns I'm going to focus on that for today okay so I've already covered most of this slide let me just give a couple of examples so if your discount factor is zero this means we only actually care about immediate rewards this is actually quite a common setup if there are also no states if there's only once one single state alternatively and we have a bunch of different actions and we only care about the immediate reward this is called a multi-armed bandit problem and this is an improv of interest because there's lots of applications even for this much simpler reinforcement learning problem with only a single state and no future no time dependencies in some sense and for instance this is used often as many papers about this in in advertising where you could think of any sort of certain websites you have a certain ad you want to show you want to optimize the number of clicks say your click number of clicks could be your reward and then you just show an ad you see if that works you show a different one the main problem then becomes one of expiration where you want to try new things occasionally you want to try for maybe many possible actions that you have but you don't want to try actions that don't give you a lot of return too often this is called the exploration exploitation trade-off and you can already interestingly reason about that even if the reward is immediate but it becomes even maybe a bigger concern when you have a sequential problem of course I'll talk more about that later last point of this slide so the discounts can vary over time most of the classic papers have this fixed discount which is given as part of the problem formulation but actually turns out to be quite useful to have the flexibility to allow this constant change over time they could for instance be a function of the state or they could just vary over time because of non stationarity in your problem or they could be a random function of something else that it's not on your state maybe and this turns out to be useful to have the additional flexibility but you don't need to worry too much about that here you can just mostly think of them as being just some number between 0 and 1 the trade off the near term versus long term yeah yes so can you agents how can the agent reason about the reward it gets when it has taken an action and indeed in general we assume that if you take an action you will consume that reward and you won't be able to take back that action now in the multi-armed and setting because you're basically in the same state over and over again you can try a different one next time in the same state in full reinforced learning setting taking the action will have changed your states which means you can't actually undo that action again unless you're in a specific problem which you could do that so need then the point becomes to already learn to predict the consequences of your actions before taking them so that you can take well informed questions before seeing the results this is slightly different in planning if you have access to the model of course you could just look at all the actions the hypothetical outcomes from that and then the only problem becomes computational do you actually want to look at everything might be too much it's a great question and speaking of which about predictions so this is something that we often want to predict it's not the immediate reward but it's actually the value which is the future potentially discounted return and there's an expectation here so it's the expected return and this expectation is basically about everything that is random including potentially your policy if your policy is random but the problem itself might also be random so if you take a certain action in a certain situation the reward you get in the next stage you end up in mine will always be the same even if your state that you started in what's the same we use shorthand that we just subscript the expectation with PI if all the actions are taken according to policy PI typically your current policy and this then allows you to reason about what the value is of your current policy in a certain state and now the goal for the agent basically becomes to maximize value so we're not in not interested in maximum the actual return because often you can't if it's stochastic but you can try to maximize the expected return and then the rewards and values basically define the desirability of a state or in action there's no supervised feedback which means basically that you might not be able to know the true value from any state in action you might not be able to even know the maximal attainable value from a certain state in action but you can still learn these things from interaction and note that the return as down on the slide in the equation and also the value can be defined recursively and this is going to be useful for algorithms later so why why are values so important in reinforcement learning because they are very central concept in reinforcement learning so roughly speaking if we can predict everything arguably we know all that there is to know and in order for that to be true maybe we have to be quiet leading with what we what we call prediction but you could think of doing an experiment and being able to predict the outcome if you can predict the outcome exactly then you knew all there was to know about that experiment in a sense but also if you can predict in a sense the outcome of some maybe logical deduction or something then again you knew everything that there was to know about this so that sends values can be used to encode knowledge about the world perhaps might be good to note here that there are different types of knowledge they're self-contained mathematical knowledge and separately there's empirical knowledge which is basically what we're talking about here which is the knowledge about what the outcomes are of processes in the real world perhaps we can construct values so in order to make use of this we can construct values over more than just the reward signal which is this one scalar signal essentially we can define values over all of the signals the D the agent has access to for instance you can think of the raw sensory readings you could either try to predict those and discounted future sums of those or functions of those and then construct functions of those and then try to predict those for instance you could try to predict for e robots when it hits a wall let's say it has a bumper sensor and then when it actually hits the wall it can tell that this sensor goes from say zero to one there's a bit that flips somewhere internally if that's the case and this robot could learn to predict when it hits the wall under a certain policy and this might be useful knowledge even if it's not actually what we're trying to optimize remind Oh be intending to hit the wall as much as possible maybe quite quite the converse or maybe it's just interesting for the agent to know when it can hit when it will hit the wall or not irrespective of the rewards it gets for all of the other reasons okay so if we have access to the values for all these signals and transformations of these signals we could then try to learn to predict and control all of these signals and to maybe say something that follows quite naturally from what we said before if we can fully control the reward signal and specifically if we can predict its future values future return and then we can control this so we can also optimize it then we basically solve the problem that we set out to solve before you can actually do that they might actually quite useful to try to predict and control many other things as well now on to actions so the goal is to select actions to maximize the value which might have long-term consequences because the reward might be delayed and this concretely might mean it might be better to sacrifice immediate reward to gain long-term reward and the second key pieces captures already immediately in the value function if that's the case then the value function for taking that action might be higher even if the immediate reward is lower and examples of this include making financial investments which might be if you look at the immediate reward might be a bad idea but if you look at the long-term reward it might be a good idea for in-store refueling in helicopter which might prevent you from having to land or refuel later and a may be much more awkward time or to block an opponent moves and say a block board game which might not immediately get you a lot of return or reward but it might help your winning chances later on in addition to state values we can also define action values and that sometimes going to be convenience the definition here is very similar to the stage value definition that I gave earlier let me go back there it was up here which is basically just the return G which is the cumulative discounted reward from a state s and we do know that with V of s value of s and for action values for historical reasons we use Q and these are sometimes called Q values that's the only reasons just historical reason there's no other reason why it's Q and these are defined on the stage action pair and that means that we're simply just conditioning on the first action already being a the same a is on the left-hand side whether or not our current policy will actually select that action this allows you to do a very small small notion of counter factional reasoning if you want to want to think of it like that from a certain stage we could hypothetically for each of the actions in try to predict what the resulting value would be if you would take that action and then follow policy PI maybe your current policy after that and then if we have that for each of the actions we can basically reason about each of these actions what about what you would be and this is useful this is related to that question about predicting the reward but instead of the reward we're predicting the whole value so the agent so this is basically this is all just definitions right so this is the definition of the value of taking an action a and then following a policy PI we haven't yet talked about what the agent in is actually doing so we'll talk about that next and the agent has a number of components or potentially has a number of components including the agents 8 and the policy both of those you'll basically always have to have in a sense and an opportunity value function or a model where the value function won't be the true value function necessarily but it will be an approximation to that the agent States is basically distinct from the environment State and it's used basically as an input to the policy so the policy of the agent is defined as a mapping from States to actions where the state there is the agents state it cannot be anything else that's the only thing the agent has access to but also the environment has some internal states in the simplest case as I've discussed before there's only one state which means we don't have to even have to think about that but that's already an interesting learning problem because you might still want to reason about all the possible actions often there is of course many different states sometimes infinitely many especially if the states say lie in some continuous space and there might be any field infinitely many states as I said the state of the environment of the agent generally differs from the state of the environments and the agent might not have access to the full state of the environment even if the agent would have access to the full state of infrared it might actually be too big to reason about meaningfully so here we again see the the interaction of the agent and no Derek only gets observations from the environments so the environment states the full state of the environment might be observable in which case the observation might cover all of that but usually that's not the case and even if it would be it might contain loads of irrelevant information that even to process even to look at for the aging take simply too much computation for it to be a meaningful thing to do so typically the observations aren't the fool environment state and then it's sometimes meaningful to reason about the history of the agents starting at time zero there's some first initial observation and then the agent takes an action a zero it observe the reward r1 and a new observation or one and this then continues and continues so you could think of this for instance as the sensory motor stream of a robots where the observations are the inputs of the sensors coming into the robot from say a camera maybe some some auto sorry signals and then the actions are the motor controls basically the the power it sends to its motors and the rewards might be something that we define might be a function of the observations that's often a simple way to define a good reward where your friends tell the robot I want you to go to the bright place or something like that and then you might get a reward whenever the brightness goes up in its camera the history then can be used to construction Asian states which will denote SD and the future action will depend on this States so as I said if the world is fully observable we could think about the observation just being the full state of the environment then the AG kind is used is observation and this is the agent is then in what is called a Markov decision process Markov decision process in general from a very useful mathematical framework and used a lot with in reinforcement learning and the definition of that is that a decision process is Markov if conditioning on the full history as we see on the right hand side of that equation will give you the same probability of a roared potentially discount and the next state as if you'd only condition on the current state what does this mean it basically means that the future is independent of the past's given the presence where the present is your current state phrase differently it means that the the stage contains everything that you need to know for the future doesn't mean it contains it doesn't mean it's it's it is toufool history it just means it's sufficient in a sense so once this state is known history might not be thrown away without incurring any disadvantages the food environment said I would posit is stability Markov although it might be extremely large and history itself is Markov as a state if you would just replace s T on the left hand side with HT in the history then obviously this is true it's not particularly useful though if your history becomes too while too too large and unwieldy to manage more commonly we're in a partially observable setting where the age you get some partial information about the state of the environments as I mentioned you could have think of a row but with a camera vision and for instance then the robot might will be told this absolute location even if knowing that location would be very useful for its policy another example would be a poker player AG the only observes the public public cards but doesn't know the cards of the opponents in that case this is not a fully observable setting and the observation now is not Markovian formally this is called a partially observable Markov decision process it's quite an obvious name in that sense and know that the environment state might still be Marco although the agent cannot access it and indeed it typically is because typically when you say for instance if you think about say running a simulator or something there's something in the simulator that is used to compute the next step of the environments and that will be the sufficient state so that might be the Markov state of the environment which might not be visible to the agent yes yes so in upon VP is the environment safe Markoff I would actually post it at the environment say there's always Markov in basically any problem of interest you can define things in such a way that that's not true but I'm not sure that's a particularly useful definition is it then still upon VP I would say well it depends how you define upon DP exactly in practice it doesn't really matter so in practice we can just think of the environment say it as being Markov but we just don't see everything and then it would be a homely P and E typically informally P there's some definition on your observations being a function of the states which is Markov so there must be done some underlying Markov state and your preservation is just some function of that so the agent states now can only ever be a function of the history and the agents actions can only ever depend on its States that's the definition of the state it is what is in the agent right now so it's actions must depend on that and only that so an example would be that the state is your observation this would be one choice but it mind will be the best choice and more generally it might be some function of I put there the previous state action and the current observation where F there is what sometimes called the state update function you could also toss the reward in there or whatever else is there in terms of signals I didn't put the reward there because you could also think of the reward as being part of the observation or as I said being internal to the agent these this agency it is typically much much smaller than the environment state for for instance computational reasons you have some limiters compute that you can do and therefore you don't want this to grow indefinitely yeah right so just losing the Markov property if I if I paraphrase correctly otherwise just losing the Markov property does not actually mean that you have to depend on the history for instance as I said up there you could still define the agent state as being in the current observation the only thing to them be aware of is that this observation might not have all the information that you need to make an optimal decision and then in general what we say well more generally we could could we could define our agent state as being some function of the previous agent States and the action and the observation or even more generally some function of the history and basically what I'm saying is that this is this is maybe useful in some cases instead of relying only in the current observation it might be useful to have some ongoing state within the agent that you continually update but it is also allowed to Francis remember things from the past because this can be useful for the decision process the problem would still be upon DP so in terms of the interface what what problem is the agents trying to solve this is a partially observable Markov decision process and the only on the agent side are we talking about maybe remembering something building up an agent state is maybe a useful algorithmic tool in order to form better did that so to give an example this might be a very simple problem where there are some maze and you're fooled environments say it might be the maze maybe in addition where you are in that maze because typically don't just haven't made you have an aging in the maze I didn't depict the agent here and this might be a potential observation let's say the agent is in the center of this 3x3 a little field and it can only see the immediate points around then it might observe this and for instance this could just be some numbers that the agent that consumes it consumes maybe the walls are are encoded as once and maybe the empty passage is encoded as zeros sorry the black is supposed to be a wall and then white is supposed to be empty and then it's this would just be nine numbers at the ages or nine bits that the agent gets as input and then it can hopefully try to do something with that observation we could see an observation in a completely different location but note that in this specific example both of those observations are actually indistinguishable there are the same observation if you know you're in a different position now depending on the reward function you might want to be able to distinguish between those those two in order to even decide which way you want to go so here's a question how could you construct a mark of agent stage that does tell you all that you possibly would want to know in order to predict the future in this maze for any reward signal so if you don't have if you don't commit to any reward signal you should still be able to build up a mark of agent state does anybody have a suggestion yeah yeah so the suggestion is to basically keep track of a small window of history to have a couple of previous observations and indeed in this case this would be sufficient for instance if you would consider coming from the top the previous observation would in both of these cases be different already if you're coming from the bottom one step wouldn't actually do it but if you go two steps down you would already again see something different so if you have enough previous in observations in this case you would be able to tell apart in these two different states and this will give you enough information to know exactly where you are given that the maze doesn't change so if you're long as your anime is long enough at some point you'll learn where you are by just being able to look at a couple of observations so indeed it's a good suggestion yes so if you keep track of your full history to decide to what to do next this would have the Markov property because you cannot add your history to your history and then do something more the only problem with that is that it's computationally too large typically having a short history could be enough but depends on the problem I'm not sure that answers your question though so if you want to write so the question is if I if I understood correctly that if you so there's actually two questions one is if you have a short history rather than just look at the current observation does that then violate the Markov property and the answer to that is it depends a short history like that the problem there is this short bit a short history might not be enough to still have all the information and more generally in building up an agent state even from your history it might not lead to a fully Markovian agent state it might lead to something that is maybe more Markovian and in fact a lot of people think about these things much more on a gray scale than black and white either Markov or not but so it might still be helpful to build up a short history but it might not make it fully Markovian and the second question is essentially if I paraphrase and generalize it slightly is can you build a good decision policy from a non mark of state and the answer to that is in practice at least we found yes you can so you don't need fully Markovian States in order to make the right decisions although you can set up examples of course where you're missing that key ingredient that you didn't know about and then of course everything falls apart but in practice it turns out if you just have a little bit of memory for instance you have a couple of previous observations or you have some learning system that learns some a the agent updates function function that this already helps a lot Thanks yes yes so if the environment is fully observable then actually this is a very interesting question so if the environment is fully observable that means that everything that you want to know is in that observation and then the question is then you never want to do anything else you never want to like build up more past observations or anything because that would just be wasteful and in one sense this is obviously true because if you if you have to consume multiple multiple observations whereas just the last one already had all the information that you wanted to know then this is true however the other thing that I said about the environment state is that it's often very big so even if you couldn't observe the full environment States you might still choose to only pay attention to part of it to only basically you do compute on part of it and that part of it might still not be fully Markovian so in practice what sometimes is better is - even if you do have access to the full Environment stage to only use a very small part of it and then still build up some history or do some memory that allows you to build up a state that is still roughly Markovian simply for computational reasons so it's a good question so it is indeed sufficient like if your current observation is sufficient then that's enough the D basically the distinction between these two methods is sometimes it's easier to build up the knowledge by looking at parts of the observations sequentially than it is to fully inspect the current observation it's a good question why do I call it a Marco can you state so I've defined Markov earlier you can define it in several ways I mean Markov itself the property has a well-defined definition but all we exactly mean here I defined it as such where based on your state in your action you can predict the next state in the reward and the discount and here we're building up that state so this is this is not actually reasoning about that in some sense it's on that that would be the next step would be to see if that save is Markovian but this is the previous step in a sense and building up that state so it's F itself going to be does it have us have a certain property in terms of Markov property well typically in many cases F will actually be a deterministic function of your previous state and observation your observation is random of course but that doesn't that does really matter too much for F so f could be a fairly simple thing stacking a couple of observations for instance as was suggested before is deterministic function you just keep your previous observation around you you have two observations say each time you observe something new you just toss away one and replace it with the new one it could sometimes be a stochastic function sometimes that's useful but in practice it's often long friends is a common common if you have to know about these things a common implementation these days is to implement this as a recurrent neural network and a recurrent neural network has a very similar equation which is basically saying oh I have some internal states ICU input and then my new internal state will be some function of this input and the previous state and that almost by definition is then a recurrent neural network if you ask people who work with these the main distinction being that typically for recurrent neural networks and typically for neural networks in general these are typically restricted to being differentiable functions so they typically smooth functions of the inputs not necessarily the case but typically they are because this makes it easier for us to learn them I'll say more about neural networks in deep learning a bit later so this is also called memory instead of agent state we like to call it agent se because it has the reinforcement learning viewpoints but it's good to know that these are basically the same concept yes hmm how do we know how much memory we need so the the sad answer to that is typically we don't so you just make a well-informed guess so what is helpful so what is hard what is easy sometimes it's very hard to pick a function a stayed up they function in advance for the problem that you're trying to solve without knowing enough about the problem that you're trying to solve or even if you do know quite a bit might still be hard pick one and this is why many people these days implement these things of recurrent neural networks because then we can optimize them we can just learn from the date and what a good update function is there's still questions that are very related a question that users asked which is how do we know how much memory need this is then related to the question of how do we know how to update it for this function and how to pick the specific architecture of this neural network that we're going to put in there and again the answers typically we don't really so we just try a couple of things and we see what works there's of course a lot of knowledge now in the field of when when things are sufficient and of course there is a certain notion of how big your agent state is how many bits does it actually encode and this will have give you some upper bound on what it could potentially encode so if you know something about our hard problem is you know something about how much you want to remember you might want to pick this to be at least a certain size in terms of capacity but otherwise it's a hard problem yeah well this network effect what the agent pays attention to yes typically I in in at least one one way of thinking about this is that the state update function that is defined here is the only function that consumes the observation and all the other parts of the agent the model the policy the value function all of the other parts they just look at the agent state so that means that this salable function is maybe the only part that looks is the actual raw observation so then either you beforehand pick what pays attention to or it's some learning system itself which then learns which part of the observation to pay attention to now obviously there's caveats here because if at some point it stops paying attention to certain parcel T of observation then it might be very hard later to learn again that these are actually very important because it's not paying attention to them anymore but those are just very hard problems in general not specific to reinforcement learning but it seems that we're at least scratching the surface a little bit and that there are methods that can learn these things in at least some some domains quite successfully yes yes so sorry okay so the question is can we be sure that if we're going to learn this say with the neural network how can we be sure that this state is correct and the short answer to that is we can't and in fact it might be a somewhat meaningless notion to talk about the correct state let's think about the example where the environments say it is huge it's a mark of state but it's really large much larger than fits in memory of the of the agent say then we can have no hope of reconstructing the actual environment state we might still be able to find a state that is useful we might even be able to find a state that is sufficient to find the optimal policy yes yes indeed so this is true so the the observation is that if you're going to learn this say whether with a neural network even if the true environment environment State does exist in your agent could potentially map it like could store it in memory and could reason about it for instance the age even has enough capacity to construct a fool history there's no guarantee that this recurrent neural network that we're training will find this solution which is sufficient which has the properties that the optimal mapping is still preserved and this is true so this is there's very little we can say these days and maybe in the future this will get better but these days there's very little we can say with certainty about the solutions that these systems will find and the reason for that is the saying sure that these systems are quite general typically these are nonlinear functions that we're trying to learn and it can be applied to systems quite quite broadly but the flexibility means that they're very hard to analyze and we don't know a lot with certainties what we typically do know is that they typically get better over time and in some restricted cases we can say that much more fermium we can say this will actually find say a local optimum in some sense but can we guarantee in general that these will find the true state if that exists and is accessible or learn about or even a useful States not really we cannot really say that with certainty doesn't mean they don't work in practice them so these are distinct things and they're both important and it's also good to realize that that we cannot always have these guarantees Thanks ok was about the agency so now we can move on to the policy which defines the agent behavior as I said a couple of times already and it's basically just a mapping from the agent States to an action these this can be a deterministic action or a deterministic policy which means we have a deterministic mapping from input States to an action or it can be a stochastic policy where basically we we are able to note that with P sorry PI a given s or sometimes just PI s comma a where we see both of those as inputs and then the output is a probability and this is very simply just the probability of selecting an action in the current agent State so can we move on from that one quite quickly because we already covered the policy quite a bit now for the value function this again is just the definition of the value the true value under a certain policy which is the expectation of the cumulative discounted return it's just the sum of the rewards into the future with a discount factor given your current state and given your current policy and it's good to know that this value depends on the policy so actually you could also ask many different questions predictive questions for many different policies you could say hey for this policy what's my value for that policy what's my value in many cases were actually quite interested in just knowing the value for the current policy and this can then be used to evaluate the desirability of different states or it can be used to select between different actions yes there's a good question so why is there always a discount factor in my in my slides and does the theory change if it's set to one and the answer to that is so the first answer is that the discount factor is somewhat more general so if you can just include it we can just stays things more generally than if we would have to if we wouldn't include it because you can indeed think of it as being one and then you need a good question is so does anything change if it's equal to one and yes some things change and there's also sometimes you need separate theory for them for the young discounted case in particularly you typically need as an additional assumption for the undiscounted case which is that the process will end at some point because otherwise your values could actually be infinite and therefore it can be very hard to distinguish between certain policies because they both have infinite value in which case the agent might just not care but typically when we do use undiscounted returns we do this in problems that we call episodic which means at some point it ends but you are allowed to restart and do it again the Atari games we saw a video of them earlier are an example of this where typically you play this until your game over but then you start over again you play it again and again and again and another way to view this I had these time varying discounts on an earlier slide another way to view that is you could still model this as a continuing problem but then the discounts are maybe one a lot of the time but sometimes there's zero this is when your game over and when the discount is zero in that sense this is basically your prediction ends and this means that our values if this is always going to happen inevitably at some point then these two values are all well defined even if the discount factor is one the discount factor does interact with the algorithms as well though and it turns out it's actually easier to learn values for lower discounts which may be intuitively obvious but this can be very useful in practice so in this Atari games for instance we did use a discount factor which was lower than 1 it was 0.99 which in these desire games basically means you have a look ahead of something like 10 seconds in the game and Beyond there the rewards you don't really care anymore for these Atari games that sufficient for other problems you might need a much longer discount factor much closer to one yeah could you have like a distribution over like could you have a different shape of the discounts over time and yeah this is a very good question and yes you can this is basically a geometrically decaying function sometimes it's maybe you're more useful to have something that that's low first M goes up and then goes down again the reason we don't typically consider those as much is because we want this recursive relationship we want to be able to say we first observed the first three and then we can look at the value of the next state and if you have a function that goes up and then down typically these aren't as easy to reason about however turns out if you want something like that bump you can get those if you look at the difference between the value with one discount and the value with a different discounts and these will have exactly at a property so you could just predict two different values for two different time scales sometimes the discount is refer to as the time scale of the problem and then you can still therefore like pull out specific parts in the future that you care more about these by the way seem to match quite nicely to what has been found in the human brain as terms in terms of temporal sensitivity that these look a lot like the difference between different value functions interestingly okay so value function inside the agents is going to be some approximation to this true value function as I said the return actually has a recursive form which you can think of as the return is this cumulative discounted sum of rewards which you can pull apart that's just the first step where you see the first reward and then the rest of the return may be multiplied with a discount and the value also has this recursive form and this is going to be very useful where we can say the value at a certain state is the expectation of the first reward and then the discounted value at the next state and this equation is known as the bellman equation in literature these days which was noted by Richard bellman in in the 50s and similar equations actually holds for the optimal value which is typically denoted with the star in the reinforcement learning literature so we replaced this subscript script of Pi the current policy with a star which basically stands in for the optimal policy and this is quite interesting because now we can reason about what the optimal value is that you could attain in this specific Markov decision process for simplicity and a lot of the theory here is done in that setting we typically think about the Markov decision process case where you have full observability so the state here might be the full environment states and all of this is basically it's it's all just optimal giving your agent stage in that sense but it will be optimal for the actual problem then if that is the case and note that these optimal values these equations do not depend on some arbitrary policy anymore they don't depend on PI they're self-contained definitions of the optimal values and in reinforcement learning we have you exploit equations like this and use them to recreate many algorithms so the age is often approximately value functions and there are many algorithms to plan or learn these efficiently and the reason why we care about those is that with an accurate value function we can behave optimally this might be the most obvious from the action value function there if we have the optimal action value function we can just pick the action with the highest value in every state and this will be guaranteed to be the optimal policy for the problem so in general as I said our problems might be too big to solve these things exactly but with suitable approximations we can behave well even in interactively large domains we might lose optimality right we might not be able to find the optimal value function already optimal policy but we might still be able to find better policies incrementally and repeatedly and thereby do a lot better than we did initially okay so I'll go through the final agent components and then I think we'll do a short break yeah so the question is so in a fine and MEP you you have a well-defined optimal policy and the question is does something similar apply in continuous state any piece and I actually think that the continuous part doesn't really matter that much that will be fine it's still well-defined it might be intractable to solve for I I think so we have to be very careful when we talk about continuous States so we would have an infinite state space right and so III I agree with the the comment that that the theory basically just goes through but algorithmically that might not be as interesting because the theory would maybe it's well-defined I took too much policy as well the fine but it doesn't mean it is easy to find so if people typically do then they reason about what the solution is or the fixed point of a certain method under some approximation and then your solution the quality of your solution will depend on the choices you make in your approximation and you can actually reason about that as well which means you can also think about like how close is that to the actual optimal thing and how does it depend on your approximation and there are some some theory on that although you mentioned something about neural networks earlier most of that theory is restricted to the linear case if you have linear functions so if you have a Markov decision process the optimal policy will be stationary deterministic if you're in a partial observe will MEP then yeah okay so I think the continuous test case doesn't change that it's it's going to be stationary and deterministic policy I think that still goes through but I don't have a reference handy for you okay so next potentially an agent can have a model which predicts routine environment we'll do next this can be useful for many cases of course it's like an obvious thing that you could try to do when interacting with an environment you're trying to learn things you could try to learn the model and what is the model then well we have these two components that are important in some sense it's the next state prediction and the immediate reward prediction the problem with a model is it has certain benefits offer value functions and policies in the sense that these are typically easier to learn because this is supervised learning now you could just see which state you end up in or you could see which rewards you and end up receiving and then you could learn a function that basically takes the mapping some from say again for simplicity let's think about the case where your state's your agency is just your observation then these could just learn mappings from these observations to these next observations and mappings from the observation to the immediate reward that you get done this is supervised learning which is well understood and it's works really well in practice often the problem though is that the model itself doesn't immediately give us a good policy we still need to plan then with this model and this can be non-trivial it can take a lot of compute to do so depending on the model that you have we could also consider learning the definition here for the reward says our Aurora model we're going to is going to map to the expected reward from a state in action alternatively you could learn to try to map the whole distribution or you could learn a system that samples from this distribution so basically then the goal of this model is to output rewards with the same distribution as the actual rewards that you're going to see and this is especially you first states because if you output a an actual sampled state according to the same distribution as you would have received in the real world then you can plan through trajectories and you could have all of the branching that happens in the real world happen inside of the agents when it's out rolls out its model of course then you would probably want to repeatedly roll out trajectories this to be able to average over the noise but this can be a useful approach so what happens is the model makes mistakes a very good question because they always make mistakes it's for a non serve your problems you're not going to have a perfect model even if it's supervised learning even if you're going to do pretty well there will be some mistake which basically will mean there will be some hole in the wall in your model and the age will think it's the door and it'll go there and it'll bump into the wall because it's model model said there was an opening and you're so very very often happen so saying she did solution to that is not to trust your model too much and there's many ways to do that there was recent work by some of my colleagues a deep mind where they trained the model and then they used this model to predict a couple of steps into the future and then they just used yeah output of that as inputs to a function that was trying to learn the values or the policy so instead of trusting it completely basically said here's some additional information that you might want to use that might be correct and you might learn to trust it or maybe it's a little bit shaky but might still be useful if you're going to literally plan with a learnt model it happens very often that the resulting plan finds certain peculiarities of your approximation that aren't really there and the reason is that most of our planning algorithms they they basically assume that the model is completely perfect so they'll exploit anything that's in there they don't have a notion of uncertainty about how accurate a model is this is a very active ongoing research by the way how to work around inaccuracies with your model and planning and there are some ideas but there's no like well accepted general solutions to this yet so it's a very good good research question so here's an example a simple example of course for Forex position let's say we have a maze with a start and a goal and the rewards are minus one on each time step there's no discounting but the minus one will already give the the agents it will spur it on to solve the problem as quickly as possible there's four actions according to the cardinal directions and the state is just where the agent is if it's a fixed maze just where the agent is is Markovian right we don't need to see anything even just for our you are gives you all the information that you possibly could need to have an optimal policy this is an example of an optimal policy which you could try to learn directly I didn't give you any learning algorithms on any of these things yeah I you don't have to trust me that there are methods that can learn this but or can just plan through this if you have access to the whole thing and this is then a deterministic policy in this case which steps through this whole MVP and this then is an example of the actual value function the true value function so in this case I put P V PI there but actually that's kind of a mistake where or I should have just put V star there because this is actually the value for the optimal policy you could also think about the value for the random policy say which would look different this one note that the increments are with exact integers this is because there's no discounting so all of the rewards are equally weighted equally heavy at which points we can just return these things as being the negative number of steps until you exit the maze and then this might be a model which is wrong so this is an example of a wrong model because essentially what this model did it basically learned an exactly correct reward function but it's missing a transition there near the starts going down there's a whole part of the maze there and the model basically thinks that you may stop early so this could happen in this case it doesn't hurt because planning through this would still give you the optimal solution but of course there's cases in which maybe they may be going down would have actually been a shortcut in which case planning with the model would with this wrong giving you the wrong solution okay so yeah let me quickly go through this so just quickly categorize agencies just giving a little bit of terminology that I use but also other people in the literature use value based methods our agents that explicitly build an approximation to the value function by learning or planning often by learning policy based agents they build an explicit representation of the policy so they build this function directly but they typically don't have a value function and in the value based age they don't typically have a policy if they're called value based because the four policies then just inferred from the values typically value based agents they learn action value functions these cube functions and then you can build up a policy by just looking at for each action what is the value that I predict I'm going to get when I take that action and then maybe you just take the action with the highest predictive value now when we have both these components explicitly a policy and a value function then we call this an actor critic system and then often the policy is learned by using the value function in some way now a separate categorization separate dimension in which you can make a distinction is whether they have a model or not model free ages they have a policy and/or a value function but they don't have a model now this is a bit contentious because some people call any neural network say a model or any any statistical model a model well it's kind of in a name already so you could go to a value function model and this would be very valid use of basically statistical terminology but a reinforcement learning we reserve the term model more or less for things that predict the next states or the next reward and we call a value function if you just have a value function we call that model free more or less for historical reasons and then it kind of follows that a model-based agents it could still optionally have a policy and or a value function explicitly but in any case that also has a model and model here's of the dynamics of the system of the transitions of the reward which brings us to something that looks a little bit like this so we have a Venn diagram with on the top left we have the the agents agents that store a value function top right yeah just a store policy and at the bottom we have the AG set store model and these are then just terms that are you in the literature unfortunately not completely consistently but still fairly consistently enough to make it useful to be aware of them where we have value based methods policy based message model based methods and then in the intersection of value and policy based we call these actor critics which could have a model could not have a model there's no separate terms for ages that have a model and a policy or a model and a value they're just calls mobile based value function learning agents or something like that there's no specific term for that okay and then I suggest we'll have a short break 15 minute break I don't know
Info
Channel: Federated Logic Conference FLoC 2018
Views: 692
Rating: 5 out of 5
Keywords:
Id: dSxKxbqvSdA
Channel Id: undefined
Length: 86min 52sec (5212 seconds)
Published: Mon Jul 30 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.