MIT 6.S191: Reinforcement Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi everyone and welcome back to 6.S191! Today is a really exciting day because we'll learn about how we can marry the very long-standing field of reinforcement learning with a lot of the very recent advancements that we've been seeing so far in this class in deep learning and how we can combine these two fields to build some really extraordinary applications and really agents that can outperform or achieve super human performance now i think this field is particularly amazing because it moves away from this paradigm that we've been really constrained to so far in this class so so far deep learning the way we've seen it has been really confined to fixed data sets the way we kind of either collect or have can obtain online for example in reinforcement learning though deep learning is placed in some environment and is actually able to explore and interact with that environment and it's able to learn how to best accomplish its goal usually does this without any human supervision or guidance which makes it extremely powerful and very flexible as well this has huge obvious impact in fields like robotics self-driving cars and robot manipulation but it also has really revolutionized the world of gameplay and strategic planning and it's this really connection between the real world and deep learning the virtual world that makes this particularly exciting to me and and i hope this video that i'm going to show you next really conveys that as well starcraft has imperfect information and is played in real time it also requires long-term planning and the ability to choose what action to take from millions and millions of possibilities i'm hoping for a 5-0 not to lose any games but i think the realistic goal would be four and one in my favor i think he looks more confident than __ was quite nervous before the room was much more tense this time really didn't know what to expect he's been playing starcraft pretty much since his fight i wasn't expecting the ai to be that good everything that he did was proper it was calculated and it was done well i thought i'm learning something it's much better than i expected it i would consider myself a good player right but i lost every single one of five games all right so in fact this is an example of how deep learning was used to compete against humans professionally trained game players and was actually trained to not only compete against them but it was able to achieve remarkably superhuman performance beating this professional uh starcraft player five games to zero so let's start by taking a step back and really seeing how reinforcement learning fits within respect to all the other types of learning problems that we have seen so far in this class so the first piece and the most comprehensive piece of learning problems that we have been exploring so far in this class has been that of supervised learning problems so this was kind of what we talked about in the first second and third lectures and in this domain we're basically given a bunch of data x and we try to learn a neural network to predict its label y so this goal is to learn this functional mapping from x to y and i like to describe this very intuitively if if i give you a picture of this apple for example i want to train a neural network to determine and tell me that this thing is an apple okay the next class of algorithms that we actually learned about in the last lecture was unsupervised learning so in this case we were only given data with no labels so a bunch of images for example of apples and we were forced to learn a neural network or learn a model that represented this underlying structure in the data set so again in the apple scenario we tried to learn a model that says back to us if we show it these two pictures of apples that these things are basically like each other we don't know that they're apples because we were never given any labels that explicitly tell the model that this thing is an apple but we can tell that oh this thing is pretty close to this other thing that it's also seen and it can pick out those underlying structure between the two to identify that now in the last part in rl and reinforcement learning which is what today's lecture is going to be focused on we're given only data in the form of what we call state action pairs now states are what are the observations of the system and the actions are the behaviors that that system takes or that agent takes when it sees those states now the goal of rl is very different than the goal of supervised learning and the goal of unsupervised learning the goal of rl is to maximize the reward or the future reward of that agent in that environment over many time steps so again going back to the apple example what the analog would be would be that the agent should learn that it should eat this thing because it knows that it will keep you alive it will make you healthier and needs and you need food to survive again like the unsupervised case it doesn't know that this thing is an apple it doesn't even recognize exactly what it is all it knows is that in the past they must have eaten it and it was able to survive longer and because it was a piece of food it was able to to become healthier for example and through these state action pairs and somewhat trial and error it was able to learn these representations and learn these plans so our focus today will be explicitly on this third class of learning problems and reinforcement learning so to do that i think it's really important before we start diving into the details and the nitty-gritty technical details i think it's really important for us to build up some key vocabulary that is very important in reinforcement learning and it's going to be really essential for us to build up on each of these points later on in the lecture this is very important part of the lecture so i really want to go slowly through this section so that the rest of the lecture is going to make as much sense as possible so let's start with the central part the core of your reinforcement learning algorithm and that is your agent now the agent is something that can take actions in the environment that could be things like a drone making a delivery in the world it could be super mario navigating a video game the algorithm in reinforcement learning is your agent and you could say in real life the agent is each of you okay the next piece is the environment the environment is simply the world in which the agent lives it's the place where the agent exists and operates and and conducts all of these actions and that's exactly the connection between the two of them the agent can send commands to the environment in forms of actions now capital a or lowercase a of t is the action at time t that the agent takes in this environment we can denote capital a as the action space this is the set of all possible actions that an agent can make now i do want to say this even though i think it is somewhat self-explanatory an action is uh or the list by which an action can be chosen the set of all possible actions that an agent can make in the environment can be either discrete or it can be from a set of actions in this case we can see the actions are forwards right backwards or left or it could also be continuous actions for example the exact location in the environment a real as a real number coordinate for example like the gps coordinates where this agent wants to move right so it could be discrete as a categorical or a discrete probability distribution or it could be continuous in either of these cases observations are how the environment interacts back with the agent it's how the agent the agent can observe where in the environment it is and how its actions affected its own state in the environment and that leads me very nicely into this next point the state is actually a concrete and immediate situation in which the agent finds itself so for example a state could be something like a image feed that you see through your eyes this is the state of the world as you observe it a reward now is also a way of feedback from the environment to the agent and it's a way that the environment can provide a way of feedback to measure the success or failure of an agent's actions so for example in a video game when mario touches a coin he wins points and from a given state an agent will send out outputs in the form of actions to the environment and the environment will respond with the agent's new state the next state that it can can achieve which resulted on acting on that previous state as well as any rewards that may be collected or penalized by reaching that state now it's important to note here that rewards can be either immediate or delayed uh they basically you should think of rewards as effectively evaluating that agent's actions but you may not actually get a reward until very late in life so for example you might take many different actions and then be rewarded a long time into the future that's called a very delayed reward but it is a reward nonetheless we can also look at what's called the total reward which is just the sum of all rewards that an agent gets or collects at after a certain time t so r of i is the reward at time i and r of capital r of t is just the return the total reward from time t all the way into the future so until time infinity and that can actually be written now expanded we can we can expand out that summation uh from our from r of t plus r of t plus one all the way on into the future so it's adding up all of those uh rewards that the agent collects from this point on into the future however often it's it's very common to consider not just the summed return the total return as a straight-up summation but actually instead what we call the discounted sum of rewards now this discounting factor which is represented here of by gamma is multiplied by the future awards that are discovered by the agent in order to dampen those rewards effect on the agent's choice of action now why would we want to do this so this is actually this this formulation was created by design to make future rewards less important than immediate rewards in other words it enforces a kind of short-term learning in the agent a concrete example of this would be if i offered to give you five dollars today or five dollars in five years from today which would you take even though it's both five dollars your reward would be the same but you would prefer to have that five dollars today just because you prefer short-term rewards over long-term rewards and again like before we can expand the summation out now with this discount factor which has to be typically between zero and one and the discount factor is multiplied by these future rewards as discovered by the agent and like i said it is reinforcing this this concept that we want to prioritize these short-term rewards more than uh very long-term rewards in the future now finally there's a very very important function in rl that's going to kind of start to put a lot of these pieces together and that's called the q function and now let's look at actually how this q function is defined remembering the definition of this total discounted reward capital r of t so remember the total reward r of t measures the discounted sum of rewards obtained since time t so now the q function is very related to that the q function is a function that takes as input the current state that the agent is in and the action that the agent takes in that state and then it returns the expected total future reward that the agent can receive after that point so think of this as if an agent finds itself in some state in the world and it takes some action what is the expected return that it can receive after that point that is what the q function tells us let's suppose i give you this magical q function this is actually it really is a magical function because it tells us a lot about the problem if i give you this function an oracle that you can plug in any state and action pair and it will tell you this expected return from point time point t from your current time point i give you that function the question is can you determine given the state that you're currently in can you determine what is the best action to take you can perform any queries on this function and the way you can simply do this is that you ultimately in your mind you want to select the best action to take in your current state but what is the best action to take well it's simply the action that results in the highest expected total return so what you can do is simply choose a policy that maximizes this future reward or this future return well that can be simply written by finding the arg max of your q function over all possible actions that you can take at the state so simply if i give you this q function and a state that you're in you can feed in your state with every single action and evaluate what the q function would tell you the expected total reward would be given that state action pair and you pick the action that gives you the highest q value that's the best action to take in this current state so you can build up this policy which here we're calling pi of s to infer this best action to take so think of your policy now as another function that takes this input your state and it tells you the action that you should execute in this in the state so the strategy given a q function to compute your policy is simply from this arcmax formulation find the action that maximizes your q function now in this lecture you're going to we're going to focus on basically these two classes of reinforcement learning algorithms into two categories one of which will actually try to learn this q function q of s your state and your action and the other one will be called what are called policy learning algorithms because they try to directly learn the policy instead of using a q function to infer your policy now we're going to in policy learning directly infer your policy pi of s that governs what actions you should take this is a much more direct way of thinking about the problem but first thing we're going to do is focus on the value learning problem and how we can do what is called q learning and then we'll build up to policy learning after that let's start by digging a bit deeper into this q function so first i'll introduce this game of atari breakout on the left for those who haven't seen it i'll give a brief introduction into how the game works now the q value tells us exactly what the expected total expected return that we can expect to see on any state in this game and this is an example of one state so in this game you are the agent is this paddle on the bottom of the board it's this red paddle it can move left or right and those are its two actions it can also stay constant in the same place so it has three actions in the environment there's also this ball which is traveling in this case down towards the bottom of the the board and is about to hit and ricochet off of this paddle now the objective the goal of this game is actually to move the pedal back and forth and hit the ball at the best time such that you can bounce it off and hit and break out all of these colored blocks at the top of the board each time the ball touches one of these colored blocks it's able to break them out hence the name of the the name of the game breakout and the goal is to knock off as many of these as possible each time the ball touches one it's gone and you got to keep moving around hitting the ball until you knock off all of the of the blocks now the q function tells us actually what is the expected total return that we can expect in a given state action pair and the point i'd like to make here is that it's actually it can sometimes be very challenging to understand or intuitively guess what is the uh q value for a given state action pair so even if let's say i give you these two state action pairs option a and option b and i ask you which one out of these two pairs do you think has a higher q value on option a we can see the ball is already traveling towards the paddle and the paddle is choosing to stay in the same place it's probably going to hit the ball and it'll bounce back up and and break some blocks on the second option we can see the ball coming in at an angle and the paddle moving towards the right to hit that ball and i asked you which of these two options state action pairs do you believe will return the higher expected total reward before i give you the answer i want to tell you a bit about what these two policies actually look like when they play the game instead of just seeing this single state action pair so let's take a look first at option a so option a is this this relatively conservative option that doesn't move when the ball is traveling right towards it and what you can see is that as it plays the game it starts to actually does pretty well it starts to hit off a lot of the breakout pieces towards the center of the game and it actually does pretty well it breaks out a lot of the ball a lot of the colored blocks in this game but let's take a look also at option b option b actually does something really interesting it really likes to hit the ball at the corner of the paddle uh it does this just so the ball can ricochet off at an extreme angle and break off colors in the corner of the screen now this is actually it does this to the extreme actually because it even even in the case where the ball is coming right towards it it will move out of the way just so it can come in and hit it at these extreme ricocheting angles so let's take a look at how option b performs when it plays the game and you can see it's really targeting the side of the paddle and hitting off a lot of those colored blocks and why because now you can see once it breaks out the corner the two edges of the screen it was able to knock off a ton of blocks in the game because it was able to basically get stuck in that top region let's take another look at this so once it gets stuck in that top region it doesn't have to worry about making any actions anymore because it's just accumulating a ton of rewards right now and this is a very great policy to learn because it's able to beat the game much much faster than option a and with much less effort as well so the answer to the question which stay action pair has a higher q value in this case is option b but that's a relatively unintuitive option at least for me when i first saw this problem because i would have expected that playing things i mean not moving out of the way of the ball when it's coming right towards you would be a better action but this agent actually has learned to move away from the ball just so it can come back and hit it and really attack at extreme angles that's a very interesting observation that this agent has made through learning now the question is how can we use because q values are so difficult to actually define it's hard for humans to define them as we saw in the previous example instead of having humans define that q value function how can we use deep neural networks to model this function and learn it instead so in this case uh well the q value is this function that takes this input to state in action so one thing we could do is have a deep neural network that gets inputs of both its state and the desired action that it's considering to make in that state then the network would be trained to predict the q value for that given state action pair that's just a single number the problem with this is that it can be rather inefficient to actually run forward in time because if remember how we compute the policy for this model if we want to predict what is the optimal action that it should take in this given state we need to evaluate this deep q network n times where n is the number of possible actions that it can make in this time step this means that we basically have to run this network many times for each time step just to compute what is the optimal action and this can be rather inefficient instead what we can do which is very equivalent to this idea but just formulate it slightly differently is that it's often much more convenient to output all of the q values at once so you input the state here and you output a basically a vector of q values instead of one q value now it's a vector of q values that you would expect to see for each of these states so the q value for state sorry for each of these actions so the q value for action one the q value for action two all the way up to your final action so for each of these actions and given the state that you're currently in the this output of the network tells you what is the optimal set of actions or what is the the breakdown of the q values given these different actions that could be taken now how can we actually train this version of the deep queue network we know that we wanted to output these what are called q values but it's not actually clear how we can train it and to do this and well this is actually challenging you can think of it conceptually as because we don't have a data set of q values right all we have are observations state action and reward triplets so to do this and to train this type of deep queue network we have to think about what is the best case scenario how would the agent perform optimally or perform ideally what would happen if it takes all the best actions like we can see here well this would mean that that the target return would be maximized and what what we can do in this case is we can actually use this exact target return to serve as our ground truth our data set in some sense in order to actually train this agent to train this deep q network now what that looks like is first we'll formulate our expected return if we were to take all of the best actions the initial reward r plus the action that we select that maximizes the expected return for the next future state and then we apply that discounting factor gamma so this is our target this is our our q value that we're going to try and uh optimize towards it's like the what we're trying to match right that's what we want our prediction to mac match but now we should ask ourselves what does our network predict well our network is predicting like we can see in this in this network the network is predicting the q value for a given state action pair well we can use these two pieces of information both our predicted q value and our target q value to train and create this what we call q loss this is a essentially a mean squared error formulation between our target and our predicted q values and we can use that to train this deep queue network so in summary let's uh walk through this whole process end to end of deep q learning our deep neural network sees as input a state and the state gets fed through the network and we try to output the q value for each pause of the three possible actions here there are three different ways that the network can play you can either move to the left move to the right or it can stay constant in the same place now in order to infer the optimal policy it has to look at each of these q values so in this case moving to the left because it sees that the ball is moving to the left it can it sees that okay if i step a little bit to the left i have a higher chance of probably hitting that ball and and continuing the game so my q value for my expected total reward return my q value for moving left is 20. on the other hand if i stay in the same place let's say i have a q value of 3 and if i move to the right out of the way of the ball in this case because the ball is already moving towards me i have a q value of 0. so these are all my q values for all of the different possible actions how do i compute the optimal policy well as we saw before the optimal policy is obtained by looking at the maximum q value and picking the action that maximizes our q value so in this case we can see that the maximum q value is attained when we move to the left with action one so we we select action one we feed this back into the game engine we send this back to the environment and we receive our next state and this process repeats all over again the next state is fed through the deep neural network and we obtain a list of q values for each of the possible actions and it repeats again now deepmind showed that these networks deep queue networks could be applied to solve a variety of different types of atari games not just breakout but many other games as well and basically all they needed to do was provide the state pictorially as an input passing them through these convolutional layers followed by non-linearities and pooling operations like we learned in lecture three and at the right hand side it's predicting these q values for each possible action that it could take and it's exactly as we saw in the previous couple slides so it picks an optimal action to execute on the next time step depending on the maximum q value that could be attained and it sends this back to the environment to execute and and receive its next state this is actually remarkably simple because despite its remarkable simplicity in my opinion of essentially trial and error they tested this on many many games in atari and showed that for over 50 percent of the games they were able to surpass human level performance with this technique and other games which you can see on the right hand side of this plot were more challenging but still again given how simple this technique is and how clean it is it really is amazing that this works at all to me so despite all of the advantages of this approach the simplicity the the cleanness and how elegant the solution is i think it's and uh i mean above all that the ability for this solution to learn superhuman uh policies policies that can beat humans even on some relatively simple tasks there are some very important downsides to queue learning so the first of which is the simplistic model that we learned about today this model can only handle action spaces which are discrete and it can only really handle them when the action space is small so when we're only given a few possible actions as each step it cannot handle continuous action spaces so for example if an autonomous vehicle wants to predict where to go in the world instead of predicting to go left right or go straight these are discrete categories how can we use reinforcement learning to learn a continuous steering wheel angle one that's not not discretized into bins but can take any real number within some bound of where the steering wheel angle can execute this is a continuous variable it has an infinite space and it would not be possible in the version of q learning that we presented here in this lecture it's also its flexibility of q learning is also somewhat limited because it's not able to learn policies that can be stochastic that can change according to some unseen probability distribution so they're deterministically computed from the q function through this maximum formulation it always is going to pick the maximum the action that maximally elevates your expected return so it can't really learn from these stochastic policies on the other hand to address these we're really going to dive into this next phase of today's lecture focused on policy gradient methods which will hopefully we'll see tackle these remaining issues so let's dive in the key difference now between what we've seen in the first part of the lecture and the second part that we're going to see is that in value learning we try to have a neural network to learn our q value q of our state given or action and then we use this q value to infer the best action to take given a state that we're in that's our policy now policy learning is a bit different it tries to now directly learn the policy using our neural network so it inputs the state and it tries to directly learn the policy that will tell us which action we should take this is a lot simpler since this means we now get directly the action for free by simply sampling straight away from this policy function that we can learn so now let's dive into the the details of how policy learning works and and first i want to really really narrow or sorry drive in this difference from q learning because it is a subtle difference but it's a very very important difference so deep q networks aim to approximate this q function again by first predicting given a state the q value for each possible action and then it simply picks the best action where best here is described by which action gives you the maximum q value the maximum expected return and execute that that action now policy learning the key idea of policy learning is to instead of predicting the q values we're going to directly optimize the policy pi of s so this is the policy distribution directly governing how we should act given a current state that we find ourselves in so the output here is for us to give us the desired action in a much more direct way the outputs represent the probability that the action that we're going to sample or select should be the correct action that we should take at this step right in other words it will be the one that gives us the maximum reward so take for example this if we see that we predict these probabilities of these given actions being the optimal action so we get the state and our policy network now is predicting a probability distribution of uh we can basically aggregate them into our policy so we can say our policy is now defined by this probability distribution and now to compute the action that we should take we simply sample from this distribution to predict the action that we should execute in this case it's the car going left which is a1 but since this is a probability distribution the next time we sample we might not we might get to stay in the same place we might sample action 2 for example because this does have a nonzero probability a probability of 0.1 now note that because this is a probability distribution this this p of actions given our state must sum to one now what are some of the advantages of this type of formulation over first of all over q learning like we saw before besides the fact that it's just a much more direct way to get what we want instead of optimizing a q function and then using the q function to create our policy now we're going to directly optimize the policy beyond that though there is one very important advantage of this formulation and that is that it can handle continuous action spaces so this was an example of a discrete action space what we've been working with so far in this atari breakout game moving left moving right or staying in the center there are three actions and they're discrete there's a finite number of actions here that can be taken for example this is showing the prob our action space here is representing the direction that i should move but instead a continuous action space would tell us not just the direction but how fast for example as a real number that i should move questions like that that are infinite in the number of possible answers this could be one meter per second to the left half a meter per second to left or any numeric velocity it also tells us direction by nature through plus or minus sign so if i say minus one meter per second it tells me that i want to move to left at one meter per second if i say positive one it tells me i want to move to the right at one meter per second but now when we plot this as a probability distribution we can also visualize this as a continuous action space and simply we can visualize this using something like a gaussian distribution in this case but it could take many different you can choose the type of distribution that fits best with your problem set gaussian is a popular choice here because of its simplicity so here again we can see that the probability of moving to the left faster to the left is much greater than moving faster to the right and we can actually see that the mean of this distribution the average the point where this normal distribution is highest tells us an exact numerical value of how fast it should be moving not just that it should be moving to the left but how fast it should be moving to the left now let's take a look at how we can actually model these continuous action spaces with a policy gradient method instead of predicting the probability of taking an action given a possible state which in this case since we're in the continuous domain there would be an infinite number of actions let's assume that our output distribution is actually a normal gaussian an output a mean and a variance for that gaussian then we only have two outputs but it allows us to describe this probability distribution over the entire continuous space which otherwise would have been infinite an infinite number of outputs so in this case that's if we predict that the mean that uh the mean action that we should take mu is negative one and the variance is of 0.5 we can see that this probability distribution looks like this on the bottom left-hand side it should move to the left with an average speed of negative one meters per second and with some variance so it's not totally confident that that's the best speed at which it should move to the left but it's pretty set on that being the the place to move to the left so for this picture we can see that the paddle needs to move to the left if we actually plot this distribution like this and we can actually see that the mass of the distribution does lie on the left-hand side of the number line and if we sample for example from this distribution we can actually see that in this case we're getting that the action that we should take the concrete velocity that should be executed would indicate that we need to move left negative at a speed of 0.8 meters per second so again that means that we're moving left with a speed of 0.8 meters per second note here that even though the the mean of this distribution is negative one we're not constrained to that exact number this is a continuous probability distribution so here we sampled an action that was not exactly the mean but that's totally fine and that really highlights that the difference here between the discrete action space and the continuous action space this opens up a ton of possibilities for applications where we do model infinite numbers of actions and again like before like the discrete action case this probability distribution still has all of the nice properties of probability distributions namely that the integral of this computed probability distribution does still sum to one so we can indeed sample from it which is a very nice confirmation property okay great so let's take a look now of how the policy gradients algorithm works in a concrete example let's start by revisiting this whole learning loop of reinforcement learning again that we saw in the very beginning of this lecture and let's think of how we can use the policy gradient algorithm that we have introduced to actually train an autonomous vehicle using using this trial and error policy gradient method so with this case study study of autonomous vehicles or self-driving cars what are all of these components so the agent would be our vehicle it's traveling in the environment which is the the world the lane that it's it's traveling in it has some state that is obtained through camera data lidar data radar data et cetera it obtains sorry it makes actions what are the actions that it can take the actions in this case are the steering wheel angle again this is a concrete example of a continuous action space you don't discretize your steering wheel angle into unique bins your steering wheel angle is infinite in the number of possibilities it can take and it can take any real number between some bounds so that is a continuous problem that is a continuous variable that we're trying to model through this action and finally uh it receives rewards in the in the form of the distance it can travel before it needs to be uh needs some form of human intervention so let's now dive in now that we've identified all of those so how can we train this car using policy gradient network in this context here we're taking self-driving cars as an example but you can hopefully see that we're only using this because it's nice and intuitive but this will also apply to really any domain where you can identify and set up the problem like we've set up the problem so far so let's start by initializing our agent again our agent is the vehicle we can place it onto the road in the center of the road the next step would be to let the agent run in the beginning it doesn't run very well because it crashes and well it's never been trained before so we don't expect it to run very well but that's okay because this is reinforcement learning so we run that policy until it terminates in this case we mark terminations by the time that it it crashes and needs to be taken over along that uh what we call rollout we start to record all of the state action pairs or sorry state action reward pairs so at each step we're going to record where was the robot what was it state what was the action that it executed and what was the reward that it obtained by executing that action in that state now next step would be to take all of those state action reward pairs and actually decrease the probability of taking any action that it took close to the time where the terminated determination happened so close to the time where the crash occurred we want to decrease the probability of making any of those actions again in the future likewise we want to increase the probability of making any actions in the beginning of this episode note here that we don't necessarily know that there was something good in this first part of the episode we're just assuming that because the crash occurred in the second part of the episode that was likely due to an action that occurred in that second part this is a very unintelligent if you could say algorithm because it that's all it assumes it just tries to decrease the probability of anything that resulted in a low reward and increase the probability of anything that resulted in a high reward it doesn't know that any of these actions were better than the other especially in the beginning because it doesn't have that kind of feedback this is just saying that we want to decrease anything that may have been bad and increase anything that would have been good and if we do this again we can see that the next time the car runs it runs for a bit longer and if we do it again we do the same thing now on this rollout we decrease the probability of actions that resulted in low reward and increase the probability that resulted in positive or high reward we reinitialize this and we run it until completion and update the policy again and it seems to run a bit longer and we can do this again and we keep doing this until eventually it learns to start to follow the lanes without crashing and this is really awesome i think because we never taught this vehicle how anything well we never taught it anything about lanes we never taught it what a lane marker is it learns to avoid lanes though and not crash and not crash just by observing very sparse rewards of crashing so it observed a lot of crashes and it learned to say like okay i'm not going to do any of these actions that occurred very close to my crashes and just by observing those things it was able to successfully avoid lanes and survive in this environment longer and longer times now the remaining question is how we can actually update our policy on every training iteration to decrease the probability of bad events and increase the probability of these good events or these good actions let's call them so that really focuses and narrows us into points four and five in this this training algorithm how can we do this learning process of decreasing these probabilities when it's bad and increasing the probabilities when they're good let's take a look at that in a bit more detail so let's look at specifically the loss function for training policy gradients and then we'll dissect it to understand exactly why this works so this loss consists of really two parts that i'd like to dive into the first term is this log likelihood term the log likelihood of our pro of our our policy our probability of an action given our state the second term is where we multiply this negative log likelihood by the total discounted reward or the total discounted return excuse me r of t so let's assume that we get a lot of reward for an action that had very high log likelihood this loss will be great and it will reinforce these actions because they resulted in very good returns on the other hand if the reward is very low for an action that it had high probability for it will adjust those probabilities such that that action should not be sampled again in the future because it did not result in a desirable return so when we plug in these uh this loss to the gradient descent algorithm to train our neural network we can actually see that the policy gradient term here which is highlighted in blue which is where this this algorithm gets its name it's the policy because it has to compute this gradient over the policy part of this function and again uh just to reiterate once more this policy gradient term consists of these two parts one is the likelihood of an action the second is the reward if the action is very positive very good resulting in good reward it's going to amplify that through this gradient term if the action is very is very probable or sorry not very probable but it did result in a good reward it will actually amplify it even further so something that was not probable before will become probable because it resulted in a good return and vice versa on the other side as well now i want to talk a little bit about how we can extend some of these reinforcement learning algorithms into real life and this is a particularly challenging question because this is something that has a particular interest to the reinforcement learning field right now and especially right now because applying these algorithms in the real world is something that's very difficult for one reason or one main reason and that is this step right here running a policy until termination that's one thing i touched on but i didn't spend too much time really dissecting it why is this difficult well in the real world terminating means well crashing dying usually pretty bad things and we can get around these types of things usually by training and simulation but then the problem is that modern simulators do not accurately depict the real world and furthermore they don't transfer to the real world when you deploy them so if you train something in simulation it will work in simulation it will work very well in simulation but when you want to then take that policy deployed into the real world it does not work very well now one really cool result that we created in my lab was actually developing a brand new photo realistic simulation engine specifically for self-driving cars that i want to share with you that's entirely data driven and enables these types of reinforcement learning advances in the real world so one really cool result that we created was developing this type of simulation engine here called vista and allows us to use real data of the world to simulate brand new virtual agents inside of the simulation now the results here are incredibly photorealistic as you can see and it allows us to train agents using reinforcement learning in simulation using exactly the methods that we saw today so that they can be directly deployed without any transfer learning or domain adaptation directly into the real world now in fact we did exactly this we placed agents inside of our simulator train them using exactly the same policy grading algorithm that we learned about in this lecture and all of the training was done in our simulator then we took these policies and put them directly in our full-scale autonomous vehicle as you can see in this video and on the left hand side you can actually see me sitting in this vehicle in the bottom of the interior shot you can see me sitting inside this vehicle as it travels through the real world completely autonomous this represented the first time at the time of when we published these results the first time an autonomous vehicle was trained using rl entirely in simulation and was able to be deployed in the real life a really awesome result so now we have covered the fundamentals behind value learning as well as policy gradient reinforcement learning approaches i think now it's really important to touch on some of the really remarkable deep reinforcement learning applications that we've seen in recent years and for that look we're going to turn first to the game of go where reinforcement learning agents were put to the test against human champions and achieved what at the time was and still is extremely exciting results so first i want to provide a bit of an introduction to the game of go this is a game that consists of 19 by 19 uh grids it's played between two players who hold either white pieces or black pieces and the objective of this game is to occupy more board territory with your pieces than your opponent now even though the grid and the rules of the game are very simple the problem of go solving the game of go and doing it to beat the grand masters is an extremely complex problem it's it's actually because the number of possible board positions the number of states that you can encounter in the game of go is massive the full size with the full-size board there are greater number of legal board positions than there are atoms in the universe now the objective here is to train an ai to train a machine learning or deep learning algorithm that can master the game of go not only to beat the existing gold standard software but also to beat the current world human champions now in 2016 google deepmind rose to this challenge and a couple and several years ago they actually developed a reinforcement learning based pipeline that defeated champion go players and the idea at its core is very simple and follows along with everything that we've learned in this lecture today so first a neural network was trained and it got to watch a lot of human expert go players and basically learn to imitate their behaviors this part was not using reinforcement learning this was using supervised learning you basically got to study a lot of human experts then they use these pre-trained networks to play against reinforcement learning policy networks which allows the policy to go beyond what the human experts did and actually play against themselves and achieve actually superhuman performance in addition to this one of the really tricks that brought this to be possible was the usage of this auxiliary network which took the input of the state of the board as as input and predicted how good of a state this was now given that this network the ai could then hallucinate essentially different board position actions that it could take and evaluate how good these actions would be given these predicted values this essentially allowed it to traverse and plan its way through different possible actions that it could take based on where it could end up in the future finally a recently published extension of these approaches just a few years ago in 2018 called alpha zero only used self-play and generalized to three famous board green board games not just go but also chess shogi and go and in these examples the authors demonstrate that it was actually not necessary to pre-train these networks from human experts but instead they optimized them entirely from scratch so now this is a purely reinforcement learning based solution but it was still able to not only beat the humans but it also beat the previous networks that were pre-trained with human data now as recently as only last month very recently the next breakthrough in this line of works was released with what is called mu0 where the algorithm now learned to master these environments without even knowing the rules i think the best way to describe mu0 is to compare and contrast its abilities with those previous advancements that we've already discussed earlier already today so we started this discussion with alphago now this demonstrates superhuman performance with go on go using self pro self play and pre-training these models using human grand master data then came alphago 0 which showed us that even better performance could be achieved entirely on its own without pre-training from the human grandmasters but instead directly learning from scratch then came alpha zero which extended this idea even further beyond the game of go and also into chess and shogi but still required the model to know the rule and be given the rules of the games in order in order to learn from them now last month the authors demonstrated superhuman performance on over 50 games all without the algorithm knowing the rules beforehand it had to learn them as well as actually learning how to play the game optimally during its training process now this is critical because in many scenarios we do not know the rules beforehand to tell the model when we are placed in the environment sometimes the rules are unknown the rules or the dynamics are unknown objects may interact stochastically or unpredictably we may also be in an environment where the rules are simply just too complicated to be described by humans so this idea of learning the rules of the game or of the task is a very very powerful concept and let's actually walk through very very briefly how this works because it's such an awesome algorithm but again at its core it really builds on everything that we've learned today so you should be able to understand each part of this of this algorithm we start by observing the board's state and from this point we predict or we perform a tree search through the different possible scenarios that can arise so we take some actions and we look at the next possible scenarios or the next possible states that can arise but now since we don't know the rules the network is forced to learn the dynamics the dynamics model of how to do this search so to learn what could be the next states given the state that it currently sees itself in and the action that it takes now at the base time this gives us this probability of executing each of these possible actions based on the value that it can attain through this branch of the tree and it uses this to plan the next action that it should take this is essentially the policy network that we've been learning about but amplified to also encounter this tree search algorithm for planning into the future now given this policy network it takes this action and receives a new observation from the game and repeats this process over and over again until of course that it the game finishes or the game is over a very similar this is very very similar to how we saw alpha zero work but now the key difference is that the dynamics model as part of the tree search that we can see at each of these steps is entirely learned and greatly opens up the possibilities for these techniques to be applied outside of rigid game scenarios so in these scenarios we do know the rules of the games very well so we could use them to train our algorithms better but in many scenarios this type of advancement allows us to apply these algorithms to areas where we simply don't know the rules and where we need to learn the rules in order to play the game or simply where the rules are much harder to define which in reality in the real world many of the interesting scenarios this would be exactly the case so let's briefly recap with what we've learned in the lecture today we started with the foundations of deep reinforcement learning we defined what agents are what actions are what environments are and how they all interact with each other in this reinforcement learning loop then we started by looking at a broad class of q learning problems and specifically the deep q network where we try to learn a q function given a state and action pair and then determine a policy by selecting the action that maximizes that q function and finally we learned how we could optimize instead of optimizing the q value or the q function learn to directly optimize the policy straight from straight from the state and we saw that this has really impactful applications in continuous action spaces where q functions or this q learning technique is somewhat limited so thank you for attending this lecture on deep reinforcement learning at this point we'll now be taking the next part of the class which we focused on reinforcement learning and you'll get some experience on how you can apply these algorithms onto all by yourself specifically focusing on the policy gradient algorithms in the context of a very simple example of pong as well as more complex examples as well so you'll actually build up this body and the brain of the agent and the environment from scratch and you'll really get to put together a lot of the ideas that we've seen today in this lecture so please come to the gather town for if you have any questions and we'd be happy to discuss questions on the software lab specifically as well as any questions on today's lecture so we look forward to seeing you there thank you

Info

Channel: Alexander Amini

Views: 64,263

Rating: 4.9546099 out of 5

Keywords: deep learning, mit, artificial intelligence, neural networks, machine learning, 6s191, 6.s191, mit deep learning, ava soleimany, soleimany, alexander amini, amini, lecture 2, tensorflow, computer vision, deep mind, openai, basics, introduction, deeplearning, ai, tensorflow tutorial, what is deep learning, deep learning basics, deep learning python, reinforcement learning, rl, policy gradient, deep q network, deep q learning, deepmind, alphago, alphazereo

Id: 93M1l_nrhpQ

Channel Id: undefined

Length: 57min 7sec (3427 seconds)

Published: Fri Mar 05 2021