MIT 6.S191: Reinforcement Learning

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi everyone and welcome back to 6.S191! Today is  a really exciting day because we'll learn about   how we can marry the very long-standing field  of reinforcement learning with a lot of the very   recent advancements that we've been seeing so  far in this class in deep learning and how we   can combine these two fields to build some really  extraordinary applications and really agents that   can outperform or achieve super human performance  now i think this field is particularly amazing   because it moves away from this paradigm that  we've been really constrained to so far in this   class so so far deep learning the way we've seen  it has been really confined to fixed data sets the   way we kind of either collect or have can obtain  online for example in reinforcement learning   though deep learning is placed in some environment  and is actually able to explore and interact with   that environment and it's able to learn how to  best accomplish its goal usually does this without   any human supervision or guidance which makes  it extremely powerful and very flexible as well this has huge obvious impact in fields  like robotics self-driving cars and robot   manipulation but it also has really  revolutionized the world of gameplay   and strategic planning and it's this really  connection between the real world and deep   learning the virtual world that makes this  particularly exciting to me and and i hope   this video that i'm going to show  you next really conveys that as well   starcraft has imperfect information and is played  in real time it also requires long-term planning   and the ability to choose what action to take from  millions and millions of possibilities i'm hoping   for a 5-0 not to lose any games but i think the  realistic goal would be four and one in my favor i think he looks more confident  than __ was quite nervous before   the room was much more tense this time   really didn't know what to expect he's been  playing starcraft pretty much since his fight i wasn't expecting the ai to be that good  everything that he did was proper it was   calculated and it was done well  i thought i'm learning something it's much better than i expected it i  would consider myself a good player right   but i lost every single one of five games all right so in fact this is an example of  how deep learning was used to compete against   humans professionally trained game players  and was actually trained to not only compete   against them but it was able to achieve remarkably  superhuman performance beating this professional   uh starcraft player five games to zero so let's  start by taking a step back and really seeing how   reinforcement learning fits within respect to all  the other types of learning problems that we have   seen so far in this class so the first piece and  the most comprehensive piece of learning problems   that we have been exploring so far in this class  has been that of supervised learning problems   so this was kind of what we talked about  in the first second and third lectures   and in this domain we're basically given a  bunch of data x and we try to learn a neural   network to predict its label y so this goal is  to learn this functional mapping from x to y   and i like to describe this very intuitively  if if i give you a picture of this apple for   example i want to train a neural network to  determine and tell me that this thing is an apple   okay the next class of algorithms that we  actually learned about in the last lecture was   unsupervised learning so in this case we  were only given data with no labels so a   bunch of images for example of apples and we were  forced to learn a neural network or learn a model   that represented this underlying structure in  the data set so again in the apple scenario   we tried to learn a model that says back to us  if we show it these two pictures of apples that   these things are basically like each other  we don't know that they're apples because we   were never given any labels that explicitly  tell the model that this thing is an apple   but we can tell that oh this thing is pretty  close to this other thing that it's also   seen and it can pick out those underlying  structure between the two to identify that   now in the last part in rl and reinforcement  learning which is what today's lecture is   going to be focused on we're given only data in  the form of what we call state action pairs now   states are what are the observations of the system  and the actions are the behaviors that that system   takes or that agent takes when it sees those  states now the goal of rl is very different than   the goal of supervised learning and the goal  of unsupervised learning the goal of rl is to   maximize the reward or the future reward of that  agent in that environment over many time steps   so again going back to the apple example what the  analog would be would be that the agent should   learn that it should eat this thing because it  knows that it will keep you alive it will make you   healthier and needs and you need food to survive  again like the unsupervised case it doesn't know   that this thing is an apple it doesn't even  recognize exactly what it is all it knows is   that in the past they must have eaten it and it  was able to survive longer and because it was a   piece of food it was able to to become healthier  for example and through these state action pairs   and somewhat trial and error it was able to  learn these representations and learn these plans so our focus today will be explicitly on  this third class of learning problems and   reinforcement learning so to do that i think it's  really important before we start diving into the   details and the nitty-gritty technical details  i think it's really important for us to build up   some key vocabulary that is very important  in reinforcement learning and it's going   to be really essential for us to build up on  each of these points later on in the lecture   this is very important part of the lecture so  i really want to go slowly through this section   so that the rest of the lecture is going to make  as much sense as possible so let's start with the   central part the core of your reinforcement  learning algorithm and that is your agent   now the agent is something that can take actions  in the environment that could be things like a   drone making a delivery in the world it could be  super mario navigating a video game the algorithm   in reinforcement learning is your agent and you  could say in real life the agent is each of you   okay the next piece is the environment the  environment is simply the world in which the   agent lives it's the place where the agent exists  and operates and and conducts all of these actions   and that's exactly the connection between  the two of them the agent can send commands   to the environment in forms of actions  now capital a or lowercase a of t is   the action at time t that the agent takes  in this environment we can denote capital a   as the action space this is the set of all  possible actions that an agent can make now   i do want to say this even though i think it  is somewhat self-explanatory an action is uh or   the list by which an action can be chosen the set  of all possible actions that an agent can make in   the environment can be either discrete or it can  be from a set of actions in this case we can see   the actions are forwards right backwards or left  or it could also be continuous actions for example   the exact location in the environment a real as  a real number coordinate for example like the   gps coordinates where this agent wants to move  right so it could be discrete as a categorical   or a discrete probability distribution or it  could be continuous in either of these cases observations are how the environment interacts  back with the agent it's how the agent the agent   can observe where in the environment it is and  how its actions affected its own state in the   environment and that leads me very nicely into  this next point the state is actually a concrete   and immediate situation in which the agent finds  itself so for example a state could be something   like a image feed that you see through your eyes  this is the state of the world as you observe it a reward now is also a way of feedback from the  environment to the agent and it's a way that the   environment can provide a way of feedback to  measure the success or failure of an agent's   actions so for example in a video game when  mario touches a coin he wins points and from   a given state an agent will send out outputs  in the form of actions to the environment and   the environment will respond with the agent's new  state the next state that it can can achieve which   resulted on acting on that previous  state as well as any rewards that may   be collected or penalized by reaching  that state now it's important to note here   that rewards can be either immediate or delayed  uh they basically you should think of rewards as   effectively evaluating that agent's actions but  you may not actually get a reward until very late   in life so for example you might take many  different actions and then be rewarded a   long time into the future that's called a very  delayed reward but it is a reward nonetheless   we can also look at what's called the total reward  which is just the sum of all rewards that an agent   gets or collects at after a certain time t so r  of i is the reward at time i and r of capital r of   t is just the return the total reward from time t  all the way into the future so until time infinity   and that can actually be written now expanded we  can we can expand out that summation uh from our   from r of t plus r of t plus one all the way on  into the future so it's adding up all of those   uh rewards that the agent collects from this  point on into the future however often it's   it's very common to consider not just the summed  return the total return as a straight-up summation   but actually instead what we call the discounted  sum of rewards now this discounting factor which   is represented here of by gamma is multiplied  by the future awards that are discovered by the   agent in order to dampen those rewards effect  on the agent's choice of action now why would   we want to do this so this is actually this this  formulation was created by design to make future   rewards less important than immediate rewards  in other words it enforces a kind of short-term   learning in the agent a concrete example of  this would be if i offered to give you five   dollars today or five dollars in five years from  today which would you take even though it's both   five dollars your reward would be the same but  you would prefer to have that five dollars today   just because you prefer short-term rewards over  long-term rewards and again like before we can   expand the summation out now with this discount  factor which has to be typically between zero   and one and the discount factor is multiplied by  these future rewards as discovered by the agent   and like i said it is reinforcing this  this concept that we want to prioritize   these short-term rewards more than uh  very long-term rewards in the future   now finally there's a very very important function  in rl that's going to kind of start to put a lot   of these pieces together and that's called the q  function and now let's look at actually how this   q function is defined remembering the definition  of this total discounted reward capital r of t   so remember the total reward r of t measures the  discounted sum of rewards obtained since time t   so now the q function is very related to that  the q function is a function that takes as input   the current state that the agent is in and the  action that the agent takes in that state and   then it returns the expected total future reward  that the agent can receive after that point so   think of this as if an agent finds itself in some  state in the world and it takes some action what   is the expected return that it can receive after  that point that is what the q function tells us   let's suppose i give you this magical q function  this is actually it really is a magical function   because it tells us a lot about the problem  if i give you this function an oracle that   you can plug in any state and action pair and it  will tell you this expected return from point time   point t from your current time point i give you  that function the question is can you determine   given the state that you're currently in can  you determine what is the best action to take   you can perform any queries on this function  and the way you can simply do this is   that you ultimately in your mind you want to  select the best action to take in your current   state but what is the best action to take well  it's simply the action that results in the highest   expected total return so what you can do is  simply choose a policy that maximizes this future   reward or this future return well  that can be simply written by finding   the arg max of your q function over all  possible actions that you can take at the state   so simply if i give you this q function and a  state that you're in you can feed in your state   with every single action and evaluate what the q  function would tell you the expected total reward   would be given that state action pair and you  pick the action that gives you the highest q value   that's the best action to take in this current  state so you can build up this policy which here   we're calling pi of s to infer this best  action to take so think of your policy now   as another function that takes this  input your state and it tells you   the action that you should execute in this in  the state so the strategy given a q function to   compute your policy is simply from this arcmax  formulation find the action that maximizes   your q function now in this lecture you're going  to we're going to focus on basically these two   classes of reinforcement learning algorithms  into two categories one of which will actually   try to learn this q function q of s your state  and your action and the other one will be called   what are called policy learning algorithms because  they try to directly learn the policy instead of   using a q function to infer your policy now we're  going to in policy learning directly infer your   policy pi of s that governs what actions you  should take this is a much more direct way of   thinking about the problem but first thing we're  going to do is focus on the value learning problem   and how we can do what is called q learning  and then we'll build up to policy learning   after that let's start by digging a bit deeper  into this q function so first i'll introduce this   game of atari breakout on the left for those who  haven't seen it i'll give a brief introduction   into how the game works now the q value tells  us exactly what the expected total expected   return that we can expect to see on any state in  this game and this is an example of one state so   in this game you are the agent is this paddle  on the bottom of the board it's this red paddle   it can move left or right and those are its two  actions it can also stay constant in the same   place so it has three actions in the environment  there's also this ball which is traveling in this   case down towards the bottom of the the board  and is about to hit and ricochet off of this   paddle now the objective the goal of this game is  actually to move the pedal back and forth and hit   the ball at the best time such that you can bounce  it off and hit and break out all of these colored   blocks at the top of the board each time the ball  touches one of these colored blocks it's able to   break them out hence the name of the the name of  the game breakout and the goal is to knock off   as many of these as possible each time the ball  touches one it's gone and you got to keep moving   around hitting the ball until you knock off all  of the of the blocks now the q function tells us   actually what is the expected total return that  we can expect in a given state action pair and   the point i'd like to make here is that it's  actually it can sometimes be very challenging to   understand or intuitively guess what is the uh  q value for a given state action pair so even if   let's say i give you these two state action pairs  option a and option b and i ask you which one out   of these two pairs do you think has a higher q  value on option a we can see the ball is already   traveling towards the paddle and the paddle is  choosing to stay in the same place it's probably   going to hit the ball and it'll bounce back up  and and break some blocks on the second option   we can see the ball coming in at an angle and the  paddle moving towards the right to hit that ball   and i asked you which of these two options  state action pairs do you believe will   return the higher expected total reward before  i give you the answer i want to tell you   a bit about what these two policies actually  look like when they play the game instead of   just seeing this single state action pair so let's  take a look first at option a so option a is this   this relatively conservative option that doesn't  move when the ball is traveling right towards it   and what you can see is that as it plays the game  it starts to actually does pretty well it starts   to hit off a lot of the breakout pieces towards  the center of the game and it actually does pretty   well it breaks out a lot of the ball a lot of the  colored blocks in this game but let's take a look   also at option b option b actually does something  really interesting it really likes to hit   the ball at the corner of the paddle uh it does  this just so the ball can ricochet off at an   extreme angle and break off colors in the corner  of the screen now this is actually it does this   to the extreme actually because it even even in  the case where the ball is coming right towards   it it will move out of the way just so it can come  in and hit it at these extreme ricocheting angles   so let's take a look at how option  b performs when it plays the game and you can see it's really targeting the side of  the paddle and hitting off a lot of those colored   blocks and why because now you can see once  it breaks out the corner the two edges of the   screen it was able to knock off a ton of blocks  in the game because it was able to basically   get stuck in that top region let's take another  look at this so once it gets stuck in that top   region it doesn't have to worry about making any  actions anymore because it's just accumulating a   ton of rewards right now and this is a very great  policy to learn because it's able to beat the game   much much faster than option a and with much less  effort as well so the answer to the question which   stay action pair has a higher q value in this case  is option b but that's a relatively unintuitive   option at least for me when i first saw this  problem because i would have expected that   playing things i mean not moving out of the way of  the ball when it's coming right towards you would   be a better action but this agent actually has  learned to move away from the ball just so it can   come back and hit it and really attack at extreme  angles that's a very interesting observation   that this agent has made through learning now the  question is how can we use because q values are so   difficult to actually define it's hard for humans  to define them as we saw in the previous example   instead of having humans define that q value  function how can we use deep neural networks to   model this function and learn it instead so in  this case uh well the q value is this function   that takes this input to state in action so one  thing we could do is have a deep neural network   that gets inputs of both its state and the desired  action that it's considering to make in that state   then the network would be trained to predict the  q value for that given state action pair that's   just a single number the problem with this is  that it can be rather inefficient to actually   run forward in time because if remember how we  compute the policy for this model if we want to   predict what is the optimal action that it should  take in this given state we need to evaluate this   deep q network n times where n is the number  of possible actions that it can make in this   time step this means that we basically have to run  this network many times for each time step just to   compute what is the optimal action and this can be  rather inefficient instead what we can do which is   very equivalent to this idea but just formulate  it slightly differently is that it's often much   more convenient to output all of the q values at  once so you input the state here and you output   a basically a vector of q values instead of one q  value now it's a vector of q values that you would   expect to see for each of these states so the q  value for state sorry for each of these actions   so the q value for action one the q value for  action two all the way up to your final action   so for each of these actions and given  the state that you're currently in   the this output of the network tells you what  is the optimal set of actions or what is the   the breakdown of the q values given these  different actions that could be taken   now how can we actually train this version of  the deep queue network we know that we wanted   to output these what are called q values but it's  not actually clear how we can train it and to do   this and well this is actually challenging  you can think of it conceptually as because   we don't have a data set of q values right all  we have are observations state action and reward   triplets so to do this and to train this type  of deep queue network we have to think about   what is the best case scenario how would the  agent perform optimally or perform ideally   what would happen if it takes all the best  actions like we can see here well this would mean   that that the target return would be maximized  and what what we can do in this case is we can   actually use this exact target return to serve  as our ground truth our data set in some sense   in order to actually train this agent to train  this deep q network now what that looks like   is first we'll formulate our expected return  if we were to take all of the best actions the   initial reward r plus the action that we select  that maximizes the expected return for the next   future state and then we apply that discounting  factor gamma so this is our target this is our our   q value that we're going to try and uh optimize  towards it's like the what we're trying to   match right that's what we want our prediction  to mac match but now we should ask ourselves   what does our network predict well our  network is predicting like we can see   in this in this network the network is predicting  the q value for a given state action pair   well we can use these two pieces of information  both our predicted q value and our target q value   to train and create this what we call q loss this  is a essentially a mean squared error formulation   between our target and our predicted q values and  we can use that to train this deep queue network so in summary let's uh walk through this  whole process end to end of deep q learning   our deep neural network sees as input a state  and the state gets fed through the network and   we try to output the q value for each pause  of the three possible actions here there are   three different ways that the network  can play you can either move to the left   move to the right or it can  stay constant in the same place now in order to infer the optimal policy  it has to look at each of these q values   so in this case moving to the left because it sees  that the ball is moving to the left it can it sees   that okay if i step a little bit to the left i  have a higher chance of probably hitting that   ball and and continuing the game so my q value  for my expected total reward return my q value   for moving left is 20. on the other hand if i stay  in the same place let's say i have a q value of 3   and if i move to the right out of the way of the  ball in this case because the ball is already   moving towards me i have a q value of 0.  so these are all my q values for all of the   different possible actions how do i compute  the optimal policy well as we saw before   the optimal policy is obtained by looking at  the maximum q value and picking the action   that maximizes our q value so in this case we can  see that the maximum q value is attained when we   move to the left with action one so we we  select action one we feed this back into   the game engine we send this back to the  environment and we receive our next state   and this process repeats all over again the next  state is fed through the deep neural network   and we obtain a list of q values for each of the  possible actions and it repeats again now deepmind   showed that these networks deep queue networks  could be applied to solve a variety of different   types of atari games not just breakout but many  other games as well and basically all they needed   to do was provide the state pictorially as an  input passing them through these convolutional   layers followed by non-linearities and pooling  operations like we learned in lecture three   and at the right hand side it's predicting these q  values for each possible action that it could take   and it's exactly as we saw in the previous  couple slides so it picks an optimal action   to execute on the next time step depending on  the maximum q value that could be attained and   it sends this back to the environment to execute  and and receive its next state this is actually   remarkably simple because despite its remarkable  simplicity in my opinion of essentially trial and   error they tested this on many many games in atari  and showed that for over 50 percent of the games   they were able to surpass human level performance  with this technique and other games which you can   see on the right hand side of this plot were more  challenging but still again given how simple this   technique is and how clean it is it really  is amazing that this works at all to me   so despite all of the advantages  of this approach the simplicity   the the cleanness and how elegant the solution  is i think it's and uh i mean above all that   the ability for this solution to learn superhuman  uh policies policies that can beat humans even on   some relatively simple tasks there are some  very important downsides to queue learning   so the first of which is the simplistic model  that we learned about today this model can only   handle action spaces which are discrete and it  can only really handle them when the action space   is small so when we're only given a few possible  actions as each step it cannot handle continuous   action spaces so for example if an autonomous  vehicle wants to predict where to go in the world   instead of predicting to go left right or  go straight these are discrete categories   how can we use reinforcement learning to learn a  continuous steering wheel angle one that's not not   discretized into bins but can take  any real number within some bound   of where the steering wheel angle can execute this  is a continuous variable it has an infinite space   and it would not be possible in the version of q  learning that we presented here in this lecture   it's also its flexibility of q learning is also  somewhat limited because it's not able to learn   policies that can be stochastic that can change  according to some unseen probability distribution   so they're deterministically computed from the  q function through this maximum formulation it   always is going to pick the maximum the action  that maximally elevates your expected return   so it can't really learn from these stochastic  policies on the other hand to address these   we're really going to dive into this next phase of  today's lecture focused on policy gradient methods   which will hopefully we'll see  tackle these remaining issues   so let's dive in the key difference now  between what we've seen in the first part   of the lecture and the second part that we're  going to see is that in value learning we try   to have a neural network to learn our q value  q of our state given or action and then we use   this q value to infer the best action to take  given a state that we're in that's our policy   now policy learning is a bit different it tries  to now directly learn the policy using our neural   network so it inputs the state and it tries to  directly learn the policy that will tell us which   action we should take this is a lot simpler since  this means we now get directly the action for free   by simply sampling straight away from this policy  function that we can learn so now let's dive into   the the details of how policy learning works  and and first i want to really really narrow   or sorry drive in this difference from q learning  because it is a subtle difference but it's a very   very important difference so deep q networks  aim to approximate this q function again   by first predicting given a state the q value for  each possible action and then it simply picks the   best action where best here is described by which  action gives you the maximum q value the maximum   expected return and execute that that action now  policy learning the key idea of policy learning is   to instead of predicting the q values we're going  to directly optimize the policy pi of s so this   is the policy distribution directly governing  how we should act given a current state that   we find ourselves in so the output here  is for us to give us the desired action   in a much more direct way the outputs represent  the probability that the action that we're going   to sample or select should be the correct action  that we should take at this step right in other   words it will be the one that gives us the maximum  reward so take for example this if we see that we   predict these probabilities of these given actions  being the optimal action so we get the state and   our policy network now is predicting a probability  distribution of uh we can basically aggregate them   into our policy so we can say our policy is  now defined by this probability distribution   and now to compute the action that we should  take we simply sample from this distribution   to predict the action that we should execute in  this case it's the car going left which is a1   but since this is a probability distribution  the next time we sample we might not we might   get to stay in the same place we might sample  action 2 for example because this does have a   nonzero probability a probability of 0.1 now note  that because this is a probability distribution   this this p of actions given our state must  sum to one now what are some of the advantages   of this type of formulation over first of all  over q learning like we saw before besides the   fact that it's just a much more direct way to get  what we want instead of optimizing a q function   and then using the q function to create our policy  now we're going to directly optimize the policy   beyond that though there is one very important  advantage of this formulation and that is that it   can handle continuous action spaces so this was an  example of a discrete action space what we've been   working with so far in this atari breakout game  moving left moving right or staying in the center   there are three actions and they're discrete  there's a finite number of actions here that can   be taken for example this is showing the prob our  action space here is representing the direction   that i should move but instead a continuous action  space would tell us not just the direction but   how fast for example as a real number that  i should move questions like that that are   infinite in the number of possible answers  this could be one meter per second to the left   half a meter per second to left or any  numeric velocity it also tells us direction   by nature through plus or minus sign so if i say  minus one meter per second it tells me that i want   to move to left at one meter per second if i say  positive one it tells me i want to move to the   right at one meter per second but now when  we plot this as a probability distribution   we can also visualize this as a continuous action  space and simply we can visualize this using   something like a gaussian distribution in this  case but it could take many different you can   choose the type of distribution that fits best  with your problem set gaussian is a popular choice   here because of its simplicity so here again we  can see that the probability of moving to the left   faster to the left is much greater than moving  faster to the right and we can actually see that   the mean of this distribution the average the  point where this normal distribution is highest   tells us an exact numerical value of how fast it  should be moving not just that it should be moving   to the left but how fast it should be moving  to the left now let's take a look at how we can   actually model these continuous action spaces with  a policy gradient method instead of predicting   the probability of taking an action given  a possible state which in this case since   we're in the continuous domain there would be an  infinite number of actions let's assume that our   output distribution is actually a normal gaussian  an output a mean and a variance for that gaussian   then we only have two outputs but it allows us  to describe this probability distribution over   the entire continuous space which otherwise would  have been infinite an infinite number of outputs   so in this case that's if we predict that the mean  that uh the mean action that we should take mu   is negative one and the variance is of 0.5 we can  see that this probability distribution looks like   this on the bottom left-hand side it should move  to the left with an average speed of negative   one meters per second and with some variance so  it's not totally confident that that's the best   speed at which it should move to the left but  it's pretty set on that being the the place   to move to the left so for this picture we can  see that the paddle needs to move to the left   if we actually plot this distribution like this  and we can actually see that the mass of the   distribution does lie on the left-hand side of  the number line and if we sample for example from   this distribution we can actually see that in this  case we're getting that the action that we should   take the concrete velocity that should be executed  would indicate that we need to move left negative   at a speed of 0.8 meters per second so again  that means that we're moving left with a   speed of 0.8 meters per second note here that  even though the the mean of this distribution   is negative one we're not constrained to that  exact number this is a continuous probability   distribution so here we sampled an action that  was not exactly the mean but that's totally fine   and that really highlights that the difference  here between the discrete action space and the   continuous action space this opens up a ton  of possibilities for applications where we   do model infinite numbers of actions and again  like before like the discrete action case this   probability distribution still has all of the  nice properties of probability distributions   namely that the integral of this computed  probability distribution does still sum to one   so we can indeed sample from it which  is a very nice confirmation property   okay great so let's take a look now of how  the policy gradients algorithm works in a   concrete example let's start by revisiting this  whole learning loop of reinforcement learning   again that we saw in the very beginning of  this lecture and let's think of how we can   use the policy gradient algorithm that we have  introduced to actually train an autonomous vehicle   using using this trial and error policy gradient  method so with this case study study of autonomous   vehicles or self-driving cars what are all of  these components so the agent would be our vehicle   it's traveling in the environment which is the  the world the lane that it's it's traveling   in it has some state that is obtained through  camera data lidar data radar data et cetera   it obtains sorry it makes actions what are the  actions that it can take the actions in this   case are the steering wheel angle again this is a  concrete example of a continuous action space you   don't discretize your steering wheel angle into  unique bins your steering wheel angle is infinite   in the number of possibilities it can take and  it can take any real number between some bounds   so that is a continuous problem that is a  continuous variable that we're trying to   model through this action and finally uh it  receives rewards in the in the form of the   distance it can travel before it needs to  be uh needs some form of human intervention   so let's now dive in now that we've  identified all of those so how can we train   this car using policy gradient network in this  context here we're taking self-driving cars as   an example but you can hopefully see that we're  only using this because it's nice and intuitive   but this will also apply to really any domain  where you can identify and set up the problem   like we've set up the problem so far so let's  start by initializing our agent again our agent   is the vehicle we can place it onto the road in  the center of the road the next step would be to   let the agent run in the beginning it doesn't  run very well because it crashes and well it's   never been trained before so we don't expect it  to run very well but that's okay because this is   reinforcement learning so we run that policy until  it terminates in this case we mark terminations by   the time that it it crashes and needs to be  taken over along that uh what we call rollout   we start to record all of the state action pairs  or sorry state action reward pairs so at each step   we're going to record where was the robot what  was it state what was the action that it executed   and what was the reward that it obtained  by executing that action in that state now next step would be to take all of  those state action reward pairs   and actually decrease the probability of taking  any action that it took close to the time   where the terminated determination happened so  close to the time where the crash occurred we   want to decrease the probability of making  any of those actions again in the future   likewise we want to increase the probability  of making any actions in the beginning of this   episode note here that we don't necessarily  know that there was something good in this   first part of the episode we're just assuming  that because the crash occurred in the second   part of the episode that was likely due to an  action that occurred in that second part this   is a very unintelligent if you could say algorithm  because it that's all it assumes it just tries to   decrease the probability of anything that resulted  in a low reward and increase the probability of   anything that resulted in a high reward it doesn't  know that any of these actions were better than   the other especially in the beginning because  it doesn't have that kind of feedback this is   just saying that we want to decrease anything that  may have been bad and increase anything that would   have been good and if we do this again we can see  that the next time the car runs it runs for a bit   longer and if we do it again we do the same thing  now on this rollout we decrease the probability of   actions that resulted in low reward and increase  the probability that resulted in positive or   high reward we reinitialize this and we run it  until completion and update the policy again   and it seems to run a bit longer and we  can do this again and we keep doing this   until eventually it learns to start to follow the  lanes without crashing and this is really awesome   i think because we never taught this vehicle how  anything well we never taught it anything about   lanes we never taught it what a lane marker is  it learns to avoid lanes though and not crash   and not crash just by observing very sparse  rewards of crashing so it observed a lot of   crashes and it learned to say like okay i'm not  going to do any of these actions that occurred   very close to my crashes and just by observing  those things it was able to successfully avoid   lanes and survive in this environment longer and  longer times now the remaining question is how   we can actually update our policy on every  training iteration to decrease the probability of   bad events and increase the probability of these  good events or these good actions let's call them   so that really focuses and narrows us into points  four and five in this this training algorithm   how can we do this learning process of decreasing  these probabilities when it's bad and increasing   the probabilities when they're good let's  take a look at that in a bit more detail   so let's look at specifically the loss function  for training policy gradients and then we'll   dissect it to understand exactly why this works  so this loss consists of really two parts that   i'd like to dive into the first term is this log  likelihood term the log likelihood of our pro of   our our policy our probability of an action  given our state the second term is where we   multiply this negative log likelihood by the total  discounted reward or the total discounted return   excuse me r of t so let's assume that we get a  lot of reward for an action that had very high   log likelihood this loss will be great and it  will reinforce these actions because they resulted   in very good returns on the other hand if the  reward is very low for an action that it had high   probability for it will adjust those probabilities  such that that action should not be sampled again   in the future because it did not result in  a desirable return so when we plug in these   uh this loss to the gradient descent algorithm to  train our neural network we can actually see that the policy gradient term here  which is highlighted in blue   which is where this this algorithm gets  its name it's the policy because it has to   compute this gradient over the policy part of this  function and again uh just to reiterate once more   this policy gradient term consists of these  two parts one is the likelihood of an action   the second is the reward if the action is very  positive very good resulting in good reward it's   going to amplify that through this gradient  term if the action is very is very probable   or sorry not very probable but it did result in  a good reward it will actually amplify it even   further so something that was not probable before  will become probable because it resulted in a good   return and vice versa on the other side as well  now i want to talk a little bit about how we can   extend some of these reinforcement  learning algorithms into real life and   this is a particularly challenging question  because this is something that has a particular   interest to the reinforcement learning field  right now and especially right now because   applying these algorithms in the real world  is something that's very difficult for one   reason or one main reason and that is this step  right here running a policy until termination   that's one thing i touched on but i didn't spend  too much time really dissecting it why is this   difficult well in the real world terminating means  well crashing dying usually pretty bad things   and we can get around these types of things  usually by training and simulation but then   the problem is that modern simulators do not  accurately depict the real world and furthermore   they don't transfer to the real world when  you deploy them so if you train something   in simulation it will work in simulation it will  work very well in simulation but when you want   to then take that policy deployed into the real  world it does not work very well now one really   cool result that we created in my lab was actually  developing a brand new photo realistic simulation   engine specifically for self-driving cars that i  want to share with you that's entirely data driven   and enables these types of reinforcement learning  advances in the real world so one really cool   result that we created was developing this type of  simulation engine here called vista and allows us   to use real data of the world to simulate brand  new virtual agents inside of the simulation now   the results here are incredibly photorealistic as  you can see and it allows us to train agents using   reinforcement learning in simulation using exactly  the methods that we saw today so that they can be   directly deployed without any transfer learning  or domain adaptation directly into the real world   now in fact we did exactly this we placed agents  inside of our simulator train them using exactly   the same policy grading algorithm that we learned  about in this lecture and all of the training was   done in our simulator then we took these policies  and put them directly in our full-scale autonomous   vehicle as you can see in this video and on the  left hand side you can actually see me sitting   in this vehicle in the bottom of the interior  shot you can see me sitting inside this vehicle   as it travels through the real world completely  autonomous this represented the first time at   the time of when we published these results the  first time an autonomous vehicle was trained   using rl entirely in simulation and was able to be  deployed in the real life a really awesome result so now we have covered the fundamentals behind  value learning as well as policy gradient   reinforcement learning approaches i think now  it's really important to touch on some of the   really remarkable deep reinforcement learning  applications that we've seen in recent years   and for that look we're going to turn first  to the game of go where reinforcement learning   agents were put to the test against human  champions and achieved what at the time was   and still is extremely exciting results so first  i want to provide a bit of an introduction to the   game of go this is a game that consists of 19  by 19 uh grids it's played between two players   who hold either white pieces or black pieces  and the objective of this game is to occupy   more board territory with your pieces than your  opponent now even though the grid and the rules   of the game are very simple the problem of go  solving the game of go and doing it to beat   the grand masters is an extremely complex problem  it's it's actually because the number of possible   board positions the number of states that you  can encounter in the game of go is massive   the full size with the full-size board there  are greater number of legal board positions   than there are atoms in the universe now the  objective here is to train an ai to train a   machine learning or deep learning algorithm  that can master the game of go not only to   beat the existing gold standard software but  also to beat the current world human champions   now in 2016 google deepmind rose to this challenge  and a couple and several years ago they actually   developed a reinforcement learning based pipeline  that defeated champion go players and the idea at   its core is very simple and follows along with  everything that we've learned in this lecture   today so first a neural network was trained and  it got to watch a lot of human expert go players   and basically learn to imitate their behaviors  this part was not using reinforcement learning   this was using supervised learning you basically  got to study a lot of human experts then   they use these pre-trained networks to play  against reinforcement learning policy networks   which allows the policy to go beyond what  the human experts did and actually play   against themselves and achieve actually  superhuman performance in addition to this   one of the really tricks that brought this to be  possible was the usage of this auxiliary network   which took the input of the state of the board  as as input and predicted how good of a state   this was now given that this network the ai  could then hallucinate essentially different   board position actions that it could take  and evaluate how good these actions would   be given these predicted values this essentially  allowed it to traverse and plan its way through   different possible actions that it could take  based on where it could end up in the future finally a recently published extension of these  approaches just a few years ago in 2018 called   alpha zero only used self-play and generalized to  three famous board green board games not just go   but also chess shogi and go and in these examples  the authors demonstrate that it was actually not   necessary to pre-train these networks from human  experts but instead they optimized them entirely   from scratch so now this is a purely reinforcement  learning based solution but it was still able to   not only beat the humans but it also beat the  previous networks that were pre-trained with   human data now as recently as only last month  very recently the next breakthrough in this   line of works was released with what is called  mu0 where the algorithm now learned to master   these environments without even knowing the rules  i think the best way to describe mu0 is to compare   and contrast its abilities with those previous  advancements that we've already discussed earlier   already today so we started this discussion  with alphago now this demonstrates superhuman   performance with go on go using self pro self  play and pre-training these models using human   grand master data then came alphago 0 which showed  us that even better performance could be achieved   entirely on its own without pre-training  from the human grandmasters but instead   directly learning from scratch then came alpha  zero which extended this idea even further   beyond the game of go and also into chess  and shogi but still required the model to   know the rule and be given the rules of the  games in order in order to learn from them   now last month the authors demonstrated  superhuman performance on over 50 games   all without the algorithm knowing the rules  beforehand it had to learn them as well as   actually learning how to play the game  optimally during its training process   now this is critical because in many scenarios  we do not know the rules beforehand to tell the   model when we are placed in the environment  sometimes the rules are unknown the rules or   the dynamics are unknown objects may interact  stochastically or unpredictably we may also   be in an environment where the rules are simply  just too complicated to be described by humans so   this idea of learning the rules of the game  or of the task is a very very powerful concept   and let's actually walk through very very briefly  how this works because it's such an awesome   algorithm but again at its core it really builds  on everything that we've learned today so you   should be able to understand each part of this of  this algorithm we start by observing the board's   state and from this point we predict or we perform  a tree search through the different possible   scenarios that can arise so we take some actions  and we look at the next possible scenarios or the   next possible states that can arise but now since  we don't know the rules the network is forced to   learn the dynamics the dynamics model of how to  do this search so to learn what could be the next   states given the state that it currently sees  itself in and the action that it takes now at   the base time this gives us this probability of  executing each of these possible actions based on   the value that it can attain through this branch  of the tree and it uses this to plan the next   action that it should take this is essentially  the policy network that we've been learning   about but amplified to also encounter this tree  search algorithm for planning into the future   now given this policy network it takes this  action and receives a new observation from   the game and repeats this process over and over  again until of course that it the game finishes or   the game is over a very similar this is very  very similar to how we saw alpha zero work   but now the key difference is that the dynamics  model as part of the tree search that we can see   at each of these steps is entirely learned  and greatly opens up the possibilities for   these techniques to be applied outside of  rigid game scenarios so in these scenarios   we do know the rules of the games very well so  we could use them to train our algorithms better   but in many scenarios this type of advancement  allows us to apply these algorithms to areas   where we simply don't know the rules and where we  need to learn the rules in order to play the game   or simply where the rules are much harder  to define which in reality in the real world   many of the interesting scenarios  this would be exactly the case   so let's briefly recap with what we've learned in  the lecture today we started with the foundations   of deep reinforcement learning we defined what  agents are what actions are what environments   are and how they all interact with each other in  this reinforcement learning loop then we started   by looking at a broad class of q learning  problems and specifically the deep q network   where we try to learn a q function given a state  and action pair and then determine a policy by   selecting the action that maximizes that q  function and finally we learned how we could   optimize instead of optimizing the q value or the  q function learn to directly optimize the policy   straight from straight from the state and we saw  that this has really impactful applications in   continuous action spaces where q functions or  this q learning technique is somewhat limited   so thank you for attending this lecture on deep  reinforcement learning at this point we'll now be   taking the next part of the class which we focused  on reinforcement learning and you'll get some   experience on how you can apply these algorithms  onto all by yourself specifically focusing on   the policy gradient algorithms in the context  of a very simple example of pong as well as   more complex examples as well so you'll actually  build up this body and the brain of the agent and   the environment from scratch and you'll really  get to put together a lot of the ideas that   we've seen today in this lecture so please  come to the gather town for if you have any   questions and we'd be happy to discuss questions  on the software lab specifically as well as any   questions on today's lecture so we look  forward to seeing you there thank you
Info
Channel: Alexander Amini
Views: 64,263
Rating: 4.9546099 out of 5
Keywords: deep learning, mit, artificial intelligence, neural networks, machine learning, 6s191, 6.s191, mit deep learning, ava soleimany, soleimany, alexander amini, amini, lecture 2, tensorflow, computer vision, deep mind, openai, basics, introduction, deeplearning, ai, tensorflow tutorial, what is deep learning, deep learning basics, deep learning python, reinforcement learning, rl, policy gradient, deep q network, deep q learning, deepmind, alphago, alphazereo
Id: 93M1l_nrhpQ
Channel Id: undefined
Length: 57min 7sec (3427 seconds)
Published: Fri Mar 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.