MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

today I'd like to overview the exciting field of deep reinforcement learning introduced overview and provide you some of the basics I think it's one of the most exciting fields in artificial intelligence it's marrying the power and the ability of deep neural networks to represent and comprehend the world with the ability to act on that understanding on that representation taking as a whole that's really what the creation of intelligent beings is understand the world and act and the exciting breakthroughs that recently have happened captivate our imagination about what's possible and that's why this is my favorite area of deep learning and artificial intelligence in general and I hope you feel the same so what is deep reinforcement learning we've talked about deep learning which is taking samples of data being able to in a supervised way compress encode the representation that data in the way that you can reason about it I would take that power and apply it to the world where sequential decisions are to be made so it's looking at problems and formulations of tasks where an agent an intelligent system has to make a sequence of decisions and the decisions that are made have an effect on the world around the agent how how do all of us any intelligent being that it's tasked with operating in the world how did he learn anything especially when you know very little in the beginning it's trial and error is the fundamental process by which reinforcement learning agents learn and the deep part of deep reinforcement learning is neural networks as using the frameworks and reinforcement learning where the neural network is doing the representation of the world based on which the actions are made and we have to take a step back when we look at the types of learning sometimes the terminology itself can confuse us to the fundamentals there are supervised learning there semi-supervised learning there's unsupervised learning there's reinforcement learning and there's this feeling that supervised learning is really the only one where you have to perform the manual annotation where you have to do the large-scale supervision that's not the case every type of machine learning is supervised learning it's supervised by a loss function or a function that tells you what's good and what's bad you know even looking at our own existence is how we humans figure out what's good and bad there's all kinds of sources direct and indirect by which our morals and ethics we figure out what's good and bad the difference we supervised and unsupervised and reinforcement learning is the source of that supervision what's implied when you say unsupervised is that the cost of human labor required to attain the supervision is low but it's never Turtles all the way down it's Turtles and then there's a human at the bottom there at some point there needs to be human intervention human input to provide what's good and what's bad and this will arise in reinforcement learning as well I have to remember that because the challenges and the exciting opportunities of reinforcement learning lie in the fact of how do we get that supervision in the most efficient way possible but supervision nevertheless is required for any system that has an input and an output that's trying to learn like a neural network does to provide an output that's good he needs somebody to say what's good and what's bad for you curious about that there's been a few books a couple written throughout the last few centuries from Socrates to Nietzsche I recommend the latter especially so let's look at supervised learning and reinforcement learning let like to propose a way to think about the difference that is illustrative and useful when we start talking about the techniques so supervised learning is taking a bunch of examples of data and learning from those examples where a ground truth provides you the compressed semantic meaning of what's in that data and from those examples one by one whether it's sequences or single samples we learn what how to then few take future such samples and interpret them reinforcement learning is teaching what we teach an agent through experience not by showing a singular sample of a data set but by putting them out into the world the distinction there the essential element of reinforcement learning then for us now we'll talk about a bunch of algorithms but the essential design step is to provide the world in which to experience the agent learns from the world the from the world it gets the dynamics of that world the physics of the world from that world that gets the rewards what's good and bad and us as designers of that agent do not just have to do the algorithm we have to do design the the world in which that agent is trying to solve a task the design of the world is the process of reinforcement learning the design of examples the annotation of examples is the world of supervised learning and the essential perhaps the most difficult element of reinforcement learning is the reward the good versus bad here a baby starts walking across the room we want to define success as a baby walking across the room and reaching the destination that's success and failure is the inability to reach that destination simple and reinforcement learning in humans the way we learn from these very few examples appear to learn from very few examples of trial and error is a mystery a beautiful mystery full of open questions it could be from the huge amount of data 230 million years worth of bipedal data there who've been walking what mammals walking ability to walk or 500 million years the ability to see having eyes so that's the the hardware side somehow genetically encoded in us is the ability to comprehend this world extremely efficiently it could be through not the hardware not the five hundred million years but the the few minutes hours days months maybe even years in the very beginning were born the ability to learn really quickly through observation to aggregate that information filter all the junk that you don't need and be able to learn really quickly through imitation learning through observation the way for walking that might mean observing others talk the idea there is if there was no other around we would never be able to learn this the fundamentals of this walking or as efficiently it's through observation and then it could be the algorithm totally not understood is the algorithm that our brain uses to learn the backpropagation that's an artificial neural networks the same kind of processes not understood in the brain that could be the key so I want you to think about that as we talk about the very trivial by comparison accomplishments and reinforcement learning and how do we take the next steps but it nevertheless is exciting to have machines that learn how to act in the world the process of learning for those who have fallen in love with artificial intelligence the process of learning is thought of as intelligence it's the ability to know very little and through experience examples interaction with the world in whatever medium whether it's data or simulation so on be able to form much richer and interesting representations of that world be able to act in that world that's that's the dream so let's look at this stack of what an age what it means to be an agent in this world from top the input to the bottom the output is the there's an environment we have to sense that environment we have just a few tools as humans have several sensory systems on cars you can have lidar camera stereo vision audio microphone networking GPS IMU sensor so on whatever robot you can think about there's a way to sense that world and you have this raw sensory data and then once you have the raw sensory data you're tasked with representing that data in such a way that you can make sense of it as opposed to all the the raw sensors and the I the cones and so on that taking just giant stream of high bandwidth information we have to be able to form higher abstractions of features based on which we can reason from edges to corners to faces and so on that's exactly what deep learning neural networks have stepped in to be able to in an automated fashion with as little human input as possible be able to form higher-order representations of that information then there is the the learning aspect building on top of the greater abstractions form through the representations be able to accomplish something useful well--there's discriminative tasks a generative task and so on based on the representation be able to make sense of the data be able to generate new data and so on from sequence the sequence to sequence the sample from Sam of the sequence and so on and so forth to actions as we'll talk about and then there is the ability to aggregate all the information has been received in the past to the useful information that's pertinent to the task at hand it's the thing the old it looks like a duck quacks like a duck swims like a duck three different data sets I'm sure there's state-of-the-art algorithms for the three image class education audio recognition video classification - activity recognition so on aggregating those three together is still an open problem and that could be the last piece again I want you to think about as we think about reinforcement learning agents how do we play how do we transfer from the game of Atari to the game of go to the game of dota to the game of a robot navigating an uncertain environment in the real world and once you have that once you sense the raw world once you have a representation of that world then we need to act which is provide actions within the constraints of the world in such a way that we believe can get us towards success the promise excitement of deep learning is is the part of the stack that converts raw data into meaningful representations the promise the dream of deeper enforcement learning is going beyond and building an agent that uses that representation and acts achieve success in the world that's super exciting the framework and the formulation reinforcement learning at its simplest is that there's an environment and there's an agent that acts in that environment the agent senses the environment by a by some observation well there's partial or complete observation of the environment and it gives the environment and action it acts in that environment and through the action the environment changes in some way and then a new observation occurs and then also as you provide they actually make the observations you receive a reward in most formulations of this of this framework this entire system has no memory that the the only thing you two could be concerned about as a state you came from the state you arrived in and the reward received the open question here is what can't be modeled in this kind of way can we model all of it from from human life to the game of go can all this be model in this way and what are is this a good way to formulate the learning problem of robotic systems in the real world in simulated world those are the open questions the environment could be fully observable or partially observable like in poker it could be single agent or multi agent Atari versus driving like deep traffic deterministic or stochastic static versus dynamic static is in chess dynamic again and driving in most real-world applications the screen versus continuous like games chess or continuous and carpal balancing a polo on a cart the challenge for RL in real world applications is that as a reminder supervised learning is teaching by example learning by example teaching from our perspective reinforcement learning is teaching by experience and the way we provide experience the reinforcement learning agents currently for the most part is through simulation or through highly constrained real-world scenarios so the challenge is in the fact that most of the successes is with systems environments that are simulated so there's two ways to then close this gap to directions of research and work one is to improve the algorithms improve the ability of the algorithm student to form policies that are transferable across all kinds of domains including the real world including especially in the real world so train and simulation transfer to the real world or is we improve the simulation in such a way that the fidelity of the simulation increased increases to the point where the gap between reality and simulation is is minimal to a degree that things learn the simulation are directly trivially transferable to the to the real world okay the major components of an RL agent an agent operates based on a strategy called the policy it sees the world it makes a decision that's a policy makes a decision how to act sees the reward sees a new state acts sees a reward she's new States and acts and this repeats forever until a terminal state the value function is the estimate of how good a state is or how good a state action pair is meaning taking an action in a particular state how good is that ability to evaluate that and then the model different from the environment from the perspective the agent so the environment has a model based on which it operates and then the agent has a representation best understanding of that model so the purpose for an RL agent in this simply formulated framework is to maximize reward the way that the reward mathematically and practically is talked about is with a discounted framework so we discount further and further future award so the reward that's farther into the future is means less to us in terms of maximization than reward that's in the near term and so why do we discount it so first a lot of it is a math trick to be able to prove certain aspects analyze certain aspects of convergence and in general on a more philosophical sense because environments either are or can be thought of a stochastic random it's very difficult to there's a degree of uncertainty which makes it difficult to really estimate the the the reward they'll be in the future because of the ripple effect of the uncertainty let's look at an example a simple one helps us understand policy's rewards actions there's a robot in the room there's 12 cells in which you can step it starts in the bottom left it tries to get rewards on the on the top right there's a plus one it's a really good thing at the top right wants to get there by walking around there's a negative 1 which is really bad you wants to avoid that Square and the choice of action is this up-down left-right for actions so you could think of there being a negative reward of point 0 4 for each step so there's a cost to each step and there's a stochastic nature to this world potentially we'll talk about both deterministic stochastic so in the in the stochastic case when you choose the action up with an 80% probability with an 80% chance you move up but with 10% chance to move left another 10 move right so that's the Catholic nature even though you try to go up you might end up in a blocks to the left into the right so for a deterministic world the optimal policy here given that we always start in the bottom left is really shortest path is you know you can't ever because there's no stochasticity you're never gonna screw up and just fall into the hole negative 1 hole that you just compute the shortest path and walk along that shortest path why shortest path because every single step hurts there's a negative a reward to it point 0 4 so shortest path is the thing that minimizes the reward shortest path to the to the plus 1 block ok let's look at it stochastic world like I mentioned the 80% up and then split to 20 10 % to left and right how does the policy change well first of all we need to have we need to have a plan for every single block in the area because you might end up there due to this the castus 'ti of the world ok the the basic addition there is that we're trying to go avoid up the closer you get to the negative one hole so just try to avoid up because up the stochastic nature of up means that you might fall into the hole with a 10% chance and given the point zero for step reward you're willing to take the long way home in some cases in order to avoid that possibility the negative one possibility now let's look at a reward for each step if it decreases to negative two it really hurts to take every step then again we go to the shortest path despite the fact that there's a stochastic nature in fact you don't really care that you step into the negative one hole because every step really hurts you just want to get home and then you can play with this reward structure right yes instead of negative 2 or negative point 0 4 you can look at negative 0.1 and you can see immediately that the structure of the policy it changes so with a higher value the higher negative reward free step immediately the urgency of the agent increases versus the less urgency the lower the negative reward and when the reward flips so it's positive the every step is a positive so the entire system which is actually quite common in reinforcement learning the entire system is full of positive rewards and so that then the optimal policy becomes the longest path is grad school taking as long as possible never reaching the destination so what lessons do we draw from robot in the room two things the environment model the dynamics is just there in the trivial example the stochastic nature the difference between 80 percent 100 percent and 50 percent the model of the world the environment has a big impact on what the optimal policy is and the reward structure most importantly the thing we can often control more in our constructs of the task we try to solve them enforcement is the what is good and what is bad and how bad is it and how good is it the reward structure is a big impact and that has a complete change like like Robert Frost say the complete change on the policy the choices the agent makes so at when you formulate a reinforcement learning framework as researchers as students what you often do is you design the environment you design the world in which the system learns even when your ultimate goal is the physical robot it does still there's a lot of work still done simulation so you design the world the parameters of that world and you also design the reward structure and it can have a transformative results slight variations in those parameters going to huge results on huge differences on the policy that's arrived and of course the example I've shown before I really love is the impact of the the changing reward structure might have unintended consequences and those consequences for real-world system can have obviously highly detrimental costs that are more than just a failed game of Atari so here is a human performing the task gate playing the game of coast runners racing around the track and so it's when you finish first and you finish fast you get a lot of points and so it's natural to then okay let's do an RL agent and then optimize this for those points and will you find out in the game is that you also get points by picking up the little green turbo things and with agent figures out is that you can actually get a lot more points even by simply focusing on the green turbos focusing on the green turbos just rotating over and over slamming into the wall fire and everything just picking it up especially because ability to pick up those turbos can avoid the terminal state at the end of finishing the race in fact finishing the race means you stop collecting positive reward so you never want to finish collected turbos and though that's a trivial example it's not actually easy to find such examples but they're out there of unintended consequences that can have highly negative detrimental effects when put in the real world we'll talk about a little bit of robotics when you put robots for wheeled ones like autonomous vehicles into the real world and you have objective functions that have to navigate difficult intersections full of pedestrians you have to form intent models those pedestrians here you see cars asserting themselves through dense intersections taking risks and within those risks that are taking by us humans will drive vehicles we have to then encode that ability to take subtle risk into into AI based control algorithms perception then you have to think about at the end of the day there's an objective function and if that objective function does not anticipate the green turbos that are to be collected and then result in some understand the consequences could have very negative effects especially in situations that involve human life that's the field of AI safety and some of the folks will talk about deep mind and open AI that are doing incredible work in RL also have groups that are working on a AI safety for a very good reason this is a problem that I believe that artificial intelligent will define some of the most impactful positive things in the 21st century but I also believe we are nowhere close to solving some of the fundamental problems of AI safety that we also need to address as we those algorithms okay examples and reinforcement learning systems all of it has to do with formulation or rewards formulation of states and actions you have the traditional the often used benchmark of a cart balancing a poll continuous so the action is the horizontal force to the cart the goal is to balance the poll so stays top and the moving cart and the reward is one in each time step if the poll is upright in the state measured by the cart by the agent is the pole angle angular speed and of course self sensing of the cart position and the horizontal velocity another example here didn't want to include the video because it's really disturbing but I do want to include the slide because it's really important to think about is by sensing the the raw pixels learning and teaching an agent to play a game of doom so the goal there is to eliminate all opponents the state is the raw game pixels the action is up/down shoot reload and so on and the positive reward is when an opponent is eliminated and negative one the agent is eliminated simple I added it here because again on the topic of AI safety we have to think about objective functions and how that translate into the world of not just autonomous vehicles but things that even more directly have harm like autonomous weapon systems and we have a lecture on this in the AGI series and on the robotics platform the manipulate object manipulation and grasping objects there's a few benchmarks there's a few interesting applications learning the problem of grabbing objects moving objects manipulating objects rotating and so on especially when those objects don't have have complicated shapes and so the goal is to pick up an object in the purely in the grasping objects allenge the state is the visual racial slurs visual visual base the raw pixels of the objects the actions is to move the arm grasp the object pick it up and obviously it's positive when the pickup is successful the reason I'm personally excited by this is because it'll finally allow us to solve the problem of the claw which has been torturing me for many years I don't know that's not at all why I'm excited by it okay and then we have to think about as we get greater and greater degree of application in the real world with robotics like cars the the main focus of my passion in terms of robotics is how do we encode some of the things that us humans encode how do we you know we have to think about our own objective function our own reward structure our own model of the environment about which we perceive and reasonable in order to then encode machines that are doing the same and I believe autonomous driving is in that category but to ask questions of ethics we have to ask questions of of risk value of human life value of efficiency money and so on all these in front of ethical questions that an autonomous vehicle unfortunately has to solve before it becomes fully autonomous so here are the key takeaways of the real-world impact of reinforcement learning agents on the deep learning side okay these neural networks that form high representation the fun part is the algorithms all the different architectures the different encoder/decoder structures all the attentions self attention recurrent Sallust Engr use all the fun architectures and the data so that and the ability to leverage different data sets in order to discriminate better than perform this Crematory tasks better than you know MIT does better than stand for that kind of thing that's the fun part the hard part is asking good questions and collecting huge amounts of data that's representative over the task that's for real world impact not cvpr publication real-world impact a huge amount of data on a deeper enforcement learning side the key challenge the fun part again is the algorithms how do we learn from data some of the stuff I'll talk about today the hard part is defining the environment defining the acts of space and the reward structure as I mentioned this is the big challenge and the hardest part is how to crack the gap between simulation in the real world the leaping lizard that's the hardest part we don't even know how to solve that transfer learning problem yet for the real world in fact the three types of reinforcement learning there's countless algorithms and there's a lot of ways to economize them but at the highest level there's model-based and there's model free model based algorithms learn the model of the world so as you interact with the world you construct your estimate of how you believe the dynamics of that world operates the nice thing about doing that is once you have a model or an estimate of a model you're able to anticipate you're able to plan into the future you're able to use the model to in a branching way predict how your actions will change the world so you can plan far into the future this is the mechanism by which you can you can do chess in the simplest form because in chess you don't even need to learn the model the models learnt is given to you chess go and so on the most important way in which they're different I think is the sample efficiency is how many examples of data are needed to be able to successfully operate in the world and so model based methods because they're constructing a model if they can are extremely simple efficient because once you have a model you can do all kinds of reasoning that doesn't require experiencing every possibility of that model you can unroll the model to see how the world changes based on your actions value based methods are ones that look to estimate the quality of states the quality of taking a certain action in the certain state so they're called off policy versus the last category that's on policy what does it mean to be off policy it means that they constantly value based agents constantly update how good is taken action in a state and they have this model of that goodness of taking action in a state and they use that to pick them optimal action they don't directly learn a policy a strategy of how to act they learn how good it is to be in a state and use that goodness information to then pick the best one and then every once in a while flip a coin in order to explore and then policy based methods our ones that directly learn a policy function so they take as input the the world representation of that world neural networks and this output a action where the action is stochastic so okay that's the range of model-based value based and policy based here's an image from open AI that I really like I encourage you to as we further explore here to look up spinning up in deeper enforcement learning from open AI here's an image that texana mises in the way that I described some of the recent developments in RL so at the very top the distinction between model free RL and model-based RL in model free RL which is what we'll focus on today there is a distinction between policy optimization so on policy methods and q-learning which is all policy methods pause optimizations methods that directly optimize the policy they'll directly learn the policy in some way and then q-learning off policy methods learn like I mentioned the value of taking a certain action in the state and from that learned that learned Q value be able to choose how to act in the world so let's look at a few sample representative approaches in this space let's start with the with the one that really was one of the first great breakthroughs from google deepmind on the deep IRL side and solving atari games dqn deep queue learning networks deep queue networks and let's take a step back and think about what cue learning is q-learning looks at the state action value function queue that estimates based on a particular policy or based on an optimal policy how good is it to take an action in this state the estimated reward if I take an action in this state and continue operating under an optimal optimal policy it gives you directly a way to say amongst all the actions I have which action should that take to maximize the reward now in the beginning you know nothing you know you don't have this value estimation you don't have this cue function so you have to learn it and you learn it with a bellman equation of updating it you take your current estimate and update it with the reward you seed received after you take an action here it's off policy and model free you don't have to have any estimate or knowledge of the world you don't have to have any policy whatsoever all you're doing is roaming about the world collecting data when you took a certain action here award you received and you're updating gradually this table where the table has state states on the y-axis and actions on the x-axis and the key part there is because you always have an estimate of what of to take an action of the value of taking that action so you can always take the optimal one but because you know very little in the beginning that optimal is going to you have no way of knowing that's good or not so there's some degree of expiration the fundamental aspect of value based methods or ami are all methods like I said it's trial and error is exploration so for value based methods that q-learning the way that's done is with the flip of a coin epsilon greedy with a flip of a coin you can choose to just take a random action and you slowly decrease epsilon to zero as your agent learns more and more and more so in the beginning you explore a lot with epsilon 1 and epsilon of zero in the end when you're just acting greedy based on the your understanding of the world as represented by the q-value function for non neural network approaches this is simply a table the Q this Q function is a table like I said on the Y State X actions and in each cell you have a reward that's at this counter reward that you estimated to be received there and as you walk around with this bellami equation you can update that table but it's a table nevertheless number of states times number of actions now if you look at any practical real-world problem and an arcade game with raw sensory input is a very crude first step towards the real world so raw sensor information this kind of value iteration and updating a table is impractical because here's for a game of break out if we look at four consecutive frames of a game of breakout size of the of the raw sensory input is 84 by 84 pixels grayscale every pixel has 256 values that's 256 to the power of whatever 84 times 84 times 4 is whatever it is it's significantly larger the number of atoms in the universe so the size of this cue table if we use the traditional approach is intractable you'll know it's to the rescue deep RL is rl+ neural networks where the neural networks is tasked with taking this in Valley based methods taking this cue table and learning a compress representation of it learning an approximator for the function from state action to the value that's what previously talked about the ability the powerful ability of neural networks to form representations from extremely high dimensional complex raw sensory information so it's simple the framework remains for the most part the same in reinforcement learning it's just that this cue function for value based methods becomes a neural network and becomes an approximator where the hope is as you navigate the world and you pick up new knowledge through the back propagating the gradient and the loss function that you're able to form a good representation of the optimal q function so using your networks with you'll know it's a good at which is function approximator x' and that's DQ 1 deep Q Network was used to have the initial incredible nice results on our K games where the input is the raw sensory pixels with a few convolutional layers for the connected layers and the output is a set of actions you know probability of taking that action and then you sample that and you choose the best action and so this simple agent whether the neural network that estimates that Q function very simple network is able to achieve superhuman performance on many of these arcade games that excited the world because it's taking raw sensory information with a pretty simple network that doesn't in the beginning understand any of the physics of the world any of the dynamics of the environment and through that intractable space the intractable state space is able to learn how to actually do pretty well the loss function for DQ n has to Q functions one is the expected the predicted Q value of a taking an action in a particular state and the other is the target against which the loss function is calculated which is what is the value that you got once you actually take in that action and once you've taken that action the way you calculate the value is by looking at the next step and choosing the max to Singh if you take the best action in the next state what is going to be the Q function so there's two estimators going on with in terms of neural networks those two forward passes here there's two Q's in this equation so in traditional DQ n that's just that's done by a single neural network with a few tricks and double DQ n that's done by two neural networks and I mentioned tricks because with this and with most of RL tricks tell a lot of the story a lot of what makes systems work is the details in in games and robotic systems in these cases the two biggest tricks for DQ n that will reappear and a lot of value based methods is experience replay so think of an agent that plays through these games as also collecting memories you collect this bank of memories that can then be replayed the power of that one of the central elements of what makes value based methods attractive is that because you're not directly estimating the policy but are learning the quality of taking an action in a particular state the you're able to then jump around through your memory and and play different aspects of that memory so learn train the network through the historical data and then the other trick simple is like I said that there is so the loss function has two queues so you're it's it's a dragon chasing its own tail it's easy for the loss function to become unstable so the training does not converge so the trick of fixing a target Network is taking one of the queues and only updating in every X steps every thousand steps and so on and taking the same kind of network it's just fixing it so for the target network that defines the loss function just keeping it fixed and only updating any regulator so you're chasing a fixed target with a loss function as opposed to a dynamic one so you can solve a lot of the Atari games with minimal effort come up with some creative solutions here break out here after 10 minutes of training on the left after a to have 2 hours of training on the right is coming up with some creative solutions again it's pretty cool because this is raw pixels right we're now like there's been a few years since this breakthrough so kind of take it for granted but I still for the most part captivated by just how beautiful it is that from the raw sensory information neural networks are able to learn to act in a way that actually supersedes humans in terms of creativity in terms of in terms of actual raw performance it's really exciting and games of simple form is the cleanest way to demonstrate that and you the the same kind of DQ and network is able to achieve superhuman performance and a bunch of different games there's improvements to this like dual DQ one again the q function can be decomposed which is useful in to the value estimate of being in that state and what's called and in future slides that we called advantage so the advantage of taking action in that state the nice thing of the advantage as a measure is that it's a measure of the action quality relative to the average action that could be taken there so if it's very useful advantage versus sort of raw reward is that if all the actions you have to take are pretty good you want to know well how much better it is in terms of optimism that's a better measure for choosing actions in a value-based sense so when you have these two estimates you have these two streams for neural networking the dueling DQ n DG QM where one estimates the value the other the advantage and that's again that dueling nature is useful for also on the there are many states in which the action is decoupled the quality of the actions is decouple from the state so many states it doesn't matter which action you take so you don't need to learn all the different complexities all the topology of different actions when you in a particular state and another one is prioritize experience for play like I said experience replay is really key to these algorithms and the thing that sinks some of the policy optimization methods and experiments replay is collecting different memories but if you just sample randomly in those memories you're now affected the sampled experiences are really affected by the frequency of those experience occurred not their importance so prioritize experience replay assigns a priority a value based on the magnitude of the temporal difference learned error so the the stuff you have learned the most from is given a higher priority and therefore you get to see through the experience replay process that that particular experience more often okay moving on to policy gradients this is on policy versus q-learning off policy policy gradient is directly optimizing the policy where the input is the raw pixels and the policy network represents the forms of representations of that environment space and as output produces a stochastic estimate a probability of the different actions here in the pong the pixels a single output that produces the probability of moving the paddle up so how do pause gradients vanilla policy grading the very basic works is you unroll the environment you play through the environment here pong moving the paddle up and down and so on collecting no rewards and only collecting reward at the very end based on whether you win or lose every single action you're taking along the way gets either punished or rewarded based on whether it led to victory or defeat this also is remarkable that this works at all because the credit assignment there's a is I mean every single thing you did along the way is averaged out it's like muddied it's the reason that policy gradient methods are more inefficient but it's still very surprising that it works at all so the pros versus DQ one the value based methods is that if the world is so messy that you can't learn a q function the nice thing about policy gradient because it's learning the policy directly that it will at least learn a pretty good policy usually in many cases faster convergence it's able to deal with stochastic policies so value based methods can out learners the gassing policies and it's much more naturally able to deal with continuous actions the cons is it's inefficient versus dqn it's it can become highly unstable as we'll talk about some solutions to this during the training process and the credit assignment so if we look at the chain of actions that lead to a positive reward some might be awesome action some may be good action some might be terrible actions but that doesn't matter as long as the death the nation was good and that's then every single action along the way gets a positive reinforcement that's the downside and there's now improvements to that advantage actor critic methods a to see combining the best of value based methods and policy base methods so having an actor two networks an actor which is policy based and that's the one that's takes the actions samples the actions from the policy Network and the critic that measures how good those actions are and the critic is value based all right so as opposed to in the policy update the first equation there the reward coming from the destination the that our war being from whether you won the game or not every single step along the way you now learn a Q value function Q s a state and action using the critic Network so you're able to now learn about the environment about evaluating your own actions at every step so you're much more sample efficient there's a synchronous from deep mind and synchronous from open AI variants of this but of the actor advantage actor critic framework but both are highly parallelizable the difference with a three C the asynchronous one is that every single agency just throw these agents operating in the environment and they're learning they're rolling out the games and getting the reward they're updating the original Network asynchronously the global network parameters asynchronously and as a result they're also operating constantly an outdated versions of that network the open AI approach that fixes this is that there's a coordinator that there's these rounds where everybody all the agents in parallel are rolling out the episode but then the coordinator waits for everybody to finish in order to make the update to the global network and then distributes all the same parameter to all the agents and so that means that every iteration starts with the same global parameters and that has really nice properties in terms of conversions and stability of the training process okay from google deepmind the deep deterministic policy gradient is combining the ideas of dqn but dealing with continuous action spaces so taking a policy network but instead of the actor actor critic framework but instead of picking a stochastic policy having the actor operator on the since the casting nature is picking the best picking a deterministic policy so it's always choosing the best action but ok with that the problem quite naturally is that when the policy is now deterministic it's able to do continuous action space but because it's termina stick it's never exploring so the way we inject exploration into the system is by adding noise either adding noise into the action space on the output or adding noise into the parameters of the network that have then that create perturbations and the actions such that the final result is that you try different kinds of things and the the scale of the noise just like well the epsilon greedy in the exploration for DQ on the scale of the noise decreases as you learn more and more so on the policy optimization side from open ai and others we'll do a lecture just on this there's been a lot of exciting work here the basic idea of optimization on policy optimization with PPO and TRP au is first of all we want to formulate reinforcement learning as purely an optimization problem and second of all if policy optimization the actions you take influences the rest of your the optimization process you have to be very careful about the actions you take in particular you have to avoid taking really bad actions when you're convergence the the training performance in general collapses so how do we do that there's the line search methods which is where gradient descent or gradient descent falls under which which is the how we train deep neural networks is you first pick a direction of the gradient and then pick the step size the problem with that is that can get you into trouble here there's a nice visualization walking along a ridge is it can it can result in you stepping off that Ridge again the collapsing of the training process the performance the trust region is is the underlying idea here for the for the policy optimization methods that first pick the step size so that constrain in various kinds of ways the the magnitude of the difference to the weights that's applied and then the direction so it placing a much higher priority not choosing bad actions that can throw you off the optimization path should actually we should take to that path and finally the on the model-based methods and we'll also talk about them in the robotics side there's a lot of interesting approaches now where deep learning is starting to be used for a model-based methods when the model has to be learned but of course when the model doesn't have to be learned it's given inherent to the game you know the model like Ingo and chess and so on out zero has really done incredible stuff so what's wise what is the model here so the way that a lot of these games are approached you know game of Go it's turn-based one person goes and then another person goes and there's this game tree at every point as a set of actions that could be taken and quickly if you look at that game tree it's it becomes you know a girl's exponentially so it becomes huge a game of go is the hugest of all in terms of because the number of choices you have is the largest and there's chess and then you know it gets the checkers and then tic-tac-toe and it's just the the degree at every step increases decreased based on the game structure and so the task for a neural network there is to learn the quality of the board it's that it's to learn which boards which game positions are most likely to result in a are most useful to explore and a result in a highly successful state so that choice of what's good to explore what's what branch is good to go down is where we can have neural network step in and without phago it was pre trained the first success that beat the world champion was pre trained on expert games then with alphago zero it was no pre training on expert systems so no imitation learning is just purely through self play through suggesting through playing itself new board positions many of these systems use Monte Carlo tree search and during the search balancing exploitation exploration so going deep on promising positions based on the estimation then you'll network or with a flip of a coin playing under play positions and so this kind of here you can think of as an intuition of looking at a board and estimating how good that board is and also estimating how good that board is likely to lead to victory down the end so as to mean just general quality and probability of leading to victory then the next step forward is alpha zero using the same similar architecture with MCTS what do you call it research but applying it to different games and applying it and competing against other engines state-of-the-art engines and go and shogi in chess and outperforming them with very few very few steps so here's this model-based approaches which are really extremely simple efficient if you can construct us such a model and in in the robotics if you can learn such a model I can be exceptionally powerful here beating the the engines which are far superior to humans already stockfish can destroy most humans on earth at the game of chess the ability through learning through through estimating the quality of a board to be able to defeat these engines is incredible and the the exciting aspect here versus engines that don't use neural networks is that the number its it really has to do with based on the neural network you explore certain positions you explore certain parts of the tree and if you look at grandmasters human players in chess they seem to explore very few moves they have a really good neural network at estimating which are the likely branches which would provide value to explore and on the other side stock fish and so on are much more brute force in their estimation for the MCTS and then alpha zero is a step towards the Grandmaster is the number of branches need to be explored as much much fewer a lot of the work is done in the representation form by the neural network it's just super exciting and then it's able to uh perform stockfish in chess it's able to outperform Elmo and shogi and it's itself in go or the previous iterations of alphago zero and so on now the challenge here the sobering truth is that majority of real world application of agents that have to act in this world perceive the world and act in this world are for the most part not based have no RL involved so the action is not learned use neural networks to perceive certain aspects of the world but ultimately the action is not is not learned from data that's true for all most of the autonomous vehicle companies are all of the autonomous vehicle companies operating today and it's true for robotic manipulation in the industrial robotics and any of the humanoid robots have to navigate in this world under uncertain conditions all the work from Boston Dynamics doesn't involve any machine learning as far as we know now that's beginning to change here with animal the the recent development where the certain aspects of the control a robotic could be learned you're trying to learn more efficient movement you're trying to learn more robust movement on top of the other controllers so it's quite exciting through RL to be able to learn some of the control dynamics here that's able to teach this particular robot to be able to get up from arbitrary positions so it's less hard coding in order to be able to deal with unexpected nishal conditions and unexpected perturbations so that's exciting there in terms of learning the control dynamics and some of the driving policy so maybe behavioral driving behavior decisions changing lanes turning and so on that if you if you were here last week heard from way moe they they're starting to use some RL in terms of the driving policy in order to especially predict the future they're trying to anticipate intent modeling predict what the pedestrians the cars are going to be based on environment that are trying to unroll what's happened recently into the future and beginning to move beyond sort of pure end to end on NVIDIA and to end learning approach of the control decisions are actually moving to RL and making long-term planning decisions but again the challenge is the the gap the leap needed to go from simulation to real-world all most the work is done from the design of the environment and the design and the reward structure and because most of that work now is in simulation we need to either develop better algorithms for transfer learning or close the distance between simulation in the real world and also we could think outside the box a little bit at the conversation with Peter bill recently one of the leading researchers in deep RL it kind of on the side quickly mentioned the the idea is that we don't need to make simulation more realistic what we could do is just create an infinite number of simulations or very large number of simulations and the naturally the regularization aspect of having all those simulations will make it so that our our reality is just another sample from those simulations and so maybe the solution isn't to create higher fidelity simulation or to create transfer learning algorithms maybe it's to build a arbitrary number of simulations so then that step towards creating a agent that work that works in the real world is a trivial one and maybe that's exactly whoever created the simulation we're living in and the multiverse that we're living in did next steps the lecture videos will have several in RL will be made all available on deep learning that MIT ID you will have several tutorials in RL on github the link is there and I really like the essay from open AI on spinning up as a deep our researcher you know if you're interested in getting into research in RL what are the steps need to take from the background of developing the mathematical background prop stat and multivariate calculus to some of the basics like it's covered last week on deep learning some the basics ideas in RL just terminology and so on some basic concepts then picking a framework tends to flow our PI torch and learn by doing i implemented guram as i mentioned today those are the core RL algorithms so implement all isms from scratch it should only take about two hundred three hundred lines of code there actually when you put it down on paper quite simple intuitive algorithms and then read papers about those algorithms that follow after looking not for the big waving performance the hand waving performance but for the tricks that were used to change these algorithms the tricks tell a lot of the story and that's the useful parts that they need to learn and iterate fast on simple benchmark environments so open the I Jim has provided a lot of easy to use environments that you can play with that you can train an agent in minutes hours as opposed to days and weeks and so iterating fast is the best way to learn these algorithms and then on the research side there's three ways to get a best paper award right two to publish and to contribute and have an impact in the research community in in RL one is improving existing approach given us a particular benchmarks there's a few benchmark datasets environments that are emerging so you want to improve on the existing approach some aspect of the convergence in the performance you can focus on an unsolved task there's certain games that just haven't been solved through their RL formulation or you can come up with a totally new problem that hasn't been addressed by RL before so with that I'd like to thank you very much tomorrow I'll hope to see you here for deep traffic Thanks you you

Info

Channel: Lex Fridman

Views: 186,758

Rating: 4.9257517 out of 5

Keywords: introduction, basics, mit, deep rl, ai, deep learning, machine learning, reinforcement learning, robotics, tensorflow, github, alphazero, alphago, dqn, policy, ai safety, openai, deepmind, simulation, tutorial, model-based, value-based, policy optimization, lex, lex mit

Id: zR11FLZ-O9M

Channel Id: undefined

Length: 67min 29sec (4049 seconds)

Published: Thu Jan 24 2019